TM
Methods in Molecular Biology
Series Editor John M. Walker School of Life Sciences University of Hertfordshire Hatfield, Hertfordshire, AL10 9AB, UK
For other titles published in this series, go to www.springer.com/series/7651
Statistical Methods in Molecular Biology Edited by
Heejung Bang, Xi Kathy Zhou, and Madhu Mazumdar Weill Medical College of Cornell University, New York, NY, USA
Heather L. Van Epps Rockefeller University Press, New York, NY, USA
Editors Heejung Bang Division of Biostatistics and Epidemiology Department of Public Health Weill Cornell Medical College 402 East 67th St. New York, NY 10065 USA
[email protected]
Xi Kathy Zhou Division of Biostatistics and Epidemiology Department of Public Health Weill Cornell Medical College 402 East 67th St. New York, NY 10065 USA
[email protected]
Heather L. Van Epps Rockefeller University Press Journal of Experimental Medicine 1114 First Ave. New York, NY 10021 3rd Floor USA
[email protected]
Madhu Mazumdar Division of Biostatistics and Epidemiology Department of Public Health Weill Cornell Medical College 402 East 67th St. New York, NY 10065 USA
[email protected]
ISSN 1064-3745 e-ISSN 1940-6029 ISBN 978-1-60761-578-1 e-ISBN 978-1-60761-580-4 DOI 10.1007/978-1-60761-580-4 Springer New York Dordrecht Heidelberg London Library of Congress Control Number: 2009942427 © Springer Science+Business Media, LLC 2010 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Humana Press, c/o Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Cover illustration: Adapted from Figure 19 of Chapter 2. Traditional MDS map showing genes clustered by coregulation (background) and significance of the uni-variate p-values (size of red circles). The overlayed network indicates the “most significant” gene pairs (green) and trios (blue) (right). Font size indicating the smallest of the uni-, bi- and tri-variate p-values. Printed on acid-free paper Humana Press is part of Springer Science+Business Media (www.springer.com)
Knowing is not enough; we must apply. Willing is not enough; we must do. -Johann Wolfgang Von Goethe
Preface This book is intended for molecular biologists who perform quantitative analyses on data emanating from their field and for the statisticians who work with molecular biologists and other biomedical researchers. There are many excellent textbooks that provide fundamental components for statistical training curricula. There are also many “by experts for experts” books in statistics and molecular biology which require in-depth knowledge in both subjects to be taken full advantage of. So far, no book in statistics has been published that provides the basic principles of proper statistical analyses and progresses to a more advanced statistics in response to rapidly developing technologies and methodologies in the field of molecular biology. Responding to this situation, our book aims at bridging the gap between these two extremes. Molecular biologists will benefit from the progressive style of the book where basic statistical methods are introduced and gradually elevated to an intermediate level. Similarly, statisticians will benefit from learning the various biological data generated from the field of molecular biology, the types of questions of interest to molecular biologists, and the statistical approaches to analyzing the data. The statistical concepts and methods relevant to studies in molecular biology are presented in a simple and practical manner. Specifically, the book covers basic and intermediate statistics that are useful for classical and molecular biology settings and advanced statistical techniques that can be used to help solve problems commonly encountered in modern molecular biology studies, such as supervised and unsupervised learning, hidden Markov models, manipulation and analysis of data from high-throughput microarray and proteomic platform, and synthesis of these evidences. A tutorial-type format is used to maximize learning in some chapters. Advice from journal editors on peer-reviewed publication and some useful information on software implementation are also provided. This book is recommended for use as supplementary material both inside and outside classrooms or as a self-learning guide for students, scientists, and researchers who deal with numeric data in molecular biology and related fields. Those who start as beginners, but desire to be at an intermediate level, will find this book especially useful in their learning pathway. We want to thank John Walker (series editor), Patrick Marton, David Casey, and Anne Meagher, (editors at Springer and Humana) and Shanthy Jaganathan (Integra-India). The following persons provided useful advice and comments on selection of topics, referral to experts in each topic, and/or chapter reviews that we truly appreciate: Stephen Looney (a former editor of this book), Stan Young, Dmitri Zaykin, Douglas Hawkins, Wei Pan, Alexandre Almeida, John Ho, Rebecca Doerge, Paula Trushin, Kevin Morgan, Jason Osborne, Peter Westfall, Jenny Xiang, Ya-lin Chiu, Yolanda Barron, Huibo Shao, Alvin Mushlin, and Ronald Fanta. Drs. Bang, Zhou, and Mazumdar were partially supported by Clinical Translational Science Center (CTSC) grant (UL1-RR024996). Heejung Bang
vii
Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vii
Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xi
BASIC STATISTICS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.
Experimental Statistics for Biological Sciences . . . . . . . . . . . . . . . . . . . Heejung Bang and Marie Davidian
3
2.
Nonparametric Methods for Molecular Biology . . . . . . . . . . . . . . . . . . 105 Knut M. Wittkowski and Tingting Song
3.
Basics of Bayesian Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 Sujit K. Ghosh
4.
The Bayesian t-Test and Beyond . . . . . . . . . . . . . . . . . . . . . . . . . 179 Mithat Gönen
PART I
PART II
DESIGNS AND METHODS FOR MOLECULAR BIOLOGY . . . . . . . . . . 201
5.
Sample Size and Power Calculation for Molecular Biology Studies . . . . . . . . 203 Sin-Ho Jung
6.
Designs for Linkage Analysis and Association Studies of Complex Diseases . . . . 219 Yuehua Cui, Gengxin Li, Shaoyu Li, and Rongling Wu
7.
Introduction to Epigenomics and Epigenome-Wide Analysis . . . . . . . . . . . 243 Melissa J. Fazzari and John M. Greally
8.
Exploration, Visualization, and Preprocessing of High–Dimensional Data . . . . 267 Zhijin Wu and Zhiqiang Wu
PART III STATISTICAL METHODS FOR MICROARRAY DATA . . . . . . . . . . . . 285 9.
Introduction to the Statistical Analysis of Two-Color Microarray Data . . . . . . 287 Martina Bremer, Edward Himelblau, and Andreas Madlung
10.
Building Networks with Microarray Data . . . . . . . . . . . . . . . . . . . . . 315 Bradley M. Broom, Waree Rinsurongkawong, Lajos Pusztai, and Kim-Anh Do
PART IV
ADVANCED OR SPECIALIZED METHODS FOR MOLECULAR BIOLOGY . . 345
11.
Support Vector Machines for Classification: A Statistical Portrait . . . . . . . . . 347 Yoonkyung Lee
12.
An Overview of Clustering Applied to Molecular Biology Rebecca Nugent and Marina Meila
ix
. . . . . . . . . . . . 369
x
Contents
13.
Hidden Markov Model and Its Applications in Motif Findings . . . . . . . . . . 405 Jing Wu and Jun Xie
14.
Dimension Reduction for High-Dimensional Data . . . . . . . . . . . . . . . . 417 Lexin Li
15.
Introduction to the Development and Validation of Predictive Biomarker Models from High-Throughput Data Sets . . . . . . . . . . . . . . . . . . . . 435 Xutao Deng and Fabien Campagne
16.
Multi-gene Expression-based Statistical Approaches to Predicting Patients’ Clinical Outcomes and Responses . . . . . . . . . . . . . . . . . . . . 471 Feng Cheng, Sang-Hoon Cho, and Jae K. Lee
17.
Two-Stage Testing Strategies for Genome-Wide Association Studies in Family-Based Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485 Amy Murphy, Scott T. Weiss, and Christoph Lange
18.
Statistical Methods for Proteomics . . . . . . . . . . . . . . . . . . . . . . . . 497 Klaus Jung
PART V
META-ANALYSIS FOR HIGH-DIMENSIONAL DATA . . . . . . . . . . . . 509
19.
Statistical Methods for Integrating Multiple Types of High-Throughput Data . . 511 Yang Xie and Chul Ahn
20.
A Bayesian Hierarchical Model for High-Dimensional Meta-analysis . . . . . . . 531 Fei Liu
21.
Methods for Combining Multiple Genome-Wide Linkage Studies . . . . . . . . 541 Trecia A. Kippola and Stephanie A. Santorico
PART VI
OTHER PRACTICAL INFORMATION . . . . . . . . . . . . . . . . . . . . 561
22.
Improved Reporting of Statistical Design and Analysis: Guidelines, Education, and Editorial Policies . . . . . . . . . . . . . . . . . . . . . . . . . 563 Madhu Mazumdar, Samprit Banerjee, and Heather L. Van Epps
23.
Stata Companion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 599 Jennifer Sousa Brennan
Subject Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627
Contributors CHUL AHN • Division of Biostatistics, Department of Clinical Sciences, The Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX, USA SAMPRIT BANERJEE • Division of Biostatistics and Epidemiology, Department of Public Health, Weill Cornell Medical College, New York, NY, USA HEEJUNG BANG • Division of Biostatistics and Epidemiology, Weill Cornell Medical College, New York, NY, USA MARTINA BREMER • Department of Mathematics, San Jose State University, San Jose, CA, USA JENNIFER SOUSA BRENNAN • Department of Biostatistics, Yale University, New Haven, CT, USA BRADLEY M. BROOM • Department of Bioinformatics and Computational Biology, University of Texas M. D. Anderson Cancer Center, Houston, TX, USA FABIEN CAMPAGNE • HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, Department of Physiology and Biophysics, Weill Cornell Medical College, New York, NY, USA FENG CHENG • Department of Biophysics, University of Virginia, Charlottesville, VA, USA SANG-HOON CHO • Department of Public Health Sciences, University of Virginia, Charlottesville, VA, USA YUEHUA CUI • Department of Statistics and Probability, Michigan State University, East Lansing, MI, USA MARIE DAVIDIAN • Department of Statistics, North Carolina State University, Raleigh, NC, USA XUTAO DENG • HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, Weill Cornell Medical College, New York, NY, USA KIM-ANH DO • Department of Biostatistics, University of Texas M. D. Anderson Cancer Center, Houston, TX, USA MELISSA J. FAZZARI • Division of Biostatistics, Department of Epidemiology and Population Health, Department of Genetics, Albert Einstein College of Medicine, Bronx, NY, USA SUJIT K. GHOSH • Department of Statistics, North Carolina State University, Raleigh, NC, USA MITHAT GÖNEN • Department of Epidemiology and Biostatistics, Memorial Sloan-Kettering Cancer Center, New York, NY, USA JOHN M. GREALLY • Department of Genetics, Department of Medicine, Albert Einstein College of Medicine, Bronx, NY, USA EDWARD HIMELBLAU • Department of Biological Science, California Polytechnic State University, San Luis Obispo, CA, USA KLAUS JUNG • Department of Medical Statistics, Georg-August-University Göttingen, Göttingen, Germany
xi
xii
Contributors
SIN-HO JUNG • Department of Biostatistics and Bioinformatics, Duke University, Durham, NC, USA TRECIA A. KIPPOLA • Department of Statistics, Oklahoma State University, Stillwater, OK, USA CHRISTOPH LANGE • Center for Genomic Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, USA; Department of Biostatistics, Harvard School of Public Health, Boston, MA, USA JAE K. LEE • Department of Public Health Sciences, University of Virginia, Charlottesville, VA, USA YOONKYUNG LEE • Department of Statistics, The Ohio State University, Columbus, OH, USA GENGXIN LI • Department of Statistics and Probability, Michigan State University, East Lansing, MI, USA LEXIN LI • Department of Statistics, North Carolina State University, Raleigh, NC, USA SHAOYU LI • Department of Statistics and Probability, Michigan State University, East Lansing, MI, USA FEI LIU • Department of Statistics, University of Missouri, Columbia, MO, USA ANDREAS MADLUNG • Department of Biology, University of Puget Sound, Tacoma, WA, USA MADHU MAZUMDAR • Division of Biostatistics and Epidemiology, Department of Public Health, Weill Cornell Medical College, New York, NY, USA MARINA MEILA • Department of Statistics, University of Washington, Seattle, WA, USA AMY MURPHY • Channing Laboratory, Center for Genomic Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, USA REBECCA NUGENT • Department of Statistics, Carnegie Mellon University, Pittsburgh, PA, USA LAJOS PUSZTAI • Department of Breast Medical Oncology, University of Texas M. D. Anderson Cancer Center, Houston, TX, USA WAREE RINSURONGKAWONG • Department of Biostatistics, University of Texas M. D. Anderson Cancer Center, Houston, TX, USA STEPHANIE A. SANTORICO • Department of Mathematical and Statistical Sciences, University of Colorado, Denver, CO, USA TINGTING SONG • Center for Clinical and Translational Science, The Rockefeller University, New York, NY, USA HEATHER L. VAN EPPS • Journal of Experimental Medicine, Rockefeller University Press, New York, NY, USA SCOTT T. WEISS • Channing Laboratory, Center for Genomic Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, USA KNUT M. WITTKOWSKI • Center for Clinical and Translational Science, The Rockefeller University, New York, NY, USA JING WU • Department of Statistics, Carnegie Mellon University, Pittsburgh, PA, USA RONGLING WU • Departments of Public Health Sciences and Statistics, Pennsylvania State University, Hershey, PA, USA ZHIJIN WU • Center for Statistical Sciences, Brown University, Providence, RI, USA ZHIQIANG WU • Department of Electrical Engineering, Wright State University, Dayton, OH, USA JUN XIE • Department of Statistics, Purdue University, West Lafayette, IN, USA
Contributors
xiii
YANG XIE • Division of Biostatistics, Department of Clinical Sciences, The Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX, USA
Part I Basic Statistics
Chapter 1 Experimental Statistics for Biological Sciences Heejung Bang and Marie Davidian Abstract In this chapter, we cover basic and fundamental principles and methods in statistics – from “What are Data and Statistics?” to “ANOVA and linear regression,” which are the basis of any statistical thinking and undertaking. Readers can easily find the selected topics in most introductory statistics textbooks, but we have tried to assemble and structure them in a succinct and reader-friendly manner in a stand-alone chapter. This text has long been used in real classroom settings for both undergraduate and graduate students who do or do not major in statistical sciences. We hope that from this chapter, readers would understand the key statistical concepts and terminologies, how to design a study (experimental or observational), how to analyze the data (e.g., describe the data and/or estimate the parameter(s) and make inference), and how to interpret the results. This text would be most useful if it is used as a supplemental material, while the readers take their own statistical courses or it would serve as a great reference text associated with a manual for any statistical software as a self-teaching guide. Key words: ANOVA, correlation, data, estimation, experimental design, frequentist, hypothesis testing, inference, regression, statistics.
1. Introduction to Statistics 1.1. Basis for Statistical Methodology
The purpose of the discussion in this section is to stimulate you to start thinking about the important issues upon which statistical methodology is based. Heejung Bang adapted Dr. Davidian’s lecture notes used for the course “Experimental Statistics for Biological Sciences” at the North Carolina State University. The complete version of the lecture notes is available at http://www4.stat. ncsu.edu/~davidian/.
H. Bang et al. (eds.), Statistical Methods in Molecular Biology, Methods in Molecular Biology 620, DOI 10.1007/978-1-60761-580-4_1, © Springer Science+Business Media, LLC 2010
3
4
Bang and Davidian
Statistics: The development and application of theory and methods to the collection (design), analysis, and interpretation of observed information from planned (or unplanned) experiments. Typical Objectives: Some examples are as follows: (i) Determine which of three fertilizer compounds produces highest yield (ii) Determine which of two drugs is more effective for controlling a certain disease in humans (iii) Determine whether an activity such as smoking causes a response such as lung cancer Examples (i) and (ii) represent situations where the scientist has the opportunity to plan (design) the experiment. Such a preplanned investigation is called a controlled experiment. The goal is to compare treatments (fertilizers, drugs). In example (iii), the scientist may only observe the phenomenon of interest. The treatment is smoking, but the experimenter has no control over who smokes. Such an investigation is called an observational study. In this section, we will focus mostly on controlled experiments, which leads us to thinking about design of experiments. Key Issue: We would like to make conclusions based on the data arising as the result of an experiment. We would moreover like the conclusion to apply in general. For example, in (i), we would like to claim that the fertilizers produce different yields in general based on the particular data from a single experiment. Let us introduce some terminology. Population: The entire “universe” of possibilities. For example, in (ii), the population is all patients afflicted with the disease. Sample: A part of the population that we can observe. Observation of a sample gives rise to information on the phenomenon of interest, the data. Using this terminology, we may refine our statement of our objective. We would like to make statements about the population based on observation of samples. For example, in (ii), we obtain two samples of diseased patients: and subject one to drug 1 and the other to drug 2. In agricultural, medical, and other biological applications, the most common objective is the comparison of two or more treatments. We will thus often talk about statistical inference in the context of comparing treatments. Problem: A sample we observe is only one of many possible samples we might have seen instead. That is, in (ii), one sample of patients would be expected to be similar to another, but not identical. Plants will differ due to biological variability, which may cause different reactions to fertilizers in example (i). There is uncertainty about inference we make on a population based on observation of samples.
Experimental Statistics for Biological Sciences
5
The premise of statistical inference is that we attempt to control and assess the uncertainty of inferences we make on the population of interest based on observation of samples. Key: Allow explicitly for variation in and among samples. Statistics is thus often called the “study of variation.” First Step: Set up or design experiments to control variability as much as possible. This is certainly possible in situations such as field trials in agriculture, clinical trials in medicine, and reliability studies in engineering. It is not entirely possible in observational studies, where the samples “determine themselves.” Principles of Design: Common sense is the basis for most of the ideas for designing scientific investigations: • Acknowledgment of potential sources of variation. Suppose it is suspected that males may react differently to a certain drug from females. In this situation, it would make sense to assign the drugs to the samples with this in mind instead of with no regard to the gender of participants in the experiment. If this assignment is done correctly, differences in treatments may be assessed despite differences in response due to gender. If gender is ignored, actual differences in treatments could be obscured by differences due to gender. • Confounding. Suppose in example (i), we select all plants to be treated with fertilizer 1 from one nursery, all for fertilizer 2 from another nursery, etc. Or, alternatively, suppose we keep all fertilizer 1 plants in one greenhouse, all fertilizer 2 plants in another greenhouse, etc. These choices may be made for convenience or simplicity, but introduce problems: Under such an arrangement, we will not know whether any differences we might observe are due to actual differences in the effects of the treatments or to differences in the origin or handling of plants. In such a case, the effects of treatments are said to be confounded with the effects of, in this case, nursery or greenhouse. Here is another example. Consider a clinical trial to compare a new, experimental treatment to the standard treatment. Suppose a doctor assigns patients with advanced cases of disease to a new experimental drug and assigns patients with mild cases to the standard treatment, thinking that the new drug is promising and should thus be given to the sicker patients. The new drug may perform poorly relative to the standard drug. The effects of the drugs are confounded with the severity of the disease. To take adequate account of variation and to avoid confounding, we would like the elements of our samples to be as alike as possible except for the treatments. Assignment of treatments to the samples should be done so that potential sources of variation do not obscure treatment differences. No amount of fancy statistical analysis can help an experiment that was conducted without paying attention to these issues!
6
Bang and Davidian
Second Step: Interpret the data by taking appropriate account of sources and magnitude of variation and the design, that is, by assessing variability. Principle of Statistical Inference: Because variation is involved, and samples represent only a part of the population, we may not make statements that are absolute. Rather, we must temper our statements by acknowledging the uncertainty due to variation and sampling. Statistical methods incorporate this uncertainty into statements interpreting the data. The appropriate methods to use are dictated to a large extent by the design used to collect the data. For example, • In example (ii) comparing two drugs, if we are concerned about possible gender differences, we might obtain two samples of men and two samples of women and treat one of each with drug 1, the other of each with drug 2. With such a design, we should be able to gain insight into differences actually due to the drugs as well as variability due to gender. A method to assess drug differences that takes into account the fact that part of variation observed is attributable to a known feature, gender, would then be appropriate. Intuitively, a method that did not take this into account would be less useful. Moral: Design and statistical analysis go hand in hand. 1.2. First Notions of Experimental Design
It is obvious from the above discussion that a successful experiment will be one where we consider the issues of variation, confounding, and so on prior to collecting data. That is, ideally, we design an experiment before performing it. Some fundamental concepts of experimental design are as follows. Randomization: This device is used to ensure that samples are indeed as alike as possible except for the treatments. Randomization is a treatment assignment mechanism – rather than assign the treatments systematically, assign them so that, once all acknowledged sources of variation are accounted for, it can be assumed that no obscuring or confounding effects remain. Here is an example. Suppose we wish to compare two fertilizers in a certain type of plant (we restrict attention to just two fertilizers for simplicity). Suppose we are going to obtain plants and then assign them to receive one fertilizer or the other. How best to do this? We would like to perform the assignment so that no plant will be more likely to get a potentially “better” treatment than another. We’d also like to be assured that, prior to getting treatment, plants are otherwise basically alike. This way, we may feel confident that differences among the treatments that might show up reflect a “real” phenomenon. A simple way to do this is to obtain a sample of plants from a single nursery, which ensures that they are basically alike, and then determine two samples by a coin flip:
Experimental Statistics for Biological Sciences
7
• Heads = Fertilizer 1, Tails = Fertilizer 2 • Ensures that all plants had an equal chance of getting either fertilizer, that is, all plants are alike except for the treatment • This is a random process Such an assignment mechanism, based on chance, is called randomization. Randomization is the cornerstone of the design of controlled experiments. Potential Problem: Ultimately, we wish to make general statements about the efficacy of the treatments. Although using randomization ensures the plants are as alike as possible except for the treatments and that the treatments were fairly assigned, we have only used plants from one nursery. If plants are apt to respond to fertilizers differently because of nursery of origin, this may limit our ability to make general statements. Scope of Inference: The scope of inference for an experimental design is limited to the population from which the samples are drawn. In the design above, the population is plants from the single nursery. To avoid limited scope of inference, we might like to instead use plants from more than one nursery. However, to avoid confounding, we do not assign the treatments systematically by nursery. Extension: Include more nurseries but use the same principles of treatment assignment. For example, suppose we identify three nurseries. By using plants from all three, we broaden the scope of inference. If we repeated the above process of randomization to assign the treatments for each nursery, we would have increased our scope of inference, but at the same time ensured that, once we have accounted for the potential source of variation, nursery, plants receiving each treatment are basically alike except for the treatments. One can imagine that this idea of recognizing potential sources of variation and then assigning treatments randomly to ensure fair comparisons, thus increasing the scope of inference, may be extended to more complicated situations. For example, in our example, another source of variation might be the greenhouses in which the treated plants are kept, and there may be more than two treatments of interest. We will return to these issues later. For now, keep in mind that sound experimental design rests on these key issues: • Identifying and accounting for potential sources of variation • Randomly assigning treatments after accounting for sources of variation • Making sure the scope of the experiment is sufficiently broad Aside: In an observational study, the samples are already determined by the treatments themselves. Thus, a natural question is whether there is some hidden (confounding) factor that
8
Bang and Davidian
causes the responses observed. For example, in (iii), is there some underlying trait that in fact causes both smoking and lung cancer? This possibility limits the inference we may make from such studies. In particular, we cannot infer a causal relationship treatment (smoking) and response (lung cancer, yes or no). We may only observe that an association appears to exist. Because the investigator does not have control over how the treatment is applied, interpretation of the results is not straightforward. If aspects of an experiment can be controlled, that is, the experiment may be designed up-front (treatment assignment determined in advance), there is an obvious advantage in terms of what conclusions we may draw. 1.3. Data
The purpose of this section is to introduce concepts and associated terminology regarding the collection, display, and summary of data. Data: The actual observations of the phenomenon of interest based on samples from the population. A Goal of Statistics: Present and summarize data in a meaningful way. Terminology: A variable is a characteristic that changes (i.e., shows variability) from unit to unit (e.g., subjects, plots). Data are observations of variables. Types of variables are Qualitative Variables: Numerical measurements on the phenomenon of interest are not possible. Rather, the observations are categorical. Quantitative Variables: The observations are in the form of numerical values. • Discrete: Possible values of the variable differ by fixed amounts. • Continuous: All values in a given range are possible. We may be limited in recording the exact values by the precision and/or accuracy of the measuring device. Appropriate statistical methods are dictated in part by the nature of the variable in question.
1.3.1. Random Samples
We have already discussed the notion of randomization in the design of an experiment. The underlying goal is to ensure that samples may be assumed to be “representative” of a population of interest. In our nursery example, then, one population of interest might be that of plant subjected to treatment 1. The random assignment to determine which plants receive treatment 1 thus may be thought of as an attempt to obtain a representative sample from plants from this population, so that data obtained from the sample will be free from confounding and bias. In general, the way in which we will view a “representative sample” is one chosen so that any member of the population has an equal chance
Experimental Statistics for Biological Sciences
9
of being in the sample. In the nursery example, it is assumed that the plants ending up in the sample receiving treatment 1 may be thought of as being chosen from the population of all plants were they to receive treatment 1. All plants in the overall population may have equally ended up in this sample. The justification for this assumption is that randomization was used to determine the sample. The idea that samples have been chosen randomly is a foundation of much of the theory underlying the statistical methods that we will study. It is instructive, in order to shape our thinking later when we talk about the formal theory, to think about a conceptual model for random sampling. 1.3.2. Sampling – A Model
One way to think about random sampling in a simple way is to think about drawing from a box. Population: Slips of paper in a box, one slip per individual (plant, patient, plot, etc.). Sample: To obtain a random sample, draw slips of paper from the box such that, on each draw, all slips have a equal chance of being chosen, i.e., completely random selection. Two Ways to Draw: • With replacement: On each draw, all population members have the same chance of being in the sample. This is simple to think about, but the drawback is that an individual could conceivably end up in the sample more than once. • Without replacement: The number of slips in the box decreases with each draw, because slips are not replaced. Thus, the chance of being in the sample increases with the number of draws (size of the sample). If the population is large relative to the size of the sample to be chosen, this increase is negligible. Thus, we may for practical purposes view drawing without replacement as drawing with replacement. The populations of interest are usually huge relative to the size of the samples we use for experimentation. WHY Is This Important? To simplify the theory, standard statistical methods are predicated on the notion that the samples are completely random, which would follow if the samples were indeed chosen with replacement. This fact allows us to view the samples as effectively having been drawn in this way, i.e., with replacement from populations of interest. This is the model that we will use when thinking about the properties of data in order to develop statistical methods. Practical Implementation: We have already talked about using the flip of a coin to randomize treatments and hence determine random samples. When there are several treatments, the randomization is accomplished by a random process that is a little more sophisticated than a simple coin flip, but is still in the same spirit. We will assume that if sampling from the population has been
10
Bang and Davidian
carried out with careful objectivity in procuring members of the population to which to assign treatments, and the assignment has been made using these principles, then this model applies. Henceforth, then, we will use the term sample interchangeably with random sample, and we will develop methods based on the assumption of random sampling. We will use the term sample to refer to both the physical process and to the data arising from it. 1.3.3. Descriptive Statistics
Idea: Summarize data by quantifying the notions of “center” and “spread” or “dispersion.” That is, define relevant measures that summarize these notions. One Objective: By quantifying center and spread for a sample, we hope to get an idea of these same notions for the population from which the sample was drawn. As above, if we could measure the diameter of every cherry tree in the forest, and quantify the “center” and “spread” of all of these diameters, we would hope that the “center” and “spread” values for our sample of 31 trees from this population would bear resemblance to the population values. Notation: For the remainder of this discussion, we adopt the following standard notation to describe a sample of data: n = size of the sample and Y = the variable of interest Y1 , Y2 , . . . , Yn = observations on the variable for the sample. Measures of Center: There are several different ways to define the notion of “center.” Here, we focus on the two most common: • Mean or average: This is the most common notion of center. For our data, the sample mean is defined as 1 Y¯ = Yi = sample mean. n n
[1]
i=1
That is, the sample mean is the average of the sample values. The notation “ Y¯ ” is standard; the “bar” indicates the “averaging” operation being performed. The population mean is the corresponding quantity for the population. We often use the Greek symbol μ to denote the population mean: μ = mean of all possible values for the variable for the population. One may think of μ as the value that would be obtained if the averaging operation in [1] were applied to all the values in the population.
Experimental Statistics for Biological Sciences
11
• Median: The median is the value such that, when the data are put into ascending order, 50% of the observations are above and 50% are below this value. Thus, the median quantifies the notion of center differently from the mean – the assessment of “center” is based on the likelihood of the observations (their distribution) rather than by averaging. It should be clear that, with this definition, the value chosen as the median of a sample need not be unique. If n is odd, then the definition may be applied exactly – the median is the “center” observation. When n is even, the median can be defined as the average of the two “middle” values. Median better represents a typical value for skewed data. Fact: It is clear that mean and median do not coincide in general. Measures of Spread: Two data sets (or populations) may have the same mean, but may be “spread” about this mean value very differently. For a set of data, a measure of spread for an individual observation is the sample deviation (Yi − Y¯ ). Intuition suggests that the mean or average of these deviations ought to be a good measure of how the observations are spread about the mean n for the sample. However, it is straightforward to show that ¯ i=1 (Yi − Y ) = 0 for any set of data. A sensible measure of “spread” is concerned with how spread out the observations are; direction is not important. The most common idea of ignoring the sign of the deviations is to square them. • Sample variance: For a sample of size n, this is defined as 1 (Yi − Y¯ )2 . n−1 n
[2]
i=1
This looks like an average, but with (n − 1) instead of n as the divisor – the reason for this will be discussed later. [(n − 1) is called the degrees of freedom (df), as we’ll see.] Note that if the original data are measured in some units, then the sample variance has (units)2 . Thus, sample variance does not measure spread on the same scale of measurement as the data. • Sample standard deviation (SD): To get a measure of spread on the original scale of measurement, take the square root of the sample variance. This quantity, which thus measures spread in data units, is called the sample SD 2 s= s =
1 (Yi − Y¯ )2 . n−1 n
i=1
12
Bang and Davidian
The sample variance and SD are generally available on handheld calculators. It is instructive, however, to examine hand calculation of them. First, it can be shown that n 2 n n 1 2 2 (Yi − Y¯ ) = Yi − Yi . [3] n i=1
i=1
i=1
This formula is of interest for several reasons: Formula [3] is preferred when doing hand calculation over [2] to avoid propagation of error. Breaking the sum of squared deviations into two pieces highlights a concept that will be important to our thinking later. Write SS = ni=1 (Yi − Y¯ )2 . This is the sum of squared deviations, but is often called the sum of squares adjusted for the mean. The reason may be deduced from [3]. The two components are called n
Yi 2 = unadjusted sum of squares of the data
i=1
n 1 n
Yi
2 = correction or adjustment for the mean = nY¯ 2.
i=1
Thus, we may interpret SS as taking each observation and “centering” (correcting) its magnitude about the sample mean, which itself is calculated from the data. • Range: Difference between highest and lowest values, another simple measure of spread. • Interquartile range: Difference between the third and first quartiles, also called the midspread or middle fifty. 1.3.4. Statistical Inference
We already discussed the notion of a population mean, denoted by μ. We may define in an analogous fashion population variance and population SD. These quantities may be thought of as the measures that would be obtained if we could calculate variance and SD based on all the values in the population. Terminology: A parameter is a quantity describing the population, e.g., • μ = population mean • σ 2 = population variance Parameter values are usually unknown. Thus, in practice, sample values are used to get an idea of the values of the parameters for the population. • Estimator: An estimator is a quantity describing the sample that is used as a “guess” for the value of the corresponding population parameter. For example, Y¯ is an estimator for μ and
s 2 is an estimator for σ 2 .
Experimental Statistics for Biological Sciences
13
– The term estimator is usually used to denote the “abstract” quantity, i.e., formula. – The term estimate is usually used to denote the actual numerical value of the estimator. • Statistic: A quantity derived from the sample observations, e.g., Y¯ and s 2 , are statistics. The larger the sample, the closer estimates will be to the true (but unknown) population parameter values assuming that sampling was done correctly. • If we are going to use a statistic to estimate a parameter, we would like to have some sense of “how close.” • As the data values exhibit variability, so do statistics, because statistics are based on the data. • A standard way of quantifying “closeness” of an estimator to the parameter is to calculate a measure of how variable the estimator is. • Thus, the variability of the statistic is with respect to the spread of all possible values it could take on. – So we need a measure of how variable sample means from samples of size n are across all the possible data sets of size n we might have ended up with. • Think now of the population of all possible values of the sample mean Y¯ – this population consists of all Y¯ values that may be calculated from all possible samples of size n that may be drawn from the population of data Y values. • This population will itself exhibit spread and itself may be thought of as having a mean and variance. • Statistical theory may be used to show that this mean and variance are as follows: def Mean of population of Y¯ values = μ = μY¯ ,
[4]
σ 2 def 2 = σY¯ , Variance of population of Y¯ values = n
[5]
where the symbols on the far right in each expression are the customary notation for representing these quantities – the subscript “Y¯ ” emphasizes that these are the mean and variance of the population of Y¯ values, not the population of Y values. • Equation [4] shows that the mean of Y¯ values is the same as the mean of Y values! This makes intuitive sense – we expect Y¯ to be similar to the mean of the population of Y values, μ. Equation confirms this and suggests that Y¯ is a sensible estimator for μ.
14
Bang and Davidian
• Furthermore, the quantity σ 2 /n in [5] represents how variable Y¯ values are. Note that this depends on the n. In practice, the quantity σ 2¯ depends on σ 2 , which is Y unknown. Thus, if we want to provide a measure of how variable Y¯ values are, the standard approach is to estimate σ 2¯ by Y replacing σ 2 , the unknown population (of Y values) variance by its estimate, the sample variance s 2 . • That is, calculate the following and use as an estimate of σ 2¯ : Y
sY2¯ = • Define
s2 n
.
sY¯ =
s s2 =√ , n n
[6]
sY¯ is referred to as the standard error (SE) (of the mean) and is an estimate of the SD of all possible Y¯ values from samples of size n. • It is important to keep the distinction between s and sY¯ clear: s = SD of Y value and s ¯ = SD of Y¯ values. Y
• The SE is an estimate of the variability (on the original scale of measurement) associated with using the sample mean Y¯ as an estimate of μ. √ Important: As n increases, σY¯ and sY¯ decrease with 1/ n. Thus, the larger the n, the less variable (more precise, reliable) the sample mean Y¯ will be as an estimator of μ! As intuition suggests, larger n’s give “better” estimates. Coefficient of Variation: Often, we wish to get an idea of variability on a relative basis, that is, we would like to have a unitless measure that describes variation in the data (population). • This is particularly useful if we wish to compare the results of several experiments for which the data are observations on the same variable. • The problem is that different experimental conditions may lead to different variability. Thus, even though the variable Y may be the same, the variability may be different. • The coefficient of variation is a relative measure that expresses variability as a proportion (percentage) of the mean. For a population with mean μ, SD σ , the definition is CV =
σ σ as a proportion = × 100% as a percentage. μ μ
Experimental Statistics for Biological Sciences
15
• As μ and σ are unknown parameters, CV is estimated by replacing them by their estimates from the data, e.g., s CV = . Y¯ • CV is also a useful quantity when attention is focused on a single set of data – it provides an impression of the amount of variation in the data relative to the size of the thing being measured; thus, if CV is large, it is an indication that we will have a difficult time learning about the “signal” (μ) because of the magnitude of the “noise” (σ ). 1.4. Probability Distributions 1.4.1. Probability
In order to talk about chance associated with random samples, it is necessary to talk about probability. It is best to think about things very simply at first, so, as is customary, we do so. This may seem simplistic and even irrelevant; in particular, it is standard to think about very simple situations in which to develop the terminology and properties first and then extend the ideas behind them to real situations. Thus, we describe the ideas in terms of what is probably the most simple situation where chance is involved, the flip of a coin. The ideas, however, are more generally applicable. Terminology: We illustrate the generic terminology in the context of flipping a coin. • (Random) Experiment: A process for which no outcome may be predicted with certainty. For example, if we toss a coin once, the outcome, “heads” (H) or “tails” (T), may not be declared prior to the coin landing. With this definition, we may think of choosing a (random) sample and observing the results as a random experiment – the eventual values of the data we collect may not be predicted with certainty. • Sample space: The sample space is the set of all possible (mutually exclusive) outcomes of an experiment. We will use the notation S in this section to denote the sample space. For example, for the toss of a single coin, the sample space is
S = {H , T }. By mutually exclusive, we mean that the outcomes do not overlap, i.e., they describe totally distinct possibilities. For example, the coin comes up either H or T on any toss – it can’t do both! • Event: An event is a possible result of an experiment. For example, if the experiment consists of two tosses of a coin, the sample space is S = {HH , TH , HT , TT }.
16
Bang and Davidian
Each element in S is a possible result of this experiment. We will use the notation E in this Section to denote an event. We may think of other events as well, e.g., E1 = {see exactly 1 H in 2 tosses} = {TH, HT}. E2 = {see at least 1H in 2 tosses} = {HH, TH, HT}. Thus, events may be combinations of elements in the sample space. • Probability function: P assigns a number between 0 and 1 to an event. Thus, P quantifies the notion of the chance of an event occurring. The properties are as follows: – For any event E, 0 ≤ P(E) ≤ 1 . – If S is composed of mutually exclusive outcomes denoted by Oi , i.e.,
S = {O1 , O2 , . . .} then P(S ) =
P(Oi ) = 1.
i
Thus, S describes everything that could possibly happen, since, intuitively, we assign the probability “1” to an event that must occur. A probability of “0” (zero) implies that an event cannot occur. We may think of the probability of an event E occurring intuitively as P(E) =
no. of outcomes in S associated with E . total no. of possible outcomes in S
For example, in our experiment consisting of two tosses of a coin P(E1 ) =
1.4.2. Random Variables
1 2 = , 4 2
P(E2 ) =
3 . 4
The development of statistical methods for analyzing data has its foundations in probability. Because our sample is chosen at random, an element of chance is introduced, as noted above. Thus, our observations are themselves best viewed as random, that is, subject to chance. We thus term the variable of interest a random variable (often denoted by r.v.) to emphasize this principle. Data represent observations on the r.v. These may be (for quantitative r.v.’s) discrete or continuous. Events of interest may be formulated in terms of r.v.’s.
Experimental Statistics for Biological Sciences
17
In our coin toss experiment, let Y = no. H in 2 coin tosses. Then we may represent our events as E1 = {Y = 1}, E2 = {Y ≥ 1}. Furthermore, the probabilities of the events may also be written in terms of Y , e.g., P(E1 ) = P(Y = 1), P(E2 ) = P(Y ≥ 1). If we did our coin toss experiment n times (each time consists of two tosses of the coin), and recorded the value of Y each time, we would have data Y1 , . . . , Yn , where Y i is the number of heads seen the ith time we did the experiment. 1.4.3. Probability Distribution of a Random Variable
To understand this concept, it is easiest to first consider the case of a discrete r.v. Y . Thus, Y takes on values that we can think about separately. Our r.v. Y corresponding to the coin tossing experiment is a discrete r.v.; it may only take on the values 0, 1, or 2. Probability Distribution Function: Let y denote a possible value of Y . The function f (y) = P(Y = y) is the probability distribution function for Y . f (y) is the probability we associate with an observation on Y taking the value y. If we think in terms of data, that is, observations on the r.v. Y , then we may think of the population of all possible data values. From this perspective, f (y) may be thought of as the relative frequency (in the population) with which the value y occurs in the population. A histogram for a sample summarizes the relative frequencies with which the data takes on values, and these relative frequencies are represented by area. This gives a pictorial view of how the values are distributed. If we think of probabilities as the relative frequencies with which Y would take on values, then it seems natural to think of representing probabilities in a similar way. Binomial Distribution: Consider more generally the following experiment: (i) k unrelated trials are performed (ii) Each trial has two possible outcomes, e.g., for a coin toss, H or T; for clinical trial, outcomes can be dead or alive. (iii) For each trial, the probability of the outcome we are interested in is equal to some value p, 0 ≤ p ≤ 1 .
18
Bang and Davidian
For the k trials, we are interested in the number of trials resulting in the outcome of interest. Let S = “success” denote this outcome. Then the r.v. of interest is Y = no. of S in k trials. Y may thus take on the values 0, 1, . . . , k. To fix ideas, consider an experiment consisting of k coin tosses, and suppose we are interested in the number of H observed in the k tosses. For more realistic situations, the principles are the same, so we use coin tossing as an easy, “all-purpose” illustration. Then Y = no. of H in k tosses and S = H . Furthermore, if our coin is “fair,” then p = 12 = p(H ). It turns out that the form of the probability distribution function f (y) of Y may be derived mathematically. The expression is k f (y) = P(Y = y) = py (1 − p)k−y , y = 0, 1, . . . , k, y k where = k!/{y!(k − y)!}. The notation x! is “x factorial” = y x(x − 1)(x − 2) · · · (2)(1), that is, the product of x with all positive whole numbers smaller than x. By convention, 0! = 1. If we do k trials, and y are “successes,” then k − y of them are not “successes.” There area number of ways that this can happen k in k trials – the expression turns out to quantify this number y of ways. We are not so interested in this particular f (y) here – our main purpose is to establish the idea that probability distribution functions do exist and may be calculated. Mean and Variance: Thinking about f (y) as a model for the population suggests that we consider other population quantities. In particular, for data from an experiment like this, what would the population mean μ and variance σ 2 be like? It turns out that mathematical derivations may be used to show that μ = kp,
σ 2 = kp(1 − p).
We would thus expect, if we did the experiment n times, that the sample mean Y¯ and sample variance s 2 would be “close” to these values (and be estimators for them). Continuous Random Variables: Many of the variables of interest in scientific investigations are continuous; thus, they can take on any value. For example, suppose we obtain a sample of n pigs and weigh them. Thus, the r.v. of interest is Y = weight of a pig and the data are Y1 , . . . , Yn , the observed weights for our n pigs.
Experimental Statistics for Biological Sciences
19
Y is a r.v. because the pigs were drawn at random from the population of all pigs. Furthermore, all pigs do not weigh exactly the same; they exhibit random variation due to biological and other factors. Goal: Find a function like f (y) for a discrete r.v. that describes the probability of observing a pig weighing y units. This function, f, would thus serve as a model describing the population of pig weights – how they are distributed and how they vary. Probability Density: A function f (y) that describes the distribution of values taken on by a continuous r.v. is called a probability density function. If we could determine f for a particular r.v. of interest, its graph would have the same interpretation as a probability histogram did for a discrete r.v. If we were to take a sample of size n and construct the sample histogram, we would expect it to have the roughly same shape as f – as n increases, we’d expect it to look more and more like f. Thus, a probability density function describes the population of Y values, where Y is a continuous r.v. Normal Approximation to The Binomial Distribution: Recall that, as the number of trials k grows, the probability histogram for a binomial r.v. with p = 1/2 has shape suspiciously like the normal density function. The former is jagged, while the normal density is a smooth curve; however, as k gets larger, the jagged edges get smoother. Mathematically, it may be shown that, as k gets large, for p near 1/2, the probability distribution function and histogram for the binomial distribution looks more and more like the normal probability density function and its graph! Thus, if one is dealing with a discrete binomial r.v., and the number of trials is relatively large, the smooth, continuous normal distribution is often used to approximate the binomial. This allows one to take advantage of the statistical methods we will discuss that are based on the assumption that the normal density is a good description of the population. We thus confine much of our attention to methods for data that may be viewed (at least approximately) as arising from a population that may be described by a normal distribution. 1.4.4. The Standard Normal Distribution
The probability distribution function f for a normal distribution has a very complicated form. Thus, it is not possible to evaluate easily probabilities for a normal r.v. Y the way we could for a binomial. Luckily, however, these probabilities may be calculated on a computer. Some computer packages have these probabilities “built in”; they are also widely available in tables in the special case when μ = 0 and σ 2 = 1, e.g., in statistical textbooks. It turns out that such tables are all that is needed to evaluate probabilities for any μ and σ 2 . It is instructive to learn how to use such tables, because the necessary operations give one a better understanding of how probabilities are represented by area. We will learn how to evaluate normal probabilities when μ and σ 2 are known first; we
20
Bang and Davidian
will see later how we may use this knowledge to develop statistical methods for estimating them when they are not known. Suppose Y ∼ N (μ, σ 2 ). We wish to calculate probabilities such as P(Y ≤ y), P(Y ≥ y), P(y1 ≤ Y ≤ y2 ),
[7]
that is, probabilities for intervals associated with values of Y . Technical Note: When dealing with probabilities for any continuous r.v., we do not make the distinction between strict inequalities like “<” and “>” and inequalities like “≤” and “≥” due to the limitations on our ability to see the exact values make it impossible to distinguish between Y being exactly equal to a value y and Y being equal to a value extremely close to y. Thus, the probabilities in [7] could equally well be written: P(Y < y), P(Y > y), P(y1 < Y < y2 ).
[8]
The Standard Normal Distribution: Consider the event (y1 ≤ Y ≤ y2 ). Thus, if we think of the r.v. Y − μ, it is clear that this is also a normal r.v., but now with mean 0 and the same variance σ 2 . If we furthermore know σ 2 , and thus the SD σ (same units as Y ), suppose we divide every possible value of Y by the value σ . This will yield all possible values of the r.v. Y /σ . Note that this r.v. is “unitless”; e.g., if Y is measured in grams, so is σ , so the units “cancel.” Rather, this r.v. has “units” of SD; that is, if it takes value 1, this says that the value of Y is 1 SD to the right of the mean. In particular, the SD of Y /σ is 1. Define Z =
Y −μ . σ
Then Z will have mean zero and SD 1. (It has units of SD of the original r.v. Y ). It will also be normally distributed, just like Y , as all we have done is shift the mean and scale by the SD. Hence, we call Z a standard normal r.v., and we write Z ∼ N (0, 1). Applying this, we see that
y2 − μ y1 − μ (y1 ≤ Y ≤ y2 ) ⇔ ≤Z ≤ σ σ y−μ . (Y ≥ y) ⇔ Z ≥ σ
and
If we want to find probabilities about events concerning Y , and we know μ and σ , all we need is a table of probabilities for a standard normal r.v. Z. 1.4.5. Statistical Inference
We assume that we are interested in a r.v. Y that may be viewed (exactly or approximately) as following a N (μ, σ 2 ) distribution.
Experimental Statistics for Biological Sciences
21
Suppose that we have observed data Y1 , . . . , Yn . In real situations, we may be willing to assume that our data arise from some normal distribution, but we do not know that values of μ or σ 2 . As we have discussed, one goal is to use Y1 , . . . , Yn to estimate μ and σ 2 . We use statistics like Y¯ and s2 as estimators for these unknown parameters. Because Y¯ and s2 are based on observations on a r.v., they themselves are r.v.’s. Thus, we may think about the populations of all possible values they may take on (from all possible samples of size n). It is natural to thus think of the probability distributions associated with these populations. The fundamental principle behind the methods we will discuss is as follows. We base what we are willing to say about μ and σ 2 on how likely the values of the statistics Y¯ and s2 we saw from our data would be if some values μ0 and σ02 were the true values of these parameters. To assess how likely, we need to understand the probabilities with which Y¯ and s2 take on values. That is, we need the probability distributions associated with Y¯ and s2 . Probability Distribution of Y¯ : It turns out that if Y is normal, then the distribution of all possible values of Y¯ is also normal! Thus, if we want to make statements about “how likely” it is that Y¯ would take on certain values, we may calculate these using the normal distribution. We have σ Y¯ ∼ N (μY¯ , σY2¯ ), μY¯ = μ, σY¯ = √ . n We may use these facts to transform events about Y¯ into events about a standard normal r.v. Z. 1.4.6. The χ 2 Distribution
Probability Distribution of s2 : When the data are observations on a r.v. Y ∼ N (μ, σ 2 ), then it may be shown mathematically that the values taken on by (n − 1)s 2 σ2 are well represented by another distribution different from the normal. This quantity is still continuous, so the distribution has a probability density function. The distribution with this density, and thus the distribution that plays a role in describing probabilities associated with s2 values, is called the Chi-square distribution with (n − 1) df. This is often written χ 2 .
1.4.7. Student’s t Distribution
Recall that one of our objectives will be to develop statistical methods to estimate μ, the population mean, using the obvious estimator, Y¯ . We would like to be able to make statements about “how likely” it is that Y¯ would take on certain values. We saw above that this involves appealing to the normal distribution, as Y¯ ∼ N (μ, σ 2¯ ). A problem with this in real life is, of course, that Y
22
Bang and Davidian
σ 2 , and hence σ 2¯ , is not known, but must itself be estimated. Y Thus, even if we are not interested in σ 2 in its own right, if we are interested in μ, we still need σ 2 to make the inferences we desire! An obvious approach would be to replace σY¯ in our standard normal statistic Y¯ − μ σY¯ by the obvious estimator, sY¯ in [6], and consider instead the statistic Y¯ − μ . sY¯
[9]
The value for μ would be a “candidate” value for which we are trying to assess the likelihood of seeing a value of Y¯ like the one we saw. It turns out that when we replace σY¯ by the estimate sY¯ , the resulting statistic [9] no longer follows a standard normal distribution. The presence of the estimate sY¯ in the denominator serves to add variation. Rather, the statistic has a different distribution, which is centered at 0 and has the same, symmetric, bell shape as the normal, but whose probabilities in the extreme “tails” are larger than those of the normal distribution. Student’s Distribution: The probability distribution describing the probabilities associated with the values taken on by the quantity [9] is called the (Student’s) t distribution with (n − 1) df for a sample of size n. 1.4.8. Degrees of Freedom
For both the statistics (n − 1)s 2 Y¯ − μ and , sY¯ σ2 which follow the χ 2 and t distributions, respectively, the notion of df has arisen. The probabilities associated with each of these statistics depend on the n through the df value (n − 1). What is the meaning of this? Note that both statistics depend on s 2 and 1 n 2 recall that s = n i=1 (Yi − Y¯ )2 . Recall also that it is always n ¯ true that i=1 (Yi − Y ) = 0. Thus, if we know the values of (n − 1) of the observations in our sample, we may always compute the last value, because the deviations about Y¯ of all n of them must sum to zero. Thus, s 2 may be thought of as being based on (n − 1) “independent” deviations – the final deviation can be obtained from the other (n − 1). The term df thus has to do with the fact that there are (n − 1) “free” or “independent” quantities upon which the r.v.’s above are based. Thus, we would expect their distributions to depend on this notion as well.
Experimental Statistics for Biological Sciences
23
2. Estimation and Inference 2.1. Estimation, Inference, and Sampling Distributions
Estimation: A particular way to say something about the population based on a sample is to assign likely values based on the sample to the unknown parameters describing the population. We have already discussed this notion of estimation, e.g., Y¯ is an estimator for μ, s 2 is an estimator for σ 2 . Now we get a little more precise about estimation. Note that these estimators are not the only possibilities. For example, recall that 1 (Yi − Y¯ 2 ); n-1 n
s2 =
i=1
an alternative estimator for σ2 would be 1 (Yi − Y¯ 2 ), n n
s∗2 =
i=1
where we have replaced the divisor (n − 1) by n. An obvious question would be, if we can identify competing estimators for population parameters of interest, how can we decide among them? Recall that estimators such as Y¯ and s 2 may be thought of as having their own underlying populations (that may be described by probability distributions). That is, for example, we may think of the population of all possible Y¯ values corresponding to all of the possible samples of size n we might have ended up with. For this population, we know that Mean of population of Y¯ = μY¯ = μ.
[10]
The property [10] says that the mean of the probability distribution of Y¯ values is equal to the parameter we try to estimate by Y¯ . This seems intuitively like a desirable quality. Unbiasedness: In fact, it is and has a formal name! An estimator is said to be “unbiased” if the mean of the probability distribution is equal to the population parameter to be estimated by the estimator. Thus, Y¯ is an unbiased estimator of μ. Clearly, if we have two competing estimators, then we would prefer the one that is unbiased. Thus, unbiasedness may be used as a criterion for choosing among competing estimators. Minimum Variance: What if we can identify two competing estimators that are both unbiased? On what grounds might
24
Bang and Davidian
we prefer one over the other? Unbiasedness is clearly a desirable property, but we can also think of other desirable properties an estimator might have. For example, as we have discussed previously, we would also like our estimator to be as “close” to the true values as possible – that is, in terms of the probability distribution of the estimator, we’d like it to have small variance. This would mean that the possible values that the estimator could take on (across all possible samples we might have ended up with) exhibit only small variation. Thus, if we have two unbiased estimators, choose the one with smaller variance. Ideally, then, we’d like to use an estimator that is unbiased and has the smallest variance among all such candidates. Such an estimator is given the name minimum variance unbiased. It turns out that, for normally distributed data Y , the estimators Y¯ (for μ) and s 2 (for σ 2 ) have this desirable property. 2.1.1. Confidence Interval for μ
An estimator is a “likely” value. Because of chance, it is too much to expect that Y¯ and s 2 would be exactly equal to μ and σ2 , respectively, for any given data set of size n. Although they may not be “exactly” equal to the value they are estimating, because they are “good” estimators in the above sense, they are likely to be “close.” • Instead of reporting only the single value of the estimator, we report an intervals (based on the estimator) and state that it is likely that the true value of the parameter is in the interval. • “likely” means that probability is involved. Here, we discuss the notion of such an interval, known as a confidence interval (CI), for the particular situation where we wish to estimate μ by the sample mean Y¯ . Suppose for now that Y ∼ N (μ, σ 2 ), where μ and σ are unknown. We have a random sample of observations Y1 , . . . , Yn and wish to estimate μ. Of course, our estimator is Y¯ . If we wish to make probability statements involving Y¯ , this is complicated by the fact that σ 2 is unknown. Thus, even if we are not interested in σ 2 in its own right, we can not ignore it and must estimate it anyway. In particular, we will be interested in the statistic Y¯ − μ sY¯
if we wish to make probability statements about Y¯ without knowledge of σ2 . What kind of probability statement do we wish to make? The chance or randomness that we must contend with arises because Y¯ is based on a random sample. The value μ we wish to estimate is a fixed (but unknown) quantity. Thus, our probability statements intuitively should have something to do with the uncertainty of trying to get an understanding of the fixed
Experimental Statistics for Biological Sciences
25
value of μ using the variable estimator Y¯ . We have P( − tn−1,α/2 ≤
¯ Y−μ ≤ tn−1,α/2 ) = 1 − α. sY¯
[11]
If we rewrite [11] by algebra, we obtain P(Y¯ − tn−1,α/2 sY¯ ≤ μ ≤ Y¯ + tn−1,α/2 sY¯ ) = 1 − α.
[12]
It is important to interpret [12] correctly. Even though the μ appears in the middle, this is not a probability statement about μ – remember, μ is a fixed constant! Rather, the probability in [12] has to do with the quantities on either side of the inequalities – these quantities depend on Y¯ and sY¯ , and thus are subject to chance. Definition: The interval (Y¯ − tn−1,α/2 sY¯ , Y¯ + tn−1,α/2 sY¯ ) is called a 100(1 − α)% CI for μ. For example, if α = 0.05, then (1 − α) = 0.95, and the interval would be called a 95% CI. In general, the value (1 − α) is called the confidence coefficient. Interpretation: As noted above, the probability associated with the CI has to do with the endpoints, not with probabilities about the value μ. We might be tempted to say “the probability that μ falls in the interval is 0.95,” thinking that the endpoints are fixed and μ may or may not be between them. But the endpoints are what varies here, while μ is fixed (see Fig. 1.1). Once it has been constructed, it doesn’t make sense to talk about the probability that it covers μ. We instead talk about our confidence in the statement “the interval covers μ.”
Fig. 1.1. Illustration of Confidence Interval.
2.1.2. Confidence Interval for a Difference of Population Means
Rarely in real life is our interest confined to a single population. Rather, we are usually interested in conducting experiments to compare populations. For example, an experiment may be set up because we wish to gather evidence about the difference among yields (on the average) obtained with several different rates of fertilizer application. More precisely, we would like to make a
26
Bang and Davidian
statement about whether treatments give truly different responses. The simplest case is where we wish to compare just two such treatments. We will develop this case here first, then discuss extension to more than two treatments later. Experimental Procedure: Take two random samples of experimental units (plants, subjects, plots, rats, etc.). Each unit in the first sample receives treatment 1, each in the other receives treatment 2. We would like to make a statement about the difference in responses to the treatments based on this setup. Suppose we wish to compare the effects of two concentrations of a toxic agent on weight loss in rats. We select a random sample of rats from the population of interest and then randomly assign each rat to receive either concentration 1 or concentration 2. The variable of interest is Y = weight loss for a rat. Until the rats receive the treatments, we may assume them all to have arisen from a common population for which Y has some mean μ and variance σ2 . Because these are continuous measurement data, it is reasonable to assume that Y ∼ N (μ, σ 2 ). Once the treatments are administered, however, the two samples become different. Populations 1 and 2 may be thought of as the original population with all possible rats treated with treatment 1 and 2, respectively. We may thus regard our samples as being randomly selected from these two populations. Because of the nature of the data, it is further reasonable to think about two r.v.’s, Y 1 and Y 2 , one corresponding to each population, and to think of them as being normally distributed: Population 1: Y1 ∼ N(μ1 , σ12 ) Population 2: Y2 ∼ N(μ2 , σ22 ) Notation: Because we are now thinking of two populations, we must adjust our notation accordingly, so we may talk about two different r.v.’s and observations on each of them. Write Y ij to denote the observation from the jth unit receiving the ith treatment, that is, the jth value observed on r.v. Y i . With this definition, we may thus view our data as follows: Y11 , Y12 , . . . , Y1n1 n1 = no. of units in sample from population 1 Y21 , Y22 , . . . , Y2n2 n2 = no. of units in sample from population 2. In this framework, we may now cast our question as follows: (μ1 − μ2 ) = 0: no difference (μ1 − μ2 ) = 0: difference. An obvious strategy would be to base investigation of this population mean difference on the data from the two samples by
Experimental Statistics for Biological Sciences
27
estimating the difference. It may be shown mathematically that, if both of the r.v.’s (one for each treatment) are normally distributed, then the following facts are true: • The r.v. (Y1 − Y2 ) satisfies (Y1 − Y2 ) ∼ N(μ1 − μ2 , σD2 ), where σD2 = σ12 + σ22. • Define n1 n2 1 1 ¯ ¯ ¯ ¯ ¯ Y1j ,Y2 = Y2j . D = Y1 − Y2 , where Y1 = n1 n2 j=1
j=1
¯ is the difference in sample means for the two samThat is, D ples. Then, just as for the sample mean from a single sample, the difference in sample means is also normally distributed, i.e., ¯ ∼ N(μ1 − μ2 , σ 2 ), D ¯ D
σD2¯ =
σ12 σ2 + 2. n1 n2
Thus, the mean of the population of all possible differences in sample means from all possible samples from the two populations is the difference in means for the original populations, by analogy to the single population case. Thus, this statistic would follow a standard normal distribution ¯ − (μ1 − μ2 ) D σD¯ . ¯ Intuitively, we can use D as an estimator of (μ1 − μ2 ). As before, for a single population, we would like to report an interval assessing the quality of the sample evidence, that is, give a CI for (μ1 − μ2 ). In practical situations, σ12 and σ22 will be unknown. The obvious strategy would be to replace them by estimates. We will consider this first in the simplest case: • n1 = n2 = n, i.e., the two samples are of the same size. • σ12 = σ22 = σ 2 , i.e., the variances of the two populations are the same. This essentially says that application of the two different treatments affects only the mean of the response, not the variability. In many circumstances, this is not unreasonable. These simplifying assumptions are not necessary but simply make the notation a bit easier so that the concepts will not be obscured by so many symbols. Under these conditions, n n 1 1 σ2 2 ¯ ¯ Y1 = . Y1j , Y2 = Y2j , and σD¯ = 2 n n n j=1
j=1
28
Bang and Davidian
If we wish to replace σ 2¯ and hence σ2 by an estimate, we must D first determine a suitable estimate under these conditions. Because the variance is considered to be the same in both populations, it makes sense to use the information from both samples to come up with such an estimate. That is, pool information from the two samples to arrive at a single estimate. A “pooled” estimate of the common σ2 is given by s2 =
s 2 + s22 (n − 1)s12 + (n − 1)s22 = 1 , 2(n − 1) 2
[13]
where s12 and s22 are the sample variances from each sample. We use the same notation as for a single sample, s 2 , as again we are estimating a single population variance (but now from two samples). Note that the pooled estimate is just the average of the two sample variances, which makes intuitive sense. We have written s 2 as we have in the middle of [13] to highlight the general form – when we discuss unequal n’s later, it will turn out that the estimator for a common σ2 will be a ‘weighted average’ of the two sample variances. Here, the weighting is equal because
2the 2 2 n’s are equal. Thus, an obvious estimator for σ ¯ is s ¯ = 2 sn . D D Just as in the single population case, we would consider the statistic ¯ − (μ1 − μ2 ) D 2 s. [14] , where sD¯ = sD¯ n It may be shown that this statistic has a Student’s t distribution with 2(n − 1) df. CI for (μ1 − μ2 ): By the same reasoning as in the single population case, then, we may use the last fact to construct a CI for (μ1 − μ2 ). In particular, by writing down the same type of probability statement and rearranging, and using the fact that the statistic [14] has a t2(n−1) distribution, we have for confidence coefficient (1 − α), ¯ − (μ1 − μ2 ) D P −t2(n−1), α/2 ≤ ≤ t2(n−1), α/2 = 1 − α, sD¯ ¯ + t2(n−1), α/2 s ¯ ) = 1 − α. ¯ − t2(n−1), α/2 s ¯ ≤ μ1 − μ2 ≤ D P(D D D The CI for (μ1 − μ2 ) thus is ¯ − t2(n−1), α/2 s ¯ , D ¯ + t2(n−1), α/2 s ¯ ). (D D D 2.2. Inference on Means and Hypothesis Testing
In the last section, we began to discuss formal statistical inference, focusing in particular on inference on a population mean for a single population and on the difference of two means under some simplifying conditions (equal n’s and variances). We saw in these situations how to
Experimental Statistics for Biological Sciences
29
• Estimate a single population mean or difference of means. • Make a formal, probabilistic statement about the sampling procedure used to obtain the data, i.e., how to construct a CI for a single population mean or difference of means. Both estimation and construction of CIs are ways of getting an idea of the value of a population parameter of interest (e.g., mean or difference of means) and taking into account the uncertainty involved because of sampling and biological variation. In this section, we delve more into the notions of statistical inference. As with our first exposure to these ideas, probabilistic statements will play an important role. Normality Assumption: As we have previously stated, all of the methods we will discuss are based on the assumption that the data are “approximately” normally distributed, that is, the normal distribution provides a reasonable description of the population(s) of the r.v.(s) of interest. It is important to recognize that this is exactly that, an assumption. It is often valid, but does not necessarily have to be. Always keep this in mind. The procedures we describe may lead to misleading inferences if this assumption is seriously violated. 2.2.1. Hypothesis Tests or Tests of Significance
Problem: Often, we take observations on a sample with a specific question in mind. For example, consider the data on weight gains of rats treated with vitamin A discussed in the last section. Suppose that we know from several years of experience that the average (mean) weight gain of rats of this age and type during a 3week period when they are not treated with vitamin A is 27.8 mg. Question: If we treat rats of this age and type with 2.5 units of vitamin A, how does this affect 3-week weight gain? That is, if we could administer 2.5 units of vitamin A to the entire population of rats of this age and type, would the (population) mean weight gain change from what it would be if we did not? Of course, we cannot administer vitamin A to all rats, nor are we willing to wait for several years of accumulated experience to comment. The obvious strategy, as we have discussed, is to plan to obtain a sample of such rats, treat them with vitamin A, and view the sample as being drawn (randomly) from the (unobservable) population of rats treated with vitamin A. This population has (unknown) mean μ. We carry out this procedure and obtain data (weight gains for each rat in the sample). Clearly, our question of interest may be regarded as a question about μ. Either (i) μ = 27.8 mg, that is, vitamin A treatment does not affect weight gain, and the mean is what we know from vast past experience, 27.8 mg, despite administration of vitamin A. (ii) μ = 27.8 mg, that is, vitamin A treatment does have an affect on weight gain.
30
Bang and Davidian
Statements (i) and (ii) are called statistical hypotheses H0 :μ = 27.8 vs. H1 :μ = 27.8, which are “null” hypothesis and “alternative” hypothesis (often denoted by HA as well), respectively. A formal statistical procedure for deciding between H0 and H1 is called a hypothesis test or test of significance. We base our “decision” on observation of a sample from the population with mean μ; thus, our decision is predicated on the quality of the sampling procedure and the inherent biological variation in the thing we are studying in the population. Thus, as with CIs, probability will be involved. Suppose in truth that μ = 27.8 mg (i.e., vitamin A has no effect, H0 ). For the particular sample we ended up with, recall that we observed Y¯ = 41.0 mg, say, from n = 5. The key question would be, How “likely” is it that we would see a sample yielding a sample mean Y¯ = 41.0 mg if it is indeed true that the population mean μ = 27.8 mg? • If it is likely, then 41.0 is not particularly unusual; thus, we would not discount H0 as an explanation for what is going on. (We do not reject H0 as an explanation.) • If it is not likely, then 41.0 is unusual and unexpected. This would cause us to think that perhaps H0 is not a good explanation for what is going on. (Reject H0 as an explanation, as it seems unlikely.) Consider the generic situation where H0 :μ = μ0 vs. H1 :μ = μ0 , where μ0 is the value of interest (μ0 = 27.8 mg in the rat example). If we assume (“pretend”) H 0 is true, then we assume that μ = μ0 , and we would like to determine the probability of seeing a sample mean Y¯ (our “best guess” for the value of μ) like the one we ended up with. Recall that if the r.v. of interest Y is normal, then we know that, under our assumption that μ = μ0 , Y¯ − μ0 ∼ tn−1 . sY¯
[15]
That is, a sample mean calculated from a sample drawn from the population of all Y values, when “centered” and “scaled,” behaves like a t r.v. Intuitively, a “likely” value of Y¯ would be one for which Y¯ is “close” to μ0 . Equivalently, we would expect the value of the statistic [15] to be “small” (close to zero) in Y¯ −μ0 magnitude, i.e., s ¯ close to 0. To formalize the notion of Y “unlikely,” suppose we decide that if the probability of seeing the value of Y¯ that we saw is less than some small value α, say α =
Experimental Statistics for Biological Sciences
31
0.05, then things are sufficiently unlikely for us to be concerned that our pretend assumption μ = μ0 may not hold. We would thus feel that the evidence in the sample (the value of Y¯ we saw) is strong enough to refute our original assumption that H0 is true. We know that the probabilities corresponding to values of the statistic in [15] follow the t distribution with (n − 1) df. We thus know that there is a value tn−1,α/2 such that Y¯ − μ 0 P > tn−1,α/2 = α. sY¯ Values of the statistic that are greater in magnitude than tn−1,α/2 are thus “unlikely” in the sense that the chance of seeing them is less than α, the “cut-off” probability for “unlikeliness” we have specified. The value of the statistic [15] we saw in our sample is thus a realization of a t r.v. Thus, if the value we saw is greater in magnitude than tn−1,α/2 , we would consider the Y¯ we got to be “unlikely”, and we would reject H0 as the explanation for what is really going on in favor of the other explanation, H1 . Implementation: Compare the value of the statistic [15] to the appropriate value tn−1,α/2 . In the rat example, we take μ0 = 27.8 mg (H0 assumed true). If we have n = 5, sY¯ = 4.472, then Y¯ − μ 41.0 − 27.8 0 = 2.952. = sY¯ 4.472 From the table of the t distribution, if we take α = 0.05, we have t4,0.025 = 2.776. Comparing the value of the statistic we saw to this value gives 2.952 > 2.776. We thus reject H0 – the evidence in our sample is strong enough to support the contention that mean weight gain is different from μ0 = 27.8, that is, H1 , i.e., vitamin A does have an effect on weight gain. Terminology: The statistic Y¯ − μ0 sY¯ is called a test statistic. A test statistic is a function of the sample information that is used as a basis for “deciding” between H0 and H1 . Another, equivalent way to think about this procedure is in terms of probabilities rather than the value of the test statistic. Our test statistic is a r.v. with a tn−1 distribution. Thus, instead of finding the “cut-off” value for our chosen α and comparing the magnitude of the statistic for our sample to this value, we find the probability of seeing a value of the statistic with the same magnitude as that we saw, and compare this probability to α. That is, if tn−1 represents a r.v. with the t distribution with (n − 1) df, find P (|tn−1 | > value of test statistic we saw)
32
Bang and Davidian
and compare this probability to α. In our example, from the T table with n − 1 = 4, we find 0.02 < P(|t4 | > 2.952) < 0.05. Thus, the probability of seeing what we saw is between 0.02 and 0.05 (small enough) and thus less than α = 0.05. The value of the test statistic we saw, 2.952, is sufficiently unlikely, and we reject H0 . These two ways of conducting the hypothesis test are equivalent. • In the first, we think about the size of the value of the test statistic for our data. If it is “large,” then it is “unlikely.” “Large” depends on the probability α we have chosen to define “unlikely.” • In the second, we think directly about the probability of seeing what we saw. If the probability is “small” (smaller than α), then the test statistic value we saw was “unlikely.” • A large test statistic and a small probability are equivalent. • An advantage of performing the hypothesis test the second way is that we calculate the probability of seeing what we saw – this is useful for thinking about just how “strong” the evidence in the data really is. Terminology: The value α, which is chosen in advance to quantify the notion of “likely,” is called the significance level or error rate for the hypothesis test. Formally, because we perform the hypothesis test assuming H0 is true, α is thus the probability of rejecting H0 when it really is true. When will we reject H0 ? There are two scenarios: (i) H0 really is not true, and this caused the large value of the test statistic we saw (equivalently, the small probability of seeing a statistic like the one we saw). (ii) H0 is in fact true, but it turned out that we ended up with an unusual sample that caused us to reject H0 nonetheless. The situation in (ii) is a mistake – we end up making an incorrect judgment between H0 and H1 Unfortunately, because we are dealing with a chance mechanism (random sampling), it is always possible that we might make such a mistake because of uncertainty. A mistake like that in (ii) is called a Type I Error. The hypothesis testing procedure above ensures that we make a Type I error with probability at most α. This explains why α is often called the “error rate.” Terminology: When we reject H0 , we say formally that we “reject H0 at level of significance α.” This states clearly what criterion we used to determine “likely”; if we do not state the level of significance, others have no sense of how stringent or lenient we were in our determination. An observed value of the test statistic
Experimental Statistics for Biological Sciences
33
leading to rejection of H0 is said to be (statistically) significant at level α. Again, stating α is essential. One-Sided And Two-Sided Tests: For the weight gain example, we just considered the particular set of hypotheses H0 :μ = 27.8 vs. H1 :μ = 27.8 mg. Suppose we are fairly hopeful that vitamin A not only has an effect of some sort on weight gain but in fact causes rats to gain more weight than they would if they were untreated. Under these conditions, it might be of more interest to specify a different alternative hypothesis: H0 :μ > 27.8. How would we test H0 against this alternative? As we now see, the principles underlying the approach are similar to those above, but the procedure must be modified to accommodate the particular direction of a departure from H0 in which we are interested. Using the same intuition as before, a “likely” value of Y¯ under these conditions would be one where the value of the statistic would be close to zero. On the other hand, a value of Y¯ that we would expect if H0 were not true but instead H1 were would be large and positive. We thus are interested only in the situation where Y¯ is sufficiently far from μ0 in the positive direction. We know that there is a value tn−1,α such that Y¯ − μ0 > tn−1,α = α. P sY¯ Terminology: The test of hypotheses of the form H0 :μ = μ0 vs. H1 :μ = μ0 is called a two-sided hypothesis test – the alternative hypothesis specifies that μ is different from μ0 , but may be on “either side” of it. Similarly, a test of hypotheses of the form H0 :μ = μ0 vs. H1 :μ > μ0 or H0 :μ = μ0 vs. H1 :μ < μ0 is called a one-sided hypothesis test. Terminology: For either type of test, the value we look up in the t distribution table to which we compare the value of the test statistic is called the critical value for the test. In the one-sided test above, the critical value was 2.132; in the two-sided test, it was 2.776. Note that the critical value depends on the chosen level of significance α and the n. The region of the t distribution that leads to rejection of H0 is called the critical region. For example, in the
34
Bang and Davidian ¯
0 one-sided test above, the critical region was Y −μ > 2.132. If we sY¯ think in terms of probabilities, there is a similar notion. Consider the two-sided test. The probability
Y¯ − μ 0 P > what we saw sY¯ is called the ‘p-value’. The p-value is compared to α/2 in a twosided test in the alternative method of testing the hypotheses. Reporting a p-value gives more information than just reporting whether or not H0 was rejected. For example, if α = 0.05, and the p-value = 0.049, yes, we might reject H0 , but the evidence in the sample might be viewed as “borderline.” On the other hand, if the p-value = 0.001, clearly, we reject H0 ; the p-value indicates that the chance of seeing what we saw is very unlikely (1/1000). How to Choose α: So far, our discussion has assumed that we have a priori specified a value α that quantifies our feelings about “likely.” How does one decide on an appropriate value for α in real life? Recall that we mentioned the notion of a particular type of mistake we might make, that of a Type I error. Because we perform the hypothesis test under the assumption that H0 is true, this means that the probability we reject H0 when it is true at most α. Choosing α thus has to do with how serious a mistake a Type I error might be in the particular applied situation. Important Example: Suppose the question of interest concerns the efficacy of a costly new drug for the treatment of certain disease in humans, and the new drug has potentially dangerous side effects. Suppose a study is conducted where sufferers of the disease are randomly assigned to receive either the standard treatment or the new drug (this is called a randomized clinical trial), and suppose that the r.v. of interest Y is survival time for sufferers of the disease. It is known from years of experience with the standard drug that the mean survival time is some value μ0 . We hope that the new drug is more effective in the sense that it increases survival, in which case it would be worth its additional expense and the risk of side effects. We thus consider H0 : μ = μ0 vs. H1 : μ > μ0 , where μ = mean survival time under treatment with the new drug. The data are analyzed, and suppose, unbeknownst to us, our sample leads us to commit a Type I error – we end up rejecting H0 when it is not true and claim that the new drug is more effective than the standard drug when in reality it is not! Because the new drug is so expensive and carries the possibility of dangerous side effects, this could be a costly mistake, as patients would be paying more with the risk of dangerous side effects for no real gain over the standard treatment. In a situation like this, it is intuitively clear that we would probably like α to be very small, so that the chance we end
Experimental Statistics for Biological Sciences
35
up rejecting H0 when we really shouldn’t is small. In situations where the consequences are not so serious if we make a Type I error, we might choose α to be larger. Another Kind of Mistake: The sample data might be “unusual” in such a way that we end up not rejecting H0 when it really is not true (so we should have rejected). This type of mistake is called a Type II Error. Because a Type II error is also a mistake, we would like the probability of committing such an error, β, say, to also be small. In many situations, a Type II error is not as serious a mistake as a Type I error (think about a verdict “innocent” vs. “guilty”!). In our drug example, if we commit a Type II error, we infer that the new drug is not effective when it really is. Although this, too, is undesirable, as we are discarding a potentially better treatment, we are no worse off than before we conducted the test, whereas, if we commit a Type I error, we will unduly expose patients to unnecessary costs and risks for no gain. General Procedure for Hypothesis Testing: Here we summarize the steps in conducting a test of hypotheses about a single population mean. The same principles we discuss here will be applicable in any testing situation. 1. Determine the question of interest. This is the first and foremost issue – no experiment should be conducted unless the scientific questions are well formulated. 2. Express the question of interest in terms of null and alternative (1 or 2-sided) hypotheses about μ (before collecting/seeing data!). 3. Choose the significance level α , usually a small value like 0.05. The particular situation (severity of making a Type I error) will dictate the value. 4. Conduct the experiment, collect the data, determine critical value, and calculate the test statistic. Perform the hypothesis test, either rejecting or not rejecting H0 in favor of H1 Remarks: • We do not even collect data until the question of interest has been established! • You will often see the phrase “Accept H0 ” used in place of “do not reject H0 .” This terminology may be misleading. If we do reject H0 , we are saying that the sample evidence is sufficiently strong to suggest that H0 is probably not true. On the other hand, if we do not reject H0 , we do not because the sample does not contain enough evidence to say it probably not true. This does not imply that there is enough evidence to say that it probably is true! Tests of hypotheses are set up so that we assume H0 is true and then try to refute it – if we can’t, this doesn’t mean the assumption is true, only that we couldn’t reject it. It could well be that the
36
Bang and Davidian
alternative hypothesis H1 is indeed true, but, because we got an “unusual” sample, we couldn’t reject H0 – this doesn’t make H0 true. Another way to think of this: We conduct the test based on the presumption of some value μ0 , in the rat example, μ0 = 27.8 . Suppose that we conducted a hypothesis test and did not reject H0 . Now suppose that we changed the value of μ0 say, in the rat example, to μ0 = 27.7 and performed a hypothesis test and, again, did not reject H0 . Both 27.8 and 27.7 cannot be true! If we “accepted” H0 in each case, we have a conflict. • The significance level and critical region are not cast in stone. The results of hypothesis tests should not be viewed with absolute yes/no interpretation, but rather as guidelines for aiding us in interpreting experimental results and deciding what to do next. Often, experimental conditions are so complicated that we can never be entirely assured that the assumptions necessary to validate exactly our statistical methods are satisfied. For example, the assumption of normality may be only an approximation, or may in fact be downright unsuitable. It is thus important to keep this firmly in mind. It has become very popular in a number of applied disciplines to do all tests of hypotheses at level 0.05 regardless of the setting and to strive to find “p-value less than 0.05”; however, if one is realistic, a p-value of, say, 0.049 must be interpreted with these cautionary remarks in mind. Also multiplicity of testing should be accounted for as necessary (See Section 3.2). 2.2.2. Relationship between Hypothesis Testing and Confidence Intervals
Before we discuss more exotic hypotheses and hypothesis tests, we point out something alluded to at the beginning of our discussion. We have already seen that a CI for a single population mean is based on the probability statement P −tn−1,α/2
Y¯ − μ ≤ ≤ tn−1,α/2 sY¯
= 1 − α.
[16]
As we have seen, a two-sided hypothesis test is based on a probability statement of the form Y¯ − μ P > tn−1,α/2 = α. sY¯
[17]
Comparing [16] and [17], a little algebra shows that they are the same (except for the strict vs. not-strict inequalities ≤ and >, which are irrelevant for continuous r.v.’s). Thus, choosing a “small” level of significance α in a hypothesis test is the same as choosing a “large” confidence coefficient (1 − α). Furthermore,
Experimental Statistics for Biological Sciences
37
for the same choice α, [16] and [17] show that the following two statements are equivalent: (i) Reject H0 :μ = μ0 at level α based on Y¯ . (ii) μ0 is not contained in a 100(1 − α)% CI for μ based on Y¯ . That is, two-sided hypothesis tests about μ and CIs for μ yield the same information. A similar notion/procedure can be extended to one-sided hypothesis counterparts. 2.2.3. Tests of Hypotheses for the Mean of a Single Population
We introduced the basic underpinnings of tests of hypotheses in the context of a single population mean. Here, we summarize the procedure for convenience. The same reasoning underlies the development of tests for other situations of interest; these are described in subsequent sections. General form: H0 : μ = μ0 vs. one-sided: H1 : μ > μ0 or H1 : μ < μ0 , two-sided: H1 : μ =μ0 . Test statistic: t=
Y¯ − μ0 . sY¯
Procedure: For level of significance α, reject H0 if one-sided: t > tn−1,α or t < −tn−1,α , two-sided: |t| > tn−1,α/2 . 2.2.4. Testing the Difference of Two Population Means
As we have discussed, the usual situation in practice is that in which we would like to compare two competing treatments or compare a treatment to a control. Scenario: With two populations, Population 1: Y1 ∼ N(μ1 , σ12 ), Y11 , Y12 , . . . , Y1n1 ⇒ Y¯ 1 , s12 Population 2: Y2 ∼ N(μ2 , σ 2 ), Y21 , Y22 , . . . , Y2n ⇒ Y¯ 2 , s 2 2
2
2
Because the two samples do not involve the same “experimental units” (the things giving rise to the responses), they may be thought of as independent (totally unrelated). General form: H0 :μ1 − μ2 = δ vs. one-sided: H1 :μ1 − μ2 > δ two-sided: H1 :μ1 − μ2 = δ where δ is some value. Most often, δ = 0, so that the null hypothesis corresponds to the hypothesis of no difference between the population means. Test statistic: As in the case of constructing CIs for μ1 − μ2 , intuition suggests that we base inference on Y¯ 1 − Y¯ 2 . The test statistic is ¯ −δ D t= , sD¯
38
Bang and Davidian
¯ = Y¯ 1 − Y¯ 2 and s ¯ is an estimate of σ ¯ . where D D D Note that the test statistic is constructed under the assump¯ μ1 − μ2 , is equal to δ. This is analogous tion that the mean of D, to the one-sample case – we perform the test under the assumption that H0 is true. The Case of Equal Variances, σ12 = σ22 = σ 2 : As we have already discussed, it is often reasonable to assume that the two populations have common variance σ2 . One interpretation is that the phenomenon of interest (say, two different treatments) affects only the mean of the response, not its variability (“signal” may change with changing treatment, but “noise” stays the same). In this case, we “pool” the data from both samples to estimate the common variance σ2 . The obvious estimator is s2 =
(n1 − 1)s12 + (n2 − 1)s22 , (n1 − 1) + (n2 − 1)
[18]
where s12 and s22 are the sample variances for each sample. Thus, [18] is a weighted average of the two sample variances, where the “weighting” is in accordance with the n’s. We have already discussed such “pooling” when the n is the same, in which case this reduces to a simple average. [18] is a generalization to allow differential weighting of the sample variances when the n’s are different. Recall that in general, σD2¯ =
σ12 σ2 + 2. n1 n2
When the variances are the same, this reduces to 1 1 2 2 + , σD¯ = σ n1 n2 which can be estimated by plugging in the “pooled” estimator for σ2 . We thus arrive at
1 1 1 1 2 2 + + . , sD¯ = s sD ¯ =s n1 n2 n1 n2 Note that the total df across both samples is (n1 − 1) + (n2 − 1) = n1 + n2 − 2, so the tn1 +n2 −2 distribution is relevant. Procedure:
For level of significance α, reject H0 if one-sided: t > tn1 +n2 −2,α , two-sided: |t| > tn1 +n2 −2,α/2 .
CIs: Extending our previous results, a 100(1 − α) % CI for μ1 − μ2 would be ¯ + tn +n −2,α/2s ). ¯ − tn +n −2,α/2s , D (D 1 2 1 2 ¯ ¯ D
D
Experimental Statistics for Biological Sciences
39
Also, a one-sided lower confidence bound for μ1 − μ2 would be ¯ − tn +n −2,αs . D 1 2 ¯ D
The relationship between hypothesis tests and CIs is the same as in the single sample case. The Case of Unequal Variances, σ12 = σ22 : Cases arise in practice where it is unreasonable to assume that the variances of the two populations are the same. Unfortunately, things get a bit more complicated when the variances are unequal. In this case, we may not “pool” information. Instead, we use the two sample variances to estimates σD¯ in the obvious way:
sD¯ =
s2 s12 + 2. n1 n2
Note that because we can’t pool the sample variances, we can’t pool df ’s, either. It may be shown mathematically that, under these conditions, if we use sD¯ calculated in this way in the ¯
denominator of our test statistic D−δ sD¯ , the statistic no longer has exactly a t distribution! In particular, it is not clear what to use for df, as we have estimated two variances separately! It turns out that an approximation is available that may be used under these circumstances. One calculates the quantity, which is not an integer 2 s12 s22 n1 + n2 . “effective df” = 2 2 s12 n1
/(n1 − 1) +
s22 n2
/(n2 − 1)
One then rounds the “effective df ” to the nearest integer. The approximate effective df are then used as if they were exact; for the critical value, use tedf ,α (for one-sided test) and tedf ,α/2 (for one-sided test), where edf is the rounded “effective df ” It is important to recognize that this is only an approximation – the true distribution of the test statistic is no longer exactly t, but we use the t distribution and the edf as an approximation to the true distribution. Thus, care must be taken in interpreting the results. – one should be aware that “borderline” results may not be trustworthy. 2.2.5. Testing Equality of Variances
It turns out that it is possible to construct hypothesis tests to investigate whether or not two variances for two independent samples drawn from two populations are equal.
40
Bang and Davidian
Warning: Testing hypotheses about variances is a harder problem than testing hypotheses about means. This is because it is easier to get an understanding of the “signal” in a set of data than the “noise” – sample means are better estimators of the population means than sample variances are of the population variances using the same n. Moreover, for the test we are about to discuss to be valid, the assumption of normality is critical. Thus, tests for equality of variances should be interpreted with caution. We wish to test H0 :σ12 = σ22
vs.
H1 :σ12 = σ22 .
It turns out that the appropriate test statistic is the ratio of the two sample variances F =
larger of (s12 , s22 ) smaller of (s12 , s22 )
.
The F Distribution: The sampling distribution of the test statistic F may be derived mathematically, just as for the distributions of our test statistics for means. In general, for normal data, a ratio of two sample variances from independent samples with sample sizes nN (numerator) and nD (denominator) has what is known as the F distribution with (nN − 1) and (nD − 1) df’s, denoted as FnN −1,nD −1 . This distribution has shape similar to that of the χ 2 distribution. Tables of the probabilities associated with this distribution are widely available in statistical textbooks. Procedure: Reject H0 at level of significance α if F > Fr,s,α/2 with df r(numerator) and s(denominator). 2.2.6. Comparing Population Means Using Fully Paired Comparisons
As we will see over and over, the analysis of a set of data is dictated by the design. For example, in the developments so far on testing for differences in means and variances, it is necessary that the two samples be independent (i.e., completely unrelated); this is a requirement for the underlying mathematical theory to be valid. Furthermore, although we haven’t made much of it, the fact that the samples are independent is a consequence of experimental design – the experimental units in each sample do not overlap and were assigned treatments randomly. Thus, the methods we have discussed are appropriate for the particular experimental design. If we design the experiment differently, then different methods will be appropriate. If it is suspected in advance that σ12 and σ22 may not be equal, an alternative strategy is to use a different experimental design to conduct the experiment. It turns out that for this design, the appropriate methods of analysis do not depend on whether the variances for the two populations are the same. In addition, the design may make more efficient use of experimental resources. The idea is to make comparisons within pairs of experimental
Experimental Statistics for Biological Sciences
41
units that may tend to be more alike than other pairs. The effect is that appropriate methods for analysis are based on considering differences within pairs rather than differences between the two samples overall, as in our previous work. The fact that we deal with pairs may serve to eliminate a source of uncertainty, and thus lead to more precise comparisons. The type of design is best illustrated by example. Example: (Dixon and Massey, 1969, Introduction to Statistical Analysis, p. 122). A certain stimulus is thought to produce an increase in mean systolic blood pressure in middle-aged men. One way to design an experiment to investigate this issue would be to randomly select a group of middle-aged men and then randomly assign each man to either receive the stimulus or not. We would think of two populations, those of all middle-aged men with and without the stimulus, and would be interested in testing whether the mean for the stimulated population is greater than that for the unstimulated population. We would have two independent samples, one from each population, and the methods of the previous sections would be applicable. In this setup, variability among all men as well as variability within the two groups may make it difficult for us to detect differences. In particular, recall that variability in the sample mean ¯ is characterized by the estimate s ¯ , which appears in difference D D the denominator of our test statistic. If sD¯ is large, the test statistic will be small, and it is likely that H0 will not be rejected. Even if there is a real difference, the statistical procedure may have a difficult time identifying it because of all the variability. If we could eliminate the effect of some of the variability inherent in experimental material, we might be able to overcome this problem. In particular, if we designed the experiment in a different way, we might be able to eliminate the impact of a source of variation, thus ending up with a more sensitive statistical test (that will be more likely to detect a real difference if one exists). A better design that is in this spirit is as follows, and seems like a natural approach in a practical sense as well. Rather than assigning men to receive on treatment or the other, obtain a response from each man under each treatment! That is, obtain a random sample of middle-aged men and take two readings on each man, with and without the stimulus. This might be carried out using a before–after strategy, or, alternatively, the ordering for each man could be different. We will thus assume for simplicity that measurements on each man are taken in a before–after fashion and that there is no consequence to order. To summarize, Design
Type of difference
Sources of variation
1
Across men
Among men, within men
2
Within men
Within men
42
Bang and Davidian
In this second design, we still may think of two populations, those of all men with and without the stimulus. What changes in the second design is how we have “sampled” from these populations. The two samples are no longer independent, because they involve the same men. Thus, different statistical methods are needed. Let Y 1 and Y 2 be the two r.v.’s representing the paired observations, e.g., in our example, Y 1 = systolic blood pressure after stimulus, Y 2 = systolic blood pressure before stimulus. The data are the pairs (Y1j , Y2j ), pairs of observations from the jth man, j = 1, . . . , n. Let Dj = Y1j − Y2j = difference for the jth pair. The relevant population is thus the population of the r.v. D, on which we have observations D1 , . . . , Dn ! If we think of the r.v.’s Y 1 and Y 2 as having the means μ1 and μ2 , our hypotheses are H0 :μ1 − μ2 = δ vs. H1 :μ1 − μ2 = δ for a two-sided test, where, as before, δ is often 0. These hypotheses may be regarded as a test about the mean of the population of all possible differences (i.e., the r.v. D), which has this same mean. To remind ourselves of our perspective, we could think of the mean of the population of differences as diff = μ1 − μ2 and express our hypotheses equivalently in terms of diff , e.g., H0 :diff = δ vs. H1 :diff = δ. Note that once we begin thinking this way, it is clear what we have – we are interested in testing hypotheses concerning the value of a single population mean, diff (that of the hypothetical population of differences)! Thus, the appropriate analysis is that for a single population mean based on a single sample applied to the observed differences D1 , . . . , Dn . Compute ¯ = 1 D Dj = sample mean n n
j=1
and ⎛ 2 = sD
1 ⎜ ⎝ n−1
n j=1
n Dj2 −
j=1 Dj
n
2 ⎞ ⎟ ⎠ = sample variance.
Experimental Statistics for Biological Sciences
43
It turns out that the sample mean of the the differences is algebraically equivalent to the difference of the individual sample means, that is, ¯ = Y¯ 1 − Y¯ 2 , D so that the calculation may be done either way. The SE for the ¯ is, by analogy to the single sample case, sample mean D sD¯ =
sD sD =√ . n n
Note that we have used the same notation, sD¯ , as we did in the case of two independent samples, but the calculation and interpretation are different here. We thus have the following: Test statistic: ¯ −δ D t= sD¯ Note that the denominator of our test statistic depends on sD , the sample SD of the differences. This shows formally the important point – the relevant variation for comparing differences is that within pairs – that this sample SD is measuring precisely this quantity, the variability among differences on pairs. Procedure:
2.2.7. Power, Sample Size, and Detection of Differences
For level of significance α, reject H0 if one-sided: t > tn−1,α , two-sided: |t| > tn−1,α/2 .
We now return to the notion of incorrect inferences in hypothesis tests. Recall that there are two types of “mistakes” we might make when conducting such a test: Type I error: reject H0 when it really is true, P(Type I error) = α Type II error: do not reject H0 when it really isn’t true, P(Type II error) = β. Because both Type I and II errors are mistakes, we would ideally like both α and β to be small. Because Type I error is often more serious, the usual approach is to fix the Type I error (i.e., that level of significance α) first. So far, we have not discussed the implications of Type II error and how it might be taken into account when setting up an experiment. Power of a Test: If we do not commit a Type II error, then we reject H0 when H0 is not true, i.e., infer H1 is true when H1 really is true. Thus, if we do not commit a Type II error, we have made a correct judgment; moreover, we have done precisely what we
44
Bang and Davidian
hoped to do – detect a departure from H0 when in fact such a departure (difference) really does exist. We call 1 − β = P(reject H0 when H0 is not true) the power of the hypothesis test. Clearly, high power is a desirable property for a test to have: low probability of Type II error ⇔ high power ⇔ high probability to detect a difference if one exists. Intuition: Power of a test is a function of how different the true value of μ or μ1 − μ2 is from the null hypothesis value μ0 or δ. If the difference between the true value and the null hypothesis value is small, we might not be too successful at detecting this. If the difference is large, we are apt to be more successful. We can be more formal about this. To illustrate, we will consider the simple case of testing hypotheses about the value of the mean of a single population, μ. The same principles hold for tests on differences of means and in fact any test. Consider in particular the one-sided test H0 :μ = μ0
vs.
Recall that the test statistic is
H1 :μ > μ0 . Y¯ −μ0 sY¯ .
This test statistic is based on the idea that a large observed value of Y¯ is evidence that the ¯ 0 > true value of μ is greater than μ0 . We reject H0 when Y −μ s¯ Y
tn−1,α . To simplify our discussion, let us assume that we know σ2 , the variance of the population, and hence we know σY¯ . Y¯ − μ0 ∼ N(0, 1), σY¯
[19]
if H0 is true. In this situation, rather than compare the statistic to the t distribution, we would compare it to the standard normal distribution. Let zα denote the value satisfying P(Z > zα ) = α, for a standard normal r.v. Z. Then we would conduct the test at α level by rejecting H0 when Y¯ − μ0 > zα . σY¯ Now if H0 is not true, then it must be that μ = μ0 but, instead, μ = some other value, say μ1 > μ0 . Under these conditions, in reality, the statement in [19] is not true. Instead, the statistic that really has a standard normal distribution is Y¯ − μ1 ∼ N(0, 1), σY¯
[20]
Experimental Statistics for Biological Sciences
45
What we would like to do is evaluate power; thus, we need probabilities under these conditions (when H0 is not true). Thus, the power of the test is Y¯ − μ0 > zα 1 − β = P(reject H0 when μ = μ1 ) = P σY¯ μ1 − μ0 Y¯ − μ0 + > zα , =P σY¯ σY¯ where this last expression is obtained by adding and subtracting side of the inequality. Rearthe quantity μ1 /σY¯ to the left-hand
¯
1 0 . Now, under ranging, we get 1 − β = P Y σ−μ > zα − μ1σ−μ Y¯ Y¯ what is really going on, the quantity on the left-hand side of the inequality in this probability statement is a standard normal r.v., as noted in [20]. Thus, the probability (1 − β) is the probability 0 that a standard normal r.v. Z exceeds the value zα − μ1σ−μ , ¯
i.e., 1 − β = P Z > zα −
μ1 − μ0 σY¯
Y
.
That is, power is a function of μ1 , what is really going on. 2.2.8. Balancing α and β and Sample Size Determination
The theoretical results of the previous subsection show that n helps to determine power. We also discussed the idea of a “meaningful” (in a scientific sense) difference to be detected. This suggests that if, for a given application, we can state what we believe to be a scientifically meaningful difference, we might be able to determine the appropriate n to ensure high power to detect this difference. We now see that, once α has been set, we might also like to choose β, and thus the power 1 − β of detecting a meaningful difference at this level of significance. It follows that once we have determined • The level of significance, α • A scientifically meaningful departure from H0 • Power, 1 − β , with which we would like to be able to detect such a difference we would like to determine the appropriate n to achieve these objectives. For example, we might want α = 0.05 and an 80% chance of detecting a particular difference of interest. Let zα be the value such that P(Z > zα ) = α for a standard normal r.v. Z. Procedure: For a test between two population means μ1 and μ2 , choose the n so that (zα + zβ )2 ζD , one-sided test n = diff 2 (zα/2 + zβ )2 ζD . two-sided test n = diff 2
46
Bang and Davidian
Here, diff is the meaningful difference we would like to detect, and depending on the type of design 2 independent samples ζD = 2σ 2 , Paired comparison ζD = σD2 , where σ2 is the (assumed common) variance for the two populations and σD2 is the true variance of the population of differences Dj . For the two independent samples case, the value of n obtained is the number of experimental units needed in each sample. For the paired comparison case, n is the total number of experimental units (each will be seen twice). Slight Problem: We usually do not know σD2 or σ 2 . Some practical solutions are as follows: • Rather than express the “meaningful difference,” diff, in terms of the actual units of the response (e.g., diff = 5 mg for the rat experiment), express it in units of the SD of the appropriate response. For example, for a test based on two independent samples, we might state that we wish to detect a difference the size of one SD of the response. We would thus take diff = σ , so that the factor ζD diff2
=
2σ 2 = 2. σ2
For a paired comparison, we might specify the difference to be detected in terms of SD of a response difference D. If we wanted a one SD difference, we would take diff = σ D . • Another approach is to substitute estimates for σD2 or σ 2 from previous studies. Another Slight Problem: The tests we actually carry out in practice are based on the t distribution, because we don’t know σ2 or σD2 but rather estimate them from the data. In the procedures above, however, we used the values zα , zα/2 , and zβ from the standard normal distribution. There are several things one can do: • Nothing formal. The n’s calculated this way are only rough guidelines, because so much approximation is involved, e.g., having to estimate or “guess at” σ 2 or σD2 , assume the data are normal, and so on. Thus, one might regard the calculated n as a conservative choice and use a slightly bigger n in real application. • Along these lines, theoretical calculations may be used to adjust for the fact that the tests are based on the t rather than standard normal distribution. Specifically, one may calculate an appropriate “correction factor” to inflate the n slightly. (i) Use the appropriate formula to get n
Experimental Statistics for Biological Sciences
47
(ii) Multiply n by the correction factor 2n + 1 for two independent samples and 2n − 1 n+2 for a paired design. n Remark: Calculation of an appropriate n, at least as a rough guideline, should always be undertaken. There is no point in spending resources (subjects/animals, time, and money) to do an experiment that has very little chance of detecting the scientifically meaningful difference in which one is interested! This should always be carried out in advance of performing an experiment – knowing the n was too small after the fact is not very helpful. Also note that we estimate “minimal n” rather than suitable n. Fun Fact about History of t-test (from Wikipedia): The t statistic was introduced by William Sealy Gosset for cheaply monitoring the quality of beer brews (“Student” was his pen name). Gosset was a statistician for the Guinness brewery in Dublin, Ireland, and was hired due to Claude Guinness’s innovative policy of recruiting the best graduates from Oxford and Cambridge to apply biochemistry and statistics to Guinness’ industrial processes. Gosset published the t-test in Biometrika in 1908, but was forced to use a pen name by his employer who regarded the fact that they were using statistics as a trade secret. In fact, Gosset’s identity was unknown not only to fellow statisticians. Today, it is more generally applied to the confidence that can be placed in judgments made from small samples.
3. Analysis of Variance (ANOVA) 3.1. One-Way Classification and ANOVA
The purpose of an experiment is often to investigate differences among treatments. In particular, in our statistical model framework, we would like to compare the (population) means of the responses to each treatment. We have already discussed designs (two independent samples, pairing) for comparing two treatment means. In this section, we begin our study of more complicated problems and designs by considering the comparison of more than two treatment means. Recall that in order to detect differences if they really exist, we must try to control the effects of experimental error, so that any variation we observe can be attributed mainly to the effects of the treatments rather than to differences among the experimental units to which the treatments are applied. We discussed the idea that designs involving meaningful grouping of experimental units are the key to reducing the effects of experimental error,
48
Bang and Davidian
by identifying components of variation among experimental units that may be due to something besides inherent biological variation among them. The paired design for comparing two treatments is an example of such a design. Before we can talk about grouping in the more complicated scenario involving more than two treatments, it makes sense to talk about the simplest setting in which we compare several treatment means. This is basically an extension of the “two independent samples” design to more than two treatments. One-Way Classification: Consider an experiment to compare several treatment means set up as follows. We obtain (randomly, of course) experimental units for the experiment and randomly assign them to treatments so that each experimental unit is observed under one of the treatments. In this situation, the samples corresponding to the treatment groups are independent (the experimental units in each treatment sample are unrelated). We do not attempt to group experimental units according to some factor (e.g., gender). In this experiment, then, the only way in which experimental units may be “classified” is with respect to which treatment they received. Other than the treatments, they are viewed as basically alike. Hence, such an arrangement is often called a one-way classification. Example: In a laboratory experiment for which the experimental material may be some chemical mixture to be divided into beakers (the experimental units) to which treatments will be applied, and experimental conditions are the same for all beakers, we would not expect much variation among the beakers before the treatments were applied. In this situation, grouping beakers would be pointless. We would thus expect that, once we apply the treatments, any variation in responses across beakers will mainly be due to the treatments, as beakers are pretty much alike otherwise. Complete Randomization: When experimental units are thought to be basically alike, and are thus expected to exhibit a small amount of variation from unit-to-unit, grouping them really would not add much precision to an experiment. If there is no basis for grouping, and thus treatments are to be simply assigned to experimental units without regard to any other factors, then, as noted above, this should be accomplished according to some chance (random) mechanism. All experimental units should have an equal chance of receiving any of the treatments. When randomization is carried out in this way, it is called complete randomization. This is to distinguish the scheme for treatment allocation from more complicated methods involving grouping, which we will talk about later. Advantages: • Simplicity of implementation and analysis.
Experimental Statistics for Biological Sciences
49
• The size of the experiment is limited only by the availability of experimental units. No special considerations for different types of experimental units are required. Disadvantages: • Experimental error, our assessment of variation believed to be inherent among experimental units (not systematic), includes all (both inherent and potential systematic) sources. If it turns out unexpectedly that some of the variation among experimental units is indeed due to a systematic component, it will not be possible to “separate it out” of experimental error, and comparisons will lack precision. In such a situation, a more complicated design involving grouping should have been used up-front. Thus, we run the risk of low precision and power if something unexpected arises. 3.1.1. ANOVA
We wish to determine if differences exist among means for responses to treatments; however, the general procedure for inferring whether such differences exist is called “analysis of variance”. ANOVA: This is the name given to a general class of procedures that are based roughly on the following idea. We have already spoken loosely of attributing variation to treatments as being “equivalent” to determining if a difference exists in the underlying population treatment means. It turns out that it may be shown that there is actually a more formal basis to this loose way of speaking, and it is this basis that gives the procedure its name. It is easiest to understand this in the context of the oneway classification; however, the basic premise is applicable to more complicated designs that we will discuss later. Notation: To facilitate our further development, we will change slightly our notation for denoting a sample mean. As we will see shortly, we will need to deal with several different types of means for the data, and this notation would be a bit easier. Let t denotes the number of treatments. Let Yij = response on the jth experimental unit on treatment i. Here, i = 1, . . . , t. (We consider first the case where only one observation is taken on each experimental unit, so that the experimental unit = the sampling unit.) We will consider for simplicity the case where the same number of experimental units, that is, replicates, are assigned to each treatment. To highlight the term replication, we let r = number of experimental units, or replicates, per treatment. Remark: Thus, r replaces our previous notation n. We will denote the sample mean for treatment i by 1 Yij . r r
Yi . =
j=1
50
Bang and Davidian
The only difference between this notation and our previous notation is the use of the “·” in the subscript. This usage is fairly standard and reminds us that the mean was taken by summing over the subscript in the second position, j. Also define r t 1 Y¯ ·· = Yij . rt i=1 j=1
Note that the total number of observations in the experiment is r × t = rt, r replicates on each of t treatments. Thus, Y¯ represents the sample mean of all the data, across all replicates and treatments. The double dots make it clear that the summing has been performed over both subscripts. Setup: Consider first the case of t = 2 treatments with two independent samples. Suppose that the population variance is the same for each treatment and equal to σ 2 . Recall that our test statistic for the hypotheses H0 :μ1 − μ2 = 0 vs. H1 :μ1 − μ2 = 0 was (in our new notation) Y¯ 1· − Y¯ 2· , sD¯ ¯ = Y¯ 1· − Y¯ 2· . Here, we have taken δ = 0 and conwhere now D sidered the two-sided alternative hypothesis, as we are interested in just a difference. For t > 2 treatments, there is no obvious generalization of this setup. Now, we have t population means, say μ1 , μ2 , . . . , μt . Thus, the null and alternative hypotheses are now H0 :the μi are all equal vs. H1 : the μi are not all equal Idea: The idea to generalize the t = 2 case is to think instead about estimating variances, as follows. This may seem totally irrelevant, but when you see the end result, you’ll see why! Assume that the data are normally distributed with the same variance for all t treatment populations, that is, Yij ∼ N (μi , σ 2 ). How would we estimate σ 2 ? The obvious approach is to generalize what we did for two treatments and “pool” the sample
Experimental Statistics for Biological Sciences
51
variances across all t treatments. If we write si2 to denote the sample variance for the data on treatment i, then 1 (Yij − Yi .)2 . r−1 r
si2 =
j=1
The estimate would be the average of all t sample variances (because r is the same for all samples), so the “pooled” estimate would be (r − 1)s12 + (r − 1)s22 + · · · + (r − 1)st2
t(r − 1) r r ¯ 2 ¯ 2 ¯ 2 j=1 (Y1j − Y1· ) + j=1 (Y2j − Y2· ) + · · · j=1 (Ytj − Yt· ) . t(r − 1) [21]
r =
As in the case of two treatments, this estimate makes sense regardless of whether H0 is true. It is based on deviations from each mean separately, through the sample variances, so it doesn’t matter whether the true means are different or the same – it is still a sensible estimate. Now recall that, if a sample arises from a normal population, then the sample mean is also normally distributed. Thus, this should hold for each of our t samples, which may be written as 2 σ [22] Y¯ i· ∼ N μi , r (that is, σ 2¯ = σ 2 /r). Now consider the null hypothesis; under Yi· H0 , all the treatment means are the same and thus equal the same value, μ, say. That is, under H0 , μi = μ, i = 1, . . . , t. Under this condition, [22] becomes σ2 ¯ Yi· ∼ N μ, r for all i = 1, . . . , t. Thus, if H0 really were true, we could view the sample means Y¯ 1· , Y¯ 2· , . . . , Y¯ t· as being just a random sample from a normal population with mean μ and variance σ 2 /r. Consider under these conditions how we might estimate the variance of this population, σ 2 /r. The obvious estimate would be the sample variance of our “random sample” from this population, the t sample means. Recall that sample variance is just the sum of squared deviations from the sample mean, divided
52
Bang and Davidian
by sample size −1. Here, our sample size is t, and the sample mean is ⎛ ⎞ r t t r t 1 ⎝1 ⎠ 1 ¯ 1 Yij = Yij = Y¯ ·· . Yi . = t t r rt i=1
i=1
j=1
i=1 j=1
That is, the mean of the sample means is just the sample mean of all the data (this is not always true, but is in this case because r is the same sample size for all treatments.) Thus, the sample variance 1 t ¯ ¯ 2 we would use as an estimator for σ 2 /r is t−1 i=1 (Yi − Y·· ) . This suggests another estimator for σ 2 , namely, r times this, or 1 ¯ (Yi . − Y¯ ·· )2 . t −1 t
r×
[23]
i=1
Remark: Note that we derived the estimator for σ 2 given in [23] under the assumption that the treatment means were all the same. If they really were not the same, then, intuitively, this estimate of σ 2 would tend to be too big, because the deviations about the sample mean Y¯ ·· of the Y¯ i . will include two components: 1. A component attributable to ‘random variation’ among the Y¯ i .s 2. A component attributable to the ‘systematic difference’ among the means μi The first component will be present even if the means are the same; the second component will only be present when they differ. Result: We now have derived two estimators for σ 2 : • The first, the “pooled” estimate given in [21], will not be affected by whether or not the means are the different. This estimate reflects how individual observations differ from their means, regardless of the values of those means; thus, it reflects only variation attributable to how experimental units differ among themselves. • The second, derived assuming the means are the same, and given in [23], will be affected by whether the means are different. This estimate reflects not only how individual observations, through their sample means, differ but also how the means might differ. Implication – The F Ratio: Recall that we derived the second estimator for σ 2 under the assumption that H0 is true. Thus, if H0 really were true, we would expect both estimators for σ 2 to be about the same size, since in this case both would reflect only variation attributable to experimental units. If, on the other hand, H0 really is not true, we would expect the second estimator to be
Experimental Statistics for Biological Sciences
53
larger. With this in mind, consider the ratio estimator for σ 2 based on sample means [23] F = estimator for σ 2 based on individual deviations [21] We now see that if H0 is true, ratio will be small. If H0 is not true, ratio will be large. The result is that we may base inference on treatment means (whether they differ) on this ratio of estimators for variance. We may use this ratio as a test statistic for testing H0 vs. H1 . Recall that a ratio of two sample variances for two independent populations has a F distribution (Sec. 2.2.5). Recall also that our approach to hypothesis testing is to assume H0 is true, look at the value of a test statistic, and evaluate how “likely” it is if H0 is true. If H0 is true in our situation here, then • The numerator is a sample variance of data Y¯ i . , i = 1, . . . , t, from a N(μ, σ 2 /r) population. • The denominator is a (pooled) sample variance of the data Yij . It turns out, as we will see shortly, that if H0 is true we may further view these two sample variances as independent, even though they are based on the same observations. It thus follows that we have the ratio of two “independent” sample variances if H0 is true, so that F ∼ F(t−1),t(r−1) . Interesting Fact: It turns out that, in the case of t = 2 treatments, the ratio F reduces to F =
(Y¯ 1· − Y¯ 2· )2 = t 2, s 2¯ D
the square of the usual t statistic. Here, F will have a F1,2(r−1) distribution. It is furthermore true that when the numerator df for a F distribution is equal to 1 and the denominator df = some value ν, say, then the square root of the F r.v. has a tν distribution. Thus, when t = 2, comparing the ratio F to the F distribution is the same as comparing the usual t statistic to the t distribution. 3.1.2. Linear Additive Model
It is convenient to write down a model for an observation to highlight the possible sources of variation. For the general oneway classification with t treatments, we may classify an individual observation as being on the jth experimental unit in the ith treatment group as Yij = μ + τi + εij , i = 1, . . . , t, j = 1, . . . , ri ,
54
Bang and Davidian
where • t = number of treatments • ri = number of replicates on treatment i. In general, this may be different for different treatments, so we add the subscript i. (What we called ni in the case t = 2 is now ri .) • μi = μ + τi is the mean of the population describing responses on experimental units receiving the ith treatment. • μ may be thought of as the “overall” mean with no treatments • τi is the change in mean (deviations from μ) associated with treatment i We have written the model generally to allow for unequal replication. We will see that the idea of an F ratio may be generalized to this case. This model is just an extension of that we used in the case of two treatments and, as in that case, shows that we may think of observations varying about an overall mean because of the systematic effect of treatments and the random variation in experimental units. 3.1.3. Fixed vs. Random Effects
Recall that τi represents the “effect” (deviation) associated with getting treatment i. Depending on the situation, our further interpretation of τi , and in fact of the treatments themselves, may differ. Consider the following examples. Example 1: Suppose t = 3 and that each treatment is a different fertilizer mixture for which mean yields are to be compared. Here, we are interested in comparing three specific treatments. If we repeated the experiment again, these three fertilizers would always constitute the treatments of interest. Example 2: Suppose a factory operates a large number of machines to produce a product and wishes to determine whether the mean yield of these machines differs. It is impractical for the company to keep track of yield for all of the many machines it operates, so a random sample of five such machines is selected, and observations on yield are made on these five machines. The hope is that the results for the five machines involved in the experiment may be generalized to gain insight into the behavior of all of the machines. In Example 1, there is a particular set of treatments of interest. If we started the experiment next week instead of this week, we would still be interested in this same particular set – it would not vary across other possible experiments we might do. In Example 2, the treatments are the 5 machines chosen from all machines at the company, chosen by random selection. If we started the experiment next week instead of this week here, we might end up with a different set of five machines with which to do the experiment. In fact, whatever five machines we end up with,
Experimental Statistics for Biological Sciences
55
these particular machines are not the specific machines of interest. Rather, interest focuses on the population of all machines operated by the company. The question of interest now is not about the particular treatments involved in the experiment, but the population of all such treatments. • In a case like Example 1, the τi are best regarded as fixed quantities, as they describe a particular set of conditions. In this situation, the τi are referred to as fixed effects. • In a case like Example 2, the τi are best regarded as r.v.’s. Here, the particular treatments in the experiment may be thought of as being drawn from a population of all such treatments, so there is chance involved. We hence think of the τi as r.v.’s with some mean and variance στ2 . This variance characterizes the variability in the population of all possible treatments, in our example, the variability across all machines owned by the company. If machines are quite different in terms of yield, στ2 will be large. In this situation, the τi are referred to as random effects. You might expect that these two situations would lead to different considerations for testing. In the random treatment effects case, there is additional uncertainty involved, because the treatments we use aren’t the only ones of interest. It turns out that in the particular simple case of assessing treatment differences for the one-way classification, the methods we will discuss are valid for either case. However, in more complicated designs, this is not necessarily the case. 3.1.4. Model Restriction
We have seen that Y¯ i . is an estimator for μi for a sample from population i. Y¯ i . is our “best” indication of the mean response for population i. But if we think about our model, μi = μ + τi , which breaks μi into two components, we do not know how much of what we see, Y¯ i ., is due to the original population of experimental units before treatments were applied (μ) and how much is due to the effect of the treatment (τi ). In particular, the linear additive model we write down to describe the situation actually contains elements we can never hope to get a sense of from the data at hand. More precisely, the best we can do is estimate the individual means μi using Y¯ i· ; we cannot hope to estimate the individual treatment effects τi without additional knowledge or assumptions. Terminology: Mathematically speaking, a model that contains components that cannot be estimated is said to be overparameterized. We have more parameters than we can estimate from the available information. Thus, although the linear additive model is a nice device for focusing our thinking about the data, it is overparameterized from a mathematical point of view.
56
Bang and Davidian
One Approach: It may seem that this is an “artificial” problem – why write down a model for which one can’t estimate all its components? The reason is, as above, to give a nice framework for thinking about the data – for example, the model allows us to think of fixed or random treatment effects, depending on the type of experiment. To reconcile our desire to have a helpful model for thinking and the mathematics, the usual approach is to impose some sort of assumption. A standard way to think about things is to suppose that the overall mean μ can be thought of as the mean or average of the individual treatment means μi , that is,
μ=
1 ¯ 1 μi , just as Y¯ ·· = Yi· . t t t
t
i=1
i=1
This implies that μ=
1 1 (μ + τi ) = μ + τi , t t t
t
i=1
i=1
so that it must be that t
τi = 0.
i=1
t The condition i=1 τi = 0 goes along with the interpretation of the τi as “deviations” from an overall mean. This restriction is one you will see often in work on ANOVA. Basically, it has no effect on our objective, investigating differences among treatment means. All the restriction does is to impose a particular interpretation on our linear additive model. You will often see the null and alternative hypotheses written in terms of τi instead of μi . Note that, under this interpretation, if all treatment means were the same (H0 ), then the τi must all be zero. This interpretation is valid in the case where the τi are fixed effects. When they are random effects, the interpretation is similar. We think of the τi themselves as having population mean 0, analogous to them averaging to zero above. This is the analog of the restriction in the case of random effects. If there are no differences across treatments, then they do not vary, that is, στ2 = 0. 3.1.5. Assumptions for ANOVA
Before we turn to using the above framework and ideas to develop formal methods, we restate for completeness the assumptions underlying our approach. • The observations, and hence the errors, are normally distributed. • The observations have the same variance σ 2 .
Experimental Statistics for Biological Sciences
57
• All observations, both across and within samples, are unrelated (independent). The assumptions provide the basis for concluding that the sampling distribution of the statistic F upon which we will base our inferences is really the F distribution. Important: The assumptions above are not necessarily true for any given situation. In fact, they are probably never exactly true. For many data sets, they may be a reasonable approximation, in which case the methods we will discuss will be fairly reliable. In other cases, they may be seriously violated; here, the resulting inferences may be misleading. If the underlying distribution of the data really is not normal and/or the variances across treatment groups are not the same, then the rationale we used to develop the statistic is lost. If the data really are not normal, hypothesis tests may be flawed in the sense that the true level of significance is greater than the chosen level α, and we may claim there is a difference when there really is not a difference. We may think we are seeing a difference in means, when actually we are seeing lack of normality. For some data, it may be possible to get around these issues somewhat. So far, we have written down our model in an additive form. However, there are physical situations where a more plausible model is one that has error enter in a multiplicative way: Yij = μ∗ τi∗ εij∗ . Such a model is often appropriate for growth data, or many situations where the variability in response tends to get larger as the response gets larger. If we take logarithms, we may write this as log Yij = μ + τi + εij , μ = log μ∗ , τi = log τi∗ , εij = log εij∗ . Thus, the logarithms of the observations satisfy a linear, additive model. This is the rationale behind the common practice of transforming the data. It is often the case that many types of biological data seem to be close to normally distributed with constant variance on the logarithm scale, but not at all on their original scale. The data are thus analyzed on this scale instead. Other transformations may be more appropriate in some circumstances. 3.1.6. ANOVA for One-Way Classification with Equal Replication
We begin with the simplest case, where ri = r for all i. We will also assume that the τi have fixed effects. Recall our argument to derive the form of the F ratio statistic F =
estimator for σ 2 based on sample means [23] . estimator for σ 2 based on individual deviations [21]
58
Bang and Davidian
In particular, the components may be written as follows: • Numerator: r×
t
¯
¯
i=1 (Yi . − Y·· )
2
=
t −1
Treatment SS df for treatments
• Denominator: r
t i=1
j=1 (Yij
− Y¯ i .)2
t(r − 1)
Error SS . df for error
=
Here, we define the quantities Treatment SS and Error SS and their df as given above, where, as before, SS = sum of squares. These names make intuitive sense. The Treatment SS is part of the estimator for σ 2 that includes a component due to variation in treatment means. The use of the term Error SS is as before – the estimator for σ 2 in the denominator only assesses apparent variation across experimental units. “Correction” Term: It is convenient to define
t C=
r
j=1 Yij
i=1
2 .
rt
Algebraic Facts: It is convenient for getting insight (and for hand calculation) to express the SS’ s differently. It is possible to show that
Treatment SS = r
t
r
t (Y¯ i. − Y¯ ·· )2 =
i=1
Error SS =
(Yij − Y¯ i. )2 =
i=1 j=1
t r
2 − C.
r
i=1 t r
j=1 Yij
t Yij2 −
i=1 j=1
i=1
r
j=1 Yij
2
r
Consider that the overall, “total” variation in all the data, if we do not consider that different treatments were applied, would obviously be well-represented by the sample variance for all the data, lumping them all together without regard to treatment. There are rt total observations; thus, this sample variance would be t i=1
r
j=1 (Yij
rt − 1
− Y¯ ·· )2
;
each deviation is taken about the overall mean of all rt observations.
.
Experimental Statistics for Biological Sciences
59
Algebraic Facts: The numerator of the overall sample variance may be written as r t
(Yij − Y¯ ·· )2 =
i=1 j=1
r t
Yij2 − C.
i=1 j=1
This quantity is called the Total SS. Because it is the numerator of the overall sample variance, it may be thought of as measuring how observations vary about the overall mean, without regard to treatments. That is, it measures the total variation. We are now in a position to gain insight. From the algebraic facts above, note that Treatment SS + Error SS = Total SS.
[24]
Equation [24] illustrates a fundamental point – the Total SS, which characterizes overall variation in the data without regard to the treatments, may be partitioned into two independent components: • Treatment SS, measuring how much of the overall variation is in fact due to the treatments (in that the treatment means differ). • Error SS, measuring the remaining variation, which we attribute to inherent variation among experimental units. F Statistic: If we now define MST = Treatment MS =
Treatment SS Treatment SS = df for treatments t −1
MSE = Error MS =
Error SS Error SS = df for error t(r − 1)
then we may write our F statistic as F =
Treatment MS . Error MS
We now get some insight into why F has an Ft−1,t(r−1) distribution. The components in the numerator and denominator are “independent” in the sense that partition the Total SS into two “orthogonal” components. (A formal mathematical argument is possible.) We summarize this information in Table 1.1. Statistical Hypotheses: The question of interest in this setting is to determine if the means of the t treatment populations are different. We may write this formally as
60
Bang and Davidian
Table 1.1 One–way ANOVA table – Equal replication Source of variation
DF
Definition
SS
Among Treatments Error (within treatments) Total
MS
r
F
2
t i=1 j=1 Yij r ti=1 (Y¯ i. − Y¯ ·· )2 − C MST F r t r 2 ¯ t(r − 1) by subtraction MSE i=1 j=1 (Yij − Yi. ) t r r 2−C 2 t ¯ rt − 1 (Y − Y ) Y ·· ij i=1 j=1 i=1 j=1 ij
t −1
H0,T : μ1 = μ2 = · · · = μt vs. H1,T : The μi are not all equal, This tmay also be written in terms of the τi under the restriction i=1 τi = 0 as H0,T : τ1 = τ2 = · · · = τt = 0 vs. H1,T : The τi are not all equal. where the subscript “T” is added to remind ourselves that this particular test is with regard to treatment means. It is important to note that the alternative hypothesis does not specify the way in which the treatment means (or deviations) differ. The best we can say based on our statistic is that they differ somehow. The numerator of the statistic can be large because the means differ in a huge variety of different configurations. Some of the means may be different while the others are the same, all might differ, and so on. Test Procedure: Reject H0,T in favor of H1,T at level of significance α if F > F(t−1),t(r−1), α. This is analogous to a two-sided test when t = 2 – we do not state in H1,T the order in which the means differ, only that they do. The range of possibilities of how they differ is just more complicated when t > 2. We use α instead of α/2 here because we have no choice as to which MS appears in the numerator and which appears in the denominator. (Compare to the test of equality of variance in the case of two treatments). 3.1.7. ANOVA for One-Way Classification with Unequal Replication
We now generalize the ideas of ANOVA to the case where the ri are not all equal. This may be the case by design or because of mishaps during the experiment that result in lost or unusable data. Here, again, our discussion assumes that the τi have fixed effects. When the ri are not all equal, we redefine the correction factor and the total number of observations as
C=
ri t i=1 j=1 Yij t i=1 ri
2
=
ri t i=1 j=1 Yij N
2 ,N =
t i=1
ri .
Experimental Statistics for Biological Sciences
61
Using the same logics that are used previously, we can reach the following ANOVA table (Table 1.2):
Table 1.2 One–way ANOVA table – Unequal replication Source of variation Among Treatments Error (within treatments) Total
DF t −1 t
SS
t
t
¯
¯
i=1 ri (Yi . − Y·· )
i=1 ri − t ( = N − t)
N −1
Definition
t
r
t
r
i=1 i=1
2
r i (
i=1
j=1 Yij ri
)2
−
MS
F
MST
F
C
¯ 2 j=1 (Yij − Yi .) ¯ 2 j=1 (Yij − Y·· )
by subtraction t
i=1
MSE
r
2 j=1 Yij − C
Under the same hypotheses under equal replication, Test Procedure: Reject H0 , T in favor of H1 , T at level of significance α if F > Ft−1,N −t,α . 3.2. Multiple Comparisons
When we test the usual hypotheses regarding differences among more than two treatment means, if we reject the null hypothesis, the best we can say is that there is a difference among the treatment means somewhere. Based on this analysis, we cannot say how these differences occur. For example, it may be that all the means are the same except one. Alternatively, all means may differ from all others. However, on the basis of this test, we cannot tell. The concept of multiple comparisons is related to trying to glean more information from the data on the nature of the differences among means. For reasons that will become clear shortly, this is a difficult issue, and even statisticians do not always agree. As you will see, the issue is one of “philosophy” to some extent. Understanding the principles and the problem underlying the idea of multiple comparisons is thus much more important than being familiar with the many formal statistical procedures. In this section, we will discuss the issues and then consider only a few procedures. The discussion of multiple comparisons is really only meaningful when the number of treatments t ≥ 3.
3.2.1. Principles – “Planned” Vs. “Families” of Comparisons
Recall that when we perform a F test (a test based on an F ratio) in the ANOVA framework for the difference among t treatment means, the only inference we make as a result of the test is that the means considered as a group differ somehow if we reject the null hypothesis: H0 ,T :μ1 = · · · = μt vs. H1 ,T :The μi are not all equal.
62
Bang and Davidian
We do not and cannot make inference as to how they differ. Consider the following scenarios: • If H0,T is not rejected, it could be that there is a real difference between, say, two of the t treatments, but it is “getting lost” by being considered with all the other possible comparisons among treatment means. • If H0,T is rejected, we are naturally interested in the specific nature of the differences among the t means. Are they all different from one another? Are only some of them different, the rest all the same? One is naturally tempted to look at the sample means for each treatment – are there differences that are “suggested” by these means? To consider these issues, we must recall the framework in which we test hypotheses. Recall that the level of significance for any hypothesis test α = P(reject H0 when H0 is true) = P(Type I error). Thus, the level α specifies the probability of making a mistake and saying that there is evidence to suggest the alternative hypothesis is true when it really isn’t. The probability of making such an error is controlled to be no more than α by the way we do the test. Important: The level α, and thus the probability of making such a mistake, applies only to the particular H0 under consideration. Thus, in the ANOVA situation above, where we test H0,T vs. H1,T , α applies only to the comparison among all the means together – either they differ somehow or they do not. The probability that we end up saying they differ somehow when they don’t is thus no more than α, that is all. α does not pertain to, say, any further consideration of the means, say, comparing them two at a time. To see this, consider some examples. Example – t = 3 Treatments: Suppose there are t = 3 treatments under consideration. We are certainly interested in the question “do the means differ?” but we may also be interested in how. Specifically, does, say, μ1 = μ2 , but μ3 = μ1 or μ2 ? Here, we are interested in what we can say about three separate comparisons. We’d like to be able to combine the three comparisons somehow to make a statement about how the means differ. Suppose, to address this, we decide to perform three separate t-tests for the three possible differences of means, each with level of significance α Then, we have P(Type I error comparing μ1 vs. μ2 ) = α P(Type I error comparing μ1 vs. μ3 ) = α P(Type I error comparing μ2 vs. μ3 ) = α.
Experimental Statistics for Biological Sciences
63
That is, in each test, we have probability of α of inferring that the two means under consideration differ when they really do not. Because we are performing more than one hypothesis test, the probability we make this mistake in at least one of the tests is no longer α! This is simply because we are performing more than one test. For example, it may be shown mathematically that if α = 0.05, P(Make at least one Type I error in the 3 tests) ≈ 0.14! Suppose we perform the three separate tests, and we reject the null hypothesis in the first two tests, but not in the third (μ2 vs. μ3 ), with each test at level α = 0.05. We then proclaim at the end that “there is sufficient evidence in these data to say that μ1 differs from μ2 and μ1 differs from μ3 . We do not have sufficient evidence to say that μ2 and μ3 differ from each other.” Next to this statement, we say “at level of significance α = 0.05.” What is wrong with doing this? From the calculation above, the chance that we said something wrong in this statement is not 0.05! Rather, it is 0.14! The chance we have made a mistake in claiming a difference in at least one of the tests is 0.14 – almost three times the chance we are claiming! In fact, for larger values of t (more treatments), if we make a separate test for each possible pairwise comparison among the t means, each at level α, things only get worse. For example, if α = 0.05 and t = 10, the chance we make at least one Type I error, and say in our overall statement that there is evidence for a difference in a pair when there isn’t, is almost 0.90! That is, if we try to combine the results of all these tests together into a statement that tries to sort out all the differences, it is very possible (almost certain) that we will claim a difference that really doesn’t exist somewhere in our statement. If we wish to “sort out” differences among all the treatment means, we cannot just compare them all separately without having a much higher chance of concluding something wrong! Another Problem: Recognizing the problems associated with “sorting out” differences above, suppose we decide to look at the sample treatment means Y¯ i· and compare only those that appear to possibly be different. Won’t this get around this problem, as we’re not likely to do as many separate tests? No! To see this, suppose we inspect the sample means and decide to conduct a t-test at level α = 0.05 for a difference in the two treatment means observed to have the highest and lowest sample means Yi· among all those in the experiment. Recall that the data are just random samples from the treatment populations of interest. Thus, the sample means Yi· could have ended up the way they did because • There really is a difference in the population means, OR
64
Bang and Davidian
• We got “unusual” samples, and there really is no difference! Because of the chance mechanism involved, either of these explanations is possible. Returning to our largest and smallest sample means, then, the large difference we have observed could be due just to chance – we got some unusual samples, even though the means really don’t differ. Because we have already seen in our samples a large difference, however, it turns out that, even if this is the case, we will still be more likely to reject the null hypothesis of no difference in the two means. That is, although we claim that we only have a 5% chance of rejecting the null hypothesis when it’s true, the chance we actually do this is higher. Here, with α = 0.05, it turns out that the true probability that we reject the null hypothesis that the means with smallest and largest observed Yi· values are the same when it is true is • Actually 0.13 if t = 3 . Actually 0.60 if t = 10 ! As a result, if we test something on the basis of what we observed in our samples, our chance of making a Type I error is much greater than we think. The above discussion shows that there are clearly problems with trying to get a handle on how treatment means differ. In fact, this brings to light some philosophical issues. Planned Comparisons: This is best illustrated by an example. Suppose that some university researchers are planning to conduct an experiment involving four different treatments: • The standard treatment, which is produced by a certain company and is in widespread use. • A new commercial treatment manufactured by a rival company, which hopes to show it is better than the standard, so that they can market the new treatment for enormous profits. • Two experimental treatments developed by the university researchers. These are being developed by two new procedures, one designed by the university researchers, the other by rival researchers at another, more prestigious university. The university researchers hope to show that their treatment is better. The administration of the company is trying to decide whether they should begin planning marketing strategy for their treatment; they are thus uninterested for this purpose in the two university treatments. Because the university researchers are setting up an experiment, the company finds it less costly to simply pay the researchers to include their treatment and the standard in the study rather than for the company to do a separate experiment of their own. On the other hand, the university researchers are mainly interested in the experimental treatments. In fact, their main question is which of them shows more promise. In this situation, the main comparison of interest as far as the company is
Experimental Statistics for Biological Sciences
65
concerned is that between their new treatment and the standard. Thus, although the actual experiment involves four treatments, they really only care about the pairwise comparison of the two, that is, the difference μstandard − μnew (company) .
[25]
Similarly, for the university researchers, their main question involves the difference in the pair μexperimental (us) − μexperimental (them) . In this situation, the comparison of interest depends on the interested party. As usual, each party would like to control the probability of making a Type I error in their particular comparison. For example, the company may be satisfied with level of significance α = 0.05, as usual, and probably would have used this same significance level had they conducted the experiment with just the two treatments on their own. In this example, prior to conduct of the experiment, specific comparisons of interest among various treatment means have been identified, regardless of the overall outcome of the experiment. Because these comparisons were identified in advance, we do not run into the problem we did in comparing the treatments with the largest and smallest sample means after the experiment, because the tests will be performed regardless. Nor do we run into the problem of sorting out differences among all means. As far as the company is concerned, their question about [25] is a separate test, for which they want the probability of concluding a difference when there isn’t one to be at most 0.05. A similar statement could be made about university researchers. Of course, this made-up scenario is a bit idealized, but it illustrates the main point. If specific questions involving particular treatment means are of independent interest, are identified as such in advance of seeing experimental results, and would be investigated regardless of outcome, then it may be legitimate to perform the associated tests at level of significance α, without concern about the outcome of other tests. In this case, the level of significance α applies only to the test in question and may be proclaimed only in talking about that single test! If a specific comparison is made in this way from data from a larger experiment, we are controlling the comparisonwise (Type I) error rate at α. For each comparison of this type that is made, the probability of a Type I error is α. The results of several comparisons may not be combined in to a single statement with claimed Type I error rate α. Families of Comparisons: Imagine a pea section experiment that there were four sugar treatments and a control (no sugar).
66
Bang and Davidian
Suppose that the investigators would like to make a single statement about the differences. This statement involves four pairwise comparisons – each sugar treatment against the control. From our previous discussion, it is clear that testing each pairwise comparison against the control at level of significance α and then making such a statement would involve a chance of declaring differences that really don’t exist greater than that indicated by α. Solution: For such a situation, then, the investigators would like to avoid this difficulty. They are interested in ensuring that the family of four comparisons they wish to consider (all sugars against the control) has overall probability of making at least one Type I error to be no more than α, i.e., P(at least one Type I error among all four tests in the family) ≤ α. Terminology: When a question of interest about treatment means involves a family of several comparisons, we would like to control the family-wise error rate at α, so that the overall probability of making at least one mistake (declaring a difference that doesn’t exist) is controlled at α. It turns out that a number of statistical methods have been developed that ensure that the level of significance for a family of comparisons is no more than a specified α. These are often called multiple comparison procedures. Various procedures are available in statistical software (e.g., PROC MULTTEST in SAS, “mtest” command in Stata). 3.2.2. The Least Significant Difference
First, we consider the situation where we have planned in advance of the experiment to make certain comparisons among treatment means. Each comparison is of interest in its own right and thus is to be viewed as separate. Statements regarding each such comparison will not be combined. Thus, here, we wish to control the comparisonwise error rate at α Idea: Despite the fact that each comparison is to be viewed separately, we can still take advantage of all information in the experiment on experimental error. To fix ideas, suppose we have an experiment involving t treatments and we are interested in comparing two treatments a and b, with means μa and μb . That is, we wish to test H0 :μa = μb vs. H1 :μa = μb . We wish to control the comparisonwise error rate at α. Then Test Statistic: As our test statistic for H0 vs. H1 , use instead Y¯ a . − Y¯ b . sY¯ a .−Y¯ b .
, sY¯ a .−Y¯ b . = s
1 1 + , s = MSE . ra rb
That is, instead of basing the estimate of σ 2 on only the two treatments in question, use the estimate from all t treatments.
Experimental Statistics for Biological Sciences
67
Result: This will lead to a more precise test: The estimate of the SD of the treatment mean difference from all t treatments will be a more precise estimate, because it is based on more information! Specifically, the SE based on only the two treatments will have only ra + rb − 2 df, while that based on all t will have df of r ri − t = N − t. j=1
Test Procedure: Perform the test using all information by rejecting H0 in favor of H1 if Y¯ a . − Y¯ b . > tN −t,α/2 . sY¯ a .−Y¯ b . A quick look at the t table will reveal the advantage of this test. The df, N − t for estimating σ 2 (experimental error), are greater than those from only two treatments. The corresponding critical value is thus smaller, giving more chance for rejection of H0 if it’s really true. Terminology: If we decide to compare two treatment means from a larger experiment involving t treatments, the value
1 1 sY¯ a. −Y¯ b. tN −t,α/2 = s + tN −t,α/2 , s = MSE ra rb is called the least significant difference (LSD) for the test of H0 vs. H1 above, based on the entire experiment. From above, we reject H0 at level α if Y¯ a. − Y¯ b. > LSD. Warning: It is critical to remember that the LSD procedure is only valid if the paired comparisons are genuinely of independent interest. 3.2.3. Contrasts
Before we discuss the notion of multiple comparisons (for families of comparisons), we consider the case where we are interested in comparisons that cannot be expressed as a difference in a pair of means. Example: Suppose that for the pea section experiment, a particular question of interest was to compare sugar treatments containing fructose to those that do not: μ3 , μ4 do contain fructose μ2 , μ5 do not contain fructose It is suspected that treatments 3 and 4 are similar in terms of resulting pea section length, treatments 2 and 5 are similar, and
68
Bang and Davidian
lengths from 3 and 4, on the average, are different from lengths from 2 and 5, on the average. Thus, the question of interest is to compare the average mean pea section length for treatments 3 and 4 to the average mean length for treatments 2 and 5. To express this formally in terms of the treatment means, we are thus μ2 +μ5 4 interested in the comparison between μ3 +μ 2 and 2 , that is, the average of μ3 and μ4 vs. the average of μ2 and μ5 . We may express this formally as a set of hypotheses: H0 :
μ2 + μ5 μ3 + μ4 = 2 2
vs.
H1 :
μ2 + μ5 μ3 + μ4 = . 2 2
These may be rewritten by algebra as H0 :μ3 + μ4 − μ2 − μ5 = 0 vs. H1 :μ3 + μ4 − μ2 − μ5 = 0. [26] Similarly, suppose a particular question of interest was whether sugar treatments differ on average in terms of mean pea section length from the control. We thus would like to compare the average mean length for treatments 2–5 to the mean for treatment 1. We may express this as a set of hypotheses: μ2 + μ3 + μ4 + μ5 = μ1 , or 4 4μ1 − μ2 − μ3 − μ4 − μ5 = 0 vs. μ2 + μ3 + μ4 + μ5 = μ1 , or H1 : 4 4μ1 − μ2 − μ3 − μ4 − μ5 = 0.
H0 :
[27]
Terminology: A linear function of treatment means of the form t
ci μi
i=1
t such that the constants ci sum to zero, i.e., i=1 ci = 0 is called a contrast. Both of the functions in [26] and [27] are contrasts: μ3 + μ4 − μ2 − μ5 4μ1 − μ2 − μ3 − μ4 − μ5
c1 = 0, c2 = −1, c3 = 1, c4 = 1, c5 = −1 c1 = 4, c2 = −1, c3 = −1, c4 = −1, c5 = −1
5 ci = 0 i=1 5 i=1 ci = 0.
Interpretation: Note that, in each case, if the means really were all equal, then, because the coefficients ci sum to zero, the contrast itself will be equal to zero. If they are different, it will not. Thus, the null hypothesis says that there are no differences. The alternative says that the particular combination of means is different from zero, which reflects real differences among functions of the means. Note that, of course, a pairwise comparison is a contrast, e.g., μ2 − μ1 is a contrast with c1 = −1, c2 = 1, c3 = c4 = c5 = 0.
Experimental Statistics for Biological Sciences
69
Estimation: As intuition suggests t that, the best (in fact, unbiased) estimator of any contrast i=1 ci μi is Q =
t
ci Y¯ i· .
i=1
It turns out that, for the population of all such Q’s for a particular set of ci ’s, the variance of a contrast is σQ2 = σ 2
t c2 i
i=1
ri
,
which may be estimated by replacing s 2 by the pooled estimate, MSE . Thus, a SE estimate for the contrast is sQ = 2 √ t ci s i=1 ri , s = MSE . It may be easily verified (try it) that this expression reduces to our usual expression when the contrast is a pairwise comparison. Test Procedure: For testing hypotheses of the general form H0 :
t
ci μi = 0 vs.
i=1
H1 :
t
ci μi = 0
i=1
at level of significance α when the comparison is planned, we use the same procedure as for the LSD test. We reject H0 in favor of H1 if |Q | = |
t
ci Y¯ i .| > sQ tN −t,α/2 .
i=1
The result of this test is not to be compared with any other. The level of significance α pertains only to the question at hand. 3.2.4. Families of Comparisons
We now consider the problem of combining statements. Suppose we wish to make a single statement about a family of contrasts (e.g., several pairwise comparisons) at some specified level of significance α. There are a number of methods for making statements about families of comparisons that control the overall family-wise level of significance at a value α. We discuss three of these here. The basic premise behind each is the same. Bonferroni Method: This method is based on modifying individual t-tests. Suppose we specify c contrasts C1 , C2 . . . , Cc and wish to test the hypotheses H0,k :Ck = 0 vs. H1,k :Ck = 0, k = 1, . . . , c
70
Bang and Davidian
while controlling the overall family-wise Type I error rate for all c tests as a group to be ≤ α. It may be shown mathematically that, if one makes c tests, each at level of significance α/c, then P(at least 1 Type I error in the c tests) ≤ α. Thus, in the Bonferroni procedure, we use for each test the t value corresponding to the appropriate df and α/c and conduct each test as usual. That is, for each contrast Ck = ti=1 ci μi , reject H0,k if |Qk | > sQ tN −t,α/(2c) , Qk =
t
ci Y¯ i· .
i=1
The Bonferroni method is valid with any type of contrast, and may be used with unequal replication. Scheffé’s Method: The idea is to ensure that, for any family of contrasts, the family level is α by requiring that the probability of a Type I error is ≤ α for the family of all possible contrasts. That is, control things so that P(at least 1 Type I error among tests of all possible contrasts) ≤ α. Because the level is controlled for any family of contrasts, it will be controlled for the one of interest. The procedure is to compute the quantity S = (t − 1) Ft−1,N −t,α . For all contrasts of interest, Ck , reject the corresponding null hypothesis H0,k if |Qk | > sQ S. Tukey’s Method: For this method, we must have equal replication, and the contrasts of interest must all be pairwise comparisons. Because only pairwise contrasts are of interest, the method takes advantage of this fact. It ensures that the family-wise error rate for all possible pairwise comparisons is controlled at α; if this is true, then any subset of pairwise comparisons will also have this property. The procedure is to compute 1 T = √ qα (t, N − t), 2 where qα (t, N − t) can be found in a table available in relevant statistical textbooks. For example, q0.05 (5, 45) ≈ 4.04 (the value closest in the table). If Ck is the kth pairwise comparison, we would reject the corresponding null hypothesis H0,k , if
Experimental Statistics for Biological Sciences
71
|Qk | > sQ T . It is legitimate to conduct the hypothesis tests using all of these methods (where applicable) and then choose the one that rejects most often. This is valid because the “cut-off” values S, T, and tN −t,α/(2c) do not depend on the data. Thus, we are not choosing on the basis of observed data (which of course would be illegitimate, as discussed earlier)! A Problem With Multiple Comparisons: Regardless of which method one uses, there is always a problem when conducting multiple comparisons. Because the number of comparisons being made may be large (e.g., all possible pairwise comparisons for t = 10 treatments!), and we wish to control the overall probability of at least one Type I error to be small, we are likely to have low power (that is, likely to have a difficult time detecting real differences among the comparisons in our family). This is for the simple reason that, in order to ensure we achieve overall level α, we must use critical values for each comparison larger than we would if the comparisons were each made separately at level α (inspect the pea section results for an example). Thus, although the goal is worthy, that of trying to sort out differences, we may be quite unsuccessful at achieving it! This problem has tempted some investigators to try to figure out ways around the issue, for example, claiming that certain comparisons were of interest in advance when they really weren’t, so as to salvage an experiment with no “significant” results. This is, of course, inappropriate! The only way to ensure enough power to test all questions of interest is to design the experiment with a large enough sample size! 3.3. Multi-Way Classification and ANOVA
Recall when no sources of variation other than the treatments are anticipated, grouping observations will probably not add very much precision, i.e., reduce our assessment of experimental error. If the experimental units are expected to be fairly uniform, then, a completely random design will probably be sufficient. In many situations, however, other sources of variation are anticipated. Examples: • In an agricultural field experiment, adjacent plots in a field will tend to be “more alike” than those far apart. • Observations made with a particular measuring device or by a particular individual may be more alike than those made by different devices or individuals. • Plants kept in different greenhouses may be more alike than those from different greenhouses. • Patients treated at the same hospital may be more alike than those treated at different hospitals. In such cases, clearly there is a potential source of systematic variation we may identify in advance. This suggests that we
72
Bang and Davidian
may wish to group experimental units in a meaningful way on this basis. When experimental units are considered in meaningful groups, they may be thought of as being classified not only according to treatment but also according to • Position in field • Device or observer • Greenhouse • Hospital Objective: In an experiment, we seek to investigate differences among treatments – by accounting for differences due to effects of phenomena such as those above, a possible source of variation will (hopefully) be excluded from our assessment of experimental error. The result will be increased ability to detect treatment differences if they exist. Designs involving meaningful grouping of experimental units are the key to reducing the effects of experimental error, by identifying components of variation among experimental units that may be due to something besides inherent biological variation among them. The paired design for comparing two treatments is an example of such a design. Multi-Way Classification: If experimental units may be classified not only according to treatment but to other meaningful factors, things obviously become more complicated. We will discuss designs involving more than one-way of classifying experimental units. In particular, • Two-way classification, where experimental units may be classified by treatment and another meaningful grouping factor. • A form of three-way classification (one of which is treatment) called a Latin square (we will not cover this novel design here). 3.3.1. Randomized Complete Block Design
When experimental units may be meaningfully grouped, clearly, a completely randomized design will be suboptimal. In this situation, an alternative strategy for assigning treatments to experimental units, which takes advantage of the grouping, may be used. Randomized Complete Block Design: • The groups are called blocks. • Each treatment appears the same number of times in each block; hence, the term complete block design. • The simplest case is that where each treatment appears exactly once in each block. Here, because number of replicates = number of experimental units for each treatment,
Experimental Statistics for Biological Sciences
73
we have number of replicates = number of blocks = r. • Blocks are often called replicates for this reason. • To set up such a design, randomization is used in the following way: - Assign experimental units to blocks on the basis of the meaningful grouping factor (greenhouse, device, etc.) - Now randomly assign the treatments to experimental units within each block. Hence, the term randomized complete block design: each block is complete, and randomization occurs within each block. Rationale: Experimental units within blocks are alike as possible, so observed differences among them should be mainly attributable to the treatments. To ensure this interpretation holds, in the conduct of the experiment, all experimental units within a block should be treated as uniformly as possible: • In a field, all plots should be harvested at the same time of day. • All measurements using a single device should be made by the same individual if different people use it in a different way. • All plants in a greenhouse should be watered at the same time of day or by the same amount. Advantages: • Greater precision is possible than with a completely random design with one-way classification. • Increased scope of inference is possible because more experimental conditions may be included. Disadvantages: • If there is a large number of treatments, a large number of experimental units per block will be required. Thus, large variation among experimental units within blocks might still arise, with the result that no precision is gained, but experimental procedure is more complicated. In this case, other designs may be more appropriate. 3.3.2. Linear Additive Model for Two-way Classification
We assume here that one observation is taken on each experimental unit (i.e., sampling unit = experimental unit). Assume that a randomized complete block design is used with exactly one experimental unit per treatment per block. For the two-way classification with t treatments, we may classify an individual observation as being from the jth block on the ith treatment:
74
Bang and Davidian
Yij = μ + τi +βj + εij = μ + βj +εij = μij + εij , μi
μij
i = 1, . . . , t; j = 1, . . . , r, where • t = number of treatments • r = number of replicates on treatment i, that is, the number of blocks • μ = overall mean (as before) • τi = effect of the ith treatment (as before) • μi = μ + τi mean of the population for the ith treatment • βj = effect of the jth block • μij = μ + τi + βj = μi + βj mean of the population for the ith treatment in the jth block • εij = “error” describing all other sources of variation (e.g., inherent variation among experimental units not attributable to treatments or blocks) In this model, the effect of the jth block, βj , is a deviation from the overall mean μ attributable to being an experimental unit in that block. It is a systematic deviation, the same for all experimental units in the block, thus formally characterizing the fact that the experimental units in the same block are “alike.” In fact, this model is just an extension of that for the paired design with two treatments considered previously. In that model, we had a term ρj , the effect of the jth pair. Here, it should be clear that this model just extends the idea behind meaningful pairing of observations to groups larger than two (and more than two treatments). Thus, the paired design is just a special case of a randomized complete block design in the case of two treatments. Fixed and Random Effects: As in the one-way classification, the treatment and block effects, τi and βj , may be regarded as fixed or random. We have already discussed the notion of regarding treatments as having fixed or random effects. We may also apply the same reasoning to blocks. There are a number of possibilities as expected. It turns out that, unlike in the case of the one-way classification with one sampling unit per experimental unit, the distinction between fixed and random effects becomes very important in higher way classifications. In particular, although the computation of quantities in an ANOVA table may be the same, the interpretation of these quantities, i.e., what the MSs involved estimate, will depend on what’s fixed and what’s random. Model Restriction: As in the one-way classification, the linear additive model above is overparameterized. If we think about the mean for the ith treatment in the jth block, μij , it should be clear that we do not know how much of what we see is attributable to each of the components μ, τi , and βj .
Experimental Statistics for Biological Sciences
75
One Approach: The usual approach is to impose the restrictions t r τi = 0 and βj = 0. i=1
j=1
One-way to think about this is to suppose the overall mean μ is thought of as the average of the μij , r r t t 1 1 μij = (μ + τi + βj ) μ= rt rt i=1 j=1
=μ+
3.3.3. ANOVA for Two-Way Classification – Randomized Complete Block Design with No Subsampling
1 t
t
i=1 j=1
τi +
i=1
1 r
r
βj .
j=1
Assumptions: Before we turn to the analysis, we reiterate the assumptions that underlie the validity of the methods: • The observations, and hence the errors, are normally distributed. • The observations have the same variance. These assumptions are necessary for F ratios we construct to have the F sampling distribution. Recall our discussion on the validity of assumptions – the same issues apply here (and, indeed, always). Idea: As for the one-way classification, we partition the Total SS, which measures all variation in the data from all sources, into “independent” components describing variation attributable different sources. These components are the numerators of estimators for “variance.” We develop the idea by thinking of both the treatment and the block effects τi and βj as being fixed. Notation: Define r t 1 Yi j = overall sample mean Y¯ ·· = rt i=1 j=1
1 Yij = sample mean for treatment i (over all blocks) Y¯ i· = r r
j=1
1 Yij = sample mean for block j (over all treatments) Y¯ ·j = t i=1 ( ti=1 rj=1 Yij )2 . C(correction factor) = rt t
The Total SS is, as usual, the numerator of the overall sample variance for all the data, without regard to treatments or, now, blocks:
76
Bang and Davidian
Total SS =
r t
(Yij − Y¯ ·· )2 =
i=1 j=1
r t
Yij 2 − C.
i=1 j=1
As before, this is based on rt − 1 independent quantities (df). As before, the numerator of a F ratio for testing treatment mean differences ought to involve a variance estimate containing a component due to variation among the treatment means. This quantity will be identical to that we defined previously, by the same rationale. We thus have
Treatment SS = r
t
t
r j=1 Yij
i=1
(Y¯ i· − Y¯ ·· )2 =
2 − C.
r
i=1
This will have t − 1 df, again by the same argument. This assessment of variation effectively ignores the blocks, as it is based on averaging over them. By an entirely similar argument with blocks instead of treatments, we arrive at analogous quantities:
Block SS = t
r
(Y¯ ·j − Y¯ ·· )2 =
j=1
r
t
i=1 Yij
j=1
2
t
− C.
This will have r − 1 df. This assessment of variation effectively ignores the treatments, as it is based on averaging over them. Error SS: Again, we need a SS that represents the variation we are attributing to experimental error. If we are to have the same situation as in the one-way classification, where all SSs are additive and sum to the Total SS, we must have Block SS + Treatment SS + Error SS = Total SS. Solving this equation for Error SS and doing the algebra, we arrive at Error SS =
r t
(Yij − Y¯ i· − Y¯ ·j + Y¯ ·· )2 .
[28]
i=1 j=1
That is, if it were indeed true that we could partition Total SS into these components, then [28] would have to be the quantity characterizing experimental error. Inspection of this quantity seems pretty hopeless for attaching such an interpretation. In fact,
Experimental Statistics for Biological Sciences
77
it is not hopeless. Recall our model restrictions t i=1
τi = 0 and
r
βj = 0
j=1
and what they imply, namely, that the τi and βj may be regarded as deviations from an overall mean μ. Consider then estimation of these quantities under the restrictions and this interpretation: the obvious estimators are μˆ = Y¯ ·· , τˆi = Y¯ i. − Y¯ ·· , βˆj = Y¯ .j − Y¯ ·· . The “hats” denote estimation of these quantities. Thus, if we wanted to estimate μij = μ + τi + βj , the estimator would be μˆ ij = Y¯ ·· + (Y¯ i. − Y¯ ·· ) + (Y¯ .j − Y¯ ·· ). Because in our linear additive model εij = Yij − μij characterizes whatever variation we are attributing to experimental error, we would hope that an appropriate Error SS would be based on an estimate of εij . Such an estimate is Yij − μˆ ij = Yij − {Y¯ ·· + (Y¯ i. − Y¯ ·· ) + (Y¯ .j − Y¯ ·· )} = Yij − Y¯ i. − Y¯ .j + Y¯ ·· by algebra. Note that this quantity is precisely that in [28]; thus, our candidate quantity for Error SS does indeed make sense – it is the sum of squared deviations of the individual observations, Yij from their “sample mean,” an estimate of all “left over” variation, εˆ ij . The df associated with Error SS is the number of independent quantities on which it depends. Our partition and associated df can be Block SS + Treatment SS + Error SS = Total SS, (r − 1) + (t − 1) + (t − 1)(r − 1) = rt − 1. An ANOVA table is as follows (Table 1.3): Treatment SS Block SS , MSB = , t −1 r−1 Error SS . MSE = (t − 1)(r − 1)
Here, MST =
Statistical Hypotheses: The primary question of interest in this setting is to determine if the means of the t treatment populations
78
Bang and Davidian
Table 1.3 Two-way ANOVA Table – Randomized complete block design Source of variation Among blocks
DF r−1
Among Treatments
t −1
Error
(t − 1) (r − 1)
Total
rt − 1
Definition
MS
F
−C
MSB
B FB = MS MSE
−C
MST
T FT = MS MSE
by subtraction j=1 (Yij − 2 ¯ ¯ ¯ Yi. − Y.j + Y·· ) t r t r 2 ¯ 2 i=1 j=1 (Yij − Y·· ) i=1 j=1 Y ij − C
MSE
t
r
r
t
¯ ¯ 2 j=1 (Y.j − Y·· ) ¯ ¯ 2 i=1 (Yi. − Y·· )
t
i=1
SS r
j=1
t
i=1 Yij
2
t t
i=1
r j=1 Yij
r
2
r
are different. We may write this formally as H0,T :μ1 = μ2 = . . . = μt vs. H1,T : The μi are not all equal. This tmay also be written in terms of the τi under the restriction i=1 τi = 0 as H0,T :τ1 = τ2 = · · · = τt = 0 vs. H1,T : The τi are not all equal. Test Procedure: Reject H0,T in favor of H1,T at level of significance α if FT > Ft−1,(t−1)(r−1),α . A secondary question of interest might be whether there is a systematic effect of blocks. We may write a set of hypotheses for this under our model restrictions as H0,B :β1 = . . . = βr = 0 vs. H1,B : The βj are not all equal. Test Procedure: Reject H0,B in favor of H1,B at level of significance α if FB > Fr−1,(t−1)(r−1),α . Note: In most experiments, whether or not there are block differences is not really a main concern, because, by considering blocks up-front, we have acknowledged them as a possible nontrivial source of variation. If we test whether block effects are different and reject H0,B , then, by blocking, we have probably increased the precision of our experiment, our original objective. Expected Mean Squares: We developed the above in the case where both ti and βj have fixed effects. It is instructive to examine
Experimental Statistics for Biological Sciences
79
the expected mean squares under our linear additive model under this condition as well as the cases where (i) both ti and βj are random and (ii) the case where ti is fixed but βj are random (the “mixed” case). This allows insight into the suitability of the test statistics. Here, σ 2 is the variance associated with the ij , that is, corresponding to what we are attributing to experimental error in this situation. From Table 1.4, we have
Table 1.4 Expected mean square Source of variation
Both fixed
Both random
Mixed
σε2 + tσβ2
σε2 + tσβ2
σε2 + rστ2
σε2 + r
r
MSB MST MSE
σε2 + t
2 j=1 βj
r−1 t 2 i=1 τi 2 σε + r t−1 σε2
σε2
σε2
t
2 i=1 τi t−1
• Both ti and βj fixed: FT and FB are both appropriate. The MSs in the numerators of these statistics, MST and MSB , estimate σ 2 (estimated by MSE ) plus an additional term that is equal to zero under H0,T and H0,B , respectively. • Both ti and βj random: Note here that MST and MSB estimate σε2 plus an extra term involving the variances στ2 and σβ2 , respectively, characterizing variability in the populations of all possible treatments and blocks. Thus, under these conditions, FT and FB are appropriate for testing H0,T :στ2 = 0 vs.
H1,T :στ2 > 0,
H0,B :σβ2 = 0 vs.
H1,B :σβ2 > 0.
These hypotheses are the obvious ones of interest when τ i and βj are random. • “Mixed” model: For τ i fixed and βj random, the same observations as above apply. FT and FB are appropriate respectively, for testing H0,T :τ1 = τ2 = . . . = τt = 0 vs. H1,T : The τi are not all equal and H0,B :σβ2 = 0 vs.
H1,B :σβ2 > 0.
• Blocking may be an effective means of explaining variation (increasing precision) so that differences among treatments that may really exist are more likely to be detected. • The data from an experiment set-up according to a particular design should be analyzed according to the
80
Bang and Davidian
appropriate procedure for that design! The above shows that if we set up the experiment according to a randomized complete block design, but then analyzed it as if it had been set up according to a completely randomized design, erroneous inference results, in this case, failure to identify real differences in treatments. The design of an experiment dictates the analysis. Remark: For more advanced or complex designs, readers should refer to the original Dr. Davidian’s lecture notes or relevant statistical textbooks. 3.3.4. More on Violation of Assumptions
Throughout the section, we have made the point that the statistical methods we are studying are based on certain assumptions. ANOVA methods rely on the assumptions of normality and constant variance, with additive error. We have already discussed the notion that this may be a reasonable assumption for many forms of continuous measurement data. We have also discussed that often a logarithmic transformation is useful for achieving approximate normality and constant variance for many types of continuous data as well. However, there are many situations where our data are in the form of counts or proportions, which are not continuous across a large range of values. In these situations, our interest still lies in assessing differences among treatments; however, the assumptions of normality and constant variance are certainly violated. It is well known for both count and proportion data that variance is not constant but rather depends on the size of the mean. Furthermore, histograms for such data are often highly asymmetric. The methods we have discussed may still be used in these situations provided that a suitable transformation is used. That is, although the distribution of Y may not be normal with constant variance for all treatments and blocks, it may be possible to transform the data and analyze them on the transformed scale, where these assumptions are more realistic. Selection of an appropriate transformation of the data, h, say, is often based on the type of data. The values h(Yij ) are treated as the data and analyzed in the usual way. Some common transformations are √ • Square root: h(Y ) = Y . This is often appropriate for count data with small values. • Logarithmic: h(Y ) = log Y . We have already discussed the use of this transformation for data where errors tend to have a multiplicative effect, such as growth data. Sometimes, the log transformation is useful for count data over a large range. √ √ • Arc sine: h(Y ) = arcsin( Y ) or sin−1 Y . This transformation is appropriate when the data are in the form of percentages or proportions.
Experimental Statistics for Biological Sciences
4. Simple Linear Regression and Correlation
81
So far, we have focused our attention on problems where the main issue is identifying differences among treatment means. In this setting, we based our inferences upon observations on a r.v. Y under the various experimental conditions (treatments, blocks, etc.). Another problem that arises in the biological and physical sciences, economics, industrial applications, and biomedical settings is that of investigating the relationship between two (or more) variables. Depending on the nature of the variables and the observations on them (more on this in a moment), the methods of regression analysis or correlation analysis are appropriate. In reality, our development of the methods for identifying differences among treatment means, those of ANOVA, are in fact very similar to regression analysis methods, as will become apparent in our discussion. Both sets of methods are predicated on representing the data by a linear, additive model, where the model includes components representing both systematic and random sources of variation. The common features will become evident over the course of our discussion. In this section, we will restrict our study to the simplest case in which we have two variables for which the relationship between them is reasonably assumed to be a straight line. Note, however, that the areas of regression analysis and correlation analysis are much broader than indicated by our introduction here. Terminology: It is important to clarify the usage of the term linear in statistics. Linear refers to how a component of an equation describing a relationship enters that relationship. For example, in the one-way classification model, recall that we represented an observation as Yij = μ + τi + εij . This equation is said to be linear because the components μ, τi , and εij enter directly. Contrast this with an example of an equation nonlinear in μ and τi : Yij = exp (μ + τi ) + εij . This equation still has an additive error, but the parameters of interest μ and τi enter in a nonlinear fashion, through the exponential function. The term linear is thus used in statistical applications in this broad way to indicate that parameters enter in a straightforward rather than complicated way. The term linear regression has a similar interpretation, as we will see. The term simple linear regression refers to the particular case where the relationship is a straight line. The use of the term linear thus refers
82
Bang and Davidian
to how parameters come into the model for the data in a general sense and does not necessarily mean that the relationship need to be a straight line, except when prefaced by the term simple. Scenario: We are interested in the relationship between two variables, which we will call X and Y . We observe pairs of X and Y values on each of a sample of experimental units, and we wish to use them to say something about the relationship. How we view the relationship is dictated by the situation: “Experimental” Data: Here, observations on X and Y are planned as the result of an experiment, laboratory procedure, etc. For example, • X = dose of a drug, Y = response such as change in blood pressure for a human subject • X = concentration of toxic substance, Y = number of mutant offspring observed for a pregnant rat In these examples, we are in control of the values of X (e.g., we choose the doses or concentrations) and we observe the resulting Y . “Observational” Data: Here, we observe both X and Y values, neither of which is under our control. For example, • X = weight, Y = height of a human subject • X = average heights of plants in a plot, Y = yield In the experimental data situations, there is a distinction between what we call X and what we call Y , because the former is under dictated by the investigator. It is standard to use these symbols in this way. In the observational data examples, there is not necessarily such a distinction. In any event, if we use the symbols in this way, then what we call Y is always understood to be something we observe, while X may or may not be. Relationships Between Two Variables: In some situations, scientific theory may suggest that two variables are functionally related, e.g., Y = g(X ), where g is some function. The form of g may follow from some particular theory. Even if there is no suitable theory, we may still suspect some kind of systematic relationship between X and Y and may be able to identify a function g that provides a reasonable empirical description. Objective: Based on a sample of observations on X and Y , formally describe and assess the relationship between them. Practical Problem: In most situations, the values we observe for Y (and sometimes X, certainly in the case of observational data) are not exact. In particular, due to biological variation among experimental units and the sampling of them, imprecision and/or inaccuracy of measuring devices, and so on, we may only
Experimental Statistics for Biological Sciences
83
observe values of Y (and also possibly X) with some error. Thus, based on a sample of (X,Y) pairs, our ability to see the relationship exactly is obscured by this error. Random vs. Fixed X: Given these issues, it is natural to think of Y (and perhaps X) as r.v.’s. How we do this is dictated by the situation, as above: • Experimental data: Here, X (dose, concentration) is fixed at predetermined levels by the experimenter. Thus, X is best viewed as a fixed quantity (like treatment in our previous situations). Y , on the other hand, which is subject to biological and sampling variation and error in measurement, is a r.v. Clearly, the values for Y we do get to see will be related to the fixed values of X. • Observational data: Consider Y = height, X = weight. In this case, neither weight nor height is a fixed quantity; both are subject to variation. Thus, both X and Y must be viewed as r.v.’s. Clearly, the values taken on by these two r.v.’s are related or associated somehow. Statistical Models: These considerations dictate how we think of a formal statistical model for the situation: • Experimental data: A natural way to think about Y is by representing it as Y = g(X ) + ε.
[29]
Here, then, we believe the function g describes the relationship, but values of Y we observe are not exactly equal to g(X) because of the errors mentioned above. The additive “error” ε characterizes this, just as in our previous models. In this situation, the following terminology is often used: Y = response or dependent variable X = concomitant or independent variable, covariate • Observational data: In this situation, there is really not much distinction between X and Y , as both are seen with error. Here, the terms independent and dependent variable may be misleading. For example, if we have observed pairs of X = weight, Y = height, it is not necessarily if we should be interested in a relationship Y = g(X ) or X = h(Y ), say. Even if we have in our mind that we want to think of the relationship in a particular way, say Y=g(X), it is clear that the above model [29] is not really appropriate, as it does not take into account “error” affecting X.
84
Bang and Davidian
4.1. Simple Linear Regression Model
Straight Line Model: Consider the particular situation of experimental data, where it is legitimate to regard X as fixed. It is often reasonable to suppose that the relationship between Y and X, which we have called in general g, is in fact a straight line. We may write this as Y = β0 + β1 X + ε
[30]
for some values β0 and β1 . Here, then, g is a straight line with Intercept Slope
β0 : The value taken on at X = 0 β1 : Expresses the rate of change in Y , i.e., β1 = change in Y brought about by a change of one unit in X.
Issue: The problem is that we do not know β0 or β1 . To get information on their values, the typical experimental setup is to choose values Xi , i = 1, . . . , n, and observe the resulting responses Y1 , . . . , Yn , so that the data consist of the pairs (Xi , Yi ), i = 1, . . . , n. The data are then used to estimate β0 and β1 , i.e., to fit the model to the data, in order to • quantify the relationship between Y and X. • use the relationship to predict a new response Y 0 we might observe at a given value X0 (perhaps one not included in the experiment). • use the relationship to calibrate – given a new Y 0 value we might see, for which the corresponding value X0 is unknown, estimate the value X0 . The model [30] is referred to as a simple linear regression model. The term regression refers to the postulated relationship. • The regression relationship in this case is the straight line β0 + β1 X . • The parameters β0 and β1 characterizing this relationship are called regression coefficients or regression parameters. If we think of our data (Xi , Yi ), we may thus think of a model for Y i as follows: Yi = β0 + β1 Xi + εi = μi + εi , μi = β0 + β1 Xi This looks very much like our linear additive model in the ANOVA, but with only one observation on each “treatment.” That is, μi = β0 + β1 Xi is the mean of observations we would see at the particular setting Xi . In the one-way classification situation, we would certainly not be able to estimate each mean μi with a single observation; how-
Experimental Statistics for Biological Sciences
85
ever, here, because we also have the variable X, and are willing to represent μi as a particular function of X (here, the straight line), we will be able to take advantage of this functional relation to estimate μi . Hence, we only need the single subscript i. The “errors” εi characterize all the sources of inherent variation that cause Y i to not exactly equal its mean, μi = β0 + β1 Xi . We may think of this as experimental error – all unexplained inherent variation due to the experimental unit. At any Xi value, the mean of responses Y i we might observe is μi = β0 + β1 Xi . The means are on a straight line over all X values. Because of “error”, at any Xi value, the Y i vary about the mean μi , so do not lie on the line, but are scattered about it. Objective: For the simple linear regression model, fit the line to the data to serve as our “best” characterization of the relationship based on the available data. More precisely, estimate the parameters β0 and β1 that characterize the mean at any X value. Remark: We may now comment more precisely on the meaning of the term linear regression. In practice, the regression need not be a straight line, nor need there be a single independent variable X. For example, the underlying relationship between Y and X (that is, the mean) may be more adequately represented by a curve like β0 + β1 X + β2 X 2 .
[31]
Or, if Y is some measure of growth of a plant and X is time, we would eventually expect the relationship to level off when X gets large, as plants cannot continue to get large without bound! A popular model for this is the logistic growth function β1 . 1 + β2 e β3 X
[32]
The curve for this looks much like that above, but would begin to “flatten out” if we extended the picture for large values of X. In the quadratic model [31], note that, although the function is no longer a straight line, it is still a straightforward function of the regression parameters characterizing the curve, β0 , β1 , β2 . In particular, β0 , β1 , and β2 enter in a linear fashion. Contrast this with the logistic growth model [32]. Here, the regression parameters characterizing the curve do not enter the model in a straightforward fashion. In particular, the parameters β2 and β3 appear in a quite complicated way, in the denominator. This function is thus not linear as a function of β1 , β2 , and β3 ; rather, it is better described as nonlinear. It turns out that linear functions are much easier to work with than nonlinear functions. Although we will work strictly with the simple linear regression model, be aware that the methods we discuss
86
Bang and Davidian
extend easily to more complex linear models like [31] but they do not extend as easily to nonlinear models, which need advanced techniques. 4.1.1. The Bivariate Normal Distribution
Consider the situation of observational data. Because both X and Y are subject to error, both are r.v.’s that are somehow related. Recall that a probability distribution provides a formal description of the population of possible values that might be taken on by a r.v.. So far, we have only discussed this notion in the context of a single r.v.. It is possible to extend the idea of a probability distribution to two r.v.’s. Such a distribution is called a bivariate probability distribution. This distribution describes not only the populations of possible values that might be taken on by the two r.v.’s, but also how those values are taken on together. Consider our X and Y . Formally, we would think of a probability distribution function f (x, y) that describes the populations of X and Y and how they are related; i.e., how X and Y vary together. Bivariate Normal Distribution: Recall that the normal distribution is often a reasonable description of a population of continuous measurements. When both X and Y are continuous measurements, a reasonable assumption is that they are both normally distributed. However, we also expect them to vary together. The bivariate normal distribution is a probability distribution with a probability density function f (x, y) for both X and Y such that • The two r.v.’s X and Y each have normal distributions with means μX and μy , and variances σ 2 X and σ 2 Y , respectively. • The relationship between X and Y is characterized by a quantity ρXY such that −1 ≤ ρXY ≤ 1. • ρXY = 1: is referred to as the correlation coefficient between the two r.v.’s X and Y and measures the linear association between values taken on by X and values taken on by Y . ρXY = 1: ρXY ρXY
all possible values of X and Y lie on a straight line with positive slope = −1: all possible values of X and Y lie on a straight line with negative slope = 0: there is no relationship between X and Y
Objective: Given our observed data pairs (Xi , Yi ), we would like to quantify the degree of association. To do this, we estimate ρXY 4.1.2. Comparison of Regression and Correlation Models
We have identified two appropriate statistical models for thinking about the problem of assessing association between two variables X and Y . These may be thought of as
Experimental Statistics for Biological Sciences
87
• Fixed X: Postulate a model for the mean of the r.v. Y as a function of the fixed quantity X (in particular, we focused on a straight line in X). Estimate the parameters in the model to characterize the relationship. • Random X: Characterize the (linear) relationship between X and Y by the correlation between them (in a bivariate normal probability model) and estimate the correlation parameter. It turns out that the arithmetic operations for regression analysis under the first scenario and correlation analysis under the second are the same! That is, to fit the regression model by estimating the intercept and slope parameters and to estimate the correlation coefficient, we use the same operations on our data! The important issue is in the interpretation of the results. Subtlety: In settings where X is best regarded as a r.v., many investigators still want to fit regression models treating X as fixed. This is because, although correlation describes the “degree of association” between X and Y , it doesn’t characterize the relationship in a way suitable for some purposes. For example, an investigator may desire to predict the yield of a plot based on observing the average height of plants in the plot. The correlation coefficient does not allow this. He thus would rather fit a regression model, even though X is random. Is this legitimate? If we are careful about the interpretation, it may be. If X and Y are really both observed r.v.’s, and we fit a regression to characterize the relationship, technically, any subsequent analyses based on this are regarded as conditional on the values of X involved. This means that we essentially regard X as “fixed,” even though it isn’t. However, this may be okay for the prediction problem above. Conditional on having seen a particular average height, he wants to get a “best guess” for yield. He is not saying that he could control heights and thereby influence yields, only that, given he sees a certain height, he might be able to say something about the associated yield. This subtlety is an important one. Inappropriate use of statistical techniques may lead one to erroneous or irrelevant inferences. It’s best to consult a statistician for help in identifying both a suitable model framework and the conditions under which regression analysis may be used with observational data. 4.1.3. Fitting a Simple Linear Regression Model – The Method of Least Squares
Now that we have discussed some of the conceptual issues involved in studying the relationship between two variables, we are ready to describe practical implementation. We do this first for fitting a simple linear regression model. Throughout this discussion, assume that it is legitimate to regard the X’s as fixed. For observations (Xi , Yi ), i = 1, . . . , n, we postulate the simple linear regression model Yi = β0 + β1 Xi + εi , i = 1, . . . , n.
88
Bang and Davidian
We wish to fit this model by estimating the intercept and slope parameters β0 and β1 . Assumptions: For the purposes of making inference about the true values of intercept and slope, making predictions, and so on, we make the following assumptions. These are often reasonable, and we will discuss violations of them later in this section. 1. The observations Y1 , . . . , Yn are independent in the sense that they are not related in any way. [For example, they are derived from different animals, subjects, etc. They might also be measurements on the same subject, but taken far apart enough in time to where the value at one time is totally unrelated to that at another.] 2. The observations Y1 , . . . , Yn have the same variance, σ 2 . [Each Y i is observed at a possibly different Xi value, and is thought to have mean μi = β0 + β1 Xi . At each Xi value, we may thus think of the possible values for Y i and how they might vary. This assumption says that, regardless of which Xi we consider, this variation in possible Y i values is the same.] 3. The observations Y i are each normally distributed with mean μi = β0 + β1 Xi , i = 1 . . . , n, and variance σ 2 (the same, as in 2 above). [That is, for each Xi value, we think of all the possible values taken on by Y as being well represented by a normal distribution.] The Method of Least Squares: There is no one way to estimate β0 and β1 . The most widely accepted method is that of least squares. This idea is intuitively appealing. It also turns out to be, mathematically speaking, the appropriate way to estimate these parameters under the assumption of normality 3 above. For each Y i , note that Yi − (β0 + β1 Xi ) = εi , that is, the deviation Yi − (β0 + β1 Xi ) is a measure of the vertical distance of the observation Y i from the line β0 + β1 Xi that is due to the inherent variation (represented by εi ). This deviation may be negative or positive. A natural way to measure the overall deviation of the observed data Y i from their means, the regression line β0 + β1 Xi , due to this error is n {Yi − (β0 + β1 Xi )}2 . i=1
This has the same appeal as a sample variance – we ignore the signs of the deviations but account for their magnitude. The method of least squares derives its name from thinking about this measure. In particular, we want to find the estimates of β0 and β1 that are the “most plausible” to have generated the data. Thus,
Experimental Statistics for Biological Sciences
89
Fig. 1.2. Illustration of Linear Least Squares.
a natural way to think about this is to choose as estimates the values βˆ0 and βˆ1 that make this measure of overall variation as small as possible (that is, which minimize it). This way, we are attributing as much of the overall variation in the data as possible to the assumed straight line relationship. Formally, then βˆ0 and βˆ1 minimize n
(Yi − βˆ0 − βˆ1 Xi )2 .
i=1
This is illustrated in Fig. 1.2. The line fitted by least squares is the one that makes the sum of the squares of all the vertical discrepancies as small as possible. To find the form of the estimators βˆ0 and βˆ1 , calculus may be used to solve this minimization problem. Define SXY =
n
(Xi − X¯ )(Yi − Y¯ ) =
i=1
SXX = SYY =
n i=1 n
n
(Xi − X¯ )2 =
i=1
n
n
n 2
i=1 Xi
Xi Yi −
i=1 n
(Yi − Y¯ )2 =
n
i=1 n
i=1 Xi
Xi2 −
n Yi2
−
i=1
n
i=1 Yi
n
i=1 Yi
2 ,
where X¯ and Y¯ are the sample means of the Xi and Y i values, respectively. Then the calculus arguments show that the values βˆ0 and βˆ1 minimizing the sum of squared deviations above satisfy βˆ1 =
SXY , βˆ0 = Y¯ − βˆ1 X¯ . SXX
90
Bang and Davidian
Thus, the fitted straight line is given by Yˆ i = βˆ0 + βˆ1 Xi . The “hat” on the Y i emphasizes the fact that these values are our “best guesses” for the means at each Xi value and that the actual values Y1 , . . . , Yn we observed may not fall on the line. The Yˆ i are often called the predicted values; they are the estimated values of the means at the Xi . Example: (Zar, Biostatistical Analysis, p. 225) The following data are rates of oxygen consumption of birds (Y ) measured at different temperatures (X). Here, the temperatures were set by the investigator, and the Y was measured, so the assumption of fixed X is justified. X [◦ C]:
−18
−15
−10
−5
0
5
10
19
Y [(ml/g)/hr]:
5.2
4.7
4.5
3.6
3.4
3.1
2.7
1.8
Calculations: We have n = 8. n
Yi = 29, Y¯ = 3.625,
i=1 n
n
Yi 2 = 114.04
i=1 n
Xi = −14, X¯ = −1.75,
i=1
Xi 2 = 1160,
i=1 n
Xi Yi = −150.4.
i=1
SXY = −150.4 − SXX =1160 − =8.915.
(−14)2 8
(29)( − 14) = −99.65 8
= 1135.5, SYY = 114.0 −
Thus, we obtain −99.65 = −0.0878, 1135.5 βˆ0 = 3.625 − ( − 0.0878)( − 1.75) = 3.4714.
βˆ1 =
The fitted line Yˆ i = 3.4714 − 0.0878Xi .
292 8
Experimental Statistics for Biological Sciences
91
Remark: It is always advisable to plot the data before analysis, to ensure that the model assumptions seem valid. 4.1.4. Assessing the Fitted Regression
Recall for a single sample, we use Y¯ as our estimate of the mean and use the SE sY¯ as our estimate of precision of Y¯ as an estimator of the mean. Here, we wish to do the same thing. How precisely have we estimated the intercept and slope parameters, and, for that matter, the line overall? Specifically, we would like to quantify • The precision of the estimate of the line • The variability in the estimates βˆ0 and βˆ1 . Consider the identity Yi − Y¯ = (Yˆ i − Y¯ ) + (Yi − Yˆ i ). Algebra and the fact that Yˆ i = βˆ0 + βˆ1 Xi = Y¯ + βˆ1 (Xi − X¯ ) may be used to show that n i=1
(Yi − Y¯ )2 =
n
(Yˆ i − Y¯ )2 +
i=1
n
(Yi − Yˆ i )2 .
[33]
i=1
The quantity on the left-hand side of this expression is one you should recognize – it is the Total SS for the set of data. For any set of data, we may always compute the Total SS as the sum of squared deviations of the observations from the (overall) mean, and it serves a measure of the overall variation in the data. Thus, [33] represents a partition of our assessment of overall variation in the data, Total SS, into two independent components. • (Yˆ i − Y¯ ) is the deviation of the predicted value of the ith observation from the overall mean. Y¯ would be the estimate of mean response at all X values we would use if we did not believe X played a role in the values of Y . Thus, this deviation measures the difference between going to the trouble to have a separate mean for each X value just using a single, common mean as the model. We would expect the sum of squared deviations n
(Yˆ i − Y¯ )2
i=1
to be large if using separate means via the regression model is much better than using a single mean. Using a single mean effectively ignores the Xi , so we may think of this as measuring the variation in the observations that may be explained by the regression line β0 + β1 Xi . • (Yi − Yˆ i ) is the deviation of the predicted value for the ith observation (our “best guess” for its mean) and the obser-
92
Bang and Davidian
vation itself (that we observed). Hence, the sum of squared deviations n
(Yi − Yˆ i )2
i=1
measures any additional variation of the observations about the regression line, that is, the inherent variation in the data at each Xi value that causes observations not to lie on the line. Thus, the overall variation in the data, as measured by Total SS, may be broken down into two components that each characterize parts of the variation: • Regression SS = ni=1 (Yˆ i − Y¯ )2 , which measures that portion of the variability that may be explained by the regression relationship (so is actually attributable to a systematic source, the assumed straight line relationship between Y and X). • Error SS (also called Residual SS) = ni=1 (Yi − Yˆ i )2 , which measures the inherent variability in the observations (e.g., Experimental error). Result: We hope that the (straight line) regression relationship explains a good part of the variability in the data. A large value of Regression SS would in some sense indicate this. Coefficient of Determination: One measure of this is the ratio R2 =
Regression SS . Total SS
R2 is called the coefficient of determination or the multiple correlation coefficient. (This second name arises from the fact that it turns out to be algebraically the value we would use to “estimate” the correlation between the Y i and Yˆ i value and is not to be confused with correlation as we have discussed it previously.) Intuitively, R2 is a measure of the “proportion of total variation in the data explained by the assumed straight line relationship with X.” Note that we must have 0 ≤ R2 ≤ 1, because both components are nonnegative and the numerator can be no larger than the denominator. Thus, an R2 value close to 1 is often taken as evidence that the regression model does “a good job” at describing the variability in the data, better than if we just assumed a common mean (and ignored Xi ). It is critical to understand what R2 does and does not measure. R2 is computed under the assumption that the simple linear regression model is correct, i.e., that it is a good description of the underlying relationship between Y and X. Thus, it assesses, if the relationship between X and Y really is a straight line, how much of the variation in the data may actually be attributed to
Experimental Statistics for Biological Sciences
93
Table 1.5 ANOVA–Simple linear regression Source
DF
SS
Regression
1
n
Error
n−2
Total
n−1
ˆ ¯ 2 i=1 (Yi − Y ) n (Yi − Yˆ i )2 i=1 n ¯ 2 i=1 (Yi − Y )
MS
F
MSR = SS 1
R FR = MS MSE
SS MSE = n−2
that relationship rather than just to inherent variation. If R2 is small it may be that there is a lot of random inherent variation in the data, so that, although the straight line is a reasonable model, it can only explain so much of the observed overall variation. ANOVA: The partition of Total SS above has the same interpretation as in the situations we have already discussed. Thus, it is common practice to summarize the results in an ANOVA table as in Table 1.5. Note that Total SS has n − 1 df, as always. It may be shown that Regression SS =
n
(Yˆ i − Y¯ )2 = βˆ1
i=1
n
(Xi − X¯ )2 .
i=1
βˆ1 is a single function of the Y i , thus it is a single independent quantity. Thus, we see that Regression SS has 1 df By subtraction, Error SS has n − 2 df Calculations: S2 Regression SS = XY . SXX Total SS ( = SYY ) is calculated in the usual way. Thus, Error (Residual) SS may be found by subtraction. It turns out that the expected mean squares, that is, the values estimated by MSR and MSE , are MSR : σ 2 + β12 ni=1 (Xi − X¯ )2 MSE : σ 2 . Thus, if β1 ≈ 0, the two MSs we observe should be about the same, and we would expect FR to be small. However, note that β1 = 0 implies that the true regression line is Yi = β0 + εi , that is, there is no association with X (slope = 0) and thus all Y i have the same mean β0 , regardless of the value of Xi . There is a straight line relationship, but it has slope 0, which effectively means no relationship.
94
Bang and Davidian
Table 1.6 ANOVA table for the oxygen data Source
DF
SS
MS
F
Regression
1
8.745
8.745
308.927
0.028
Error (Residual)
6
0.170
Total
7
8.915
If β1 = 0, then we would expect the ratio FR to be large. It is possible to show mathematically that a test of the hypotheses Ho:β1 = 0 vs. H1 : β1 = 0 may be carried out by comparing FR to the appropriate value from a F1,n−2 distribution. That is, the statistic F may be shown to have this distribution if H0 is true. Thus, the procedure would be Reject H0 at level of significance α if FR > F1,n−2,α . The interpretation of the test is as follows. Under the assumption that a straight line relationship exists, we are testing whether or not the slope of this relationship is in fact zero. A zero slope means that there is no systematic change in mean along with change in X, that is, no association. It is important to recognize that if the true relationship is not a straight line, then this may be a meaningless test. Example: For the oxygen consumption data, assume that the data are approximately normally distributed with constant variance. We have Total SS = SYY = 8.915, SXY = −99.65, SXX = 1135.5 from before, with n = 8. Thus, ( − 99.65)2 = 8.745, 1135.5 Error (Residual) SS = 8.915 − 8.745 = 0.170. (see Table 1.6) Regression SS =
We have F1,6,0.05 = 5.99. FR = 308.927 5.99; thus, we reject H0 at level of significance α = 0.05. There is strong evidence in these data to suggest that, under the assumption that the simple linear regression model is appropriate, the slope is not zero, so that an association appears to exist. Note that R2 =
8.745 = 0.981 8.915
Experimental Statistics for Biological Sciences
95
thus, as the straight line assumption appears to be consistent with the visual evidence in the plot of the data, it is reasonable to conclude that the straight line relationship explains a very high proportion of the variation in the data (the fact that Y i values are different is mostly due to the relationship with X). Estimate of σ 2 : If we desire an estimate of the variance σ 2 associated with inherent variation in the Y i values (due to variation among experimental units, sampling, and measurement error), from the expected mean squares above, the obvious estimate is the Error (Residual) MS. That is, denoting the estimate by s 2 , s 2 = MSE .
4.1.5. Confidence Intervals for Regression Parameters and Means
Because β0 , β1 , and in fact the entire regression line, are population parameters that we have estimated, we wish to attach some measure of precision to our estimates of them. SEs and CIs for Regression Parameters: It turns out that it may be shown, under our assumptions, that, if the relationship really is a straight line, the SDs of the populations of all possible βˆ1 and βˆ0 values are n 2 σ σ i=1 Xi , SD(βˆ0 ) = , SD(βˆ1 ) = √ √ SXX nSXX respectively. Because σ is not known, we estimate these SDs by replacing σ by the estimate s. We thus obtain the estimated SDs n 2 s s i=1 Xi EST SD(βˆ1 ) = √ , EST SD(βˆ0 ) = √ , SXX nSXX respectively. These are often referred to, analogous to a single sample mean, as the SEs of βˆ1 and βˆ0 . It may also be shown under our assumptions that βˆ0 − β0 βˆ1 − β1 ∼ tn−2 and ∼ tn−2 . EST SD(βˆ1 ) EST SD(βˆ0 )
[34]
These results are similar in spirit to those for a single mean and difference of means; the t distribution is relevant rather than the normal because we have replaced σ by an estimate (with n − 2 df). Because we are estimating the true parameters β1 and β0 by these estimates, it is common practice to provide a CI for the true values β1 and β0 , just as we did for a sample mean or difference of means. The derivation and the interpretation are the same. By
96
Bang and Davidian
“inverting” probability statements about the quantities in [34] in the same fashion, we arrive at the following 100(1 − α)% CIs: Interval for β1 : βˆ1 − tn−2,α/2 EST SD(βˆ1 ) , βˆ1 + tn−2,α/2 EST SD(βˆ1 ) Interval for β0 : βˆ0 − tn−2,α/2 EST SD(βˆ0 ) , βˆ0 + tn−2,α/2 EST SD(βˆ0 ) . The interpretation is as follows: Suppose that zillions of experiments were conducted using the same (fixed) Xi values as those in the observed experiment. Suppose that for each of these, we fitted the regression line by the above procedures and calculated 100(1 − α)% CIs for β1 (and β0 ). Then for 100(1 − α)% of these, the true value of β1 (β0 ) would fall between the endpoints. The endpoints are a function of the data; thus, whether or not β1 (β0 ) falls within the endpoints is a function of the experimental procedure. Thus, just as in our earlier cases, the CI is a statement about the quality of the experimental procedure for learning about the value of β1 (β0 ). SE and CI for the Mean: Our interest in the values of β0 and β1 is usually because we are interested in the characteristics of Y at particular X values. Recall that μi = β0 + β1 Xi is the mean of the Y i values at the value of Xi . Thus, just as we are interested in estimating the mean of a single sample to give us an idea of the “center” of the distribution of possible values, we may be interested in estimating the mean of Y i values at a particular value of X. For example, an experiment may have been conducted to characterize the relationship between concentration of a suspected toxicant (X) and a response like number of mutated cells (Y ). Based on the results, the investigators may wish to estimate the numbers of mutations that might be seen on average at other concentrations not considered in the experiment. That is, they are interested in the “typical” (mean) number of mutations at particular X values. Consider this problem in general. Suppose we have fitted a regression line to data, and we wish to estimate the mean response at a new value X0 That is, we wish to estimate μ0 = β0 + β1 X0 . The obvious estimator for μ0 is the value of this expression with the estimates of the regression parameters plugged in, that is, μˆ 0 = βˆ0 + βˆ1 X0 .
Experimental Statistics for Biological Sciences
97
In the example, μˆ 0 is thus our estimate of the average number of mutations at some concentration X0 . Note that, of course, the estimate of the mean will depend on the value of X0 . Because μ0 is an estimate based on our sample, we again would like to attach to it some estimate of precision. It may be shown mathematically that the variance of the distribution of all possible μˆ 0 values (based on the results of all possible experiments giving rise to all possible values of βˆ0 and βˆ1 ) is (X0 − X¯ )2 2 1 + σ . n SXX We may thus estimate the SD of μˆ 0 by
1 (X0 − X¯ )2 + . EST SD(μˆ 0 ) = s n SXX It may be shown that, under our assumptions, μˆ 0 − μ0 ∼ tn−2 . EST SD(μˆ 0 ) Thus, using our standard argument to construct a 100(1 − α)% CI for a population parameter based on such a result, we have that a 100(1 − α)% CI for μ0 , the true mean of all possible Y values at the fixed value X0 , is {μˆ 0 − tn−2,α/2 EST SD(μˆ 0 ) , μˆ 0 + tn−2,α/2 EST SD(μˆ 0 )}. Length of CI: The CI for μ0 will of course be different depending on the value of X0 . In fact, the expression for EST SD(μˆ 0 ) above will be smallest if we choose X0 = X¯ and will get larger the farther X0 is from X¯ in either direction. This implies that the precision with which we expect to estimate the mean value of Y decreases the farther X0 is from the “middle” of the original data. This makes intuitive sense – we would expect to have the most “confidence” in our fitted line as an estimate of the true line in the “center” of the observed data. The result is that the CIs for μ0 will be wider the farther X0 is from X¯ Implication: If the fitted line will be used to estimate means for values of X besides those used in the experiment, it is important to use a range of X’s which contains the future values of interest, X0 , preferably more toward the “center.” Extrapolation: It is sometimes desired to estimate the mean based on the fit of the straight line for values of X0 outside the range of X’s used in the original experiment. This is called extrapolation. In order for this to be valid, we must believe that the straight line relationship holds for X’s outside the range where we have observed data! In some situations, this may be reasonable;
98
Bang and Davidian
in others, we may have no basis for making such a claim without data to support it. It is thus very important that the investigator has an honest sense of the relevance of the straight line model for values outside those used in the experiment if inferences such as estimating the mean for such X0 values are to be reliable. In the event that such an assumption is deemed to be relevant, note from the above discussion that the quality of the estimates of the μ0 for X0 outside the range is likely to be poor. Note that we may in fact be interested in the mean at values of X that were included in the experiment. The procedure above is valid in this case. 4.1.6. Prediction and Calibration
Prediction: Sometimes, depending on the context, we may be interested not in the mean of possible Y values at a particular X0 value, but in fact the actual value of Y we might observe at X0 . This distinction is important. The estimate of the mean at X0 provides just a general sense about values of Y we might see there – just the “center” of the distribution. This may not be adequate for some applications. For example, consider a stockbroker who would like to learn about the value of a stock based on observed previous information. The stockbroker does not want to know about what might happen “on the average” at some future time X0 ; she is concerned with the actual value of the stock at that time, so that she may make sound judgments for her clients. The stockbroker would like to predict or forecast the actual value, Y 0 say, of the stock that might be observed at X0 . In this kind of situation, we are interested not in the population parameter μ0 , but rather the actual value that might be taken on by a r.v., Y 0 , say. In the context of our model, we are thus interested in the “future” observation Y0 = β0 + β1 X0 + ε0 , where ε0 is the “error” associated with Y 0 that makes it differ from the mean at X0 , μ0 . It is important to recognize that Y 0 is not a parameter but a r.v.; thus, we do not wish to estimate a fixed quantity, but instead learn about the value of a random quantity. Now, our “best guess” for the value Y 0 is still our estimate of the “central” value at X0 , the mean μ0 . We will write Yˆ 0 = βˆ0 + βˆ1 X0 to denote this “best guess.” Note that this is identical to our estimate for the mean, μˆ 0 ; however, we use a different symbol in this context to remind ourselves that we are interested in Y 0 , not μ0 We call Yˆ 0 a prediction or forecast rather than an “estimate” to make the distinction clear. Just as we do in estimation of fixed parameters, we would still like to have some idea of how well we can predict/forecast. To get an idea, we would like to
Experimental Statistics for Biological Sciences
99
characterize the uncertainty that we have about Yˆ 0 as a guess for Y0 , but, because it is not a parameter, it is not clear what to do. Our usual notion of a SE and CI does not seem to apply. We can write down an assessment of the likely size of the error we might make in using Yˆ 0 to characterize Y 0 . Intuitively, there will be two sources of error: • Part of the error in Yˆ 0 comes from the fact that we don’t know β0 and β1 but must estimate them from the observed data. • Additional error arises from the fact that what we are really doing is trying to “hit a moving target!” That is, Y 0 itself is a r.v., so itself is variable! Thus, additional uncertainty is introduced because we are trying to characterize a quantity that itself is uncertain. The assessment of uncertainty thus should be composed of two components. An appropriate measure of uncertainty is the SD of Yˆ 0 − Y0 , that is, the variability in the deviations between Yˆ 0 and the thing we are trying to “hit,” Y0 . This variance turns out to be ¯ )2 (X 1 − X 0 σ2 + σ2 + . n SXX The extra σ 2 added on is accounting for the additional variation above (i.e., the fact that Y0 itself is variable). We estimate the associated SD of this variance by substituting s 2 for σ 2 : ¯ )2 1 (X − X 0 EST ERR(Yˆ 0 ) = s 1 + + . n SXX We call this “EST ERR” to remind ourselves that this is an estimate of the “error” between Yˆ 0 and Y0 , each of which is random. The usual procedure is to use this estimated uncertainty to construct what might be called an “uncertainty interval” for Y 0 based on our observed data. Such an interval is usually called a ‘prediction interval’. A 100(1 − α)% interval is given by {Yˆ 0 − tn−2,α/2 EST ERR(Yˆ 0 ),Yˆ 0 + tn−2,α/2 EST ERR(Yˆ 0 )}. Note that this interval is wider that the CI for the mean μ0 . This is because we are trying to forecast the value of a r.v. rather than estimate just a single population parameter. Understandably, we cannot do the former as well as the latter, because Y 0 varies as well. Calibration: Suppose we have fitted the regression line and now, for a value Y 0 of Y we have observed, we would like to estimate the unknown corresponding value of X, say X0 .
100
Bang and Davidian
As an example, consider a situation where interest focuses on two different methods of calculating the age of a tree. One way is by counting tree rings. This is considered to be very accurate, but requires sacrificing the tree. Another way is by a carbon-dating process. Suppose that data are obtained for n trees on X = age by the counting method, Y = age by carbon dating. Here, technically, both of these might be considered r.v.’s; however, the goal of the investigators was to determine the relationship of the more variable, less reliable carbon data method to the accurate counting method. That is, given a tree is of age X (which could be determined exactly by the counting method), what would the associated carbon method value look like? Thus, for their purposes, regression analysis is appropriate. From the observed pairs (Xi , Yi ), a straight line model seems reasonable and is fitted to the data. Now suppose that the carbon data method is applied to a tree not in the study yielding an age Y 0 . What can we say about the true age of the tree, X0 , (that is, its age by the very accurate counting method) without sacrificing the tree? The idea is to use the fitted line to estimate X0 . Note that X0 is a fixed value, thus, it is perfectly legitimate to want to estimate it. The obvious choice for an estimator, Xˆ 0 , say, is found by “inverting” the fitted regression line: Y0 − β0 Xˆ 0 = . β1 Because Xˆ 0 is an estimate of a parameter, based on information from our original sample, we ought to also report SE and/or CI. It turns out that deriving such quantities is a much harder mathematical exercise than for estimating a mean or for prediction, and a description of this is beyond the scope of our discussion here. This is because the estimated regression parameters appear both in the numerator and denominator of the estimate, which leads to mathematical difficulties. Be aware that this may indeed be done; if you have occasion to make calibration inference, you should consult a statistician for help with attaching estimates of precision to your calibrated values (the estimates Xˆ 0 .) 4.1.7. Violation of Assumptions
Earlier in this section, we stated assumptions under which the methods for simple linear regression we have discussed yield valid inferences. It is often the case in practice that one or more of the assumptions is violated. As in those situations, there are several ways in which the assumptions may be violated. • Nonconstant variance: Recall in the regression situation that the mean response changes with X. Thus, in a given experiment, the responses may arise from distributions with means across a large range. The usual assumption is that the variance of these distributions is the same. However, it is often
Experimental Statistics for Biological Sciences
101
the case that the variability in responses changes, most commonly in an increasing fashion, with changing X and mean values. This is thus likely to be of concern in problems where the response means cover a large range. We have already discussed the idea of data transformation as a way of handling this violation of the usual assumptions. In the regression context, this may be done in a number of ways. One way is to invoke an appropriate transformation and then postulate a regression model on the transformed scale. Sometimes, in fact, it may be that, although the data do not appear to follow a straight line relationship with X on the original scale, they may on some transformed scale. Another approach is to proceed with a modified method known as weighted least squares. This method, however, requires that the variances are known, which is rarely the case in practice. A number of diagnostic procedures have been developed for helping to determine if nonconstant variance is an issue and how to handle it. Other approaches to transforming data are also available. Discussion of these and of weighted least squares is an advanced topic. The best strategy is to consult a statistician to help with both diagnosis and identification of the best methods for a particular problem. • Nonnormality: Also, as we have discussed previously, the normal distribution may not provide a realistic model for some types of data, such as those in the form of counts or proportions. Transformations may be used in the regression context as well. In addition, there are other approaches to investigating the association between such responses Y and a covariate X that we have not discussed here. Again, a statistician can help you determine the best approach. Remark-Outliers: Another phenomenon that can make the normal approximation unreasonable is the problem of outliers, i.e., data points that do not seem to fit well with the pattern of the rest of the data. In the context of straight line regression, an outlier might be an observation that falls far off the apparent approximate straight line trajectory followed by the remaining observations. Practitioners may often “toss out” such anomalous points, which may or may not be a good idea, depending on the problem. If it is clear that an “outlier” is the result of a mishap or a gross recording error, then this may be acceptable. On the other hand, if no such basis may be identified, the outlier may in fact be a genuine response; in this case, it contains information about the process under study and may be reflecting a legitimate phenomenon. In this case, “throwing out” an outlier may lead to misleading conclusions, because a legitimate feature is being ignored. Again, there are sophisticated diagnostic procedures for identifying outliers and deciding how to handle them.
102
Bang and Davidian
4.2. Correlation Analysis
Throughout this discussion, we regard both Y and X as r.v.’s such that the bivariate normal distribution provides an appropriate, approximate model for their joint distribution. Correlation: Recall that the correlation coefficient ρXY is a measure of the degree of (linear) association between two r.v.’s. Interpretation: It is very important to understand what correlation does not measure. Investigators sometimes confuse the value of the correlation coefficient and the slope of an apparent underlying straight–line relationship. These do not have anything to do with each other: • The correlation coefficient may be virtually equal to 1, implying an almost perfect association. But the slope may be very small at the same time. Although there is indeed an almost perfect association, the rate of change of Y values with X values may be very slow. • The correlation coefficient may be very small, but the apparent “slope” of the relationship could be very steep. In this situation, it may be that, although the rate of change of Y values with X values is fast, there is large inherent variation in the data. Estimation: For a particular set of data, of course, ρXY is unknown. We may estimate ρXY from a set of n pairs of observations (Xi , Yi ), i = 1, . . . , n, by the sample correlation coefficient n (Xi − X¯ )(Yi − Y¯ ) SXY =√ . rXY = i=1 n n S S 2 2 XX YY ¯ ¯ i=1 (Xi − X ) i=1 (Yi − Y ) Note that the same calculations are involved as for regression analysis! Recall in regression analysis that
SXY Regression SS = √ . SXX
2 is thus often called the coefficient of deterThe quantity rXY mination (like R2 ) in this setting, where correlation analysis is appropriate. However, it is important to recognize that the interpretation is different. Here, we are not acknowledging a straight– line relationship; rather, we are just modeling the data in terms of a bivariate normal distribution with correlation ρXY . Thus, the 2 has no meaning here. former interpretation for the quantity rXY Likewise, the idea of correlation really only has meaning when both variables Y and X are r.v.’s. CI: Because rXY is an estimator of the population parameter ρXY , it would be desirable to report, along with the estimate itself, a CI for ρXY . There is no one way to carry out these analyses. One common approach is an approximation known as Fisher’s Z transformation.
Experimental Statistics for Biological Sciences
103
The method is based on the mathematical result that the quantity 1 + rXY Z = 0.5 log 1 − rXY has an approximate normal distribution with mean and variance 1 + ρXY 1 0.5 log , and 1 − ρXY n−3 respectively, when n is large. This result may be used to construct an approximate 100(1 − α)% CI for the mean
1 + ρXY 0.5 log 1 − ρXY
,
where log is natural logarithm. This CI is
Z − zα/2
1 1 , Z + zα/2 , n−3 n−3
[35]
where, as before, zα/2 is the value such that, for a standard normal r.v. Z, zα/2 satisfies P(Z > zα/2 ). This CI may then be transformed to obtain a CI for ρXY itself as follows. Let Z L and Z U be the lower and upper endpoints of the interval [35]. We illustrate the approach for Z L . To obtain the lower endpoint for an approximate CI for ρXY itself, calculate exp (2Z L ) − 1 . exp (2Z L ) + 1 This value is the lower endpoint of the ρXY interval; to obtain the upper endpoint, apply the same formula to Z U . Hypothesis Test: We may also be interested in testing hypotheses about the value of ρXY . The usual hypotheses tested are analogous in spirit to what is done in straight line regression – the null hypothesis is the hypothesis of “no association.” Here, then, we test H0 :ρXY = 0 vs. H1 :ρXY = 0. It is important to recognize what is being tested here. The alternative states simply that ρXY is different from 0. The true value of the correlation coefficient could be quite small and H1 would be true. Thus, if the null hypothesis is rejected, this is not necessarily an indication that there is a “strong” association, just there is evidence that there is some association. Of course, as always, if we do not reject H0 , this does not mean that we do have enough evidence to infer that there is not an association! This is particularly critical here. The procedure for testing H0 vs. H1 is
104
Bang and Davidian
intuitively reasonable: we reject H0 if the CI does not contain 0. It is possible to modify this procedure to test whether ρXY is equal to some other value besides zero. However, be aware that most statistical packages provide by default the test for H0 :ρXY = 0 only. Warning: This procedure is only approximate, even under our bivariate normal assumption. It is an example of the type of approximation that is often made in difficult problems, that of approximating the behavior of a statistic under the condition that the n is large. If n is small, the procedure is likely to be unreliable. Moreover, it is worth noting that, intuitively, trying to understand the underlying association between two r.v.’s is likely to be very difficult with a small number of pairs of observations. Thus, testing aside, one should be very wary of over-interpretation of the estimate of ρXY when n is small – one “outlying” or “unusual” observation could be enough to affect the computed value substantially! It thus may be very difficult to detect when ρXY is different from 0 with a small n!
Chapter 2 Nonparametric Methods for Molecular Biology Knut M. Wittkowski and Tingting Song Abstract In 2003, the completion of the Human Genome Project (1) together with advances in computational resources (2) were expected to launch an era where the genetic and genomic contributions to many common diseases would be found. In the years following, however, researchers became increasingly frustrated as most reported ‘findings’ could not be replicated in independent studies (3). To improve the signal/noise ratio, it was suggested to increase the number of cases to be included to tens of thousands (4), a requirement that would dramatically restrict the scope of personalized medicine. Similarly, there was little success in elucidating the gene–gene interactions involved in complex diseases or even in developing criteria for assessing their phenotypes. As a partial solution to these enigmata, we here introduce a class of statistical methods as the ‘missing link’ between advances in genetics and informatics. As a first step, we provide a unifying view of a plethora of nonparametric tests developed mainly in the 1940s, all of which can be expressed as u-statistics. Then, we will extend this approach to reflect categorical and ordinal relationships between variables, resulting in a flexible and powerful approach to deal with the impact of (1) multiallelic genetic loci, (2) poly-locus genetic regions, and (3) oligo-genetic and oligo-genomic collaborative interactions on complex phenotypes. Key words: Genome-wide Association Study (GWAS), Family-based Association Test (FBAT), High-Density Oligo-Nucleotide Assay (HDONA), coregulation, collaboration, multiallelic, multilocus, multivariate, gene–gene interaction, epistasis, personalized medicine.
1. Introduction As functional genetics and genomics advance and prices for sequencing and expression profiling drop, new possibilities arise, but also new challenges. The initial successes in identifying the causes of rare, mono-causal diseases serve as a proof-of-concept that new diagnostics and therapies can be developed. Common diseases have even more impact on public health, but as they H. Bang et al. (eds.), Statistical Methods in Molecular Biology, Methods in Molecular Biology 620, DOI 10.1007/978-1-60761-580-4_2, © Springer Science+Business Media, LLC 2010
105
106
Wittkowski and Song
typically involve genetic epistasis, genomic pathways, and proteomic patterns, new requirements for database systems and statistical analysis tools are necessitated. Biological systems are regulated by various, often unknown feedback loops so that the functional form of relationships between measurement and activity or efficacy is typically unknown, except within narrowly controlled experiments. Still, many statistical methods are based on the linear model (5), i.e., on the assumption that the above (unknown) relationship is linear. The linear model has the advantage of computational efficiency. Moreover, assuming independence and additivity yields conveniently bell-shaped distributions and parameters of alluring simplicity. The prayer that biology be linear, independent, and additive, however, is rarely answered and the Central Limit Theorem (CLT) neither applies to small samples nor rescues from model misspecification. When John Arbuthnot (1667–1735) argued that the chance of more male than female babies being born in London for the last 82 consecutive years was only 1/282 “if mere Chance govern’d”, and, thus, “it is Art, not Chance, that governs”, (6) he was arguably the first to ever apply the concept of hypothesis testing to obtain what is now known as a ‘p-value’ and, interestingly, he applied it to a problem in molecular biology. The test, now known as the sign or McNemar (7) test (see below), belongs to the class of ‘nonparametric’ tests, which differ from their ‘parametric’ counterparts, in that the distribution of the data is not assumed to be known, except for a single parameter to be estimated. As they require fewer unjustifiable assumptions to be made, nonparametric methods are more adequate for biological systems, in general. Moreover, they tend to be easier to understand. The median, for instance, is easily explained as the cut-off where as many observations are above as below, and the interquartile range as the range with 25% of the data both below and above. In contrast to the mean and standard deviation, these ‘quartiles’ do not change with (monotonous) scale transformations (such as log and inverse), are robust to outliers, and often reflect more closely the goals of the investigator (8): ‘Do people in this group tend to score higher than people in that?’, ‘Is the order on this variable similar to the order on that?’ If the questions are ordinal, it seems preferable to use ordinal methods to answer them (9). The reason for methods based on the linear model, ranging from the t-test(s) and analysis of variance (ANOVA) to stepwise linear regression and factor analysis is not their biological plausibility, but merely their computational efficiency. While mean and standard deviation are easily computed with a pocket calculator, quartiles are not. Other disadvantages of ordinal methods are the relative scarcity of experimental designs that can be
Nonparametric Methods for Molecular Biology
107
analyzed. Moreover, nonparametric methods are often presented as a confusing ‘hodgepodge’ of seemingly unrelated methods. On the one hand, equivalent methods can have different names, such as the Wilcoxon (rank-sum) (10), the Mann–Whitney (u-) (11), and the Kruskal–Wallis (12) test (when applied to two groups). On the other hand, substantially different tests may be attributed to the same author, such as Wilcoxon’s rank-sum and signed-rank tests (10). In the field of molecular biology, the need for computational (bioinformatics) approaches in response to a rapidly evolving technology producing large amounts of data from small samples has created a new field of “in silico” analyses, the term a hybrid of the Latin in silice (from silex, silicis m.: hard stone (13, 14)) and phases such as in vivo, in vitro, and in utero, which seem to be better known than, for instance, in situ, in perpetuum, in memoriam, in nubibus, and in capite. Many in silico methods have added even more confusion to the field. For instance, a version of the t-test with a more conservative variance estimate is often referred to as SAM (significance analysis for microarrays) (15), and adding a similar “fudge factor” to the Wilcoxon signed-rank test yields SAM-RS (16). Referring to ‘new’ in silico approaches by names intended to highlight the particular application or manufacturer, rather than the underlying concepts has often prevented these methods from being thoroughly evaluated. With sample sizes increasing and novel technologies (e.g., RNAseq) replacing less robust technologies (e.g., microarrays) (17) that often relied on empirically motivated standardization and normalization, the focus can now shift from ad hoc bioinformatics approaches to well-founded biostatistical concepts. Below we will present a comprehensive theory for a wide range of nonparametric in silice methods based on recent advances in understanding the underlying fundamentals, novel methodological developments, and improved algorithms.
2. Univariate Unstructured Data 2.1. Independence 2.1.1. Introduction
For single-sample designs, independence of the observations constituting the sample is a key principle to guarantee when applying any statistical test to observed data. All asymptotic (large-sample) versions of the tests discussed here are based on the CLT, which applies when (i) many (ii) independent observations contribute in an (iii) additive fashion to the test statistic. Of these three requirements, additivity is typically fulfilled by the way the test statistic is formed, which may be, for instance, based on the sum of the data, often after rank or log transformation. Independence, then,
108
Wittkowski and Song
applies to how the data are being aggregated into the test statistic. Often, one will allow only a single child per family to be included, or at least only one of identical twins. Finally, the rule that ‘more is better’ underlies the Law of Large Numbers, which states that the accuracy of the estimates increases with the square root of the number of independent samples included. Finally, one will try to minimize the effect of unwanted ‘confounders.’ For instance, when one compares the effects of two interventions within the same subject, one would typically aim at a ‘cross-over’ design, where the two interventions are applied in a random order to minimize ‘carry-over’ effects, where the first intervention’s effects might affect the results observed under the subsequent intervention. Another strategy is ‘stratification’, where subjects are analyzed as subsamples forming ‘blocks’ of subjects that are comparable with respect to the confounding factor(s), such as genetic factors (e.g., race and ethnicity), exposure to environmental factors (smoking) or experimental conditions. A third strategy would be to ‘model’ the confounding factor, e.g., by subtracting or dividing by its presumed effect on the outcome. As the focus of this chapter is on nonparametric (aka ‘model-free’) methods, this strategy will be used only sparingly, i.e., when the form of the functional relationship among the confounding and observed variables is, in fact, known with certainty. 2.1.2. The Sign Test for Untied Observations
The so-called sign test is the most simple of all statistical methods, applying to a single sample of binary outcomes from independent observations, which can be discrete, such as the sexes ‘M’ of ‘F’ in Arbuthnot’s case (6), ‘correct vs. false’, such as R.A. Fisher’s famous (Lady Tasting) Tea test (7), where a lady claims to be able to decide whether milk was added to tea, or vice versa. When applied to paired observations (‘increase’ vs. ‘decrease’), the sign test is often named after McNemar (18). (Here, we assume that all sexes, answers, and changes can be unambiguously determined. The case of ‘ties’ will be discussed in Section 2.1.3 below.) When the observations are independent, the number of positive ‘signs’ follows a binomial distribution, so that the probability of results deviating at least as much as the result observed from the result expected under the null hypothesis H0 (the ‘p-value’) is easily obtained. Let X be the sum of positive signs among n replications, each having a ‘success’ probability of p (e.g., p = 1/2). Then, the probability of having k ‘successes’ (see Fig. 2.1) is n k bn,p (k) = P X = k n, p = p (1 − p)k k n! pk (1 − p)k . = k!(n − k!)
Nonparametric Methods for Molecular Biology
109
0.3
0.2
0.1
0 0
1
2
3
4
5
6
7
8
9
10
Fig. 2.1. Binomial distribution for n=10 under the hypothesis p=0.5.
n The (one-sided) p-value is easily computed as x=k bn,p (x). 1 n Let x◦ = n i=1 xi , then, for binomial random variables, the parameter estimate X /n = pˆ and its standard deviation
1 n (xi − x◦ )2 i=1 n 2 2 1 pˆ 1 − pˆ + (1 − pˆ ) 0 − pˆ = n 1 pˆ 1 − pˆ = n
sn pˆ =
are also easily derived. More than 100 years after Arbuthnot (6), Gauss (19) proved the CLT, by which the binomial distribution can be approximated asymptotically (as.) with the familiar Gaussian ‘Normal’ distribution √ pˆ − p0 n ∼as. N (0, 1) , i.e., pˆ 1 − pˆ
√ (pˆ −p0 ) n√ P >z pˆ (1−pˆ )
z
→
n→∞ −∞
2 √1 e −u /2 du. 2π
The former (exact) form of the sign test can be derived and applied using only elementary mathematical operations, while the latter, like the parametric equivalent, requires theoretical results that most statisticians find challenging to derive, including an integral without a closed form solution. For p0 = 1/2 McNemar (18) noted that this yields a (two-sided) test statistic in a particularly simple form (N+ − N− )2 ∼as. χ12 . N+ + N−
110
Wittkowski and Song
2.1.3. Sign Tests for Tied Observations
In reality, the situations where the sign test is to be applied are often more complicated, because the sign of an observation (or pair of observations) may be neither positive nor negative. When such ‘ties’ are present, the McNemar test (18) yields the same results for two distinctively different situations: • either we may have nine positive and one negative observations out of a total of ten, • or we have also nine positive and one negative observation, but also 9,990 tied observations. A ‘tie’ might be a subject being heterozygous or, in Arbuthnot’s case, having gonadal mosaicism, the rare conditions, where a subject has cells with both XX and XY chromosomes. Ignoring ties is often referred to as ‘correction for ties.’ Still one may feel uncomfortable with a result being ‘significant’, although only nine observations ‘improved’ and one even ‘worsened’, while 9,990 were (more or less) ‘unchanged.’ This observation has long puzzled statisticians. In fact, several versions of ‘the’ sign test have been proposed (20). Originally, Dixon, in 1946 (21), suggested that ties be split equally between the positive and negative signs, but in 1951 (22) followed McNemar (18) in suggesting that ties be dropped from the analysis. To explicate the discussion, we divide the estimated proportion of positive signs by their standard deviation to achieve a common asymptotic distribution. Let N− , N+ , and N= denote the number of negative, positive, and tied observations, respectively, among a total of n. The original sign test (21) can then be written as (N+ + p0 N= )/ n − p0 (N+ + p0 N= ) − pH n = T∗ = ! p0 1 − p0 n p0 1 − p0 n =
{if p0 =1/ 2}
N+ − N− ∼as. N(0, 1) √ n
and the alternative (18, 22) as N+ /(n − N= ) − p0 (N+ + p0 N= ) − pH n = T = ! p0 1 − p0 (n − N= ) p0 1 − p0 (n − N= ) N+ − N− ∼as. N (0, 1). {if p0 =1/ 2} N+ + N− =
The first and last term in the above series of equations show that the ‘correction for ties’ increases the test statis tic by a factor of n/(n − N= ), thereby yielding more ‘significant’ results. The center term is helpful in understanding the nature of the difference. T ∗ distributes the ties as the
Nonparametric Methods for Molecular Biology
111
numbers of observations expected under H0 and, then uses the unconditional variance (21). T excludes the ties from the analysis altogether, thereby ‘conditioning’ the test statistic. The sign test is often applied when the distribution of the data (or differences of data) cannot be assumed to be symmetric. (Otherwise, the paired t-test would be a robust alternative (5).) Assume, for simplicity of the argument, that the differences follow a triangular distribution with a median of zero and a symmetric discretization interval with (band-) width 2b around the origin (Fig. 2.2).
Fig. 2.2. Triangular distribution, discretized at zero with bandwidth b=0.5.
Then, with T, ‘significance’ increases with the discretization bandwidth, i.e., with the inaccuracy of the measurements; the test statistic increases from 0.00 to 2.34 (Table 2.1). With T ∗ , in contrast, the estimate for the probability of a positive sign remains within a narrow limit between 0.4 and 0.5 and the test statistic never exceeds the value of 1.0. To resolve the seeming discrepancy between theoretical results suggesting a specific treatment of ties as ‘optimal’ and the counterintuitive consequences of using this ‘optimal’ strategy, one needs to consider the nature of ties. In genetics, for instance, tied observations can often be assumed to indicate identical phenomena (e.g., mutation present or absent (23)). Often, however, a thorough formulation of the problem refers to an underlying continuous or unmeasurable factor, rather than the observed discretized variable. For instance, ties may be due to rounding of
112
Wittkowski and Song
Table 2.1 Expected estimates and test statistics (n=25) by bandwidth b p=
p+
p+ 1-p =
T 25
p+ 1-p =
T ∗25
T 25 T ∗25
0.00
0.00
0.50
0.50
0.00
0.50
0.00
1.0
0.25
0.21
0.39
0.49
0.06
0.49
0.05
1.1
0.50
0.41
0.27
0.46
0.28
0.48
0.21
1.3
0.75
0.62
0.14
0.37
0.78
0.45
0.48
1.6
0.90
0.75
0.06
0.23
1.38
0.43
0.69
2.0
0.95
0.79
0.03
0.14
1.68
0.42
0.77
2.2
0.99
0.82
0.01
0.03
1.98
0.42
0.84
2.4
1.00
0.83
0
0
2.07
0.41
0.86
2.4
1.50
0.93
0
0
2.34
0.46
0.36
3.7
>2.41
1.00
0
–
–
0.50
0.00
8.2
p
continuous variables (temperature, Fig. 2.2) or to the use of discrete surrogate variables for continuous phenomena (parity for fertility), in which case the unconditional sign test should be used. Of course, when other assumptions can be reasonably made, such as the existence of a relevance threshold (24), or a linear model for the paired comparison preference profiles (25), ties could be treated in many other ways (26). 2.2. Applications 2.2.1. Family-Based Association Studies
Family-based association tests (FBAT) control for spurious associations between disease and specific marker alleles due to population stratification (27). Thus, the transmission/disequilibrium test (TDT) for bi-allelic markers, proposed in 1993 by Spielman et al. (28), has become one of the most frequently used statistical methods in genetics. Part of! its appeal stems from its computationally simple form (b − c)2 (b + c), which resembles the conditional sign test (Section 2.1.3). Here, b and c represent the number of transmitted wild-type (P) and risk (Q) alleles, respectively, whose parental origin can be identified in affected children. Let nXY denote the number of affected XY children and let XY ∼ X Y denote a parental mating type (X , Y , X , Y ∈ {P, Q}; the order of genotypes and parental genders is arbitrary). Alleles transmitted from homozygous parents are noninformative, so that children with two homozygous parents are excluded from the analysis as ‘exact ties’ (29, 30). If one parent is homozygous, only the allele transmitted from the other parent is informative. When both parents are heterozygous, the number of PP and QQ children, nPP and nQQ , respectively, represents both the number
Nonparametric Methods for Molecular Biology
113
of P or Q alleles transmitted from either parent; nPQ represents the number of P alleles transmitted from one and the number of Q alleles transmitted from the other parent, yet the origin of the alleles is not known. To compare “the frequency with which [among heterozygous parents] the associated allele P or its alternate Q is transmitted to the affected offspring,” (28) the term b − c in the numerator of the TDT can be decomposed into the contributions from families stratified by the parental mating types: # # " " # " b − c = nPQ − nQQ PQ∼QQ + 2 (nPP − nQQ ) + (nPQ − nPQ ) PQ∼PQ + npp − npq PP∼PQ .
Of course [nPQ − nPQ ]PQ∼PQ = 0, so that this equation can be rewritten as (31) " " " # # # b − c = nPQ − nQQ PQ∼QQ + 2 nPP − nQQ PQ∼PQ + nPP − nPQ PP∼PQ .
[1]
As an ‘exact tie’ (29, 30), the term [nPQ − nPQ ]PQ∼PQ can be as ignored when computing the variance of [1], because they are as noninformative as the alleles transmitted from homozygous parents. As noted above, independence of observations is a key concept in building test statistics. While “the contributions from both parents are independent” (28), the observations are not. Because the effects of the two alleles transmitted to the same child are subject to the same genetic and environmental confounders, one does not have a sample of independently observed alleles to build the test statistic on, but “a sample of n affected children” (28). As a consequence, the factor ‘2’ in [1] does not increase the sample size (32), but is a weight indicating that the PP and the QQ children are ‘two alleles apart’, which implicates a larger risk difference under the assumption of co-dominance. Each of the three components in [1] follows a binomial distribution with a probability of success equal to 1/2, so that the variance of [1] follows from “the standard approximation to a binomial test of the equality of two proportions” (28) " " " # # # σ02 (b − c) = nPQ + nQQ PQ∼QQ + 4 nPP + nQQ PQ∼PQ + nPP + nPQ PP∼PQ
[2]
Dividing the square of estimate [1] by its variance [2] yields a stratified McNemar test (SMN) which combines the estimates
114
Wittkowski and Song
and variances of three McNemar tests in a fashion typical for stratified tests (33), (b − c)2 = σ02 (b − c)
"
2 " " # # # nPQ − nQQ PQ∼QQ + 2 nPP − nQQ PQ∼PQ + nPP − nPQ PP∼PQ " # # # " " ∼as χ12 . nPQ + nQQ PQ∼QQ + 4 nPP + nQQ PQ∼PQ + nPP + nPQ PP∼PQ
[3]
(‘Stratification’, i.e., blocking children by parental mating type, here is merely a matter of computational convenience. Formally, one could treat each child as a separate block, with the same results.) The TDT, in contrast, divides the same numerator by a “variance estimate” (28) # " σˆ 02 (b − c) = b + c = nPQ + nQQ PQ∼QQ +
$ " % # # " 2 nPP + nQQ PQ∼PQ # " + nPP + nPQ PP∼PQ , + 2 nPQ PQ∼PQ [4]
which relies on the counts of noninformative PQ children. Replacing the observed nPP + nQQ by its estimate nPQ under H0 would require an adjustment, similar to the replacement of the Gaussian by the t-distribution when using the empirical standard deviation with the t-test (34). Hence, the SMN has advantages over the TDT for finite samples (31), in part, because it has a smaller variance (more and, thus, lower steps, see Fig. 2.3) under H0 (see Table 2.2). Because the TDT estimates the variance under the assumption of co-dominance, it overestimates the variance (has lower power) for alleles with high dominance, i.e., when being het-
1.0 TDT 0.9
tdt SMN
0.8
smn
0.7
F(x)
0.6 0.5 0.4 0.3 0.2 0.1 0.0 –3
–2
–1
0 x
1
Fig. 2.3. Cumulative distribution function for the example of Table 2.2.
2
3
Nonparametric Methods for Molecular Biology
115
Table 2.2 Exact distribution of the TDT (top) and SMN (bottom) for a sample of three children nP- nP+ nP nQ nQ nQ 6 0 6 6 5 1 4 6 4 2 2 6 4 2 2 6 3 3 0 6 3 3 0 6 2 4 -2 6 2 4 -2 6 1 5 -4 6 0 6 -6 6
g1 PP PP PP PP PQ PP PP PQ PQ QQ
g2 PP PP PQ PP PQ PQ QQ PQ QQ QQ
g3 PP PQ PQ QQ PQ QQ QQ QQ QQ QQ
b 1 3 3 3 1 6 3 3 3 1
p1 0.25 0.25 0.25 0.25 0.50 0.25 0.25 0.50 0.50 0.25
p2 0.25 0.25 0.50 0.25 0.50 0.50 0.25 0.50 0.25 0.25
p3 0.25 0.50 0.50 0.25 0.50 0.25 0.25 0.25 0.25 0.25
f 0.016 0.094 0.188 0.047 0.125 0.188 0.047 0.188 0.094 0.016 1. 000
g1 PP PP PP QQ
g2 PP PP QQ QQ
g3 PP QQ QQ QQ
b 1 3 3 1
p1 0.25 0.25 0.25 0.25
p2 0.25 0.25 0.25 0.25
p3 0.25 0.25 0.25 0.25
f 0.016 0.047 0.047 0.016
PP PP PQ
PP PQ PQ QQ QQ QQ
3 0.25 0.25 0.50 6 0.25 0.50 0.25 3 0.50 0.25 0.25
0.094 0.188 0.094
2 1 0
0 1 2
2 0 -2
2 2 2
1.414 0.107 0.000 0.214 -1.414 0.107
PP PQ
PQ PQ
PQ QQ
3 0.25 0.50 0.50 3 0.50 0.50 0.25
0.188 0.188
1 0
0 1
1 -1
1 1
1.000 0.214 -1.000 0.214
PQ
PQ
PQ
1 0.50 0.50 0.50
0
0
0
0
0. 875
nPP- nPP nPP nQQ nQQ +nQ 3 0 3 3 2 1 1 3 1 2 -1 3 0 3 -3 3
x 2.449 1.633 0.816 0.816 0.000 0.000 -0.816 -0.816 -1.633 -2.449 0. 000
f 0.016 0.094 0.188 0.047 0.125 0.188 0.047 0.188 0.094 0.016 1. 000
x 1.732 0.577 -0.577 -1.732
f* 0.018 0.054 0.054 0.018
0. 000 1. 000
Legend: gi : child i genotype; b: number of combinations; pi : E0 (gi ); f : b × i pi ; alleles/children; x: (np − nQ )/ np + nQ (npp − nQQ )/ npp + nQQ
nx /nx y: number of
erozygous for the risk allele carries (almost) as much risk as being homozygous (31). On the other hand, the TDT underestimates the variance (yielding higher power) for recessive alleles (31). 2.2.2. Extensions of the Stratified McNemar
Understanding the role of parental mating types, the sensitivity of the SMN can be easily focused on either dominant or recessive alleles. When screening for recessive alleles, one excludes trios where one parent is homozygous for the wild type and assigns equal weights to the remaining trios (35). Conversely, when screening for dominant alleles, one excludes trios where one parent is homozygous for putative risk allele. A better understanding of the statistical principles underlying family-based associations studies leads not only to test statistics with better small sample properties and better targeted tests for alleles with low or high dominance but also to extensions which open new areas of applications.
116
Wittkowski and Song
For decades it has been known that the HLA-DRB1 gene is a major factor in determining the risk for multiple sclerosis (36). As HLA-DRB1 is one of the few genes where more than two alleles per locus are routinely observed, the SMN is easily generalized to multiallelic loci by increasing the number of parental mating-type strata and identifying within each stratum the number of informative children (35). In early 2009, applying the extension of the SMN for multiallelic loci allowed (Table 2.3) the narrowing down of risk determinants to amino acid 13 at the center of the HLA-DRB1 P4 binding pocket, while amino acid 60, which had earlier been postulated based on structural features (37), was seen as unlikely to play a major role (35, 38) (Fig. 2.4).
Table 2.3 Nucleotides (N#: nucleotide number, #A: amino acid number) found in 94 HLA-DRB1 SNPs discriminating the 13 main two-digit allelic groups. The nucleotide found in the HLA-DRB1∗ 15 (15) allele, is highlighted across allelic groups (modified from BMC Medical Genetics, Ramagopalan et al. (35) available from http://www.biomedcentral.com/1471-2350/10/10/ © 2009, with permission from BioMed Central Ltd). Allelotype: AlleleNucleotide 85 –1 13 124 125 126 60 265 266 96 373 375 101 390 133 485 142 511 166 584 585 #Cases:
02 15 T A G G T A C A A T A G A 3994
01
04
16 G G A T G T G T T T A A C G A G A G T G A G G G A G 208 1396
G C A T T A T T G G G G G 1982
03 05 17/18 11 12 G T C T T A C T G G G G G 1873
G T C T T A C T G G G G G 1068
06 13
G G G T G C T T T T AC AC C C T T G G G G G G G G G G 146 1519
08
10
09
07
G T T T T A C A G G G A G 80
G G T T T A T T T T C C C C T T G G G G G G G G G G 98 1653
14 G G TG G CG G T T CT T A A C C T T G G G G G G G G G G 243 408
2.2.3. Microarray Quality Control
Another simple method based on u-statistics is the median on which the Bioconductor package Harshlight is based, a program to identify and mask artifacts on Affymetrix microarrays (39, 40). After extensive discussion (41, 42), this approach was recently adopted for Illumina bead arrays as “BASH: a tool for managing BeadArray spatial artifacts” (43) (Fig. 2.5).
2.2.4. Implementation and Availability
Figure 2.6 demonstrates the relationships between the different versions of the sign and McNemar tests and how easily even their exact versions, where all possible permutations of data need to be considered, can be implemented in code compatible with R
Nonparametric Methods for Molecular Biology
117
a
b
Fig. 2.4. (a) Extension of the stratified McNemar to multi-allelic loci identifies amino acid 13 (at the center of the P4 pocket) as a major risk factor for multiple sclerosis (modified from BMC Medical Genetics, Ramagopalan et al. (35) available from http://www.biomedcentral.com/1471-2350/10/10/ © 2009, with permission from BioMed Central Ltd). (b) Positions of amino acids identified in Fig. 2.4a within β2 domain
118
Wittkowski and Song
Fig. 2.5. Implementation of the conditional and unconditional sign tests as part of the ‘muStat’ package for both R and S-PLUS. SMN: stratified McNemar, TDT: transmission disequilibrium test, MCN: McNemar, DMM: Dixon/Mood/Massey. The function MC() avoids scoping problems in S-PLUS and is part of the compatibility package muS2R.
and S-PLUS. For dominant and recessive alleles, the parameters wP/wQ are set to .0/.5, and .5/.0, respectively. HarshLight and BASH (as part of the beadarray package) are available from http://bioconductor.org.
3. Univariate Data 3.1. u-Scores 3.1.1. Definitions
Our aim is to first develop a computationally efficient procedure for scoring data based on ordinal information only. We will not make any assumptions regarding the functional relationship between variable and the latent factor of interest, except that the variable has an orientation, i.e., that an increase in this variable is either ‘good’ or ‘bad.’ Without loss of generality, we will assume that for each of the variables ‘more’ means ‘better.’ Here, the index k is used for subjects. Whenever this does not cause confusion, we will identify patients with their vector of L ≥ 1 observations to simplify the notation. The scoring mechanism is based on the principle that each patient
Nonparametric Methods for Molecular Biology
119
Fig. 2.6. Quality control images generated by Harshlight for Affymetrix chips (top, modified from BMC Bioinformatics (39, 40) © 2005, with permission from BioMed Central Ltd) and by the ‘Bead Array Subversion of Harshlight’ (bottom, modified from Bioinformatics (43) © 2008, with permission from Oxford University Press).
&
xk = (xk1 , . . . , xkL )
' k=1,...mj
is compared to every other patient in a pairwise fashion. More specifically, a score u(xk ) is assigned to each patient xk by counting the number of patients being inferior and subtracting the number of subjects being superior: u(xk ) = 3.1.2. Computational Aspects
k
I (xk < xk ) −
k
I (xk > xk ).
[5]
When Gustav Deuchler (44) in 1914 he developed what more than 30 years later should become known as the Mann–Whitney test (11) (although he missed one term in the asymptotic variance), proposed a computational scheme of striking simplicity, namely to create a square table with the data as both the row and the column headers (Fig. 2.7). With this, the row sums of the signs of paired comparisons yields the u-score. Since the matrix is symmetric, only one half (minus the main diagonal) need to be filled. Moreover, as Kehoe and Cliff (46) pointed out, even some of this information is redundant. The interactive program Interord computes which additional entries could be implied by transitivity from the entries already made.
120
Wittkowski and Song
A B C D
Y1 3 3 2 2
3 0 0 –1 –1
3 0 0 –1 –1
2 1 1 0 0
2 1 1 0 0
Y1 3 3 2 2
Y2 2 1 1 1
Y2 2 1 1 1
2 0 –1 –1 –1
1 1 0 0 0
1 1 0 0 0
1 1 0 0 0
Fig. 2.7. Matrices of paired comparisons for two variables (Y1, Y2), each observed in four subjects (modified from Statistical Applications in Genetics and Molecular Biology, Morales et al. (45) available at http://www.bepress.com/sagmb/vol7/iss1/art19/, with permission from The Berkeley Electronic Press, © 2008).
3.2. Randomization and Stratification 3.2.1. Introduction
When several conditions are to be compared, the key principle to be observed is that of randomization. In most situations, randomization is the only practical strategy to ensure that the results seen can be causally linked to the interventions to be compared. When observational units vary, it is often useful to reduce the effect of confounding variables through stratification. A special case of stratification is the sign test, when two interventions are to be compared within a stratum of two closely related subjects (e.g., twins). Here we will consider situations where more than two subjects form a stratum, e.g., where subjects are stratified by sex, race, education, or the presence of genetic or environmental risk factors. The fundamental concept underlying the sign test is easily generalized to situations where a stratum contains more than two observations. Mj Under H0 , the expectation of Tj = k=1 u xjk is zero. For the two-sample case, Mann and Whitney, in 1947 (11), reinvented the ‘u-test’ (this time with the correct variance). As u-scores are a linear function of ranks (u = 2r − (n + 1)), their test is equivalent (47) to the rank-sum test proposed in 1954 by Wilcoxon (10). The Wilcoxon–Mann–Whitney (WMW) test, in turn, is a special case of the Kruskal–Wallis 1952 (KW) rank test for >2 groups (12). The notation used below will prove to be particularly useful to generalize these results. The test statistic WKW is based on the sum of u-scores Mj Uj = k=1 u Xjk within each group j = 1, . . . , p. It can be computed as a ‘quadratic form:’
Nonparametric Methods for Molecular Biology −
WKW = U V U =
p
p j =1
j=1
2 Uj vjj− Uj ∼as. χp−1 ,
121
[6]
where V− is a generalized inverse of the variance–covariance matrix V. For the KW test, the matrix V, which describes the experimental design and the variance among the scores, is: ⎛⎛ ⎜⎜ ⎜ VKW = s 20 ⎜ ⎝⎝
s
2 0
⎞
M1 .. ⎧ ⎨
⎛M
⎟ ⎜ ⎟−⎜ ⎠ ⎝
.
Mp M+
1 M1 M+
..
M1 Mp M+
.
Mp M1 M+
2 x=1 u (x) Mj 2 k=1 u Xjk j=1
1 = M+ − 1 ⎩ p
Mp Mp M+
⎞⎞ ⎟⎟ ⎟⎟ , ⎠⎠
[7]
unconditional . conditional on ties
Traditionally, the WMW and KW tests have been presented with formulae designed to facilitate numerical computation at the expense of conceptual clarity, e.g.,: ⎛ V− KW =
s
2 0
1 s 20
⎜ ⎜ ⎜ ⎝
⎞
1 M1
..
. 1 Mp
⎟ ⎟ ⎟, ⎠
⎧ 1 unconditional M+ (M+ + 1) ⎨
= , 3 3 ⎩1− G 3 g=1 Wg − Wg / M+ − M+ conditional on ties
where Wg indicates the size of the g-th tie (group of identical observations). (The constant ‘3’ in the denominator of s02 is replaced by ‘12’ for ranks). The unconditional version (equivalent to the conditional version in the absence of ties) then yields the well known and computationally simple form: p
WKW
Tj 3 2 = ∼as. χp−1 . M+ (M+ + 1) Mj 2
j=1
With the now abundant computer power one can shift focus to the underlying design aspects and features of the score function represented in the left and right part of [7], respectively. In particular, the form presented in [6] can be more easily extended to other designs (below). 3.2.2. Complete Balanced Design
To extend u-tests to stratified designs, we add an index i = 1, . . . , n for the blocks to be analyzed. With this, the (unconditional) sign test can be formally rewritten as
122
Wittkowski and Song
VST V− ST
1 1 1 0 2 2 − 1 1 = ns02 I − 12 J , = 0 1 2 2 2 1 = 2 I, s02 = u2 j = 2, j=1 ns0 ns02
WST =
1 2 1 2 T 2 = (N+ − N− )2 ∼as. χp−1 . j=1 j 2n n
Coming from a background in econometrics, the winner of the 1976 Nobel prize in economics, Milton Friedman, had presented a similar test (48) in 1937, yet for more than two conditions: ⎛⎛ VFM =
⎜⎜
⎜ ns02 ⎜ ⎝⎝
⎞
1 ..
⎟ ⎜ ⎟−⎜ ⎠ ⎝
. 1
V− FM
⎛1
1 p
p
.. 1 p
. 1 p
⎞⎞
⎟⎟ ⎟⎟ = ns 2 I − 1 J , 0 ⎠⎠ p
p p+1 1 1 p 2 2 , = 2 I, s0 = u j = j=1 p−1 3 ns0
WFM =
p 3 2 T 2 ∼as. χp−1 . j=1 +j np p + 1
In either case, noninformative blocks are excluded with ‘conditioning on ties’. In the field of voting theory, the blocks represent voters and the u-scores are equivalent to the Borda counts, originally proposed in 1781 (49), around the time of Arbuthnot’s contributions, although the concept of summing ranks across voters was already introduced by Ramon Llull (1232–1315) (50). 3.2.3. Incomplete and/or Unbalanced Designs
In the 1950s (51, 52), it was demonstrated that the Kruskal– Wallis (12) and Friedman (48) tests, and also the tests of Durban (53) and Bradley-Terry (54), can be combined and extended by allowing for both several strata and several observations within each cell (combination of stratum and group). However, when blocks represent populations of different size (proportionate to mi , say) and/or have missing data (Mi < mi ), the problem arises how to deal with unbalanced and/or incomplete designs. Between 1979 and 1987, several approaches for weighting blocks were suggested (33). In 1988, Wittkowski (33) used the marginal likelihood principle to prove that Mike Prentice’s ‘intuition’ (55) was correct, i.e., that u-scores (or ranks) should be scaled by a factor of (m+ + 1)/(M+ + 1) to reflect differences in population sizes and missing data. In 2005, Alvo et al.
Nonparametric Methods for Molecular Biology
123
(56, 57) confirmed these results using a different approach: u xijk =
j k
I xij k < xijk −
j k
I xij k > xijk ,
mi + 1 Mij u Xijk k=1 Mi + 1 ⎛⎛ ⎞ ⎛M M i1 i1 Mi1 Mi+ ⎜ ⎜ ⎟ ⎜ .. .. 2 ⎜⎜ ⎟−⎜ Vi = si0 . . ⎝⎝ ⎠ ⎝ Mip Mi1 Mip M
Uij =
i+
2 si0
Mi1 Mip Mi+ Mip Mip Mi+
⎞⎞ ⎟⎟ ⎟⎟ , ⎠⎠
1 mi+ + 1 2 = Mi+ − 1 Mi+ + 1 $ Mi+ 2 unconditional x=1 u (x) . × p Mij 2 k=1 u Xijk conditional on ties j=1
[8]
In contrast to the above special cases, no generalized inverse with a closed form solution is known, so that one has to rely on a numerical solution: n p 2 W = U + V− + U+ = Uij vjj− Uij ∼as. χp−1 . [9]
i=1
jj =1
When the conditions to be compared are genotypes, a situation may arise, where the genotypes of some subjects are not known. Still, phenotype information from those subjects can be used when computing the scores by assigning these subjects to a pseudo group j = 0, which is used for the purpose of scoring only. 3.3. Applications 3.3.1. Binary Data/Mantel Haenszel
Genome-wide association studies (GWAS) often aim at finding a locus, where the proportions of alleles differ between cases and controls. For most human single-nucleotide polymorphisms (SNPs), only two alleles have been seen. Thus, the data can be organized in a 2 × 2 table. Control (0)
Cases (0)
M1
Allele 1
M1
M2
M+
M1
M2
M+
(1)
M2
(0)
Allele 0
(1)
M+
(1)
While data for the sign test can also be arranged in a 2 × 2 table, the question here is not, whether M1(1) = M2(0) , but whether M1(1)/M1 = M2(1)/M2 . Thus, the appropriate test here is not the sign, but Fisher’s exact or, asymptotically, the χ 2 test
124
Wittkowski and Song
for independence or homogeneity. This χ 2 test is asymptotically equivalent to the WMW test for two response categories. As with the sign test, ties are considered ‘exact’ in genetics and, thus, the variance conditional on the ties is applied. When the data are stratified, e.g., by sex, one obtains a 2 × 2 table for each block and the generalization of the χ 2 test is known as the Cochran–Mantel–Haenzel (CMH) test, which can also be seen as a special case of the W test: WCMH =
i=1
2 mi+ +1 Mi+ +1 Mi1 Mi2 (Pi1
n mi+ +1 2 i=1
=
n
n
i=1
Mi+ +1
2 Mi+ Mi+ −1 Mi1 Mi2 Pi+ (1 − Pi+ )
2 (1) (0) mi+ +1 Mi+ +1 (Mi2 Mi1
n mi+ +1 2 i=1
− Pi2 )
Mi+ +1
(1) (0) − Mi1 Mi2 )
(1) (0) 1 Mi+ −1 Mi+ Mi+ Mi1 Mi2
∼as. χ12 .
3.3.2. Degrees of Freedom in GWAS
For most statistical applications, the degrees of freedom for χ 2 tests is the rank of the variance–covariance matrix V. For genetic and genomic screening studies, however, where the dimension of the variance–covariance matrix may vary depending on the number of groups present for a particular gene or locus. Clearly, it would be inappropriate in a screening study to decrease the degrees of freedom (df) for the χ 2 distribution if one (or more) of the groups is (are) missing, but revert to the full df when a single observation for this group is available for another gene or locus.
3.3.3. Implementation and Availability
Among the many obstacles that have prevented statistical methods based on u-scores from being used more widely is that they are traditionally presented as an often confusing hodgepodge of procedures, rather an a unified approach backed by a comprehensive theory. For instance, the Wilcoxon rank-sum and the Wilcoxon signed-rank tests, are easily confused. Both were published in the same paper (10) but they are not special cases of a more general approach. The Lam–Longnecker test (58), on the other hand, directly extends the WMW to paired data. Finally, the Wilcoxon rank sum and the Mann–Whitney u tests are equivalent (47), although they were independently developed based on different theoretical approaches. Above, we have demonstrated that several rank or u-test can be presented as a quadratic form W = U + V− + U+ ∼as. 2 , where the variance–covariance matrix V reflects the variχp−1 + ous designs. The muStat package (available for both R and SPLUS) takes advantage of this fact, both by providing the unified
Nonparametric Methods for Molecular Biology
125
#---------------------------------------------------------------------# mu.test (y, groups, blocks, …) # most general form # mu.friedman.test (y, groups, blocks, …) # one observation per cell # mu.kruskal.test (y, groups, …) # single block # mu.wilcox.test (y, groups, …) # single block/two groups #---------------------------------------------------------------------# y, # data (NA allowed) # groups, # groups (unbalanced allowed) # blocks # blocks (unequal sz allowed) # score = "rank", # NULL: y already scored # paired = FALSE, # wilcox only # exact = NULL, # wilcox only # optim = TRUE # optimize for special cases # df = -1, # >0: as entered, =0: by design, -1: by data #---------------------------------------------------------------------mu.test <- function(y, groups, blocks, ...) mu.friedman.test <- function(y, groups, blocks, ...) mu.wilcox.test <- function(y, groups, blocks = NULL, paired = F, ...) mu.kruskal.test <- function(y, groups, blocks = NULL, ...) … ok <- is.orderable(y)) mi <- rowSums(m <- xTable(blocks, groups )) # exp. block size Mi <- rowSums(M <- xTable(blocks[ok],groups[ok])) # obs. block size Wi <- mi+1 # Prentice weights Tijk <- Centered( Score(applyBy(y, blocks, wRank)/(Mi[blocks]+1), blocks, Mi) * Wi[blocks] T1 <- qapply(Tijk, groups, sum) V0 <- structure(dim=c(P,P), (1/(Mi-1))*qapply(Tijk^2, blocks, sum) %*% ( # blkvar t(apply(M,1, function(x,P) diag(x) )) – # design (1/Mi) * t(apply(M,1, function(x) outer(x,x))) ) # matrix ) W <- T1 %*% ginv(V0) %*% T1
Fig. 2.8. Implementation of (conditional) rank tests within the ‘muStat’ package for both R- and S-PLUS. The notation corresponds directly to the details published in 1988 (33), as given in Equations [8] and [9]. The code has been simplified for clarity. See the muStat package for details and for definitions of the functions xtable, applyBy, and qapply, which are optimized variance of the standard functions table and apply.
function mu.test() (in addition to replacement functions of the original R and S-PLUS functions, for compatibility) and by being based on a single copy of code (Fig. 2.8), although internally an optimized version may be used for special cases. (The case of censored data will be handled as a special case of multivariate data, see Section 4.2.1.) To ensure that p-values are comparable within a study (see Section 3.3.2), the function mu.test() allows the number of degrees to be fixed (df>0) or to be determined by the design matrix (df=0), in addition to it being computed from the observed data (df=-1).
4. Multivariate Data 4.1. Introduction
Few biological systems can be sufficiently characterized by a single variable only. A single measure often does not appropriately reflect the effect of all relevant genetic or environmental risk fac-
126
Wittkowski and Song
tors, clinical or epidemiological interventions, or personal preferences. Sometimes the definite measure is not easily obtained, so that several surrogate measures need to be evaluated. At other times, e.g., when assessing a complex syndrome or a chronic disease, a definite measure may not even exist. Still, most statistical methods for multivariate data are based on the (generalized) linear model, either explicitly, as in regression, factor, discriminant, and cluster analysis, or implicitly, as in neural networks. One scores each variable individually on a comparable scale, either present/absent, low/intermediate/high, 1–10, or z-transformation, and then defines a global score as the linear combination (weighted average) of these scores. Thus, it is assumed that it is known how to transform each variable to a common scale, so that a weighted average of these transformed variables can be meaningfully interpreted and that these weights are constant. The linear model became popular mainly because its mathematical elegance lead to computational efficiency and parameters of alluring simplicity. When applied to real-world data, however, this approach may have shortcomings, because biological systems are typically regulated by various, often unknown, feedback loops so that the functional form of relationships between measurement and activity or efficacy is typically unknown, except within narrowly controlled experiments. Since the relative importance of the variables, the correlation among them, and the functional relationship of each variable with the immeasurable latent factor ‘safety’, ‘activity’ or ‘effectiveness’ are typically unknown, construct validity (59) cannot be established on theoretical grounds and one needs to resort to empirical ‘validation’, choosing weights and functions to provide a reasonable fit with a ‘gold standard’ when applied to a sample, a process of questionable validity by itself (60). The Delphi oracle, where women intoxicated by fumes predicted the future was often difficult to interpret. The ‘Delphi method’ (61) approach to scoring systems, where weights and functions are agreed upon by a group of experts may facilitate comparison between studies, yet comparability along a scale with questionable validity may still yield questionable results. The diversity of scoring systems used attests to the subjective nature of this process. Nonparametric methods, in general, are designed for ordinal data, where a one-unit difference may not carry the same ‘meaning’ across the range of possible values (62), and, thus, avoid artifacts created by making unrealistic assumptions. Consequently, nonparametric methods are particularly well suited for analyses in human phenomics (63). The marginal likelihood principle (MrgL) provides a general framework to extended rank tests to and missing data (33, 64–66). In 1992, it was demonstrated that MrgL procedures for censored data can be generalized to
Nonparametric Methods for Molecular Biology
127
cover multivariate ordinal observations (67), in general. While this approach proved eminently useful (68–72), the computational effort was prohibitive. Drawing on the analogy of the Wilcoxon rank sum (10) (MrgL) with the Mann–Whitney test (73) (u-statistics), a 1914 algorithm (44) yielded a computationally more efficient approach, which can easily be extended to more complex partial orderings (62, 74) and designs (31, 33, 75). Below, it will be demonstrated how μ-scores (u-scores for multivariate data) cover situations where a one-unit difference may carry a different ‘meaning’ across variables (76) and, thus, can integrate information even when the events counted are incomparable or the variables’ scales differ, as long as each variable has the same ‘orientation.’ 4.2. Partial and Incomplete Orderings
We will now extend the notation introduced in Section 3.1 (using i = 1, . . . , n for blocks, j = 1, . . . , p for groups, and k = 1, . . . , mij for replications) by adding an additional index = 1, . . . , L for the variables to be analyzed. The scoring mechanism is based on the principle that each subject xjk = xjk1 , . . . , xjkL j=1,...,p; k=1,...mj is compared to every other subject in a pairwise manner. For stratified designs, these comparisons will be made within each stratum only. When the observed outcomes can be assumed to be correlated with an unobservable latent factor, a partial ordering (67) among the patients can easily be defined. If the second of two patients has values higher in at least one variable and at least as high among all variables = 1, . . . , L (or equivalently, lower in none), it will be called ‘superior’: xjk < xj k ⇔ {∀=1,...,L xjk ≤ xj k ∧ ∃=1,...,L xjk < xj k }. [10] Many partial orderings can be defined, which, by definition, are transitive (a < b) ∧ (b < c) ⇒ (a < c). Orderings such as [10], which treat all variables equally, are called ‘regular.’ The partial ordering for an (interval) censored variable is just one example. Several more examples will be given below. Even though a partial ordering does not guarantee that all patients can be ordered on a pairwise basis, all patients can be scored. One can assign a score in exactly the same fashion as described in [5], using the partial order [10] instead of the simple univariate order. By definition, μ-scores are ‘intrinsically valid’, i.e., independent of the choice of (nonzero) weights and (monotonous) transformations assigned to the variables.
128
Wittkowski and Song
4.2.1. Censored Data
Many phenotypes are ‘censored.’ A typical case is ‘survival’, where some subjects may still be alive at the end of a study. Similarly, a subject may be surveyed as not yet having cancer. Other examples are events that are not directly observable, like the recurrence of cancer, where often only the last date with a negative and the first date with a positive result are known. With genetic studies, censored information arises, for instance, when the prevalence of a disease that develops over time is observed in subjects that vary in age, such as in late-onset diabetes, cardiovascular diseases, cancer, etc. For such studies, a negative observation merely means that this subject does not yet have developed the disease, rather than that the subject is ‘immune.’ In 1965, Gehan (77, 78) demonstrated how u-score can be applied to censored (including interval-censored) variables (see Fig. 2.9). Assuming that the variables represent the last date negative (LDN) and the first date positive (FDP), Subject A experiences the event under investigation ‘later’ than subject B if LDN(A) > FDP(B). In clinical or epidemiological research, such data may arise when the exact date of an event, e.g., infection or recurrence, is not known, but the event is known to have happened between the date of the last negative test xjk1 and the date of the first positive test xjk2 . Right censored data are a special case (xjk2 = xjk1 : event, xjk2 = ∞: censoring). Thus, pairs of intervals can be ordered, if they do not overlap. For left- and right-censored observations, LDN and FDP are –∞ and +∞, respectively. Tests for censored data are often presented in an equivalent form, where the first variable is the time point, and the second is the ‘censoring indicator.’ The above representation, however,
X1 1 2 ? 1
A B C D
X1 1 2 ? 1
X2 2 2 1 2
X1 1
2
?
1
X2 2
2
1
2
X2 2 2 1 2
censored ? −1 1 ? 1 0 1 1 −1 −1 ? −1 ? −1 1 ?
U 0 3 −3 0
Fig. 2.9. Matrix of paired comparisons a single interval censored data observed in four subjects (modified from Statistical Applications in Genetics and Molecular Biology, Morales et al. (45) available at http://www.bepress.com/sagmb/vol7/iss1/art19/, with permission from The Berkeley Electronic Press, © 2008).
Nonparametric Methods for Molecular Biology
129
is more easily generalized, e.g., to ‘interval censored’ data, where xjk1 < xjk2 < ∞. To further clarify the relation between censored and multivariate data, it is convenient to consider the most general case, interval censored observations. In clinical or epidemiological research, such data may arise when the exact date of an event, e.g., infection or recurrence, is not known, but the event is known to have happened between the date of the last negative test xjk1 and the date of the first positive test xjk2 . Right censored data are a special case (xjk2 = xjk1 : event, xjk2 = ∞: censoring). Thus, pairs of intervals can be ordered, if they do not overlap, or, equivalently, if both time points in one subject are earlier than both time points in the other subject: Schemper (79) generalized the Gehan test to more than two groups. Also, from the work of Hoeffding (73) and Lehmann (80), u-scores for censored data can be analyzed for all the stratified designs mentioned above. 4.2.2. Ambiguous Paired Comparisons
If only a single variable needs to be considered (L = 1) and all observations are different, the order is complete (Fig. 2.10a). If identical observations (ties) are present, two cases need to be considered, as with the sign test, above. On the one hand, ties may be due to the underlying phenomena and on the other hand may be caused by discretization or by observing a discrete surrogate variable for a continuous phenomenon. In both cases, there are three possibilities for each pair of patients. In the former case, they are <, >, or = (Fig. 2.10b), in the latter, where ties reflect some ambiguity (29), they are <, >, or ∼ = (Fig. 2.10c). Intervals, however, can only be ordered, if they are disjoint, so that some paired comparisons may be undetermined. In Fig. 2.10d, for instance, it is not known, if the patient infected between the first and the third follow-up visit (1..3) was infected earlier than the patient infected between the second and the third visit (2..3). The same rationale applies to situations with several (L > 1) vari-
a) 4
b) 4
c)
4
d) 3..4
e) 4;4;4
2..3 2
2
2
≈2
≈2
1..3 1..2
0
0
0
2;3;1 2;1;2
0..1
1;2;1 0;0;0
Fig. 2.10. Orderings: (a) simple, (b) exact, (c) inexact, (d) interval, (e) multivariate (modified from Computing Science and Statistics, WITTKOWSKI (74), available from http://ideas.repec.org/p/pra/mprapa/4570.html, with permission from the Interface Foundation of North America, Inc, and Statistics in Medicine, WITTKOWSKI et al. (76) © 2004, with permission from John Wiley & Sons, Inc.)
130
Wittkowski and Song
ables (Fig. 2.10e). For multivariate data, the order between two subjects is undetermined if xjk < xj k for some variable , while xjk > xj k for another variable . In either case, the ordering may be ‘partial’, rather than ‘complete.’ From Fig. 2.10d and e, the Gehan test (77, 78) can be easily extended to multivariate data, using the approach originally suggested by Hoeffding (81), because it relies on the existence of a partial order only, irrespective of how that partial ordering was created. 4.2.3. Information Content
As an alternative to the ‘weak’ partial ordering [10], one might require its ‘strong’ cousin (82): xjk < xj k ⇔ {∀=1,...,L xjk < xj k }.
[11]
At first sight, the strong order [11] may seem more appropriate for discretized variables, because the true order of the observations discretized into a tie is unknown. If each variable is presumed to be a surrogate for the same latent factor, however, the strong partial order highlights a potential downside of μscores, namely that the number of ambiguous paired comparisons increases with the number of variables, unless the variables are highly correlated. As a result, a larger sample may be needed to achieve the desired power. Allowing for ties to be broken by observations in other variables again yields the weak partial ordering. Thus, the weak regular partial ordering [10] will be called ‘natural’ for applications where each variable can be assumed to be a surrogate for the same underlying latent factor. Formally, we will define the proportion of paired comparison with subject A that can be decided as the ‘information content’ of subject A’s score. Unless the variables are highly correlated, so that adding more variables merely breaks ties, information content often decreases with the number of variables and the rigidity of the partial ordering. On the other hand, the more information about the underlying models is available to justify combining several variables into a linear combination, the more paired comparisons can be decided. Below, we will discuss two nonparametric approaches to increase information content by reflecting prior knowledge. The first is to transform the data using less stringent assumptions than those imposed by the linear model. Then, we will demonstrate how allowing for hierarchical structures among variables can increase information content when variables are known to belong to different factors (e.g., safety and efficacy) or sub-factors (45, 83). In turn, identifying structures that maximize information content then leads to a nonparametric alternative to traditional factor analyses, including more efficient questionnaires.
Nonparametric Methods for Molecular Biology
131
Deuchler’s (44) univariate algorithm to depict a complete ordering as a symmetric matrix (82) is easily extended to a partial and imperfect (non-transitive) ordering for data profiles comprising several variables. Morales et al. (45) separated this process into two steps, as outlined in Fig. 2.11. First, Deuchler’s (44) univariate paired comparisons are represented as the matrices U (l ) =
(l ) (l ) (l ) (middle row) with ukk or xk(l ) are missing and ukk
= ? if xk
(l ) (l ) (l ) − I xk(l ) > xk(l ) otherwise (76), extendukk
= I x k < xk ing Deuchler’s (44) algorithm to incomplete (not necessarily partial) orderings missing data. The matrix U obtained by the ‘AND’ operation
4.2.4. Computational Aspects
U =
+ l
U (l ) =
where ukk
(l ) , ukk
⎧ (l ) (l ) ⎪ 1 ∃l:ukk = 1 ∧ ∀l:ukk = −1 ⎪ ⎪ ⎪ ⎨ 0 ∃l :u(l ) = 0 ∧ ∀l: u(l ) = 1 kk kk = , (l) (l ) ⎪ −1 ∃l:ukk = −1 ∧ ∀l:ukk = 1 ⎪ ⎪ ⎪ ⎩ ? otherwise
– –
– – – – –
– – –
– – – –
– – –
– –
Fig. 2.11. Ambiguity caused by discordant paired comparisons across variables. Hypothetical example with four variables X1, X2, Y1, and Y2, observed in four subjects A, B, C, and D. The node on top shows the data, the center row the four univariate partial (X1) and complete (X2, Y1, Y2) orderings. The node at the bottom shows the data, the multivariate partial ordering, the μ-scores (U) and their information content (W). Each matrix shows whether the row element is smaller (–1), identical (0), or larger (1) than the column element. Ambiguous paired comparisons are indicated as ‘?’. Resolving ambiguities (such as those in X1) if the corresponding paired comparisons in the other matrices orderings are unambiguous increases information content, but may result in non-transitivity and, thus, a ‘imperfect’ order. On the other hand, ambiguities can arise (such as in the upper right and lower left corner of the bottom matrix) if some paired comparisons are negative, while others are positive (modified from Statistical Applications in Genetics and Molecular Biology, Morales et al. (45) available at http://www.bepress.com/sagmb/vol7/iss1/art19/, with permission from The Berkeley Electronic Press, © 2008).
132
Wittkowski and Song
with ‘?’ indicating ambiguity, is the same as the matrix one would obtain by applying the imperfect orderingdefined earlier (76) and, thus, the scores obtained from U = l U (l ) (bottom of Fig. 2.11) are the nonhierarchical μ-scores. Recently, Cherchye and Vermeulen (84) proposed to replace the matrix of paired comparisons, which has entries ‘+1’, ‘0’, ‘?’, and ‘–1’, by a computationally simpler ‘greater-equal’ (GE) matrix of binary entries ukk = I (xk ≥ xk ). Combining these two approaches, one can compute univariate GE matrices ukk = I (xk ≥ xk ) first and then combine them into a multivariate GE matrix. Several of these matrices can then be combined, again, thereby providing a high level of flexibility. When comparing the severity of ‘damage’ seen in ultrasound or radiologic images (85), for instance, it may not be possible to define all relevant variables. In these situations, one might present a rater with pairs of images (A, B) to be judged as A < B, A > B, or A ∼ = B. As in the univariate case (see Section 3.1.2), an interactive system could then reduce the number of questions to be asked by inferring which paired comparisons can be implied from the answers already made. If the number of pairs that each rater can be presented is to be limited, the unanswered questions would be considered ambiguous. 4.2.5. Implementation and Availability
In the muStat package for R and S-PLUS packages are available from http://cran.r-project.org and http://csan.insightful.com, respectively, these steps are implemented as the functions mu.PwO and mu.AND, followed by computation of scores in information content (Fig. 2.12). mu.PwO <- function(x, y=x) if (length(y)>1) apply(rbind(x,y),2,mu.PwO,nrow(x)) else as.numeric(NAtoZer(outer(x[1:y],x[-(1:y)],">="))) mu.AND <- function(PwO, frml=NULL) if (is.null(frml)) { GE <- sq.array(PwO); AND <- GE[,,1]^0; nNA
0))) } else { # … deal with the formula … (code omitted) return(…) } mu.Sums <- function(PwO) { GE <- sqmatrix(PwO). wght <- colSums(GE|t(GE)) list (score= (rowSums(GE) - colSums(GE)) *ifelse(wght==0,NA,1), weight = wght) }
Fig. 2.12. Simplified code for the core functions of the muStat packages.
Nonparametric Methods for Molecular Biology
133
Using the asymptotic results of Hoeffding (73), the resulting μ-scores can then be analyzed with mu.test (Fig. 2.8) or through bootstrapping (86). 4.3. Relationships between Variables 4.3.1. Graded Variables
Often, a simple transformation of the variables may suffice to reflect additional knowledge. For graded variables, where one unit impacts less in a ‘lower grade’ than in a ‘higher grade’ variable (74, 84) one can split each value of variable () (sorted by grades) into the value of the lowest grade variable and incremental values of the higher grade variables = 2 . . . : xk,(1) = xk,(1) 1 , xk,(2) = xk,(2) 1 + xk,(2) 2 , ... xk,(L) = xk,(L) 1 + xk,(L) 2 + . . . + xk,(L) L .
Thus, the profile of counts sorted by grade xk,(1) , xk,(2) , . . . , xk,(L) can be expressed as the column sums xk,(≥1) , xk,(≥) = L = xk,() . The partial xk,(≥1) , . . . , xk,(=L) , where ordering for graded variables xk,(1) , . . . , xk,(L) < xk ,(1) , . . . , xk ,(L) ⇔ ∀L=1 xk,(≥) ≤ xk ,(≥) ∧ ∃L=1 xk(≥) < xk (≥)
[12]
is equivalent to the regular ordering [10] applied to the cumulative variables xk,(≥) . Although each profile’s outcomes are decomposed into additive components , substantially weaker assumptions are made than with linear weight (lw) scores, because the additive components can be unknown and may even differ between pairs, noting that for subjects far apart (subject k is lower than subject k for each of the variables) or incomparable (some variables higher for subject k and some variables higher for subject k ) the weights are irrelevant. Thus, the weights only need to be ‘locally similar’, rather than ‘globally constant.’ 4.3.2. Hierarchically Structured Variables
If the variables are related to different ‘factors’ and the order between subjects A and B is ambiguous with respect to variables related to one factor (e.g., genetics), unambiguous results with respect to another factor (e.g., environment) can ‘overwrite’ this ambiguity. The advantage of creating the matrices reflecting the univariate orderings first (mu.PwO) and combining them in a separate step (mu.AND) before computing the scores (mu.Sum) is that incorporating knowledge about the sub-factor hierarchy through hierarchically combining the matrices typically reduces
134
Wittkowski and Song
loss of information content (45) (number of unambiguous paired comparisons contributing to a score). If the variables X1 and X2 in Fig. 2.11 were related to the same factor, while the variables Y1 and-Y2 are related to another factor, one could replace UNH = l U (l ) by UH = - (l ) , (l ) . Figure 2.13 demon{l: X1, X2} U {l: Y1, Y2} U strates how reflecting the structure increases information content by resolving the ambiguity related to comparing A vs. D. Figure 2.13, shows that reflecting more hierarchical information can never decrease and typically increases information content, because it reduces the effect of ‘noise’ that may have caused paired comparisons within a factor to be ambiguous. If all ambiguities are resolved, the μ-scores become ranks, which are uniformly spaced across the widest possible range. Since Gehan, u-scores have been used to handle single interval censored data, where only the last date the subject is known to have been negative (LDN) and the first date the subject is known to have been positive (FDP) are available. Subject A experiences the event under investigation ‘later’ than subject B if LDN(A) > FDP(B). For left- and right-censored observations, LDN and FDP are −∞ and +∞, respectively. μ-Score, consequently, can easily handle censored (including interval-censored) variables (Fig. 2.14).
4.3.3. Doubly Interval-Censored Variables
–
– – – – – –
– – – –
– – –
– – – –
– – – – –
– – – –
– –
Fig. 2.13. Resolving ambiguity from discordant paired comparisons: Using hierarchical structure in the example of Fig. 2.11. Note that adding the intermediate step resolves the ambiguity in the lower left and upper right corner of the matrices representing the partial orderings (modified from Statistical Applications in Genetics and Molecular Biology, Morales et al. (45) available at http://www.bepress.com/sagmb/vol7/iss1/art19/, with permission from The Berkeley Electronic Press, © 2008).
Nonparametric Methods for Molecular Biology
135
– –
– –
–
–
– –
– –
– – –
– – – –
– – –
– –
– – –
Fig. 2.14. Reflecting censoring during the creation of matrices of paired comparisons. Subjects A and B can be ordered if X1(A)>X2(B) or Y2(A)
Adding a hierarchical ordering to interval-censored data also provides a solution to analyzing doubly interval censored data, where, for instance, both the date of exposure and the date of disease manifestation are interval censored (Fig. 2.15). A B C D
Time of Onset ≥
X1 1 2 ? 1
A B C D
X1 1 2 ? 1
X2 2 2 1 2
Y1 2 1 1 1
Y2 3 3 2 2
one-step 0 ? 1 ? 0 1 -1 -1 0 -1 -1 1
X1 X2 X2 2 2 1 2
U O 0
1 2 ? 1 2 2 1 2 oligovariate U ? -1 1 ? 0 1 0 1 1 3 -1 -1 ? -1 -3 ? -1 1 ? 0
U W 1 2 3 1 2 3 -1 -3 4 0 -1 4
X1 X2 Y1 Y2 1 2 2 3 2 2 1 3 ? 1 1 2 1 2 1 2
2 O 1 0
Time of Response 1 A0
3
B
A B C D
X1 1 2 ? 1
X2 2 2 1 2
R0 Y2R3 1
≥
Y1 2 1 1 1
X1 X2 Y2 3 3 2 2
2 1 1 1 3 3 2 2 oligovariate ? ? 1 1 ? ? ? ? -1 ? ? ? -1 ? ? ?
Y1 2 1 1 1
Y2 3 3 2 2
hierarchical U W ? -1 1 1 1 3 1 0 1 1 3 4 -1 ? -1 -3 3 -1 -1 1 ? -1 3
&
U 2 0 -1 -1
Fig. 2.15. Doubly interval censored data as a special case of hierarchically structured interval censored data. O0 and O1 indicate the last negative and first positive date of onset, respectively, while R0 and R1 indicate the last negative and first positive date of response, respectively. As neither the onset nor the response ranges overlap, objects A and B can be ordered.
4.3.4. Genetical Structure (Epistasis Between Diplotypes)
With SNP arrays, which can have an even larger number of variables (currently > 500 K) than expression arrays, the known sequence of the SNPs on the chromosome (ordinal structure) can be utilized to reduce complexity. From Fig. 2.16, a disease locus in close proximity to a particular SNP (e.g., SNP A) is likely to be highly correlated (in linkage disequilibrium, LD) with this SNP.
136
Wittkowski and Song SNP.A
SNP.B
Proximal Disease Locus LD primarily with A
(A
SNP.C
SNP.D
Intermediate Disease Locus LD with C and D (and B)
…
)
( (A, B) ) ( (A,-B) ) … (A,(A, B) ) ( (A,-B),-B “a,(A,±B),±b” (±B,±C),±C (a,(A,±B), ) ( B,( B,±C), (±C,±D),±d) (a,(A,±B), (±B,±C), (±C,±D),±d)
Fig. 2.16. Hypothetical relative distance of disease loci to a single ‘proximal’ SNP or a pair of ‘distal’ SNPs and corresponding muStat formulae. Top: conceptual interval structure. Alternating intervals around a SNP and between two adjacent SNPs (first row), diplotype ranges of various length (second and third row). Bottom: muStat formulae. To conserve space, single-SNP intervals at the end of a diplotype, which may or may not increase association are not capitalized, ‘linked polarity’ among the same SNP appearing in adjacent intervals are connected by dotted lines.
A disease locus approximately equidistant to two adjacent loci (SNP C and D) may be in LD with both of the adjacent SNPs. Traditionally, one would look at the plot of the univari ate − log10 p by locus and give more confidence to a locus, if several neighboring loci are also in LD. With μ-scores, one can now directly aggregate evidence from neighboring SNPs by assuming that each chromosome consists of intervals around and between SNPs. (Note that we will not require the demarcations between these intervals to be known.) In particular, we will treat each genetic interval as two variables, each contributing to the same disease locus as ‘factor’ (either side of Fig. 2.13). With high-density SNPs, several adjacent intervals may form diplotypes in LD with a single disease locus, the most informative diplotypes being determined by the LD/noise ratio. Epistasis is defined as an interaction between diplotypes that is associated with a phenotype. Within muStat, epistasis is indicated as a hierarchical structure among diplotypes (Fig. 2.17).
4.3.5. Summary
While u-statistics for univariate data (11, 18) and censored data (77–79) are widely used, u-statistics for multivariate data (73, 80) are rarely applied, presumably because they were not presented in an easy-to-use form and transistors had just been put to practical use (87). A more fundamental problem may have been an even
Nonparametric Methods for Molecular Biology
137
( a, (A,B), b), ( x, (X,Y),(Y,Z), z) a, (A,B), b
A, (A,B) &
&
x, (X,Y),(Y,Z), z
(A,B),B
X, (X,Y),(Y,Z)
&
&
(X,Y),(Y,Z)
&
A,B A
f rm f rm f rm f rm
l l l l
< < < <
( - “ ( - “ ( - “ (
X,Y B
&
X
A A A A
(X,Y),(Y,Z), Z
X,(X,Y),(Y,Z),Z
& Y,Z
Y
&
&
, ( ) , ( ) , ( ) , (
X x,( X , Y ) ,y y , ( Y ,Z) , z x,( X , Y ) , ( Y ,Z) , z
) ) ” )” ” )” )
f rm l … f rm l
< - “ (
B
) , (
X
< - “ (
B
) , (
x,( X , Y ) ,
f rm f rm f rm f rm
< < < <
b b b b
) ) ) )
l l l l
-
“ “ “ “
( ( ( (
a, a, a, a,
( ( ( (
A A A A
, , , ,
B B B B
), ), ), ),
Z
, , , ,
( ( ( (
( Y ,Z) , z )”
X x,( X , Y ) ,y y , ( Y ,Z) , z x,( X , Y ) , ( Y ,Z) , z
Fig. 2.17. Epistasis between diplotypes. Left: Sequence of hierarchical ( responding mu.AND formulae. (See Fig. 2.16 for details.)
) and combinatorical (
)” )” )” )”
) steps. Right: cor-
more important hindrance. As the number of variables increases, information content (the proportion of paired comparisons that can be decided) drops fast, until all μ-scores (u-scores for multivariate data) become NA, especially with a ‘strong’ order (82), as compared to its ‘weak’ counterpart (67). As pointed out recently (88), averaging univariate u-scores (89) or using the lexicographical order avoid this problem, yet require the often unrealistic assumptions to be made that the relative importance of the variables be constant and known or that less important variables contribute only by breaking ties. Two recently developed extensions of u-statistics resolve this conundrum. First, the process of scoring multivariate data was separated into first determining the uni-variate orders first (44) and then combining such orders (45) into an incomplete order. As the proposed combination of incomplete orders results in yet another incomplete order, a ‘tree’ of incomplete orders can be defined, wherein hierarchical (sub-) factor structures can reflect functional, topological, or temporal relationships. Conversely, ‘μfactor analysis’ can identify hierarchical factor structures that maximize information content (90). Second, cumulation of count variables (62) allows for ‘graded’ variables (e.g., ‘mild’ < ‘moderate’ < ‘severe’) but avoids the infinite relative weights implicitly assigned with a lexicographical order. μ-Scores can be used for various analyses, including testing differences between groups defined by simple genotypes with respect to complex phenotypes, correlating complex genotypes with complex phenotypes, or identifying genetic variables that explain (correlate best with) a complex phenotype.
138
Wittkowski and Song
Once GE matrices (or scores) are created, further strategies can be employed to incorporate design or model knowledge. For instance, if variables can be ordered, but the outcomes cannot be cumulated (as in Section 4.3.1), Kendall’s (91) correlation coefficient rK or, equivalently, the Jonckheere–Terpstra test (92, 93) could be used. For paired observations, the Wilcoxon signed-rank test (10) (WSR) is often proposed as a ‘nonparametric’ alternative to the paired t-test. Since the WSR is based on the arithmetic difference between the paired observations, however, the results are not independent of scale transformations. The Lam–Longnecker test (58), instead, is directly based on the WMW test, except that the variance is reduced by a factor 1 − rS , where rS is Spearman’s rank correlation coefficient. Replacing rS by the closely related rK yields a truly ‘nonparametric’ alternative to the WSR, which fits seamlessly into the above unified concept of u-tests. 4.4. Applications 4.4.1. μ-Discriminant Analysis: Collaboration vs. Coregulation (Supervised)
The focus of many traditional methods is on ‘correlation’ as an indicator of ‘coregulation’, i.e., of variables along the same ‘pathway’. Often, however, different subjects may respond along different pathways, i.e., several pathways may be able to “share the load” (94). Then, including several variables (or pathways) may provide better discrimination between categories than using individual variables, until the variables included begin to model noise, in which case the information content of the resulting scores would begin to decline (see Section 4.2.3). μ-Scores enable a paradigm shift by introducing a novel concept of interaction. With the linear model, finding the best discriminating function was often futile, because the choice of the best discriminating set depended on the linearizing transformations and the relative weights chosen. Selecting one among the many ‘best solutions’ generated by the various combinations of subjective transformations and weights was difficult at best. Moreover, the mere assumption that the relative importance among two variables would not depend on the magnitude of these and other variables could easily bias the results. μ-Scores are independent of (monotonous) transformations and (positive) weights, so that solutions are less dependent on the assumptions made and, thus, allow, for the first time, a new type of questions to be approached. Instead of focusing on ‘coregulated’ variables within a pathway, one could search for variables representing ‘collaborating’ pathways. This paradigm shift has direct implications for applications in systems biology. In Spangler et al. (94), who compared activity of dopaminergic receptors between rats addicted to sugar and control rats, the response in the caudatus putamen (CPU, Fig. 2.18, bottom) was clearly more ‘coordinated’) than in the nucleus accumbens (NAC, Fig. 2.18, top). In the NAC, bivariate
Nonparametric Methods for Molecular Biology
p=0.172
p=0.006
p=0.132
p=0.003
p=0.208
139
p=0.006 n
5 6 7
7
8
1
6
7
10
7
8 3
8
5 5
3
8
6
8 3
4
–10
1
6
6 4
3
1
2
2
7
2 5
4
3 1
3 6
4
2
4
1
3 2
8
1
7 4
2
6
6
5
4
4
1
3
3
8
5
6 1
1
2 5 8 2
0
3
6
4
6
2
3
1
5 2
7
4
2
1
6
1 5 8
3 5
5
7
7
4 8
8
7
7
n
7
p=0.064
p=0.008
p=0.121
p=0.355
( n = animal within group ) Sucrose
8
2
Control
5
p=0.728
p=0.028 n
5
5
8
8
7
6 5
10
8
6
6 4
4
3
7
7
3
2
6
1
3
6 4
6 1
0
6
3 2
3
7
2
3 5
4
2 7 1
6
4 7
2 8
4
1 2
10
4 2
1
1
2
2
4
7
1
7
5 2
4
2
7 6
5
7
1
1
1 4
2
5
5
( n = animal within group ) Sucrose
3
8
3 3
3
7
6
1 4
3
4 5
3
5
8
6 7
Control
5 5
1
n
D1
D2
–D3
pE
pD
pT
Fig. 2.18. Coregulation vs. Collaboration. Dopaminergic activity in the Caudatus putamen (left) and Nucleus accumbens (right) of rats addicted to sugar (closed circles) vs. control rats (open circles).
profiles including D3 and either of D1, D2, pD, and pT discriminated better than their components, and the best discrimination was seen for trivariate profiles containing both D3 and pT discriminated even better (p < 0.0001). In the CPU, in contrast, D2 alone (p=0.008) discriminated better than any combination.
140
Wittkowski and Song
A recent study comparing patients with Fanconi anemia type C (FA-C) to normal controls (95) identified in a first step 200 differentially expressed genes in univariate comparisons (Fig. 2.19, top, significance indicated by size of nodes). These genes showed some clustering, albeit without hinting to any suggestive interpretation. By multivariate analyses, AURKA and RRM2 emerged as the member of many of the most significant pairs and triplets, respectively. Both genes are targets of drugs in cancer treatment, suggesting that these drugs might become the first treatments for FANCC patients. Thus, adding cooperation as a novel concept in system biology opens an additional dimension for the interpretation of functional relationships between molecular variables. By integrating the muStat output with systems such as Cytoscape (http://www.cytoscape.org/), collaborative relationships are easily visualized (see Fig. 2.19). 4.4.2. Genome-Wide Association Studies
After the human genome was decoded in 2004, it was widely expected that medicine would enter a new era of personalized medicine, where common diseases, such as hypertension, obesity, and cancer could be successfully treated. Many of the early results, however, could not be confirmed in subsequent studies. Figure 2.20 exemplifies the problems experienced. Traditionally, each SNP is analyzed using either the Cochran–Armitage χ2 test for trend (96) with weights (1,1,0), (1,2,3), and (0,1,1) for dominant, additive, and recessive effects, respectively, the 2 × 3 ‘genetic’ χ2 test, or the 2 × 2 ‘allelic’ test based on counts of alleles (which is less appropriate (32) for the same reason as the TDT, see Section 2.2.1, and, thus, ommitted here). To the dismay of investigators, even the most significant loci often turned out to be false positive, while important loci are often overlooked. In Fig. 2.20, for instance, 177 cases of children with childhood absence epilepsy (CAE) are compared against three sets of 354 controls matched against different subsets of ancestry informative markers (AIMs). Not surprisingly, the traditional and the
Fig. 2.19. Collaboration between pairs and trios of genes against a background of genes clustered by pairwise correlation indicated. AURKA(top) and RRM2(bottom) are identified through gene–gene interactions as relevant for the pathology of FANCC. Gene ontology information suggests that subjects may differ with respect to how three major pathways (signal transduction, repair/recombination/replication, protein metabolism/transcription) 1 − r 2 as are involved. Background: MDS (multidimensional scaling) map (with distance) showing genes clustered by coregulation and significance of the univariate p-values (size of nodes). Foreground: The network indicates the ‘most significant’ pairs and trios of genes with font size indicating the smallest of the uni-, bi-, and trivariate p-values (95).
Nonparametric Methods for Molecular Biology
Fig. 2.19. (continued)
141
142
Wittkowski and Song
χA χB χC µ1 µ2 µ3 µ4 µ5 µ6
Fig. 2.20. GWAS comparing 177 children with childhood absence epilepsy to three sets of differently matched normal controls. Top heat map: LD, bottom heat map: −log10 (p-value) XA/B/C : max χ2 (recessive, trend, dominant) by control set, μ1. . .6: μ-test by length of diplotype. Bold lines indicate the promoter region and a particular exon of the longer splice form implicated by the most significant μ-test results.
μ-score analyses yield consistent results, although only one SNP has consistent max(χ2 ) p-value below 10–3 for all three control sets. Hence, this locus would not have been drawn any attention. While the results of traditional single-SNP χ2 -tests point to the haplotype block as a whole at best, the results of the diplotype μ-tests (stratified for control sets, using the method of Section 3.2) point to two specific regions with high biologic plausibility (Fig. 2.20). One ‘significance triangle’ points at a single exon and the latter at the promoter region of a splice variant, which is expressed in embryonal brain only and is involved in neuronal development, so that it clearly is a strong candidate gene. GWAS is primarily a screening strategy to generate hypotheses, which then need to be studied further. As a criterion in a selection procedure (98) (balancing not the risk of false positive results, but the number of loci selected vs. the risk of overlooking the most relevant loci), rather than as indicating the confidence in a particular result, p-values must not be judged by their absolute value (mainly a function of sample size). Instead, one may give the high priority for further investigation to those hypotheses that have both low p-values and high biologic plausibility. Given the high “false-positive rate” in the early GWAS studies, sample size requirements have since been increased to thousands or even tens of thousands of cases. As an alternative, μscores allow to increase the signal/noise ratio by increasing the number of neighboring SNPs that are comprehensively analyzed.
Nonparametric Methods for Molecular Biology
143
As demonstrated in Fig. 2.20, investing more computational effort into better statistical methods can reduce the sample size requirements for GWAS substantially, even for common, complex disease, such as CAE. While the results of traditional singleSNP χ2 -tests merely point to the haplotype block, in general, the results of the stratified (Section 3.2) diplotype μ-tests point to two specific regions with high biologic plausibility. For some cases, variations in the promoter region may affect regulation of this splice variant, while variations in the exon implicated may affect function. Figure 2.21 shows how representing genetic structure increases sensitivity for detecting association, in general, and epistasis, in particular. Among the 13 chromosomes selected, univariate analysis (bottom border of both diagrams) points to a short range at the beginning of Chr 3 and, possibly, Chr 12 as a whole. Moving from SNPs to diplotypes by itself (top border of right diagram) merely shifts the focus on Chr 3 slightly higher. Epistasis between SNPs, as indicated in the square areas, suggests that the range identified on Chr 3 might interact with Chr 11 as a whole and suggests epistasis between Chr 10 and Chr 11. Allowing for epistasis between diplotypes confirms these results, and, in addition, points to ranges on Chr 5 as interacting with ranges on Chr 3, 4, 11, and 12. Moreover, one now sees that the SNPs on several chromosomes are separated into ‘clusters’ interspersed by SNPs that seem to be unrelated.
Fig. 2.21. Heatmap of correlations between atherosclerosis and μ-scores of combinations of diplotypes on mouse chromosomes 8–Y (Left: individual SNPs; right: diplotypes of up to 4 SNPs).
4.4.3. Complex Phenotypes
The above examples tried to explain relatively simple phenotypes, such as addiction to sugar (Fig. 2.18), presence vs. absence of a disease (Fig. 2.19), or degree of atherosclerosis in mice (Fig. 2.21). Even with some ‘monogenic’ diseases, such as Fanconi anemia, however, the phenotype involved can be anything
144
Wittkowski and Song
but simple. With FA, the phenotype contains various congenital malformations (often binary), life span, time of cancer manifestation or hematological failure (each of them often censored), and laboratory measurements of chromosomal stability (mostly quantitative) (Fig. 2.22 top). 4.4.4. μ-Factor Analysis: Hierarchical Coregulation Structures (Unsupervised)
Being able to increase information content by introducing a hierarchical structure offers the opportunity to reverse the goals, i.e., to find the hierarchical structure for a given set of variables that maximizes information content. Diana et al. (84), for instance, compared three different models (hierarchical structures) with respect to the information content of the resulting scores. Clearly, an exhaustive search among all possible hierarchical factor structures is often not computationally feasible, even for relatively small numbers of variables, unless a limited number (up to a few thousand) is preselected. In Diana et al. (84), this approach was used to decide whether commuters categorize transportation modes by • C1: Constraint: bicycle < motorcycle < car (driver, pax) < transit (bus, tram, metro) • C2: Typology: two-wheeled (bicycle, motorcycle), car (driver, pax), transit (bus, tram, metro) • C3: Autonomy: passenger (car pax, bus, tram, metro) < driver (bicycle, motorcycle, car driver). In this case, the data suggested autonomy as the concept most relevant for individual decision making. In molecular biology, one might easily use the same approach as an alternative to traditional factor (often called ‘cluster’) analysis. Figure 2.23 shows another possible use of this method, namely to classify object with respect to how much they gain from the use of the hierarchy. With RDM, in particular, it is clear that the hierarchical and nonhierarchical μ-scores are essentially the same, but there exists a population whose scores are improved by using the hierarchy.
4.4.5. μ-Signatures and Personalized Medicine
A frequent problem in molecular biology is to find a molecular ‘signature’ that discriminates between diseases that present with similar phenotypes, but require different therapeutic interventions. Better targeted diagnostics could improve patient health by avoiding the common trial-and-error approach to identifying the best treatment option. In a study of 33 cellular surface markers and six plasma biomarkers as diagnostic markers for two hemophagocytic syndromes (99) (98), univariate u-tests discriminated poorly, with 30.0, 21.0, 30.2, 37.7, 38.5, and 31.8% misclassifications, respectively. By μ-scores, the subset of biomarkers that discriminated best between MAS and HLH in this population (p=10–15 ,
Nonparametric Methods for Molecular Biology
145
Fig. 2.22. Reflecting a hierarchical structure among variables increases information content and improves discrimination. Hierarchical data structure of the Fanconi anemia phenotype (top), hierarchy increasing information content (center), and discriminatory power (bottom) (45).
146
Wittkowski and Song
Fig. 2.23. Model selection by Information content (C3 > C1 > NH). Left: Subjective Mobility (SM), right: Relative Desired Mobility (RDM). Inserts of correlations between nonhierarchical (NH) and either C1 or C3 hierarchical scores show two categories of subjects, especially with respect to RDM (84).
9.4% misclassifications) consisted of two plasma and two surface markers. In contrast to traditional methods, discrimination based on μ-scores does not necessarily improve when more variables are added. In fact, reduced significance with more variables may suggest that noise is being modeled. Moreover, μ-scores are independent of (monotonous) transformations and (positive) weights. Thus, neither the selection of the variables nor the choice of transformations of weights needs to be ‘validated,’ so that decisions can be made in a timely fashion. For complex diseases, finding signatures that can be applied to the population as a whole has proven to be difficult, if not impossible (99), in part, because different pathways may
Nonparametric Methods for Molecular Biology
147
be involved in different subpopulations. As a signature based on μ-scores is ‘intrinsically valid’ for the population in which it is developed (requiring validation for neither the number of variables chosen nor the transformations applied to them), they may provide a solution for this conundrum. Traditionally, when the same ‘signature’ is used for a wide range of patients, representativity of the case population of cases for a wide range of subjects to be diagnosed needs to be ‘validated’. However, when a signature is developed based on a subpopulation of cases specifically selected to match the subject based on criteria considered relevant by the caring physician (a process, which could involve μ-scores for similarity), the need to ‘validate’ representativity is substantially lessened, so that a personalized signature can be generated ad hoc. Most importantly, this subpopulation of cases is likely to be more homogeneous with respect to the risk factors involved, so that highly predictive signatures will be easier to identify. 4.4.6. Implementation and Availability
In the muStat package for R and S-PLUS, the function mu.PwO generates Deuchler’s univariate orderings, mu.AND combines them into an incomplete ordering, and mu.Sums computes scores and weights from a incomplete ordering (Fig. 2.12). The pseudo-code below demonstrates calculation of the hierarchical (UH) and the nonhierarchical (UNH) μ-scores of Fig. 2.22. The parentheses indicate the hierarchy and variables separated by colon indicate the bounds of interval censored observations. frml <- "( ((BLnB,BLpC),((DEBnB,DEBpC),Mosaic)), LS0:LS1, (CA0:CA1,LK0:LK1), (HM0:HM1,(P0:P1,R0:R1,W0:W1),BMT0:BMT1), ((HZ%,(EAR,. . .)),(HT%,(BMRK,. . .))) )" x <- importData(. . .) PO <- mu.PwO(x, frml) # creates univariate pw orderings # using censoring info from POH <- mu.AND(PO,frml) # creates part. ordering w/hierarchy PONH <- mu.AND(PO ) # creates part. ordering wo/hierarchy UH <- mu.Sums(POH )$score # creates scores w/hierarchy UNH <- mu.Sums(PONH)$score# creates scores wo/hierarchy
148
Wittkowski and Song
When using the elementary functions of Fig. the statements mu.Score(x,frml) and Sums(mu.AND(mu.PwO(x,frml), frml))$score equivalent to the following pseudo-code:
2.12, mu. are
mu.Sums( mu.AND(cbind( mu.AND(cbind( mu.AND(mu.PwO(x[,c(BLnB,BLpC)])), mu.AND(cbind( mu.AND(mu.PwO(x[,c(DEBnB,DEBpC)])), mu.PwO(x[,Mosaic]))))), mu.PwO(x[,LS0],x[,LS1]), mu.AND(cbind( mu.PwO(x[,CA0],x[,CA1]), mu.PwO(x[,LK0],x[,LK1]))), mu.AND(cbind( mu.PwO(x[,HM0],x[,HM1]), mu.AND(cbind( mu.PwO(x[,P0],x[,P1]), . . . mu.PwO(x[,W0],x[,W1]))), mu.PwO(x[,BMT0],x[,BMT1]))), mu.AND(cbind( mu.AND(cbind(mu.PwO(x[,HZp]), mu.PwO(x[,c(EAR,. . .)]))), mu.AND(cbind(mu.PwO(x[,HTp]), mu.PwO(x[,c(BMRK,. . .)])))))) )$score.
5. Conclusions Nonparametric tests are often considered the second choice, to be used primarily, as Friedman implied, “to avoid the assumption of normality” (48). Being based on least square estimates, rather than maximum likelihood, ANOVA is asymptotically distributionfree. As Scheffé demonstrated in Chapter 10 of The Analysis of Variance (5), “Nonnormality has little effect on inferences about means”, in general, and ANOVA, in particular, because the CLT holds for even moderately sized samples. Hence, when methods based on both the linear model and u-statistics are available, the choice between these approaches should primarily rely on the hypothesis to be tested (difference in tendency vs difference in mean), rather than the empirical distribution of the residuals (100). If differences of one unit have the same ‘meaning’ across the scale, the linear model would be more appropriate, even for “non-normal” distributions. If, like in many biological applications, no truly ‘linearizing’ transformation exists, tests based on u-statistics often give more meaningful results (9).
Nonparametric Methods for Molecular Biology
149
Another reason for using methods based on the linear model is that, with the exceptions of some special cases (101, 102), nonparametric methods have not yet been generalized to complex factorial designs). On the other hand, factor and sub-factor structures for graded and censored variables often require restrictive assumptions to be made for the linear model to apply. Hence, a new distinction between parametric and nonparametric methods emerges. The former are more appropriate to reflect controlled structures among independent variables, while the latter may be better suited to reflect uncontrolled structures among dependent variables. When Fisher introduced structured experimental designs (7), he did so with emphasis on agricultural problems where the assumption of linearity between water, sunshine, and nutrients and the single outcome (yield) is often easily justified. In molecular biology, in contrast, experiments are often simple, but the confounding variables and outcomes can be complex with a structure that is only partially known. Thus, molecular biology may eventually prove as a field where the recent advances in nonparametric statistics are especially useful. In particular, the recent generalizations to structured multivariate data may, together with sequencing methods and grid or cloud computing, become equally important recent developments allowing personalized medicine to become a successful strategy for improving health while reducing overall healthcare cost.
Acknowledgments The work was supported in part by Grant No. UL1RR024143 from the U.S. National Center for Research Resources (NCRR). Of the many colleagues who have contributed to this chapter through discussions and suggestions, I would like to thank, in particular, Jose F. Morales, Ephraim Sehayek, Sreeram Ramagopalan, and Martina Durner for their input on the biological background, Sreeram Ramagopalan, Bill Raynor, and Norman Cliff for their helpful comments, an anonymous reviewer for an inspiring discussion, and Daniel Eckardt for help with Latin grammar. References 1. Collins, F. S., Green, E. D., Guttmacher, A. E., and Guyer, M. S. (2003) A vision for the future of genomics research, Nature 422, 835–847. 2. Butler, D. (2003) The Grid: tomorrow’s computing today, Nature 422, 799–800.
3. Pearson, T. A., and Manolio, T. A. (2008) How to interpret a genome-wide association study, JAMA 299, 1335–1344. 4. Psychiatric, GWAS Consortium Coardinating Committee (2009) Genomewide association studies: history, rationale, and prospects for
150
5. 6.
7. 8.
9. 10. 11.
12. 13. 14. 15.
16. 17. 18.
19. 20. 21. 22.
Wittkowski and Song psychiatric disorders, Am J Psychiatry 166, 540–556. Scheffé, H. (1959) The Analysis of Variance, Wiley, New York, NY. Arbuthnot, J. (1710) An argument for divine providence taken from the constant regularity observ’d in the births of both sexes, Philos Trans R Soc London 27, 186–190. Fisher, R. A. (1935) The Design of Experiments, Oliver & Boyd, Edinburgh. Cliff, N. (1996) Answering ordinal questions with ordinal data using ordinal statistics, Multivariate Behav Res 31,; 331–350. Cliff, N. (1996) Ordinal Methods for Behavioral Data Analysis, Lawrence Erlbaum, Mahwah, NJ. Wilcoxon, F. (1954) Individual comparisons by ranking methods, Biometrics 1, 80–83. Mann, H. B., and Whitney, D. R. (1947) On a test of whether one of two random variables is stochastically larger than the other, Ann Math Stat 18, 50–60. Kruskal, W. H., and Wallis, W. A. (1952) Use of ranks in one-criterion variance analysis, J Am Stat Assoc 47, 583–631. Lewis, C. T., and Short, C. (1879) A Latin Dictionnairy, Clarendon, Oxford. Georges, K. E. (1918) Ausführliches lateinisch-deutsches Handwörterbuch, Hahn, Hannover. Tusher, V. G., Tibshirani, R., and Chu, G. (2001) Significance analysis of microarrays applied to the ionizing radiation response (vol 98, pg 5116), Proc Natl Acad Sci USA 98, 10515–10515. van de Wiel, M. A. (2004) Significance analysis of microarrays using rank scores, Kwantitatieve Methoden 71, 25–37. Wang, Z., Gerstein, M., and Snyder, M. (2009) RNA-Seq: a revolutionary tool for transcriptomics, Nat Rev Genet 10,; 57–63. McNemar, Q. (1947) Note on the sampling error of the differences between correlated proportions or percentages, Psychometrica 12, 153–157. Gauss, C. F. (1823) Theoria combinationis observationum erroribus minimis obnoxiae, Dieterich, Goettingen. Coakley, C. W., and Heise, M. A. (1996) Versions of the sign test in the presence of ties, Biometrics 52, 1242–1251. Dixon, W. J., and Mood, A. M. (1946) The statistical sign test, J Am Stat Assoc 41,; 557– 566. Dixon, W. J., and Massey, F. J. J. (1951) An Introduction to Statistical Analysis, McGrawHill, New York.
23. Rayner, J. C. W., and Best, D. J. (1999) Modelling ties in the sign test, Biometrics 55, 663–665. 24. Rao, P. V., and Kupper, L. L. (1967) Ties in paired-comparison experiments: a generalization of the Bradley–Terry model, J Am Stat Assoc 62, 194–204. 25. David, H. A. (1988) The Method of Paired Comparisons, 2nd ed., Griffin, London. 26. Stern, H. A. L. (1990) A continuum of paired comparisons models, Biometrika 77, 265–273. 27. Yan, T., Yang, Y. N., Cheng, X., DeAngelis, M. M., Hoh, J., and Zhang, H. (2009) Genotypic Association Analysis Using Discordant-Relative-Pairs, Ann Hum Genet 73, 84–94. 28. Spielman, R. S., McGinnis, R. E., and Ewens, W. J. (1993) Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM), Am J Hum Genet 52,; 506– 516. 29. Wittkowski, K. M. (1998) Versions of the sign test in the presence of ties, Biometrics 54, 789–791. 30. Wittkowski, K. M. (1989) An asymptotic UMP sign test for discretized data, Statistician 38, 93–96. 31. Wittkowski, K. M., and Liu, X. (2002) A statistically valid alternative to the TDT, Hum Hered 54, 157–164. 32. Sasieni, P. D. (1997) From genotypes to genes: doubling the sample size, Biometrics 53, 1253–1261. 33. Wittkowski, K. M. (1988) Friedman-type statistics and consistent multiple comparisons for unbalanced designs, J Am Stat Assoc 83, 1163–1170. 34. Student. (1908) On the probable error of a mean, Biometrika 6, 1–25. 35. Ramagopalan, S. V., McMahon, R., Dyment, D. A., Sadovnick, A. D., Ebers, G. C., and Wittkowski, K. M. (2009) An extension to a statistical approach for family based association studies provides insights into genetic risk factors for multiple sclerosis in the HLA-DRB1 gene, BMC Med Genetics; 10, 10. 36. Hafler, D. A., Compston, A., Sawcer, S., Lander, E. S., Daly, M. J., De Jager, P. L., de Bakker, P. I. W., Gabriel, S. B., Mirel, D. B., Ivinson, A. J., Pericak-Vance, M. A., Gregory, S. G., Rioux, J. D., McCauley, J. L., Haines, J. L., Barcellos, L. F., Cree, B., Oksenberg, J. R., and Hauser, S. L. (2007) Risk alleles for multiple sclerosis identified by a genomewide study, N Engl J Med 357,; 851–862.
Nonparametric Methods for Molecular Biology 37. Barcellos, L. F., Sawcer, S., Ramsay, P. P., Baranzini, S. E., Thomson, G., Briggs, F., Cree, B. C., Begovich, A. B., Villoslada, P., Montalban, X., Uccelli, A., Savettieri, G., Lincoln, R. R., DeLoa, C., Haines, J. L., Pericak-Vance, M. A., Compston, A., Hauser, S. L., and Oksenberg, J. R. (2006) Heterogeneity at the HLA-DRB1 locus and risk for multiple sclerosis, Hum Mol Genet 15, 2813–2824. 38. Ramagopalan, S., and Ebers, G. (2009) Multiple sclerosis: major histocompatibility complexity and antigen presentation, Genome Med 1, 105. 39. Suárez-Fariñas, M., Haider, A., and Wittkowski, K. M. (2005) “Harshlighting” small blemishes on microarrays, BMC Bioinformatics 6, 65. 40. Suarez-Farinas, M., Pellegrino, M., Wittkowski, K. M., and Magnasco, M. O. (2005) Harshlight: a “corrective make-up” program for microarray chips, BMC Bioinformatics 6, 294. 41. Arteaga-Salas, J. M., Harrison, A. P., and Upton, G. J. G. (2008) Reducing spatial flaws in oligonucleotide arrays by using neighborhood information, Stat Appl Genet Mol Biol 7, 19. 42. Arteaga-Salas, J. M., Zuzan, H., Langdon, W. B., Upton, G. J. G., and Harrison, A. P. (2008) An overview of image-processing methods for Affymetrix GeneChips, Brief Bioinform 9, 25–33. 43. Cairns, J. M., Dunning, M. J., Ritchie, M. E., Russell, R., and Lynch, A. G. (2008) BASH: a tool for managing BeadArray spatial artefacts, Bioinformatics 24,; 2921–2922. 44. Deuchler, G. (1914) Über die Methoden der Korrelationsrechnung in der Pädagogik und Psychologie, Z pädagog Psychol 15, 114–131, 145–159, 229–242. 45. Morales, J. F., Song, T., Auerbach, A. D., and Wittkowski, K. M. (2008) Phenotyping genetic diseases using an extension of μscores for multivariate data, Stat Appl Genet Mol Biol 7, 19. 46. Kehoe, J. F., and Cliff, N. (1975) Interord: a computer-interactive Fortran iv program for developing simple orders, Educ Psychol Meas 35, 675–678. 47. Kruskal, W. H. (1957) Historical notes on the Wilcoxon unpaired two-sample test, J Am Stat Assoc 52, 356–360. 48. Friedman, M. (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance, J Am Stat Assoc 32, 675–701. 49. Iain, M., and Urken, A. B. (1995) On elections by ballot, in Classics of Social Choice
50. 51. 52.
53. 54.
55. 56.
57. 58. 59. 60. 61. 62.
63. 64.
65.
151
(Iain, M., and Urken, A. B., Eds.),; pp. 83– 89, University of Michigan Press, Ann Arbor, MI. Hägerle, G., and Puckelsheim, F. (2001) Llull’s writings on electorial systems, Stud Lulliana 41, 3–38. Benard, A., and Van Elteren, P. H. (1953) A generalization of the method of m rankings, Indagationes Math 15, 358–369. van Elteren, P., and Noether, G. E. (1959) The asymptotic efficiency of the chi_rˆ2test for a balanced incomplete block design, Biometrika 46, 475–477. Durbin, J. (1951) Incomplete blocks in ranking experiments, Br J Psychol 4, 85–90. Bradley, R. A., and Milton, E. T. (1952) Rank analysis of incomplete block designs: I. The method of Paired comparisons, Biometrika 39, 324–345. Prentice, M. J. (1979) On the problem of m incomplete rankings, Biometrika 66,; 167–170. Alvo, M., and Cabilio, P. (2005) General scores statistics on ranks in the analysis of unbalanced designs, Can J Stat 33,; 115–129. Gao, X., and Alvo, M. (2005) A unified nonparametric approach for unbalanced factorial designs, J Am Stat Assoc 100, 926–941. Lam, F. C., and Longnecker, M. T. (1983) A modified Wilcoxon rank sum test for paired data, Biometrika 70, 510–513. Cronbach, L. J., and Meehl, P. E. (1955) Construct validity in psychological tests, Psychol Bull 52, 281–302. Popper, K. R. (1937) Logik der Forschung, Julius Springer, Wien. Delbecq, A. (1975) Group techniques for program planning, Scott Foresman, Glenview, IL . Wittkowski, K. M., Song, T., Anderson, K., and Daniels, J. E. (2008) U-scores for multivariate data in sports, J Quant Anal Sports 4, 7. Freimer, N., and Sabatti, C. (2003) The human phenome project, Nat Genet 34,; 15–21. Wittkowski, K. M. (1980) Ein nichtparametrischer Test im Stufenblockplan [A nonparametric test for the step-down design], Institut für Medizinische Statistik, Georg-August-Universität,; Göttingen, D. Wittkowski, K. M. (1984) Semiquantitative Merkmale in der nichtparametrischen Statistik, in Der beitrag der informationsverarbeitung zum fortschritt der medizin (Köhler, C. O., Wagner, E., and Tautu, P., Eds.), pp. 100–105, Springer, Berlin, D.
152
Wittkowski and Song
66. Wittkowski, K. M. (1988) Small sample properties of rank tests for incomplete unbalanced designs, Biom J 30,; 799–808. 67. Wittkowski, K. M. (1992) An extension to Wittkowski, J Am Stat Assoc 87, 258. 68. Einsele, H., Ehninger, G., Hebart, H., Wittkowski, K. M., Schuler, U., Jahn, G., Mackes, P., Herter, M., Klingebiel, T., Löffler, J., et al. (1995) Polymerase chain reaction monitoring reduces the incidence of cytomegalovirus disease and the duration and side effects of antiviral therapy after bone marrow transplantation, Blood 86, 2815– 2820. 69. Talaat, M., Wittkowski, K. M., Husein, M. H., and Barakat, R. (1998) A new procedure to access individual risk of exposure to cercariae from multivariate questionnaire data, in Reproductive Health and Infectious Diseases in the Middle East (Barlow, R., and Brown, J. W., Eds.), pp. 167–174, Ashgate, Aldershot, UK. 70. Susser, E., Desvarieux, M., and Wittkowski, K. M. (1998) Reporting sexual risk behavior for HIV: a practical risk index and a method for improving risk indices, Am J Public Health 88, 671–674. 71. Wittkowski, K. M., Susser, E., and Dietz, K. (1998) The protective effect of condoms and nonoxynol-9 against HIV infection, Am J Public Health 88, 590–596, 972. 72. Banchereau, J., Palucka, A. K., Dhodapkar, M., Kurkeholder, S., Taquet, N., Rolland, A., Taquet, S., Coquery, S., Wittkowski, K. M., Bhardwj, N., Pineiro, L., Steinman, R., and Fay, J. (2001) Immune and clinical responses after vaccination of patients with metastatic melanoma with CD34+ hematopoietic progenitor-derived dendritic cells, Cancer Res 61, 6451–6458. 73. Hoeffding, W. (1948) A class of statistics with asymptotically normal distribution, Ann Math Stat 19, 293–325. 74. Wittkowski, K. M. (2003) Novel methods for multivariate ordinal data applied to genetic diplotypes, genomic pathways, risk profiles, and pattern similarity, Comput Sci Stat 35, 626–646. 75. Wittkowski, K. M., and Liu, X. (2004) Beyond the TDT: rejoinder to Ewens and Spielman, Hum Hered 58, 60–61. 76. Wittkowski, K. M., Lee, E., Nussbaum, R., Chamian, F. N., and Krueger, J. G. (2004) Combining several ordinal measures in clinical studies, Stat Med 23, 1579–1592. 77. Gehan, E. A. (1965) A generalised two-sample Wilcoxon test for doubly censored samples, Biometrika 52, 650–653.
78. Gehan, E. A. (1965) A generalised Wilcoxon test for comparing arbitrarily singly censored samples, Biometrika 52, 203–223. 79. Schemper, M. (1983) A nonparametric; ksample test for data defined by intervals, Stat Neerl 37, 69–71. 80. Lehmann, E. L. (1951) Consistency and unbiasedness of certain nonparametric tests, Ann Math Stat 22, 165–179. 81. Hoeffding, W. (1994) The Collected Works of Wassily Hoeffding, Springer, New York. 82. Rosenbaum, P. G. (1994) Coherence in observationsl studies, Biometrics 50,; 368– 374. 83. Song, T., Coffran, C., and Wittkowski, K. M. (2007) Screening for gene expression profiles and epistasis between diplotypes with S-Plus on a grid, Stat Comput Graph 18,; 20–25. 84. Cherchye, L., and Vermeulen, F. (2006) Robust rankings of multidimensional performances: an application to Tour de France racing cyclists, J Sports Econ 7, 359–373. 85. Quaia, E., D’Onofrio, M., Cabassa, P., Vecchiato, F., Caffarri, S., Pittiani, F., Wittkowski, K. M., and Cova, M. A. (2007) Diagnostic value of hepatocellular nodule vascularity after microbubble injection for characterizing malignancy in patients with cirrhosis, Am J Roentgenol 189, 1474–1483. 86. Ramamoorthi, R. V., Rossano, M. G., Paneth, N., Gardiner, J. C., Diamond, M. P., Puscheck, E., Daly, D. C., Potter, R. C., and Wirth, J. J. (2008) An application of multivariate ranks to assess effects from combining factors: Metal exposures and semen analysis outcomes, Stat Med 27, 3503–3514. 87. Shockley, W., Bardeen, J., and Brattain, W. H. (1948) The electronic theory of the transistor, Science 108, 678–679. 88. Haberle, L., Pfahlberg, A., and Gefeller, O. (2009) Assessment of multiple ordinal endpoints, Biom J 51, 217–226. 89. O’Brien, P. C. (1984) Procedures for comparing samples with multiple endpoints, Biometrics 40, 1079–1087. 90. Diana, M., Song, T., and Wittkowski, K. (2009) Studying travel-related individual assessments and desires by combining hierarchically structured ordinal variables, Transp 36, 187–206. 91. Kendall, M. G. (1938) A new measure of rank correlation, Biometrika 30,; 81–93. 92. Jonckheere, A. R. (1954) A distributionfree k-sample test against ordered alternatives, Biometrika 41, 133–145. 93. Terpstra, T. J. (1952) The asymptotic normality and consistency of Kendall’s test against trend when ties are present in one ranking, Indagationes Math 14, 327–333.
Nonparametric Methods for Molecular Biology 94. Spangler, R., Wittkowski, K. M., Goddard, N. L., Avena, N. M., Hoebel, B. G., and Leibowitz, S. F. (2004) Opiate-like effects of sugar on gene expression in reward areas of the rat brain, Mol Brain Res 124,; 134–142. 95. Morales, J. F., Song, T., Wittkowski, K. M., and Auerbach, A. D. (submitted) A statistical systems biology approach to FANCC gene expression suggests drug targets for Fanconi anemia. 96. Armitage, P. (1955) Tests for linear trends in proportions and frequencies, Biometrics 11, 375–386. 97. Janka, G. E., and Schneider, E. M. (2004) Modern management of children with haemophagocytic lymphohistiocytosis, Br J Haematol 124, 4–14. 98. Seybold, M. P., Wittkowski, K. M., and Schneider, E. M. (2008) Biomarker; analysis using a non-parametric selection procedure
153
to discriminate the phagocytic syndromes HLH (hemophagocytic lymphohistiocytosis) and mas (macrophage activation syndrome), Shock 29, 90. 99. Kraft, P., and Hunter, D. J. (2009) Genetic risk prediction – Are we there yet?, N Engl J Med 360, 1701–1703. 100. Wittkowski, K. M. (1990) Statistical knowledge-based systems – critical remarks and requirements for approval, Comput Methods Programs Biomed 33, 255–259. 101. Akritas, M. G., Arnold, S. F., and Brunner, E. (1997) Nonparametric hypotheses and rank statistics for unbalanced factorial designs. Part I, J Am Stat Assoc 92,; 258– 265. 102. Brunner, E., Munzel, U., and Puri, M. L. (1999) Rank-score tests in factorial designs with repeated measures, J Multivar Anal 70, 286–317.
Chapter 3 Basics of Bayesian Methods Sujit K. Ghosh Abstract Bayesian methods are rapidly becoming popular tools for making statistical inference in various fields of science including biology, engineering, finance, and genetics. One of the key aspects of Bayesian inferential method is its logical foundation that provides a coherent framework to utilize not only empirical but also scientific information available to a researcher. Prior knowledge arising from scientific background, expert judgment, or previously collected data is used to build a prior distribution which is then combined with current data via the likelihood function to characterize the current state of knowledge using the socalled posterior distribution. Bayesian methods allow the use of models of complex physical phenomena that were previously too difficult to estimate (e.g., using asymptotic approximations). Bayesian methods offer a means of more fully understanding issues that are central to many practical problems by allowing researchers to build integrated models based on hierarchical conditional distributions that can be estimated even with limited amounts of data. Furthermore, advances in numerical integration methods, particularly those based on Monte Carlo methods, have made it possible to compute the optimal Bayes estimators. However, there is a reasonably wide gap between the background of the empirically trained scientists and the full weight of Bayesian statistical inference. Hence, one of the goals of this chapter is to bridge the gap by offering elementary to advanced concepts that emphasize linkages between standard approaches and full probability modeling via Bayesian methods. Key words: Bayesian inference, hierarchical models, likelihood function, Monte Carlo methods, posterior distribution, prior distribution.
1. Introduction The Bayesian inferential method to analyze data provides a logical framework to utilize all available sources of information in making a decision. In practice, often the source of information is based on the current data collected from the field of application, but in many cases one might have prior knowledge arising from H. Bang et al. (eds.), Statistical Methods in Molecular Biology, Methods in Molecular Biology 620, DOI 10.1007/978-1-60761-580-4_3, © Springer Science+Business Media, LLC 2010
155
156
Ghosh
experience, expert opinion, or from data collected from previous studies which were based on a similar protocol as was used to collect the current data. In such cases it would be a waste if such externally available scientific information is not utilized properly within a statistical framework. Despite its ability to systematically use the available background scientific knowledge, Bayesian statistical methods are often criticized as being subjective. However, it can be argued that almost any scientific knowledge and theories at best are subjective in nature. In an interesting article by (1), the authors cite many scientific theories (mainly from physics) where subjectivity played a major role and they concluded “Subjectivity occurs, and should occur, in the work of scientists; it is not just a factor that plays a minor role that we need to ignore as a flaw. . .” and they further added that “Total objectivity in science is a myth. Good science inevitably involves a mixture of subjective and objective parts.” The Bayesian inferential framework provides a logical foundation to accommodate objective (by modeling observed data) and subjective (by using prior distribution for parameters) parts involved in data analysis. In fact the so-called classical (frequentist) statistical methods are not as objective as often claimed in practice. For example, many frequentist test procedures (e.g., US Food and Drug Administration (FDA) guidelines) routinely advocate the rejection of the null hypotheses when the p-values of the tests fall below 0.05. To the best of our knowledge there is no formal basis that justifies the choice of the cut-off value of 0.05. It is unclear in what sense the value of 0.05 is “optimal” or an “objective” choice. In fact, blind use of 0.05 might inflate the type II error rate resulting into a test with much lower power. Further, the sample size calculations based on power and size analysis are often based on the subjective choice of the parameter values in the null and alternative hypotheses. Even the so-called nonparametric methods are not completely objective. For instance, the assumption of the existence of variance within a linear (or more general regression) model framework may be violated if the data arise from long-tail distributions (e.g., Cauchy family), or the assumption of independence among the observations may not hold when the data are collected from several nearer regions and across different time points. These types of structural assumptions are often taken as granted in analyzing biomedical data for making statistical conclusions. Almost any scientific method is built around a set of assumptions (e.g., regularity conditions for asymptotic methods) which are often necessary to build a scientific theory and the collection of such assumptions are certainly subjective. When such assumptions are violated or we suspect them to be violated in specific applications, it is of course possible to account for such structural background information often available in a data set and develop subjectively chosen models (e.g., by using
Basics of Bayesian Methods
157
autoregressive models that account for dependence among observations). Hence, a good scientific practice would be to state upfront all of the model assumptions used for data analysis and then make an effort to validate such assumptions using future test cases or by withholding a portion of the current data. There is nothing wrong to have a subjective but reasonably flexible model as long as we can exhibit some form of sensitivity analysis when the assumptions of the model are mildly violated. In recent years, there has been tremendous interest in applying Bayesian methodologies to various fields of science including biology, genetics, engineering, and finance. Several articles and books on Bayesian methodologies have appeared covering theory, computations, and applications. Introductory books on Bayesian methods include (2–6). Readers with advanced knowledge of calculus, may find the following books on Bayesian methods useful: (7–14) among many other available books. In 2006, the Center for Devices and Radiological Health (CDRH) within FDA has issued a guidance for the use of Bayesian statistics in medical device clinical trials (see http://www.fda.gov/cdrh/osb/guidance/1601.pdf). In this chapter, a brief introduction to Bayesian inferential methods is presented and such methods have been illustrated with simple real data examples. First, we develop some notations to be used throughout this chapter. Consider an experiment where the goal is to infer about plausible value(s) of the parameter vector θ (or its function g(θ)) based on observing a response vector X = (x1 , x2 , . . . , xn ). In many scientific experiments, the response vector X is usually the “effect” that we get to observe which might have been triggered by the (hidden) “cause” θ. One of the basic goals of such scientific experiments is to make inference about such hidden cause(s) based on observing the effect(s). However, any physical experiment is prone to measurement errors and other uncertainties that cannot always be controlled in an experiment and hence we assume that the observations x1 , . . . , xn have been generated by a probability density f (x|θ), the conditional density of the observation x defined on the sample space X for a given value of the parameter θ. Henceforth we will call this conditional density f (x|θ) as the sampling density of x given θ. If we use the above “cause and effect” analogy it turns out that the goal of statistical (scientific) inference is to obtain the so-called inverse distribution, i.e., the conditional distribution of θ given the observed response vector X. It is well established in probability theory that such an inversion of probability distribution is not possible without constructing a probability distribution for θ itself. Hence we assume that we have a probability density π(θ) of θ and we will call this marginal density, π(θ) of θ to be the prior density of θ defined on the parameter space .
158
Ghosh
The actual determination of the prior density is as critical as the determination of the sampling density f (x|θ). The prior density can be determined from the subject matter of interest or in the event of very little information about the parameter (e.g., models involving a lot of nuisance parameters), the prior density can be specified as uniform over the domain of the parameter space. In what follows, we consider only the parametric class of sampling densities {f (x|θ):θ ∈ } which assumes that the specific form of f ( · |θ) is completely known once the value of θ is determined. Further, we assume that the parameter space is finite-dimensional, i.e., ⊆ Rm for some integer m ≥ 1. It is however possible to extend the inferential procedures to the nonparametric family of densities which allows the parameter space to be infinite, but such methods are beyond the scope of this chapter.
1.1. The Bayes’ Rule
Once we have determined the sampling density f (x|θ) of x ∈ X for a given value of θ ∈ and the prior density π(θ) of θ, it follows from probability theory ((15), (16), and (17)) that the conditional density of θ given x is given by p(θ|x) = .
f (x|θ)π(θ) f (x|θ)π(θ) . = m(x) f (x|θ)π(θ)dθ
[1]
The conditional density p(θ|x) . will be called as the posterior density of θ given x and m(x) = f (x|θ)π(θ)dθ is known as the marginal density of x. The formula [1] is defined only for those x which satisfy m(x) > 0. Notice that if we observe a vector of response X = (x1 , x2 , . . . , xn ) and f (X |θ) denotes the joint density of X given θ, then we would simply replace f (x|θ) by f (X |θ) in [1]. In many applications we assume that given θ, the observations are independently and identically distributed (iid) each having a sampling iid
= 1, . . . , n. In such a case, density f (x|θ), i.e., xi |θ ∼ f (x|θ) for i/ the joint density is given by f (X |θ) = ni=1 f (xi |θ). On the other hand, if we assume that xi are independently distributed each / having a density fi (x|θ), then we have f (X |θ) = ni=1 fi (xi |θ). More generally, if fi (x|x1 , . . . , xi−1 , θ) denotes the conditional density of xi given /nx1 , . . . , xi−1 and θ for i = 2, 3, . . . , n then f (X |θ) = f1 (x1 |θ) i=2 fi (xi |x1 , . . . , xi−1 , θ). In each of the previous cases, we can simply replace f (x|θ) by f (X |θ) in [1] to obtain the posterior density of θ given X and a prior density π(θ). Thus a Bayesian model for an observed data set X consists of two quantities: (a) sampling density f (X |θ) and (b) prior density π(θ) and is summarized in the following table:
Basics of Bayesian Methods
prior density for θ : sampling density of x given θ : marginal density of x: posterior density of θ given x:
159
π (θ ) f (x|θ ) . m(x) = f (x|θ )π (θ )dθ p(θ |x) = f (x|θ )π (θ )/m(x)
In defining the so-called Bayes rule (or formula) in [1], we have used . the integral notation to define the marginal density, m(x) = f (x|θ)π(θ)dθ but it should be understood that if the parameter θ was discrete valued, the integral in the denominator will be replaced by a summation. More generally, one may define the marginal density and hence the posterior density with respect to a σ −finite measure (e.g., Lebesgue measure or counting measure), but we will avoid such technical formalism throughout this introductory chapter. Also we have suppressed the domain of integration while defining m(x) by assuming that π(θ) = 0 for all θ ∈ / . Remark: It is to be noted that we are not following the customary textbook notation of using uppercase letters to represent random variables and corresponding lowercase letters to represent their observed values. The reason is that in Bayesian analysis the parameter θ is also random and the usual textbook notation might create confusion when we use to represent the parameter space. Finally, we are also not using the traditional notation of using bold letter to represent vectors, however when required we will explicitly indicate when a quantity is vector or matrix. . −1 Suppose π(θ) = C · k(θ) where C = k(θ)dθ > 0 is the normalizing constant and k( · ) is a nonnegative function defined on the parameter space . Then k(θ) is called the prior kernel function of the prior density. For example, θ 5.2 (1 − θ)2.5 I(0,1) (θ) is the kernel of a beta distribution with shape parameters 6.2 and 3.5 or θ 5 e −θ I(0,∞) (θ) is the kernel of a gamma distribution with parameters 6 and 1, where IA (θ) denotes the indicator function of the set A and satisfies IA (θ) = 1 if θ ∈ A otherwise IA (θ) = 0. Also let f (X |θ) = C(X )L(θ; X ), where C(X ) > 0 is a function of data X only. Then L(θ;X ) is called the likelihood function iid
of θ. For example, if xi |θ ∼ N (θ1 , θ2 ), where θ = (θ1 , θ2 ) ∈ = R × (0, ∞), i.e., xi ’s are normally distributed with mean θ1 and variance θ2 then f (x|θ) = (2πθ2 )−1/2 exp{−(x − θ1 )2 /2θ2 } and −n/2 exp{− ni=1 (xi − θ1 )2 /2θ2 }. hence L(θ; X ) = θ2 From Bayes rule [1] it is clear that the posterior density is determined completely by the posterior kernel function given by K (θ; X ) = L(θ; X )k(θ),
[2]
160
Ghosh
where L(θ;X ) is the likelihood function of θ and k(θ) is a prior kernel function of θ. In fact, it follows that we can express the posterior density in the following alternative form, p(θ|X ) = .
K (θ;X ) L(θ;X )k(θ) =. . K (θ;X )dθ L(θ;X )k(θ)dθ
[3]
As the denominator of the [3] does not depend on θ (but it does depend on X), the above equation is often expressed as the Bayesian Mantra: posterior is proportional to likelihood times prior kernel, i.e., p(θ|X ) ∝ L(θ;X )k(θ). The above formalism also provides an extension of the Bayes rule which allows one to use a prior kernel function k(θ) that is not necessarily finitely integrable. In other words, we can even use . prior kernel function k(θ) ≥ 0 for which k(θ)dθ = ∞. If one uses such a prior kernel function, then the prior is said to be an improper prior. Notice that . an improper prior is. determined only up to a constant because k(θ)dθ = ∞ implies C · k(θ)dθ = ∞ for any C > 0. Also it no longer makes sense to call it an improper prior (probability) distribution as π(θ) is no longer a (probability) density function. If an improper prior is used the researcher should check that the posterior is still proper for almost all data. Remark: In general, the researchers should be very careful when using an improper prior, as the posterior is no longer guaranteed to be proper for all values of X. Thus, when an improper prior π(θ) = C · k(θ) is used one should analytically verify that . the posterior kernel is finitely integrable, i.e., K (θ;X )dθ < ∞ for almost all X. Notice that by an application of Fubini’s theorem (on interchanging order of integration) it follows that an . improper prior necessarily leads to an improper . marginal, i.e., k(θ)dθ = ∞ implies (and is also implied by) m(X )dX = ∞. The following result is often useful to check if one can use improper prior: Lemma: If the likelihood function is bounded below, i.e., infθ L(θ;X ) ≥ L0 (X ) for some L0 (X ) > 0, then any improper prior leads to an improper posterior. Example 1: Does vitamin C cure common cold? Let us begin with a very simple experiment where a randomly chosen group of patients suffering from common cold took (same amount of) vitamin C for 1 week and responded whether or not vitamin C cured common cold immediately following that week.
Basics of Bayesian Methods
161
Suppose that x = 1 represents that common cold was cured and x = 0 denotes that the cold was not cured within a week. In this experiment, the parameter of interest is θ = Pr [common cold is cured within a week]. Clearly in this case the only probability distribution that can be used to model the outcome of a given patient is the so-called Bernoulli distribution, and hence f (x|θ) = θ x (1 − θ)1−x , where x = 0, 1 and θ ∈ [0, 1]. If we obtain n iid observations xi ∈ {0, 1} then the joint density of / the response vector X = (x1 , x2 , . . . , xn ) is given by n s n−s f (X |θ) = i=1 f (xi |θ) = θ (1 − θ) , where s = ni=1 xi is the total number of patients (out of n) who were cured within a week on taking vitamin C. It is well known that s has a binomial distribution henceforth denoted by Bin(n, θ), whose density function is given by f (s|θ) = (ns)θ s (1 − θ)n−s , where s = 0, 1, . . . , n. For any prior density π(θ) the posterior density of θ in this case is given by p(θ|X ) = . 1 0
θ s (1 − θ)n−s π(θ) θ s (1 − θ)n−s π(θ)dθ
.
[4]
Remark: Notice that in this case (for any prior density) the posterior distribution p(θ|X) depends on the data X only through the sufficient statistic s = i xi . In other words, p(θ|X ) = p(θ|s), i.e., the conditional density of θ given the entire data vector X is same as the conditional density of θ given only the sum s (and sample size n). This interesting phenomenon of posterior distribution is not just a coincidence, it is true in general. More specifically, from [3] and Fisher–Neyman factorization theorem it follows that, if f (X |θ) denotes the joint density of X given θ and if S = S(X ) is a sufficient statistic (vector), then p(θ|X ) = p(θ|S) for any prior density π(θ). Now, returning to the vitamin C example, once the sampling density is determined, we need to specify a prior density to complete the Bayesian model specification. A careful look at the expression [4] suggests that if we choose π(θ) = Cθ a−1 (1 − θ)b−1 for some constant C = C(a, b), where a > 0 and b > 0 are known quantities, then the posterior density is also of ∗ ∗ the form C ∗ θ a −1 (1 − θ)b −1 for some constant C ∗ = C(a∗ , b ∗ ), ∗ ∗ where a = a + s and b = b + n − s. By routine integral calculus it can . ∞be shown that C(a, b) = (a + b)/ (a)(b), where (a) = 0 t a−1 e −t dt is the gamma function. Thus, the prior θ ∼ Beta(a, b) leads to the posterior θ|X ∼ Beta(a∗ , b ∗ ), where Beta(a, b) denotes a beta distribution with mean a/(a + b). Remark: When a family of prior densities leads to posterior densities which belong to the same family as the prior
162
Ghosh
densities, then such a family of prior densities is called a conjugate family. For example, for the binomial sampling density, i.e., when s|θ ∼ Bin(n, θ) the family of beta densities, {Beta(a, b):a > 0, b > 0} forms a conjugate family of prior densities. Notice that the choice of conjugate family is not unique. For instance, for the binomial sampling density, the family of densities {C(a, b)θ a−1 (1 − θ)b−1 π0 (θ):a > 0, b > 0} also forms a conjugate family, where π0 ( · ) can be chosen to be any nonnegative continuous function defined on [0, 1] and the normalizing con . −1 1 stant, C(a, b) = 0 θ a−1 (1 − θ)b−1 π0 (θ)dθ . In general, when sampling density belongs to the so-called exponential family of densities which admits a sufficient statistic of constant dimension, see (18) and (19), we can always find a conjugate family of prior densities (20). In general, one can construct a class of natural conjugate family for iid observations obtained from a sampling density f (x|θ) for x ∈ X and / θ ∈ 0 by sim0 ply defining the family of prior densities as { m j=1 f (xj |θ): xj ∈ X, m = m.0 ,/ m0 + 1, . . .} provided there exists an integer m0 ≥ 1 m0 0 such that j=1 f (xj |θ)dθ < ∞. There has been a great deal of work on developing conjugate prior families and their extensions (see (21–23)). When the parameters of the prior density are elicited using previously collected data or expert knowledge such priors are often called subjective prior or informative prior. However, when no such prior information is available or very little knowledge is available about the parameter θ, one may use what are known as noninformative prior. For instance, (24) developed noninformative prior that are invariant under monotone transformations, Hartigan (1998) developed maximum likelihood prior (25), Bernardo (1979) developed reference prior among others (26). However, it is not always necessary to obtain or use conjugate family of prior densities when we can use advanced numerical methods to compute posterior summary estimates. For example, for the binomial sampling density, Zellner (27) proposed a prior π(θ) = Cθ θ (1 − θ)1−θ for θ ∈ (0, 1), where the constant C = 1.61857 (obtained by numerical integration)(27). In literature, this prior is known as Zellner’s prior for binomial sampling density. Also the Beta(a = 0.5, b = 0.5) is known as Jeffreys’ prior (24) and the improper prior π(θ) = [θ(1 − θ)]−1 is known as the Haldane’s prior for binomial sampling density (though this prior also appeared in (28)). The Bayesian inference for θ using binomial sampling density was originally solved by Bayes (15) who used the so-called uniform prior which is same as the Beta(a = 1, b = 1) prior. It can be shown that posterior inference about θ or odds ratio ρ = θ/(1 − θ) or log-odds ratio η = log ρ is relatively insensitive to all of these priors (i.e., Zellner, Jeffreys, Haldane/Lhoste priors and uniform priors).
Basics of Bayesian Methods
163
Results using numerical integration are presented in latter sections R codes are given in Section 6.1. 1.2. Hierarchical and Empirical Bayes Approaches
In practice, often it is difficult to completely specify the prior density based on available expert opinions or background scientific information. In such cases, one may consider a class of prior densities {π1 (θ|λ): λ ∈ }, where λ is referred to as the hyperparameter and π1 ( · | · ) denotes a conditional density of θ given λ. For example, if we consider again the class of (conjugate) beta densities {Beta(a, b):(a, b) ∈ (0, ∞) × (0, ∞)} for the vitamin C example, then (a, b) would be the so-called hyper-parameter for the family of beta densities. A natural question is what would be a reasonable choice for (a, b) or more generally for λ? In practice, two methods are usually used to determine the hyper-parameter λ. The first alternative is to assign another prior distribution for λ, i.e., λ ∼ π2 (λ), where π2 ( · ) is a probability density defined over . This results in a more flexible prior for θ; . π(θ) = π1 (θ|λ)π2 (λ)dλ. Thus, the complete Bayes model can be specified in hierarchical stages: X |θ, λ ∼ f (X |θ),
θ|λ ∼ π1 (θ|λ) and λ ∼ π2 (λ).
The above model will be referred to as the three stage Bayesian Hierarchical Model (BHM). The prior π2 ( · ) can often be chosen suitably to obtain estimators with good frequentist properties (see (29)). Although the BHM provides a more flexible framework for data analysis, often it becomes difficult to analytically obtain the posterior distribution of θ given X as the inference based on p(θ|X ) requires integration of both θ and λ. In recent years, Monte Carlo (MC) methods (in particular Markov Chain Monte Carlo (MCMC) methods) are routinely used to obtain samples from the posterior density p(θ|X ) (see Section 5.2 for further details). Another alternative to determine the hyper-parameter λ is to use the so-called Empirical Bayes (EB) method . which is based on the so-called marginal likelihood m(X |λ) = f (X |θ)π1 (θ|λ)dθ. One may view m(X |λ) as the marginal likelihood function of ˆ )= λ and “estimate” λ by maximizing m(X |λ), i.e., λˆ = λ(X arg maxλ∈ m(X |λ). Alternatively, one may also obtain a moment based method to “estimate” λ, e.g., if λ is a q-dimensional parameter, we can use a set of q suitable moments of X with respect to the marginal density m(X |λ) and equate them to the corresponding q empirical moments of the data vector X = (x1 , x2 , . . . , xn ). In either case, once we obtain an empirical estimate λˆ we can use the π(θ) = π1 (θ|λˆ ) as the prior for θ. Often, such a procedure is criticized as an “un-pure” Bayesian method as the data X is used to obtain the prior density π(θ)! However, the posterior
164
Ghosh
estimators obtained by using EB procedures have been shown to posses good frequentist properties (see (30)).
2. Point Estimation In general, suppose f (X |θ) denotes the joint (conditional) density of the data vector X = (x1 , x2 , . . . , xn ), given a parameter vector θ . Recall that if x1 , x2 , . . . , xn/are conditionally iid, each with density f (x|θ) then f (X |θ) = ni=1 f (xi |θ). However, it is not necessary that xi ’s be iid to obtain the posterior estimates of θ. But, for most of our applications in this chapter we will restrict our attention to only conditionally iid observations. Next, assume that the prior density of θ is given by π(θ), i.e., θ ∼ π(θ) which could have been determined by considering a hierarchical prior or by an EB method. All relevant statistical inference about θ can be obtained from the posterior kernel given by [2]. First, notice that the posterior density . can be obtained as p(θ|X ) = K (θ;X )/b(X ), where b(X ) = K (θ;X )dθ. However in many situations, we do not have to determine the normalizing constant b(X ) to obtain posterior estimates (e.g., by using the so-called Monte Carlo methods, discussed later in Section 5). Suppose that we are interested to estimate a specific function η = η(θ). For example, if θ = (θ1 , . . . , θm ) is a m-dimensional parameter vector and we might be interested to make inference about η = η(θ) = θ1 , i.e., the first component of θ or we might be interested in making inference about η = (θ1 , θ2 − θ12 ), etc. Of course, we might as well be interested in making inference about the entire parameter vector θ and in that case η = η(θ) = θ. Several point estimates of η can be obtained by computing the mean, median, or mode of the posterior distribution of η given X. For instance, an estimator of η defined as the posterior mean of η is given by . η(X ˆ ) = E[η|X ] = E[η(θ)|X ] =
η(θ)K (θ;X )dθ . , K (θ;X )dθ
[5]
where K (θ;X ) is the kernel of the posterior density defined in [2]. It can be shown that the posterior mean of η as defined in [5] is the optimal (often called the Bayes) estimator of η if we wish to minimize the squared error loss. More precisely, one can show that E[||η − η(X ˆ )||2 ] ≤ E[||η − T (X )||2 ] for any other estimator T (X ) of η, where the expectation is taken with respect to both X
Basics of Bayesian Methods
165
and θ, i.e., with respect to the joint density of (X , θ) and || · || denotes the Euclidean norm. In general, it can be shown that (optimal) Bayes estimator exists for more general (convex) loss functions (e.g., weighted squared error loss, asymmetric loss, etc.) and that the Bayes estimator can be expressed in closed form as suitable integrals of the posterior kernel. This is a major advantage of Bayesian inference as the Bayes estimators can be obtained in a straightforward way and can be expressed in closed form (though involving highdimensional integrals). However, until recently this advantage was actually a road-block to the application of Bayesian methods in practice, as computing high-dimensional integrals is not an easy task in most situations. But with the advent of modern computing methods, we can now routinely use efficient numerical methods (see Section 5) to compute such (optimal) Bayes estimators.
3. Hypothesis Testing Suppose denotes the entire parameter space and let the two competing hypotheses that we want to test be expressed as H0 :θ ∈ 0 vs. Ha : θ ∈ a , where 0 ∩ a = ∅ and 0 ∪ a ⊆ . For example, for the vitamin C example suppose we want to test H0 :θ = 0.5 vs. Ha :θ > 0.5, then we have = [0, 1], 0 = {0.5}, and 0 = [0.5, 1]. To compare the two hypotheses we can simply compare the posterior probabilities of the null set 0 and the alternative set a and then decide in favor of that hypothesis which has a larger probability. In other words, we would reject H0 if and only if Pr [θ ∈ 0 |X ] < Pr [θ ∈ a |X ]. Notice that the posterior probabilities can also be computed from the posterior kernel as follows: . j K (θ;X )dθ for j = 0, a, [6] Pr[Hj |X ] = Pr [θ ∈ j |X ] = . K (θ;X )dθ where K(θ; X) is as defined in [2]. Notice that as the denominators of the posterior probabilities in [6] are the same, we can simply compare the numerators to make a decision. Thus, while comparing H0 vs. Ha we may use the following simple rule: Reject H0 if and only if
0
K (θ;X )dθ <
a
K (θ;X )dθ, [7]
where K(θ;X) is as defined in [2]. Although the above formulation for testing or comparing two competing hypotheses works in general, the researcher needs to be careful in constructing the
166
Ghosh
prior distribution especially when one of the hypotheses (e.g., the null hypothesis) is a singleton set (e.g., 0 = {0.5}) or contains a lower dimensional plane (e.g., 0 = {(θ1 , θ2 ) ∈ [0, 1]2 :θ1 = θ2 }) so that the prior distribution allows a positive probability of the null set 0 . In other words, we should use a prior distribution which guarantees that Pr [θ ∈ j ] > 0 for j = 0, a. Further, unless there is any substantial prior information about the two hypotheses, it is a good idea to use a prior distribution which assures that Pr [0 ] = Pr [a ], i.e., a priori we assume that the two hypotheses are equi-probable before observing any data. In practice, there is another popular quantity called Bayes factor (BF) which is also used to choose between two hypotheses. The BF is defined as the ratio of posterior odds to prior odds. More specifically, the BF of Ha to H0 is defined as: Pr [θ ∈ a |X ]/ Pr [θ ∈ 0 |X ] BF (X ) = BF (a|0) = Pr [θ ∈ a ]/ Pr [θ ∈ 0 ] . . K (θ;X )dθ k(θ)dθ a ·. 0 , = . 0 K (θ;X )dθ a k(θ)dθ
[8]
where K(θ; X) and k(θ) denote the posterior and prior kernel functions, respectively. Thus, we can use the following rule to choose between the two hypotheses: Reject H0 if and only if log (BF (X )) > 0,
[9]
where BF (X ) is defined in [8]. Notice that the rule [7] is equivalent to rule [9] if we choose to use a equi-probable . prior that satisfies Pr [θ ∈ 0 ] = Pr [θ ∈ a ] or equivalently 0 k(θ)dθ = . a k(θ)dθ. The traditional frequentist methods of testing hypotheses are constructed to maintain a specific (but arbitrary) level of significance α (e.g., α = 0.05). More precisely, if T (X ) is a test statistic and the rule is to reject H0 if T (X ) > T0 , then the cut-off value T0 is chosen so that the type I error rate, infθ∈0 Pr [T (X ) > T0 |θ] ≤ α, where Pr [T (X ) > T0 |θ] can be computed using the conditional joint density f (X |θ). Although the Bayesian tests are not generally constructed to maintain a specific value of type I error rate, we can use either posterior odds or the BF as a test statistic to construct similar traditional tests. For example, if we define T (X ) = log (BF (X )) as the test statistic, we can find T0 to satisfy Pr [T (X ) > T0 |θ] ≤ α for all θ ∈ 0 . In this case the rule [9] would need to be modified accordingly, i.e., to reject H0 if and only if log (BF (X )) > T0 , instead of using T0 = 0. However, the researcher should be careful that such a choice of T0 might unnecessarily inflate the type II error rate Pr [T (X ) ≤ T0 |θ] for θ ∈ a . Alternatively, one may simply
Basics of Bayesian Methods
167
report the posterior probabilities Pr [θ ∈ j |X ] for j = 0, a and let the researcher decide a cut-off value, say, p0 so that H0 is rejected if and only if Pr [θ ∈ a |X ] > p0 . For instance, instead of using the default value of p0 = 0.5 one may use a more conservative value, say, p0 = 0.8. Admittedly, the choice of p0 is arbitrary but so is the choice of level of significance α (or a cut-off value of 0.05 for the p-value). A more formal approach would to be control the Bayesian type I error rate defined as BE1 (T0 ) = Pr [T (X ) > T0 |θ ∈ 0 ]. Notice that if a frequentist method is used to select the cut-off value T0 that satisfies Pr [T (X ) > T0 |θ] ≤ α for all θ ∈ 0 , then it follows that the same T0 would also satisfy BE1 (T0 ) ≤ α for any prior π(θ). In a similar manner one also defines the Bayesian type II error rate as BE2 (T0 ) = Pr [T (X ) ≤ T0 |θ ∈ a ]. Clearly, as T0 increases BE1 (T0 ) decreases while BE2 (T0 ) increases. One possible way to control both type of errors would be to find a T0 that minimizes the total weight error, TWE(T0 ) = w1 BE1 (T0 ) + w2 BE2 (T0 ) for some suitable nonnegative weights w1 and w2 . In other words, we can determine Tˆ 0 = arg min TWE(T0 ). Notice that if we choose w2 = 0, this would correspond to controlling only the (Bayesian) type I error rate while choosing a large value for w2 would be equivalent to controlling the (Bayesian) power 1 − BE2 (T0 ). The above procedure can also be used to determine required sample size once Tˆ 0 is determined. In other words, once Tˆ 0 is determined for a given sample size n, we can find the optimal (minimum) sample size so that the total error rate BE1 (Tˆ 0 ) + BE2 (Tˆ 0 ) ≤ α.
4. Region Estimation In addition to computing a point estimator for η or testing the value of θ in a given region, we can determine an interval and more generally a region (as a subset of the parameter space, ) such that the posterior probability of that region exceeds a given threshold value (e.g., 0.90). In other words, we would like to determine a subset R = R(X ) such that Pr [θ ∈ R(X )|X ] ≥ 1 − α for a given value α ∈ (0, 1). The set R(X ) is said to be a 100 (1 − α)% credible set for θ. The definition of a credible set may sound similar to the traditional confidence set (computed by means of frequentist methods), but these two sets have very different interpretations. A 100(1 − α)% credible set R = R(X ) guarantees that the probability that θ is in R(X ) is at least 1 − α, whereas a 100(1 − α)% confidence set C(X) for θ merely suggests that if the method
168
Ghosh
of computing the confidence set is repeated many times then at least 1 − α proportion of those confidence sets would contain θ, i.e., Pr [C(X ) θ|θ] ≥ 1 − α for all θ ∈ . Notice that given an observed data vector X the chance that a confidence set C(X ) contains θ is either 0 or 1. In many applied sciences this fundamental distinction in the interpretations of credible and confidence set is often overlooked and many applied researchers often wrongly interpret the confidence set as the credible set. Given a specific level 1 − α of credible set R(X ) it might be of interest to find the “smallest” such set or region that maintains the given level. Such a region is obtained by computing the highest probability density (HPD) regions. A region R(X ) is said to be a HPD region of level 1 − α, if R(X ) = {θ ∈ :K (θ;X ) > K0 (X )},
[10]
where K0 (X ) > 0 is chosen to satisfy Pr [θ ∈ R(X )|X ] ≥ 1 − α and K (θ;X ) denotes the posterior kernel function as defined in [2]. In practice, it might not be straightforward to compute the HPD region R(X) as defined in [10], but many numerical methods (including Monte Carlo methods) are available to compute such regions (see Section 5 for more details). Notice that even when η = η(θ) is a real-valued parameter of interest, the HPD region for η may not be a single interval but may consist of a union of intervals. For example, when the posterior density is a bimodal density, the HPD region may turn out to be the union of two intervals each centered around the two modes. However, when the posterior density of a real-valued parameter η is unimodal the HPD region will necessarily be an interval of the form R(X ) = (a(X ), b(X )) ⊆ , where −∞ ≤ a(X ) < b(X ) ≤ ∞.
5. Numerical Integration Methods
It is easily evident that almost any posterior inference based on the kernel K (θ;X ) often involves high-dimensional integration when the parameter space is itself a high-dimensional subset of the Euclidean space Rm . For instance, in order to compute the posterior mean η(X ˆ ) as defined in [5], we need to compute two (possibly) high-dimensional integrals, one for the numerator and another for the denominator. Similarly, to compute the posterior probabilities in [6] we need to compute the two high-dimensional integrals. Also to compute the HPD region in [10] we again need to compute high-dimensional integration. Thus, it turns out that in general given a generic function g(θ) of the parameter θ the posterior inference requires the computation of the integral
Basics of Bayesian Methods
I = I (X ) =
g(θ)K (θ;X )dθ.
169
[11]
Notice that we can take g(θ) = 1 to compute the denominator of [5] or [6] and g(θ) = η(θ) to compute the numerator of [5]. Similarly, we can choose g(θ) = Ij (θ) to compute the numerator of [6]. There are primarily two numerical approaches to compute integrals of the form I (X ) as defined in [11]. First, when the dimension of the parameter space, m is small (say m < 5) then usually the classical (deterministic) numerical methods perform quite well in practice provided the functions g(θ) and K (θ;X ) satisfy certain smoothness conditions. Second, when the dimension m is large (say m ≥ 5) the stochastic integration methods, more popularly known as the MC methods are more suitable. One of the advantages of the MC methods is that very little regularity conditions are required for the function g(θ). It is well known that if a (deterministic or stochastic) numerical method requires N functional evaluations to compute the integral I in [11], then order of accuracy of a deterministic numerical integration is generally O(N −2/m ), whereas the order of accuracy of a stochastic integration is O(N −1/2 ). Thus, the rate at which the error of the deterministic integral converges to zero depends on the dimension m and hence any such deterministic numerical integration suffers from the so-called curse of dimensionality. On the other hand the rate at which the error of the Monte Carlo integration converges (in probability) to zero does not depend on the dimension m. However, whenever a deterministic numerical integration can be performed it usually provides a far more accurate approximation to the targeted integral than that provided by the MC integration by using the same number of function evaluations. In recent years, much progresses have been made with both type of numerical integration methods and we provide a very brief overview of the methods and related software to perform such integrations. 5.1. Deterministic Methods
The commonly used deterministic numerical methods to compute integrals usually depend on some adaptation of the quadrature rules that are known for centuries and have been primarily developed for one-dimensional problem (i.e., when the dimension m of the parameter space is unity). In order to understand the basic ideas we first consider the one-dimensional (i.e., m = 1) .b case, where the goal is to compute the integral I = a h(θ)dθ, where h(θ) = g(θ)K (θ;X ) and we have suppressed the dependence on the data X (as once X is observed it is fixed for rest of the posterior analysis). The basic idea is to find suitable weights wj ’s and ordered knot points θ j ’s such that
170
Ghosh b
I = a
h(θ)dθ ≈ IN =
N
wj h(θj ),
[12]
j=1
where N = 2, 3, . . . is chosen large enough to ensure that |I − IN | → 0 as N → ∞. Different choices of the weights and the knot points lead to different integration rules. A majority of the numerical quadrature rules fall into two broad categories; (i) Newton–Cotes rules, which are generally based on equally spaced knot points (e.g., θj − θj−1 = (b − a)/N ) and (ii) Gaussian quadrature rules, which are based on knot points obtained via the roots of orthogonal polynomials and hence not necessarily equi-spaced. The Gaussian quadrature rules usually provide more accurate results, especially when the function h( · ) can be approximated reasonably well by a sequence of polynomials. The simplest examples of Newton–Cotes rules are trapezoidal rule (a two-point rule), mid-point rule (another two-point rule) and Simpson’s rule (a three-point rule). In general, if a (2k − 1)point (or 2k-point) rule with N equally spaced knots with suitably chosen weights is used it can be shown that |I − IN | = Ck (b − a)2k+1 h (2k) (θ ∗ )/N 2k for some θ ∗ ∈ (a, b) and the constant Ck > 0. The Gaussian quadratures go a further step which not only suitably choose the weights but also the knots at which the function h( · ) is evaluated. The Gaussian quadrature rules based on only N knots can produce exact results for polynomials of degree 2N − 1 or less, whereas Newton–Cotes rules usually produce exact results of polynomials of degree N (if N is odd) or degree (N − 1) (if N is even). Moreover, the Gaussian quadrature rules can be specialized to use K (θ;X ) as the weight functions. The literature on different rules is huge and is beyond the scope of this chapter (see (31) and Chapter 12 (pp. 65–169) of (33) for more details). In the software R (see Section 6.1) such one-dimensional integration is usually carried out using a function known as integrate which is based on the popular QUADPACK routines dqags and dqagi (see (33) for further details). An illustration of using the R function integrate is given in Section 6.1. The above one-dimensional numerical integration rules can be generalized to higher dimensional case and in general if the multidimensional trapezoidal rule is used one can show that I = IN + O(N −2/m ), where N denotes the number of knots at which we have to evaluate the function h(θ), where now θ = (θ1 , . . . , θm ) is an m-dimensional vector. Thus, it easily follows that with increasing dimension m the accuracy of the integration rule diminishes rapidly. One may try to replace the multidimension trapezoidal rule by a multidimensional version of Simpson type rule, but such a multidimensional rule would still lead to an error bound of O(N −4/m ). In the next section we will see that the
Basics of Bayesian Methods
171
error bound of the Monte Carlo methods is of Op (N −1/2 ) irrespective of the dimension m and hence for very large dimensions Monte Carlo integration methods are becoming more and more popular. Thus, in summary, although the deterministic numerical integration methods are very accurate (sometimes even exact for certain class of functions), the usefulness of such deterministic methods declines rapidly with increasing dimensions. Moreover, when the parameter space is a complicated region (e.g., embedded in a lower dimensional space) or the function to be integrated is not sufficiently smooth, the above mentioned theoretical error estimates are no longer valid. Despite such limitations a lot of progress has been made in recent years and an R package known as adapt offers generic tools to perform numerical integrations of higher dimensions (e.g., for 2 ≤ m ≤ 20). Many innovative high-dimensional integration methods and related software are available online at http://www.math.wsu.edu/faculty/genz/software/software. html maintained by Alan Genz at Washington State University. 5.2. Monte Carlo Methods
MC statistical methods are used in statistics and various other fields (physics, operational research, etc.) to solve various problems by generating pseudo-random numbers and observing that sample analogue of the numbers converges to their population versions. The method is useful for obtaining numerical solutions to problems which are too complicated to solve analytically or by using deterministic methods. The most common application of the Monte Carlo method is Monte Carlo integration. The catchy name “Monte Carlo” derived from the famous casino in Monaco was probably popularized by notable physicists like Stanislaw Marcin Ulam, Enrico Fermi, John von Neumann, and Nicholas Metropolis. S. M. Ulam (whose uncle was a gambler) noticed a similarity between the casino activities with the random and repetitive nature of the simulation process and coined the term Monte Carlo to honor his uncle (see (34)). The key results behind the success of MC methods of integration relies on primarily two celebrated results in Statistics: (i) The (Strong or Weak) Law of Large Numbers (LLN) and (ii) The Central Limit Theorem (CLT). For a moment suppose that we can generate a sequence of iid observations, θ (1) , θ (2) , . . . by using the kernel K (θ; X ) as defined in [2]. In many cases it is usually possible to generate pseudo-random samples just by using the kernel of a density (e.g., by using accept–reject sampling, etc.) and several innovative algorithms are available (see (35) and (36)). Later we are going to relax this assumption when we describe an algorithm known as the importance sampling.
172
Ghosh iid
First assume that we can generate θ (l) ∼ p(θ|X ) for l = 1, 2, . . . , N by using only K (θ;X ). Then it follows from (Strong/Weak) LLN that, as N → ∞, g¯ N
N p 1 = g(θ (l) ) −→ E[g(θ)|X ] = N
g(θ)p(θ|X )dθ, [13]
l=1
where p(θ|X ) denotes the posterior density and it is assumed that E[|g(θ)||X ] < ∞, i.e., the posterior mean of g(θ) is finite. Notice that by using the above result, we can avoid computing the numerator and denominator of [5], provided we are able to generate random samples by using only the kernel of the posterior density. In other words, with probability one the sample mean g¯ N as defined in [13] converges to the population mean E[g(θ)|X ] as N becomes large. Also notice that almost no smoothness condition is required on the function g( · ) to apply the MC method. Although N can be chosen as large as we wish, it would be nice to know in practice how large is good enough for a given application. In order to determine the least approximate value of N that we would need, the so-called CLT which states that as N → ∞, √ d N (g¯ N − E[g(θ)|X ]) −→ N (0, g ),
[14]
where g = E[(g(θ) − E[g(θ)|X ])(g(θ) − E[g(θ)|X ])) |X ] is the posterior variance (matrix) of g(θ), provided E[||g(θ)||2 |X ] < ∞. Hence it follows that g¯ N = E[g(θ)|X ] + Op (N −1/2 ) and (stochastic) error bound does not depend on the dimension m of θ. For simplicity, when g( · ) is a real-valued function we illustrate how to obtain an approximate value of N using CLT. Suppose for a given small number > 0, we want to determine N which would guarantee with 95% probability g¯ N is within ± of the posterior mean E[g(θ)|X ]. By using CLT it follows that the 95% confidence interval for E[g(θ)|X ] is given by g¯ N ± √ 1.96σ 1.96σg / N . So we want to find N such that √ g ≤ or equivalently N ≥
(1.96)2 σg2 . 2
N
Although σg2 is also unknown we can “esti (l) mate” it by the sample standard deviation σˆ g2 = N l=1 (g(θ ) − 2 g¯ N ) /(N − 1). In summary, if we want an accuracy of > 0 with about 95% confidence, we can find N that roughly satisfies N ≥ 4σˆ g2 / 2 . For example, if we assume σg2 = 1 and we require first decimal accuracy (i.e., ≈ 0.1), then we need to generate at least N = 400 random samples from the posterior distribution. However, if we require second decimal accuracy (i.e., ≈ 0.01), then we need to generate at least N = 40, 000 samples! In general, one may need to generate a very large number of samples
Basics of Bayesian Methods
173
from the posterior distribution to obtain reasonable accuracy of the posterior means (or generalized moments). It appears that the success of the MC method of integration comes at a price as the resulting estimate (g¯ N ) of the posterior mean is no longer a fixed value and will depend on the particular random number generating mechanism (e.g., the “seed” used to generate the random samples). In other words, two persons using the same data and the same model (sampling density and prior) will not necessarily obtain exactly the same estimate of the posterior summary. The MC integration only provides a probabilistic error bound as opposed to the fixed error bound of the deterministic integration methods discussed in the previous section. Also the stochastic error bound of the MC method can hardly be improved (though there exist methods to reduce the variability of sampling) and will generally be inferior to the deterministic integration methods when m ≤ 4 and the functions are sufficiently smooth. Hence, if the function g(θ) is smooth and the dimension m of θ is not very large, it is more efficient to use the deterministic integration methods (e.g., using the adapt function available in R). On the other hand, if the dimension is large, g( · ) is nonsmooth or the parameter space is complicated it would be more advantageous to use the MC integration methods. The next question is what happens when it is not possible to generate samples from the posterior density just by using the kernel function K (θ; X )? In such situations we can use what is known as the importance sampling which uses another density to generate samples and then uses a weighted sample mean to approximate the posterior mean. The idea is based on the following simple identity g(θ)K (θ; X )dθ =
g(θ)
K (θ; X ) q(θ)dθ = g(θ)w(θ)q(θ)dθ, q(θ) [15]
where q(θ) is a probability density function whose support contains that of K (θ;X ), i.e., {θ ∈ :K (θ;X ) > 0} ⊆ {θ ∈ :q(θ) > 0}. The function w(θ) as defined in [15] is called the importance weight function and the density q(θ) is called the importance proposal density. Thus, it follows from the identity in [15] that we can use the following algorithm to estimate I = I (X ) as defined in [11]: iid
Generate θ (l) ∼ q(θ) for l = 1, 2, . . . , N . Compute the importance weights w(l) = w(θ (l) ) for l = 1, 2, . . . , N . Compute I¯N = 1 N g(θ (l) w(l) ). N
l=1
p Again it follows by using LLN that I¯N −→ I as N → ∞. Notice that we can also estimate the posterior mean
174
Ghosh
(l) as E[g(θ)|X ] by the quantity g¯ N = N g(θ (l) w(l) )/ N l=1 l=1 w N (l) l=1.w /N converges in probability to the normalizing constant K (θ;X )dθ. The importance sampling as described above may sound that we can use it as an all-purpose method to compute posterior summaries, as in theory the choice of the importance density q( · ) can be any arbitrary density with support that contains the support of the posterior density. However in practice the choice of q(θ) is often not that easy, especially when θ is high-dimensional. In such cases MCMC methods are used to generate (dependent) samples from the posterior distribution, again only making use of the posterior kernel K (θ;X ) (37). A MCMC method is based on sampling from the path of a (discrete time) Markov Chain {θ (l) :l = 1, 2, . . .} whose stationary distribution is the posterior distribution. In other words, given a posterior distribution as its stationary distribution a MCMC method creates a suitable transition kernel of a Markov chain. One of the most popular approaches is to use the Metropolis algorithm which was later generalized by Hastings and now referred to as the Metropolis– Hastings algorithm (38). It is beyond the scope of this chapter to explain the details of the Markov Chain theory related to the MCMC methods. One striking difference between the direct MC method described above and the MCMC method is that the samples generated by the MCMC algorithm are dependent, whereas the samples obtained via direct MC method (e.g., rejection sampling or importance sampling) are independent. Further, the samples generated by the MCMC method are not directly obtained from the posterior distribution but rather they are only obtained in an asymptotic sense. This is the primary reason for discarding away first few thousands of samples generated by a MCMC procedure as burn-in’s. Thus, if {θ (l) :l = 1, 2, . . .} denotes a sequence of samples generated from a MCMC procedure then one only uses, {θ (l) :l = B + 1, B + 2, . . .} to approximate the posterior summaries for a suitably large integer B > 1. In other words, if we want to estimate η = η(θ) based on MCMC (l) samples θ (l) ’s, we can compute η¯ = N l=B+1 η(θ )/(N − B) to approximate its posterior mean E[η|X ] for some large integers N > B > 1. A generic software to implement MCMC methods to obtain posterior estimates for a variety of problems can be achieved by the freely available software WinBUGS. The software can be downloaded from the site: http://www.mrc-bsu.cam.ac. uk/bugs/.
6. Examples Let us consider again the vitamin C example, where we have f (X |θ) = L(θ;X ) = L(θ|s) = θ s (1 − θ)n−s and K (θ;X ) =
Basics of Bayesian Methods
175
K (θ|s) = θ s (1 − θ)n−s k(θ), where k(θ) is a kernel of a (prior) density on
[0, 1]. If we are interested in estimating the odds ratio θ or the log-odds ratio η = log ρ, then we can compute ρ = 1−θ the posterior estimator using [5] as follows
.1 θ s n−s k(θ)dθ 0 log 1−θ θ (1 − θ) , [16] ηˆ = .1 s (1 − θ)n−s k(θ)dθ θ 0 for a given choice of the prior kernel k(θ). We now illustrate the computation of [16] using the following kernels: (a) beta prior: k(θ) = θ a−1 (1 − θ)b−1 , where a, b > 0. In particular, a = b = 1 yields Uniform prior, a = b = 0.5 yields Jeffreys’ prior and a = b = 0 yields Haldane’s prior (which is improper). When using Haldane prior, one should check that s(n − s) > 0, otherwise the prior will lead to an improper posterior distribution. (b) Zellner’s prior: k(θ) = θ θ (1 − θ)1−θ . For illustration, suppose 15 out of 20 randomly selected patients reported that common cold was cured using vitamin C within a week. This leads to n = 20 and s = 15. See the codes in Section 6.1 for a suite of codes to obtain posterior estimates of θ and log (θ/(1 − θ)). Now suppose in addition to the responses xi ∈ {0, 1} we also have patient level predictor variables such as the gender, race, age, etc. which we denote by the vector zi . In such case we need to use a regression model (e.g., logistic model) to account for subject level heterogeneity. Let θi = Pr [xi = 1|zi ] = θ(zi ) for i = 1, . . . , n. We may use several parametric models for the regression function θ( · ), e.g., θ(z) = (1 + exp{−β z})−1 (which corresponds to a logistic regression model) or θ(z) = (β z) (which corresponds to a probit regression model) or more generally θ(z) = F0 (β z), where F0 ( · ) is a given distribution function with real line as its support. In general we can express the framework as a BHM as follows: xi |zi ∼ Ber(θi ) θi = F0 (β zi ) β ∼ N (0, c0 I ), where c0 > 0 is a large constant and I denotes identity matrix. In Section 6.2 we have provided a generic code using the WinBUGS language to implement the model. 6.1. R
To compute the posterior mean estimate of log-odd ratio η we can use the following code: #Prior kernel functions: #Beta prior: fix a and b #a=1;b=1 #k=function(theta){ #thetaˆ(a-1)∗(1-theta)ˆ(b-1)}
176
Ghosh #Zellner’s prior: k=function(theta){ thetaˆtheta∗(1-theta)ˆ(1-theta)} #Set observed values of s and n: s=15;n=20 #Likelihood function: L=function(theta){ thetaˆs∗(1-theta)ˆ(n-s)} #Posterior kernel function: K=function(theta){ k(theta)∗L(theta)} #Set the parameter of interest: #Odd ratio: #num.fun=function(theta){ #theta/(1-theta)∗K(theta)} #Log-Odds ratio: num.fun=function(theta){ log(theta/(1-theta))∗K(theta)} #Obtain the numerator and #denominator to compute posterior mean: num=integrate(num.fun,lower=0,upper=1) den=integrate(K,lower=0,upper=1) post.mean=num$value/den$value
Notice that the above code can be easily modified to compute the posterior mean of an one-dimensional parameter of interest for any other likelihood and prior kernel. 6.2. WinBUGS
To compute the posterior mean estimate of regression coefficient β we can use the following code: model{ for(i in 1:n){ x[i]∼dbern(theta[i]) theta[i]<-1/(1+exp(-inprod(beta[1:p], z[i,1:p]))) #theta[i]<-phi(inprod(beta[1:p],z[i,1:p])) } beta[1:p]~dmnorm(zero[], Tau[,]) } Data: list(n=20, p=2, zero=c(0,0), Tau=structure(.Data=c(0.01, 0, 0, 0.01), .Dim=c(2,2)), x=c(1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1), z=structure(.Data=c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2.03, -0.87, 0.21, -0.56, 2.31, -0.87,
Basics of Bayesian Methods
177
-0.11, -2.16, 0.56, 0.53, 1.01, -0.87, -0.27, 0.42, 1.1, -0.83, 0.51, -1.07, -0.39, -0.71), .Dim = c(20, 2))) Inits: list(beta=c(0, 0))
The above code can be easily modified to accommodate other prior distributions for β and other link functions for θi ’s.
References 1. Press, J. S. and Tanur, J. M. (2001) The Subjectivity of Scientists and the Bayesian Approach. Wiley, New York. 2. Berry, D. A. (1996) Statistics: A Bayesian Perspective. Wiley, New York. 3. Winkler, R. L. (2003). Introduction to Bayesian Inference and Decision. 2nd Edition. 4. Bolstad, W. M. (2004). Introduction to Bayesian Statistics. John Wiley, New York. 5. Lee, P. M. (2004). Bayesian Statistics: An Introduction. Arnold, New York. 6. Sivia, D., and Skilling, J. (2006) Data Analysis: A Bayesian Tutorial. Oxford University Press, New York. 7. Berger, J. O. (1985). Statistical Decision Theory and Bayesian Analysis. 2nd Edition, Springer-Verlag, New York. 8. Bernardo, J. M., and Smith, A. F. M. (1994) Bayesian Theory. Wiley, Chichester. 9. Box, G. E. P., and Tiao, G. C. (1992) Bayesian inference in statistical analysis. Wiley, New York. 10. Carlin, B. P., and Louis, T. A. (2008) Bayesian Methods for Data Analysis. 3rd Edition. Chapman & Hall/CRC, Boca Raton, Florida. 11. Congdon, P. (2007) Bayesian Statistical Modelling. 2nd Edition. Wiley, New York. 12. Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (2003) Bayesian Data Analysis. 2nd edition. CRC Press, New York. 13. Ghosh, J. K., Delampady, M., and Samanta, S. (2006) An Introduction to Bayesian Analysis. Springer, New York. 14. Robert, C. P. (2001) The Bayesian Choice. Springer Verlag, New York. 15. Bayes, T. (1764) An Essay Toward Solving a Problem in the Doctrine of Chances. Philos. Trans. R. Soc. London 53, 370–418. 16. Laplace, P. S. (1774) Mmoire sur la probabilit des causes par les vnements. Mmoires de mathmatique et de physique presents. lAcadmie royale des sciences par divers savants & lus dans ses assembles 6, 621–656.
17. Kolmogorov, A. N. (1930) Sur la loi forte des grands nombres. Comptes Rendus de l’Academie des. Sciences 191, 910–912. 18. Pitman, E. (1936) Sufficient statistics and intrinsic accuracy. Proc. Camb. Phil. Soc. 32, 567–579. 19. Koopman, B. (1936) On distribution admitting a sufficient statistic. Trans. Amer. math. Soc. 39, 399–409. 20. Diaconis, P. and Ylvisaker, D. (1979) Conjugate priors for exponential families. Ann. Stat. 7, 269–281. 21. Consonni, G., and Veronese, P. (1992) Conjugate priors for exponential families having quadratic variance functions. J. Amer. Stat. Assoc. 87, 1123–1127. 22. Consonni, G., and Veronese, P. (2001) Conditionally reducible natural exponential families and enriched conjugate priors. Scand. J. Stat. 28, 377–406. 23. Gutirrez-Pea, E. (1997) Moments for the canonical parameter of an exponential family under a conjugate distribution. Biometrika. 84, 727–732. 24. Jeffreys, H. (1946) An invariant form for the prior probability estimation problems. Proc. R. Stat. Soc. London (Ser. A). 186, 453–461. 25. Hartigan, J. A. (1998) The maximum likelihood prior. Ann. Stat. 26, 2083–2103. 26. Bernardo, J. M. (1979) Reference posterior distributions for Bayesian inference, J. R. Stat. Soc., B. 41, 113–147 (with discussion). 27. Zellner, A. (1996) Models, Prior Information, and Bayesian Analysis. J. Econom. 75, 51–68. 28. Lhoste, E. (1923) Le Calcul des probabilits appliqu lartillerie, lois de probabilit apriori. Revue dartillerie, Mai-Aot, Berger-Levrault, Paris. 29. Berger, J. O., and Strawderman, W. E. (1996) Choice of hierarchical priors: Admissibility in estimation of normal means. Ann. Stat. 24, 931–951.
178
Ghosh
30. Efron, B., and Morris, C. (1975) Data analysis using Stein’s estimator and its generalizations. J. Amer. Stat. Assoc. 70, 311–319. 31. Davis, P. J., and Rabinowitz, P. (1984) Methods of Numerical Integration. 2nd Edition. Academic Press, New York. 32. Ueberhuber, C. W. (1997) Numerical Computation 2: Methods, Software, and Analysis. Springer-Verlag, Berlin. 33. Piessens, R., de Doncker-Kapenga, E., Uberhuber, C., and Kahaner, D. (1983) QUADPACK, A Subroutine Package for Automatic Integration. Springer-Verlag, Berlin.
34. Metropolis, N., and Ulam, S. (1949) The Monte Carlo Method. J. Amer. Stat. Assoc., 44, 335–341. 35. Devroye, L. (1986) Non-Uniform Random Variate Generation. Springer-Verlag, New York. 36. Gentle, J. E. (2003) Random Number Generation and Monte Carlo Methods, 2nd Edition. Springer-Verlag, New York. 37. Berg, B. A. (2004) Markov chain Monte Carlo Simulations and their Statistical Analysis. World Scientific, Singapore. 38. Chib, S. and Greenberg, E. (1995) Understanding the MetropolisHastings Algorithm. Am. Stat. 49(4), 327–335.
Chapter 4 The Bayesian t-Test and Beyond Mithat Gönen Abstract In this chapter we will explore Bayesian alternatives to the t-test. We saw in Chapter 1 how t-test can be used to test whether the expected outcomes of the two groups are equal or not. In Chapter 3 we saw how to make inferences from a Bayesian perspective in principle. In this chapter we will put these together to develop a Bayesian procedure for a t-test. This procedure depends on the data only through the tstatistic. It requires prior inputs and we will discuss how to assign them. We will use an example from a microarray study as to demonstrate the practical issues. The microarray study is an important application for the Bayesian t-test as it naturally brings up the question of simultaneous t-tests. It turns out that the Bayesian procedure can easily be extended to carry several t-tests on the same data set, provided some attention is paid to the concept of the correlation between tests. Key words: Two-sample comparisons, prior correlations, multivariate testing, simultaneous inferences, multiple testing.
1. Introduction to the t-Test The two-sample t-test, sometimes called the student’s t-test after the pseudonym used by William Gosset who first proposed this method, is arguably the most commonly used statistical test in biological research. Its practical importance is easy to appreciate. Most experiments have two or more groups to compare and when there are more than two groups it is often of interest to compare pairs of groups. Hence a tool that assigns statistical significance to a comparison between two groups aptly receives attention. I will briefly review the critical aspects of the t-test, not only to establish notation, but also to make some observations that will be useful when we develop the Bayesian counterpart. H. Bang et al. (eds.), Statistical Methods in Molecular Biology, Methods in Molecular Biology 620, DOI 10.1007/978-1-60761-580-4_4, © Springer Science+Business Media, LLC 2010
179
180
Gönen
Let us call the two groups to be compared 1 and 2 and the variable of interest to be compared y. It is easiest if we think of the data as two vectors of numbers: Y1 and Y2 where each vector represents the measurements taken in one of the groups. Comparing Y1 and Y2 in its most general sense involves comparing the probability distributions that they are associated with. The two-sample t-test makes two critical assumptions which reduce the problem of comparing the distributions to comparing the means: 1. Both Y1 and Y2 come from normal distributions. 2. The distributions for Y1 and Y2 have the same nonzero variance. Now remember that the mean and variance completely identify the normal distribution so under these two assumptions, the distributions that generated Y1 and Y2 differ through their means, if at all. If the means of Y1 and Y2 , say μ1 and μ2 , are the same then the data vectors must have come from the same distribution. Therefore one can state the null hypothesis of interest as H0 :μ1 = μ2 against the alternative H1 :μ1 = μ2 . It is important to note parenthetically that the second assumption can be relaxed and leads to a version of the t-test known as the Welch test. But the version that assumes equal variances remains the most popular. Denote the observed group means by Y¯ 1 and Y¯ 2 . The magnitude of the difference between the two groups is given by δ = μ1 − μ2 and estimated by δˆ = Y¯ 1 − Y¯ 2 . To determine statistical significance we need to understand how variable δˆ is: had we repeated the experiment under the same conditions how far would δˆ have been from what we observed. Note that δˆ is a random variable, (its value will change from one repetition of the experiment to the other) and as such it has a variance. The variˆ to be denoted by Var(δ), ˆ represents the variability that ance of δ, we need to judge significance and is a function of Var(Y2 ) (which is assumed to be equal to Var(Y1 ). Since the two variances are assumed to be equal it makes sense to “pool” the data from the two groups to estimate this variance. If the sample (estimated) variance of each group is denoted by s12 and s22 then the pooled variance is given by sp2 = {(n1 − 1)s12 + (n2 − 1)s22 }/(n1 + n2 − 2)
[1]
where n1 and n2 are the size (number of observations) of the two groups. This is essentially a weighted average of the group variances. The weights are (almost) the sample sizes and using sample size weighted averages is commonly employed when pooling information from various sources to estimate a common parameter. There is no good intuitive explanation for using n1 − 1 and n2 − 1 as the weights instead of n1 and n2 , but for purposes of
The Bayesian t-Test and Beyond
181
understanding the principles it does not matter which weights are used. Using this pooled estimate of the variance leads to the following formula for the two-sample t-statistic (also see Chapter 1) t=
y1 − y2 1/2
sp /nδ
,
[2]
where nδ = (n1−1 + n2−1 )−1 ,
[3]
can be thought of as the “effective sample size” because it penalizes an experiment where the groups were not of equal sizes (a smaller nδ means a smaller t). The next step is to derive distribution of t under the null hypothesis. Letting ν = n1 + n2 − 2, H0 is rejected in favor of H1 when |t| ≥ t{1 − α/2, ν}, where t{1 − α/2, ν} is the 1 − α/2 quantile of the T distribution with ν degrees of freedom (df), to be denoted by Tν . The two-sided p-value is obtained as p = 2 × P (T ≥ |t|), where T has the Tν distribution, and is routinely reported by most statistical software packages. Example. Suppose we have the expression for a particular gene from a microarray experiment in patients with a particular type of cancer and we want to compare the gene expression according to the site of cancer. This example is motivated by a published study (1) where the particular type of cancer is the gastrointestinal stromal tumor (GIST) and the two sites of interest are stomach and small bowel. We observe Y1 = (7970, 228, 12486, 6879, 8692, 7166, 6275, 4652, 5626, 7933, 9280) and Y2 = (273, 5836, 1276, 1794, 3, 89, 312, 216, 420, 472, 8772, 1163, 430, 266, 7443, 654, 369, 315, 947, 317, 284, 799, 544, 436, 275). The value of the t-statistic is 6.0935 with 34 df and it is highly significant (p < 0.001). This test can be performed in R using the t.test function. Without getting into details at this point, we note that a logarithmic transformation may be appropriate here, in which case one finds t = 4.5932 and, again, highly significant with 34 df. The use of the log transform will be explored in more detail in the next section.
182
Gönen
2. The Bayesian t-Test 2.1. General Principles
As mentioned in the previous section, t-test is ubiquitous and one might naturally think that there is a uniquely defined Bayesian version that is equally popular. Alas, there are many different ways to analyze the same type of data but there is not a single wellaccepted test. I surmise that one of the reasons for this is the lack of a closed-form Bayesian test statistic that intuitively connects with the usual t-test. Therefore the goal of this section is to derive a Bayesian t-test which depends on the data only through the t-statistic and thus uses identical information from the sample. It is important to emphasize one fact, lest it should be misinterpreted: The discussion of the two-sample testing problem is prominently featured in Bayesian statistics (2–6). Most of these methods using μ1 and μ2 to parameterize the problem and struggle through the distribution of μ1 –μ2 . Until the emergence of posterior simulation methods the popular solution was to use analytic approximation for the posterior distributions. More recent attempts replace the analytic approximation with numerical ones. This is potentially an alienating factor, since most Bayesians quickly became used to the availability of general purpose posterior simulation software such as BUGS/WinBUGS and started to accept numerical approximations as practically tenable while non-statisticians, and even applied statisticians that are not familiar with Bayesian software, might have felt uncomfortable with the need to use numerical approximations and the investment it requires to learn new software for this purpose. If indeed this is the case, then the test proposed in this chapter can be routinely used without these concerns since it does not require any approximation. Chapter 3 explained the principles and motivations behind a Bayesian philosophy of statistics. Familiarity with that material is essential to understand the rest of this chapter. For those readers who may find themselves referring back to Chapter 3, it will be important to note the similarities and differences in notation. The parameter of interest, which is denoted by θ in Chapter 3, is δ here. The two data vectors Y1 and Y2 collectively replace the X of Chapter 3. Chapter 3 mentioned hypothesis testing in general terms using regions for the null and alternative hypothesis (0 and 1 ), In this chapter, 0 is taken to be δ = 0 and 1 is taken to be δ = 0. The region δ = 0 looks innocuous but needs to be handled with care since any continuous prior on δ will imply P(δ = 0) = 0 which will imply P(δ = 0|data) = 0. The standard approach is to assign prior probabilities π0 and π1 (π0 + π1 = 1) on hypotheses H0 and H1 , respectively, then update these values
The Bayesian t-Test and Beyond
183
via Bayes theorem to obtain the posterior probability of H0 P(H0 |data) =
π0 P(data|H0 ) , π0 P(data|H0 ) + π1 P(data|H1 )
[4]
where P(data|H j ) denotes the marginal density of the data under hypothesis Hj . The importance of the term “marginal” will be clear later, if not at this moment. In most cases π0 will be taken to be 0.5, but there will be cases where strong prior information will dictate another choice. The decision to reject or retain a null hypothesis H0 can be based on the magnitude of this posterior probability. It is tempting to think that if P(H0 |data) < 0.05 then the null hypothesis should be rejected but protection of type I error (where 0.05 comes from) has no role in this analysis. Although there is no standard threshold for Bayesian hypothesis testing, 0.5 is used commonly. So a Bayesian procedure would retain the null if P(H0 |data) > 0.5 and reject it otherwise. This amounts to retaining the more likely hypothesis and is a logically defensible choice. Since the posterior probabilities are sensitive to the priors π0 and π1 , it is often suggested to use the Bayes factor (BF) (compare with Chapter 3) instead: BF =
P(data | H0 ) . P(data | H1 )
[5]
When BF > 1 the data provide evidence for H0 , and when BF < 1 then the data provide evidence for H1 (and against H0 ). Jeffreys (7) suggests BF < 0.1 provides “strong” evidence against H0 and BF < 0.01 provides “decisive” evidence. The posterior probability is simply related to the BF as 1 0 π1 1 −1 . P(H0 |data) = 1 + π0 BF
[6]
While [4] is simple and elegant, it is not immediately obvious how to compute the key quantity, P(data|Hj ) because it obscures the fact that each hypothesis posits a set of parameters which describe the distribution of the data P(data|Hj ). When considering the two-sample case in particular, these parameters are μ = μ1 = μ2 under H0 and (μ1 , μ2 ) under H1 . Now P(data|H0 ) can now be thought of as two parts: P(data|μ) and P(μ|H0 ). The relationship between these three distributions is given by P(data|H0 ) =
P(data|μ)P(μ|H0 )dμ.
[7]
184
Gönen
Similarly,
P(data |H1 ) =
P(data|μ1 , μ2 )P(μ1 , μ2 |H0 )dμ.
[8]
It is easy to see that [7] and [8] can be used to compute [4]. The important consideration here is that the elements of [7] and [8] are directly available: P(data|μ) or P(data|μ1 , μ2 ) are simply the normal distribution functions with the specified means and P(μ1 , μ2 ) and P(μ) are the two prior distributions. It turns out there is a simple, closed-form expression for the Bayesian t-test which can be derived from the material presented so far. Its simplicity heavily depends on a particular way we can use to assign the prior distributions. Section 2.2 deals with this issue and Section 2.3 presents the Bayesian t-test along with simple statistical software to calculate it. 2.2. Choice of Prior Distributions
By way of notation, let N (y|a, b) denote the normal (Gaussian) distribution function with mean a and variance b, and let χd2 (u) denote the chi-square distribution with d df. We informally stated in Section 1 that we have two groups with normally distributed data. This can be made more precise by stating that the data are conditionally independent with Yr |{μr , σ 2 } ∼ N (μr , σ 2 ). The goal is to test the null hypothesis H0 :δ = μ1 − μ2 = 0 against the two-sided alternative H1 :δ = 0. The first key step in thinking of the priors is to change the parameters of the problem from (μ1 , μ2 , σ 2 ) to (μ, δ, σ 2 ) by defining μ = (μ1 + μ2 )/2. This is consistent with the fact that we have been using μ to denote the common mean under H0 . It also has the advantage of defining a suitable μ for H1 as well enabling us to use one set of parameters to represent the distribution of the data under either of the hypotheses. The second key step is modeling the prior information in terms of δ/σ , called the standardized effect size rather than for δ. For Jeffreys (7), one of the founding fathers of Bayesian hypothesis testing, dependence of the prior for δ on the value of σ is implicit in his assertion “from conditions of similarity, it [the mean] must depend on σ , since there is nothing in the problem except σ to give a scale for [the mean].” We will start by assigning a N (λ, σδ2 ) prior to δ/σ . Since the standardized effect size is a familiar dimensionless quantity, lending itself to easy prior modeling, choosing λ and σδ2 should not be too difficult. For example Cohen (8) reports that |δ/σ | values of 0.20, 0.50, and 0.80 are “small,” “medium,” and “large,” respectively, based on a survey of studies reported in the social sciences literature. These benchmarks can be used to check whether the specifications of the prior parameters λ and σδ2 are reasonable; a simple check based
The Bayesian t-Test and Beyond
185
on λ ± 3σδ can determine whether the prior allows unreasonably large effect sizes. The remaining parameters (μ, σ 2 ) are assigned a standard noninformative prior, no matter whether δ = 0 or δ = 0. As explained in Chapter 3, the standard noninformative prior for μ (which also happens to be improper) is constant and for σ 2 log-constant. Since they require no prior input we will not dwell further on this choice. It suffices to say that they ensure that the BF depends on the data only through the two-sample t-statistic. To summarize, the prior is as follows: P(δ /σ | μ, σ 2 , δ = 0) = N (δ/σ | λ, σδ2 ),
[9]
with the nuisance parameters assigned the improper prior P(μ, σ 2 ) ∝ 1/σ 2 .
[10]
Finally, the prior is completed by specifying the probability that H0 is true: π0 = P(δ = 0),
[11]
where π0 is often taken to be 1/2 as an “ objective” value (12). However, π0 can be simply assigned by the experimenter to reflect prior belief in the null; it can be assigned to differentially penalize more complex models (7); it can be assessed from multiple comparisons considerations (7, 9); and it can be estimated using empirical Bayes methods (10). The next section provides a case study for prior assessment. It should be mentioned prominently that Jeffreys, who pioneered the Bayesian testing paradigm, derived a Bayesian test for H0 :μ1 = μ2 that is also a function of the two-sample tstatistic [2]. However, his test uses an unusually complex prior that partitions the simple alternative H1 :μ1 = μ2 into three disjoint events depending upon a hyperparameter μ: H11 :μ2 = μ = μ1 , H12 :μ1 = μ = μ2 , and H13 :μ1 = μ2 and neither equals μ. Jeffreys further suggests prior probabilities in the ratio 1:1/4:1/4:1/8 for H0 , H11 , H12 , and H13 respectively, adding another level of avoidable complexity. An additional concern with Jeffreys’ two-sample t-test is that it does not accommodate prior information about the alternative hypothesis. 2.3. The Bayesian t-Test
For the two-sample problem with normally distributed, homoscedastic, and independent data, with prior distributions as specified in Section 2, the BF for testing H0 :μ1 = μ2 = μ, vs.
186
Gönen
H1 :μ1 = μ2 is
BF =
Tν (t|0, 1) 1/2
Tν (t|nδ λ, 1 + nδ σδ2 )
.
[12]
Here t is the pooled-variance two-sample t-statistic [2], λ and σδ2 denote the prior mean and variance of the standardized effect size (μ1 − μ2 )/σ under H1 , and Tν (.|a, b) denotes the noncentral t probability density function (pdf) having location a, scale b, and df ν (specifically, Tν (.|a, b) is the pdf of the ran 2 dom variable N (a, b)/ χν /ν, with numerator independent of denominator). The mathematical derivation of [12] is available in (11). The data enter the BF only through the pooled-variance two-sample t-statistic [2], providing a Bayesian motivation for its use. For the case where the prior mean λ of the effect size is assumed to be zero, the BF requires only the central T distribution and is calculated very simply, e.g., using a spreadsheet, as 2 BF =
1 + t 2 /ν 1 + t 2 /{ν(1 + nδ σδ2 )}
3−(ν+1)/2 (1 + nδ σδ2 )1/2 .
[13]
Gene expression example revisited: How would the conclusions differ if one applies the Bayesian t-test to the gene expression data from gastrointestinal stromal tumors? An easy starting point is the calculation of nδ which turns out to be 7.639. We then need to specify λ, the prior mean for δ. Not knowing anything about the context a good first guess is λ = 0, giving equal footing to either mean. What would be a reasonable value for σδ ? This is clearly the most difficult (and perhaps the least attended) parameter. While gene expression scales are somewhat artificial, a difference of 2 or more on the log-scale is traditionally considered substantial. This suggests using log (Y1 ) and log (Y2 ) as the data and choosing σδ = 1 gives us a small but nonzero chance, a priori, that the means can differ by at least two-fold on the log-scale (probability that δ, with mean 0 and variance 1, will be greater than 2 or less than −2 is approximately 0.05). It is possible to use other values for σδ , without resorting to two-fold difference. If the gene expression study was powered to detect a particular difference the same calculation can be repeated using the detectable difference. Suppose the study was powered to detect a difference of 50% (1.5-fold) on the log-scale. Then σδ = 0.75 gives the same probability of means differing by at least 1.5-fold on the log-scale.
The Bayesian t-Test and Beyond
187
In this case the BF turns out to be 0.36 and the posterior probability is given by 0 1 Prior Odds −1 1+ , 0.36 where the prior odds is π/(1 − π), the odds that the null is true a priori. If one is willing to grant that the two hypotheses are equally likely before the data are collected (i.e., prior odds = 1) then the posterior probability of the null hypothesis is roughly one-fourth. While the data favors the alternative hypothesis by lifting its chances from one-half to three-fourth, the resulting posterior probability is not a smoking-gun level evidence to discard the null. Let us recall the findings of the frequentist analysis. Using the log-transform t was 4.59 corresponding to a p-value of 0.00006, a highly significant finding by any practical choice of type I error. In contrast, the Bayesian test offers a much more cautious finding: The data supports the alternative more than it does the null but the conclusion that the means are different is relatively weak with a probability of 0.26. It is tempting to directly compare 0.00006 with 0.26, but it is misleading—the two probabilities refer to different events. Most Bayesians prefer to choose the hypothesis that has the higher posterior probability, so in this analysis, the alternative hypothesis, with a probability of 0.74, will be preferred over the null. A common criticism for this Bayesian analysis would be the impact of the choice of prior on the Bayesian results. The different results between the frequentist and Bayesian methods, not to mention the cavalier way we chose the value for σδ , begets a sensitivity analysis. Here is a table which can be a template for analyses reporting Bayesian results (Table) 4.1, alternatively a display like Fig. 4.1 can be used to convey the sensitivity analyses.
Table 4.1. Sensitivity Analysis for the Bayesian t-test σδ
0.5
1
1.5
2.0
2.5
3.0
BF
0.46
0.36
0.31
0.28
0.26
0.25
P(H0 )
0.32
0.26
0.24
0.22
0.21
0.20
While there is some sensitivity to the specification of σδ , the conclusions are very similar and remain quite different from the frequentist analysis. It seems highly unlikely that the culprit is the prior for σδ . Of course one can vary δ or even π but in the context of this analysis other values for these two prior parameters hardly
0.0
0.2
0.4
0.6
0.8
1.0
Gönen
Posterior Probability or the Bayes Factor
188
0.0
0.5
1.0 1.5 2.0 Prior Variance for the Effect Size
2.5
3.0
Fig. 4.1. Sensitivity analysis for the Bayesian t-test. Horizontal axis is σδ and the vertical axis is either the Bayes factor (circles) or the posterior probability of the null (triangles).
make sense. A possible explanation for the differential results is the so-called irreconcilability of p-values and posterior probabilities (12). It turns out the disagreements between p-values and posterior probabilities are very common when testing point null hypothesis (when the null hypothesis is a single point, such as δ = 0). A justification of this requires technical arguments that are beyond the scope of this chapter, but it is important to bear in mind when one is trying to compare the findings of a Bayesian analysis with that of a frequentist one.
3. Simultaneous t-Tests An important aspect of most genetic data is its multivariable nature. In a gene expression study we will often have more than one gene of interest to test. It is widely-recognized in frequentist statistics that performing multiple tests simultaneously inflates the nominal type I error, so if each test is performed at the 5% level, the combined type I error probability will be higher than 5%. There is substantial literature on obtaining tests so that the combined error rate is controlled at 5%. Our goal is to provide the Bayesian counterpart for simultaneous evaluation of multiple t-tests, therefore we will not discuss the frequentist approaches further. While the literature on this topic has multiplied over the past few years (13–16), we will mention only one important concept from
The Bayesian t-Test and Beyond
189
this literature. Most of the frequentist discussions center on the type of error to protect. The most conservative route is the so-called family-wise error rate (FWER) which defines type I error as the probability of making at least one rejection if all the nulls are true, see Chapter 1 and 5 for more information on FWER. Clearly FWER should be used if it is plausible that all the nulls are true. As we noted in the previous section, the development of a Bayesian hypothesis test is not concerned with type I and type II errors. Indeed it is often said that Bayesians do not need any multiplicity adjustments. This is not entirely correct and technicalities behind these arguments require the use of statistical decision theory. A good starting point for this literature is Berger’s book (2) although the Bayesian section of Miller’s book (13) is also very helpful. There is also a practical side of performing multiple tests based on the assignment of prior probabilities which is more relevant for our development. Suppose we have k genes, and continuing with the theme of the GIST example, we want to see if their expressions differ between tumors originating from stomach and small bowel tumors. In the previous section we chose the prior probability of the null hypothesis P(H ) = 0.5 without much discussion. With several tests, to be called H1 , . . . , Hk can we find a similar “automatic” choice of prior? Suppose we set P(Hk ) = 0.5 for all k, which seems to be a natural starting point. What is, then, the probability that all nulls are true, that is P(all) = P(H1 = H2 = · · · = Hk = 1)? Supposing for the time being that the hypotheses are independent, the probability of all nulls being true is 0.5k . As k increases this quantity quickly approaches 0. Table 4.2 gives P (all) for some selected values of k.
Table 4.2 Probability that all the k nulls are true when they are independent a priori k
1
2
3
4
5
10
100
P(all)
0.5
0.25
0.125
0.0625
0.0031
0.0001
<10−30
Hence if we have five null hypotheses to test, independent of another, and we assign as a prior probability of 0.5 to each of them, we are practically implying that at least one of them is false, since the chances of all being true is 0.0031. It is clear that keeping π = 0.5 for all the nulls is impractical if it is plausible that all nulls can be true. Thus, allowing the marginal prior probabilities for the null hypotheses to decrease as the number of nulls increase is the Bayesian equivalent of protecting the FWER.
190
Gönen
3.1. Choosing Prior Probabilities for Simultaneous t-Tests
One can object to the numbers in Table 4.2 by questioning the assumption that all nulls are independent. It is indeed a questionable assumption: most multiple testing situations arise when each test captures a different aspect of a complex multidimensional response, such as a genetic profile. By the nature of the problem these tests will be correlated. It turns out that computing the probabilities in Table 4.2 under a given correlation schema is not as straightforward as it may sound. The reason for this is not computational, but more theoretical. Each null hypothesis is implicitly treated so far as a Bernoulli distribution with π as the probability of “success.” It turns out that there are several ways to generalize the Bernoulli distribution to the multivariate case and none of them is considered standard. This realization suggests that the problem will resonate beyond Table 4.2 and goes to the heart of developing a good Bayesian procedure for simultaneous hypotheses testing. How could one choose and implement the prior probabilities of the nulls, allowing for a correlation between them? In this section we show that it is possible to use the multivariate normal distribution for this purpose. It has the advantage of being familiar to most users and accessible by standard software. It also turns out that the posterior calculations are greatly simplified by this choice. Before we proceed let us define precisely what the goal is. Our aim is to assign prior probabilities to the k null hypotheses Hj , j = 1, . . . , k where P(Hj = 0) = πj = P(δj = 0). Based on the discussions for the the choice of π in the univariate testing case we want to ensure that πj > 0 for all j. We also want a flexible way to specify nonzero Corr(Hi , Hj ), the correlations between the null hypotheses. It will be helpful to consider two independent latent vectors u and v: u ∼ MVNk (0, 1 )
[14]
v ∼ MVNk (λ, 2 )
[15]
and define the prior for δ as follows: $ 0 if uj < cj δj = , vj else where cj is a constant selected in a way to get the desired prior probabilities πj . Hence the elements of u determine the a priori probability of the corresponding δj to be zero (or Hj to be true). In a particular realization of δ, some δj ’s will be given the value of 0 by this procedure and others will take on the values supplied by v. The resulting prior is a mixture of two multivariate normals which has the following desirable properties: (i) P(Hj ) > 0 for all j.
The Bayesian t-Test and Beyond
191
(ii) Corr(Hi , Hj ), i = j, is given by the ij th element of 1 . (iii) Corr(Hi , Hj ), i = j, is given by the ij th element of 2 . It is not difficult to derive the posterior probabilities for this general setup (see next section and also (17)) but we find it helpful to consider a model further simplified as follows: Assume λ = (λ, . . . , λ). Here λ represents the “anticipated effect size.” We will also assume that 1 = and 2 = σδ2 , where is a correlation matrix with 1 on the diagonals and ρ on the off-diagonals, with −1/(k − 1) < ρ < 1 (this requirement is needed to ensure that the resulting matrix is positive-definite, a necessary condition for a covariance or correlation matrix). Under this simplified model, we have Corr(Hi , Hj ) = Corr(Hi , Hj ) = ρ. This is not unreasonable for most settings, many times there is no reason to expect that the correlation structure between the endpoints will be different depending on the existence of an effect. Before proceeding, an analyst needs to input values for λ, ρ, and σδ2 . One can think of u representing the structure corresponding to the null hypothesis and v representing the structure corresponding to the alternative hypothesis. Hence the choice of λ, ρ, and σδ2 can be simplified and based on the parameters of study design. We will give examples of this in Section 4. At this point it is instructive to revisit Table 4.2 by a simple implementation of this procedure. Using the simplified version where all the correlations between nulls is constant at ρ and specifying σδ to be 1, as in the univariate gene expression example we can compute the probability that all nulls are true if each null has a prior probability of 0.5, corresponding to cj = 0 for all j. All we need for this is the vector u and whether its elements are positive or not. Since u has mean 0, each null hypothesis has a prior probability of 0.5. By choosing 0 to have a constant correlation of ρ in the off-diagonals we can compute the probability that all nulls are true. While this can be done analytically it is easy and instructive to simulate. We first generate 10, 000 replicates of u and then count the number of times when all the elements are negative. A simple R code for this is given in the Appendix. Figure 4.2 displays P(all) as a function of ρ and k. As expected, increasing ρ increases P(all) and in the limit ρ = 1, P(all) = 0.5 = P(each). The effect of k becomes less pronounced as ρ increases. The results are worth considering seriously: when the hypotheses are moderately correlated (ρ ≤ 0.5) the probability that all nulls are true remain below one-sixth for k ≥ 4. Only when ρ is over 0.5 the chances of nulls being true start approaching 0.5, yet even when ρ = 0.9, a value that seems unlikely in practice, the probability that all nulls are true is about one-third for moderate to large values of k. I strongly recommend going through this exercise before choosing a value of ρ and cj . In fact it seems equally reasonable, if
Gönen
0.4 0.3 0.1
0.2
k=2
k = 10 0.0
Prior Probability that All Nulls Are True
0.5
192
0.0
0.2
0.4
0.6
0.8
1.0
Prior Correlation
Fig. 4.2. Prior probability that all the k nulls are true (vertical axis) as a function of ρ (horizontal axis) and k. Different values of k correspond to different lines with the highest line representing k= 2 and the lowest k= 10.
not more so, to strive for probability of all nulls being true to be 0.5 and let the value of ρ decide cj . 3.2. Computing the Posterior Probabilities for Simultaneous t-Tests
This section presents the technical details that are given by (17). It is not essential for a practical understanding and can be skipped. Suppose the data are k-dimensional random vectors, where k is the number of endpoints: Yir |(μi , ) ∼ N (μi , ), for i = 1, 2 (treatment regimens) and r = 1, . . . , ni (replications). Let ¯1 −Y ¯ 2, D=Y ¯ 1 + n2 Y ¯ 2 )/n, M = (n1 Y ¯ i )(Y ir − Y ¯ i ) = νS, C = i r (Yir − Y δ = μ1 − μ2 , μ = (n1 μ1 + n2 μ2 )/n. Note that (D, M, C) are minimal sufficient complete statistics, distributed as P(D, M, C|μ1 , μ2 , ) = N (D|δ, nδ−1 ) × N (M|μ, n.−1 ) × Wν (C|), where Wν (.|.) denotes the Wishart distribution with ν df, and where nδ− 1 = n1− 1 + n2− 1, n. = n1 + n2 , and ν = n. − 2. Define indicator variables Hj , for which Hj = 0 when δj = 0 and Hj = 1 when δj = 0. Let H = diag(Hj ) and σ = diag1/2 ().
The Bayesian t-Test and Beyond
193
We consider the following hierarchical prior: P(δ|μ, , H) = P(δ|, H) = N (δ|Hσ mξ , σ HSξ Hσ ), P(μ, |H) ∝ ||−(k+1)/2 , P(H = h) = πh .
[16] [17] [18]
Equation [16] is best viewed as a N (Hmξ , HSξ H) prior for the vector of effect sizes ξ = σ −1 δ. It requires a prior mean mξ j and prior variance sξ2j for each effect size, but correlations between the effect sizes also must be specified to obtain the full prior covariance matrix sξ . The prior [16] can be simplified considerably if one is willing to assume exchangeability of the nonzero ξj . At the second level [17] of the hierarchy we use a flat prior for μ and a standard noninformative prior for . At the third stage [18] we specify the prior probability that model h is true. Since there are 2k possible models, one might generically assume P(H = h) ≡ 2−k (18), as would result from independent Hj with πj = P(Hj = 0) ≡ 0.5. However, this implies in particular that the probability that all nulls are true is 2−k , which may be unreasonably low. Independence may be unreasonable as well. Our third stage hierarchy [18] takes all these into account using the prior inputs πj , j = 1, . . . , k, and ρ H , where ρ H is a specified tetrachoric correlation matrix of H. Given these prior specifications, the values πh can be calculated from the multivariate normal distribution either analytically, in simple cases, or numerically using algorithms (19). As is the case for [16], determining [18] is simplified by assuming exchangeability; in this case only two parameters (the prior null probability and the tetrachoric correlation) must be specified. The parameters μ and δ can be integrated analytically, leaving .
N (D|σ Hmξ , nδ−1 + σ HSξ Hσ )|C|(ν−k−1)/2 ||−(ν+k+1)/2 exp ( − .5tr −1 C)d. [19] While analytic evaluation of this expression seems impossible, it can be simulated easily and stably by averaging N (D|σ Hmξ , nδ−1 + σ Hsξ Hσ ) over independent draws of from the inverse Wishart distribution with parameter matrix C−1 and df ν. An instructive alternative representation of the multivariate likelihood is N (D|σ Hmξ , nδ−1 + σ HSξ Hσ ) = 1/2 1/2 |σ |−1 N (z|nδ Hmξ , ρz + nδ HSξ H), where Z = nδ σ −1 D, the vector of two-sample z-statistics. Thus, for large ν where the inverse Wishart distribution of is concentrated around its mean P(D, M, C|H) ∝
194
Gönen
C/(ν − k − 1), P(D, M, C|H) becomes closer in proportionality 1/2 to N (ˆz|nδ mξ H, ρˆz + nδ HSξ H). This yields an approximate method for calculating posterior probabilities (justified in Theorem 2 of the Appendix in reference [17]). that has been previously described (20), and for which software is available (16). Letting denote the set containing the 2k k-tuples h, we now have πh |Y = P(H = h|D, M, C) = πh P(D, M, C|H = h)/h ∈ Hπh P(D, M, C|H = h). Thus P(Hj = 0|D, M, C) = h∈H,hj =0 πh|Y , which is the sum of the 2k−1 posterior probabilities of models for which δj is zero. In the univariate case, all integrals can be evaluated analytically and one recovers the Bayesian t-test described in Section 2.
4. Simultaneous Testing in the GIST Microarray Example
We now return to the microarray example introduced in Section 1. Signal transduction pathways are long recognized to be key elements of various cellular processes such as cell proliferation, apoptosis, and gene regulation. As such they could be playing important roles in formation of malignancies. Four genes from a certain signal transduction pathway are chosen for this analysis, primarily based on previously reported differential expression. These genes are KIAA0334, DDR2, RAMP2, and CIC and the grouping variable of interest, once more, is the location of the primary tumor (small bowel vs. stomach). The relevant summary statistics for the analysis are calculated as ⎛ ⎛ ⎞ ⎞ −0.2578 0.341 0.018 0.131 0.021 ⎜ ⎟ ⎜ ⎟ ⎜ 0.1379 ⎟ ⎜ 0.018 0.271 0.068 −0.033 ⎟ ⎜ ⎟ ⎜ ⎟ D=⎜ ⎟ , C = ⎜ 0.131 0.068 0.544 0.038 ⎟ . ⎝ 1.1359 ⎠ ⎝ ⎠ 1.5772 0.021 −0.033 0.038 0.124 [These summary statistics are calculated on the logarithm of the gene expression]. We also note that n1 = 11 and n2 = 23. First order of business is to line up the H matrix. With four hypotheses, H has 16 rows and 4 columns, depicted in the first four columns of Table 4.3. Each row of H represents a possible combination of the true states of nature for the four hypotheses at hand. Since the likelihood expression [19] is conditional on H our likelihood function takes on 16 values, one for each row of H. These values were computed using Monte Carlo integration on and presented in Table 4.3 (column L, for likelihood). The
The Bayesian t-Test and Beyond
195
Table 4.3 All the 16 possible states of nature for the four null hypotheses from Section 4. L is the likelihood for each possible combination (up to a proportionality constant), prior is the prior probability assigned for that combination and posterior is the resulting posterior probability h1
h2
h3
h4
L
Prior
Posterior
0
0
0
0
0.953
0.16
0.494
0
0
0
1
0.131
0.27
0.114
0
0
1
0
0.128
0.27
0.112
0
0
1
1
0.018
0.14
0.008
0
1
0
0
0.129
0.27
0.112
0
1
0
1
0.018
0.14
0.008
0
1
1
0
0.018
0.14
0.008
0
1
1
1
0.003
0.27
0.002
1
0
0
0
0.132
0.27
0.114
1
0
0
1
0.018
0.14
0.008
1
0
1
0
0.018
0.14
0.008
1
0
1
1
0.003
0.27
0.002
1
1
0
0
0.018
0.14
0.008
1
1
0
1
0.003
0.27
0.002
1
1
1
0
0.003
0.27
0.002
1
1
1
1
0.001
0.16
0.002
code to perform this analysis is available at mskcc.org/gonenm. We also need to generate prior probabilities for each row of this table, which require some care. We continue to center the prior at 0, as we did in Section 2 for testing a single gene, since there is no a priori expectation of the direction of the effect (whether the expression will be higher in the stomach or in the small bowel). Using similar considerations we also take σδ = 1. The prior correlation between effect sizes is somewhat harder to model. We assumed that if the effect sizes were correlated, then the hypotheses Hj also were correlated; and we assumed that the correlation between effect sizes was equal to the correlation between hypotheses. To model correlation among hypotheses, we used latent multivariate normal variables as follows: we let U ∼ N (0, ρ), where ρ is the compound symmetric correlation matrix with parameter ρ, and we defined c = −1 (π). If Hj = I (Uj ≥ c), then the {Hj } are exchangeable, with P(Hj = 0) ≡ π and tetrachoric correlation ρ. Exchangeability implies
196
Gönen
24 5hj √ c − ρU P(H = h) = E √ 1−ρ 5k−hj 3 4 √ c − ρU 1− √ , 1−ρ
[20]
where expectation is taken with respect to a standard normal U. This expression is easily evaluated using one-dimensional integration. An example in R using the function integrate can be found at mskcc.org/gonenm. The correlation ρ is suggested by the discussion in Section 3.1. Two parameters of particular interest are π = P(Hj = 0) and π0 = P(H = 0). While these four genes are selected from a pathway which is known to affect malignant processes, there is no reason to think that these particular genes are differentially expressed in stomach tumors vs. small bowel tumors. For this reason we take π0 = P(H = 0) = 0.5. Using exchangeability once more we set π = P(Hj = 0) = 0.8 for the four genes. These selections imply a value of ρ = 0.31 (see Section 3.1 for computing ρ when π = P(Hj = 0) and π0 = P(H = 0) are given. Now we can use [20] to generate the prior probability for each row of Table 4.3. Once the likelihood and the prior are determined, the posterior is trivially calculated from the basics of the Bayes theorem: P(D, M, C|H = h)P(H = h) P(H = h) = , h P(D, M, C|H = h)P(H = h) which are presented in the last column of Table 4.3, labeled as the posterior. It is clear that the most likely state of nature is that all nulls are true (posterior probability of 0.494). There is absolutely no support for two or more nulls (any combination) being true, all of these have posterior probabilities less than 0.01. The probability that one of the nulls can be false is about 0.45 (add rows 2, 3, 5, and 9). The data, however, does not favor any one of the nulls over another in the case that one of them is false. Table 4.3 gives us the joint posterior probabilities of the nulls for all possible combinations. It could be more useful to look at the marginal probabilities, as they help us answer the question for a given gene without any reference to the others. This can also be deduced from Table 4.3. For the first gene for example, the rows 1 through 8 represent the cases where the first null is true, namely the gene KIAA0334 is not differentially expressed between the stomach GISTs and small bowel GISTs. Adding the posterior probabilities of these rows gives us P(h1 = 0) = 0.858, so we can safely conclude that the expression of KIAA0334 is
The Bayesian t-Test and Beyond
197
not significantly different between the sites. Going through a similar exercise one gets P(h2 = 0) = 0.860, P(h3 = 0) = 0.860, and P(h4 = 0) = 0.858. Therefore we conclude that none of the genes are differentially expressed.
5. Conclusions In this chapter we reviewed a Bayesian approach to a very common problem in applied statistics, namely the comparison of group means. Chapter 3 had already listed various reasons why a Bayesian approach can be beneficial over traditional approaches. The particular Bayesian t-test has all those benefits, in addition to being closely related to the traditional t-test. It also has the advantage of being naturally extended to simulatenous t-tests. Simultaneous testing of hypotheses is a contentious and difficult problem. While this Bayesian approach does not alleviate the analytic difficulties, it clarifies the needs for adjustment (a major source of confusion in frequentist multiple testing methods) and produces an output, such as Table 4.3, that is simple to communicate and explain. A particular disadvantage is that this is not an automatic procedure that can be extended to very high dimensions. It requires careful and thoughtful prior modeling of the anticipated effect sizes and directions, as well as correlations between the hypotheses. Although we chose our example from a microarray study, this approach would not be useful with thousands of genes.
Appendix Calculation of the Bayesian t -Test. Calculation of [12] requires evaluation of the noncentral T pdf with general scale parameter. Since software typically provide the pdf of the noncentral t having scale parameter 1.0, a simple modification is needed for the general case: Tν (t|a, b) = Tv (t/b 1/2 |a/b 1/2 , 1)/b 1/2 . Thus, for example, in SAS the BF can be computed using pdf( T ,t,n1 + n2 − 2)/(pdf( T ,t/sqrt(postv), [21] n1+n2-2,nc)/sqrt(postv);) or in R or Splus,
198
Gönen
pt(t,n1+n2-2)/(pt(t/sqrt(postv),n1+n2-2,nc))/ sqrt(postv)), [22] where t is the value of the two-sample t-statistic, postv = 1 + 1/2 nδ σδ2 and nc = nδ λ/(1 + nδ σδ2 )1/2 . Calculation of the Probability that All Nulls Are True. The following R code computes the probability that all nulls are true. This example sets ρ = 0.1 and other cases can be obtained by simply re-running the code from the point where ρ is defined, setting it to the desired value. mvtnorm library is required to generate random replicates of multivariate normal vectors using the rmvnorm function. This example uses 10,000 replicates which gives accurate estimates for all practical purposes. If a value other than 5 is needed then the three occurrences of 5 (twice in the definition of sigmat and once in the mvrnorm need to be replaced by the desired value of k). The output will be a k + 1 dimensional vector of probabilities. The j th element of this vector is the probability that j nulls are false, so the first element (for j = 0) is the desired probability. library(mvtnorm) Jmat<-function(k){return(matrix(rep(1,k∧ 2),nrow = k))} Imat<-function(k){II<-Jmat(k)-1;diag(II)<-rep(1,k);return(II)} rho<-0.1 sigmat<-(1-rho)∗ Imat(5)+rho∗ Jmat(5) uu<-mvrnorm(n = 10000,mu = rep(0,5),Sigma = sigmat) table(apply((apply(uu,2,sign)+1)/2,1,sum))/10000
References 1. Antonescu, C.R., Viale, A., Sarran, L., Tschernyavsky, S.J., Gönen, M., Segal, N.H., Maki, R.G., Socci, N.D., DeMatteo, R.P., and Besmer, P. (2004) Gene expression in gastrointestinal stromal tumors is distinguished by KIT genotype and anatomic site. Clinical Cancer Reserch 10, 3282–3290. 2. Berger, J.O. (1985) Statistical Decision Theory and Bayesian Analysis. New York: Springer-Verlag. 3. Bernardo, J.M. and Smith, A.F.M. (1994) Bayesian Theory. New York: John Wiley and Sons. 4. Berry, D.A. (1996) Statistics: A Bayesian Perspective. Belmont, CA: Wadsworth. 5. Berry, D.A. (1997) Teaching elementary Bayesian statistics with real applications in science. The American Statistician 51, 241–246.
6. Carlin, B. and Louis, T. (2001) Bayes and Empirical Bayes Methods for Data Analysis. London: Chapman and Hall. 7. Jeffreys, H. (1961) Theory of Probability. Oxford: Oxford University Press. 8. Cohen, J. (1988) Statistical Power Analysis for the Behavioral Sciences New York: Academic Press. 9. Westfall, P.H., Johnson, W.O., and Utts, J.M. (1997) A Bayesian perspective on the Bonferroni adjustment. Biometrika 84, 419–427. 10. Efron, B., Tibshirani, R., Storey, J.D., and Tusher, V. (2001) Empirical Bayes analysis of a microarray experiment. Journal of the American Statistical Association 96, 1151–1160. 11. Gönen M., Johnson, W.O., Lu, T., and Westfall, P.H. (2005) The Bayesian two-sample
The Bayesian t-Test and Beyond
12.
13. 14. 15. 16.
t-test. The American Statistician 59, 252–257. Berger, J.O. and Sellke, T. (1987) Testing a point null hypothesis: The irreconcilability of p values and evidence. Journal of the American Statistical Association 82, 112–122. Miller, R. (1981) Simultaneous Statistical Inference. New York: Springer. Hochberg, Y. and Tamhane, A. (1987) Multiple Comparison Procedures. New York: Wiley. Westfall, P.H. and Young, S.S. (1993). Resampling-Based Multiple Testing. New York: Wiley. Westfall, P.H., Tobias, R., Rom, D., Wolfinger, R., and Hochberg, Y. (1999) Multiple Comparisons and Multiple Tests
17.
18. 19.
20.
199
Using the SAS(R) System. Cary: SAS Institute. Gönen, M., Westfall, P.H., and Johnson, W.O. (2003) Bayesian multiple testing for two-sample multivariate endpoints. Biometrics 59, 76–82. Kass, R.E. and Raftery, A.E. (1995) Bayes factors. Journal of the American Statistical Association 90, 773–795. Genz, A. (1992) Numerical computation of the multivariate normal probabilities. Journal of Computational and Graphical Statistics 1, 141–150. Gönen, M. and Westfall, P.H. (1998) Bayesian multiple testing of multiple endpoints in clinical trials. Proceedings of the American Statistical Association, Biopharmaceutical Subsection, 108–113.
Part II Designs and Methods for Molecular Biology
Chapter 5 Sample Size and Power Calculation for Molecular Biology Studies Sin-Ho Jung Abstract Sample size calculation is a critical procedure when designing a new biological study. In this chapter, we consider molecular biology studies generating huge dimensional data. Microarray studies are typical examples, so that we state this chapter in terms of gene microarray data, but the discussed methods can be used for design and analysis of any molecular biology studies involving high-dimensional data. In this chapter, we discuss sample size calculation methods for molecular biology studies when the discovery of prognostic molecular markers is performed by accurately controlling false discovery rate (FDR) or family-wise error rate (FWER) in the final data analysis. We limit our discussion to the two-sample case. Key words: False discovery rate, family-wise error rate, prognostic gene, true rejection, two-sample t-test.
1. Sample Size for FDR Control Controlling the FDR relaxes the multiple testing criteria compared to controlling the FWER in general, and consequently increases the number of declared significant genes. Proposed first by Benjamini and Hochberg (1), its operating and numerical characteristics are elucidated in recent publications (2, 3). In this section, we discuss a sample size estimation procedure for FDR control that is proposed by Jung (4). A sample size is derived for a specified number of true rejections (i.e., identifying the prognostic genes) while controlling the FDR at a desired level. As input parameters, we specify the allocation proportions between two groups, the total number of candidate genes, the number of prognostic genes, the effect sizes of the prognostic genes in addition to the required number of true rejections and the FDR level. In general, this procedure requires solving an H. Bang et al. (eds.), Statistical Methods in Molecular Biology, Methods in Molecular Biology 620, DOI 10.1007/978-1-60761-580-4_5, © Springer Science+Business Media, LLC 2010
203
204
Jung
equation using a numerical method such as the bisection method. However, if the effect sizes are equal among all prognostic genes, the equation can be solved to give a closed-form formula. Pounds and Cheng (5) and Liu and Hwang (6) later propose similar sample size calculation methods. 1.1. FDR-Based Multiple Testing Procedure
At first, we briefly review a popular FDR-based multiple testing procedure. We denote the number of total genes under consideration by m, of which m0 genes are equally expressed between two groups. Suppose that, in the jth testing, we reject the null hypothesis Hj if the p-value pj is smaller than or equal to α ∈ (0, 1). Assuming independence of the m p-values, the total number of false rejections is R0 = =
m j=1 m
I (Hj true, Hj rejected) Pr(Hj true)Pr(Hj rejected|Hj ) + op (m),
j=1
which equals m0 α, where m−1 op (m) → 0 in probability as m → ∞ (7). Ignoring the error term, we have FDR(α) =
m0 α , R(α)
[1]
where R(α) = m j=1 I (pj ≤ α) denotes the total number of rejections. Given α, estimation of FDR by [1] requires estimation of m0 . For the estimation of m0 , Storey (7) assumes that the histogram of m p-values is a mixture of m0 p-values that are corresponding to the true null hypotheses and following U (0, 1) distribution, and m1 p-values that are corresponding to the alternative hypotheses and expected to be close to 0. Consequently, for a chosen constant λ away from 0, none (or few, if any) of the latter m1 p-values will fall above λ, so that the number of p-values above λ, m I j=1 (pj > λ), can be approximated by the expected frequency among the m0 p-values above λ from U (0, 1) distribution, i.e., m0 (1 − λ). Hence, given λ, m0 is estimated by m ˆ 0 (λ) = m
j=1 I (pj
1−λ
> λ)
.
By combining this m0 estimator with [1], Storey (7) obtains α m ˆ 0 (λ) α×m j=1 I (pj > λ) = . FDR(α) = R(α) (1 − λ) m j=1 I (pj ≤ α)
Sample Size and Power Calculation for Molecular Biology Studies
205
For an observed p-value pj , Storey (7) defines the q-value, the minimum FDR level at which we reject Hj , as
DR(α). qj = inf F α≥pj
This formula is reduced to
DR(pj ) qj = F if FDR(α) is strictly increasing in α, see Theorem 2 of Storey (8). Appendix of Jung (4) shows that this assumption holds if the power function of the individual tests is concave in α, which is the case when the test statistics follow the standard normal distribution under the null hypotheses. We reject Hj (or, equivalently, discover gene j) if qj is smaller than or equal to the prespecified FDR level. The independence assumption among m test statistics was loosened to independence only among m0 test statistics corresponding to the null hypotheses by Storey and Tibshirani (9), and to weak independence among all m test statistics by Storey (8) and Story et al. (10). 1.2. Sample Size Calculation
In this section, we discuss a sample size estimation method for the FDR-based multiple testing procedure discussed in the previous section. Let M0 and M1 denote the set of genes for which the null and alternative hypotheses are true, respectively. Note that the cardinalities of M0 and M1 are m0 and m1 ( = m − m0 ), respectively. Since the estimated FDR is invariant to the order of the genes, we may rearrange the genes and set M1 = {1, ..., m1 } and M0 = {m1 + 1, ..., m}. By Storey (7) and Storey and Tibshirani (9), for large m and under independence (or weak dependence) among the test statistics, we have R(α) = E(R0 (α)) + E(R1 (α)) + op (m) = m0 α +
ξj (α) + op (m),
j∈M1
where Rh (α) = j∈Mh I (pj ≤ α) for h = 0, 1, ξj (α) = P(pj ≤ α) is the marginal power of the single α-test applied to gene j ∈ M1 . So, from [1], we have FDR(α) = by omitting the error term.
m0 α +
m0 α
j∈M1 ξj (α)
[2]
206
Jung
Let xkij denote the expression level of gene j for subject i in group k( = 1, 2) with mean μkj and common variance σj2 . We consider two-sample t-tests, Tj =
x¯ 1j − x¯ 2j , −1 −1 σˆ j n1 + n2
for hypothesis j (= 1, .., m), where nk is the number of subjects in group k( = 1, 2), x¯ kj is the sample mean of {xkij , i = 1, ..., nk }, and σˆ j2 is the pooled sample variance. We reject Hj :μ1j = μ2j in ¯ j :μ1j = μ2j if |Tj | is large. favor of H Let n = n1 + n2 denote the total sample size, and ak = nk /n the allocation proportion for group k (a1 + a2 = 1). Often, investigators use the effect size μ1j − μ2j (or fold-change μ1j /μ2j before taking a log-transformation) to measure how differentially a gene is expressed between two groups. However, without a consideration on the variance of the distributions, they are not directly translated into the statistical significance. For this reason, we use the standardized effect size for gene j μ1j − μ2j . σj
δj =
For j ∈ M0 , we have δj = 0. √ Note that, for large n, Tj ∼ N (δj na1 a2 , 1), so that, for j ∈ M1 , we have √ ¯ α/2 − |δj | na1 a2 ), ξj (α) = (z ¯ −1 (α) is the ¯ · ) denotes the survivor function and zα = where ( upper 100α-th percentile of N (0, 1). Hence, [2] is expressed as FDR(α) =
m0 α +
m0 α . √ ¯ j∈M1 (zα/2 − |δj | na1 a2 )
[3]
From [3], FDR is decreasing in |δj | and n. Further, FDR is increasing in |a1 − 1/2| and α, see Appendix of Jung (4). If the effect sizes are equal among the prognostic genes, FDR is increasing in π0 = m0 /m. It is easy to show that FDR increases from 0 to m0 /m as α increases from 0 to 1. At the design stage of a study, m is decided by the microarray chips chosen for experiment and m1 , {δj , j ∈ M1 } and a1 are projected based on experience or from pilot data if any. The only variables undecided in [3] are α and n. With all other design parameters fixed, FDR is controlled at a certain level by the chosen α level. So, we want to find the sample size n that will guarantee a certain number, say γ ( ≤ m1 ), of true rejections with FDR controlled at a specified level q.
Sample Size and Power Calculation for Molecular Biology Studies
207
In [3], the expected number of true rejections is E{R1 (α)} =
√ ¯ α/2 − |δj | na1 a2 ). (z
[4]
j∈M1
In multiple testing controlling FDR, E(R1 )/m1 plays the role of the power of a conventional testing (11, 12). With E(R1 ) and the FDR level set at γ and q, respectively, [3] is expressed as q=
m0 α . m0 α + γ
By solving this equation with respect to α, we obtain α∗ =
γq . m0 (1 − q)
Given m0 , α ∗ is the marginal type I error level for γ true rejections with the FDR controlled at q. With α and E(R1 ) replaced by α ∗ and γ , respectively, [4] yields an equation h(n) = 0, where h(n) =
√ ¯ α∗ /2 − |δj | na1 a2 ) − γ . (z
[5]
j∈M1
We obtain the sample size by solving this equation. In general, solving the equation h(n) = 0 requires a numerical approach, such as the bisection method. If we do not have prior information on the effect sizes, we may want to assume equal effect sizes δj = δ ( > 0) for j ∈ M1 . In this case, [5] is reduced to √ ¯ α∗ /2 − |δ| na1 a2 ) − γ h(n) = m1 (z and, by solving h(n) = 0, we obtain a closed-form formula: n=
" (zα∗ /2 + zβ ∗ )2 # a1 a2 δ 2
+ 1,
[6]
where α ∗ = γ q/m0 (1 − f ) and β ∗ = 1 − γ /m1 . Note that [6] is the conventional sample size formula when we want to detect an effect size of δ with power 1 − β ∗ while controlling the type I error level at α ∗ . In summary, our sample size calculation proceeds as follows: (A) Specify the input parameters: – q = FDR level – γ = number of true rejections – ak = allocation proportion for group k( = 1, 2) – m = total number of genes for testing – m1 = number of prognostic genes (m0 = m − m1 ) – {δj , j ∈ M1 } = effect sizes for prognostic genes
208
Jung
(B) Obtain the required sample size: 1. If the effect sizes are constant δj = δ for j ∈ M1 , n=
" (zα∗ /2 + zβ ∗ )2 # a1 a2 δ 2
+ 1,
where α ∗ = γ q/ {m0 (1 − q)} and β ∗ = 1 − γ /m1 . 2. Otherwise, solve h(n) = 0 using the bisection method, where h(n) =
√ ¯ α∗ /2 − |δj | na1 a2 ) − γ (z
j∈M1
and α ∗ = γ q/ {m0 (1 − q)} . Given sample sizes n1 and n2 , one may want to check how many true rejections are expected as if we want to check the power in a conventional testing. In this case, we solve the equations for γ . For example, when the effect sizes are constant, for j ∈ M1 , we solve the equation
zα∗ (r1 ) + zβ ∗ (γ )
= δ n1−1 + n2−1
with respect to γ , where α ∗ (γ ) = γ q/ {m0 (1 − q)} and β ∗ (γ ) = 1 − γ /m1 . Example 1 (Constant effect size case) Suppose that we want to design a microarray study on m = 4000 candidate genes, among which about m1 = 40 genes are expected to be differentially expressing between two patient groups. Note that m0 = m − m1 = 3960. Constant effect sizes, δj = δ = 1, for the m1 prognostic genes are projected. About equal number of patients are expected to enter the study from each group, i.e., a1 = a2 = 0.5. We want to discover γ = 24 prognostic genes by one-sided tests with the FDR controlled at q = 1% level. Then α∗ =
24 × 0.01 = 0.612 × 10−4 3960 × (1 − 0.01)
and β ∗ = 1 − 24/40 = 0.4, so that zα∗ /2 = 4.008 and zβ ∗ = 0.254. Hence, from [6], the required sample size is given as
n= or n1 = n2 ≈ 34.
" (3.841 + 0.253)2 # 0.5 × 0.5 × 12
+ 1 = 73,
Sample Size and Power Calculation for Molecular Biology Studies
209
Example 2 (Varying effect size case) We assume (m, m1 , a1 , γ , q) = (4000, 40, 0.5, 24, 0.01), δj = 1.5 for 1 ≤ j ≤ 20 and δj = 0.5 for 21 ≤ j ≤ 40. Then α∗ =
24 × 0.01 = 0.612 × 10−4 3960 × (1 − 0.01)
and zα∗ /2 = 4.008, so that we have ¯ ¯ − 0.5 n/4) − 24. h(n) = 20(4.008 − 1.5 n/4) + 20(4.008 By solving h(n) = 0, we obtain n = 161. Table 5.1 lists the required sample size n under m = 10, 000; a1 = 0.5 or 0.7; m1 = 50, 100, or 150; constant effect sizes δ = 0.5 or 1; r1 = 0.8m1 , 0.85m1 , or 0.9m1 ; f = 0.05 or 0.1.
Table 5.1 Sample size n under m = 10, 000; m 1 = 50, 100, or 150; constant effect sizes δ = 0.5 or 1; r 1 /m 1 = 0.8, 0.85, or 0.9; a 1 = 0.5 or 0.7; q = 0.05 or 0.1 a 1 = 0.5
a 1 = 0.7
m1
δ
r 1 /m 1
FDR=5%
10%
5%
10%
50
0.5
0.8
331
304
394
361
0.85
358
329
426
392
0.9
394
363
468
432
1
100
0.5
1
150
0.5
1
0.8
83
76
99
91
0.85
90
83
107
98
0.9
99
91
117
108
0.8
305
278
363
331
0.85
331
302
394
359
0.9
365
335
435
398
0.8
77
70
91
83
0.85
83
76
99
90
0.9
92
84
109
100
0.8
290
262
345
312
0.85
315
286
375
340
0.9
348
318
415
378
0.8
73
66
87
78
0.85
79
72
94
85
0.9
87
80
104
95
210
Jung
2. Sample Size for FWER Control In the previous section, we have considered a sample size method for FDR-based multiple testing procedures. It is known that, in microarray data analysis, FDR-based methods tend to discover more prognostic genes than FWER-based methods. While the FWER can be accurately controlled by the permutation resampling method, however, the FDR can not be accurately controlled by any existing methods (13). FWER is the probability that one or more false rejections are committed. Despite its well-known conservatism, Bonferroni test has been one of the most popular methods in analyzing microarray data while controlling the FWER. Although Holm (14) and Hochberg (15) improve upon such conservatism by devising multistep testing procedures, they do not exploit the dependency of the test statistics and consequently the resulting improvement is often minor. Later, Westfall and Young (16, 17) propose adjusting p-values in a state-of-the-art step-down manner using simulation or resampling method, by which dependency among test statistics is effectively incorporated. Westfall and Wolfinger (18) derive exact adjusted p-values for a step-down method for discrete data. Recently, the Westfall and Young’s permutation-based test was introduced to microarray data analyses and strongly advocated by Dudoit and her colleagues (3, 19, 20). It has been shown that the power of the FWER-based test is heavily dependent on the complicated correlation structure of the gene expression data (21). There have been several publications on sample size estimation for FWER-based multiple testing procedures without examining the accuracy of the estimated sample sizes. Furthermore, they focus on exploratory and approximate relationships among statistical power, sample size, and effect size (often, in terms of fold-change), and use the most conservative Bonferroni adjustment without any attempt to incorporate underlying correlation structure (10, 22–26). Showing that an ostensibly similar but incorrect choice of sample size ascertainment could cause considerable underestimation of the required sample size, Jung et al. (21) propose a sample size calculation for the permutation method under a hypothetical correlation structure. Because of high dimensionality and complicated correlation structure of microarray data, there have been no sample size methods reflecting the true correlation structure of gene expression data. Lin (27) develops a simulation-based multiple testing procedure. In this section, we discuss a sample size calculation method of a multiple testing for FWER control exploiting Lin’s procedure. When pilot data are available, this method approximates the true correlation structure using the observed one from the pilot
Sample Size and Power Calculation for Molecular Biology Studies
211
data. This method can be used for sample size recalculation in the middle of a large microarray study as well. 2.1. PermutationBased Multiple Testing Method to Control the FWER
We briefly discuss the popular permutation-based multiple testing method to control the FWER. We assume that, for group k( = 1, 2), {(xki1 , ..., xkim ), i = 1, ..., nk } are independent and identically distributed (iid) random vectors from an unknown distribution with means μkj , variances σj2 = var(xkij ), and correlation coefficients = (ρjj )1≤j,j ≤m . In order to discover genes that are differentially expressed between two groups, we perform a sta¯ j :μ1j = μ2j for each gene. We tistical test on Hj :μ1j = μ2j vs. H consider rejecting Hj (or discover gene j) if the absolute value of two-sample t-test statistic Tj =
x¯ 1j − x¯ 2j σˆ j n1−1 + n2−1
is large. Let H0 = ∩m j=1 Hj denote the complete null hypothesis with ¯ the relevant alternative hypothesis, Ha = ∪m j=1 Hj . Multiple testing procedures controlling the FWER choose critical values for Tj so that the probability of rejecting one or more Hj ’s is controlled below a specified level under H0 . Westfall and Young proposed (16, 17) a step-down procedure controlling the FWER accurately using a permutation method. A step-down procedure sequentially rejects the null hypotheses using different critical values for different hypotheses, starting from the one with the smallest p-value until it does not reject a null hypothesis. A single-step procedure uses a common critical value c to ¯ j when |Tj | > c. In this case, the FWER reject Hj in favor of H fixed at w is defined as w = P( max |Tj | > c|H0 ). 1≤j≤m
[7]
In order to control the FWER below the prespecified w level, Bonferroni uses c = cw = tn−2,w/(2m) , the upper w/(2m)-quantile of the t distribution with n − 2 degrees of freedom assuming normality of the gene expression data, or c = zw/(2m) the upper w/(2m)-quantile of the standard normal distribution based on asymptotic normality with respect to large n. It is well known that Bonferroni test is very conservative, especially when the test statistics are highly correlated and m is large. Jung et al. (21) claim that the single-step procedure has exactly the same global power 1 − β0 = P( max |Tj | > c|Ha ) 1≤j≤m
212
Jung
as the Westfall and Young’s step-down procedure. Since the distribution of (T1 , ..., Tm ) satisfies the condition of asymptotic subset pivotality (17), the single-step procedure controlling the FWER weakly by [7] also controls the FWER strongly. Microarray data are collected from the same individuals and experience co-regulation, so that the expression levels among genes tend to be complicatedly correlated. Motivated by these properties together with the relationship in [7], we derive the distribution of W = maxj=1,...,m Tj under H0 using permutation. There are B = (nn1 ) different ways of partitioning the pooled sample of size n = n1 + n2 into two groups of sizes n1 and n2 . In order to maintain the dependence structure and distributional characteristics of the gene expression measures within each subject, the sampling unit is the subject, not the gene. Recently, this type of resampling became popular in multiple testing to avoid the specification of the true distribution for the gene expression data (3, 19, 20). The permutation-based single-step multiple testing can be summarized as follows: (A) Compute the test statistics t1 , ..., tm from the original data. (B) For the b-th permutation of the original data (b = (b) (b) 1, ..., B), compute the test statistics t1 , ..., tm and wb = (b)
maxj=1,...,m tj . (C) Sort w1 , ..., wB to obtain the order statistics w(1) ≤ · · · ≤ w(B) and compute the critical value cw = w([B(1−w)+1]) , where [a] is the largest integer no greater than a. If there exist ties, cw = w(k) where k is the smallest integer such that w(k) ≥ w([B(1−w)+1]) . (D) Reject all hypotheses Hj (j = 1, ..., m) for which tj > cw . Through simulations, Jung et al. (21) show that this permutation-based single-step procedure controls the FWER accurately. 2.2. Sample Size Calculation
We want to calculate the sample size for a new study whose data will be analyzed by the FWER-based multiple testing method. Jung et al. (21) show that the power of FWER-based multiple testing methods depends on the standardized effect sizes under Ha and the correlation coefficients of expression data among genes. Contrary to the effect sizes, the correlation coefficients usually are nuisance parameters in the multiple testing procedures. We may specify δj for some candidate prognostic genes, but it will be difficult to specify many correlation coefficients close to the true values. In order to tackle this problem, we assume that pilot data are available to provide reliable estimates of the correlation coefficients. The required sample size for a future study is calculated based on the assumption that the distribution of the test statistics to
Sample Size and Power Calculation for Molecular Biology Studies
213
be calculated from the future study can be approximated by that from the pilot data set, {(xki1 , ..., xkim ), i = 1, ..., nk , k = 1, 2}. Let x¯ kj and sj2 be the sample means and the pooled variances calculated from the pilot data. Let N ( = N1 + N2 ) denote the sample size of the new study, and ak = Nk /N the allocation proportion for group k. Also, let X¯ kj and Sj2 denote the sample means and the pooled variances, respectively, that will be calculated from the new study. For large N, the t-test statistics that will be obtained from the new study are
Tj =
X¯ 1j − X¯ 2j Sj N1−1 + N2−1
= δj Na1 a2 + Zj + op (1), where Zj =
X¯ 1j − X¯ 2j − δj σj σj N1−1 + N2−1
and op (1) converges to 0 in probability as N → ∞. Here, (Z1 , ..., Zm ) is the limit of test statistics (T1 , ..., Tm ) under H0 that will be calculated from the data of the future study. It is easy to show that (Z1 , ..., Zm ) is a random vector with means 0, variances 1, and covariance matrix . Note that the asymptotic correlation structure of the test statistics is identical to that of the raw data, and δj = 0 under Hj . Given FWER = w, the critical value cw satisfies w = P( max |Zj | > cw ) 1≤j≤m
[8]
from [7]. Suppose that there are m1 prognostic genes with nonzero effect sizes and m0 ( = m − m1 ) non-prognostic genes with 0 effect sizes. Let M1 denote the set of prognostic genes. For an integer γ ( ∈ [1, m1 ]), we want to calculate the sample size N guaranteeing at least γ ( ∈ [1, ≤ m1 ]) true rejections with probability 1 − βγ by controlling the FWER at w. Then, we need to solve 1 − βγ = P{
j∈M1
I (|δj Na1 a2 + Zj | > cw ) ≥ γ }
[9]
214
Jung
with respect to N. Similarly, N for a global power of 1 − β0 can be obtained from 1 − β0 = P( max |δj Na1 a2 + Zj | > cw ). [10] 1≤j≤m
In order to solve these equations, we need to approximate the probabilities [8]-[10] involving the high-dimensional random vector (Z1 , ..., Zm ). For large n with nk /n → ak ∈ (0, 1), by Lin (27), the marginal distribution of (Z1 , ..., Zm ) can be approximated by the conditional distribution of (Z˜ 1 , ..., Z˜ m ) on the pilot data, where Z˜ j =
x˜ kj =
nk−1
x˜ 1j − x˜ 2j , √ vj
nk
(xkij − x¯ j )εki ,
i=1
nk 2 /n + s 2 /n , s 2 = n−1 2 vj = s1j 1 2 2j kj i=1 (xkij − x¯ j ) , and (εki , 1 ≤ k i ≤ nk , k = 1, 2) are iid N (0, 1) random variables which are independent of the pilot data. The set of prognostic genes and their effect sizes may be prespecified based on prior biological knowledge or the estimated effect sizes from the pilot data. The sample size calculation procedure can be summarized as follows: (A) Specify the input variables: – Pilot data {(xki1 , ..., xkim ), i = 1, ..., nk , k = 1, 2} – Number of prognostic genes m1 , their identifiers M1 = {j1 , ..., jm1 }, and effect sizes (δj1 , ..., δjm1 )
– FWER = w
– Number of minimum true rejections γ ( ≤ m1 ), and the probability of γ true rejections 1 − βγ – Proportion of subjects in each group, a1 and a2 (B) Generate B copies of (Z˜ 1 , ..., Z˜ m ), {(˜zb1 , ..., z˜ bm ), b = 1, ..., B}. (C) Given FWER = w, calculate cw by the upper 100w percentile of u˜ 1 , ..., u˜ B , where u˜ b = max1≤j≤m |˜zbj |. (D) Let h(N ) = B −1
4 5 B I I (|δj Na1 a2 + z˜ bj | > cw ) ≥ γ . b=1
j∈M1
Then, given 1 − βγ , the sample size N ∗ is obtained by solving h(N ) = 1 − βγ using the bisection method.
Sample Size and Power Calculation for Molecular Biology Studies
215
Above sample size derivation assumes equal variance between 2 = σ 2 = σ 2 . If this assumption is questiontwo groups, i.e., σ1j 2j j able, we may use the test statistics Tj =
x¯ 1j − x¯ 2j 2 /n + s 2 /n s1j 1 2 2j
,
2 is the sample variance for gene j from group k data. And where skj the same sample size calculation method can be used with
μ1j − μ2j δj = . 2 /a + σ 2 /a σ1j 1 2 2j The above method can be used for sample size recalculation in the middle of a study. At the design stage of a study, we calculate ˆ based on pilot data using Algoan approximate sample size N rithm 2 or projected correlation coefficients as in Jung et al. (21). ˆ . We colOften, the first stage sample size n is chosen by half of N lect the first stage data {(xki1 , ..., xkim ), 1 ≤ i ≤ nk , k = 1, 2}, and calculate the final sample size N using them as pilot data. If N ( = N1 + N2 ) is smaller than n(n1 + n2 ), then we stop the study. Otherwise, we collect stage 2 data, {(xki1 , ..., xkim ), n1 + 1 ≤ i ≤ Nk , k = 1, 2}, and conduct the multiple testing procedure of the previous section using the cumulative data {(xki1 , ..., xkim ), 1 ≤ i ≤ Nk , k = 1, 2}. An accurate sample size estimation requires pilot data with a reasonable size. Even though a pilot data set is not large enough, we still may use it in designing a new study since it will give us a better estimation than a complete projection from no prior information on the complicated structure of the gene expression data. Through some simulation studies, it was found that too small pilot data tend to slightly underestimate the required sample size N. If n is smaller than 50% of the calculated N, we recommend to increase N by 5% to 10%. Example 3 Huang et al. (28) published DNA microarray data from n = 37 breast cancer patients (n1 = 19 LN− patients and n2 = 18 LN+ patients) to identify the genes that were differentially expressed by their lymph node (LN) status. The original data, available from //data.genome.duke.edu/lancet.php, include 12, 625 probe sets, called genes in this section. Expression values were calculated using the robust multichip average (RMA) method (29). RMA estimates are based upon a robust average of background corrected perfect match intensities. Normalization was done using quantile normalization (30). We filtered out all “AFFX” genes and the genes for which there were less than 8
216
Jung
Table 5.2 Top 20 genes (or, probe sets) with their standardized effect sizes δˆ j Gene
δˆ j
Gene
δˆ j
35428_g_at
1.869
35622_at
1.623
33406_at
1.517
32823_at
1.430
39266_at
1.409
34990_at
1.366
36227_at
−1.336
41389_s_at
1.316
35834_at
1.316
37149_s_at
1.305
34800_at
1.300
38922_at
1.292
35839_at
−1.270
1878_g_at
1.266
39425_at
−1.263
39798_at
1.244
38792_at
−1.231
36890_at
1.200
37874_at
1.199
39665_at
1.177
present calls among the 37 present/marginal/absent calls. The filtering yielded m = 6,599 genes which were then used in the subsequent analyses. Table 5.2 lists top 20 genes in terms of absolute value of estimated standardized effect size. Suppose that we want to calculate the sample size of a new microarray study to discover the genes that are differentially expressed by the lymph node (LN) status of breast cancer patients using the data of Huang et al. (28) as pilot data. We specify the set of prognostic genes M1 by the top 20 genes listed in Table 5.2. In order to reflect the variation in the estimated effect sizes and for a slightly conservative sample size calculation, the true effect sizes of the specified m1 = 20 prognostic genes are set at the 75% of the estimated standardized effect sizes given in Table 5.2. Figure 5.1 displays the estimated N for γ ( ∈ [0, 20]) true rejections by w = 0.05 multiple testing with 1 − βγ = 90% of power. We assume a1 = a2 = 1/2, which are close to the group proportions nk /n in the pilot data, and B = 10, 000 simulations are conducted for the sample size calculation. In this sample size calculation, we consider the m = 6, 599 genes that are remained in the pilot data after filtering, but the set of genes included in the final analysis may be slightly different depending on the results of data preprocessing. From Figure 5.1, the required sample size monotonically increases from N = 37 for γ = 0 or 1 to N = 192 for γ = 20. Note that n = 37 of the Huang et al. (28) data is about the right size for at least one true rejection, i.e., γ = 1. The expected number of false rejections, defined as j∈M0 P(|Tj | > cw ) and estimated from the B = 5,000 simulations, is only about 0.1.
217
50
100
N
150
Sample Size and Power Calculation for Molecular Biology Studies
0
5
10
15
20
γ
Fig. 5.1. Sample size required for γ true rejections for a breast cancer study estimated using the Huang et al. (2003) data as pilot data.
References 1. Benjamini, Y., Hochberg, Y. (1995) Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B 57(1), 289–300. 2. Genovese, C., Wasserman, L. (2002) Operating characteristics and extensions of the false discovery rate procedure. Journal of the Royal Statistical Society, Series B 64(3), 499–517. 3. Dudoit, S., Shaffer, J.P., Boldrick, J.C. (2003) Multiple hypothesis testing in microarray experiments. Statistical Science 18, 71–103. 4. Jung, S.H. (2005) Sample size for FDRcontrol in microarray data analysis. Bioinformatics 21, 3097–3103. 5. Pounds, S., Cheng, C. (2005) Sample size determination for the false discovery rate. Bioinformatics 21, 4263–4271. 6. Liu, P., Hwang, J.T.G. (2007) Quick calculation for sample size while controlling false discovery rate with application to microarray analysis. Bioinformatics 23, 739–746. 7. Storey, J.D. (2002) A direct approach to false discovery rates. Journal of the Royal Statistical Society, Series B 64(1), 479–498. 8. Storey, J.D. (2003) The positive false discovery rate: a Bayesian interpretation and the qvalue. Annals of Statistics 31(6), 2013–2035. 9. Storey, J.D., Tibshirani, R. (2001) Estimating false discovery rates under dependence, with applications to DNA microarrays. Technical Report 2001–2028, Department of Statistics, Stanford University. 10. Storey, J.D., Taylor, J.E., Siegmund, D. (2004) Strong control, conservative point estimation and simultaneous conservative
11.
12.
13. 14. 15. 16.
17.
18. 19.
consistency of false discovery rates: a unified approach. Journal of the Royal Statistical Society, Series B 66(1), 187–205. Lee, M.L.T., Whitmore, G.A. (2002) Power and sample size for DNA microarray studies. Statistics in Medicine 21, 3543–3570. van den Oord, E.J.C.G., Sullivan, P.F. (2003) A framework for controlling false discovery rates and minimizing the amount of genotyping in gene-finding studies. Human Heredity 56(4), 188–199. Jung, S.H., Jang, W. (2006) How accurately can we control the FDR in analyzing microarray data? Bioinformatics 22, 1730–1736. Holm, S. (1979) A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statististics 6, 65–70. Hochberg, Y. (1998) A sharper Bonferroni procedure for multiple tests of significance. Biometrika 75, 800–802. Westfall, P.H., Young, S.S. (1989) P-value adjustments for multiple tests in multivariate binomial models. Journal of the American Statistical Association 84, 780–786. Westfall, P.H., Young, S.S. (1993) Resampling-based Multiple Testing: Examples and Methods for P-value Adjustment. Wiley: New York. Westfall, P.H., Wolfinger, R.D. (1997) Multiple tests with discrete distributions. American Statistician 51, 3–8. Dudoit, S., Yang, Y.H., Callow, M.J., Speed, T.P. (2000) Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Statistica Sinica 12, 111–139.
218
Jung
20. Ge, Y., Dudoit, S., Speed, T.P. (2003) Resampling-based multiple testing for microarray data analysis. Test 12(1), 1–44. 21. Jung, S.H., Bang, H., Young, S.S. (2005) Sample size calculation for multiple testing in microarray data analysis. Biostatics 6(1), 157–169. 22. Witte, J.S., Elston, R.C., Cardon, L.R. (2000) On the relative sample size required for multiple comparisons. Statistics in Medicine 19, 369–372. 23. Wolfinger, R.D., Gibson, G., Wolfinger, E.D., Bennett, L., Hamadeh, H., Bushel, P., Afshari, C., Paules, R.S. (2001) Assessing gene significance from cDNA microarray expression data via mixed models. Journal of Computational Biology 8(6), 625–637. 24. Black, M.A., Doerge, R.W. (2002) Calculation of the minimum number of replicate spots required for detection of significant gene expression fold change in microarray experiments. Bioinformatics 18(12), 1609–1616. 25. Pan, W., Lin, J., Le, C.T. (2002) How many replicated of arrays are required to detect gene expression changes in microarray experiments? A mixture model approach. Genome Biology 3(5), 1–10.
26. Cui, X., Churchill, G.A. (2003) How many mice and how many arrays? Replication in mouse cDNA microarray experiments. In Methods of Microarray Data Analysis II. Kluwer Academic Publishers: Norwell, MA, 139–154. 27. Lin, D.Y. (2005) An efficient Monte Carlo approach to assessing statistical significance in genomic studies. Bioinformatics 21, 781–787. 28. Huang, E., Cheng, S.H., Dressman, H., Pittman, J., Tsou, M.H., Horng, C.F., Bild, A., Iversen, E.S., Liao, M., Chen, C.M., West, M., Nevins, J.R., Huang, A.T. (2003) Gene expression predictors of breast cancer outcomes. Lancet, 361, 1590–1596. 29. Irizarry, R.A., Hobbs, B., Collin, F., BeazerBarclay, Y.D., Antonellis, K.J., Scherf, U., Speed, T.P. (2003) Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics, 4(2), 249–264. 30. Bolstad, B.M., Irizarry, R.A., Astrand, M., Speed, T.P. (2003) A comparison of normalization methods for high density oligonucleotide array data based on bias and variance. Biostatistics 19(2), 185–193.
Chapter 6 Designs for Linkage Analysis and Association Studies of Complex Diseases Yuehua Cui, Gengxin Li, Shaoyu Li, and Rongling Wu Abstract Genetic linkage analysis has been a traditional means for identifying regions of the genome with large genetic effects that contribute to a disease. Following linkage analysis, association studies are widely pursued to fine-tune regions with significant linkage signals. For complex diseases which often involve function of multi-genetic variants each with small or moderate effect, linkage analysis has little power compared to association studies. In this chapter, we give a brief review of design issues related to linkage analysis and association studies with human genetic data. We introduce methods commonly used for linkage and association studies and compared the relative merits of the family-based and populationbased association studies. Compared to candidate gene studies, a genomewide blind searching of disease variant is proving to be a more powerful approach. We briefly review the commonly used two-stage designs in genome-wide association studies. As more and more biological evidences indicate the role of genomic imprinting in disease, identifying imprinted genes becomes critically important. Design and analysis in genetic mapping imprinted genes are introduced in this chapter. Recent efforts in integrating gene expression analysis and genetic mapping, termed expression quantitative trait loci (eQTLs) mapping or genetical genomics analysis, offer new prospect in elucidating the genetic architecture of gene expression. Designs in genetical genomics analysis are also covered in this chapter. Key words: Association studies, complex diseases, genetical genomics, genome-wide association studies, genomic imprinting, linkage analysis.
1. Introduction Understanding the genetic etiology of complex traits has been a long-term effort in genetics study. The genetic mechanisms that link the genotype with the phenotypic variation have been a central puzzle in modern genetical research. The development of H. Bang et al. (eds.), Statistical Methods in Molecular Biology, Methods in Molecular Biology 620, DOI 10.1007/978-1-60761-580-4_6, © Springer Science+Business Media, LLC 2010
219
220
Cui et al.
modern bio-technologies has made it possible to generate largescale genetic and genomic data. Coupled with the development of efficient statistical inference tools and high-speed computational facilities, it is now possible to disentangle and understand the genetic basis of complex traits at the molecular level. The identification of novel genetic variants that account for the variation of phenotypic distribution of complex traits will help to gain novel insights into the functional complexity of an organism and find novel genetic targets for the purpose of trait quality improvement and disease prevention and treatment. Quantitative trait locus (QTL) mapping has been a standard means for searching for genetic regions that harbor potential genes associated with a complex quantitative trait (1). The most important step in QTL mapping is to create a segregation population with linkage disequilibrium (LD). LD creates the marker– trait association through which one can target the genetic loci that underlie the trait variation. Experimental crosses between two inbred lines with divergent phenotypic distributions can create such a population with maximum disequilibrium. Examples of such a population include the backcross, F2 , recombinant inbred line (RIL), and many others. When creating inbred lines may not be possible, data with outbred line species displayed in a familybased structure may be collected for the mapping purpose. For such data, within-family LD is expected although linkage equilibrium may occur across families in the population. Statistical methods for QTL mapping have been flourished in the literature (2–4). Unlike in plants or animals, controlled crosses are impossible in humans. Design and analysis of human genetic data are thus different from those developed in plants or animals. Linkage analysis has been commonly used in human disease–gene mapping study. The foundation of linkage study is to assess the recombinant rate between a genetic marker and a disease predisposing region with family-based pedigree data. However, linkage mapping cannot detect variants with fairly small or moderate effects. This limitation can be overcome by association studies which no longer assess recombination but LD. With largely reduced genotyping costs, large-scale genomewide single nucleotide polymorphisms (SNPs) data are routinely generated in many labs. The high-dimensional SNP data bring prospects and also challenges in disease–gene identification. With more and more biological evidences indicating the importance of genomic imprinting in plant, animal, and human genetic studies, hunting for imprinted genes becomes paramountly important in the post genomic era. Moreover, the recent integrated analysis by combining gene expression data together with genotyping data offers promising prospects toward the goal of complete dissection of genetic secrets underlying complex diseases.
Designs for Linkage Analysis
221
Thus the purpose of this chapter is to give a comprehensive review of design and analysis of human genetic data. QTL mapping in experimental crosses or outbred populations will not be covered. We will also cover topics related to design and analysis in recent developments of genome-wide association studies (GWAS) such as the two-stage GWA analysis, genetic mapping genomic imprinting, and genetical genomics analysis.
2. Designs in Genetic Linkage and Association Studies
2.1. Designs in Genetic Linkage Studies
The classical procedure in searching for genes underlying various traits of interest in humans generally starts with a segregation analysis of collected families to verify Mendelian segregation. Linkage analysis is then conducted followed by an association analysis to assess allelic association due to LD at the region where there exists strong linkage signals. The identification of the disease predisposing locus is the first step toward the positional cloning of the gene itself. Linkage analysis has been a traditional means for identifying the location of genes that cause genetic diseases. It has been proven to be powerful for simple (Mendelian) diseases with rare, severe, and high-risk mutations (5). Linkage analysis relies on the genetic recombination event information obtained from a family-based pedigree structure. Design of linkage analysis largely depends on assumptions about the underlying disease model and methods to apply. When the mode of inheritance of a disease follows a known pattern in family members, the parametric linkage analysis or so-called model-based approach such as the logarithm of the odds (LOD) score analysis (6) can be applied. In this case, typically, one or a few large family pedigrees can be collected to track the disease status. The recombination between the disease locus and the known marker loci is then estimated and the significance of linkage can be assessed using LOD score method. In a typical linkage analysis, all information about affected and unaffected individuals is used. However, the LOD score-based parametric linkage analysis is limited by incomplete disease penetrance, phenocopies, and genetic heterogeneity. Compared to Mendelian diseases which follow a known Mendelian pattern of inheritance, the genetic basis of complex traits often involve multiple interacting genes and environmental effects as well as complicated interactions among them. The disease inheritance pattern is generally unknown. Thus nonparametric method or model-free method that does not require specification of the mode of inheritance is more robust than the parametric
222
Cui et al.
linkage analysis. The basic idea of nonparametric linkage analysis is based on the allele- or identity-by-descent(IBD)-sharing, which involves studying affected relatives in a pedigree to see how often a particular copy of a chromosomal segment is shared IBD, based on observed genotypes at the marker loci. When affected members of a pedigree share more marker alleles IBD than expected by chance, this may indicate existence of a susceptibility gene close to the marker in study. Two commonly applied designs in nonparametric linkage analysis are the affected sib-pair (ASP) design and the affected relative design. The ASP design requires nuclear families with two or more affected children and assesses linkage with a likelihood-ratio method by comparing the observed IBD sharing for 0, 1, or 2 alleles with the expected under no linkage (0.25, 0.5, and 0.25, respectively) (5). The affected relative design requires extended families with two or more affected relatives (other than parent–child only) and nonparametric linkage scores can be calculated to assess linkage (7). In both designs, only affected relatives or sib-pairs are used. Often unaffected individuals are genotyped to help infer the IBD status and improve statistical power. For some mating types, IBD cannot be uniquely determined, in which case the expected IBD sharing is estimated using all the available marker data. For a quantitative trait, the Haseman–Elston (HE) regression model can be used. The squared trait difference between sibs is regressed on the proportion of alleles the sib-pair shared IBD. Testing for linkage is equivalent to testing that the regression model has a negative slope (8). The HE model has been extended to many situations, e.g., larger sib-ships (9) and extended pedigrees (10). Another commonly applied approach is the variance components (VC) approach (11), which has been shown to be more powerful than the HE method in a number of situations (12). A simple one QTL VC model with quantitative phenotype yik , for individual i measured in family k( = 1, · · · , K ) can be written as a linear function of the QTL and other genetic effects, yik = μk + Qik + Gik + eik , k = 1, . . . , K ; i = 1, . . . , nk ,
[1]
where nk is the kth family size, μk denotes the mean for the kth family, Qik ∼ N (0, σq2 ) is the random effect of the major monogenic QTL, Gik ∼ N (0, σg2 ) is the polygenic effect that reflects the effects of unlinked genes to the tested QTL, and eik ∼ N (0, σe2 ) is the random environmental error term uncorrelated to other terms. The phenotypic variance–covariance between individuals i and i in the kth family for a model given in [1] can be expressed as
Designs for Linkage Analysis
cov(yik , yi k ) =
⎧ ⎨ σq2 + σg2 + σe2 ⎩
223
if i = i
πii |k σq2 + φii σg2 if i = i ,
where πii |k is the proportion of alleles shared IBD between individual i and i , φii is the expected proportion of alleles shared IBD. In matrix notation, the variance-covariance matrix for the kth family k can be written as k = q|k σq2 + g σg2 + Iσe2 ,
[2]
where q|k is a matrix of the proportion of marker alleles shared IBD in the kth family, g is a matrix of the expected proportion of alleles shared IBD, and I is the identity matrix. Testing linkage can be done by testing H0 :σq2 = 0. A major task in VC linkage analysis is to construct the IBD matrix. Various methods have been proposed for this purpose (e.g., (13), (14)). Assuming multivariate normality for the trait vector, the density function of observing a particular vector of data in the kth family is given by 4 5 1 1
−1 (y − μ ) , f (yk |Mk ) = (y exp − − μ ) k k k k 2 k (2π)nk /2 |k |1/2 where yk = (y1k , ..., ynk k )T is a nk x 1 vector of phenotypes for the kth family; Mk refers to the marker data. The overall log likelihood function for K independent families is given by =
K
log[f (yk |Mk )].
k=1
The estimation of the variance components parameters can use either the maximum likelihood (ML) method or the restricted maximum likelihood (REML) method (15). Since the ML estimation of variance components does not take into account the loss in degrees of freedom that results from estimation of the fixed effects, ML estimators generally tend to be biased. A REML estimation may be preferred, especially with a small sample size. The family-based linkage study has been broadly applied in disease gene discovery followed by a variety of fine mapping techniques to pinpoint the disease predisposing region. For fine resolution, it generally requires large pedigrees, which largely hinders its utility (16). Moreover, for common and multifactorial genetic diseases, linkage analysis is limited by the lack of clear segregation of DNA variants in multi-generational families and by the modest contribution of multiple genetic variants, each one with fairly small and moderate effect (17). Thus, following up
224
Cui et al.
the linkage analysis, regions with strong linkage signals need to be further fine mapped with the association study which assesses the marker–trait correlation through testing LD. More importantly, the recent advances in biotechnology and the development of human HapMap project have produced large–scale SNP data, which provide a dense coverage of genetic variants on the genome, and make association studies more attractive. 2.2. Designs in Genetic Association Studies
The basic idea of association analysis is to assess correlations between genetic variants and a disease phenotype in a population. However, genetic variants that cause the disease phenotype are generally unknown and cannot be observed. What we have is a collection of genetic markers. The disease–gene association can thus be assessed through the disease–marker association, where we expect markers associated with a disease phenotype to be in high LD with causal disease variants. The diagram in Fig. 6.1 displays the relationship among a disease variant, a tested marker, and a disease phenotype. Generally speaking, the causal disease variant is unknown. If there is an association between a marker, and a disease, we expect a causal variant in high LD with this marker in its neighborhood. Thus, LD holds the key in genetic association studies. When a disease mutation occurs somewhere in the population, markers in high LD with the mutation are expected to be inherited together over generations. The disease– gene association can therefore be inferred through the disease– marker association, where the tested marker is in high LD with the causal disease variant. This is the theoretical foundation of association studies.
Fig. 6.1. The diagram shows the association among a disease phenotype, a causal variant (cv), and a tested marker. The true association between the phenotype and the cv is unknown and is shown with the dashed arrow. This true association can be assessed through the marker–phenotype association (the solid arrow) due to strong LD between the cv and the known maker (the dotted-dash arrow).
2.2.1. Case–Control Designs
The case–control study has been the most widely used approach in characterization of genetic variants associated with a disease. Case–control studies compare the distribution of affected subjects with that of unaffected controls and test if there is a significant difference between the two study populations in terms
Designs for Linkage Analysis
225
of genotype or allele frequency at a putative locus. A difference in the frequency of an allele or a genotype of the polymorphic locus between the two study groups indicates that the locus may increase the disease risk or may be in LD with a causal polymorphism. For a population-based association study, unrelated cases and controls are typically sampled. Assume that cases are sampled from a study population and unaffected controls are sampled independently from the general population. Each individual is genotyped with one of three genotypes AA, AB, and BB for a marker with two alleles A and B. The data obtained in case–control studies can be displayed as in Table 6.1 in a genotype-based or allelebased format. Define f0 = Pr(case|AA), f1 = Pr(case|AB), and f2 = Pr(case|BB) as the penetrance functions for the three genotypes. The population prevalence of a disease is defined as D = Pr(case). Let the genotype frequencies for cases and controls be pj = Pr(j|case) and qj = Pr(j|control), j = 0, 1, 2 correspond ing to genotype AA, AB, and BB, respectively, and 2j=0 pj = 2 j=0 qj = 1. Define (g0 , g1 , g2 ) as the genotype probability in the general population. Then we can get the following equations, pj =
fj g j (1 − fj )gj and qj = for j = 0, 1, 2. D 1−D
The mode of genetic inheritance can then be defined to be recessive if f0 = f1 , additive if f1 = (f0 + f2 )/2, and dominant if f1 = f2 . If we know the mode of inheritance for a disease, the number of columns for genotypes in Table 6.1 can be reduced. For recessive (or dominant models), the columns with AA and AB (or BB and AB) can be collapsed to one column to reflect the mode of inheritance pattern. The null hypothesis of no association is to test pj = qj for each j. For a 2 × 3 table with genotype frequency, a chi-square test with 2 degrees of freedom (df) can be applied (18) and the test is independent of the underlying genetic model. An alternative to
Table 6.1 Genotype and allele distribution for a case–control design Genotypes
Alleles
AA
AB
BB
Total
A
B
Total
Case
n11
n12
n13
n1.
2n11 +n12
2n13 +n12
2n1.
Control
n21
n22
Total
n.1
n.2
n23
n2.
2n21 +n22
2n23 +n22
2n2.
n.3
n
2n.1 +n.2
2n.3 +n.2
2n
226
Cui et al.
the chi-square-based test is the Armitage trend test which has 1 df and hence is more powerful than the 2 df chi-square test. The trend test has the form, √ 2 n j=0 xj (n2. n1j − n1. n2j ) Zx = , {n1. n2. [n 2j=0 xj2 n.j − ( 2j=0 xj n.j )2 ]}1/2 where x = (x0 , x1 , x2 ) which can take values x = (0, x, 1) with 0 ≤ x ≤ 1. Zheng et al. (19) showed that the optimal choices of x for recessive, additive, and dominant models are x = 0, x = 1/2 and x = 1, respectively. An alternative to the genotype-based test is the allele-based test. Two commonly used tests focusing on alleles are the allele association (AA) test (18) and the Hardy–Weinberg disequilibrium (HWD) test (20). The AA test for the 2 × 2 table for alleles shown in Table 6.1 is a chi-square test with the form, 2 = XAA
2n[(2n11 +n12 )(n22 + 2n23 ) − (2n21 + n22 )(n12 + 2n13 )]2 ∼ χ12 . 4n1. n2. (2n.1 + n.2 )(n.2 + 2n.3 )
The allele-based test can also be done by testing the difference of the risk allele frequency between cases and controls. Let pˆ A and pˆ U be the estimated allele frequency in affected (cases) and unaffected (control) samples. The test statistic, Z =
pˆ A − pˆ U [pˆ A (1 − pˆ A ) + pˆ U (1 − pˆ U )]/2n
,
has an asymptotic standard normal distribution with mean 0 and variance 1 under the null hypothesis of no association. The HWD test tests the deviation of Hardy–Weinberg equilibrium (HWE) in cases. Let pˆ = (2n11 + n12 )/2n1. be the estimated allele frequency for the disease allele A. Under the null of HWE, the expected genotype frequency for the three genotypes can be expressed as E(AA)=n1. pˆ 2 , E(AB)=2n1. pˆ (1 − pˆ ), and E(BB)=n1. (1 − pˆ )2 . The chi-square test for HWE can be written as 2 XHWD =
(n11 − E(AA))2 (n12 − E(AB))2 (n13 − E(BB))2 + + . E(AA) E(AB) E(BB)
Hoh et al. (21) later on proposed a product test by taking the 2 )(X 2 product of the above two test statistics, i.e., Tp = (XAA HWD ) to compensate the power loss by taking either one of the single test. For a power comparison of different tests, one is referred to Zheng et al. (22). The above commonly applied tests can be considered as detection tests through which one can only detect an association
Designs for Linkage Analysis
227
signal, but not the specific disease variant effect. Covariates can often affect the disease risk and can bias an association test when they are not properly adjusted. Covariates are variables that influence disease risks independently of or interactively with genetic variants. For example, age often is a factor influencing disease risk and smoking often causes lung disease. When covariates exist, it may be necessary to adjust for their effects when testing genetic association. A widely applied approach in case–control studies that can adjust covariate effects is the logistic regression model. The analysis can be focused either on genotypes or alleles or for even more complicated haplotype-based analysis. Let yi (i = 1, · · · , n) denote the disease status with yi = 1 indicating cases and yi = 0 indicating controls. The logistic regression model for given genetic information (Gi ) can be expressed as Pr(yi = 1|Gi ) = β0 + xiAA βAA + xiAB βAB + xil γl , Pr(yi = 0|Gi ) L
log
[3]
l=1
where parameters βAA and βAB are the coefficients for the additive and dominance effects and βl , (l = 1, · · · , L), are the covariate effects. In the above coding scheme, genotype BB is considered as the baseline. Thus exp(βAA ) or exp(βAB ) can be considered as the relative risk for an individual carrying genotype AA or AB relative to those carrying genotype BB. Model [3] represents a two-parameter genetic model, and assessment of disease–gene association can be done by testing H0 : βAA = βAB = 0. A likelihood ratio or score type test can be applied to assess significance in which the test statistic asymptotically follows a chi-square distribution with 2 df. A 1 df test can also be conducted by assuming a particular disease model. The logistic regression is then expressed as Pr(yi = 1|Gi ) = β 0 + xi β 1 + xil γl , Pr(yi = 0|Gi ) L
log
l=1
where xi denotes the number of risk allele A. xi can take values (0, 1, 2), (0, 0, 1), or (0, 1, 1) corresponding to genotype (BB, AB, AA) for an additive, recessive, or dominant disease model, respectively. The logistic regression model can also be extended to fit haplotypes, gene–gene interactions, and gene– environment interactions (see 23–27) Sample size requirement: It is commonly recognized that increasing sample size can increase testing power. Conversely, genotyping large samples can greatly increase costs. With affordable resources, calculating the optimal sample size or at least the minimal sample size required to achieve certain power is mandatory. Estimation of required case-control size depends on a
228
Cui et al.
variety of factors, such as the effect size of the testing locus, the frequency of the susceptible disease allele, the mode of inheritance of the “risk allele” (i.e., recessive, dominant, or additive), and the correlation (i.e., LD) between the causal variants to the surrogate marker. Without prior knowledge of these contributing factors, certain assumptions concerning these factors are usually made to calculate the minimum required sample size (see 28–30). A software named PGA is available for use, which is applicable to association studies of candidate genes, fine-mapping studies, and whole-genome scans (31). With high-dimensional genomewide SNP markers in the scale of millions, a sample of several hundred to a few thousand individuals may be required to detect multiple genetic factors with fairly small or moderate effects (28, 29). Control selection and matching: As genetic factors accounting for a complex disease are likely to vary from population to population, case–control study design has been the most prone to produce spurious association signals. Careful selection of individuals for inclusion is necessary to ensure homogenous genetic background and avoid possible stratification. As often there is not much choice to select cases due to sample limitation, the selection of control population becomes critical to avoid potential inflation or dilution of the disease allele frequency differences between cases and controls due to population stratification. The importance of control selection has been critically addressed in Cardon and Bell (17). One strategy in selecting unrelated controls is to match controls that may otherwise confound the analysis. Matching refers to the selection of controls that are as similar as possible to cases in terms of the distribution of confounding factors that are difficult to measure. Matching thus can reduce the chance of spurious association. Methods for analyzing matched case–control data have been developed (e.g., 32). However, even for well-matched case–control samples, the issue of population stratification still needs to be addressed to avoid spurious association. Pritchard and Donelly (33) developed a method for testing for genetic association in the presence of population stratification, by using unlinked markers to make inferences about population substructure and employing this information to test for associations within the identified subpopulations. STRUCTURE and STRAT are two commonly applied softwares used for the detection of stratification and testing for genetic association in the presence of stratification. Alternatively, one can use another approach, called genomic control, to correct for population stratification (34). However, one has to be cautious that correction for stratification cannot completely remove the possibility of increased false positive results under all circumstances (17, 33, 35) and stratification should be avoided where possible
Designs for Linkage Analysis
229
during the design stage by trying to sample from a homogeneous population. 2.2.2. Family-Based Designs
As mentioned in Section 2.2.1, population stratification has been a critical issue which hinders the application of population-based case–control study. The family-based design offers a powerful alternative to solve this problem. Using parents as controls, the family-based designs are robust against population substructure. Furthermore, studies that incorporate elements of family structure (e.g., discordant sibs or trios) offer a solution to the problems of model building and multiple-hypothesis testing, which are important issues in testing of association, especially in the era of GWAS (36). The most commonly used test in family-based design is the transmission disequilibrium test (TDT) (37). TDT measures association (and linkage) in families with observed transmissions of genetic markers from parents to offspring, in a nuclear family. Under no association with the disease, the alleles of the tested genetic marker will have an equal chance of being transmitted from a heterozygous parent to the offspring. If, however, one allele increases risk of disease or trait, this allele will be transmitted to the affected offspring more often than expected by chance. Specifically, the TDT test compares the frequency of a particular allele that is transmitted from a heterozygous parent to an affected offspring to the frequency of the allele that is not so transmitted (37). It tests if heterozygous parents transmit a particular marker allele to affected offspring more frequently than expected. Table 6.2 illustrates a 2 × 2 table for data collected from a diallelic SNP marker with alleles A and B. The underlying assumption is that marker alleles in LD with the disease alleles will also be transmitted preferentially to affected offspring. The test statistic is defined as X2 =
(b − c)2 . b+c
[4]
Table 6.2 Parental transmission for a diallelic marker Transmitted Not Transmitted
A
B
Total
A
a
b
a+b
B
c
d
c+d
Total
a+c
b+d
2n
230
Cui et al.
When the marker is not linked or not associated with the disease variant, we do not expect differential transmission of marker alleles. Therefore, under the null hypothesis, X2 has a χ 2 distribution with 1 df. The TDT test does not require the specification of a disease model or assumptions about the distribution of the disease in the population, and hence is completely nonparametric. It is robust to misspecification of the phenotype distribution and is insensitive to population stratification. It was originally used to test for linkage in the presence of association. Now it is typically used as a test for association (38). Some efforts have been made to extend the TDT test to fit more general settings. For example, the genotype-based TDT test is good for disease models that are more like dominant or recessive (39). The multiple affected offspring pedigree disequilibrium test (PDT) uses multiple affected offsprings (40). Considering the quantitative nature of the trait distribution, the quantitative TDT (QTDT) method can be applied (41). The family-based association test (FBAT), as a widely used extension of the TDT method, can handle various types of trait distribution, large pedigrees, missing founders, and haplotypes. A comprehensive review about FBAT can be found in Laird and Lange (36) (see Chapter 17 for more details.). 2.2.3. Population-Based or Family-Based, Which One Should Be Preferred?
Population-based and family-based studies are the two commonly used designs in genetic association analysis. Population-based study designs by sampling cases and unrelated controls can readily collect sufficiently large populations to detect disease variants with small or moderate effects. However, this design suffers from potential power loss or spurious associations due to population stratification or admixture. On the other hand, familybased designs (e.g., sampling with case–sibling pairs or case– parent trios) have the advantage in which all individuals in a family pedigree share a common genetic background. Therefore, family members tend to be more homogeneous in exposure to environments possibly associated with the disease etiology. The problem of population stratification is greatly alleviated and bypassed. However, family-based designs suffer from small sample sizes as it is often difficult to accumulate large enough samples of wellcharacterized families. Moreover, family-based samples are typically more difficult to collect than case–control samples (especially for late onset diseases), and generally offer less statistical power than the equivalent-sized case–control study (28, 29). Another feature that makes family-based designs less attractive is that they are sensitive to genotyping errors (42–44). As we are approaching the era of whole-genome association studies, choosing the design becomes even more pressing. One can choose the family-based design to bypass the possible negative effect of population stratification which may create unacceptable
Designs for Linkage Analysis
231
noise in the search for significant associations across the genome (45). On the other hand, large sample sizes can be collected in a population-based design for the purpose to robustly control type I error rate. Despite concerns and debates about the two designs, Evangelou et al. (46) recently conducted a meta-analysis to examine 93 associations and found no remarkable differences between the two designs. Therefore, there is no robust criteria for optimal selection of these two design strategies. The design choice really depends on sampling conditions as well as the underlying disease mode of inheritance. In an effort to maximize efficiency, a hybrid design that utilizes data from both types of study designs may be more appropriate (see 47–49). 2.3. GWAS: A Two-Stage Design
Due to limited knowledge about the distribution of disease variants across the genome, a genomewide blind search for association signals appears to be the most powerful approach in disease–gene hunting, given that the genotyped SNP markers have sufficient genomewide coverage. With largely reduced costs in genotyping technology, genomewide genotyping has become a standard means in large-scale association study with unrelated case–control sampling design. A number of studies have shown the potential of the GWAS in finding novel disease variants (e.g., 50–53). In typical GWAS, a number of cases and unrelated controls are sampled for genotyping at thousands or millions of SNP loci. For example, the recently developed Affymetrix genomewide human SNP array 6.0 contains more than 906 K SNPs as well as more than 946 K probes for copy number variation detection. SNPs may be chosen randomly across the genome or specifically chosen for their coverage based on HapMap information. The high-dimensional SNP data bring daunting challenges in statistical modeling, computing, and testing. Whether focusing on single SNP-based, haplotype-based (e.g., 24), or genebased analysis (54), the large number of hypothesis tests inevitably require large sample sizes in order to achieve reasonable power to detect true associations. Compared to the high-dimensional nature of SNP profiles across the genome, several thousand of samples may still not be enough despite the huge costs associated with genotyping and sample collecting. Although genotyping cost has been tremendously reduced recently, most labs still cannot afford the cost. Efficient designs are needed to reduce genotyping costs, but yet without suffering too much from power loss. A simple design in GWAS is the one-stage design in which all cases and controls are genotyped for all SNPs across the genome using an array platform. The drawback of this design is the increased costs. The concerns about genotyping costs lead to the need for a cost-effective two-stage design. The two-stage design was first developed in a candidate gene study (55). The idea of the two-stage design is to genotype certain proportion of
232
Cui et al.
cases and controls in the first stage. Markers that are potentially associated with a disease and are identified during this stage are further genotyped for additional cases and controls in the second stage. With a fixed genotyping cost for both designs, the twostage design allows one to genotype substantially more cases and controls and hence sufficiently increase testing power. Since its inception, various improvements have been proposed and studied (e.g., 56–58). For a GWAS with more than 100 K SNPs, the same design ideas developed for candidate gene study can be applied. Following Wang et al. (59), define πs as the fraction of samples genotyped in the first stage and πg as the fraction of potential SNPs selected in the first stage to be genotyped in the second stage. Let t1 and t2 be the per-sample costs of genotyping a SNP at the first and second stages, respectively. The difference between t1 and t2 is due to the different number of SNPs genotyped at the two stages. For a one-stage design, the per-sample genotyping cost is given by t1 , whereas that for the two-stage design it is t1 πs + t2 (1 − πs )πg . If t1 πs + t2 (1 − πs )πg < t1 , a two-stage design can reduce the genotyping cost. Wang et al. (59) developed optimization solutions to minimize the second stage costs subject to maintaining proper testing power and type I error. Thus, the choice of the two-stage design can be viewed as an optimization problem considering a variety of issues including genotyping cost, testing power, and type I error. Intense investigations have been done following the main framework of the two-stage design (e.g., 60, 61). In comparing these design studies, a general rule that most designs follow is that no more than 50% of the samples in the first stage are used and less than 25% of SNPs are carried over for the second stage genotyping. There are more complicated multistage designs such as the three-stage designs (38, 62). But it is hard to reach an optimal solution due to large number of design parameters involved. For a comprehensive review of multistage sampling designs, one is referred to Elston et al. (63). In terms of statistical analysis for the two-stage design, Skol et al. (64) evaluated different analysis methods, namely the combined analysis and the replication-based analysis. Since the number of SNP markers genotyped in the second stage is generally much smaller than the total number of makers genotyped in the first stage, one can use the markers genotyped at the second stage to check the consistency of significant associations obtained at the first stage. The second stage analysis is thus considered as a replication of the first stage analysis and is termed the replicationbased analysis (64). The combined analysis is to combine SNP markers genotyped at both stages for association analysis. Skol et al. (64) showed that the combined analysis is more powerful than the replication-based analysis. However, one should be
Designs for Linkage Analysis
233
cautious to reach this conclusion because the markers selected at the first stage for carryover to the second stage are analyzed twice in the combined analysis. If an analysis is merely focused on comparing the difference of allele frequency, a DNA pooling strategy can be applied in a two-stage design subject to correcting for genotyping errors (65). One common issue for the two-stage analysis is that it is difficult to come up with a unified method to assess statistical significance. The traditional Bonferroni correction is certainly too conservative without considering the correlations among SNP markers. Lin (66) recently proposed an efficient Monte Carlo procedure based on simulation studies. The method that accounts for the correlations among SNPs shows merits in control of genomewide Type I error. In summary, a two-stage GWAS by following optimal design strategy is cost-efficient with power close to a one-stage design, but with up to 50% cost reduction. With the development of modern biotechnology, large-scale genotyping costs have been sharply reduced. It might be eventually affordable for most labs to conduct a one-stage design. Before that, an optimal two-stage design may still be preferred for the purpose of reducing costs, and possibly by pooling case–control DNA samples at one stage to further reduce costs (63, 65).
3. Designs in Genetic Mapping Genomic Imprinting
Genomic imprinting refers to a phenomenon in which the same genes express differently, depending on their parental origin (67). For example, a locus with two alleles A and a is thought to be imprinted if the expression of genotype Aa that inherits A from the maternal parent and a from the paternal parent is different from the expression of genotype aA that inherits the two alleles in the other way around. The role of imprinting in shaping an organism’s development has been ubiquitously observed in plants, animals, and humans (68–71). Identifying imprinted genes following genomewide linkage scan has become a routine procedure in hunting for genetic variants underlying complex traits (72–80). Genomic imprinting, also called parent-of-origin effect, has been shown to play an important role in controlling embryonic growth and development (81–83). Recent studies have shown that the expression of imprinted genes displays in a quantitative scale rather than mono-allelic expression (84–86). This suggests that a quantitative trait analysis framework can be applied to map imprinted genes. One key step in mapping imprinted genes is to construct a segregation population in which the allelic parental origin can be traced and distinguished. Following that,
234
Cui et al.
the parental origin-specific genetic effect can be estimated and tested to assess imprinting. Depending on the underlying study population, mapping imprinted genes can be conducted in plants, animals, or humans. While the goal is the same, design and analysis based on different populations may be largely different due to the nature of the mating design. In plants or small size animals such as mouse, generating a segregation population by crossing two inbreed lines with large contrast phenotypic difference has been a standard procedure in QTL mapping studies. With inbreed line species, a natural choice is the reciprocal backcross design in which the two reciprocal heterozygotes Aa and aA can be distinguished, and testing imprinting is the same as testing the mean difference of the two reciprocal heterozygotes, i.e., H0 : μAa = μaA (see 79). For an F2 inbreed line design, the two reciprocal heterozygotes cannot be distinguished in the segregation population since both F1 parents have the same genotypes. Cui et al. (78) recently proposed a new mapping strategy by incorporating the sex-specific recombination information into the mapping framework with a purpose to distinguish the distribution of the two reciprocal heterozygotes. The method uses the information about the sex-specific recombination rates in male and female parents which are known to some species. For example, averaged over the entire genome, the female-to-male recombination rate ratio is 1.4:1 in dog (87) and pig (88), and 1.25:1 in mouse (89). For outbred plant and animal species, testing imprinting can be done by following traditional QTL mapping analysis using outbred population subject to modification to include an imprinting effect (76, 77, 90). In humans, designs in mapping imprinted genes follow the same strategy as regular family-based linkage study. The IBDbased variance components method can be modified to assess imprinting effect. To take parent-of-origin effects into account, the additive main QTL effect Q given in model [1] can be partitioned into two parts: the additive effects of the alleles inherited from the maternal and paternal parents denoted as Q m and Q f , respectively, resulting in the modified imprinting model expressed as yik = μk + Qikm + Qikf + Gik + eik , k = 1, ..., K ;i = 1, ..., nk , 2 2 ) and Q where Qikm ∼ N (0, σqm ikf ∼ N (0, σqf ) (72). All the other terms are defined the same as in model [1]. Correspondingly, the variance–covariance matrix given in [2] among sib-pairs can be partitioned as 2 k = qm|k σqm + qf |k σqf2 + g σg2 + Iσ 2e ,
Designs for Linkage Analysis
235
where qm|k and qf |k are matrices of the proportion of marker alleles shared IBD that are derived from the mother and father in the kth family, respectively (72). Testing QTL effect can be done 2 = σ 2 = 0. Testing imprinting is the same as by testing H0 : σqm qf 2 = σ 2 . The likelihood ratio test (LRT) can be testing H0imp : σqm qf applied and the test statistic asymptotically follows a mixture of chi-square distribution under the null of no QTL effect (72). The LRT statistic follows a χ12 distribution under the null of no imprinting for testing H0imp . With a family-based design, the TDT test can also be applied to test imprinting with modification (91, 92).
4. Designs in eQTL Mapping Genetic mapping and gene expression analysis have been two traditional means in identifying potential genetic regions or variants underlying various traits of interests. Statistical interval mapping dates back to the seminal work of Lander and Botstein (1). Gene expression analysis with microarray technique started in later 1990s. The radical breakthrough in biotechnology makes it possible to understand genomewide gene expression profiles in the scale of thousands. The two endeavors have been traditionally pursued in separate analysis. Until recently, studies showed that mRNA levels for many genes are heritable traits, thus can be used as quantitative traits for mapping purpose (93–95). By integrating genetic markers with transcriptional profiles, a procedure called expression QTL (eQTL) mapping can provide much more information about the genetic architecture of gene expression and eventually helps us understand the genetic etiology of complex traits. The integrated analysis was termed eQTL mapping by Schadt et al. (95) and called genetical genomics analysis by Jansen and Nap (96). One of the major advantages of eQTL mapping by using gene expression as phenotype is that it can help us identify novel regulators, i.e., cis-regulators or trans-regulators. The cis-acting eQTLs are caused by genomic sequence variants that reside within or close to the gene being regulated and hence are attractive candidate genes for (patho)physiological QTLs mapped to the same location (97). The trans-regulated eQTLs are often located in remote regions relative to the tested genes and reflect remote gene regulation. If many genes are mapped to the same transacting locus, this suggests coordinated regulation of many genes by a single “master regulator.” It is probable that master regulators of gene expression are key control points in gene networks whose dysregulation leads to complex phenotypes like disease
236
Cui et al.
(98). Since its inception, genetical genomics analysis has been broadly applied to a number of studies, for instance, to identify novel regulators or regulatory hotspots (95, 97, 99–103) and to construct gene networks (104–107). Most eQTL mapping studies follow the traditional QTL mapping strategy by considering each gene expression profile as a trait and repeatedly doing a genomewide linkage scan for every expression profile. Figure 6.2 displays a basic design framework for a typical eQTL mapping study. The first design issue in an eQTL mapping study is to choose an appropriate mapping population which can be a backcross (e.g., 108), F2 (e.g., 95), RIL (e.g., 109), or human population (e.g., 99) depending on specific diseases or traits to study. The second issue associated with eQTL mapping is the concern of costs for gene expression profiling. Even though costs to get genomewide expression profiles are largely reduced, most labs still cannot afford large number of array samples. Efficient design is needed in order to minimize costs, yet maintain reasonable power. Issues about gene expression microarray designs are flourished in literature. Rosa et al. (110) recently gave a comprehensive review about gene expression designs associated with eQTL mapping study.
Fig. 6.2. A typical eQTL mapping study starts with a cross of two inbreed lines (a) to generate a segregation population (b) which could be backcross, F2 , or RIL. Individuals in the segregation population are then genotyped (d) and their expression profiles are obtained with microarray technology (c). The combination of expression profiling and molecular markers in a QTL mapping framework makes it possible to identify influential genes and gene products. Figure adapted from Jansen and Nap (96).
The analysis of an eQTL mapping study largely depends on the mapping population used. With largely reduced genotyping cost, genomewide SNP genotyping will be a standard means in generating genomewide polymorphism profiles. Whether focused on single SNPs, haplotypes, or genes, development of new analytical method will be largely driven by the knowledge of
Designs for Linkage Analysis
237
the LD pattern across genome in different species. For complex human diseases, the eQTL mapping study also faces all the design and analysis issues that challenge traditional linkage or association studies. It should be noted that an eQTL mapping study often costs more with both genomewide genotyping and gene expression profiling. With limited financial resources, sample size is certainly an issue. No study for sample size calculation in eQTL mapping has been done so far and more efforts should be made in this regard. It is optimistic that eQTL mapping should open a new framework in disease–gene identification and will help us identify novel drug target in developing personalized medicine, one of the goals of human genome project.
5. Summary This chapter gives a brief review of design and analysis of genetic data emphasizing human genetic linkage and association studies, genetic mapping genomic imprinting, and eQTL mapping analysis. With advanced genotyping technology, genomewide SNP markers are routinely generated. Design and analysis of genetic data have thus been shifted from traditional linkage scan with low-resolution markers such as microsatellite to highdensity large-scale genomewide SNP markers. This shift brings prospects and also challenges in both design and analysis in disease–gene hunting. Even though GWAS are largely pursued in recent years, family-based linkage analysis still preserves its merits in many situations (36). There is no unified criteria on which designs should be preferred. For an association study, whether to use a population-based case-control design or familybased design largely relies on the nature of the disease inheritance pattern, financial resources, and availability of test population. As more and more biological evidences indicate the role of genomic imprinting in the development of an organism, identifying imprinted genes should also be considered in the design stage in linkage or association study. The integration of gene expression analysis and genetic mapping, termed eQTL mapping or genetical genomics analysis, will definitely enrich our understanding of the genetic etiology of complex traits and offer the prospect of personalized medicine development.
Acknowledgments This work was supported in part by NSF grants DMS-0707031 and DMS-0540745.
238
Cui et al.
References 1. Lander, E.S., and Botstein, D. (1989) Mapping endelian factors underlying quantitative traits using RFLP linkage maps. Genetics 121, 185–199. 2. Broman, K. (2001) Review of statistical methods for QTL mapping in experimental crosses. Lab Anim. 30, 44–52. 3. Burt, D.W. (2002) A comprehensive review on the analysis of QTL in animals. Trends. Genet. 18, 488–488. 4. Wu, R.L., Casella, G., and Ma, C.-X. (2007) Statistical Genetics of Quantitative Traits: Linkage, Maps and QTL. Springer, New York. 5. Risch, N.J. (1990) Linkage strategies for genetically complex traits. I. Multilocus models. Am. J. Hum. Genet. 46, 222–228. 6. Ott, J. (1991) Analysis of Human Genetic Linkage. Johns Hopkins University Press, Baltimore. 7. Kruglyak, L., Daly, M.J., Reeve-Daly, M.P., and Lander, E.S. (1996) Parametric and nonparametric linkage analysis: a unified multipoint approach. Am. J. Hum. Genet. 58, 1347–1363. 8. Haseman, J.K., and Elston, R.C. (1972) The investigation of linkage between a quantitative trait and a marker locus. Behav. Genet. 2, 3–19. 9. Shete, S., Jacobs, K.B., and Elston, R.C. (2003) Adding further power to the Haseman and Elston method for detecting linkage in larger sibships: Weighting sums and differences. Hum. Hered. 55, 79–85. 10. Wang, T., and Elston, R.C. (2005) Twolevel Haseman-Elston regression for general pedigree data analysis. Genet. Epidemiol. 29, 12–22. 11. Amos, C.I. (1994) Robust variancecomponents approach for assessing genetic linkage in pedigrees. Am. J. Hum. Genet. 54, 535–543. 12. Williams, J.T., and Blangero, J. (1999) Power of variance component linkage analysis to detect quantitative trait loci. Ann. Hum. Genet. 63, 545–563. 13. Pong-Wong, R., George, A.W., Woolliams, J.A., and Haley, C.S. (2001) A simple and rapid method for calculating identityby-descent matrices using multiple markers. Genet. Sel. Evol. 33, 453–471. 14. Abecasis, G.R., Cherny, S.S., Cookson, W.O., and Cardon, L.R. (2002) Merlin rapid analysis of dense genetic maps using sparse gene flow trees. Nat. Genet. 30, 97–101. 15. Almasy, L., and Blangero, J. (1998) Multipoint quantitative-trait linkage analysis in
16.
17. 18. 19.
20.
21.
22.
23.
24.
25.
26.
27.
general pedigrees. Am. J. Hum. Genet. 62, 1198–1211. Boehnke, M. (1994) Limits of resolution of genetic linkage studies: Implications for the positional cloning of human disease genes. Am. J. Hum. Genet. 55, 379–390. Cardon, L.R., and Bell, J.I. (2001) Association study designs for complex diseases. Nat. Rev. Genet. 2, 91–99. Gibson, G., and Muse, S. (2001) A Primer of Genome Science. Sinnauer, Sunderland, MA. Zheng, G., Freidlin, B., Li, Z., and Gastwirth, J.L. (2003) Choice of scores in trend tests for case-control studies of candidategene associations. Biometrical J. 45, 335–348. Song, K., and Elston, R.C. (2006) A powerful method of combining measures of association and Hardy-Weinberg disequilibrium for fine-mapping in case-control studies. Stat. Med. 25, 105–126. Hoh, J., Wile, A., and Ott, J. (2001) Trimming, weighting, and grouping SNPs in human case-control association studies. Genome Res. 11, 269–293. Zheng, G., Freidlin, B., and Gastwirth, J.L. (2006) Comparison of robust tests for genetic association using case-control studies. IMS Lecture Notes-Monograph Series. 49, 253–265. Schaid, D.J., Rowland, C.M., Tines, D.E., Jacobson, R.M., and Poland, G.A. (2002) Score tests for association between traits and haplotypes when linkage phase is ambiguous. Am. J. Hum. Genet. 70, 425–434. Epstein, M.P., and Satten, G.A. (2003) Inference on haplotype effects in case-control studies using unphased genotype data. Am. J. Hum. Genet. 73, 1316–1329. Lake, S.L., Lyon, H., Tantisira, K., Silverman, E.K., Weiss, S.T., Laird, N.M., and Schaid, D.J. (2003) Estimation and tests of haplotype-environment interaction when linkage phase is ambiguous. Hum. Hered. 55, 56–65. Cordell, H.J., Barratt, B.J., and Clayton, D.G. (2004) Case/pseudocontrol analysis in genetic association studies: A unified framework for detection of genotype and haplotype associations, gene-gene and gene-environment interactions, and parentof-origin effects. Genet. Epidemiol. 26, 167–185. Spinka, C., Carroll, R.J., and Chatterjee, N. (2005) Analysis of case-control studies of genetic and environmental factors with missing genetic information and
Designs for Linkage Analysis
28.
29. 30.
31.
32.
33.
34. 35.
36.
37.
38.
39. 40.
41.
haplotype-phase ambiguity. Genet. Epidemiol. 29, 649–659. McGinnis, R. (2000) General equations for Pt, Ps, and the power of the TDT and the affected-sib-pair test. Am. J. Hum. Genet. 67, 1340–1347. Risch, N.J. (2000) Searching for genetic determinants in the new millennium. Nature 405, 847–856. Pfeiffer, R.M., and Gail, M.H. (2003) Sample size calculations for population- and family- based case-control association studies on marker genotypes. Genet. Epidemiol. 25, 136–148. Menashe, I., Rosenberg, P.S., and Chen, B.E. (2008) PGA: Power calculator for casecontrol genetic association analyses. BMC Genet. 9, 36. Zheng, G., and Tian, X., and ACCESS Research Group (2006b) Robust trend tests for genetic association using matched casecontrol design. Stat. Med. 25, 3160–3173. Pritchard, J.K., and Donnelly, P. (2001) Case-control studies of association in structured or admixed populations. Theo. Pop. Bio. 60, 227–237. Devlin, B., and Roeder, K. (1999) Genomic control for association studies. Biometrics 55, 997–1004. Devlin, B., Roeder, K., and Wasserman, L. (2001) Genomic control, a new approach to genetic-based association studies. Theo. Pop. Bio. 60, 155–166. Laird, N.M., and Lange, C. (2006) Familybased designs in the age of large-scale geneassociation studies. Nat. Rev. Genet. 7, 385– 394. Spielman, R.S., McGinnis, R.E., and Ewens, W.J. (1993) Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am. J. Hum. Genet. 52, 506–516. Hirschhorn, J.N., and Daly, M.J. (2005) Genome-wide association studies for common diseases and complex traits. Nat. Rev. Genet. 6, 95–108. Schaid, D.J. (1999) Likelihoods and TDT for the case-parents design. Genet. Epidemiol. 16, 250–260. Martin, E.R., Monks, S.A., Warren, L.L., and Kaplan, N.L. (2000) A test for linkage and association in general pedigree: The pedigree disequilibrium test. Am. J. Hum. Genet. 67, 146–154. Abecasis, G.R., Cardon, L.R., and Cookson, W.O. (2000) A general test of association for quantitative traits in nuclear families. Am. J. Hum. Genet. 66, 279–292.
239
42. Gordon, D., Heath, C., Liu, X., and Ott, J. (2001) A transmission/disequilibrium test that allows for genotyping errors in the analysis of single-nucleotide polymorphism data. Am. J. Hum. Genet. 69, 371–380. 43. Gordon, D., Haynes, C., Johnnidis, C., Patel, S.B., Bowcock, A.M., and Ott, J. (2004) A transmission disequilibrium test for general pedigrees that is robust to the presence of random genotyping errors and any number of untyped parents. Eur. J. Hum. Genet. 12, 752–761. 44. Yang, Y., Wise, C.A., Gordon, D., and Finch, S.J. (2008) A family-based likelihood ratio test for general pedigree structures that allows for genotyping error and missing data. Hum. Hered. 66, 99–110. 45. Ioannidis, J.P. (2003) Genetic associations: False or true? Trends Mol. Med. 9, 135–138. 46. Evangelou, E., Trikalinos, T.A., Salanti, G., and Ioannidis, J.P.A. (2006) Family-based versus unrelated case-control designs for genetic associations. PLoS Genet. 2, e123. doi:10.1371/journal.pgen.0020123. 47. Ackerman, H., Usen, S., Jallow, M., et al. (2005) A comparison of case-control and family-based association methods: The example of sickle-cell and malaria. Ann. Hum. Genet. 69, 559–565. 48. Epstein, M.P, Veal, C.D., Trembath, R.C., et al. (2005) Genetic association analysis using data from triads and unrelated subjects. Am. J. Hum. Genet. 76, 592–608. 49. Weinberg, C.R., and Umbach, D.M. (2005) A hybrid design for studying genetic influences on risk of diseases with onset early in life. Am. J. Hum. Genet. 77, 627–636. 50. Hunter, D.J., Kraft, P., Jacobs, K.B., et al. (2007) A genome-wide association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer. Nat. Genet. 39, 870–874. 51. Samani, N.J., Erdmann, J., Hall, A.S., et al., WTCCC and the Cardiogenics Consortium. (2007) Genomewide association analysis of coronary artery disease. N. Engl. J. Med. 357, 443–453. 52. Wellcome Trust Case Control Consortium. (2007) Genome-wide association study of 14, 000 cases of seven common diseases and 3, 000 shared controls. Nature 447, 661–678. 53. Yeager, M., Orr, N., Hayes, R.B., et al. (2007) Genome-wide association study of prostate cancer identifies a second risk locus at 8q24. Nat. Genet. 39, 645–649.
240
Cui et al.
54. Cui, Y.H., Kang, G.L., Sun, K.L., Romero, R., Qian, M., and Fu, W.J. (2008) Genecentric Genomewide Association Study via Entropy. Genetics 179, 637–650. 55. Satagopan, J.M., Verbel, D.A., Venkatramanm, E.S., Offit, K.E., and Begg, C.B. (2002) Two-stage design for genedisease association studies. Biometrics 58, 163–170. 56. Satagopan, J.M., and Elston, R.C. (2003) Optimal two-stage genotyping in population-based association studies. Genet. Epidemiol. 25, 149–157. 57. Satagopan, J.M., Veukatraman, E.S., and Begg, C.B. (2004) Two-stage designs for gene-disease association studies with sample size constraints. Biometrics 60, 589–597. 58. Thomas, D.C., Xie, R., and Gebregziabher, M. (2004) Two-stage sampling designs for gene association studies. Genet. Epidemiol. 27, 401–414. 59. Wang, H., Thomas, D.C., Péer, I., and Stram, D.O. (2006) Optimal two-stage genotyping designs for genome-wide association scans. Genet. Epidemiol. 30, 356–368. 60. Zuo, Y., Zou, G., and Zhao, H. (2006) Twostage designs in case-control association analysis. Genetics 173, 1747–1760. 61. Skol, A.D., Scott, L.J., Abecasis, G.R., and Boehnke, M. (2007) Optimal designs for two-stage genome-wide association studies. Genet. Epidemiol. 31, 776–778. 62. Prentice, R.L., Pettinger, M., and Anderson, G.L. (2005) Statistical issues arising in the Women’s Health Initiative. Biometrics 61, 899–941. 63. Elston, R., Lin, D.Y., and Geng, Z. (2007) Multistage sampling for genetic studies. Ann. Rev. Genomics Hum. Genet. 8, 327–342. 64. Skol, A.D., Scott, L.J., Abecasis, G.R., and Boehnke, M. (2006) Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nat. Genet. 38, 209–213. 65. Zuo, Y., Zou, G., Wang, J., Zhao, H., and Liang, H.(2008) Optimal two-stage design for case-control association analysis incorporating genotyping error. Ann. Hum. Genet. 72, 375–387. 66. Lin, D.Y. (2006) Evaluating statistical significance in two-stage genomewide association studies. Am. J. Hum. Genet. 78, 505–509. 67. Pfeifer, K. (2000) Mechanisms of genomic imprinting. Am. J. Hum. Genet. 67, 777–787. 68. Alleman, M., and Doctor, J. (2000) Genomic imprinting in plants: Observations and evo-
69.
70.
71.
72.
73.
74.
75.
76.
77.
78.
79.
80.
81.
lutionary implications. Plant Mol. Biol. 43, 147–161. Falls, J.G., Pulford, D.J., Wylie, A.A., and Jirtle, R.L. (1999) Genomic imprinting: Implications for human disease. Am. J. Pathol. 154, 635–647. Jeon, J.-T., Carlborg, O., Tornsten, A., et al. (1999) A paternally expressed QTL affecting skeletal and cardiac muscle mass in pigs maps to the IGF2 locus. Nat. Genet. 21, 157–158. Tuiskula-Haavisto, M., de Koning, D.J., Honkatukia, M., Schulman, N.F., MakiTanila, A., and Vilkki, J. (2004) Quantitative trait loci with parent-of-origin effects in chicken. Genet. Res. 84, 57–66. Hanson, R.L., Kobes, S., Lindsay, R.S., and Kmowler, W.C. (2001) Assessment of parent-of-origin effects in linkage analysis of quantitative traits. Am. J. Hum. Genet. 68, 951–962. Knapp, M., and Strauch, K. (2004) Affectedsib-pair test for linkage based on constraints for identical-by-descent distributions corresponding to disease models with imprinting. Genet. Epidemiol. 26, 273–285. Shete, S., and Amos, C.I. (2002) Testing for genetic linkage in families by a variancecomponents approach in the presence of genomic imprinting. Am. J. Hum. Genet. 70, 751–757. Shete, S., Zhou, X., and Amos, C.I. (2003) Genomic imprinting and linkage test for quantitative trait loci in extended pedigrees. Am. J. Hum. Genet. 73, 933–938. de Koning, D-J., Rattink, A.P., Harlizius, et al. (2000) Genome-wide scan for body composition in pigs reveals important role of imprinting. Proc. Natl. Acad. Sci. USA 97, 7947–7950. de Koning, D.-J., Bovenhuis, H., and van Arendonk, J.A.M. (2002) On the detection of imprinted quantitative trait loci in experimental crosses of outbred species. Genetics 161, 931–938. Cui, Y.H., Lu, Q., Cheverud, J.M., Littell R.C., and Wu, R.L. (2006) Model for mapping imprinted quantitative trait loci in an inbred F2 design. Genomics 87, 543–551. Cui, Y.H., Cheverud, J.M., and Wu, R.L. (2007) A Statistical Model for Dissecting Genomic Imprinting through Genetic Mapping. Genetica 130, 227–239. Cui, Y.H., Li, S.Y., and Li, G.X. (2008) Functional Mapping Imprinted Quantitative Trait Loci Underlying Developmental Characteristics. Theor. Biol. Med. Model. 6, 5. Tycko, B., and Morison, I.M. (2002) Physiological functions of imprinted genes. J. Cell Physiol. 192, 245–258.
Designs for Linkage Analysis 82. Constancia, M., Kelsey, G., and Reik, W. (2004) Resourceful imprinting. Nature 432, 53–57. 83. Isles, A.R., and Holland, A.J. (2005) Imprinted genes and mother-offspring interactions. Early Hum. Dev. 81, 73–77. 84. Spencer, H.G. (2002) The correlation between relatives on the supposition of genomic imprinting. Genetics 161, 411–417. 85. Naumova, A.K., and Croteau, S. (2004) Mechanisms of epigenetic variation: polymorphic imprinting. Curr. Genomics 5, 417–429. 86. Sandovici, I., Kassovska-Bratinova, S., Loredo-Osti, J.C., et al. (2005) Interindividual variability and parent of origin DNA methylation differences at specific human Alu elements. Hum. Mol. Genet. 14, 2135–2143. 87. Neff, M.W., Broman, K.W., Mellersh, C.S., Ray, K., Acland, G.M., Aguirre, G.D., Ziegle, J.S., Ostrander, E.A., and Rine, J. (1999) A second-generation genetic linkage map of the domestic dog, Canis familiaris. Genetics 151, 803–820. 88. Marklund, L., Moller, M.J., Hoyheim, B., et al. (1996) A comprehensive linkage map of the pig based on a wild pig-Large White intercross. Anim. Genet. 27. 255–269. 89. Dietrich, W.F., Miller, J., Steen, R., et al. (1996) A comprehensive genetic map of the mouse genome. Nature 380, 149–152. 90. Knott, S.A., Marklund, L., Haley, C.S., et al. (1998) Multiple marker mapping of quantitative trait loci in a cross between outbred wild boar and large white pigs. Genetics 149, 1069–1080. 91. Hu, Y.Q., Zhou, J.Y., and Fung, W.K. (2007) An extension of the transmission disequilibrium test incorporating imprinting. Genetics. 175, 1489–1504. 92. Hu, Y.Q., Zhou, J.Y., Sun, F., and Fung, W.K. (2007) The transmission disequilibrium test and imprinting effects test based on case-parent pairs. Genet. Epidemiol. 31, 273–287. 93. Brem, R.B., Yvert, G., Clinton, R., and Kruglyak, L. (2002) Genetic dissection of transcriptional regulation in budding yeast. Science 296, 752–755. 94. Cheung, V.G., Conlin, L.K., Weber, T.M., et al. (2003) Natural variation in human gene expression assessed in lymphoblastoid cells. Nat. Genet. 33, 422–425. 95. Schadt, E.E., Monks, S.A., Drake, T.A., et al. (2003) Genetics of gene expression surveyed in maize, mouse and man. Nature 422, 297–302.
241
96. Jansen, R.C., and Nap, J.P. (2001) Genetical genomics: the added value from segregation. Trends. Genet. 17, 388–391. 97. Hubner, N., Wallace, C.A., Zimdahl, H., et al. (2005) Integrated transcriptional profiling and linkage analysis for identification of genes underlying disease. Nat. Genet. 37, 243–253. 98. Yvert, G., Brem, R.B., Whittle, J., Akey, J.M., Foss, E., Smith, E.N., Mackelprang, R., and Kruglyak, L. (2003) Trans-acting regulatory variation in Saccharomyces cerevisiae and the role of transcription factors. Nat. Genet. 35, 57–64. 99. Morley, M., Molony, C.M., Weber, T.M., Devlin, J.L., Ewens, K.G., Spielman, R.S., and Cheung, V.G. (2004) Genetic analysis of genome-wide variation in human gene expression. Nature 430, 743–747. 100. Brem R.B., and Kruglyak, L. (2005) The landscape of genetic complexity across 5, 700 gene expression traits in yeast. Proc. Natl. Acad. Sci. 102, 1572–1577. 101. Bystrykh, L., Weersing, E., Dontje, et al. (2005) Uncovering regulatory pathways that affect hematopoietic stem cell function using “genetical genomics”. Nat. Genet. 37, 225–232. 102. Lan, H., Chen, M., Flowers, J.B., et al. (2006) Combined expression trait correlations and expression quantitative trait locus mapping. PLoS Genet. 2, e6. 103. Wu, C., Delano, D.L., Mitro, N., et al. (2008) Gene set enrichment in eQTL data identifies novel annotations and pathway regulators. PLoS Genet. 4, e1000070. 104. Bing, N., and Hoeschele, I. (2005) Genetical genomics analysis of a yeast segregant population for transcription network inference. Genetics 170, 533–542. 105. Chesler, E.J., Lu, L., Shou, S., et al. (2005) Complex trait analysis of gene expression uncovers polygenic and pleiotropic networks that modulate nervous system function. Nat. Genet. 37, 233–242. 106. Li, H., Lu, L., Manly, K.F., Chesler, E.J., Bao, L., Wang, J., Zhou, M., Williams, R.W., and Cui,Y. (2005) Inferring gene transcriptional modulatory relations: a genetical genomics approach. Hum. Mol. Genet. 14, 1119–1125. 107. Zhu, J., Wiener, M., Zhang, C., Fridman, A., Minch, E., Lum, P.Y., Sachs, J.R., and Schadt, E.E. (2007) Increasing the power to detect causal associtions by combing genotypic and expression data in segregating populations. PloS Comp. Biol. 3, e69.
242
Cui et al.
108. Derome, N., Bougas, B., Rogers, S.M., R.W., and St Clair, D.A. (2007) Global Whiteley, A., Labbe, A., Laroche, J., and eQTL mapping reveals the complex genetic Bernatchez, L. (2008) Pervasive sex-linked architecture of transcript-level variation in effects on transcription regulation as revealed Arabidopsis. Genetics 175, 1441–1450. by eQTL mapping in lake whitefish species 110. Rosa, G.J., de Leon, N., and Rosa, A.J. pairs (Coregonus sp, Salmonidae). Genetics (2006) Review of microarray experimental 179, 1903–1917. design strategies for genetical genomics stud109. West, M.A., Kim., K., Kliebenstein., D.J., van ies. Physiol. Genomics 28, 15–23. Leeuwen, H., Michelmore, R.W., Doerge,
Chapter 7 Introduction to Epigenomics and Epigenome-Wide Analysis Melissa J. Fazzari and John M. Greally Abstract Epigenetics is the study of heritable change other than those encoded in DNA sequence. Cytosine methylation of DNA at CpG dinucleotides is the most well-studied epigenetic phenomenon, although epigenetic changes also encompass non-DNA methylation mechanisms, such as covalent histone modifications, micro-RNA interactions, and chromatin remodeling complexes. Methylation changes, both global and gene specific, have been observed to be associated with disease, particularly in cancer. This chapter begins with a general overview of epigenomics, and then focuses on understanding and analyzing genome-wide cytosine methylation data. There are many microarray-based techniques available to measure cytosine methylation across the genome, as well as gold-standard techniques based on sequencing bisulfite converted DNA, which is used to measure methylation in a smaller, more targeted set of loci. We have provided an overview of many of the current technologies – their advantages, limitations, and recent improvements. Regardless of which technology is used, the goal is to produce a set of methylation measurements that are highly consistent with true methylation levels of the corresponding set of CpG dinucleotides. Identifying all loci with aberrant methylation or hypomethylation in disease, or in natural processes such as aging, requires the comparison of methylation levels across many samples. In such studies, the development of methylation-based diagnostic tools may be of interest, potentially to be used as early disease detection strategies based on a set of sentinel loci. In addition, the identification of loci with potentially reversible methylation events may result in new therapeutic options. Given the vast number of measurable sites, prioritization of candidate loci is an important and complex issue and rests on a foundation of appropriate statistical testing and summarization. Coupled with statistical estimates of importance, the genomic context of each locus measured may offer important information about the mechanisms by which epigenetic changes impact disease and allows us further refinement of candidate loci. We will conclude this chapter by identifying issues in building methylation-based models for prediction and potential directions of further statistical research in epigenetics. Key words: Epigenomics, methylation, statistical epigenetics, CpG islands, quantile–quantile plots, prioritization.
H. Bang et al. (eds.), Statistical Methods in Molecular Biology, Methods in Molecular Biology 620, DOI 10.1007/978-1-60761-580-4_7, © Springer Science+Business Media, LLC 2010
243
244
Fazzari and Greally
1. Introduction The word “epigenetic” literally means “on genetic” sequence. The term has evolved to include any process that alters gene activity without changing the DNA sequence and leads to modifications that can be transmitted to daughter cells. The term epigenetic typically refers to a study of a single locus or sets of loci, while epigenomics refers to the global study of epigenetic changes across the entire genome. For many reasons, both technical and biological, the most commonly studied epigenetic event is cytosine methylation. In light of this current focus, this chapter will be primarily concerned with the treatment and analysis of methylation data in humans. However, this phenomenon represents only one epigenetic process; epigenetics also involves the study of non-DNA methylation events. One such important process is histone modification, which is thought to play a vital role in transcription enhancement or repression. Most of the time, genes are locked away within a complex packaging (chromatin) of proteins (histones). In general, tightly compacted chromatin tends to be shut down, or not expressed, while more open chromatin is functional, or expressed. The long tails protruding from histones are subject to chemical modification, such as acetylation or methylation, as well as affected by some forms of RNA such as small interfering RNAs (1). These modifications alter chromatin structure, influencing gene expression. Acting or interacting with a particular chromatin structure, DNA methylation involves the addition of a methyl group to a cytosine base, specifically cytosine followed by guanine (CpG) dinucleotides. Such additions to the DNA are central to genomic imprinting and X chromosome inactivation as these methylation events are associated with the silencing of gene expression. The mechanism regulating the establishment of methylation is not fully known. While the enzymes that add the methyl group to cytosines in the context of the CpG dinucleotide are recognized (2), how these DNA methyltransferases are directed to specific sequences in different cell states remains poorly understood. How the methylation mark is subsequently read appears to involve two mechanisms. The methylcytosine can act as a recognition site for methyl-binding domain (MBD) proteins (3), which tend to exist in large macromolecular complexes that also influence the histones locally, leading to local condensation of chromatin structure, recruiting other proteins to the gene which can modify histones, thus creating a silenced chromatin state and the physical prevention of binding of
Introduction to Epigenomics and Epigenome-Wide Analysis
245
transcriptional proteins to the gene (4, 5). Evidence points to an interplay between DNA methylation and histone acetylation, an “epigenetic cross-talk,” in the process of gene transcription and aberrant gene silencing in tumors (6). The second mechanism appears to involve the hindrance of binding of transcription factors to their target sequence when that sequence becomes methylated (7). In both the chromatin modification and transcription factor-binding models the outcome is similar, the failure to target transcriptional regulators to a locus and consequent loss of its activity. DNA methylation changes have been reported to occur early in carcinogenesis causing events such as silencing of tumorsuppressor genes, and are potentially good indicators of early disease and progression (8). Therefore, many of the studies discussed and referenced in this chapter will be based upon methylation profiling in cancer, such as the comparison of tumor versus normal samples. However, the study of DNA methylation is farreaching, encompassing diverse areas such as neurological disorders (9), aging (10, 11), and the environment (12). Statistical analysis of data generated from the study of methylation utilizes many of the tools already developed for other highdimensional data sets such as gene expression (4, 13). Similar to gene expression intensities, normalization and pre-processing of methylation measurements is paramount in interpreting methylation status, and properly separating signal from background noise is highly dependent upon the platform used as well as DNA sequence characteristics. Given the sheer number of CpG sites in the genome (29 million (4)) there are potentially millions of methylation-based data points available for study. Prioritization, aggregation, and description of statistically significant loci are key to understanding genome-wide patterns, isolating biomarkers, and building predictive methylation profiles with respect to outcome. New methods for prioritization of loci summarized by these statistical tests are based not only on the observed p-value, but upon physical, functional, and sequence-based relationships between loci. Before we may properly analyze methylation data, it is important to understand the basics of epigenomics, therefore we will begin our chapter with a review of many important concepts as well as commonly used platforms used in the study of cytosine methylation. 1.1. DNA Methylation and CpG Islands
In humans, DNA methylation occurs at the C5 position of cytosines that belong to CpG dinucleotides (i.e., a cytosine directly followed by a guanine). In general, the genomic landscape in mammalian DNA appears to be such that the majority of CpG dinucleotides are methylated (14, 15). Given this methylated state, and the observation that deaminated methylcytosine
246
Fazzari and Greally
is prone to mutation and depletion at a much higher rate, CpG dinucleotides are distributed at much lower frequencies than would be expected given base sequence composition across the genome. However, within a genome that is relatively methylated and depleted of CpGs, there are clusters of CpGs – called CpG islands, with particular importance both biologically and functionally (16). CpG islands are stretches of sequence that are populated with CpGs, much more than observed in the rest of the genome. Originally, CpG islands were empirically defined from a sample of sequences generated from HpaII tiny fragments (see Table 7.1 for a glossary of common epigenetic terms). When these fragments were cloned and sequenced, it was observed that they were CpG dinucleotide and C+G mononucleotide rich, thus giving rise to the original definition of CpG island – (G+C) ≥ 0.50, an observed to expected CpG ratio ≥ 0.60, and both occurring within a window ≥ 200 bp.
Table 7.1 Glossary of common epigenetic terms Epigenetics
Heritable and potentially reversible modifications to DNA not involving changes (mutations, deletions) of the genomic sequence itself.
CpG island
A cluster of CpG dinucleotides. Originally defined as (G+C) ≥ 0.50, an observed to expected CpG ratio ≥ 0.60, and both occurring within a window ≥ 200 bp. Associated with promoter regions, especially those of housekeeping genes, and believed to be largely unmethylated in normal tissue.
HpaII tiny fragment (HTF)
A fragment generated by digestion of genomic DNA at two closely located consecutive HpaII (CCGG) sites. Study of the sequence characteristics of HTFs generated the original CpG island definition. A chemical modification (the addition of a methyl group) of DNA that can be inherited and subsequently removed without changing the DNA sequence, typically methylation occurs at a CpG dinucleotide.
Methylation
Transposable elements (SINE, LINE)
A mobile segment of DNA in the genome that is capable of inserting itself (or a copy of itself) into a second site. Methylation of Transposable elements (TEs) is thought to protect the genome from transcription and transposition of the elements. Global loss of methylation is associated with genomic instability.
Histones
Proteins which tightly bind DNA. The long tails protruding from histones are subject to epigenetic modification.
Cytosine methylator (CIMP) phenotype
A set of tumor suppressor and oncogenes genes whose coordinated methylation or hypomethylation is associated with poor outcome and distinct tumor subtypes.
Global hypomethylation
The loss of methylation genome wide. Associated with genomic instability and consistently observed in many cancers.
Differentially methylated sites
Tissue-specific and disease-related changes in methylation status at sites across the genome. Methylation changes are related to altered patterns of transcription.
Introduction to Epigenomics and Epigenome-Wide Analysis
247
In the original study of Gardiner-Garden and Frommer (17), CpG islands were found to be . . .associated with the 5 ends of all housekeeping genes and many tissue-specific genes, and with the 3 ends of some tissuespecific genes. A few genes contained both 5 and 3 CpG islands, separated by several thousand base-pairs of CpG-depleted DNA.
CpG islands have a strong association with promoters and housekeeping genes, therefore it was proposed that all CpG islands not located in transposable elements are unmethylated, with the exception of imprinted genes and X-inactivated loci (18). Under this hypothesis, CpG islands are largely unmethylated in normal cells, thereby allowing for transcription of the genes through access to the transcription machinery to the DNA. In certain disease states or with aging, CpG islands have been found to become aberrantly methylated, dysregulating the normal transcriptional activity of the gene (Fig. 7.1). Identifying which loci tend to become methylated with high frequencies across tumor samples, or which loci seem to be consistently altered across a
Fig. 7.1. CpG island promoter methylation as a cause of epigenetic dysregulation of gene transcription. Top: the unmethylated state of the promoter region in normal tissue, resulting in normal gene transcription. Bottom: one example of an epigenetic event in a tumor – CpG island hypermethylation at the promoter region of a gene. This hypermethylation may not be consistent across all loci. The silencing of gene expression due to promoter hypermethylation is implicated in cancer development and progression. Examination of promoter hypermethylation across multiple tumor and normal samples will identify epigenetic events implicated in cancer.
248
Fazzari and Greally
variety of related diseases, may provide important information regarding development of disease, as well as clinically important outcomes such as recurrence, survival, or response to treatment. The focus on CpG island methylation has been central to many studies of cytosine methylation patterns and disease (5, 19–21). Although studies of promoter methylation have yielded many methylation-based profiles that are prognostic for clinical outcome and disease, such promoter-specific approaches may be missing important variation occurring in regions such as intergenic sequence and in transposable elements. It has been observed that along with hypermethylation (14, 21) of CpG island promoters, repetitive sequences (such as LINEs and Alu SINEs) and non-CpG-dense regions that are normally methylated may lose their methylation marks in diseases such as cancer (22, 23). Although specific CpGs may acquire methylation, the overall trend appears to be a gradual loss of methylation genome wide. It has also been shown in some studies that the level of global hypomethylation is predictive of poor prognosis (24). The mechanism by which global DNA hypomethylation impacts prognosis is unknown; however, there are several hypotheses, including genomic instability in regions of aberrant hypomethylation and the influence of transposable elements such as SINEs (Fig. 7.2) causing transcriptional dysregulation of nearby genes (4). Studies based on epigenome-wide approaches to measuring methylation changes in disease will further explore the role of nonpromoter methylation changes. Whether hypomethylation is purely a global change associated with randomly occurring methylation loss, or whether there are specific regions that are prone to losing methylation marks is still largely unknown.
Fig. 7.2. Transposable element (SINE) hypomethylation as a cause of epigenetic dysregulation of gene transcription. The figure depicts aberrant hypomethylation of the SINE element closely linked to a promoter region. This event is postulated to cause dysregulation in gene transcription regardless of the methylation status of the promoter (21).
2. Measuring Cytosine Methylation in the Genome
Gene transcription dysregulation leads to disease. Measuring genome-wide gene expression levels has produced many genomic “signatures” that are predictive of outcome, severity of disease, and subtype. As we have previously discussed, methyla-
Introduction to Epigenomics and Epigenome-Wide Analysis
249
tion of specific CpGs, not just those located with CpG island promoters, may impact transcription of nearby genes. Therefore, measuring genome-wide cytosine methylation has a number of distinct advantages compared with gene expression. First of all, cytosine methylation is a relatively stable characteristic since it is based on DNA, thus having less biological temporal variation and greater sample stability than RNA. Second, downstream gene expression may vary considerably in normal conditions across samples with equivalent methylation profiles, variation that provides no further discrimination with respect to the condition under study. Third, the identification of aberrant methylation sites in the genome may provide a potential for therapeutic advances. Fourth, recent data suggest that epigenetic changes are involved in the earliest phases of tumorigenesis, and that they may predispose stem/progenitor cells to subsequent genetic and epigenetic changes involved in tumor promotion (25). Methylation events may thus provide ideal biomarkers for molecular diagnostics and early detection of cancer, acting as sentinels that can be measured and are predictive for subsequent proliferative events. In addition, one can tailor microarray-based assays to genomic targets of interest, such as promoter regions, imprinted genes, and differentially methylated regions (DMRs) depending upon the question of interest. Currently, there are many platforms for studying DNA methylation (26–28). There are tens of millions of CpG sites in the genome; however, we cannot readily measure genome-wide methylation directly. DNA methylation marks are lost as a consequence of PCR amplification, therefore several methods are used to preserve them, including sodium bisulfite conversion (29, 30), immunoprecipitation-based enrichment (31), and methylationsensitive restriction digestion (32, 33). Broadly, we may divide these methods into locus-specific approaches and genome-wide approaches. Bisulfite conversion (ex: Illumina Infinium, Illumina Golden Gate, Sequenom MassARRAY) converts unmethylated cytosines to uracil (and eventually thymine following PCR amplification), while leaving methylated cytosines unconverted (34–36). Bisulfite conversion-based approaches offer single CpG resolution and are considered the gold-standard technique for methylation (37). Analysis of the PCR product with direct sequencing (38), or mass spectrometry (39), can be used to quantify the amount of methylation at each CpG; however, the cost of such an approach would be prohibitively high if attempted to be applied to large-scale whole-genome analyses across many samples. While shotgun sequencing of bisulfite-converted DNA using massively parallel sequencing is now possible, the cost of this approach also remains extremely high and is potentially subject to major potential sources of artifact (40). Bisulfite methods at
250
Fazzari and Greally
present are typically used to validate the first-stage results using genome-wide microarray-based platforms. Methods that measure the bisulfite-treated DNA at single CpGs using oligonucleotide arrays (Illumina Infinium or GoldenGate) allow for a more costeffective examination of methylation with high levels of sensitivity and specificity. A potential issue with bisulfite analysis is that it depends on the complete conversion of unmethylated cytosines to uracil, as incomplete conversion will be interpreted as a methylated cytosine. Spike-in internal controls of known unmethylated DNA allow an assessment of whether conversion is complete. Methylation-Sensitive Restriction Digestion Assays (ex: HELP, McrBC) are more cost-effective whole genome approaches, usually used in the first (screening) stage of analysis, and subsequently validated using gold-standard approaches. The HELP assay (32) generates HpaII/MspI ratios that represent the amount of unmethylated HpaII fragments relative to the total population of HpaII fragments. This measurement, presented on the log2 scale, is continuous, with values that typically fall anywhere in the range of (–4,+4). McrBC, a robust, methylation-dependent restriction enzyme recognizes methylated recognition sites (33). A Cy5-labeled intensity relative to control DNA (Cy3) ratio of greater than 1.0 is indicative of methylation. However, both methods are limited to the analysis of CpG sites located within the enzyme recognition site(s), for example, HpaII sites in HELP. Immunoprecipitation-Based Enrichment Assays (ex: MeDip) antibodies directed against methylated CpGs are used to enrich DNA in methylated sequences relative to control DNA. The resulting intensity ratio represents a ratio of methylated fragments over the total control, and positive values are interpreted as enrichment for methylation. Although this method is not constrained to measuring methylation in recognition sites, the drawback is a lack of specificity in low CpG-dense regions due to noise (41). As a result of numerous techniques for measuring methylation, the measurements obtained may be binary, semi-quantitative, or quantitative and will represent single CpG methylation levels, an averaged methylation level across multiple CpG sites, or a fragment-level measurement. There is no one-to-one correspondence between many of these methods; therefore, mapping fragments to single CpGs (generated by the gold-standard technique) or different fragments captured requires computational and biological input. The set of all loci that these methods have in common may be quite small (for example, McrBC will not recognize HpaII sites in which the internal cytosine is methylated); thus conclusions made using one approach may represent a subset of loci not measured in other techniques, thus increasing the potential variability across methods. In addition, each technique has its
Introduction to Epigenomics and Epigenome-Wide Analysis
251
own sensitivity and specificity, as well as unique characteristics that may be beneficial for one type of study over another (42). Even though the methylation status of any particular CpG is inherently binary, the overall methylation status of a CpG in a sample will tend to be more quantitative based on both assayspecific and biological reasons. Partial methylation can occur due to imprinting (one allele is methylated), or due to the mixture of methylated and nonmethylated CpGs across cells within the sample. If the unit of measurement is based on several CpG sites or summary of an entire CpG island, despite regional correlations of neighboring CpG sites (39), the resulting measurement may be nonbinary due to varying methylation status across CpG sites. For example, progressive spreading of methylation has been observed in specific CpG islands, with increasing methylation found gradually moving toward the transcription start site of genes from the boundary of the CpG island (43). Promoter regions in general may show highly heterogeneous methylation patterns, and are observed to be rarely ever fully methylated (44). These regional differences in methylation patterns and overall levels of methylation (for example, 50% vs. 70% methylation) may have large consequences in terms of gene expression, and therefore impact on disease. There may also be a threshold effect due to methylation with respect to gene expression changes. Depending on the type of technology used, incomplete conversion, amplification issues, and other technology-dependent sources will also result in variability of methylation. Regardless of technology used, the statistical methods used for the analysis of methylation measurements must be carefully selected to best summarize the results of the experiment, taking into account the study design, as well as the correlative structure of the samples and type of measurement collected. 2.1. Considerations with Microarray-Based Methylation
Irizarry (42) was the first to methodically examine the performance of several commonly used whole-genome methylation assays. Using the Illumina Golden Gate assay as the gold standard, which measures methylation at the individual CpG level (see bisulfite conversion), he mapped each CpG to one methylation value based on the microarray data. This mapping algorithm differed for each method (for more details on each mapping procedure, see Irizarry (42)); however, the end result was that comparisons between the gold-standard measurement and each methylation intensity across all methods could be made on the CpG level, across the genome. The main assumption of this comparison is that the values measured via Illumina were correct, therefore, the conclusions from this study are conditioned on the accuracy of the current gold-standard techniques. Several issues were identified in the comparison of HELP, MeDIP, and McrBC, including biases with fragment size and
252
Fazzari and Greally
CpG density. It has been assumed that the methylation intensity produced by each one of these methods has a linear relationship to true methylation levels, allowing interpretation of changes in methylation intensity to be related to true changes in the amount of methylation at a particular site. However, the strength of this linear relationship, as well as the true form of the relationship, was observed to be tenuous for many of the methods. The operating characteristics of each assay were heavily impacted by both fragment size and (relatedly) CpG density, resulting in lower sensitivity in HELP for fragments located within CpG-dense regions (tend to be smaller fragments), and lower sensitivity using MeDIP in CpG sparse regions. McrBC appeared to be less impacted by CpG density relative to the other methods, but the correlation with the gold-standard measurement was still only moderate. Fragments in McrBC larger than 1,500 bp were not sensitive with respect to detecting methylation. Several algorithms have been proposed to improve upon the current techniques of measuring methylation across the above platforms. The use of a novel quantile normalization algorithm for HELP alleviated much of the fragment biases associated with fragments < 300 bp or fragments > 1,500 bp (45). Combining of neighboring probe-level data to create more stable methylation measurements allowed improvements to both MeDIP and McrBC. This averaging algorithm was based on the observation that neighboring CpGs tend to be highly correlated (39); therefore, averaging should produce a more stable estimate of the local methylation levels. CHARM incorporates an array design that covers more CpG sites as well as facilitates smoothing of closely neighboring CpGs via a genomic smoothing algorithm. When used in conjunction with McrBC, this design produced better correlations with the gold standard, thus improving the detection of methylated regions. Another improvement in the processing of methylation data from these arrays is one that takes the CpG density bias inherent in most techniques of measuring methylation into account. HELP indirectly corrects for CpG density by controlling fragment length bias since CpG density and fragment length is inversely correlated (45). Pelizzola et al. (46) proposed a statistical algorithm used in conjunction with MeDIP that combines neighboring probes for increased stability as well as integrating the observed sigmoidal relationship between antibody enrichment data and DNA methylation level into the estimation process, providing a model-based estimate of methylation level that has better correlation with true methylation levels. Down et al. (41) used a Bayesian deconvolution strategy in conjunction with MeDIP methylation values to estimate absolute methylation levels (those with a linear relationship to and thus a high level of correlation with respect to bisulfite sequencing methylation values) across a large range of CpG densities.
Introduction to Epigenomics and Epigenome-Wide Analysis
3. Statistical Issues in the Analysis of Methylation
3.1. Association of Differentially Methylated Regions with Sequence Composition
253
The statistical challenges in the primary analysis of cytosine methylation are similar to those encountered with most largescale genomic analyses, including proper normalization and preprocessing. The CpG density and fragment length biases are platform specific, and new methods have been proposed to improve the quantitative measurements obtained (see Section 2.1). Assuming a normalized and pre-processed data set, the additional analytic challenges are ones due to high dimensionality and the lack of focused hypotheses when analyzing epigenetic changes on a genome-wide level. Because the unit of interest is a CpG dinucleotide or cluster of dinucleotides, the total number of representations in the genome can reach into millions, rather than tens of thousands. The analysis of methylation data may be focused on whether only specific types of loci are important, as well as which set of loci have consistent changes across samples. Examining specific sequence or positional features of the most informative loci gives us a greater understanding of the mechanisms by which methylation changes can impact gene expression and disease. Irizarry (47) examined the methylome in an epigenome-wide scale using MrcBC-fractionated DNA and hybridized onto the CHARM (42) array. Samples were obtained from three sites (human brain, liver, spleen) in five individuals. In addition, data were obtained on 13 individuals with colon cancer (tumor and subject-matched normal mucosa). They defined methylation change () across tissues as the difference in average methylation values across five samples in each pair-wise tissue comparison. This resulted in millions of values, which were subsequently ranked by Z-score ( divided by its standard error). False discovery rate (FDR) was computed and all probes meeting an FDR < 0.05 were considered differentially methylated. Groups of contiguous and statistically significant (at the nominal p-value level of 0.001 rather than FDR < 0.05) probes were classified as tissue differential methylated regions (T-DMRs). The statistical significance of each region (defined by multiplied by the length of the region) was assessed via permutation test. Interestingly, they found that 76% of the T-DMRs are located within 2 kb of CpG islands, or CpG island “shores.” Thus differential methylation was not occurring within the areas of highest CpG density (the island itself; only 6% of T-DMRs occur here), but rather in the region a much lower CpG density (shores) outside of the promoter region. This result is important in supporting the hypothesis that important methylation changes are not
254
Fazzari and Greally
solely occurring in promoter regions, thus shifting from a CpG island-centric point of view and more toward the point of view that important methylation changes occur in repetitive sequence and intergenic regions, away from genes. In the same study, the paired samples representing tumor and normal tissue from the same set of patients were examined to find regions that are differentially methylated in cancer (CDMRs). Using the fold change () only, the authors found a set of C-DMRs using a similar approach to the T-DMR procedure above. Similar to T-DMRs, hypomethylated regions are typically not associated with CpG islands (the majority occur at least 3,000 bp from CpG islands), and hypermethylated regions are more likely to be part of a CpG island, overlapping the island, or located within 500 bp of an island. In addition, C-DMRs colocalized with the T-DMR regions, which suggests that epigenetic changes in cancer may cause cells to lose their specificity and behave like other cells. When integrated with gene expression changes, the authors found non-CpG island hyper and hypomethylation was associated with expression differences in associated genes, providing additional support that non-CpG island methylation changes may be transcriptionally important. Low penetrance of hypermethylation events occurring at promoters, coupled with potentially high information content existing outside of promoter or CpG island regions, moves us away from gene-centric approaches in studying methylation change, and toward an epigenome-wide analysis that is unbiased with respect to where relevant changes may occur. 3.2. Association of Epigenetic Events with Disease
DNA methylation plays an essential role in the regulation of gene expression during development and differentiation, and in diseases such as multiple sclerosis, diabetes, schizophrenia, and cancer. There have been many high-profile studies of methylation and histone modification, particularly in cancer. The understanding that tumor suppressors gain methylation marks in the progression to cancer has been well established; however, the precise timing of this change and what precipitates this change is still unknown. Brock (48) examined methylation markers in predicting the early recurrence of Stage I lung cancer. Tsou (49) identified a panel of DNA methylation markers for lung adenocarcinoma, and Ogino (24) demonstrated that LINE-1 hypomethylation was associated with higher colon cancer-specific and overall mortality, independent of other tumor characteristics. In other areas, methylation and histone modifications are also being studied. Siegmund (50) examined two psychiatric disease cohorts (Alzheimer’s and schizophrenia) with respect to the methylation status of CpGs located primarily in the promoters of 50 agerelated genes. A robust and progressive rise in DNA methylation
Introduction to Epigenomics and Epigenome-Wide Analysis
255
levels across the lifespan was observed for 8/50 loci, typically in conjunction with declining levels of the corresponding mRNAs. In the above studies, as well as many other studies of the predictive value of methylation markers, the statistical tools used are similar to those traditionally used in the analysis of gene expression studies. If we have outcome data such as survival or recurrence, or have collected multiple tissues or tumor types (for example, we are comparing a set of normal samples with a set of tumors), we may apply traditional supervised statistical testing to determine which probes or fragments separate the samples with respect to survival time, outcome, or class. Initially, due to the sheer number of loci measured, most studies begin with a locus-by-locus evaluation of the data (screening phase). The form of the outcome variable dictates the statistical test used. If we want to distinguish between two classes of tumor (for example, stage I vs. stage IV disease) with respect to a continuous methylation marker, we may use a two-class t-test or nonparametric analog (see Table 7.2). Similarly, we may model the class variable as a binary outcome, using methylation as a continuous or categorized predictor via logistic regression. Both statistical approaches will identify those loci that can discriminate between the two known classes. For nonsupervised problems, where we have multiple tumors of the same apparent type and wish to determine whether there are epigenomic subclasses, clustering methods (such as K-means or hierarchical clustering) that provide
Table 7.2. Statistical methods for the analysis of methylation data Analysis
Example
Statistical methods
Nonparametric methods
Two sample – paired
Tumor vs. matched normal
Paired t-test
Wilcoxon signed rank
Two sample – independent
Tumor vs. normal, Stage I vs. Stage IV
T-test, logistic regression
Wilcoxon–MannWhitney
Clustered, > 2 groups
Three tissues from the same patient
Random effects or marginal models with robust standard errors
Friedman’s test
Independent, > 2 groups
Three treatment groups
ANOVA
Kruskall–Wallis
Survival
Overall survival based on CIMP phenotype
Kaplan–Meier, Log-rank test, Cox proportional hazards model
Multivariable
Prediction of cancer status (class I) vs. normal status (class 0)
Support vector machines Regularized regression (LASSO, DLDA, Ridge, elastic net) Random Forest
256
Fazzari and Greally
some level of “epigenomic distance” (45) may also be applied. Table 7.2 provides some of the commonly used statistical methods. A complete overview of these methods may be found in most introductory statistics textbooks. A comprehensive overview of these techniques specifically for methylation studies is also available (51). 3.2.1. A Graphical Approach to Examine Whole-Genome Associations
Higher levels of organization, both spatially and by genomic context (such as whether a particular locus overlaps or is contained in a CpG island, gene body, or repetitive sequence), allows us to examine methylation information more globally, combining the observed statistical significance of each methylation marker with the characteristics of the marker itself. Such information may be summarized graphically using a quantile–quantile (or Q–Q) plot. A Q–Q plot allows the comparison of two distributions, yielding a highly descriptive summary of the data. In addition, any systematic biases present in the data will easily be uncovered. In many methylation studies, we want to represent the observed quantiles of test statistic values compared with the quantiles of the test statistic under the null hypothesis of no differentially methylated loci. This null distribution may be generated through the use of control probes with no methylation changes expected across samples, through perturbation of the class labels, or through the generation of the distribution by sampling directly from the known distribution of the test statistic under the null. For example, if we use a t-test to test methylation differences between n1 tumors and n2 independent normal tissue, the known distribution of the test statistic under the null is ∼t (df = n1 –n2 –2). Plotting the observed and expected distributions of test statistics, we will observe all significantly hypomethylated loci as a departure from the 45◦ line such that the test statistic is negative. Similarly, hypermethylated loci will be reflected in large departures from the 45◦ line in the positive direction (see Fig. 7.3). Of interest is whether there is an abundance of marginal or significant test statistics in the hypomethylated direction, as this begins to explore the question of global hypomethylation and whether such changes are occurring across the genome on a large scale, randomly occurring at a subset of sites, or targeted to specific regions. By stratifying and sub-setting loci based on certain genomic characteristics (such as promoters vs. intergenic regions, or CpG islands vs. CpG shores) we may assess the relative information content available in each type of region. Confidence intervals may be generated by bootstrapping (52).
3.2.2. The CIMP Phenotype
Issa (5) coined the term “the CpG island methylator phenotype” or CIMP. As methylation profiles across, a variety of tumors was collected, it was observed that in some tumors, groups of genes (such as those involved in cell cycle and DNA repair) had
Introduction to Epigenomics and Epigenome-Wide Analysis
257
Fig. 7.3. Q–Q plot for genome-wide analysis of methylation. The figure illustrates a quantile–quantile plot for the global evaluation of methylation changes in a sample of subjects with tumor and normal tissues. Deviations from the 45◦ line indicate the presence of loci with larger than expected methylation differences based on the magnitude of the test statistic and compared with the null distribution. Test statistics with large negative values are indicative of consistently hypomethylated loci in tumor, while large positive values are indicative of consistent hypermethylation, potentially in CpG island promoters.
consistently increased methylation at their promoters (14, 21, 53). Thus, generally, CIMP phenotype potentially discriminates between tumors with only a few “sporadic” methylated genes and tumors with systematic hypermethylation across many important genes. Logistically, if the penetrance of a methylation event at any one particular locus is low, looking across a set of exchangeable loci for an accumulation of changes will often have more power to detect significant effects in small studies of methylation. However, statistical testing of individual loci will also identify those loci that are consistently hypermethylated in tumor compared to normal tissue; therefore, the utility of this phenotype is dependent upon the selection, as well as the assumed relationships, between the CIMP loci. A positive CIMP status is often determined using some threshold based on the percent of genes in the set that are methylated, however, the phenotype may be expanded to include CIMP-low, CIMP-moderate, and CIMP-high patients. The specific set of genes making up a CIMP phenotype will likely vary across different types of cancers and diseases in general. The
258
Fazzari and Greally
CIMP phenotype has been evaluated in lung, gastric, and ovarian cancers, as well as non-cancer-related processes such as aging (54–61); however, currently there is no standard in terms of what genes define the CIMP phenotype or the effect of the magnitude of changes in methylation across this set of genes. Metaanalyses of CIMP phenotypes would be informative with respect to determining whether a particular set or subset of loci are consistently methylated across a wide range of tumors. In addition, further epidemiologic studies of CIMP+ patients would be useful in determining whether CIMP+ represents a distinct subgroup of affected patients. To this end, further studies estimating the association of the CIMP phenotype with gender, diet, smoking status, race, mutations, and other clinically important patient features along with clinical outcome will be highly informative. For example, it has been observed in colorectal carcinomas that CIMP+ tumors have a high frequency of classical genetic changes, while CIMP– tumors have relatively few “classical” genetic changes, thus dividing tumors into two major molecular pathways (62). Currently, there is no consensus on what exactly defines the CIMP phenotype (62). Studies (positive and negative) of CIMP have used different gene sets, measurements of methylation, and definitions for what constitutes a CIMP+ profile. CIMP may only be applicable for a subset of cancers, and the utility of studying a CIMP profile in comparison to singular genes will be realized as more studies are conducted in larger sets of patients. 3.3. Discussion and Future Directions
Epigenomics is a new and quickly evolving field of research. Therefore, we have identified some remaining issues and questions in analyzing DNA methylation data, as well as some potentially important areas for further bioinformatics research.
3.3.1. Defining Differentially Methylated Regions or Hypomethylation Hotspots
Although methylation at the individual locus level is important, defining contiguous regions of consistent methylation or hypomethylation is an important and relatively unexplored research direction. Hotspots of hypomethylated loci are associated with genomic instability; therefore, the identification of such regions may shed light on to chromosomal breakage and instability. Graphically, we may examine whether there are spatially clustered loci or fragments with respect to methylation changes. Plotting the observed test statistic by genomic position along with other genomic information such as promoter and gene body locations will allow visual assessment of regional hotspots of methylation change. Irizarry (47) used a simple method to detect regions of similar change based on the p-values of probes in that region and total length of the region. Permutation-based tests based on randomly selecting regions from the genome may be used to summarize the likelihood of the observed region being a chance
Introduction to Epigenomics and Epigenome-Wide Analysis
259
finding. Other methods of identifying and assigning statistical significance to blocks of differentially methylated loci are important areas of future research. Currently, regions of similarly methylated loci are evaluated descriptively, through integrative graphs that allow visualization of methylation levels, along with information about CpG density, promoter proximity, and conservation. 3.3.2. Integration of Epigenetic Outcomes with Gene Expression, Copy Number Variation
Analysis of methylation change is enhanced by the integration of gene expression changes. If methylation is involved in controlling gene expression, there should be differences between the methylation patterns of a gene in cells where the gene is expressed compared with cells where it is silent. Methylated loci that are also downregulated in gene expression experiments are likely to be transciptionally important, therefore merging of the two sets of data to find genes with highly correlated changes in methylation and gene expression is important. In addition, gene expression changes may result from either methylation or copy number variation. The loss of a region in one copy of a gene combined with the methylation of the remaining copy is likely to have a large impact on gene expression. However, copy number changes alone may be uncommon; therefore, a joint analysis of genes with methylation and/or copy number changes should be performed. Fragment or CpG-level information, along with copy number, may be mapped to a gene and analyzed as one variable using the union of the two events.
3.3.3. Prioritization of Candidate Loci from Screening Epigenome-Wide Studies
Due to the sheer number of measured CpGs in epigenomewide studies, statisticians and investigators must prioritize loci with observed statistically significant methylation changes. Often, methylation studies are performed in two or three stages – epigenome-wide screening, technical validation of the methylation measurement, and external validation in a new set of independent samples. Technical validation stage may be based on the same samples (internal validation of the methylation levels measured) or on an independent set, using gold-standard techniques on a select set of loci from the screening phase. Determination of which set of loci to move forward into the validation phase is difficult, and is often based on investigator input as well as statistical significance. As in any study, p-values provide some evidence of information content, and often loci are ranked with respect to the p-value generated based on the appropriate test statistic. Interpretation of the p-value for any particular locus in light of all of the tests performed is problematic. Given the millions of fragment or individual CpG measurements possible, paired with relatively small sample size, the top of the ranked list may represent truly informative loci, or simply the extreme observations of the null distribution. It is easy to simulate, especially with small sample sizes, a set of uninformative markers that com-
260
Fazzari and Greally
prise the top-ranked loci based on p-value. Certainly, selecting the “best” set of loci with observed p-values of 0.001 in a study that measured tens of thousands of loci cannot be interpreted with the same confidence as a p-value of 0.001 in a study of a single pre-selected marker. Computing the FDR (63, 64) or adjusting p-values for multiple testing, such as the Holm–Bonferroni adjustment (65), yields a discrete set of significant genes with FDR or Holm-adjusted p-value that is acceptably low. While these approaches are useful if the goal is to obtain a discrete set of best candidates, often more biologically driven information is available by which to select a final set of loci. Prioritization of candidate loci may be enhanced through integrative analysis of methylation and gene expression. For example, evaluation of loci with highly methylated promoter regions and lower downstream expression levels may be useful in isolating functionally important loci that are associated with transcriptional effects. However, this approach will fail to identify methylation changes located away from the promoter region that impact transcription (see Section 3.1), or changes in methylation that are not consistent with noisier gene expression measurements. We propose an initial prioritization scheme for methylation studies, using locus-specific characteristics, giving higher weight to statistically significant changes in 1. Loci associated with CpG island promoters with or without associated gene expression changes; 2. Loci associated with conserved sequence in noncoding regions; 3. Loci with constitutively hypomethylated loci, defined by studies of normal tissue, across multiple tissue types; 4. Loci associated with altered methylation patterns across tissue types. We may informally select such loci using these criteria; however, a simple transformation of the p-value proposed by Wacholder (66) (see also Wakefield (67) for another overview) allows us to re-rank loci by the false-positive reporting probability (FPRP), using this prioritization scheme (along with an initial set of weights) in the form of priors. The strength of this approach is that the priors used may be reported along with the results of the study, as well as the sensitivity of the results to the exact priors used. This allows the results of the study, and subsequent validation of the identified loci, to be evaluated in the context of the assumptions made, where such investigator assumptions are clearly and quantitatively specified. 3.3.4. Building Methylation “Signatures”
Similar to gene expression studies, building predictive models and quantifying the prognostic value of a set of methylation markers is an important statistical problem. Examination of the impact of
Introduction to Epigenomics and Epigenome-Wide Analysis
261
global summaries (such as methylation index or percent global hypomethylation) may be highly informative; however, which regions or loci to sum over (all loci measured on the array vs. a select subset of loci such as in CIMP) is not obvious. Local methylation events, such as the methylation of a tumor suppressor gene or hypomethylation of an oncogene, may have obvious importance; however, determining the importance of regions not overlapping the promoter region of genes, such as in the gene body or in nearby transposable elements, is yet to be fully determined. There are many traditional statistical models, as well as newer machine-learning-based tools available for building multivariable models using methylation-based measurements as predictors (see Table 7.2 as well as (13, 68)). Given the large number of potential features in the model, an important initial step is dimension reduction, which often has a huge impact on the performance of the model (69) (see Chapter 14 for dimension reduction techniques). Adequacy of the final model or algorithm may be assessed in-sample using bootstrap or cross-validated estimates (70); however, external validation on a new set of samples is necessary. In screening candidate loci and building prognostic models, we must consider: 1. Positional aspects in screening loci – do we give higher weight to statistically significant regions near or at promoters, or equal weight to every measurable CpG site across the genome? This is especially relevant given that CpG sparse regions may not be measured as reliably as CpGs located in CpG-dense regions. 2. The integration of other sources of information into the screening process, including gene expression, copy number change, biological pathways, and genomic context. 3. The relationship between screening method and validation method to be eventually used for diagnostic tools. Importantly, models built using microarray-based methylation measurement may not be fully validated by the gold standard; therefore, the set of loci selected for final model as well as the model weights should be based on measurements derived from the technology to be used in clinical practice. 4. The overestimation of parameter estimates and predictive ability due to model selection bias, over-fitting, and estimation bias (71). 5. The need for validation of methylation signatures in independent samples. Statistical and bioinformatic approaches are important in addressing the issues outlined above, as well as further refining the estimates of methylation generated from genome-wide
262
Fazzari and Greally
microarray-based techniques. Quantification and summarization of epigenome-wide DNA methylation patterns will likely be a very powerful tool in risk assessment, the diagnosis of disease, and the prediction of clinically relevant outcomes such as recurrence and survival. References 1. Martienssen, R.D., Zaratiegui, M., and Goto, D.B. (2005) RNA interference and heterochromatin in the fission yeast Schizosaccharomyces pombe, Trends Genet 21(8), 450–456. 2. Klose, R.J. and Bird, A.P. (2006) Genomic DNA methylation: the mark and its mediators, Trends Biochem Sci 2, 89–97. 3. Jørgensen, H.F. and Bird, A. (2002) MeCP2 and other methyl-CpG binding proteins, Ment Retard Dev Disabil Res Rev 8(2), 87–93. 4. Fazzari, M.J. and Greally, J.M. (2004) Epigenomics: Beyond CpG islands, Nat Rev Genet 5, 446–455. 5. Issa, J.P. (2004) CpG island methylator phenotype in cancer, Nat Rev Cancer 4, 988–993. 6. Vaissière, T., Sawan, C., and Herceg, Z. (2008) Epigenetic interplay between histone modifications and DNA methylation in gene silencing, Mutat Res 659(1–2), 40–48. 7. Bell, A.C. and Felsenfeld, G. (2000) Methylation of a CTCF-dependent boundary controls imprinted expression of the Igf2 gene, Nature 405(6785), 482–485. 8. Laird, P.W. (1997) Oncogenic mechanisms mediated by DNA methylation, Mol Med Today 3, 223–229. 9. Ladd-Acosta, C.P., Sabunciyan, S., Yolken, R.H., Webster, M.J., Dinkins, T., Callinan, P.A., Fan, J.B., Potash, J.B., and Feinberg, A.P. (2007) DNA methylation signatures within the human brain, Am J Hum Genet 81(6), 1304–1315. 10. Richardson, B.C. (2002) Role of DNA methylation in the regulation of cell function: autoimmunity, aging and cancer. J Nutr 132, 2401S–2405S. 11. Richardson, B.C. (2003) Impact of aging on DNA methylation, Ageing Res Rev 2/3, 245–261. 12. Weinhold, R. (2006) Epigenetics: the science of change, Environ Health Persp 114, A160–A167. 13. Hastie, T., Tibshirani, R., and Friedman, J.H. (2001) The Elements of Statistical Learning, Springer–Verlag, New York.
14. Herman, J.G. and Baylin, S.B., (2003) Gene silencing in cancer in association with promoter hypermethylation, NEJM 349, 2042–2054. 15. Bird, A.P. (1980) DNA methylation and the frequency of CpG in animal DNA, Nucleic Acids Res 8, 1499–1504. 16. Bird, A.P. (1986) CpG rich islands and the function of DNA methylation, Nature 321, 209–213. 17. Gardiner-Garden, M. and Frommer, M. (1987) CpG islands in vertebrate genomes, J Mol Biol 196(2), 261–282. 18. Heard, E. (2004) Recent advances in Xchromosome inactivation, Curr Opin Cell Biol 16, 247–255. 19. Yan, P.S., Efferth, T., Chen, H.L., Lin, J., Rödel, F., Fuzesi, L., and Huang, T.H. (2002) Use of CpG island microarrays to identify colorectal tumors with a high degree of concurrent methylation, Methods 27, 162–169. 20. van Rijnsoever, M., Grieu, F., Elsaleh, H., Joseph, D., and Iacopetta, B. (2002) Characterisation of colorectal cancers showing hypermethylation at multiple CpG islands, Gut 51, 797–802. 21. Jones, P.A. and Baylin, S.B. (2002) The fundamental role of epigenetic events in cancer, Nat Rev Genet 3, 415–428. 22. Gama-Sosa, M.A., Slagel, V.A., Trewyn, R.W., Oxenhandler, R., Kuo, K.C., Gehrke, C.W., and Ehrlich, M. (1983) The 5methylcytosine content of DNA from human tumors, Nucleic Acids Res 11, 6883–6894. 23. Feinberg, A.P., Gehrke, C.W., Kuo, K.C., and Ehrlich, M. (1988) Reduced genomic 5-methylcytosine content in human colonic neoplasia, Cancer Res 48,1159–1161. 24. Ogino, S., Nosho, K., Kirkner, G.J., Kawasaki, T., Chan, A.T., Schernhammer, E.S., Giovannucci, E.L., and Fuchs, C.S. (2008) A cohort study of tumoral LINE1 hypomethylation and prognosis in colon cancer, J Natl Cancer Inst 100, 1734–1738. 25. Feinberg, A.P., Ohlsson, R., and Henikoff, S. (2006) The epigenetic progenitor origin of human cancer, Nat Rev Genet 7, 21–33.
Introduction to Epigenomics and Epigenome-Wide Analysis 26. Schumacher, A., Kapranov, P., Kaminsky, Z., Flanagan, J., Assadzadeh, A., Yau, P., Virtanen, C., Winegarden, N., Cheng, J., Gingeras, T., and Petronis, A. (2006) Microarray based DNA methylation profiling: technology and applications, Nucleic Acids Res 34(2), 528–542. 27. Beck, S. and Rakyan, V.K. (2008) The methylome: approaches for global DNA methylation profiling, Trends Genet 24(5), 231–237. 28. Zilberman, D. and Henikoff, S. (2007) Genome-wide analysis of DNA methylation patterns, Development 134, 3959–3965. 29. Dupont, J.M., Tost, J., Jammes, H., and Gut, I.G. (2004) De novo quantitative bisulfite sequencing using the pyrosequencing technology, Anal Biochem 333, 119–127. 30. Eads, C.A., Danenberg, K.D., Kawakami, K., Saltz, L.B., Blake, C., Shibata, D., Danenberg, P.V., and Laird, P.W. (2000) MethyLight: A high-throughput assay to measure DNA methylation, 28, E32. 31. Weber, M., Davies, J.J., Wittig, D., Oakeley, E.J., Haase, M., Lam, W.L., and Schubeler, D. (2005) Chromosome-wide and promoterspecific analyses identify sites of differential DNA methylation in normal and transformed human cells, Nat Genet 37, 853–862. 32. Khulan, B., Thompson, R.F., Ye, K., Fazzari, M.J., Suzuki, M., Stasiek, E., Figueroa, M.E., Glass, J.L., Chen, Q., Montagna, C., Hatchwell, E., Selzer, R.R., Richmond, T.A., Green, R.D., Melnick, A., and Greally, J.M. (2006) Comparative isoschizomer profiling of cytosine methylation: the HELP assay, Genome Res 16, 1046–1055. 33. Sutherland, E., Coe, L., and Raleigh, E.A. (1992) McrBC: a multisubunit GTPdependent restriction endonuclease, J Mol Biol 225, 327–348. 34. Clark, S. J., Harrison, J., Paul, C. L., and Frommer, M. (1994) High sensitivity mapping of methylated cytosines, Nucleic Acids Res 22, 2990–2997. 35. Clark, S.J., Statham, A., Stirzaker, C., Molloy, P.L., and Frommer, M. (2006) DNA methylation: bisulphite modification and analysis, Nat Protoc 1, 2353–2364. 36. Bibikova, M., Lin, Z., Zhou, L., Chudin, E., Garcia, E.W., Wu, B., Doucet, D., Thomas, N.J., Wang, Y., Vollmer, E., Goldmann, T., Seifart, C., Jiang, W., Barker, D.L., Chee, M.S., Floros, J., and Fan, J. (2006b) High-throughput DNA methylation profiling using universal bead arrays, Genome Res 16, 383–393. 37. Frommer, M., McDonald, L.E., Millar, D.S., Collis, C.M., Watt, F., Grigg, G.W., Mol-
38.
39.
40. 41.
42.
43.
44.
45.
46.
263
loy, P.L., and Paul, C.L. (1982) A genomic sequencing protocol that yields a positive display of 5-methylcytosine residues in individual DNA strands, Proc Natl Acad Sci USA 89(5), 1827–1831. Eckhardt, F., Lewin, J., Cortese, R., Rakyan, V.K., Attwood, J., Burger, M., Burton, J., Cox, T.V., Davies, R., Down, T.A., Haefliger, C., Horton, R., Howe, K., Jackson, D.K., Kunde, J., Koenig, C., Liddle, J., Niblett, D., Otto, T., Pettett, R., Seemann, S., Thompson, C., West, T., Rogers, J., Olek, A., Berlin, K., and Beck, S. (2006) DNA methylation profiling of human chromosomes 6, 20 and 22, Nat Genet 38, 1378–1385. Ehrich, M., Turner, J., Gibbs, P., Lipton, L., Giovanneti, M., Cantor, C., and van den Boom, D. (2008) Cytosine methylation profiling of cancer cell lines, Proc Natl Acad Sci USA 105(12), 4844–4849. Jeddeloh, J.A., Greally, J.M., and Rando, O.J. (2008) Reduced-representation methylation mapping, Genome Biol 9(8), 231. Down, T.A., Rakyan, V.K., Turner, D.J., Flicek, P., Li, H., Kulesha, E., Gräf, S., Johnson, N., Herrero, J., Tomazou, E.M., Thorne, N.P., Bäckdahl, L., Herberth, M., Howe, K.L., Jackson, D.K., Miretti, M.M., Marioni, J.C., Birney, E., Hubbard, T.J., Durbin, R., Tavaré, S., and Beck, S. (2008) A Bayesian deconvolution strategy for immunoprecipitation-based DNA methylome analysis, Nat Biotech 27(7), 779–785. Irizarry, R., Ladd-Acosta, C., Carvalho, B., Wu, H., Brandenburg, S.A., Jeddeloh, J.A., Wen, B., and Feinberg, A.P. (2008) Comprehensive high-throughput arrays for relative methylation (CHARM), Genome Res 18, 780–790. Taylor, K.H., Kramer, R.S., Davis, W.J., Guo J., Du, D.J., Xu, D., Caldwell, C.W., and Shi, H. (2007) Ultradeep bisulphite sequencing analysis of DNA methylation patterns in multiple gene promoters by 454 sequencing, Cancer Res 67(18), 8511–8518. Candiloro, I.L., Mikeska, T., Hokland, P., and Dobrovic, A. (2008) Rapid analysis of heterogeneously methylated DNA using digital methylation-sensitive high resolution melting: application to the CDKN2B (p15) gene, Epigenet Chromatin 31(1), 7–15. Thompson, R.F., Reimers, M., Khulan, B., Gissot, M., Richmond, T.A., Chen, Q., Zheng, X., Kim, K., and Greally, J.M. (2008) An analytical pipeline for genomic representations used for cytosine methylation studies, Bioinformatics 24(9), 1161–1167. Pelizzola, M., Koga, Y., Urban, A.E., Krauthammer, M., Weissman, S., Halaban,
264
47.
48.
49.
50.
51. 52. 53. 54.
55.
56.
Fazzari and Greally R., and Molinaro, A.M. (2008) MEDME: an experimental and analytical methodology for the estimation of DNA methylation levels based on microarray derived MeDIPenrichment, Genome Res 18(10), 1652– 1659. Irizarry, R.A., Ladd-Acosta, C., Wen, B., Wu, Z., Montano, C., Onyango, P., Cui, H., Gabo, K., Rongione, M., Webster, M., Ji, H., Potash, J.B., Sabunciyan, S., and Feinberg, A.P. (2009) The human colon cancer methylome shows similar hypo- and hypermethylation at conserved tissue-specific CpG island shores, Nat Genet 41(2), 178–86. Brock, M.V., Hooker, C.M., Ota-Machida, E., Han, Y., Guo, M., Ames, S., Glöckner, S., Piantadosi, S., Gabrielson, E., Pridham, G., Pelosky, K., Belinsky, S.A., Yang, S.C., Baylin, S.B., and Herman, J.G. (2008) DNA methylation markers and early recurrence in stage I lung cancer, N Engl J Med 358, 1118–1128. Tsou, J.A., Hagen, J.A., Carpenter, C.L., and Laird-Offringa, I.A. (2002) DNA methylation analysis: a powerful new tool for lung cancer diagnosis, Oncogene 21, 5450–5461. Siegmund, K.D., Connor, C.M., Campan, M., Tiffany, L.I., Weisenberger, D.J., Biniszkiewicz, D., Jaenisch, R., Laird, P.W., and Akbarian, S. (2007) DNA methylation in the human cerebral cortex is dynamically regulated throughout the life span and involves differentiated neurons, PLOS ONE 2(9), e895. Siegmund, K.D. and Laird, P.W. (2002) Analysis of complex methylation data, Methods 27, 170–178. Efron, B. and Tibshirani, R.J. (1994) An Introduction to the Bootstrap. Chapman & Hall, London. Herman, J.G. (1999) Hypermethylation of tumor suppressor genes in cancer, Semin Cancer Biol 9, 359–367. Toyota, M., Ahuja, N., Ohe-Toyota, M., Herman, J.G., Baylin, S.B., and Issa, J.P. (1999) CpG island methylator phenotype in colorectal cancer, Proc Natl Acad Sci USA 96, 8681–8686. Li, Q., Jedlicka, A., Ahuja, N., Gibbons, M.C., Baylin, S.B., Burger, P.C., and Issa, J.P. (1998) Concordant methylation of the ER and N33 genes in glioblastoma multiforme, Oncogene 16, 3197–3202. Toyota, M., Ahuja, N., Suzuki, H., Itoh, F., Ohe-Toyota, M., Imai, K., Baylin, S.B., and Issa, J.P. (1999) Aberrant methylation in gastric cancer associated with the CpG island methylator phenotype, Cancer Res 59, 5438–5442.
57. Kim, H., Kim, Y.H., Kim, S.E., Kim, N.G., Noh, S.H., and Kim, H. (2003) Concerted promoter hypermethylation of hMLH1, p16INK4A, and E-cadherin in gastric carcinomas with microsatellite instability, J Pathol 200, 23–31. 58. Ueki, T., Toyota, M., Sohn, T., Yeo, C.J., Issa, J.P., Hruban, R.H., Goggins, M. (2000) Hypermethylation of multiple genes in pancreatic adenocarcinoma, Cancer Res 60, 1835–1839. 59. Strathdee, G., Appleton, K., Illand, M., Millan, D.W., Sargent, J., Paul, J., and Brown, R. (2001) Primary ovarian carcinomas display multiple methylator phenotypes involving known tumor suppressor genes, Am J Pathol 158, 1121–1127. 60. Garcia-Manero, G., Daniel, J., Smith, T.L., Kornblau, S.M., Lee, M., Kantarjian, H.M., and Issa, P.J. (2002) DNA methylation of multiple promoter-associated CpG islands in adult acute lymphocytic leukemia, Clin Cancer Res 8, 2217–2224. 61. Toyota, M. and Issa, J.P. (1999) CpG island methylator phenotypes in aging and cancer, Semin Cancer Biol 9(5), 349–357. 62. Toyota, M., Ohe-Toyota, M., Ahuja, N., and Issa, J.P. (2000) Distinct genetic profiles in colorectal tumors with or without the CpG island methylator phenotype, Proc Natl Acad Sci USA 97, 710–715. 63. Storey, J.D. and Tibshirani, R. (2003) Statistical significance for genomewide studies, Proc Natl Acad Sci USA 100, 9440–9445. 64. Benjamini, Y. and Hochberg, Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing, J R Stat Soc Ser B 57, 289–300. 65. Holm, S. (1979) A simple sequentially rejective multiple test procedure, Scand J Stat 6, 65–70. 66. Wacholder, S., Chanock, S., Garcia-Closas, M., El-Ghormli, L., and Rothman, N. (2004) Assessing the probability that a positive report is false: an approach for molecular epidemiology studies, J Natl Cancer Inst 96, 434–442. 67. Wakefield, J. (2008) Reporting and interpretation in genome-wide association studies, Int J Epidemiol 37, 641–653. 68. Dudoit, S., Fridlyand, J., and Speed, T.P. (2000) Comparison of discrimination methods for the classification of tumors using gene expression data, Department of Statistics, University of California, Berkeley, CA, Tech Rep 576. 69. Hand, D.J. (2008) Breast cancer diagnosis from protein mass spectrometry data: a com-
Introduction to Epigenomics and Epigenome-Wide Analysis parative evaluation, Stat Appl Genet Mol Biol 7(2), 15, 1–21. 70. Molinaro, A.M., Simon, R., and Pfeiffer, R.M. (2005) Prediction error estimation: a
265
comparison of resampling methods, Bioinformatics 21(15), 3301–3307. 71. Miller, A. (2002) Subset Selection in Regression. Chapman & Hall/CRC, London.
Chapter 8 Exploration, Visualization, and Preprocessing of High–Dimensional Data Zhijin Wu and Zhiqiang Wu Abstract The rapid advances in biotechnology have given rise to a variety of high-dimensional data. Many of these data, including DNA microarray data, mass spectrometry protein data, and high-throughput screening (HTS) assay data, are generated by complex experimental procedures that involve multiple steps such as sample extraction, purification and/or amplification, labeling, fragmentation, and detection. Therefore, the quantity of interest is not directly obtained and a number of preprocessing procedures are necessary to convert the raw data into the format with biological relevance. This also makes exploratory data analysis and visualization essential steps to detect possible defects, anomalies or distortion of the data, to test underlying assumptions and thus ensure data quality. The characteristics of the data structure revealed in exploratory analysis often motivate decisions in preprocessing procedures to produce data suitable for downstream analysis. In this chapter we review the common techniques in exploring and visualizing high–dimensional data and introduce the basic preprocessing procedures. Key words: Preprocessing, exploratory data analysis, visualization, microarray, proteomics.
1. Introduction The recent development in high-throughput biotechnologies has led to a rapid growth in data accumulation. Genomic and proteomic data are probably the fastest growing examples. These data are high dimensional by nature. For example, microarrays are often used to interrogate a transcriptome or genome, generating thousands to millions of measurements for each biological sample. Mass spectrometry-based proteomics produces thousands of readings per spectrum. The raw data from these technologies, however, are not the quantity of interest directly. The H. Bang et al. (eds.), Statistical Methods in Molecular Biology, Methods in Molecular Biology 620, DOI 10.1007/978-1-60761-580-4_8, © Springer Science+Business Media, LLC 2010
267
268
Wu and Wu
experiments that generate these data often include complex procedures and the targets are indirectly quantified. This makes visualization extremely important in evaluating data quality and validating assumptions. Exploratory data analysis (EDA) is often performed at the same time as data visualization. EDA assists visualizing data, and visualization itself is part of EDA to reveal data structure, systematic trends, data anomalies, or distortions that may assist decision making in preprocessing. Preprocessing is an essential step in analyzing high-throughput data since raw data obtained in high-throughput technologies are often not in the form of biological interest. The biological samples often go through many steps of manipulation before a measurement associated with their quantity can be taken. The goal of this chapter is to give a review of methods used in exploring, visualizing, and preprocessing high–dimensional data. The chapter is organized as follows: in Section 2 we review exploration and visualization of data, including visualizing raw data as well as summary statistics. We also review common techniques in exploring and visualizing data structure. In Section 3 we use microarrays and mass spectrometry as examples to review preprocessing methods.
2. Exploration and Visualization Visualizing raw data is the first step in quality assessment. It does not guarantee to reveal all possible problems in an experiment, but can often give suggestions of experimental errors, detect sources of systematic variations, outliers, and anomalies in the data. In this section we review methods used to visualize raw data as well as methods that explore the stochastic properties and structures of data. 2.1. Image Plots
Several high-throughput biotechnologies generate images as raw data and it is most natural to plot the raw data as an image in the first step of exploration. For example, microarrays are composed of up to millions of DNA probes of different sequences immobilized on a small solid surface (the microarray). Each probe is designed to represent a genomic target, for example, a gene, with its unique sequence complementary to the target. A target sample containing a complex mixture of RNA or DNA molecules to be quantified is hybridized to the array. The target molecules are labeled with fluorescent dyes. After hybridization, the mixture of target molecules are sorted by the probes on the array. A scanner reads the fluorescent intensity from each probe location. The intensity measure from each probe represents the amount of
Exploration, Visualization, and Preprocessing of High–Dimensional Data
269
the molecule with complementary sequence bound to the probe, providing a measure for relative abundance of the target in the sample. The raw data from a microarray is an image generated from the scanners. Image processing softwares extract probe intensities and turn the image into probe level intensity readings. A common first step of exploratory analysis is to regenerate a pseudoimage from the digitalized probe intensities. This image retains the physical position of probes and uses color or grey scale to represent probe intensities. It can reveal if hybridization is done evenly over the entire array and if there are defects such as bubbles or scratches. Figure 8.1a shows an example image with blemishes. Such direct visualization of raw data as image in its original dimension is also helpful in other technologies that naturally generate images in data collection, such as HTS assay. In an HTS assay, a large number of molecules are deposited into plates of 96-, 384-, or 1, 536-well plates. The raw data is also generated by a scanner to give a matrix of numbers, each representing a well at certain row and column on the plate. Figure 8.1b shows an example of plate with strong edge effect in the assay reaction. ( a)
( b) A
B C D E F G H
I
J
K
L M N O P Q R S
T
U V W X
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Fig. 8.1. (a). A portion of an image of probe intensities, showing several oval shaped anomalies. (b). Image of a 384-plate screening assay.
2.2. Visualizing Empirical Distributions
Plotting empirical distributions or summary statistics of several samples together can also reveal outlier or problematic samples. Figure 8.2a shows the boxplot and histogram of four microarrays in an experiment. In many experiments, such as many gene expression studies in which a treatment is expected to affect a small subset of genes or comparative genomic hybridizations in which only a small region of the genome is thought to show copy number variations, the overall distributions of intensities are expected to be similar in all samples. The different location and scale of the distributions demonstrated by medians and interquartile ranges indicate systematic differences between the samples,
270
Wu and Wu
(b)
(a) 1.2
10
1.0
9
a b c d
0.8
8
0.6 7 0.4 6 0.2 5 0.0 a
b
c
d
6
8
10
12
Fig. 8.2. Boxplot and empirical density of long intensities from four microarrays in one experiment. Strong arrays effects indicate the need for normalization.
suggesting the necessity of normalization before further analysis. The histogram of sample a is bimodal, unlike the others, suggesting simple linear normalization methods will not be sufficient to remove the array effect. 2.3. Scatter Plots and Variants
Scatter plots are natural choices for visualizing the relationship between two variables, when the limited number of observations does not lead to too much overlapping among data points. However, high-throughput biotechnologies produce a large number of measurements from each experiment. This means that in simple plots like a scatter plot, there are enormous amount of data points often overlapping each other, making it harder to view and sometimes important data characteristics are hidden. Figure 8.3a and b show two scatter plots of microarray probe intensities. The two scatter plots look similar, both suggest that the majority of the probe intensities do not vary a lot across arrays and the data points appear near the diagonal. However, the smoothed scatter plots (1) tell a different story. The smoothed scatter plot computes a two-dimensional kernel density estimate of the scatter plot density and produces a color representation. Figure 8.3c plots the same data as in Fig. 8.3a, but clearly shows two populations of points. The difference between Fig. 8.3a and b is not visible because even a small proportion of points from the high-throughput data set are enough to overlap and cover the two–dimensional space. In contrast, the difference is apparent in Figs. 8.3c and d.
2.4. Principal Component Analysis
Smoothed scatter plots allow visualization of all data points generated in a pair of samples. When there are only a few samples, we can generate a matrix of scatter plots for all possible pairs of
Exploration, Visualization, and Preprocessing of High–Dimensional Data
271
Fig. 8.3. (a). Simple scatter plot comparing sample 1 versus 2 in an microarray experiment. (b). Like (a) but comparing sample 3 versus 2. (c). Same comparison as in (a) but with darkness indicating data density, revealing two populations of data points. (d). Same as (b) but with darkness indicating data density.
samples. When the number of samples increases, however, visualizing all features at once becomes difficult and dimension reduction becomes necessary. One of the most commonly used dimension reduction methods is the principal component analysis (PCA) (2). PCA searches for directions along which the maximal variation in the data is explained. It is widely used in data reduction in various fields (3–7) including biomedical research (8, 9). For example, consider a data matrix X of p rows of genes (or proteins) and n columns of samples. There are two different kinds of variance we may be interested in: the variance of samples and the variance of genes. In original data exploration, we may be interested in finding out if there are systematic artifacts that contribute to variation of gene expression across samples. To start, one computes the empirical covariance matrix of the samples, S =
1 ¯ T )T (X − 1X ¯ T ), (X − 1X p−1
Wu and Wu
¯ = 1 1T X is the vector of column means. The eigen where X p decomposition of S gives T
S = VDV −1 ,
10
20
where D is the diagonal matrix of eigenvalues, usually arranged in decreasing order. The columns of V are the eigenvectors, sometimes referred to as eigengenes (10, 2) in the context of gene expression analysis. It is not uncommon in high-throughput technologies that considerable variation in data is explained by systematic experimental artifacts. Checking if the dominant principal components correlate with non-biological factors can often help identify these artifacts. Figure 8.4 shows the first two principal components from 15 gene expression arrays. Three scanners were used to scan the microarrays and the second principal component clearly separates the samples by scanners, showing a strong scanner effect. Variations due to systematic non-biological factors such as the scanners are considered obscure variations (11) and are reasons for normalization.
–20
–10
PC2
0
Scanner 1 2 3
–30
272
−100
−50
0 PC1
50
100
Fig. 8.4. First two principal components from 15 gene expression arrays in which three different scanners were used to generate the data. The second principal components (PC2) are clearly correlated with the scanner.
Another common use of PCA is to search for sample-like patterns that explain the most variance of biological samples. The principal components for this purpose are linear combinations of gene expressions. These principal components are referred to as eigenarrays. In fact, singular value decomposition (SVD) can be used to estimate the eigengenes and eigenarrays at the same time
Exploration, Visualization, and Preprocessing of High–Dimensional Data
273
(10). The original data matrix X with p genes and n arrays (n << p in high-throughput biotechnologies) can be decomposed to X = ULV T , where U is the p-gene × n-eigenarray matrix, L is a n × n diagonal matrix of “eigenexpression” levels (10), V T is the n-eigengenes × n-arrays matrix. Starting from thousands of genes per sample, the PCA analysis can reduce the dimension to the number of n or less without loss of information. Heatmaps are color images used to visualize matrix-type data. Each entry in the data matrix is displayed as a small colored rectangle, at the row and column position as it locates in the matrix. The color is determined by the value of the datum. Heatmaps have been widely used to display microarray data (12) since Eisen et al. (13) and a red-green color scale is the most common choice. In fact, heatmaps can be used to visualize any data set with a matrix structure and the color scale can be defined by the user (14–17). Figure 8.5 gives an example of gene expression data from 25 arrays. The usual orientation of heatmaps display genes/proteins by rows and biological samples by columns, although this is mostly
2.5. Heatmaps and Unsupervised Clustering
mix 3
mix 3
mix 3
mix 3
mix 3
mix 2
mix 2
mix 2
mix 2
mix 2
mix 1
mix 1
mix 1
mix 1
mix 1
SN19
SN19
SN19
SN19
liver
SN19
liver
liver
liver
liver
1077_at 1110_at 1456_s_at 138_at 1146_at 1097_s_at 1292_at 1427_g_at 1011_s_at 1136_at 1385_at 1225_g_at 1278_at 1126_s_at 1224_at 1452_at 1379_at 1192_at 1053_at 1074_at 115_at 1453_at 1365_at 1336_s_at 1373_at 1267_at 1105_s_at 1161_at 1312_at 1081_at 1241_at 1106_s_at 1426_at 1191_s_at 1314_at 1395_at 1100_at 106_at 1067_at 1217_g_at 1148_s_at 1016_s_at 1165_at 1182_at 1265_g_at 1213_at 1234_at 1147_at 1262_s_at 1421_at
Fig. 8.5. A heatmap comparing expression in liver tissue, central nervous system cell line (SN19) and RNA mixtures from the two sources.
274
Wu and Wu
for convenience since the number of genes/proteins is usually greater than the number of samples. In practice, the rows and columns in heatmaps are often rearranged so that similar rows and/or columns appear together. The arrangement can be determined by prior knowledge (for example, functionally related genes or proteins may be grouped into adjacent rows) or study design (for example, samples from the same treatment or genotype may be grouped together). The columns or rows can also be rearranged by hierarchical clustering result. A dendrogram is usually displayed on the side of a heatmap to show the clustering of genes and/or samples. In fact, not only is hierarchical clustering used to aid the display of heatmap, heatmap itself is an important tool in visualizing hierarchical clustering structure. Clustering is a process of grouping data elements into subsets (clusters) based on their similarities or distances. It is a widely applied tool in signal processing in various fields (18–22) and has been particularly useful in exploring highthroughput biomedical data including these from neural science and genomics/proteomics studies (23–27). For the latter, data elements can be genes/proteins or biological samples. Hierarchical clustering produces a tree structure showing the complete relationship among the elements such that the finest clusters are each element itself, and the top hierarchy contains all elements. Figure 8.5 shows a dendrogram of clustering the genes on the side. A number of algorithms have been developed to perform hierarchical clustering and many of these specifically tuned for genomics applications (28, 13, 29, 30). Most methods generate a hierarchical tree by one of two different approaches. The agglomerative, or bottom-up, approach starts from singleton clusters (individual elements) and joins smaller clusters to form larger ones, until one reaches the top cluster containing all elements. The key factor determining which clusters are joined is how the distance/dissimilarity between a pair of elements is calculated and how the distances among elements are summarized to represent distances of clusters containing more than one element. Typical dissimilarity measures between elements include Euclidean distance, Manhattan distance, 1-correlation (Pearson or Spearman). The distance between two clusters can be the maximum, minimum, or average distance between elements belonging to two different clusters. The other approach works in reverse direction. The divisive, or top-down, clustering starts from the entire set of elements and splits large clusters into smaller ones by maximizing intercluster distance in each split. Clustering analysis has been widely applied to genomic data. However, it is often not emphasized enough that clustering results can be unstable and should be interpreted with caution. The hierarchical structure can change due to different choices of
Exploration, Visualization, and Preprocessing of High–Dimensional Data
275
dissimilarity measure, linkage, and gene filtering. It is less of problem to use clustering analysis as an exploratory tool, but to make inference about clustering analysis requires more scrutiny (31) (see Chapter 12 to learn more about Clustering).
3. Preprocessing Recent development in biotechnologies have enabled us to rapidly obtain massive amount of high–dimensional, such as genomics and proteomics, data. Quantification of DNA/RNA and protein/peptide molecules, however, are usually not direct. Biological samples under investigation usually have to go through a number of experimental procedures such as extraction, amplification, labeling before measurements are taken. A number of factors can affect the measurements, including the handling of biological samples as well as the settings of the measuring device. Thus the raw data we obtain are usually not directly in the form of biological interest and data preprocessing is necessary to convert raw measurements into meaningful data that can be used in further analysis. There are some common features in preprocessing high–dimensional data shared by different technologies. For example, base line adjustment/subtraction, normalization, and data deduction are the usual suspects. However, the suitable algorithms often depend on the specifics of a technology. In the following sections we will use DNA microarrays and mass spectrometry as two examples to illustrate data preprocessing. 3.1. Preprocessing Microarray Data
As introduced in Section 2.1, the raw data generated in a microarray experiment are images. The goal of microarray experiments can be diverse, from measuring gene expression, identifying single nucleotide polymorphism (SNP) to measuring DNA copy number variations, but the basic mechanism of measurement is common to all biological applications. The measuring unit is a probe, a DNA molecule immobilized on the array, and one microarray usually contains from thousands to millions of probes. A biological sample composed of a large number of species of DNA or RNA molecules to be quantified, each labeled with fluorescent dye, is hybridized to the array. The labeled nucleic acid molecules bind to their complimentary probes and a scanner reads the fluorescent intensities on the probes. There are two basic types of microarrays. The single channel arrays use short oligonucleotide probes, usually in situ synthesized on the arrays, and produce one image for each hybridization. The two–channel arrays often print the probes on the arrays and hybridize two samples, labeled in different dyes, and generate a set of two images per
276
Wu and Wu
hybridization. Sometimes more than one probe is used to query one target, for example, the Affymetrix GeneChips use 11–20 probes (a probe set) to measure expression of a gene. Sometimes one probe may be printed several times at different locations to provide technical replicates within an array. Preprocessing for microarrays refers to the conversion of raw image data to data in the format of biological targets such as gene expression measures or SNP types. The image analysis involves addressing (locating the probes), segmentation (classifying the pixels as from the foreground/probe or background/surrounding area), and intensity extraction (summarizing the pixels to one intensity value per probe). One-channel in situ synthesized arrays generate one set of probe intensities per array and two–channel printed arrays generate four: a foreground and a background intensity set for each channel. Rigorously speaking, image analysis is a part of preprocessing. However, since most microarray users obtain their raw data in the form of probe intensities with the image analysis software equipped with the scanner, we will focus the discussion of preprocessing on post-image analysis procedures. For both types of arrays, the observed probe intensities are a combination of background and signal and are influenced by systematic variations that are common to the entire array, such as sample preparation, hybridization condition, and scanner settings. A general model may be used to describe the probe intensity level of the jth probe for genomic target g on array i: Ygij = Sgij + Bgij = fi exp{agj + μgi + εgij } + Bgij + gij , [1] where B is background and S is signal, agj is a probe effect, μgi is the concentration of genomic target g on array i. f represents the systematic effect on array i. εgij and gij are the multiplicative and additive error terms both found necessary to explain the stochastic noise in microarrays (32–35). j may be omitted for arrays using one probe for each genomic target. In two color arrays, relative abundance between two channels is the final measure and the probe effect agj is often considered to be canceled out when the ratio from the channels is formed after background correction. The background B, systematic effect f, and possibility of multiple probes explain the need for background correction, normalization, and sometimes summarization in preprocessing. Below we discuss how these steps are carried out for the two types of microarrays. 3.1.1. One-Channel Arrays
Preprocessing one–channel arrays usually include three steps, background correction, normalization, and summarization. Here we use Affymetrix GeneChip arrays as example platform.
Exploration, Visualization, and Preprocessing of High–Dimensional Data
277
Although the GeneChip arrays include mismatch probes (MMs) that are designed to measure background intensity, a number of investigators have noted that the assumptions that MMs measure the unbiased background component for perfect match probes (PMs) are often violated in reality (36–38). The widely used alternatives to the manufacturer’s default algorithm usually use some empirical Bayes approach to borrow strength across probes in order to estimate background parameters. One popular method, the robust multi-array analysis (RMA) (38, 39), assumes that each array has a normally distributed background component Bgij ∼ N (μBi , σiB ) and uses the observations to the left of the empirical mode to estimate μB and σ B . Assuming an exchangeable global background often underestimates the background for probes with stronger tendency for nonspecific binding. GCRMA (40) reduces the bias in background estimation by borrowing strength only across probes with similar affinity to nonspecific binding background. This affinity measure is calculated from a model that predicts nonspecific binding from probe sequence and the model parameters are calibrated from background–only experiments. The same idea has also been applied to tilling arrays (41) and exon arrays (42). Many normalization procedures have been applied to onechannel array data, from simple scaling that equalizes the median of all arrays to nonlinear normalization that gives all arrays the same empirical distribution. Bolstad et al. (11) compare a number of normalization methods and recommend quantile normalization as a fast and effective choice. Quantile normalization is used by both RMA and GCRMA methods. One important issue often neglected by researchers is that all normalization methods make assumptions on what is expected to be constant across arrays. For example, quantile normalization assumes that either there is only a small proportion of genes with differential expression, or the up– and downregulation are approximately symmetric. When these assumptions are not met, normalization may introduce unwanted variation into the raw data rather than removing obscure systematic noise. Instead of making assumptions on what is not changing, Li and Wang (43) select probes that show little changes of ranks across arrays as an invariant set to be the normalization elements. Recently, Calza et al. (44) use a robust linear model to estimate array-to-array variation and define a leastvariant set of genes, which appear to give better results when the symmetric up/down regulation assumption is violated. Background adjusted and normalized data are usually log transformed before summarization. Log transformation seems a natural choice based on Equation [1] since the parameter of interest and measurement error are additive in the exponential term. This transformation also gives the differences an easy
278
Wu and Wu
interpretation of log ratio. Other transformations that also stabilize the variance have been proposed (32, 33). For example, variance stabilization and normalization (VSN) combines background adjustment and uses a generalized log transformation. The summarization step usually involves some robust average. Affymetrix default Microarray Suite (MAS) 5.0 uses Tukey’s biweight, a robust mean that down weights extreme values, to summarize information from multiple probes. Summarizing multiple arrays at the same time appears to give even better result since outlier probes that consistently behave differently from other probes are easier to identify this way. For a given gene, consider the background corrected, normalized log signal of the jth probe on ith array sij . Equation [1] implies sij = aj + μi + εij , where μi s are the gene expression values. One fast algorithm that provides a robust estimate of μi is median polish, and it is used in both RMA and GCRMA. More than a dozen preprocessing methods have been developed for Affymetrix GeneChip arrays and many ideas have been adapted to newer platforms for applications beyond gene expression. A convenient assessment tool, Affycomp (http://affycomp.jhsph.edu), has been developed to benchmark various expression measures (45, 46). Affycomp uses two Affymetrix spike-in data sets and provides summary statistics and graphical tools for the comparison and assessment of a collection of algorithms so that users can make informed decisions on their choice of preprocessing method. 3.1.2. Two-channel arrays
Like single channel arrays, the first step in two-channel array preprocessing is often background correction. For each channel, usually a set of background intensities are provided along with the foreground intensities. The background intensities are a summary of pixels from local area surrounding the probe (foreground) pixels. Background correction is usually done by simply subtracting the background intensity from the foreground intensity. One problem for simple subtraction is that sometimes the background intensity is higher than foreground, the background subtracted data are essentially meaningless and become missing data in log transformation. In addition, the background subtraction often inflates the noise of (logged) expression values, especially for low expression measures. Kooperberg et al. (47) provide a Bayesian background correction method based on the summary statistics of the image data that avoids the above two problems. Nonetheless, both types of background adjustment only deal with optical background and not cross hybridization background, which presumably is a more important source of background intensity.
Exploration, Visualization, and Preprocessing of High–Dimensional Data
279
Many researchers simply choose to omit the optical background correction step. Normalization is indisputably a necessary step in preprocessing two-channel arrays. In addition to array–to–array variations, the two channels within one array also have systematic dye effects that need to be removed. Therefore, for two-channel arrays, normalization includes within and between array procedures. The easiest way to examine the relationship between the two channels is to plot the difference versus the average of log intensities, typically referred to as an MA plot. Figure 8.6 shows an example. Like in one–channel arrays, in many experiments, it is often expected that either only a small proportion of genes actually show differential expression over the two samples or there are similar extent of up/down-regulation. This implies that data points would scatter around the M = 0 line symmetrically. Figure 8.6a shows the raw data as well as loess smooth fits for each print-tip. Clearly data are not symmetric around the M = 0 line. The different smooth lines also suggest that dye bias appears to be print-tip specific. Figure 8.6b shows the normalized data, in which the loess fitted values have been corrected for. After within array normalization is done, the between array normalization can use a similar method for single channel arrays. We omit this part to avoid repetition. (a)
(b)
2 (1,1) (2,1)
(1,2) (2,2) log(R)−log(G)
log(R) − log(G)
2
1
0
1
0
−1
–1
6
8
10
12
14
[log(R) + log(G)]/2
16
6
8
10
12
14
16
[log(R) + log(G)]/2
Fig. 8.6. (a). MA plot of a two–channel array before normalization. The lines are loess fits for four different print-tip groups. (b). As (a) but after print-tip specific loess normalization using Bioconductor (59) package marray.
3.2. Preprocessing Mass Spectrometry Proteomics Data
Mass spectrometry is a technology used for detection and quantification of macromolecules. These include oligonucleotides and oligosaccharides, but the most common use is for the analysis of complex protein and peptide samples. The data generated in mass spectrometry are a spectrum of ion abundance at various mass-to-charge ratios (m/z), as shown in Fig. 8.7. The spectrum is produced by a mass spectrometer containing three main parts: an ionization source, an analyzer,
280
Wu and Wu (b)
80
80
60
60 Intensity
Intensity
(a)
40
40
20
20
0
0 0
5000
10000 m/z
15000
0
20000
5000
10000 m/z
(c)
20000
(d)
80 smoothed spectrum
80
60 Intensity
15000
40
20
60 smooth
o peaks
local sigma
40
20
0
0
0
5000
10000 m/z
15000
20000
0
5000
10000 m/z
15000
20000
Fig. 8.7. (a). An example raw spectrum. (b). The baseline estimated from the spectrum in (a). (c). Baseline subtracted spectrum. (d). Smoothed spectrum dashed line and the peaks are indicated by circles. All graphs are made Bioconductor shown in package PROcess and the included test data.
and a detector. The most commonly used ionization source is matrix-assisted laser disorption/ionization (MALDI) and electrospray ionization (ESI) (48). The biological sample is ionized and the ions are separated based on their m/z by, for example, a timeof-flight (TOF) spectrometer. Ions receive energy from an electric field proportional to the charge they carry, which transforms into kinetic energy Ez = 12 mv2 . The ions fly through a fixed distance D before reaching the detector, thus the TOF reflects m/z: t = D/v ∝
m/z.
Theoretically, all ions start their flight simultaneously, so these with the same m/z reach the detector at the same time and form
Exploration, Visualization, and Preprocessing of High–Dimensional Data
281
a peak. The TOF represents the m/z ratio and the intensity at an m/z represents the quantity of ions. An ideal spectrum would have spikes at m/z associated with specific molecules with intensity proportional to the quantity and 0 (baseline) otherwise. In practice, the spectrums are affected by noises due to sample preparation, contaminants, instrument setup, and electrical noise. The isotopes also give the same molecule different mass and thus different m/z values, broadening the peak. As a result, the baseline is elevated at various levels along the spectrum and the peaks for one molecule are not at a unique m/z location. Variations in sample preparation or degradation may cause systematic changes in the spectrum, making absolute peak values not comparable across spectrums. All of these issues make preprocessing a crucial step in analyzing mass spectrometry data. Inappropriate preprocessing reduces reproducibility and can result in misleading biological conclusions (49, 50). Similar to microarray data, we can decompose the observed intensity from mass spectrum into several components (51): fi (t) = Bi (t) + φi Si (t) + εi (t), where fi (t) is the intensity of i spectrum at time t, B(t) and S(t) represent baseline and signal respectively, φi is the normalization factor for spectrum i, and εi (t) is the random noise. The preprocessing produces estimate for Si (t) for each spectrum. 3.2.1. Baseline Subtraction
The baseline subtraction is a step similar to background adjustment in microarray data. Since the baseline is not constant along the spectrum, and the observed data are always a combination of baseline and real peaks, the baseline is usually estimated by a smoothing technique. A robust local regression can be used. An even simpler approach, using local minima, has also been observed to yield better results. Another advantage of using local minima is that it avoids fitting a model that may not be appropriate for some spectra (51). Figure 8.7b shows the estimated baseline and Fig. 8.7c shows the baseline subtracted spectrum.
3.2.2. Peak Identification
After the baseline is subtracted, the adjusted spectrum still consists of a huge data set with an intensity value for each possible m/z. The peaks representing real signal need to be identified and separated from noise. This can be done by simply identifying local maxima with a sliding window (52) on smoothed spectrum, as shown in Fig. 8.7c. More researchers seem to converge to the use of wavelets (53–57).
3.2.3. Normalization
Normalization makes data across spectra more comparable. The most common method for normalizing spectra is global normalization which scales each sample by a constant. One widely used
282
Wu and Wu
scaling factor is the total ion current, which can be estimated as the area under the curve of the baseline subtracted spectrum, or denoised spectrum with only the peaks. Another choice is to use total peak height of a group of control peaks. Most preprocessing for mass spectrometry data include all three above components, although not necessarily in the same order. It remains an active research area. A more complete review on this topic is given in Chapter Pre-processing Mass Spectrometry Data in the book Fundamentals of Data Mining in Genomics and Proteomics (51). A number of softwares for mass spectrometry preprocessing have been made available for the public. Some examples are: • BioConductor (http://bioconductor.org) provides an R package PROcess. • A MATLAB software Cromwell is provided by MD Anderson (http://bioinformatics.mdanderson.org/software.html). • MS-Analyzer (http://staff.icar.cnr.it/proteus/msanalyzer. html) provides preprocessing and data mining services for proteomics (58). References 1. Gentleman, R. and Biocore. geneplotter: Graphics related functions for Bioconductor R package version 1.20.0. 2. Ringnér, M. (2008) What is principal component analysis?. Nat. Biotechnol., 26, 303–304. 3. Mutelo, R. M., Woo, W. L., and Dlay, S. S. (2008) Two dimensional principle component analysis of gabor features for face representation and recognition. Communication Systems, Networks and Digital Signal Processing, CNSDSP, p. 457–461. 4. Li, J., Tao, D., Hu, W., and Li, X. (2005) Kernel principle component analysis in pixels clustering. Web Intelligence, 2005. Proceedings, IEEE/WIC/ACM International Conference, 786–789. 5. Lee, J.-K., Kim, K.-H., Kim, T.-Y., and Choi, W.-H. (2003) Nonlinear principle component analysis using local probability. Science and Technology, Proceedings KORUS, 2, 103–107. 6. Shah, M. and Sorensen, D. C. (2005) Principle component analysis and model reduction for dynamical systems with symmetry constraints. Decision and Control on 2005 and 2005 European Control Conference, 2260–2264. 7. Yang, H., Zhang, J. Q., and Wang, B. (2007) Hypercomplex principle compo-
8.
9.
10.
11.
12.
nent weighted approach to multiplespectral and panchromatic images fusions. Geoscience and Remote Sensing Symposium on IEEE, 3096–3099. Chen, T., Hsu, Y. J., Liu, X., and Zhang, W. (2002) Principle component analysis and its variants for biometrics. Image Processing. Proceedings. 2002 International Conference, 1, 61–64. Friston, K. J., Frith, C. D., Liddle, P. F., and Frackowiak, R. S. (1993) Functional connectivity: The principal-component analysis of large (PET) data sets. J. Cereb. Blood Flow Metab., 13, 5–14. Alter, O., Brown, P. O., and Botstein, D. (2000) Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl. Acad. Sci. USA, 97, 10101–10106. Bolstad, B. M., Irizarry, R. A., Åstrand, M., and Speed, T. P. (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinfromatics, 19(2), 185–193. Verhaak, R. G., Sanders, M. A., Bijl, M. A., Delwel, R., Horsman, S., Moorhouse, M. J., van derSpek, P. J., Löwenberg, B., and Valk, P. J. (2006) Heatmapper: Powerful combined visualization of gene expression profile correlations, genotypes, phenotypes and
Exploration, Visualization, and Preprocessing of High–Dimensional Data
13.
14.
15.
16. 17.
18.
19.
20.
21.
22.
23.
24.
sample characteristics. BMC Bioinformatics, 7, 337. Eisen, M. B., Spellman, P. T., Brown, P. O., and Botstein, D. (1998) Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA, 95, 14863–14868. Kibbey, C. and Calvet, A. (2005) Molecular property explorer: A novel approach to visualizing sar using tree-maps and heatmaps. J. Chem. Inf. Model., 45, 523–532. Lee, P. S. and P. North, C. (2005) Visualization of graphs with associated timeseries data. Information Visualization, 2005. INFOVIS 2005. IEEE Symposium, 225– 232. Fisher, D. (2007) Hotmap: Looking at geographic attention. IEEE Trans. Vis. Comput. Graph., 13(6), 1184–1191. Podowski, R. M., Miller, B., and Wasserman, W. W. (2006) Visualization of complementary systems biology data with parallel heatmaps. IBM J. Res. Dev., 50(6), 575–581. Phattarsukol, S. and Muenchaisri, P. (2001) Identifying candidate objects using hierarchical clustering analysis. Software Engineering Conference on APSEC, 381–389. Werle, P., Borsi, H., and Gockenbach, E. (1999) Hierarchical cluster analysis of broadband measured partial discharges as part of a modular structured monitoring system for transformers. High Voltage Engineering, 1999. Eleventh International Symposium, 5, 29–32. Hooper, E. (2007) An intelligent intrusion detection and response system using hybrid ward hierarchical clustering analysis. 2007 International Conference on Multimedia and Ubiquitous Engineering, 1187–1192. Yanagida, R. and Takagi, N. (2005) Consideration on hierarchical cluster analysis based on connecting adjacent hyper-rectangles. 2005 IEEE International Conference on Systems, Man and Cybernetics, 3, 2795– 2800. Kobayasi, M. (1999) Classification of color combinations based on distance between color distributions. Image Processing, 3, 70–74. Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., and Levine, A. J. (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. U.S.A., 96, 6745–6750. Hodge, D., Karim, N., and Reardon, K. F. (2003) Hierarchical cluster analysis to detect coordinated protein expression in metaboli-
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
283
cally engineered Zymomonas mobilis. Proc. Am. Control Con., 3, 2081–2082. Muzinich, N. (2005) Discovery of prokaryotic relationships through latent structure of correlated nucleotide sequences. 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 143. Wang, Y. and Chen, H. (2008) Sex differences in hierarchical clustering of the spontaneous fluctuations in brain resting state. Bioinformatics and Biomedical Engineering on ICBBE, 2087–2090. Liao, W., Chen, H., Yang, Q., and Lei, X. (2008) Analysis of fmri data using improved self-organizing mapping and spatio-temporal metric hierarchical clustering. Medical Imaging, IEEE Transactions, 27(10), 1472–1483. Kaufman, L. and Rousseeuw, P. J. (1990) Finding Groups in Data: An introduction to cluster analysis, Wiley Series in Probability and Mathematical Statistics. Wiley. Liu, J.-L., Bai, Y., Kang, J., and An, N. (2006) A new approach to hierarchical clustering using partial least squares. 2006 International Conference on Machine Learning and Cybernetics, 1125–1131. Getz, G., Levine, E., and Domany, E. (2000) Coupled two-way clustering analysis of gene microarray data. Proc. Natl. Acad. Sci. USA, 97, 12079–12084. Dougherty, E. R., Barrera, J., Brun, M., Kim, S., Cesar, R. M., Chen, Y., Bittner, M., and Trent, J. M. (2002) Inference from clustering with application to gene-expression microarrays. J. Comput. Biol., 9, 105–126. Durbin, B. P., Hardin, J. S., Hawkins, D. M., and Rocke, D. M. (2002) A variance-stabilizing transformation for geneexpression microarray data. Bioinformatics, 18(Suppl. 1), S105–S110. Huber, W. von Heydebreck, A. Sueltmann, H. Poustka, A. and Vingron, M. (2003) Parameter estimation for the calibration and variance stabilization of microarray data. Stat. Appl. Genet. Mol. Biol., 2(1), Article 3. Rocke, D. M. and Durbin, B. (2001) A model for measurement error for gene expression arrays. J. Comput. Biol., 8(6), 557–569. Wu, Z. and Irizarry, R. A. (2007) A statistical framework for the analysis of microarray probe-level data. Ann. Appl. Stat. 1(2), 333– 357. Naef, F., Hacker, C. R., Patil, N., and Magnasco, M. (2002) Empirical characterization of the expression ratio noise structure in high-density oligonucleotide arrays. Genome Biol., 3, RESEARCH0018.
284
Wu and Wu
37. Naef, F., Lim, D. A., Patil, N., and Magnasco, M. (2002) Dna hybridization to mismatched templates: A chip study. Phys. Rev. E, 65, 040902. 38. Irizarry, R. A., B. Hobbs, F. C., BeaxerBarclay, Y., Antonellis, K., Scherf, U., and Speed, T. P. (2003) Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics, 4, 249–264. 39. Irizarry, R. A., Bolstad, B. M., Collin, F., Cope, L. M., Hobbs, B., and Speed, T. P. (2003) Summaries of affymetrix GeneChip probe level data. Nucleic Acids Res., 31(4), e15. 40. Wu, Z., Irizarry, R., Gentlemen, R., Martinez-Murillo, F., and Spencer, F. (2004) A model-based background adjustment for oligonucleotide expression arrays. J. Am. Stat. Assoc., 99(468), 909–917. 41. Johnson, W. E., Li, W., Meyer, C. A., Gottardo, R., Carroll, J. S., Brown, M., and Liu, X. S. (2006) Model-based analysis of tilingarrays for ChIP-chip. Proc. Natl. Acad. Sci. USA, 103, 12457–12462. 42. Kapur, K., Xing, Y., Ouyang, Z., and Wong, W. H. (2007) Exon arrays provide accurate assessments of gene expression. Genome Biol., 8, R82. 43. Li, C. and Wong, W. H. (2001) Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection. Proc. Nat. Acad. Sci. USA, 98, 31–36. 44. Calza, S., Valentini, D., and Pawitan, Y. (2008) Normalization of oligonucleotide arrays based on the least-variant set of genes. BMC Bioinformatics, 9, 140. 45. Cope, L. M., Irizarry, R. A., Jaffee, H., Wu, Z., and Speed, T. P. (2004) A benchmark for Affymetrix GeneChip expression measures. Bioinformatics, 20, 323–331. 46. Irizarry, R. A., Wu, Z., and Jaffee, H. A. (2006) Comparison of Affymetrix GeneChip expression measures. Bioinformatics, 22, 789–794. 47. Kooperberg, C., Fazzio, T. G., Delrow, J. J., and Tsukiyama, T. (2002) Improved background correction for spotted DNA microarrays. J. Comput. Biol., 9(1), 55–66. 48. Glish, G. L. and Vachet, R. W. (2003) The basics of mass spectrometry in the twentyfirst century. Nat. Rev. Drug Discov., 2, 140–150. 49. Baggerly, K. A., Edmonson, S. R., Morris, J. S., and Coombes, K. R. (2004) High-resolution serum proteomic patterns for ovarian cancer detection. Endocr. Relat. Cancer, 11, 583–584. 50. Baggerly, K. A., Morris, J. S., and Coombes, K. R. (2004) Reproducibility of SELDI-
51.
52.
53.
54.
55.
56.
57.
58.
59.
TOF protein patterns in serum: comparing datasets from different experiments. Bioinformatics, 20, 777–785. Coombes, K. R., Baggerly, K. A., and Morris, J. S. (2007) chapter Pre-Processing Mass Spectrometry Data, Fundamentals of Data Mining in Genomics and Proteomics, Springer US, 79–102 Yasui, Y., Pepe, M., Thompson, M. L., Adam, B. L., Wright, G. L., Qu, Y., Potter, J. D., Winget, M., Thornquist, M., and Feng, Z. (2003) A data-analytic strategy for protein biomarker discovery: profiling of highdimensional proteomic data for cancer detection. Biostatistics, 4, 449–463. Kwon, D., Vannucci, M., Song, J. J., Jeong, J., and Pfeiffer, R. M. (2008) A novel wavelet-based thresholding method for the pre-processing of mass spectrometry data that accounts for heterogeneous noise. Proteomics, 8, 3019–3029. Lange, E., Gropl, C., Reinert, K., Kohlbacher, O., and Hildebrandt, A. (2006) High-accuracy peak picking of proteomics data using wavelet techniques. Pac. Symp. Biocomput., 243–254. Li, X., Li, J., and Yao, X. (2007) A waveletbased data pre-processing analysis approach in mass spectrometry. Comput. Biol. Med., 37, 509–516. Coombes, K. R., Tsavachidis, S., Morris, J. S., Baggerly, K. A., Hung, M. C., and Kuerer, H. M. (2005) Improved peak detection and quantification of mass spectrometry data acquired from surfaceenhanced laser desorption and ionization by denoising spectra with the undecimated discrete wavelet transform. Proteomics, 5, 4107–4117. Du, P., Kibbe, W. A., and Lin, S. M. (2006) Improved peak detection in mass spectrum by incorporating continuous wavelet transform-based pattern matching. Bioinformatics, 22, 2059–2065. Cannataro, M. and Veltri, P. (2007) Msanalyzer: preprocessing and data mining services for proteomics applications on the grid. Concurrency and Computation, 19(15), 2047–2066. Gentleman, R. C., Carey, V. J., Bates, D. M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J., Hornik, K., Hothorn, T., Huber, W., Iacus, S., Irizarry, R., Leisch, F., Li, C., Maechler, M., Rossini, A. J., Sawitzki, G., Smith, C., Smyth, G., Tierney, L., Yang, J. Y. H., and Zhang, J. (2004) Bioconductor: Open software development for computational biology and bioinformatics. Genome Biol., 5, R80.
Part III Statistical Methods for Microarray Data
Chapter 9 Introduction to the Statistical Analysis of Two-Color Microarray Data Martina Bremer, Edward Himelblau, and Andreas Madlung Abstract Microarray experiments have become routine in the past few years in many fields of biology. Analysis of array hybridizations is often performed with the help of commercial software programs, which produce gene lists, graphs, and sometimes provide values for the statistical significance of the results. Exactly what is computed by many of the available programs is often not easy to reconstruct or may even be impossible to know for the end user. It is therefore not surprising that many biology students and some researchers using microarray data do not fully understand the nature of the underlying statistics used to arrive at the results. We have developed a module that we have used successfully in undergraduate biology and statistics education that allows students to get a better understanding of both the basic biological and statistical theory needed to comprehend primary microarray data. The module is intended for the undergraduate level but may be useful to anyone who is new to the field of microarray biology. Additional course material that was developed for classroom use can be found at http://polyploidy.org/. In our undergraduate classrooms we encourage students to manipulate microarray data using Microsoft Excel to reinforce some of the concepts they learn. We have included instructions for some of these manipulations throughout this chapter (see the “Do this. . .” boxes). However, it should be noted that while Excel can effectively analyze our small sample data set, more specialized software would typically be used to analyze full microarray data sets. Nevertheless, we believe that manipulating a small data set with Excel can provide insights into the workings of more advanced analysis software. Key words: Microarray, variation, variance, normalization, dye-swap, t distribution, t-test, ANOVA, Bonferroni Method, False Discovery Rate (FDR).
1. Introduction Knowing the transcriptional activity of a gene can give valuable insight to the function of the protein it encodes and to the role it plays in an organism. Gene activity in the same individual can vary H. Bang et al. (eds.), Statistical Methods in Molecular Biology, Methods in Molecular Biology 620, DOI 10.1007/978-1-60761-580-4_9, © Springer Science+Business Media, LLC 2010
287
288
Bremer, Himelblau, and Madlung
from tissue to tissue, between different developmental stages, or even from morning to nighttime. Gene activity is influenced by the activity of other genes and the proteins they encode. Gene expression can change in response to outside factors, such as the environment or exposure of the organism to chemical substances, competitors, or pathogens. The classical approach to measuring the activity of a gene has been to isolate messenger RNA (mRNA), design nucleic acid molecules complementary to the gene of interest, and use those to estimate the amount of mRNA of the gene of interest present at a given time in the organism. Traditionally, this has been done for one gene at a time. Whole genome sequencing projects of many species, including humans, have provided information that allows researchers to distinguish every gene in the organism. The development of microarray technology has made it possible to survey the gene expression activity of thousands of genes at the same time. This can be done in several ways. Techniques include so-called onecolor or two-color microarrays and there are several types of materials from which the gene chips can be produced. A commonly used one-color array method was first developed by Affymetrix and has become somewhat of an industry standard for singlecolor arrays for many model organisms. Here we focus on the traditional two-color spotted array, which is cheaper in production cost and widely used for transcriptomic analysis of many species. Two-color microarrays are made by using short pieces of DNA, each uniquely representing one gene, and spotting them to a solid support, such as a microscope glass slide. Extremely small capillaries are used to apply these pieces of DNA. Up to 100,000 genes can be represented on a single conventional 1.5 cm × 5 cm slide, but a more common number is somewhere around 25,000 DNA sequences. Using these microscopic arrays of DNA spots, researchers can assess the relative amount of mRNA in a sample of all represented genes (called the target spots) in the same time it takes to analyze the activity of a single gene. Such technological advances have revolutionized the way molecular bioscience is done and have sped up the rate of new discoveries. However, they have also led to the rapid acquisition of huge amounts of data that require the use of biostatistics for analysis and validation. In practice, gene activity is assessed, by labeling mRNA that was extracted from an organism, with fluorescent dyes. The labeled mRNA known as the “probe” is applied to the glass slide and allowed to bind to its complementary spot on the array. This process is called hybridization. Subsequently, the unbound mRNA is washed off the slide. The slide is scanned and the amount of fluorescently labeled mRNA bound to each spot is proportional to the activity of the gene it represents. In most cases, software analysis is then used to determine how
Introduction to the Statistical Analysis of Two-Color Microarray Data
289
much of a signal is due to biologically relevant processes and how much is due to technical “noise.” In this chapter we will present the elementary ideas behind the statistical analysis of microarray data acquired by the standard two-color hybridization approach. These types of analyses are similar to those that some commercial software packages would perform. Readers desiring more indepth discussion of advanced statistical tools are referred to some excellent reviews published on this topic (1, 2) (see Chapter 1 for a comprehensive review of basic methods). Why would a researcher want to do microarray experiments? In essence, a microarray experiment can give useful information for any question that asks whether or not two different populations of cells express different sets of genes. For example, a researcher wants to find out which genes become active if a plant is subjected to prolonged drought stress (Fig. 9.1). An appropriate experiment would be to have one set of plants growing in optimal conditions and a second set growing in the same conditions, except with limited water. After a few days under these conditions, tissue is harvested from both sets (treatment: no water; control: well watered) and mRNA is extracted. As described later in more detail, a common method used in microarray analysis is to label mRNA from the treatment group with one color dye and the control mRNA with another color. Equal amounts of mRNA are then used for the hybridization to the array. If a scanner with the capacity to detect two colors is used, relative amounts of mRNA of each gene can be compared between the control group and the treatment group. Genes up-regulated (“turned on”) in response to drought stress will show a stronger signal of one color (treatment) than the other color (control). After statistical analysis of the data obtained for all of the 25,000 genes, a gene list is generated allowing the researcher to know which genes are activated by the treatment. In our example, these are genes that become active in response to drought stress. Experiments such as the one used here in our example have been conducted using a variety of microarray platforms and plant species (3–7).
2. What Is Statistics? Statistics is a collection of procedures and formulas that allow us to make decisions when faced with uncertainty. Where does the uncertainty come from? Many experiments that address the same problem or question can differ in their outcomes when conducted by different people or with different materials. Statisticians call this variation. In microarray experiments, the two main sources of variation that cause the uncertainty are as follows.
290
Bremer, Himelblau, and Madlung
Fig. 9.1. Comparing gene expression using microarrays. mRNA is extracted from a plant that has undergone an experimental treatment (T, drought stress in this case) and an untreated control (C). The mRNA undergoes reverse transcription to generate a cDNAs. A different fluorescent molecule is used to label each of the cDNA pools (represented by circles and squares at the end of each cDNA). The labeled cDNAs are then hybridized to a microarray. The microarray consists of a glass slide on which thousands of distinct DNA sequences have been affixed. Each dot (or “feature”) on the slide represents the sequence of a different plant gene. After the unbound probe is washed away, a special slide scanner excites each feature on the array with a laser and measures the fluorescent signal emitted. The more cDNA is bound to a spot, the greater the signal will be. The magnified computer screen at the lower right shows the possible results for each feature. A red spot (A) represents a gene that is only expressed in the control. A green spot (B) represents a gene that is only expressed in the treated plant. A yellow spot (C) represents a gene that is expressed in both treated and control plants. A dark spot (D) indicates that the corresponding gene is not expressed in either the control or treated plants. (see this figure in color at: http://polyploidy.org/index.php/Microarray_analysis).
Introduction to the Statistical Analysis of Two-Color Microarray Data
2.1. Biological Variation
291
Different organisms have different gene expression profiles, or in other words, the activity of their genes varies. The measured expression levels hence vary from individual to individual used in the study (Fig. 9.2).
Fig. 9.2. Sources of variation in gene expression studies. Two sources of variation influence microarray analysis. Biological variation refers to the differences in gene expression between individuals used in the study. Technical variation refers to the differences resulting from human and manufacturing error (differences in the manufacture of microarray slides in this case).
2.2. Technical Variation
Due to human error, there can be slight variation in microarray manufacturing and hybridization of the mRNA to the slide. Even if an experiment calls for applying the same amount of mRNA from the same organism to two identical slides, the measurements may be different (Fig. 9.2). Carefully conducted experiments can keep the technical variation to a minimum. The biological variation between individuals in the experimental groups, however, cannot be influenced. When designing the experiment it is important to keep in mind that statistics can only provide information if the data set provides replications from which to estimate the degree of technical or biological variation. Obviously, just like an average cannot be calculated from a single observation, variation, and any measure of certainty that the measured value is close to the “true” value, relies on multiple measurements. Three of the most important statistical concepts are the mean, median, and standard deviation of a set of measurements (8). While the mean and median are used to describe the center of the measurements, the standard deviation is used to describe the spread (Fig. 9.3).
292
Bremer, Himelblau, and Madlung
Fig. 9.3. Graphic representation of the mean, median, and standard deviation for a set of points.
Mean: the average of the n measurements or . . .
x=
1 n xi i=1 n
Median: the middle observation. (If the number of observations n is odd the middle number is used. If n is even the average of the two middle observations is used.) Standard deviation: measures the average distance of the n observations x1 , . . . , xn from the mean (x) or . . . s=
3. Overview of Microarray Experiments
1 n (xi − x)2 i=1 n−1
It is the goal of many microarray experiments to compare the gene expression levels of a treatment group with those of a control group. For this purpose, mRNA is extracted from the cells of several individuals in each group. The samples are labeled with red and green fluorescent dyes and allowed to bind to the DNA on the array (target spots). The scanner divides the features on the array into pixels and for each pixel a computer records the scanned red and green intensity. Usually, the spots are not uniform. The intensity in both the red and the green channels may vary over these pixels. The spots with very low intensity are called the background (Fig. 9.4).
Introduction to the Statistical Analysis of Two-Color Microarray Data
293
Fig. 9.4. Spot intensity. After labeled cDNA is hybridized to a microarray each fluorescent label is individually excited and detected by the slide scanner. The scanner divides the spots into pixels. The spots are not uniform and there can be variable intensity from pixel to pixel within each spot. Also, there are areas of low intensity, “background” fluorescence in the regions between spots. After scanning, the red and green signals are superimposed by the computer that generates a composite yellow spot (notice the distinct red and green pixels in the background of the superimposed image.) (see this figure in color at: http://polyploidy.org/index.php/Microarray_analysis).
4. Microarray Data Analysis What do the many numbers in a microarray experiment output file actually mean? When a microarray experiment is conducted, the result is usually a large data file with many columns and as many rows as there were spots on the array. It may look something like Fig. 9.5. Usually, the output file contains much more information than shown here, but we will only use the pictured columns for our analysis.
294
Bremer, Himelblau, and Madlung
Fig. 9.5. Example of the output file of a microarray experiment. After scanning the slide, the intensity of both the red and green values are recorded separately (even if the spot looks yellow to the eye it is comprised of green and red labeled probe). The data are stored in a so-called .gpr file that can be copied directly into a spreadsheet; such as in this case a Microsoft Excel file.
The first three columns (labeled “Block,” “Column,” and “Row”) tell us the position of the scanned spot on the array. The column labeled “Name” contains the name of the gene that was spotted there. The column labeled “ID” contains information pertaining to the exact part of the gene that was used to originally produce the target spot on the glass slide. The red and green intensities recorded by the scanner are reported in the next four columns. “F” stands for foreground and “B” for background. The numbers 635 and 532 represent the wavelengths of the red and green laser light, respectively. For example, the numbers in the F635 Median column are the median values of the scanned foreground pixels when excited with the red laser light. In addition to measuring the spot itself, the scanner also measures so-called background intensity. This is, in essence, the probe that binds to the silicate of the glass slide falsely increasing the signal of each spot. This background intensity needs to be subtracted to get an accurate reading of the spot intensity based on probe–target interaction only. The columns that will be most helpful in our analysis are therefore the ones labeled “F635 Median – B635” and “F532 Median – B532.” They contain the background-corrected median red and green intensities for each spot on the array. For some genes the background-corrected intensity is negative. That means that for some spots the median spot intensity is actually lower than the median background intensity. This can happen if the overall intensity of a spot is very low or if the spot is hard to tell apart from the background. Since negative gene expression values do not make biological sense (a gene can have no expression but not negative expression!), these values are artificially adjusted. Usually, negative values are replaced by zeros or small positive constants. 4.1. Normalization
The results of a microarray experiment are obviously influenced by technical variation. This means that the measured intensities vary on the array(s) in a systematic manner. For the statistical analysis of microarray data it is important to understand the different sources of variation that influence the results. They usually include (but are not limited to) differences in arrays, differences
Introduction to the Statistical Analysis of Two-Color Microarray Data
295
in dyes, and biological differences between the subjects in the experiment (9). Often more than one slide is used to conduct the experiment. If different amounts of probe are applied to the slides, the intensities on some slides may be consistently higher than the intensities on other slides, even though they measure the same genes. Normalization means to mathematically manipulate the data to make it uniform in a variety of ways. There are several different ways in which normalization can be done. In many microarray experiments, the results are reported as log ratios of the two intensity measurements. If Gi represents the green intensity for gene i and Ri represents the red intensity for gene i, then two quantities commonly used are Ri [1] Mi = log2 Gi Ai =
1 log2 Ri + log2 Gi 2
The quantity Mi (the log ratio) describes the relationship between the two groups. If the intensity in the red- and greenlabeled group is the same, then Mi will be zero. If the red intensity is twice as big as the green then Mi will be equal to 1. If, on the other hand, the green intensity is twice as big as the red, then Mi will be equal to –1. The quantity Ai describes the overall intensity observed for gene i. The quantities Mi and Ai are very useful in the normalization of microarray data. Before conducting the analysis for a microarray experiment, researchers often take a look at what is called the “MA plot.” For every feature i on the array, the values Mi and Ai are computed and are plotted in an xy plot (Fig. 9.6).
Fig. 9.6. Using an MA plot to visualize normalization on one array. Highly up- or downregulated genes are above or below the x-axis. The diagonal lines to the left are called “fishtails.” Fishtails are an artifact of changing the value for spots that have higher background than foreground.
296
Bremer, Himelblau, and Madlung
It is known that the green and red dyes used to label the samples interact differently with certain genes. The dyes have different light stability and may vary in efficiency. That means that the dye molecules are more likely to bind to certain genes than others. Experimenters are usually most interested in those genes that have high fold change (large positive or negative M value), while at the same time having high intensity (large A value). For spots that have high fold change but very low intensity it is hard to distinguish if the fold change is due to a biological effect or due to technical variation in the measurements. On average, we would like the intensities for both dyes to be about the same. That means, on average, the log ratios Mi should be about zero. Normalization means that one computes the average M value for all genes spotted on the array and then makes sure that the average will be zero. (Follow the directions in Box 9.1 to normalize microarray data using Microsoft Excel.)
Box 9.1. Normalization of Microarray Data Do this. . . To normalize microarray data. Usually normalization is conducted simultaneously for many genes spotted on the same array. In the small-scale example below, the six features shown all represent the same gene.
In the first step the log ratios of red (635) to green (532) signal are computed. In Excel, the appropriate command is “LOG(x,2)” to compute log2 (x). The results are written into a new column labeled “M”:
The goal of normalization is to assure that the average of the M values is set to zero. Computing the average of the six M values above yields –1.154. This average value (here
Introduction to the Statistical Analysis of Two-Color Microarray Data
297
–1.154) is then subtracted from the M values for all genes in the example above to “correct” them
4.2. Normalization by Dye-Swap Design
A better way to deal with uneven binding of the dyes to certain genes is to carry out experiments as dye-swaps (Fig. 9.7). In them, the probe from each group (treatment and control) is split into two portions and labeled with different dyes. The labeled samples are then hybridized crosswise (red “treatment”
Fig. 9.7. Experimental design for a dye-swap experiment. In this design, the treatment and control are first labeled with one set of dyes and hybridized to array 1. To account for different labeling efficiencies of the two dyes, the same probes are now labeled with the other dye (the dyes are swapped) and subsequently hybridized to array 2 (see this figure in color at: http://polyploidy.org/index.php/Microarray_analysis).
298
Bremer, Himelblau, and Madlung
with green “control” and green “treatment” with red “control”) onto two arrays. To normalize the data, the M value for each feature on the slide is obtained (as demonstrated in the previous “Do this. . .” section). For consistency we will compute M values for the two arrays once as red/green log ratio and for the other slide (in which the dyes are swapped) as the green/red log ratio. Then average the M values from the two arrays for each feature: Mi =
1 (1) (2) Mi + Mi 2
(1)
(2)
Here Mi is the log ratio for feature i on array 1 and Mi is the log ratio for the same feature (with dyes swapped) on array 2. (Follow the directions in Box 9.2 to conduct dye-swap normalization using Microsoft Excel.)
Box 9.2. Normalization of Microarray Data with Dye-Swaps Do this. . . Suppose you have data from a dye-swap experiment. There are two (very small) arrays. Each array contains six spots for the same gene. For each spot, you have already computed the log ratio M in the previous example. First, enter the M values for array 2 as shown below:
To conduct dye-swap normalization, we average the M values for each spot on the two arrays. After normalization, the corrected M value for the spot in column 1/row 3 is (–0.16-0.42)/2 = –0.29, for example. This way, an M value can be obtained for every spot on the array. For gene At1g01000, the M values are: Corrected M values –0.29 –0.94 –0.26 0.42 –2.47 –1.34
4.3. Other Types of Normalization
The main purpose of normalization is to mathematically remove as much systematic variation not caused by biological effects as
Introduction to the Statistical Analysis of Two-Color Microarray Data
299
possible. There are many ways this can be done in addition to the global normalization and dye-swap normalization that has been shown above. In some microarray experiments, the MA plot shows a dependence of M values on the A values. This effect can be seen as a curved point cloud in the plot, i.e., the point cloud is not distributed symmetrically around the A-axis. A possible cause for this effect is a dependency of the laser scans in the two-color scans on the intensity of the spots. The effect can be mathematically removed, and the data normalized, by computing the loess (locally weighed scatter plot smoothing curve) and correcting each observation so that the normalized point cloud is distributed around the A-axis. Another commonly occurring problem is a systematic difference between the spots applied to the array by different print tips. The print tips are used to transfer the material from well plates to the glass slides. For efficiency, an array of print tips is used simultaneously. In the process, it is possible that one (or more) print tips may become slightly deformed. This would lead to systematic differences in all the spots subsequently printed by this tip. To compare the observations among print tips and to remove systematic differences, a print tip normalization may be conducted. While the techniques mentioned so far apply to normalization within a single chip, it is easy to imagine that in microarray experiments in which several chips are used systematic differences can also occur across arrays in the experiment. Other normalization methods may be used to address across chip normalization. More information about normalization methods and details about techniques can be found in several technical reviews (10, 11).
5. Drawing Conclusions It is possible to store the mind with a million facts and still be entirely uneducated (Alec Bourne)
It is often the goal of microarray experiments to identify genes that become either more or less expressed in response to a treatment that compares two different states (e.g., drought-stressed and well-watered plants). To carry out the experiment, the gene expression levels of plants that were subjected to drought stress (treatment group) are compared with those of plants that were watered well (control group). For every gene on the microarray we want to decide whether the expression levels in the two experimental groups are (significantly) different or not. How can we make sense of the many thousands of measurements collected simultaneously in a microarray experiment? For now, we will analyze each gene separately. Our goal is to decide whether the expression level of the gene is different in the
300
Bremer, Himelblau, and Madlung
treatment and the control group. If the green and red intensities are different, this means that their quotient R/G is not equal to one. If the quotient is not equal to one, then the M value for the spot, which is the logarithm of the quotient [see Equation 1], is not equal to zero. It will be positive, if the red intensity is greater than the green and negative if the green intensity is greater than the red (see Fig. 9.6). The researcher is trying to detect a difference between the treatment and control group. Very few spots on the array will have equal red/green intensities (M values equal to zero). The challenge in analysis is to determine whether the observed differences are due to biological or technical variation or whether they reflect true differences in gene expression between the samples. This is achieved by a statistical analysis of the M values. But how far away from zero do these M values have to be so that we are convinced that the result is due to a real difference in the experimental groups and not just due to variation? (Fig. 9.8)
Fig. 9.8. Is the average M equal to zero? Multiple repeats for the same gene will give different results. Statistical tests provide an answer whether or not the mean of the repeats is significantly different from zero, or, in other words, if the treatment resulted in differences in gene expression from the control. On the left different means (horizontal bars) for observations with the same variation pattern are shown. The variation in the measurements is important in deciding whether the mean of a number of observations is significantly different from zero (right). Only if the distance of the absolute value of the mean is large compared to the variation within the measurements will we declare the mean significantly different from zero. If the variation is small, we may be more inclined to assume a non-zero mean than if the variation is large for the same absolute value of mean.
To decide whether the M values for a gene are far enough away from zero so that we would call the gene differentially expressed in the treatment and control group, we also have to look at the variation of these measurements. Only if the distance from zero of the average of our measurements is large compared to the variation, will we decide that the gene has different expression in the two experimental groups.
Introduction to the Statistical Analysis of Two-Color Microarray Data
301
5.1. Statistical Decision Making
Hypothesis tests are an important tool for statistical decision making. They are used to answer a “Yes/No” question about a population. But instead of being able to observe the whole population, we only get to see a small sample. For example, our experiment is intended to answer a general question about Arabidopsis plants under drought stress. Which genes are involved in the plants’ reaction and how do they differ compared to well-watered plants. In this case our population is “Arabidopsis” and our samples are the plants we grew in our lab that we extracted tissue from to use in the microarray experiment. Suppose that we repeatedly observe one gene under both treatment conditions. If the average difference of expression levels is large, then the question that has to be answered is whether it is plausible that this large difference is explainable only by biological and technical noise. If the answer is “Yes,” then we have no reason to be very excited, but if the answer is “No,” then the difference that we observed is likely due to a treatment effect. This means that the gene is likely reacting to the drought condition.
5.2. Hypothesis Testing
A statistical hypothesis test always follows the same scheme. In the beginning the question that is to be answered needs to be formulated. This is done in the form of two opposing statements about a population parameter. The null hypothesis is always of the form “there is nothing unusual happening here.” In the case of a microarray experiment this translates into “the gene is not differentially expressed.” The alternative hypothesis is a contradiction to the null hypothesis – this is the statement that the scientist really suspects to be true. The conclusion will be based on the data that has been observed in the experiment. The scientist now takes on the role of skeptic. If the null hypothesis were true, and there truly is no effect, we will compute the probability to see an outcome as extreme as the one we observed purely through error variation. “Extreme” in this context is any observation (such as a large difference in gene expressions) that supports the alternative hypothesis more strongly than the null hypothesis. If it is unlikely to see an effect as extreme or more extreme than the one observed from error variation alone, then we conclude that there likely is a treatment effect and we then reject the null hypothesis in favor of the alternative hypothesis. (This does not mean that the alternative has been proved to be true.) On the other hand, if an outcome such as the one observed is rather likely to have been caused by error variation even if there is no treatment effect (null hypothesis is true), then we cannot reject the null hypothesis. Failing to reject the null hypothesis does not mean that the null hypothesis has been proved to be true.
302
Bremer, Himelblau, and Madlung
Statistics cannot be used to prove or disprove statements in a mathematical sense. Instead, evidence (in the form of data) is collected that may or may not support the null hypothesis. The probability to observe an apparent “effect” (i.e., a large difference between treatment and control measurements) if there is only nuisance variation is called the p-value of a hypothesis test. How small does this probability need to be before we are willing to abandon the null hypothesis belief? Most investigators work with a value of 0.05 (5%). $ p= $ =
Probability to observe extreme data if the null hypothesis is true
%
<0.05 Reject the null hypothesis ≥0.05 Do not reject the null hypothesis
The smaller a p-value is, the less likely it is to observe data such as the one you observed in the experiment if the null hypothesis were true. To figure out the probability of observing extreme (or unusual, atypical) data, we have to have a quantity whose statistical distribution (behavior) we know and whose value we can compute from the sample data. Such a quantity is called a “test statistic.” To identify differentially expressed genes in a microarray experiment, separate hypothesis tests are performed for each gene spotted on the array. Table 9.1 contains an overview of terms used in a statistical hypothesis test and their equivalents in a microarray experiment testing for differential expression.
Table 9.1 Important terms used in a statistical hypothesis test are shown alongside their equivalents in a microarray experiment testing for differential expression Hypothesis test
Gene expression experiment using microarray
Null hypothesis
No difference between average gene expression in control and treatment plants
Alternative hypothesis
Gene expression differs between the control and treated plants
Data
Gene expression measured by red/green fluorescence levels
p-value
Probability that very different expression levels result from only biological or technical variation
Rejecting the null hypothesis
Declare the gene differentially expressed
Accepting the null hypothesis
Declare that the gene is not differentially expressed
Introduction to the Statistical Analysis of Two-Color Microarray Data
5.3. Hypothesis Test for Log Ratios
303
In microarray experiments, especially if the data has been normalized, it will be in the form of log ratios of red and green intensities (M values). The data file will contain one M value for every spot on the array. Ideally, each gene is spotted several times on the same array, so that there are several M values for each gene. To conclude whether a gene is expressed differently in the two groups (treatment and control), we will decide whether the M values for that gene are close to zero (on average) or not. To make this decision, we will also have to take the variance of the observations into account. (If only one observation is available for each gene (no replication) then it is not possible to estimate the gene’s variation. In this case it is not possible to decide which genes are differentially expressed without making simplifying assumptions that may not be biologically justified.) A t-test will be used to decide whether a gene is differentially expressed. Suppose that a random characteristic with mean zero is measured repeatedly. For n measurements x1 , . . . , xn with average x and standard deviation s, the quantity x t=
s2 n
has a t distribution with df = n − 1. df stands for “degrees of freedom” and is a number that characterizes the distribution (behavior) of the test statistic. The shape of t distributions has been calculated and is shown in Fig. 9.9 for different degrees of freedom. Since we assumed the characteristic to have mean zero, the “normal” (or typical) values are those close to zero. The unusual values are the ones in the tails of the distribution, either large positive or large negative numbers. Large positive values of the test statistic mean that the log ratio is positive, which means that the red intensity is much higher than the green (meaning that the expression of the genes whose mRNA was labeled with red dye is higher than the expression of the genes, whose mRNA was labeled with green dye). Large negative values of the test statistic mean that the log ratio is negative, which means that the green intensity is much higher than the red. 5.4. How Large Is “Large”?
How large (or small) will a test statistic value need to be so that we can call it unusual? Most researchers work with a significance level of 5%. They call an observation unusual, if its p-value is smaller than 5%. That means that the test statistic value falls into the outer 5% tail area of the distribution in the graph of the t distribution above. If that occurs, one may safely argue that the two values (for red and green) differ from each other in a “statistically significant” manner between the treatment and the control group. (Proceed
304
Bremer, Himelblau, and Madlung
Fig. 9.9. t distribution. Density curves of the distribution are shown for several degrees of freedom. The total area under each distribution curve is equal to one. If the degree of freedom becomes very large (df = ∞), the shape of the t distribution becomes the same as the shape of a normal distribution.
to Box 9.3 to conduct a t-test on microarray data using Microsoft Excel.)
Box 9.3. t-test for Microarray Data Do this. . . conduct a t-test for a microarray experiment. From Box 1, We have six M values for the gene At1g01000. We can compute the average of the six observations x= –1.15 and the standard deviation s = 1.28. n = 6, since we have six observations. Now we can compute the value of the test statistic as x t=
s2 n
−1.15 = = −2.08 1.282 6
The degree of freedom that describes the behavior of the test statistic in this case is df = 6 − 1 = 5. To find the p-value, we have to find the percentage of cases, in which the t-test statistic with df = 5 would take on more extreme values than the –2.08 that we observed. Extreme values are the ones far away from zero. The area under the t distribution curve
Introduction to the Statistical Analysis of Two-Color Microarray Data
305
corresponding to the extreme values (smaller than –2.08 or larger than 2.08) is shaded in the graph below.
In the past, these values had to be looked up in tables. Today, Excel and other software programs have them stored in their statistics package. The p-value can be found with the Excel command “=TDIST(absolute value of your test statistic, df, 2)” The “2” stands for two-sided, which means that you want the shaded area in both tail ends. In our case the absolute value (no minus sign) of the test statistic is 2.08 and the degree of freedom is df = 6 − 1 = 5. Click on an empty cell and enter “=TDIST(2.08, 5,2).” In our example, the exact p-value (shaded tail area of the distribution) is 0.0921 or 9.21%. What conclusion can we draw? The p-value is the probability to observe data as extreme/unusual as the one we saw if the gene expression in the two groups were the same. Our p-value of 9.21% is quite large (bigger than 5%). That means that we would get observations such as these by random chance and not due to real difference in gene expression almost 10% of the time. Hence, our data are nothing unusual and we cannot reject the null hypothesis (equal expression in both groups) for gene At1g01000. Gene name At1g01000
p-value 0.0921
Differentially expressed at level 5%? no
Results of Experiment: To determine the p-values for the other genes spotted on the microarray, repeat the steps described above. This will provide us with a p-value for each gene on the array.
5.5. ANOVA Model for Gene Expression
Normalizing gene expression data to remove nuisance effects and then performing gene-wise t-tests is one way to identify differentially expressed genes in a microarray experiment. Even though labor intensive, this method illustrates the statistical reasoning that takes place behind the scenes in a software analysis of gene expression data. However, there exists an alternative approach that is implemented in most software packages designed for microarray data analysis. Instead of the normalization and t-test procedure these programs are based on statistical ANOVA models. ANOVA stands for analysis of variance. The basic question to be addressed by the model remains the same as in the t-test case. For each gene on the array it is to be decided whether the observed differences between treatment and control
306
Bremer, Himelblau, and Madlung
group are large enough compared to the variation in the experiment to declare the gene differentially expressed. Other than for the t-test, however, now all observations (on potentially thousands of genes) are combined in just one statistical model (below). Try not to be put off by the many letters and indices, they are really just a form of bookkeeping: Yijkgr = μ + Ai + Dj + Tk + Gg + AGig + DGjg + TGkg + εijkgr Y stands for the logarithm of background corrected intensity. The indices keep track of which spot is currently being considered. Specifically, Yijkgr is the log-intensity for the rth replication of gene g under treatment k labeled with dye j on array i. A stands for the array effect, D stands for dye effect, T stands for treatment effect, and G stands for gene effect. AG, DG, and TG stand for the array–gene, the dye–gene, and the treatment–gene interaction effects, respectively. μ represents the overall log-intensity mean and the ε are called the errors. The errors represent the variation in the experiment that cannot be explained in a systematic manner (through different dyes, treatments, arrays, or genes). (The errors in an ANOVA model are assumed to be normally distributed with mean zero and constant variance σ 2 . If repeated observations on each gene are available, it can also be assumed that different genes have different error variances.) The effects (array, dye, treatment, and gene) in the microarray ANOVA model describe the average contribution that the respective factors have on the log intensities. For example, consider once more the experiment in which the gene expression of droughtstressed plants (treatment) is compared to that of well-watered plants (control). In this case the factor “treatment” takes on two levels (k = 1 or k = 2) and the treatment effect Tk describes the average difference of log-intensities between the two groups. The interaction effects allow us to consider that not all combinations of factors will influence log-intensity equally. For example, uneven binding of the two dyes (red and green) to specific genes is not contained in the dye effect, which only describes the overall average difference between the two dyes. But it is considered in the dye–gene interaction effect, which describes how the dyes interact with different genes. (Some possible interaction effects, for instance, the array–dye (AD) effect, are missing from the model. This is because the dyes are assumed to interact with all arrays in the same manner.) All parameters in an ANOVA model can be estimated by averaging over the original observations. Start out by computing the logarithms of the background-corrected intensities from the original microarray data files: Y = log Foreground median − Background
Introduction to the Statistical Analysis of Two-Color Microarray Data
307
Call the rth observation on gene g subject to treatment k labeled with dye j from array i Yijkgr . We will use the “dot” notation to indicate taking averages. For example, 1 Yijkgr # of all r’ s all r ’s Usually, statistical software is used to compute the parameter estimates, but they can be computed by hand using formulas shown in Table 9.2. Y ijkg· =
Table 9.2 ANOVA parameters
5.6. ANOVA Hypotheses
Parameter
Estimate
μ
Y ·····
Ai
Y i···· − Y ·····
Dj
Y ·j··· − Y ·····
Tk
Y ··k·· − Y ·····
Gg
Y ···g· − Y ·····
AGig
Y i··g· − Y i···· − Y ···g· + Y ·····
DGjg
Y ·j·g· − Y ·j··· − Y ···g· + Y ·····
TGkg
Y ··kg· − Y ··k·· − Y ···g· + Y ·····
For each gene g on the microarray the null hypothesis corresponding to the ANOVA model for differential expression is H0 :T1 + TG1 g = T2 + TG2 g This hypothesis can be tested using a t-test very similar to the one described above. In the test statistic, the estimate of the treatment effect Y ··1g· − Y ··2g· √ is compared to its standard deviation σˆ / r where σˆ is the estimate of the residual standard deviation and r is the number of times the gene is spotted under the same conditions on the array. The test statistic used to test the hypothesis of differential expression for gene g has a t distribution with r–1 degrees of freedom: t=
Y ··1g· − Y ··2g· ∼ t(df = r − 1) √ σˆ / r
For each gene spotted repeatedly on the array, the value of the test statistic is computed and a corresponding p-value obtained.
308
Bremer, Himelblau, and Madlung
The resulting list of p-values (one for each gene) will be used to make the decision about differential expression of each gene. 5.7. Variance Estimation in ANOVA
There are two possible ways to estimate the standard deviation in the ANOVA model for differential expression. One can assume that the observed variation has the same magnitude for all genes on the array. In this case, the estimate of the residual standard deviation is based on all genes at once – this is known as the common gene variance model. Statistically, this method is powerful, since the standard deviation estimate is based on very many observations. However, biologically this method may not be very meaningful, because it is known that genes with very low expression across treatments vary less than genes with very high expression. Alternatively, the residual standard deviation estimate can also be computed repeatedly and separately for each gene. This is known as a per-gene variance model. In this case the estimate is statistically much less powerful, since it is based on fewer repeated observations of the same gene. But biologically this model is more appropriate, since it allows for the possibility of different genes having different magnitudes of standard deviation. (Proceed to Box 9.4 to apply an ANOVA model to microarray data.)
Box 9.4. Fit and ANOVA Model Do this. . .to fit an ANOVA model to gene expression data Disclaimer: Usually, this part of the analysis is not done by hand, but rather by software designed for the analysis of gene expression data. However, to aid understanding of the procedure we illustrate it here using a small-scale example. Suppose a microarray experiment has been conducted using two arrays with six target spots representing two genes. Both genes are spotted in triplicate on each array. The experiment has been conducted as a dye-swap. On array 1 the treatment is labeled with red dye and the control with green. On array 2 the dye colors are swapped. The data have been preprocessed to compute the logarithm of the background corrected intensities for each spot and each laser setting (red and green).
To fit the ANOVA model, we have to enumerate all possible values of “Array,” “Dye,” “Treatment,” and “Gene.”
Introduction to the Statistical Analysis of Two-Color Microarray Data
309
The order in which indices are assigned to the values is arbitrary, but it is important to be consistent throughout the analysis. When using software, the enumeration is usually conducted in the order at which data occur in a file. Array: The experiment uses two arrays (label them i = 1 for array 1, and i = 2 for array 2); Dye: The experiment uses two dyes (label them j = 1 for red, and j = 2 for green); Treatment: The experiment uses two treatments (label them k = 1 for treatment, and k = 2 for control) Gene: The experiment uses only two genes (label them g = 1 for At1g02000, and g = 2 for At1g03000) Since each gene is spotted in triplicate, we have replication r = 1, 2, 3 of each gene. Use the above labeling to identify each of the 24 log intensities in the data set. For example, Y11112 is the second replication of gene 1, under treatment 1, labeled with dye 1, on array 1. Y21221 is the first replication of gene 2, under treatment 2, labeled with dye 1, on array 2. Next, estimate the parameters of the ANOVA model (compare Table 9.2). We will demonstrate this process for the parameters μ, T1 and DG12 . The estimate for the overall intensity is μˆ = Y¯ ····· = 6.667, this is obtained by averaging all 24 entries (across both arrays, treatments, dyes, and genes) in the data set. The estimate for the first treatment effect (drought stressed plants) is Tˆ 1 = Y¯ ··1·· − Y¯ ····· = 7.833 − 6.667 = 1.167. This means that the average log expression of the genes of the treated plants is on average 1.167 higher than the average of all plants. Conversely, the average log-expression of the genes of the control plants will be 1.167 lower (Tˆ 2 = −1.167) than the average of all plants. There are four possible dye gene interaction effects in this experiment (two dyes, red and green, each paired with two genes in this experiment). Consider the estimate of the ˆ 12 = Y¯ ·1·2· − Y¯ ·1··· − Y¯ ···2· + Y¯ ····· = 9 − 6.583 − 9 + 6.667 = dye-gene interaction DG 0.084. On average, the log-expressions of gene At1g03000 when labeled with the red dye are 0.084 higher than the overall average. The residuals of the ANOVA model may now be computed after obtaining all other parameter estimates first, and then solving the model equation for each residual term εijkgr .
5.8. Multiple Testing Issues
In real microarray experiments, many more than two or six genes are spotted on an array (often many thousands). Regardless of whether we use t-tests or an ANOVA model for the analysis very many decisions will have to be made. If we use a significance level of 5% to determine whether a gene is differentially expressed or not, then for each gene there is up to a 5% chance that we falsely declare the gene differentially expressed. That means we come to the conclusion that the gene is differentially expressed when, in fact, it is not.
310
Bremer, Himelblau, and Madlung
If there is a small (5%) chance of a mistake in every decision, then overall in the very many decisions we will have to make, we will likely make many mistakes. Different procedures exist to correct this problem. We will take a closer look at two of them. 5.9. Bonferroni Method
Suppose that a microarray has spots that represent 1000 genes. You want to test for each of these 1000 genes whether it is differentially expressed between a treatment and a control group. If you work with a level of 5% for each individual test, then the probability that you make at least one wrong decision is 1 − (1 − 0.05)1000 = 0.9999999 . . . 999(twenty-three 9s) This means that it is virtually certain that your analysis will contain at least one error. To get the probability of making at least one mistake reasonably small (say under 10%), you have to work with a much smaller significance level. The Bonferroni correction method says, if you would like the probability of making at least one mistake to be less than α, (so called family-wise error rate (FWER)), then use a significance level of α/n for every test for each individual gene. In the above example this would mean that if we want to keep the probability of making at least one mistake under 10%, then we should declare only those genes differentially expressed, whose p-values are smaller than 0.10/1000 = 0.0001. This is a very conservative approach. After a Bonferroni correction, only very few genes will actually be declared differentially expressed. (see Chapter 5 for less conservative alternatives).
5.10. False Discovery Rate
Another way to deal with multiple testing issues is to consider the false discovery rate (FDR) (13). This is the expected proportion of the falsely rejected null hypothesis.
Falsely rejected null hypotheses FDR = Average rejected null hypothesis
In other words, among those genes for which we have rejected the null hypothesis FDR is the proportion of genes, which actually are not differentially expressed. We have falsely declared those genes to be “special.” There is a method called “linear step-up procedure” that pushes the false discovery rate below a given level q. That means that you can make sure that the proportion of “false discoveries” stays, for example, under 10% (if you set q = 0.10). To conduct a linear step-up test, one initially conducts the hypothesis test for differential expression for all n genes on the array separately and computes a p-value for each gene (Table 9.3).
Introduction to the Statistical Analysis of Two-Color Microarray Data
311
Table 9.3 p-values for a set of Arabidopsis genes analyzed by microarray Gene name
p-value
At1g01000
0.0921
At1g01010
0.0016
At2g01000
0.0142
At3g01000
0.1272
At4g01000
0.0812
At4g01020
0.0724
Table 9.4 p-values for a set of Arabidopsis genes analyzed by microarray sorted by size from smallest to largest Gene name
p-value
At1g01010
0.0016
At2g01000
0.0142
At4g01020
0.0724
At4g01000
0.0812
At1g01000
0.0921
At3g01000
0.1272
Smallest
Largest
In Table 9.4 these p-values are shown sorted by size (smallest to largest). Pick a level q under which you want to control the false discovery rate. Common values for q are 0.10 (10%) or 0.05 (5%). Start at the top of the list and check for each gene, whether its p-value is bigger than q · ni . Here, q is the level under which you want to control the FDR, n is the total number of genes you have, and i is the number of the test you conduct. When you find a gene for which the p-value is bigger than q · ni , declare this gene and all other genes with larger p-values not differentially expressed. This way, only the genes on top of the list will be declared differentially expressed. Which of the six genes in this example would you declare differentially expressed if you had to control the false discovery rate under q = 5%? For each of the genes, check whether the p-value is bigger or smaller than 0.05 · i/n, where n is the total number of genes we have (here, n = 6) and i is the number of the test you do. In this example, the smallest i for which the p-value is larger
312
Bremer, Himelblau, and Madlung
Table 9.5 Determining differential expression of six genes with a false discovery rate under q = 5% i
Gene name
p-value
Is p(i) ≥q· ni ?
Is the gene differentially expressed?
1
At1g01010
0.0016
0.0016 < 0.05· 16
Yes
0.05· 26 0.05· 36 0.05· 46 0.05· 56 0.05· 66
Yes
2
At2g01000
0.0142
0.0142 <
3
At4g01020
0.0724
0.0724 <
4
At4g01000
0.0812
0.0812 <
5
At1g01000
0.0921
0.0921 <
6
At3g01000
0.1272
0.1272 <
No No No No
than 0.05 · ni is i = 3. In this case we declare only the first two genes differentially expressed (Table 9.5). Suppose a microarray experiment is conducted for six genes and the t-tests for the individual genes yield the p-values listed below. For each gene, decide whether or not you would declare it differentially expressed if you did an individual hypothesis test at the 10% level. Also, decide whether the gene would be declared significant, if the Bonferroni method or the linear stepup procedure to control the FDR at 10% level were employed (Table 9.6).
Table 9.6 Summary of hypothesis tests for differential gene expression at the 10% level for several significance tests. “Yes” indicates the gene is differentially expressed between the treatments. “No” indicates no differential expression
Gene name
p-value
Significant at level 10% without FDR?
Significant at level 10% with Bonferroni?
Significant at level 10% with FDR?
At1g01000
0.0921
Yes
No
No
At1g01010
0.0016
Yes
Yes
Yes
At2g01000
0.0142
Yes
Yes
Yes
At3g01000
0.1272
No
No
No
At4g01000
0.0812
Yes
No
Yes
At4g01020
0.0724
Yes
No
Yes
Introduction to the Statistical Analysis of Two-Color Microarray Data
313
Acknowledgments We acknowledge the advice of our colleagues of the “Polyploidy Research Group.” Special thanks go to RW Doerge for advice, support, and critical reading of the manuscript. This work was supported by NSF Plant Genome grant DBI-0501712. References 1. Allison, D.B., Cui, X., Page, G.P., and Sabripour, M. (2006) Microarray data analysis: From disarray to consolidation and consensus. Nat Rev Genet 7, 55–65. 2. Butte, A. (2002) The use and analysis of microarray data. Nat Rev Drug Discov 1, 951–60. 3. John, U.P., and Spangenberg, G.C. (2005) Xenogenomics: Genomic bioprospecting in indigenous and exotic plants through EST discovery, cDNA microarray-based expression profiling and functional genomics. Comp Funct Genomics 6(4), 230–5. 4. Heath, L.S., Ramakrishnan, N., Sederoff, R.R., Whetten, R.W., Chevone, B.I., Struble, C.A., Jouenne, V.Y., Chen, D., van Zyl, L., and Grene, R. (2002) Studying the functional genomics of stress responses in loblolly pine with the expresso microarray experiment management system. Comp Funct Genomics 3(3), 226–43. 5. Mohammadi, M., Kav, N.N., and Deyholos, M.K. (2008) Transcript expression profile of water-limited roots of hexaploid wheat (Triticum aestivum ‘Opata’). Genome 51(5), 357–67. 6. Xue, G.P., McIntyre, C.L., Glassop, D., and Shorter, R. (2008) Use of expression analysis to dissect alterations in carbohydrate
7.
8. 9. 10.
11. 12.
13.
metabolism in wheat leaves during drought stress. Plant Mol Biol 67(3), 197–214. Mantri, N.L., Ford, R., Coram, T.E., and Pang, E.C. (2007) Transcriptional profiling of chickpea genes differentially regulated in response to high-salinity, cold and drought. BMC Genomics 8, 303. Sokal, R.R., and Rohlf, F.J. (1994) Biometry: The Principles and Practices of Statistics in Biological Research. W.H. Freeman. Yang, Y.H., and Speed, T. (2002) Design issues for cDNA microarray experiments. Nat Rev Genet 3, 579–88. Owzar, K., Barry, W.T., Jung, S.H., Sohn, I., and George, S.L. (2008) Statistical challenges in preprocessing in microarray experiments in cancer. Clin Cancer Res 14(19), 5959–66. Quackenbush, J. (2002) Microarray data normalization and transformation. Nat Rev Genet 32, 496–501. Cui, X., Churchill, G.A. (2003) Statistical tests for differential expression in cDNA microarray experiments. Genome Biol 4, 210. Benjamini, Y., and Hochberg, Y. (1995) Controlling the false discovery rate: A practical and powerful approach to multiple testing. J R Stat Soc Series B 57(1), 289–300.
Chapter 10 Building Networks with Microarray Data Bradley M. Broom, Waree Rinsurongkawong, Lajos Pusztai, and Kim-Anh Do Abstract This chapter describes methods for learning gene interaction networks from high-throughput gene expression data sets. Many genes have unknown or poorly understood functions and interactions, especially in diseases such as cancer where the genome is frequently mutated. The gene interactions inferred by learning a network model from the data can form the basis of hypotheses that can be verified by subsequent biological experiments. This chapter focuses specifically on Bayesian network models, which have a level of mathematical detail greater than purely conceptual models but less than detailed differential equation models. From a network learning perspective the most severe problem with microarray data is the limited sample size, since there are usually many plausible networks for modeling the system. Since these cannot be reliably distinguished using the number of samples found in current microarray data sets, we describe robust network learning strategies for reducing the number of false interactions detected. We perform preliminary clustering using co-expression network analysis and gene shaving. Subsequently we construct Bayesian networks to obtain a global perspective of the relationships between these gene clusters. Throughout this chapter, we illustrate the concepts being expounded by referring to an ongoing example of a publicly available breast cancer data set. Key words: Bayesian network, co-expression network, microarray, cancer, scale-free topology, gene modules, gene shaving, bagging, bayesian bootstrap.
1. Introduction High-throughput gene expression data, such as from gene expression microarrays (1), provide a snapshot of the activity of tens of thousands of genes. Data sets consisting of gene expression data for many individuals are commonly used to search for genes differentially expressed between two (or a few) different states, H. Bang et al. (eds.), Statistical Methods in Molecular Biology, Methods in Molecular Biology 620, DOI 10.1007/978-1-60761-580-4_10, © Springer Science+Business Media, LLC 2010
315
316
Broom et al.
such as between diseased and non-diseased, or between individuals who respond well to a particular treatment and those that do not. High-throughput gene expression data sets are also used to cluster genes or samples or both into relatively homogeneous subgroups, which can then be studied independently to identify their particular properties (2). For example, clustering of gene expression data from human glioblastoma patients has identified three different subtypes of this disease (3), which may have prognostic and treatment implications. In this chapter, we are interested in another application: learning gene interaction networks from high-throughput gene expression data sets. For example, we might have a data set containing many genes of interest but unknown functions and we would like to determine the most likely interactions between these genes. The inferred interactions could then be verified by subsequent experiments. These interactions are represented by a graphical model (4) in which the genes and covariates of interest are represented by nodes in the graph; gene influences, as well as influences involving covariates, are represented by edges connecting one node to another. Gene networks range in mathematical detail from purely conceptual networks, containing only edges between nodes with no associated mathematical model, to detailed differential equation models in which each edge is associated with a detailed reaction model described by differential equations (5). Models at all levels of mathematical detail are valuable in the right context. In the gene networks commonly found in introductory biological texts, edges are usually marked as activators or repressors. Although such networks contain only the slightest mathematical model, they are of considerable value for developing a conceptual understanding of the interactions between genes and a systems model of the network’s overall behavior. At the other end of the spectrum, a detailed differential equation model can be used to predict the detailed time evolution of the activity of the genes in the network in response to a stimulus and can discover system behaviors not obvious from a more conceptual model. For example, analysis of a differential equation model may reveal that a network exhibits modal behavior, in which changes from one state to another occur only after a significant stimulus. Creating accurate differential equation models requires extensive, detailed experiments to determine the rate constants associated with each reaction. In this chapter, we are interested in learning network models that have an underlying mathematical model with an intermediate level of detail, in between those of the conceptual and differential equation models. This class of model includes probabilistic Boolean networks, but we will concentrate on Bayesian networks (6). An edge in a Bayesian network is directional, from a parent
Building Networks with Microarray Data
317
node to a child node, and indicates that the state of the child is conditionally dependent on the state of the parent. Each node in a Bayesian network is associated with a conditional probability distribution that summarizes the statistical dependence between the node and its parent(s). Building a network model from microarray data requires learning both the network structure of the model and the conditional probability distributions associated with the nodes in the model from the data, so high-quality data are vital. The most severe problem with microarray data, from a network learning perspective, is the limited sample size. Usually, there are many plausible networks for modeling the system, and these cannot be reliably distinguished using only a small amount of data. For instance, a parent with a minor influence on its child will require more data to detect than one with a major influence. Further, each parent node will reduce the amount of data effectively available in each of its cases for learning other parents of its child. Thus, more data are required to learn denser networks. Another perspective on the data requirements is to consider an analogy to the false discovery rate, in which it is the edges that are being “discovered.” Since the number of possible edges is n(n − 1)/2 for n nodes, the problem is much more severe than for, say, detecting differential expression. Ideally, a minimum of several thousand data points (microarrays) would be available from which to learn the network. Unfortunately, microarray data sets of such size are not available (at least at the present time), so our network learning methods must include robust strategies for minimizing the number of false edges detected. Other issues that arise in learning networks from microarray data are common to many problems in microarray analysis. Samples will contain cells of many types and in different states, and their proportions will vary among samples. Samples will be taken from regions of different disease status: for instance, from the necrotic core of a tumor, to its proliferating edge, to the predominantly normal margins outside the tumor. Sample processing will vary, with perhaps significant impact on the reported results, between laboratories, reagent batches, and individual operators, and over time due to environmental fluctuations and drifts in equipment calibration (7). To the extent possible, the impact of such variations should be minimized, as for all microarray experiments, by using a good experimental design that does not confound these issues with covariates of interest, by using high standards of sample processing, and by using appropriate data preprocessing methods. Finally, data relevant to the problem of interest must be used for learning the Bayesian network. For instance, to learn about protein signaling networks, protein phosphorylation data
318
Broom et al.
are required, whereas gene expression data are required for learning about gene regulation. Since Bayesian networks contain directional edges (from parent to child), there is a tendency to think of them as causal. Certainly they can be, and if you are creating one manually it would be natural to create a causal model. Many edges, however, can be reversed without changing the overall likelihood of the network, so although there are some special edge patterns for which a Bayesian network learner is more likely to learn the causally correct edges, when a Bayesian network is learned from data, it is likely that at least some of the learned edges will not be causal. Causal networks can be learned, but doing so requires data collected at different time points for the same sample. For instance, a causal signaling network can be learned by measuring protein phosphorylation levels in each sample at multiple times (such as 5, 10, 20 minutes after stimulation). In such cases, the biology is modeled as a dynamic Bayesian network in which biological quantities measured at different times are treated as different variables, and the network learning algorithm is constrained to learn only edges to nodes at the same or a later time. Although the dynamic Bayesian network will be acyclic, merging nodes for the same biological quantity collected at different times may reveal biological cycles. Throughout this chapter, we will illustrate the concepts being expounded by referring to an ongoing example of a publicly available breast cancer data set, GSE6532, described in (8). In the following section we will describe the data set and its initial preprocessing. Subsequently, we will describe two methods for reducing the number of variables to a more appropriate number for Bayesian network analysis, and finally we will discuss Bayesian network analysis in more detail.
2. Example Data Set The study by Loi et al. (8) focussed on defining clinically distinct molecular subtypes in estrogen receptor-positive breast carcinomas. The data set they collected, GSE6532, consists of clinical covariates (see Table 10.1) and Affymetrix gene expression data for 414 samples, which were obtained from three different hospitals (John Radcliffe Hospital, Oxford, UK; Guys Hospital, London, UK; and Uppsala University Hospital, Uppsala, Sweden). (Note that the number of samples we obtained from the GSE6532 data set differs from the number of samples described in (8).)
Building Networks with Microarray Data
319
Table 10.1 Clinical covariates Covariate
Description
series
Hospital code from which the sample was obtained
age
Patient age
e.dmfs
Event to distant metastasis-free survival
e.rfs
Event to recurrence-free survival
t.dmfs
Time to distant metastasis-free survival
t.rfs
Time to recurrence-free survival
size
Tumor size
er
Patient ER status
pgr
Patient PGR status
node
Patient node status
grade
Tumor grade
treatment
Treatment used
Some samples were run on the HG-U133PLUS2 platform, whereas the other samples were run on both the HG-U133A and HG-U133B platforms, which taken together approximate the probe sets available on the HG-U133PLUS2. In this chapter, we have chosen to limit the size of the data sets and restrict our attention to just the 22,277 probe sets common to U133PLUS2 and U133A. Microarray data sets are notoriously sensitive to the conditions under which they are processed, for instance, minor differences in temperature, chemical batch, and age, that are in practice impossible to control. If such differences are not compensated for, the analysis will produce significant batch artifacts. To reduce any possible such effects for this data set, the data were split into five separate batches, each batch was preprocessed separately, and then the batches were recombined for subsequent processing. Five batches were used because for two hospitals the samples concerned had two distinct codes in the series covariate making five in total (OXFU and OXFT, KIU and KIT, and GUYT). We do not understand why two different labels were applied to samples from the same hospital, but out of an abundance of caution we treated them as different batches. Low expressing probe sets are also subject to much greater random measurement error, so probe sets that are uniformly low expressing should be discarded. Bearing these considerations in mind, the data from the five distinct series in GSE6532 were preprocessed as follows: 1. The values for each sample were ranked (from 1 to 22,277). 2. Those probe sets whose rank exceeded the median (11,138) in at least 1/4 of the samples were selected. (A total of 12,933 probe sets were selected.)
320
Broom et al.
3. Within each data set, independently (a) the probe sets were ranked across the samples (from 1 to number of samples in that data set); (b) the ranks were scaled by 1.0/number of samples in the data set, resulting in positive values up to and including 1.0. 4. The resulting data sets were concatenated into a single data set containing 414 data samples with 12,933 probe set values per sample.
3. Variable Reduction and Selection
Although the preprocessing described in the previous section has reduced the number of variables by eliminating the most noisy and least reliable probe sets, the total number of variables (12,933 probe sets plus the clinical covariates of interest) is still far too large for Bayesian network analysis, which is computationally limited to at most a few hundred variables and statistically limited, given the number of samples, to even less than that. Another problem with the probe set expression data is that many probe sets are highly correlated. For instance, there are many genes containing multiple probe sets in the data, and in the majority of cases these probe sets are highly correlated. (Exceptions might arise from genes that have detectable alternative splicing or from the occasional bad probe set.) Edges between these probe sets would certainly be learnt, but the details would be highly dependent on noise in the data, of which there is plenty, and thus be of no interest. Moreover, for edges between two groups of such correlated probe sets, both the parent and the child nodes would be selected essentially randomly from each of their respective groups. Consider, therefore, a Bayesian network model of multiple highly correlated probe sets. It will be just one of a combinatorially large number of possible networks that are just as plausible given the data, but it will misleadingly distinguish specific edges. To avoid such problems and to increase the biological scope of the network, the probe set data must be reduced to a (much) smaller set of metagenes or clusters of highly correlated probe sets (9). The variables selected for the Bayesian network analysis will be chosen from the clinical covariates and the generated metagenes. In the following sections we describe two alternative strategies for obtaining metagenes: topological analysis of weighted co-expression networks and bagged gene shaving.
Building Networks with Microarray Data
321
3.1. Topological Analysis of Co-expression Networks
In this section we describe Zhang and Horvath’s topological method for analyzing gene co-expression networks (10) and use it to extract relevant co-expression clusters from our example breast cancer data set. In gene co-expression networks nodes represent genes, and two nodes are connected if the genes they represent are significantly co-expressed across samples. Gene co-expression networks have been shown to have the properties of scale-free topology networks and specifically the power law distribution of network connections: the probability p(k) that a new node in the network connects to k other nodes is approximately k−β for some value of β. A consequence of the scale-free topology is that the network consists of a few modules, and each module is centered on a highly connected node, or hub, which links the module to the rest of the network. These hubs represent genes that are likely to be vital for the host’s survival. A network’s connectivity can be represented by an adjacency matrix, where each entry in the matrix denotes the connectivity between the pair of nodes concerned. An adjacency matrix can use either unweighted, binary values where 1 means a pair of nodes is connected and 0 means that they are not connected, or it can use weighted, continuous values ranging from 1 (strongly connected) to 0 (not connected at all) to indicate the strength of interaction between two nodes. The major drawbacks of using an unweighted representation are loss of information and sensitivity to the choice of threshold used to determine connectivity. In fact, it is questionable whether it is biologically meaningful to measure the connection using binary values. However, one potential drawback of a weighted adjacency representation is that it is not clear how to define the directly linked neighbors of a node. If a list of neighbors is required, one still needs to threshold the connection strengths. In the remainder of this section, we generalize the above description of scale-free topologies to the case of weighted networks, compute the adjacency matrix from the gene expression levels, extract relevant clusters, and determine significant genes within clusters.
3.1.1. Generalized Scale-Free Topology
Zhang and Horvath’s method uses topological properties of weighted networks to determine the parameter β, so we need to generalize the power law described above, which is the defining characteristic function of the nodal connections in scalefree topology network, to weighted networks. The power law is defined as p(k) = k−γ . The square of correlation between log10 (p(k)) and log10 (k), which is the model fitting index, R2 , of the linear model that regresses log10 (p(k)) on log10 (k), is used to measure how well the networks satisfy the scale-free topology criterion. If R2 of the network approaches 1, then there is a straight line plot of log10 (p(k)) versus log10 (k).
322
Broom et al.
3.1.2. Computing the Adjacency Matrix
A weighted adjacency matrix for the co-expression network is calculated from a similarity matrix based on Pearson correlations. A biologically motivated criterion, scale-free network topology, is used to estimate the best parameters to use for calculating adjacency. Later, gene clusters will be constructed from the adjacency matrix using average linkage hierarchical clustering coupled with the topological overlap dissimilarity measure (11). To avoid having negative values, the similarity functions are calculated as follows: 1 + cor(i, j) , Sij = [1] 2 where Sij is the similarity function and cor(i, j) is the Pearson correlation of nodes i and j. The signum function is the most popular function for calculating adjacency for an unweighted network, for some value of the parameter τ : $ 1 if sij ≥ τ aij = signum(sij , τ ) ≡ . [2] 0 if sij < τ The function uses τ as a threshold to convert pairwise correlations into an adjacency matrix. The significance levels (p-values) of the correlation coefficients are used to estimate the adjacency function parameter, τ . However, the network size is a monotonically decreasing function of the correlation threshold. Therefore, the parameter is selected so that it allows the network to have a high significance level of the correlation coefficients and a reasonable size of network. Adjacency for a weighted network can be calculated by raising each similarity measure sij to the power β: β aij = power(sij , β) ≡ sij .
[3]
Metabolic networks in living organisms have been found to display approximate scale-free topology, so we can estimate the parameter β using the scale-free network topology criterion. To estimate how well the network satisfies scale-free topology, the linear regression model fitting index, R2 , is used. As R2 approaches 1, the model is closer to the scale-free topology, but a high value of R2 leads to networks with very few connections. That is, there is a trade-off between maximizing R2 and maintaining a high mean number of connections. Zhang and Horvath recommend that R2 be greater than 0.8 and that the minimum module size be at least 50. Further, since it is implausible that a network contains more hub genes than non-hub genes, a signed version of R2 should be used, and the slope of the regression line
Building Networks with Microarray Data
323
between log10 (p(k)) and log10 (k) should be −1. To summarize, the following criteria should be considered when selecting β: • R2 > 0.8, • high mean connectivity, and • the slope of the regression line between log10 (p(k)) and log10 (k) should be near −1. For our example breast cancer data set, we computed the similarity matrix using [1] and used the scale-free topology criterion to determine an appropriate value for the β parameter for calculating the weighted network’s adjacency function using [3]. Specifically, a linear regression of network-fit scale-free topology index, R2 , between log10 p(k) and log10 (k) was used. Recall that the value of R2 should be maximized (at least more than 0.8), subject to the constraint that the number of connections should be sufficiently large for the module detection. The slope of the signed R2 must also be around –1. Table 10.2 tabulates these criteria for various values of β, while Fig. 10.1 plots signed R2 and mean connectivity as functions of β. Based on these results, we selected the power parameter β = 9, although the number of connections is rather low. If we had later found that the number of connections was too low for the module detection, the parameter could have been reduced.
Table 10.2 Determining the value of the exponent (β parameter) for the power adjacency function Power Scale law R2
Slope
Truncated R2
Mean K
Median K
Max K
1
1
−0.0658
0.979
0.985
1480
1.47E+03
2420
2
2
0.151
−1.18
0.992
276
2.60E+02
709
3
3
0.626
−2.04
0.978
68.9
5.85E+01
280
4
4
0.771
−2.38
0.955
21.6
1.56E+01
143
5
5
0.784
−2.39
0.929
8.28
4.72E+00
88
6
6
0.837
−2.07
0.927
3.8
1.65E+00
59.2
7
7
0.895
−1.78
0.942
2.04
6.66E−01
42.1
8
8
0.935
−1.58
0.951
1.24
2.95E−01
31
9
9
0.995
−1.44
0.995
0.825
1.41E−01
23.8
10
10
0.988
−1.44
0.992
0.592
6.93E−02
21.2
11
12
0.977
−1.47
0.988
0.351
1.82E−02
17.6
12
14
0.975
−1.5
0.986
0.235
5.43E−03
14.9
13
16
0.976
−1.51
0.986
0.17
1.66E−03
12.9
14
18
0.98
−1.51
0.988
0.129
5.32E−04
11.2
15
20
0.971
−1.52
0.975
0.102
1.73E−04
9.86
324
Broom et al.
Fig. 10.1. At β=9, the scale-free topology fit does not improve after increasing the power. Also, the slope of the plot is around −1. However, there is a natural trade-off between maximizing scale-free topology model fit R2 and maintaining a high mean number of connections. The parameter values that lead to an R2 value close to 1 may lead to networks with very few connections.
Fig. 10.2 shows how well the network satisfies the scale-free topology criteria for β = 9. 3.1.3. Detecting Gene Modules
Modules are groups of genes whose expression profiles are highly correlated across the samples; that is, genes with high topological overlap. To group genes with coherent expression profiles into modules, Zhang and Horvath use average linkage hierarchical clustering coupled with a topologically based dissimilarity function. Zhang and Horvath use the topological overlap dissimilarity measure, since it was found to result in biologically meaningful modules. The topological overlap of two nodes reflects their inter" # connectedness. The topological overlap matrix (TOM) = wij is non-negative and symmetric, hence is a similarity measure. For an unweighted network, aij ∈ {0, 1} and wij =
lij + aij , min(ki , kj ) + 1 − aij
[4]
where lij = u aiu auj , ki = u aiu is the node connectivity and lij is the number of nodes to which nodes i and j are both connected. Two nodes will be maximally similar (wij = 1) if they
Building Networks with Microarray Data
325
Fig. 10.2. The scatter plot between log10 (p(k)) and log10 (k) shows how well the network satisfies a scale-free topology at β = 9. The solid curve corresponds to scalefree topology and the dotted curve corresponds to truncated scale-free topology.
are connected and all neighbors of the node with fewer connections are also neighbors of the other node, whereas they will be maximally dissimilar (wij = 0) if they are not connected and do not share a common neighbor. Since [4] does not require the adjacency matrix [aij ] to contain only binary values, Zhang and Horvath generalize the TOM to weighted networks simply by allowing real numbers to be used. The dissimilarity measure for hierarchical clustering is then defined as dijw = 1 − wij .
[5]
Hierarchical clustering using the TOM-based dissimilarity measure groups highly co-expressed genes into branches within a hierarchical cluster tree, also called a dendrogram. The clusters are selected using a height threshold. If the eigengenes (first principal component) of the resulting modules are found to be highly correlated, the modules should be merged. The clustering tree for our breast cancer example data generated using hierarchical clustering with the TOM matrix is shown in Fig. 10.3. To reduce computational requirements, the analysis was restricted to the 3,600 most connected genes. Gene modules correspond to branches of the tree and can be detected by using a height threshold. A large height threshold value leads to a large module; a small value leads to a small but tight module. Here we
326
Broom et al.
Fig. 10.3. Hierarchical clustering tree using the topological overlap dissimilarity is used for the network module detection. The tree branches are labeled by module membership: green, black, red, brown, yellow, pink, turquoise, blue (left to right).
selected the fixed height threshold = 0.94 and the size of module is at least 20 nodes. Each threshold branch corresponds to each module which is labeled by an arbitrary “color code”. Genes that are not part of any module are shown in grey. Figure 10.4 shows a cluster tree that was obtained by calculating the dissimilarity between module eigengenes. 3.1.4. Detecting Significant Genes
Having obtained several clusters of co-expressing genes, we would now like to select the genes that are most correlated with clinical covariates, such as cancer survival, and are most essential to the organism. The latter genes are often network hubs, so gene essentiality is correlated with highly connected nodes. The connectivity Ki of node i, for unweighted networks, is the number of direct connections between node i and other nodes: n ki = j=1 aij .
[6]
Zhang and Horvath extend this definition to weighted networks by allowing aij to have real values. They also propose a TOMbased connectivity measure: wi = nj=1 wij ,
[7]
Building Networks with Microarray Data
327
Fig. 10.4. Cluster tree based on the module eigengenes.
where wij is the topological overlap between two nodes i and j and show that, for their cancer network application, wi is superior to ki . Zhang and Horvath also show that intramodular node connectivities (the number/weight of connections to genes within the same module, denoted by k.in and w.in) are more meaningful biologically. Another measure of gene connectivity is the clustering coefficient, Ci , 0 ≤ Ci ≤ 1, which measures the modularity of a coexpression network near node i. Ci is the ratio of the number of connections between neighbors of node i, ni , to the maximum possible number of such connections, πi : Ci =
ni . πi
[8]
Maximal clustering (Ci = 1) occurs when all the neighbors are connected to each other and minimal clustering (Ci = 0) occurs when none of the neighbors is connected to any other neighbor. For an unweighted network, the number of neighbors connected to another neighbor is ni =
1 u=i v|v=i,v=u aiu auv avi . 2
[9]
Generalizing this equation to weighted networks is simply a matter of allowing aij to take real values.
328
Broom et al.
For an unweighted network, the maximum number of possible connections between the neighbors of node i is πi =
ki (ki − 1) , 2
[10]
where ki is the number of nodes connected to node i. Generalization of this equation to weighted networks is not trivial. Zhang and Horvath show that it is given by πi =
7 #2 1 6" 2 u=i aiu − u=i aiu , 2
[11]
which reduces to [10] for an unweighted network when aij ∈ {0, 1}. Zhang and Horvath note that the clustering coefficient is approximately independent of connectivity in a weighted network but is anti-correlated with connectivity in an unweighted network. This is because the power adjacency function of the weighted network preserves the factorizability property of the within-module similarity, which results in the clustering coefficient being approximately constant, while the adjacency matrix of the unweighted network does not preserve this property. Having calculated the adjacency matrix for our breast cancer data set using β = 9 and detected eight modules using average linkage hierarchical clustering along with the topological overlap dissimilarity measure, we now relate network concepts to sample information, which is defined by the gene significance variable. Gene significance is minus log10 of the univariate Cox regression p-value for predicting survival. Figure 10.5 shows which modules are enriched with essential genes, with the black module having the highest mean value of gene significance. Table 10.3 lists the genes in the black module that have both high gene significance (Cox p-value smaller than 0.05) and high intramodular connectivity (greater than 0.85). To save space, we do not show the results for the other modules, but we note that none of the genes in the blue or green modules satisfy the criteria. 3.2. Bagged Gene Shaving
In this section, we describe an alternative method for finding gene clusters or metagenes which is based on gene shaving (12). Briefly, gene shaving finds the first principal component of the data, ranks each gene according to the absolute value of the correlation between the gene and the principal component, then discards, or shaves off, a small percentage of the genes that correlate least well with the principal component. It then finds the principal component of the remaining genes, ranks them by their correlation with the principal component, and shaves off another small percentage of genes that correlate least well with this principal component. The process is repeated until only two genes remain.
Building Networks with Microarray Data
329
Fig. 10.5. Mean value of gene significance for each module. Note that the black module has the highest mean value of gene significance.
The two remaining genes and the genes that are discarded last are those most likely to be in the cluster. To determine the size of the cluster, the total variance of the cluster, V, is separated into two components: the variance within the genes in the cluster, Vw , and the variance between the genes in the cluster, Vb . The variance explained, D, is the proportion of the total variance that is within the genes: Vw /V . The cluster size is determined by choosing the number of genes, k, that maximizes the gap statistic, which is the difference between the variance explained by the best k genes in the actual data and the estimated variance that would be explained by the best k genes obtained from a random permutation of the data. To estimate the latter, the entire shaving process is repeated several times after randomly permuting the data each time. The gap statistic Gap(k) is computed at each shaving step, k, as the difference between the variance explained of the original data and the average of the variances explained of the permuted data sets. After the cluster is determined, the original data matrix is orthogonalized against the mean gene of the cluster, thus ensuring it is not detected again, and the whole process is repeated to extract additional clusters until no further clusters can be found or some predetermined maximum number of clusters is reached.
330
Broom et al.
Table 10.3 Genes with high gene significance and high intramodular connectivity in the black module. Multiple entries for genes with the same symbol correspond to different probesets Gene
Coef
Exp. coef.
Se. coef.
z
pCox
Gene Sig
371
PSMB9
−0.74323
0.475574
0.300846
−2.47048
0.013
1.886057
520
HLA-F
−0.70219
0.495497
0.298385
−2.35332
0.019
1.721246
193
HLA-B/-C
−0.80743
0.446005
0.296068
−2.72716
0.0064
2.19382
99
HLA-B/-C
−0.8705
0.41874
0.296477
−2.93616
0.0033
2.481486
290
HLA-B/-C
−0.76611
0.464819
0.2968
−2.58123
0.0098
2.008774
45
HLA-G
−0.95659
0.3842
0.30069
−3.18133
0.0015
2.823909
103
HLA-G
−0.87481
0.41694
0.298846
−2.9273
0.0034
2.468521
46
HLA-G
−0.94972
0.386849
0.299151
−3.17472
0.0015
2.823909
12
HLA-G
−1.06741
0.343897
0.302351
−3.53038
0.00041
3.387216
100
HLA-B/-C
−0.87233
0.417977
0.297205
−2.93511
0.0033
2.481486
76
HLA-B/-C
−0.89824
0.407286
0.297903
−3.01521
0.0026
2.585027
260
HLA-B/-C
−0.77616
0.460168
0.295649
−2.62529
0.0087
2.060481
685
HLA-A
−0.66011
0.516797
0.296957
−2.2229
0.026
1.585027
638
HLA-B/-C
−0.68491
0.504135
0.304028
−2.25279
0.024
1.619789
113
HLA-G
−0.86298
0.421905
0.299372
−2.88262
0.0039
2.408935
213
HLA-E
−0.80953
0.445065
0.300429
−2.69459
0.007
2.154902
419
HLA-F
−0.73466
0.479669
0.299068
−2.4565
0.014
1.853872
Cluster selection is a kind of variable selection problem and is very sensitive to the specific data used. Small changes to the data, such as the inclusion or exclusion of a single sample, can result in clusters being found in different orders, specific genes being included in one cluster, included in a different cluster, or not included in any cluster. Occasionally, large clusters are not found at all. To ameliorate these problems, we bagged (13) the gene shaving process by generating 127 Bayesian bootstrap resamples from the original data, gene shaving each resampled data set using an optimized, high-performance implementation of GeneClust (14), itself an efficient implementation of gene shaving, and aggregating the resulting clusters. Recall that the final ranking step described in the data preprocessing section (2) ranked each gene’s expression level across the samples in each series and then divided the resulting ranks by the number of samples in the series so that all series were ranked on the same zero to one scale. This is equivalent to assigning to each sample a fractional weight of (1/n), where n is the number of samples in the series, before ranking so that the total weight assigned to each series is 1. We
Building Networks with Microarray Data
331
generated the 127 Bayesian bootstrap resamples of the data by assigning weights 1/Bx to each sample, where Bx is a random variable distributed according to the Bayesian bootstrap distribution (15), such that the weights for the samples in each series sum to 1. Unsupervised gene shaving was applied independently to the original data and to the 127 resampled data sets and terminated after 300 clusters were generated from each data set. At each shaving step, 5% of the remaining genes were discarded, and 10 randomly permuted samples were used to calculate the gap statistic. The 38,400 clusters thus obtained were aggregated by creating an N×N matrix K of all probe sets found in any cluster, where N is the number of these probe sets. The value of K[i, j] for two probe sets i, j = i is the sum of the gap statistic for all clusters in which both probe sets occur. Clearly, K is symmetric. Clusters were extracted from K by choosing a seed probe set i for which L = K [i, :] is maximal and then choosing additional probe sets j for which K [j, :] ≥ L/4. The cluster so formed is removed from K and the process repeated until K is empty. The result of this process is that the largest, strongest clusters are selected first, smaller infrequent clusters are selected later, and most likely junk is selected toward the end. Figure 10.6 shows the submatrix of K containing the probe sets from the first 20 clusters. Figure 10.7 shows the first cluster thus obtained. 3.2.1. Cluster Selection
The first 20 (biggest/largest) clusters extracted from K were selected for the Bayesian network learning described below. There are several (5) clusters of genes related solely by their genomic location, as well as clusters arising from functional relationships between the genes. Since the eight clusters found using coexpression networks overlap with these 20 clusters, we will use these to facilitate comparisons between the two methods. Under other circumstances, it may be more appropriate to explicitly include clusters that correlate strongly with one or more covariates of interest (such as survival).
3.3. Cluster Comparison
The previous two sections looked at two different methods, coexpression networks and bagged gene shaving, for reducing the number of variables (genes) we need to consider from many thousands to a relatively small number of highly correlated clusters, or meta-genes, on which we can base further analysis. We would expect considerable similarities in the clusters produced by the two methods, although we would not expect these two very different methods to generate identical clusters, and that is indeed what we observe. Even though co-expression net-
332
Broom et al.
Fig. 10.6. Submatrix of K containing the probe sets from the first 20 clusters. Dark entries indicates the two probe sets frequently belong to the same cluster. Cluster 3 displays a complex structure and interactions with clusters 6 and 8, with some probe sets appearing sometimes in cluster 3 and sometimes in the other cluster. Cluster 3 contains the red cluster and part of the brown cluster obtained using co-expression networks, while cluster 8 contains the remaining part of the brown cluster.
works were partially supervised and the bagged gene shaving was completely unsupervised, 6 of the 8 clusters generated by coexpression networks are subsets of related clusters in the strongest 20 clusters generated by bagged gene shaving. Table 10.4 lists the clusters found by both methods, together with a brief description of the biology represented by the cluster. The genes in the remaining two clusters found by coexpression networks are also found in the bagged gene shaving clusters, but the correspondence between the two is a little more involved. All the genes concerned are associated with immune response, and it appears that there are three very closely related clusters: A, B, and C. Co-expression networks put group A in the red cluster and merge B and C in the brown cluster, while bagged gene shaving merges groups A and B in cluster 3 and puts group C in cluster 8. Red and brown are the two most similar clusters found by co-expression networks, and the co-occurrence matrix of bagged gene shaving clearly shows several genes that sometimes belong to one cluster and sometimes to the other. A more
Building Networks with Microarray Data
Fig. 10.7. Heatmap of cluster 1.
333
334
Broom et al.
Table 10.4 Clusters found by gene shaving (column 1) and co-expression networks (column 2). Column 3 briefly describes the most plausible reason for the genes concerned to form a cluster. The brown co-expression network cluster has two large subsets, one of which is a subset of gene shaving cluster 3 and the other is a subset of gene shaving cluster 8 Gene shaving
Co-expression cluster
1
Blue
DNA repair, ubiquitination, many genes with unknown function in breast cancer
2
Yellow
Cell cycle
3
Red/brown
Immune response
4
–
G-protein-coupled signaling, membrane transport, many genes with unknown functions
5
Turquoise
Extracellular matrix proteins
6
Green
Immunoglobulins
7
–
30 genes from chr16q21-24
8
Brown
Immune response
9
–
22 genes from chr16p11-13
Description
10
Pink
Lipid metabolism
11
–
mRNA transcription regulation, protein translation, many genes with unknown functions
12
–
23 genes from chr1q21-43
13
–
Interferon induced
14
Affymetrix control genes
15
–
20 genes from chr8q22
16
–
16 genes from chr17p13
17
Black
Histocompatibility class I
18
–
Cell adhesion, many genes with unknown function
19
–
Histones
20
–
RNA splicing, protein transport
sophisticated method for pulling clusters from the co-occurrence matrix might be able to include such genes in both clusters. The results demonstrate that both methods can recognize functionally closely related gene clusters. For example, we observed several immune response-related clusters in which essentially all genes represent immunological functions including a cluster of immunoglobulins and clusters of interferonregulated genes and histocompatibility molecules. Other functionally homogeneous clusters contained genes involved in lipid metabolism and extracellular matrix formation. These biologi-
Building Networks with Microarray Data
335
cal functions correspond to known cellular components in breast cancer tissue. It is important to recognize that the gene expression data that we analyzed were generated from total RNA extracted from a piece of surgically resected breast cancer that includes a variable mix of fat cells, fibroblast, lymphocytes, and neoplastic cells. These clusters probably represent core transcriptional machinery in these various cell types. The gene clusters also reflect some of the known biology of breast cancer. A subset of breast cancers with aggressive histological features is also characterized by high proliferation rate and indeed we observed a strong cell cycle cluster. Several studies examined whole genome copy number alterations in breast cancer using comparative genomic hybridization (16, 17). These studies have shown that DNA copy number gains are common in chromosomes 1q, 8q, 16q in breast cancer. Indeed, we observed several clusters that include genes that are located close to each other in particular chromosome bands which suggest that they reside within a common amplicon that accounts for their co-ordinated expression. These known functional or genomic correlations support the validity of these clustering methods. However, the most interesting aspect of these results may be that they also identified several gene clusters that are dominated by genes with currently unknown function in breast cancer. Even functionally homogeneous gene clusters contain at least a few genes with no apparent functional relationship to the rest. Current biological models of breast cancer rely on the interaction of a few dozen to a few hundred genes; however, gene expression analysis reveals that several thousand genes are expressed in breast cancer. This suggests that many novel biological pathways are yet to be discovered and existing molecular pathways will need to be expanded. Clustering results like those obtained above are a starting point for laboratory investigators to select novel genes for in vitro experiments and the co-expression environment of the gene may give clues about the biological function it may participate in. For example, gene shaving cluster 11 contains many genes involved in gene transcription and protein translation and an equal number of genes with unknown function or no apparent link to transcriptional regulation. It would be reasonable to examine if those genes regulate or participate in these biological processes.
4. Bayesian Network Models The previous section described two methods for reducing the very large number of individual probe sets on a gene expression microarray into a much smaller and more manageable number of gene clusters, or metagenes, of highly correlated genes. In this
336
Broom et al.
section, we use Bayesian networks to obtain a global perspective of the relationships between these gene clusters. Conceptually, a Bayesian network model is a directed acyclic graph (DAG) consisting of nodes representing quantities of interest, such as gene clusters or clinical covariates, and directed edges, such that there are no cycles in the graph. Every directed edge from a parent node to a child node represents a conditional dependence of the child node on that parent. Mathematically, a Bayesian network model is a compact representation of the joint probability distribution of all the nodes in the model. The number of parameters required to fully specify the uncompacted model is exponential in the number of nodes in the network. The edges in Bayesian networks encode statements about conditional independences between nodes and hence enable massive reductions in the number of parameters required to describe the model. Specifically, a node in a Bayesian network is independent of its ancestors given its parents. Figure 10.8 shows a very simple Bayesian network consisting of four nodes (C1–C4) and four edges. The joint probability of the four nodes can be written as follows: P(C1, C2, C3, C4) = P(C1) ∗ P(C2|C1) ∗ P(C3|C2, C1) ∗ P(C4|C3, C2, C1).
Fig. 10.8. Simple Bayesian network.
Note that although C2 and C3 both depend on C1, C3 is conditionally independent of C2 given its parent C1, and C4 is conditionally independent of node C1 given C2 and C3, so we can simplify the above expression to P(C1, C2, C3, C4) = P(C1) ∗ P(C2|C1) ∗ P(C3|C1) ∗ P(C4|C3, C2).
For this small example, the reduction in number of parameters required to specify the network’s joint probability distribution is
Building Networks with Microarray Data
337
minimal, but it quickly becomes significant for larger networks, since the number of parameters required to describe the unsimplified joint probability grows exponentially in the number of nodes in the network. The conditional probability distributions at each node can be of any form, but multinominal tables, normal (gaussian) distributions, and mixtures thereof are the only types commonly used. Bayesian networks based on normal distributions are beyond the scope of this chapter, so we will concern ourselves only with multinomial tables. The values of each node in a multinomial Bayesian network must belong to a small set of discrete values. Clinical covariates, such as Estrogen Receptor (ER) status (yes or no), are already of this form, while continuous covariates must first be discretized. For gene expressions measurement, for example, these values could be high and low expressions. An additional midlevel expression value may also be included, but would increase the amount of data required to infer the network and complicate the biological interpretation of the interaction, so we will content ourselves with a two-level discretization. In a multinominal Bayesian network, the conditional probability distributions associated with each node become conditional probability tables (CPTs). For the network shown in Fig. 10.8, C1 has no parents and if its two values (HI and LO) are equally likely, its CPT is simply P(C2 = LO)
P(C1 = HI)
0.5
0.5
The probability distributions for nodes C2 and C3 both depend on the value of C1 and are slightly more complex: P(C2 = LO)
P(C2 = HI)
C1=LO
0.7
0.3
C1=HI
0.3
0.7
C2 is more likely to be low if C1 is low and more likely to be high if C1 is high, so this could be an example of C1 activating C2 (or vice versa since edges in a Bayesian network need not be causal): P(C3 = LO)
P(C3 = HI)
C1=LO
0.5
0.5
C1=HI
0.8
0.2
C3 is more likely to be low if C1 is high, but equally likely to be low or high if C1 is low, so this could be an example of C1 inhibiting C3. C4 depends on both C2 and C3:
338
Broom et al.
P(C4 = LO)
P(C4 = HI)
C2 = LO
C3=LO
0.4
0.6
C2 = LO
C3=HI
0.6
0.4
C2 = HI
C3=LO
0.9
0.1
C2 = HI
C3=HI
0.1
0.9
In this example, C3 weakly inhibits C4 if C2 is low, but C3 strongly activates C4 if C2 is high. Given data and a Bayesian network structure, it is a simple process to compute the CPTs associated with each node. The challenge we face is to infer the most likely network structure directly from the data. Specifically, we want to determine the graph for which the probability of the graph given the data P(G|D) is maximal. Applying Bayes’ theorem (18, 19), we can rewrite P(G|D) as P(G|D) = P(D|G) ∗ P(G)/P(D) Given the data, P(D) is a constant independent of the graph, P(G) is a prior that can be used to penalize overly complex graphs, and P(D|G) can be evaluated. In practice, it is more convenient to work with logs, so the goal is to find the network that maximizes log(P(D|G)) + log(P(G)), which can be separated into independent contributions from each node and rewritten as
log(P(D|Gn )) + log(P(Gn )).
n∈G
The body of the summation, called the scoring function, evaluates the contribution to the network’s overall fitness of the node and its parents given the data, as well as additional prior beliefs about network structures. Often a scoring function that will penalize nodes with too many parents is chosen to discourage networks that over-fit the data. Unfortunately it is impossible to find, deterministically, a network with the highest value of the scoring function for even moderately sized networks. It is known theoretically that no polynomial time algorithm for finding the highest scoring network exists (20). Moreover, the number of possible networks grows superexponentially in the number of nodes in the network, so even for very modest-sized networks it is computationally infeasible to evaluate them all. Consequently, most methods for learning Bayesian networks from data resort to heuristic search methods for finding a network (or networks) that is very likely to be close to the highest scoring network. One method, for instance, starts with a random graph,
Building Networks with Microarray Data
339
considers all possible small changes to the network—add a single edge, remove a single edge, reverse a single edge—and selects the change that improves the network score the most, then repeats this process until no further small change will improve the network’s score. However, this network may just be a local maxima, with many much better networks possible, but just not reachable from the current network using small changes. To escape from the local maxima, the search procedure restarts from a random position and searches again for the local maxima. After many restarts (say 100), it is likely that the highest scoring network found during the entire process is very close to the best possible. When there are very little data, the single best network is one of many with very similar scores. Rather than concentrate on the edges present in just this one network, we should be more concerned about the edges that are common to all the networks with scores very close to the best possible. An edge that occurs in all such networks is much more likely to be significant than one that occurs just in this network alone. To find many high-scoring networks, we take 10,000 bootstrap resamples of the data, use the method described above to find a high-scoring network for each resample, and then count how often each edge occurs in the resulting networks. A related method for finding such edges is a Bayesian Markov Chain Monte Carlo process (21). In either case, those edges that are found more often than a specified threshold are included in the consensus network. Multiple thresholds can be used to assign selected edges at different levels of confidence. To obtain discretized cluster values, the sample’s position in the cluster heatmap is determined and the sample discretized to low if it is in the left half and to high otherwise. When scoring a node for which a sample had missing clinical covariates for the node or its parents, the sample concerned was ignored and the overall node score rescaled so that it had equivalent weight to a node not having any missing data. Survival times were discretized into long and short survivors, while accounting for censoring as follows. Observed survival times less than 2500 days were discretized as short, otherwise long. Censored survival times longer than 2500 days were also discretized as long. For censored survival times less than 2500 days, the probability, pl , of surviving at least 2500 days given their censored survival time was determined, and the sample was split into two weighted samples, one with short survival and weight 1 − pl and one with long survival and weight pl . 4.1. Results
Figures 10.9 and 10.10 show the consensus network models obtained for the clusters selected using co-expression networks and bagged gene shaving, respectively. In each case, the edges that occur most frequently (in more than 99% of the 10,000 resamples) are drawn as triple lines, between 96 and 99% as bold lines,
340
Broom et al.
Fig. 10.9. Consensus Bayesian network of co-expression network clusters. Nodes in the graph correspond to clinical variables or sample weights within the clusters found by the co-expression network analysis. For the latter, the node name includes in parentheses the number of the corresponding cluster in the gene shaving cluster analysis. Edges between nodes indicate those that occur frequently in the high-scoring networks learnt from 10,000 bootstrap resamples of the data. Dotted edges occur in 80–85% of the networks, dashed edges in 86–90%, solid edges in 91–95%, bold edges in 96–99%, and triple edges in more than 99% of the networks.
Fig. 10.10. Consensus Bayesian network of gene shaving clusters. Nodes in the graph correspond to clinical variables or sample weights within the clusters found by gene shaving. Shaded nodes are similar to a cluster included in the co-expression network analysis. Edges between nodes indicate those that occur frequently in the high-scoring networks learnt from 10,000 bootstrap resamples of the data. Dotted edges occur in 80–85% of the networks, dashed edges in 86–90%, solid edges in 91–95%, bold edges in 96–99%, and triple edges in more than 99% of the network.
between 91 and 95% as black lines, between 86 and 90% as dashed lines, and between 80 and 85% as dotted lines. Edges that occur at least 80% in the same direction are drawn with a solid arrow and those that occur at least 55% in the same direction are drawn with an open arrow. Recall that the arrow does not necessarily represent causality, and that it simply represents a statistical dependence between the nodes, not activation or repression. The latter can be determined by inspecting the conditional probability tables associated with each node.
Building Networks with Microarray Data
4.2. Discussion
341
The consensus network of co-expression clusters (Fig. 10.9) reveals several known biological associations in breast cancer and also suggests new directions for research. There is a well-known and strong correlation between ER status and tumor grade (most ER-negative cancers are high grade) that is recognized by our analysis as well. However, the data suggests, that immunological infiltration is related to ER status. This is a novel observation. Internal consistency of the results is indicated by the close connection between all of the various immunological clusters (green, red, brown). We also detected the known association between grade and proliferation (high-grade tumors show high proliferation activity). However, the strong connection between extracellular matrix components, lipid metabolism, and grade is novel and deserves further studies. The consensus network of gene shaving clusters (Fig. 10.10) as expected reveals many of the same relationships as we discussed above, but also suggests additional connections. For example, the link between histones and grade is novel but expected. Histones organize three-dimensional DNA structure and play a major role in maintaining functional DNA structure between mitoses. One of the histological hallmarks of high grade is “loosened and disorganized” nuclear DNA under light microscopic examination of the cells. Connections of various functionally homogeneous gene clusters to chromosomal regions are also fascinating and could suggest that DNA anomalies at particular amplicons may cause particular functional abnormalities.
5. Summary In this chapter we demonstrated how to apply Bayesian network analysis to human gene expression data. Gene expression profiling allows clinical investigators to take an almost complete inventory of the “molecular parts” at the mRNA level that make up a human cancer. Several thousands of genes are expressed in each tumor specimen and the vast majority of these have little known functional relationship to cancer biology. The sheer number of novel genes detected by these experiments makes it difficult to design laboratory strategies to start to systematically functionally annotate these novel genes. Investigators use various subjective criteria to select genes for further investigation, for example, genes with high expression in a particular disease subset or genes associated with a particular outcome are often considered candidates for functional studies. However, usually many genes meet these selection criteria and the final pick is rather arbitrary. Furthermore, methods that define and rank differentially expressed genes between disease subsets do not provide any clues about the functional connections of the selected genes.
342
Broom et al.
Bayesian network analysis offers a different approach to define gene clusters of interest that may represent functionally connected genes. This approach is unbiased by known knowledge about the function of genes and identifies the most robust co-expression patterns that exist in a data set. Gene clusters themselves can then be organized into “metagene” networks, again unbiased by known function. Our results indicate that this approach can identify many of the known functionally connected gene networks. For example, the simplest gene co-expression networks represent gene expression signatures of different cellular components of a tissue. Our method defined several immune response gene clusters and also revealed that these “immune metagenes” themselves are coordinately expressed in the data. This is consistent with the fact that some tumors contain large amounts of lymphocyte and inflammatory cell infiltration and the gene expression profiles of these tumors will include characteristic mRNA expression patterns of these cells. Similar patterns for fat cells and fibroblast can also be discerned. However, most importantly, all of these functionally tight gene clusters also contain numerous novel genes and some clusters are dominated by such genes. We believe that studying the functional relationship between novel genes that are “mixed into” tight functional clusters could reveal new members of that particular molecular pathway. Functional examination of novel clusters could even lead to the discovery of entirely unknown biological pathways.
Acknowledgments Kim-Anh Do was partially funded by the National Institutes of Health via the University of Texas SPORE in Breast Cancer (CA116199) and the Cancer Center Support Grant (CA016672). References 1. Schena, M., Shalon, D., Davis, R., and Brown, P. (October 1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270(5235), 467–470. 2. Eisen, M. B., Spellman, P. T., Brown, P. O., and Botstein, D. (1998) Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences 95(25), 14863–14868. 3. Pelloski, C. E., Mahajan, A., Maor, M., Chang, E. L., Woo, S., Gilbert, M., Colman, H., Yang, H., Ledoux, A., Blair, H., Passe, S., Jenkins, R. B., and Aldape, K. D. (2005) YKL-40 expression is associated with poorer
4. 5.
6. 7.
response to radiation and shorter overall survival in glioblastoma. Clinical Cancer Research 11(9), 3326–3334. Airoldi, E. M. (December 2007) Getting started in probabilistic graphical models. PLoS Comput Biol 3(12), e252. Ideker, T., and Lauffenberger, D. (2003) Building with a scaffold: emerging strategies for high to low-level cellular modeling. Trends in Biotechnology 21(6). Pearl, J. (1988) Probabilistic Reasoning in Intelligent Systems. Morgan Kauffman, San Francisco, CA. Baggerly, K. A., Coombes, K. R., and Neeley, E. S. (2008) Run batch effects potentially
Building Networks with Microarray Data
8.
9.
10.
11.
12.
13. 14.
compromise the usefulness of genomic signatures for ovarian cancer. Journal of Clinical Oncology 26(7), 1186–1187. Loi, S., Haibe-Kains, B., Desmedt, C., Lallemand, F., Tutt, A. M., Gillet, C., Ellis, P., Harris, A., Bergh, J., Foekens, J. A., Klijn, J. G., Larsimont, D., Buyse, M., Bontempi, G., Delorenzi, M., Piccart, M. J., and Sotiriou, C. (2007) Definition of clinically distinct molecular subtypes in estrogen receptor-positive breast carcinomas through genomic grade. Journal of Clinical Oncology 25(10), 1239–1246. Xu, X., Wang, L., and Ding, D. (December 1984) Learning module networks from genome-wide location and expression data. FEBS Letters 578(3), 297–304. Zhang, B., and Horvath, S. (2005) A general framework for weighted gene co-expression network analysis. Statistical Applications in Genetics and Molecular Biology 4(1). Article 17. Ravasz, E., Somera, A. L., Mongru, D. A., Oltvai, N. Z., and Barabasi, A. L. (August 2002) Hierarchical organization of modularity in metabolic networks. Science 297, 1151–1155. Hastie, T., Tibshirani, R., Eisen, M. B., Alizadeh, A., Levy, R., Staudt, L., Chan, W. C., Botstein, D., and Brown, P. (2000) ‘Gene shaving’ as a method for identifying distinct sets of genes with similar expression patterns. Genome Biology 1(2). Breiman, L. (1996) Bagging predictors. Machine Learning 24(2), 123–140. Do, K.-A., Broom, B. M., and Wen, S. (2003) Geneclust. In Parmigiani, G., Garrett, E. S., Irizarry, R. A., and Zeger, S. L., ed., The Analysis of Gene Expression Data: Methods and Software, chapter 15, p. 342–361. Springer, New York, NY.
343
15. Rubin, D. B. (January 1981) The bayesian bootstrap. The Annals of Statistics 9(1), 130–134. 16. Beers, E. H. V., and Nederlof, P. M. (2006) Array-CGH and breast cancer. Breast Cancer Research. 8(3), 210. 17. Bergamaschi, A., Kim, Y. H., Wang, P., Sørlie, T., Hernandez-Boussard, T., Lonning, P. E., Tibshirani, R., BørresenDale, A.-L., and Pollack, J. R. (November 2006) Distinct patterns of DNA copy number alteration are associated with different clinicopathological features and gene-expression subtypes of breast cancer. Genes Chromosomes Cancer 45(11), 1033–1040. 18. Bayes, T. (1763) An essay towards solving a problem in the doctrine of chances. by the late Rev. Mr. Bayes, F. R. S. communicated by Mr. Price, in a letter to John Canton, A. M. F. R. S. Philosophical Transactions 53, 370–418. Giving Some Account of the Present Undertakings, Studies and Labours of the Ingenious in Many Considerable Parts of the World. 19. Bayes, T. (1763/1958) Studies in the history of probability and statistics: IX. Thomas Bayes’ essay towards solving a problem in the doctrine of chances. Biometrika 45, 296–315. Bayes’ essay in modernized notation. 20. Chickering, D. M. (1996) Learning bayesian networks is NP-complete. In Fisher, D. H., and Lenz, H.-J., (ed.), Learning from Data: Artificial Intelligence and Statistics V, chapter 12, p. 121–130. Springer-verlag. 21. Friedman, N., and Koller, D. (2003) Being bayesian about network structure: A bayesian approach to structure discovery in bayesian networks. Machine Learning 50, 95–126.
Part IV Advanced or Specialized Methods for Molecular Biology
Chapter 11 Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Abstract The support vector machine is a supervised learning technique for classification increasingly used in many applications of data mining, engineering, and bioinformatics. This chapter aims to provide an introduction to the method, covering from the basic concept of the optimal separating hyperplane to its nonlinear generalization through kernels. A general framework of kernel methods that encompass the support vector machine as a special case is outlined. In addition, statistical properties that illuminate both advantage and limitation of the method due to its specific mechanism for classification are briefly discussed. For illustration of the method and related practical issues, an application to real data with high-dimensional features is presented. Key words: Classification, machine learning, kernel methods, regularization, support vector machine.
1. Introduction Classification is a type of statistical problem where we want to predict a predefined class membership accurately based on features of an individual. For instance, pathologists wish to diagnose a patient either healthy or diseased, based on some measurements from the patient’s tissue sample. In general, the foremost goal of classification is to learn the discrimination rule attaining the minimum error rate over novel cases. In the statistics literature, Fisher’s linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA) are classical examples of a discriminant rule, and modern statistical tools include classification trees, logistic H. Bang et al. (eds.), Statistical Methods in Molecular Biology, Methods in Molecular Biology 620, DOI 10.1007/978-1-60761-580-4_11, © Springer Science+Business Media, LLC 2010
347
348
Lee
regression, neural networks, and kernel density-based methods. For reference to classification in general, see (1–3). This chapter introduces the support vector machine (SVM), a classification method which has drawn tremendous attention in machine learning, a thriving area of computer science, for the last decade or so. It has been successfully used in many applications of data mining, engineering, and bioinformatics: for instance, handwritten digit recognition, text categorization, and tumor classification with genomic profiles. Motivated by the statistical learning theory (4) that Vapnik and Chervonenkis developed, Vapnik and his collaborators (5) proposed the optimal separating hyperplane and its nonlinear generalization for pattern recognition in the early 1990s, which is now known as the SVM. For complete treatment of the subject, see (4, 6, 7) and references therein. The method has gained its popularity in part due to simple geometric interpretation, competitive classification accuracy in practice, and an elegant theory behind. In addition, it has such operational characteristics as sparsity and duality, and they render the method an appealing data-analytic tool. The sparsity of the SVM solution leads to efficient data reduction for massive data at the testing stage, and the mathematical duality allows coherent handling of high-dimensional data. The latter property, in particular, seems to be fitting for modern data analysis, as nowadays data with high-dimensional features are quite prevalent due to technological advances in many areas of science and industry. In fact, the aforementioned successful applications all involve highdimensional data. On the statistical side, a salient aspect of the SVM as a classification rule is its mechanism to directly focus on the decision boundary. One of the earlier references of the SVM (8) begins by noting how quickly the number of parameters to estimate increases in Fisher’s normal discriminant paradigm as the dimension of the feature space increases. Instead of probability model parameters, the SVM aims at classification boundary directly by a hyperplane with maximum margin, amending the non-uniqueness of Rosenblatt’s perceptron (9) (an earlier attempt to find a hyperplane for discrimination). This ‘hard’ classification approach departs from more traditional approach of ‘soft’ classification through estimation of the underlying probability model that generates data. Whether the latter is more appropriate than the former depends largely on the context of applications, and their relative efficiency still remains to be a subject of controversy. Contrasting hard classification with soft classification, this chapter provides an overview of the SVM with more focus on conceptual understanding than technical details, for beginners in the field of statistical learning. Geometric formulation of the method, related computation, and the resulting operational
Support Vector Machines for Classification
349
characteristics are outlined. Various aspects of the method are examined with more emphasis on its statistical properties and connection than other tutorials, for instance (10–12). By doing so, its advantages as well as limitation are highlighted. For illustration of the method, a data example is provided with discussion of some practical issues arising in its applications.
2. Method Consider a classification problem where multivariate attributes such as expression levels of genes are measured for each subject in a data set as potential molecular markers for a biological or clinical outcome of interest (e.g., the status of a disease or its progression). Let X = (X1 , . . ., Xp ) ∈ X = Rp denote the predictors and Y the variable for the categorical outcome which takes one of, say, k nominal class labels, Y = {1, . . . , k}. Then the socalled training data are given as a set of n observation pairs, Dn = {(xi , yi ), i = 1, . . . , n}, where (xi , yi )’s are viewed as independent and identically distributed random outcomes of (X, Y) from some unknown distribution PX, Y . The ultimate goal of classification is to understand informative patterns that exist in the predictors in relation to their corresponding class labels. For that reason, classification is known as pattern recognition in the computer science literature. Formally, it aims to find a map (classification rule), φ: X → Y based on the training data which can be generalized to future cases from the same distribution PX ,Y . For simplicity, this section is focused on classification with binary outcomes only (k = 2). Typically, construction of such a rule φ is done by finding a real-valued discriminant function f first and taking either the indicator φ(x) = I (f (x) ≥ 0) if two classes are labeled as 0 or 1, or its sign φ(x) = sgn(f (x)) if they are symmetrically labeled as ±1. In the latter, the classification boundary is determined by the zero level set of f, i.e., {x : f (x) = 0}. 2.1. Linearly Separable Case
With Y = {−1, 1}, first consider a simple scenario as depicted in Fig. 11.1, where two classes in the training data are linearly separable, so a linear discriminant function, f (x) = β x + β0 , could be adequate for classification. Fisher’s LDA is a standard example of linear classifiers in statistics, which is proven to be optimal in minimizing the misclassification rate under the normality and equal covariance assumptions on the distributions of predictors for two classes. In contrast to the LDA, without such distributional assumptions, the discriminant function of the linear SVM is determined
0.5
1.0
Lee
0.0
βtx + β0 = 0
x2
βtx + β0 = 1
−0.5
margin=2/||β||
−1.0
βtx + β0 = –1
−1.5
350
−1.0
−0.5
0.0 x1
0.5
1.0
1.5
Fig. 11.1. This toy example illustrates training data with two predictors and binary class labels (open circle: 1 and solid circle: −1). The solid line indicates the optimal separating hyperplane.
directly through the corresponding hyperplane, β x + β0 = 0, or the classification boundary itself. The perceptron algorithm (9) is a precursor of the SVM in the sense that both search for a hyperplane for discrimination. However, in the situations illustrated in Fig. 11.1, there are infinitely many separating hyperplanes, and the former intends to find just one by sequentially updating β and β0 while the latter looks for the hyperplane with the maximum margin between two classes, which is uniquely determined. The margin is defined as the distance between the two convex hulls formed by xi ’s with class labels 1 and −1, respectively, and it can be mathematically characterized as follows. When the training data are linearly separable, there exist δ > 0, β0 , and β such that β xi + β0 ≥ δ for yi = 1 and β xi + β0 ≤ −δ for yi = −1. Then, without loss of generality, δ can be set to 1 by normalizing β0 and β. This leads to the following separability condition: yi (β xi + β0 ) ≥ 1 for all i = 1, . . . , n.
[1]
So, the margin between the two classes is the same as the sum of the distances from the nearest xi ’s with yi = ±1 to the hyperplane β x + β0 = 0. Since the distance of a point x0 ∈ Rp from a hyperplane β x + β0 = 0 is given by |β x0 + β0 |/β, under the
Support Vector Machines for Classification
351
specified normalization of the separating hyperplane, the margin is given as 2/β. Maximizing the margin is mathematically equivalent to minimizing its reciprocal or a monotonically decreasing function of it in general, for example, β2 /2. For the optimal hyperplane, the SVM finds β0 and β minimizing 1 β2 subject to yi (β xi + β0 ) ≥ 1, for all i = 1, . . . , n. 2
[2]
ˆ is obtained, the induced SVM classiOnce the minimizer (βˆ0 , β) fier is given as φSVM (x) = sgn(βˆ x + βˆ0 ). The solid line in Fig. 11.1 indicates the boundary of the linear SVM classifier with maximal margin for the toy example and the dotted lines are 1 and −1 level sets of the discriminant function. Although the formulation of the SVM discussed so far pertains to a linearly separable case only, which may be fairly restrictive, it serves as a prototype for its extension to more general cases with possible overlap between classes and nonlinear boundary. The extension to the inseparable case will follow in the next section. Two distinctive aspects of the formulation for the SVM are noted here. First, by targeting classification boundary, it takes direct aim at prediction of labels given attributes bypassing modeling or estimation of the probabilistic mechanism that generates the labels. In a decision-theoretic view, the SVM is categorized as a procedure that directly minimizes the error rate under the 0–1 loss. Differently from data modeling strategy in statistics, this approach of risk minimization is commonly employed in machine learning for supervised-learning problems as encapsulated in the empirical risk minimization principle. Yet, as a result of error rate minimization, the SVM classifier is inevitably limited in inference on the underlying probability distribution, which is to be discussed later in detail. Second, although the margin may well be justified as a simple geometric notion to determine a unique separating hyperplane in the separable case, the rationale for a large margin is, in fact, deeply rooted in Vapnik’s statistical learning theory (4). The theory shows that a notion of the capacity of a family of linear classifiers is inversely related to the margin size, and large margin classifiers can be expected to give lower test error rates. Clearly, maximizing the margin is a form of regularization akin to penalization of regression coefficients in ridge regression (13) in order to control model complexity for stability and accuracy in estimation.
352
Lee
2.2. Case with Overlapping Classes
When the training data are not linearly separable, the separability condition [1] cannot be met. To relax the condition, a set of non-negative variables ξi ’s are introduced for the data points such that ξi + yi (β xi + β0 ) ≥ 1,
ξi ≥ 0 for i = 1, . . . , n.
[3]
These ξi ’s are often called slack variables in the optimization literature as they loosen the rigid constraints. However, if they are too large, many data points could be incorrectly classified. If
the ith data point is misclassified by the hyperplane n β x + β0 = 0,
that is, yi (β xi + β0 ) ≤ 0, then ξi ≥ 1. So, i=1 ξi provides an upper bound of the misclassification error of the classifier φ(x) = sgn(β x + β0 ). To minimize the error bound and at the same time to maximize the margin, the SVM formulation for the separable case is modified to seek β0 , β, and ξ : = (ξ1 , . . . , ξn ) that minimize 1 λ ξi + β2 n 2 n
[4]
i=1
subject to [3]. Here λ is a positive tuning parameter that controls the trade-off between the error bound and the margin. By noting that given a constant a, (min ξi subject to ξi ≥ 0 and ξi ≥ a) = max{a, 0}: = a+ , it can be shown that the above modification is equivalent to finding β0 and β that minimize λ 1 (1 − yi (β xi + β0 ))+ + β2 . n 2 n
[5]
i=1
This equivalent form brings a new loss function known as the hinge loss for measuring a ‘goodness of fit’ of a real-valued discriminant function. For a discriminant function f (x) = β x + β0 , consider a loss criterion, L(f (xi ), yi ) = (1 − yi f (xi ))+ = (1 − yi (β xi + β0 ))+ . yi (β xi + β0 ) is called the functional margin of the individual point (xi , yi ) differently from the geometric class margin in the separable case. The functional margin of (x, y) is the product of a signed distance from x to the hyperplane β x + β0 = 0 and β. That is, if y(β x + β0 ) > 0, yf (x) = |β x + β0 | = ||β|| × distance of x from the hyperplane β x + β0 = 0, and otherwise, yf (x) = −|β x + β0 | = −β× distance of x from the hyperplane.
Support Vector Machines for Classification
353
[−t]
* (1−t)+ 2
1
0 −2
−1
0 t = yf
1
2
Fig. 11.2. The solid line is the 0–1 loss and the dashed line is the hinge loss in terms of the functional margin yf(x).
Fig. 11.2 shows the hinge loss together with the 0–1 loss (misclassification loss) in terms of the functional margin. For a discriminant function that induces a classifier through sgn(f (x)), the misclassification loss is given by L0−1 (f (x), y): = I (y = sgn(f (x))) = I (yf (x) ≤ 0). Clearly, the hinge loss is a convex upper bound of the 0–1 loss and is monotonically decreasing in yf (x) = y(β x + β0 ), the functional margin. The convexity of the hinge loss makes the SVM computationally more attractive than direct minimization of the empirical error rate. In the case with overlapping classes, the geometric interpretation of 2/β as the separation margin between two classes no longer holds although 2/β may still be viewed as a ‘soft’ margin analogous to the ‘hard’ margin in the separable case. Rather, β2 in [5] can be immediately regarded as a penalty imposed on the linear discriminant function f. From this perspective, the SVM procedure can be cast in the regularization framework where a function estimation method is formulated as an optimization problem of finding f in a class of candidate functions F that minimizes 1 L(f (xi ), yi ) + λJ (f ). n n
i=1
354
Lee
Here L(f (x), y) is a loss function, J (f ) is a regularizer or a penalty imposed on f, and λ > 0 is a tuning parameter which controls the trade-off between data fit and the complexity of f. There are numerous examples of regularization procedures in statistics. For instance, consider the multiple linear regression with F = {f (x) = β x + β0 :β ∈ Rp , β0 ∈ R} and the squared error loss (y − f (x))2 for L. J (f ) = β2 defines the ridge regression procedure in (13) while the least absolute shrinkage and selection p operator (LASSO) in (14) takes J (f ) = j=1 |βj | as a penalty for a sparse linear model. In light of these, the SVM can be viewed as a procedure for penalized risk minimization with the hinge loss and ridge-like penalty. 2.3. Computation: Constrained Optimization
To describe the operational properties of the SVM solution, derivation of the dual optimization problem for the optimal hyperplane is sketched in this section. Thorough explanation of the theory behind is omitted for ease of discussion. Details and the relevant optimization theory can be found in (7, 10). To solve [5] through the equivalent problem in [4], we need to handle the inequality constraints in [3]. Using the standard machinery of primal–dual formulations in constrained optimization theory 15), two sets of Lagrange multipliers or dual variables are introduced for the constraints: αi and γi (i = 1, . . . , n) for ξi ≥ 1 − yi (β xi + β0 ) and ξi ≥ 0, respectively. Define hi (β, β0 , ξ ) = 1 − yi (β xi + β0 ) − ξi and hn+i (β, β0 , ξ ) = −ξi so that the constraints are of the form hi (β, β0 , ξ ) ≤ 0 for i = 1, . . . , 2n. Then the Lagrangian primal function is given by lP (β, β0 , ξ , α, γ ) =
n
ξi +
i=1
+ −
n i=1 n
nλ β2 2
αi (1 − yi (β xi + β0 ) − ξi ) γi ξi ,
i=1
with the following constraints 1 ∂lP = nλβ − α i y i xi = 0 ⇔ β = α i y i xi , ∂β nλ i=1 i=1 n n ∂lP =− αi yi = 0 ⇔ αi yi = 0, ∂β0 i=1 i=1 ∂lP = 1 − αi − γi = 0 ⇔ γi = 1 − αi , ∂ξi αi ≥ 0 and γi ≥ 0 for i = 1, . . . , n. n
n
Support Vector Machines for Classification
355
Simplifying lP by using the constraints, we have the dual problem of maximizing n i=1
αi −
1 αi αj yi yj xi xj with respect to α: = (α1 , . . . , αn ) 2nλ i,j
[6] subject to 0 ≤ αi ≤ 1 and ni=1 αi yi = 0 for i = 1, . . . , n. Note that the dual problem is a quadratic programming (QP) " # problem with a non-negative-definite matrix given by yi yj xi xj . The dual problem [6] in itself reveals a few notable characteristics of the SVM. First, it involves n dual variables, so the sample size could be the main factor that determines the size of the problem and not the number of predictors. This implies a great computational advantage when n is relatively small while p is very large. For example, typical microarray data have such a ‘large p small n’ structure. Second, the solution αˆ depends on the attributes in the training data only through their pairwise inner products xi xj . This observation proves to be particularly useful for nonlinear extension of the linear SVM. Third, once αˆ satisfying the bound condition (0 ≤ αˆ i ≤ 1) and the equilibrium condition ( ni=1 αˆ i yi = 0) is found, the normal vector of the optimal hyperplane is determined by βˆ = ni=1 αˆ i yi xi . As part of necessary and sufficient conditions for the optimality of the solution, known as the Karush–Kuhn–Tucker (KKT) conditions, the following complementarity conditions have to be met: for i = 1, . . . , n, αi hi (β, β0 , ξ ) = αi {1 − yi (β xi + β0 ) − ξi } = 0 and γi hn+i (β, β0 , ξ ) = −γi ξi = −(1 − αi )ξi = 0. Since for any 0 < αi∗ < 1, ξi∗ = 0 and 1 − yi∗ (β xi∗ + β0 )− ξi∗ = 0 by the conditions, we have 1 − yi∗ (β xi∗ + β0 ) = 0. This gives an equation for βˆ0 once βˆ is determined: βˆ0 = yi∗ − βˆ xi∗ = yi∗ −
1 αˆ i yi xi xi∗ . nλ n
i=1
Also, by the complementarity conditions, the data points can be categorized into two kinds: those with a positive Lagrange multiplier (αˆ i > 0) and those with zero (αˆ i = 0). If a data point falls outside the margin, yi (βˆ xi + βˆ0 ) > 1, then the corresponding Lagrange multiplier must be αˆ i = 0, and thus it plays no role ˆ On the other hand, the attribute vectors of in determining β. ˆ and such data points are the data points with αˆ i > 0 expand β, called the support vectors. The proportion of the support vectors depends on λ, but typically for a range of values of λ, only a fraction of the data points are support vectors. In the sense, the SVM
356
Lee
solution admits a sparse expression in terms of the data points. This sparsity is due to the singularity of the hinge loss at 1. A simple analogy of the sparsity can be made to median regression with the absolute deviation loss that has a singular point at 0. For classification of a new point x, the following linear discriminant function is used: fˆλ (x) = βˆ x + βˆ0 =
αˆ i yi xi x + βˆ0 .
[7]
i:αˆ i >0
Note that the final form of fˆλ does not depend on the dimensionality of x explicitly but depends on the inner products of xi and x as in the dual problem. This fact enables construction of hyperplanes even in infinite-dimensional Hilbert spaces (p. 406, (4)). In addition, [7] shows that all the information necessary for discrimination is contained in the support vectors. As a consequence, it affords efficient data reduction and fast evaluation at the testing phase. 2.4. Nonlinear Generalization
In general, hyperplanes in the input space may not be sufficiently flexible to attain the smallest possible error rate for a given problem. As noted earlier, the linear SVM solution and prediction of a new case x depends on the xi ’s only through the inner product xi xj and xi x. This fact leads to a straightforward generalization of the linear SVM to the nonlinear case by taking a basis expansion. The main idea of the nonlinear extension is to map the data in the original input space to a feature space and find the hyperplane with a large margin in the feature space. For an enlarged feature space, consider transformations of x, say, φm (x), m = 1, . . . , M . Let (x): = (φ1 (x), . . . , φM (x)) be the so-called feature mapping from Rp to a higher dimensional feature space, which can be even infinite dimensional. Then by replacing the inner product xi xj with (xi ) (xj ), the formulation of the linear SVM can be easily extended. For instance, suppose the input space is R2 and x = (x1 , x2 ) . Define :R2 → R3 √ as (x) = (x12 , x22 , 2x1 x2 ) . Then the mapping gives a new dot product in the feature space (x) (t) = (x12 , x22 ,
√ √ 2x1 x2 )(t12 , t22 , 2t1 t2 )
= (x1 t1 + x2 t2 )2 = (x t)2 . In fact, for this generalization to work, the feature mapping does not need to be explicit. Specification of the bivariate function K (x, t): = (x) (t) would suffice. With K, the nonlinear dis criminant function is then given as fˆλ (x) = ni=1 αˆ i yi K (xi , x) + βˆ0 , which is in the span of K (xi , x), i = 1, . . . , n. So, the shape
Support Vector Machines for Classification
357
of the classification boundary {x ∈ Rp :fˆλ (x) = 0} is determined by K. From the property of the dot product, it is clear that such a bivariate function is non-negative definite. Replacing the Euclidean inner product in a linear method with a non-negativedefinite bivariate function, K (x, t), known as a kernel function to obtain its nonlinear generalization is often referred to as the ‘kernel trick’ in machine learning. The only condition for a kernel to be valid is that it is a symmetric non-negative (semi-positive)-definite function: for every N ∈ N, ai ∈ R, and N zi ∈ Rp (i = 1, . . . , N ), i,j ai aj K (zi , zj ) ≥ 0. In other words, KN : = [K (zi , zj )] is a non-negative-definite matrix. Some kernels in common use are polynomial kernels with dth degree, K (x, t) = (1 + x t)d or (x t)d for some positive integer d and the radial basis (or Gaussian) kernel, K (x, t) = exp ( − x − t2 /2σ 2 ) for σ > 0. It turns out that this generalization of the linear SVM is closely linked to the function estimation procedure known as the reproducing kernel Hilbert space (RKHS) method in statistics (16, 17). And the theory behind the RKHS methods or kernel methods in short provides a unified view of smoothing splines, a classical example of the RKHS methods for nonparametric regression, and the kernelized SVM. The connection allows more abstract treatment of the SVM, offering a different perspective on the methodology, in particular, the nonlinear extension. 2.5. Kernel Methods
Kernel methods can be viewed as a method of regularization in a function space characterized by a kernel. A brief description of general framework for the regularization method is given here for advanced readers in order to elucidate the connection, to show how seamlessly the SVM sits in the framework, and to broaden the scope of its applicability in a wide range of problems. Consider a Hilbert space (complete inner product space) of real-valued functions defined on a domain X (not necessarily Rp ), H with an inner product f , gH for f , g ∈ H. A Hilbert space is an RKHS if there is a kernel function (called reproducing kernel) K (·, ·):X 2 → R such that (i) K (x, ·) ∈ H for every x ∈ X , and (ii) K (x, ·), f ( · )H = f (x) for every f ∈ H and x ∈ X . The second condition is called the reproducing property for the obvious reason that K reproduces every f in H. Let Kx (t): = K (x, t) for a fixed x. Then the reproducing property gives a useful identity that K (x, t) = Kx ( · ), Kt ( · )H . For a comprehensive treatment of the RKHS, see (18). Consequently, reproducing kernels are non-negative definite. Conversely, by the Moore–Aronszajn theorem, for every non-negative-definite function K (x, t) on X , there corresponds a unique RKHS HK that
358
Lee
has K (x, t) as its reproducing kernel. So, non-negative definiteness is the defining property of kernels. Now, consider a regularization method in the RKHS, HK with reproducing kernel K: 1 L(f (xi ), yi ) + λh2HK , n n
min
f ∈{1}⊕HK
[8]
i=1
where f (x) = β0 + h(x) with h ∈ HK and the penalty J (f ) is given by h2HK . In general, the null space can be extended to a larger linear space than {1}. As an example, X = Rp and HK = {h(x) = β x|β ∈ Rp } with K (x, t) = x t. For h1 (x) = β1 x and h2 (x) = β2 x ∈ H, h1 , h2 HK = β1 β2 . Then for h(x) = β x, h2HK = β x2HK = β2 . Taking f (x) = β0 + β x and the hinge loss L(f (x), y) = (1 − yf (x))+ gives the linear SVM as a regularization method in HK . So, encompassing the linear SVM as a special case, the SVM can be cast as a regularization method in an RKHS HK , which finds f (x) = β0 + h(x) ∈ {1} ⊕ HK minimizing 1 (1 − yi f (xi ))+ + λh2HK . n n
[9]
i=1
The representer theorem in (19) says that the minimizer of [8] has a representation of the form fˆλ (x) = b +
n
ci K (xi , x),
[10]
i=1
where b and ci ∈ R, i = 1, . . . , n. As previously mentioned, the kernel trick leads to the expression of the SVM solution: fˆλ (x) = βˆ0 +
n
αˆ i yi K (xi , x).
i=1
It agrees with what the representer theorem generally implies for the SVM formulation. Finally, the solution in [10] can be determined by minimizing ⎧ ⎞⎫ ⎛ n ⎨ n n ⎬ 1 ⎠ ⎝ 1 − yi b + cj K (xj , xi ) +λ ci cj K (xi , xj ) ⎩ ⎭ n i=1
j=1
+
i,j=1
over b and ci s. For further discussion of the perspective, see (17). Notably the abstract formulation of kernel methods has no restriction on input domains and the form of kernel functions. Because of that, the SVM in combination with a variety of kernels
Support Vector Machines for Classification
359
is modular and flexible. For instance, kernels can be defined on non-numerical domains such as strings of DNA bases, text, and graph, expanding the realm of applications well beyond Euclidean vector spaces. Many applications of the SVM in computational biology capitalize on the versatility of the kernel method; see (20) for examples. 2.6. Statistical Properties
Contrasting the SVM with more traditional approaches to classification, we discuss statistical properties of the SVM and their implications. Theoretically, the 0–1 loss criterion defines the rule that minimizes the error rate over the population as optimal. With the symmetric labeling of ±1 and conditional probability η(x): = P(Y = 1|X = x), the optimal rule, namely, the Bayes decision rule, is given by φB (x): = sgn(η(x) − 1/2), predicting the label of the most likely class. In the absence of the knowledge of η(x), there are two different approaches for building a classification rule that emulates the Bayes classifier. One is to construct a probability model for the data first and then use the estimate of η(x) from the model for classification. This yields such modelbased plug-in rules as logistic regression, LDA, QDA, and other density-based classification methods. The other is to aim at direct minimization of the error rate without estimating η(x) explicitly. Large margin classifiers with a convex surrogate of the 0–1 loss fall into the second type, and the SVM with the hinge loss is a typical example. The discrepancy of the 0–1 loss from the surrogate loss that is actually used for training a classifier in the latter approach generated an array of theoretical questions regarding necessary conditions for the surrogate loss to guarantee the Bayes risk consistency of the resulting rules. References (21–23) delve into the issues and provide proper conditions for a convex surrogate loss. It turns out that only minimal conditions are necessary in the binary case to ensure the risk consistency while much care has to be taken in the multiclass case (24–26). In particular, it is shown that the hinge loss is class-calibrated, meaning that it satisfies a weak notion of consistency known as Fisher consistency. Furthermore, the Bayes risk consistency of the SVM has been established under the assumption that the space generated by a kernel is sufficiently rich (21, 27). A simple way to see the effect of each loss criterion on the resulting rule is to look at its population analog and identify the limiting discriminant function which is defined as the population risk minimizer among measurable functions of f. Interestingly, for the hinge loss, the population minimizer of E(1 − Yf (X ))+ ∗ (x) : = sgn(η(x) − 1/2), the Bayes classifier itself, while is fSVM that of the negative log-likelihood loss for logistic regression is ∗ (x) : = log{η(x)/(1 − η(x))}, the true logit, for comparison. fLR This difference is illustrated in Fig. 11.3. For 300 equally spaced
360
Lee 1.5
1
0.5
0
−0.5
−1
−1.5 −2
true probability logistic regression SVM −1
0
1
2
x
Fig. 11.3. Comparison of the SVM and logistic regression. The solid line is the true function, 2η(x) − 1, the dotted line is 2ηˆ LR (x) − 1 from penalized logistic regression, and the dashed line is fˆSVM (x) of the SVM.
xi in (−2, 2), yi ’s were generated with the probability of class 1 equal to η(x) in Fig. 11.3 (the solid line is 2η(x) − 1). The dotted line is the estimate of 2η(x) − 1 by penalized logistic regression and the dashed line is the SVM. The radial basis kernel was used for both methods. Note that the logistic regression estimate is very close to the true probability 2η(x) − 1 while the SVM is close to sgn(η(x) − 1/2). Nonetheless, the resulting classifiers are almost identical. If prediction is of primary concern, then the SVM can be an effective choice. However, there are many applications where accurate estimation of the conditional probability η(x) is required for making better decisions than just prediction of a dichotomous outcome. In those cases, the SVM offers very limited information as there is no principled way to recover the probability from the SVM output in general. However, the remark pertains only to the SVM with a flexible kernel since it is based on the property that the asymptotic discriminant function is sgn(η(x) − 1/2). The SVM with simple kernels, the linear SVM for one, needs to be analyzed separately. A recent study (28) shows that under the normality and equal variance assumption on the distribution of attributes for two classes, the linear SVM coincides with the LDA in the limit. Technically, the analysis exploits a close link between the SVM and median regression yet with categorical responses. At least in this case, the
Support Vector Machines for Classification
361
probability information would not be masked and can be recovered from the linear discriminant function with additional computation. However, it is generally advised that the SVM is a tool for prediction, not for modeling of the probabilistic mechanism underlying the data.
3. Data Example Taking breast cancer data in (29) as an example, we illustrate the method and discuss various aspects of its application and some practical issues. The data consist of expression levels of 24,481 genes collected from patients with primary breast tumors who were lymph node negative at the time of diagnosis. The main goal of the study was to find a gene expression signature prognostic of distant metastases within 5 years, which can be used to select patients who would benefit from adjuvant therapy such as chemotherapy or hormone therapy. Out of 78 patients in the training data, 34 developed metastasis within 5 years (labeled as poor prognosis) and 44 remained metastasis free for at least 5 years (labeled as good prognosis). Following a similar preprocessing step in the paper, we filtered those genes that exhibited at least a twofold change in the expression from the pooled reference sample with a p-value < 0.01 in five or more tumors in the training data and discarded two additional genes with missing values, yielding 4,008 genes. Sample 54 with more than 20% of missing values was removed before filtering. First, we applied the linear SVM to the training data (77 observations), varying the number of genes from large to relatively small (d = 4,008, 70, and 20) to see the effect of the input dimension on error rates and the number of support vectors. Whenever a subset of genes were used, we included in classification those top-ranked genes by the p-value of a t-test statistic for marginal association with the prognostic outcomes. The value 70 is the number of genes selected for the prediction algorithm in the original paper although the selection procedure was not based on the p-values. λ affects classification accuracy and the number of support vectors as well. To elucidate the effect of λ, we obtained all the possible solutions indexed by the tuning parameter λ for each fixed set of genes (using R package svmpath). Figure 11.4 shows the error rate curves as a function of λ. The dotted lines are the apparent error rates of the linear SVM over the training data set itself and the solid lines are the test error rates evaluated over the 19 test patients, where 7 remained
Lee
0.5
0.5
0.4
0.4 0
10
5 λ
15
0.0
0.0
0.1
0.1
0.2
0.2
0.3
0.3
0.4 0.3 0.2 0.1
Error rate
d = 20
d = 70
0.5
d = 4008
0.0
362
0.0
0.5
1.0 λ
1.5
0.0
0.2
0.4
0.6
λ
Fig. 11.4. Error rate curves of the linear SVMs with three input dimensions (left: 4,008, center: 70, and right: 20). The dotted lines are the apparent error rates over 77 training patients and the solid lines are the test error rates over 19 test patients.
metastasis free for at least 5 years and 12 developed metastasis within 5 years. Clearly, when all of 4,008 genes are included in the classifier, the training error rates can be driven to zero as the λ decreases to zero, that is, classifiers get less regularized. On the other hand, the corresponding test error rates in the same panel for small values of λ are considerably higher than the training error rates, exemplifying the well-known phenomenon of overfitting. Hence, to attain the best test error rate, the training error rate and the complexity of a classifier need to be properly balanced. However, for smaller input dimensions, the relationship between the apparent error rates and test error rates is quite different. In particular, when only 20 genes are used, in other words, the feature space is small, regularization provides little benefit in minimizing the test error rate and the two error rate curves are roughly parallel to each other. In contrast, for d = 70 and 4,008, penalization (‘maximizing the margin’) does help in reducing the test error rate. The overall minimum error rate of around 20% was achieved when d = 70. Like the error rates, the number of support vectors also depends on the tuning parameter, the degree of overlap between two classes, and the input dimensionality among other factors. Figure 11.5 depicts how it varies as a function of λ for the three cases of high to relatively low dimension. When d = 4,008, the number of support vectors is approximately constant and except a few observations almost all the observations are support vectors. A likely reason is that the dimension is so high compared to the sample size that nearly every observation is close to the classification boundary. However, for the lower dimensions, as the λ approaches zero, a smaller fraction of observations come out to be support vectors. Generally, changing the kernel from linear to nonlinear leads to reduction in the overall training error rate, and it often
Support Vector Machines for Classification d = 20
0
5
10
15
70 60 50 40
40
50
50
60
60
70
70
d = 70
40
Number of Support Vectors
d = 4008
363
0.0
0.5
1.0
1.5
0.0
λ
λ
0.2
0.4
0.6
λ
Fig. 11.5. Relationship between the input dimension (left: 4,008, center: 70, and right: 20) and the number of support vectors for the linear SVM.
translates into a lower test error rate. As an example, we obtained the training and test error rate curves for the Gaussian kernel, K (x, t) = exp ( − x − t2 /2σ 2 ), with the 70 genes as shown in Fig. 11.6. The bandwidth σ , which is another tuning parameter, was set to be the median of pairwise distances between two classes in the left panel, its half in the center, and nearly a third of the median in the right panel, respectively. Figure 11.6 illustrates that with a smaller bandwidth, the training error rates can be made substantially small over a range of λ. Moreover, for σ = 1.69 and 1.20, if λ is properly chosen, then fewer mistakes are made in prediction for the test cases by the nonlinear SVM than the linear SVM. As emphasized before, generally the SVM output values cannot be mapped to class-conditional probabilities in a theoretically justifiable way perhaps with the only exception of the linear SVM in a limited situation. For comparison of logistic regression and SVM, we applied penalized logistic regression to the breast cancer data with the expression levels of the 70 genes as linear predictors. For simplicity, the optimal penalty size for logistic regression was
0.4
0.5
0.0
0.0
0.0
0.1
0.1
0.2
0.2
0.3
0.3
0.4
0.4 0.3 0.2 0.1
Error rate
σ = 1.20 0.5
σ = 1.69
0.5
σ = 3.38
0.00
0.02
0.04 λ
0.06
0.00
0.02
0.04 λ
0.000
0.010
0.020 λ
Fig. 11.6. Error rate curves of the nonlinear SVM with 70 genes and the Gaussian kernel for three bandwidths. The dotted lines are the apparent error rates and the solid lines are the test error rates.
0.2
logistic regression 0.4 0.6
0.8
1.0
Lee
0.0
364
−1.5
−1.0
−0.5
0.0 SVM
0.5
1.0
1.5
Fig. 11.7. Scatter plot of the estimated probabilities of good prognosis from penalized logistic regression versus the values of the discriminant function from the linear SVM for training data. The grey dots indicate the patients with good diagnosis and the black dots indicate those with poor diagnosis.
again determined by the minimum test error rate. Figure 11.7 is a plot of the estimated probabilities of good prognosis from the logistic regression versus the values of the discriminant function from the linear SVM evaluated for the observations in the training data. It shows a monotonic relationship between the output values of the two methods, which could be used for calibration of the results from the SVM with class-conditional probabilities. When each method was best tuned in terms of the test error rate, logistic regression gave 10% of the training error rate and 30% of the test error rate while both error rates were around 20% for the SVM. For more comparison between the two approaches, see (30, 31). The statistical issue of finding an optimal choice of the tuning parameter has not been discussed adequately in this data example. Instead, by treating the test set as if it were a validation set, the size of the penalty was chosen to minimize the test error rate directly for simple exposition. In practice, cross-validation is commonly used for tuning in the absence of a separate validation set. On a brief note, in the original paper, a correlation-based classifier was constructed on the basis of 70 genes that were selected sequentially and its threshold was adjusted for increased sensitivity to poor prognosis. With the adjusted threshold, only 2 out of 19 incorrect predictions were reported. This low test error rate could be explained as a result of the threshold adjustment. Recall that
Support Vector Machines for Classification
365
the good prognosis category is the majority for the training data set (good/poor=44/33) while the opposite is true for the test set (good/poor=7/12). As in this example, if two types of error (misclassifying good prognosis as poor or vice versa) are treated differentially, then the optimal decision boundary would be different from the region where two classes are equally likely, that is, η(x) = 1/2. For estimation of a different level of probability, say, η0 = 1/2 with the SVM method, the hinge loss has to be modified with weights that are determined according to the class labels. This modification leads to a weighted SVM, and more details can be found in (32).
4. Further Extensions So far, the standard SVM for the binary case has been mainly introduced. Since its inception, various methodological extensions have been considered, expanding its utility to many different settings and applications. Just to provide appropriate pointers to references for further reading, some of the extensions are briefly mentioned here. First, consider situations that involve more than two classes. A proper extension of the binary SVM to the multiclass case is not as straightforward as a probability model-based approach to classification, as evident in the special nature of the discriminant function that minimizes the hinge loss in the binary case. References (24, 25) discuss some extensions of the hinge loss that would carry the desired consistency of the binary SVM to the multicategory case. Second, identification of the variables that discriminate given class labels is often crucial in many applications. There have been a variety of proposals to either combine or integrate variable or feature selection capability with the SVM for enhanced interpretability. For example, recursive feature elimination (33) combines the idea of backward elimination with the linear SVM. Similar to the 1 penalization approach to variable selection in regression such as the LASSO and the basis pursuit method, (34), (35) and later (36) modified the linear SVM with the 1 penalty for feature selection, and (37) considered further the 0 penalty. For a nonlinear kernel function, (38, 39) introduced a scale factor for each variable and chose the scale factors by minimizing generalization error bounds. As an alternative, (40, 41) suggested functional analysis of variance approach to feature selection for the nonlinear SVM motivated by the nonparametric generalization of the LASSO in (42).
366
Lee
On the computational front, numerous algorithms to solve the SVM optimization problem have been developed for fast computation with enhanced algorithmic efficiency and for the capacity to cope with massive data. Reference (43) provides a historical perspective of the development in terms of relevant computational issues to the SVM optimization. Some of the implementations are available at http://www.kernel-machines.org, including SVM light in (44) and LIBSVM in libsvm. The R package e1071 is an R interface to LIBSVM and (45) is another R implementation of the SVM. Note that the aforementioned implementations are mostly for getting a solution at a given value of the tuning parameter λ. However, as seen in the data example, the classification error rate depends on λ, and thus, in practice, it is necessary to consider a range of λ values and get the corresponding solutions in pursuit of an optimal solution. It turns out that characterization of the entire solution path as a function of λ is possible as demonstrated in (46) for the binary case and (47) for the multicategory case. The solution path algorithms in the references provide a computational shortcut to obtain the entire spectrum of solutions, facilitating the choice of the tuning parameter. The scope of extensions of kernel methods in current use is, in fact, far beyond classification. Details of other methodological developments with kernels for regression, novelty detection, clustering, and semi-supervised learning can be found in (7). References 1. Hastie, T., Tibshirani, R., and Friedman, J. (2001) The Elements of Statistical Learning. Springer Verlag, New York. 2. Duda, R. O., Hart, P. E., and Stork, D. G. (2000) Pattern Classification (2nd Edition). Wiley-Interscience, New York. 3. McLachlan, G. J. (2004) Discriminant Analysis and Statistical Pattern Recognition. Wiley-Interscience, New York. 4. Vapnik, V. (1998) Statistical Learning Theory. Wiley, New York. 5. Boser, B., Guyon, I., and Vapnik, V. (1992) A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory 5, 144–152. 6. Cristianini, N. and Shawe-Taylor, J. (2000) An Introduction to Support Vector Machines. Cambridge University Press, Cambridge. 7. Schölkopf, B. and Smola, A. (2002) Learning with Kernels – Support Vector Machines, Regularization, Optimization and Beyond. MIT Press, Cambridge, MA. 8. Cortes, C. and Vapnik, V. (1995) SupportVector Networks. Machine Learning 20(3), 273–297.
9. Rosenblatt, F. (1958) The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review 65, 386–408. 10. Burges, C. (1998) A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2(2), 121–167. 11. Bennett, K. P. and Campbell, C. (2000) Support vector machines: Hype or hallelujah? SIGKDD Explorations 2(2), 1–13. 12. Moguerza, J. M., and Munoz, A. (2006) Support vector machines with applications. Statistical Science 21(3), 322–336. 13. Hoerl, A. and Kennard, R. (1970) Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12(3), 55–67. 14. Tibshirani, R. (1996) Regression selection and shrinkage via the lasso. Journal of the Royal Statistical Society B 58(1), 267–288. 15. Mangasarian, O. (1994) Nonlinear Programming. Classics in Applied Mathematics, Vol. 10, SIAM, Philadelphia. 16. Wahba, G. (1990) Spline Models for Observational Data. Series in Applied Mathematics, Vol. 59, SIAM, Philadelphia.
Support Vector Machines for Classification 17. Wahba, G. (1998) Support vector machines, reproducing kernel Hilbert spaces, and randomized GACV. In Schölkopf, B., Burges, C. J. C., and Smola, A. J. (ed.), Advances in Kernel Methods: Support Vector Learning, MIT Press, p. 69–87. 18. Aronszajn, N. (1950) Theory of reproducing kernel. Transactions of the American Mathematical Society 68, 3337–3404. 19. Kimeldorf, G. and Wahba, G. (1971) Some results on Tchebychean Spline functions. Journal of Mathematics Analysis and Applications 33(1), 82–95. 20. Schölkopf, B., Tsuda, K., and Vert, J. P. (ed.) (2004) Kernel Methods in Computational Biology. MIT Press, Cambridge, MA. 21. Zhang, T. (2004) Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics 32(1), 56–85. 22. Bartlett, P. L., Jordan, M. I., and McAuliffe, J. D. (2006) Convexity, classification, and risk bounds. Journal of the American Statististical Association 101, 138–156. 23. Lin, Y. (2002) A note on margin-based loss functions in classification. Statistics and Probability Letters 68, 73–82. 24. Lee, Y., Lin, Y., and Wahba, G. (2004) Multicategory Support Vector Machines, theory, and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association 99, 67–81. 25. Tewari, A. and Bartlett, P. L. (2007) On the consistency of multiclass classification methods. Journal of Machine Learning Research 8, 1007–1025. 26. Liu, Y. and Shen, X. (2006) Multicategory SVM and ψ-learning-methodology and theory. Journal of the American Statistical Association 101, 500–509. 27. Steinwart, I. (2005) Consistency of support vector machines and other regularized kernel machines. IEEE Transactions on Information Theory 51, 128–142. 28. Koo, J.-Y., Lee, Y., Kim, Y., and Park, C. (2008) A Bahadur representation of the linear Support Vector Machine. Journal of Machine Learning Research 9, 1343–1368. 29. van’t Veer, L. J., Dai, H., van de Vijver, M. J., He, Y. D., Hart, A. A., Mao, M., Peterse, H. L., van der Kooy, K., Marton, M. J., Witteveen, A. T., Schreiber, G. J., Kerkhoven, R. M., Roberts, C., Linsley, P. S., Bernards, R., and Friend, S. H. (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature 415(6871), 530–536.
367
30. Zhu, J. and Hastie, T. (2004) Classification of gene microarrays by penalized logistic regression. Biostatistics 5(3), 427–443. 31. Wahba, G. (2002) Soft and hard classification by reproducing kernel Hilbert space methods. Proceedings of the National Academy of Sciences 99, 16524–16530. 32. Lin, Y., Lee, Y., and Wahba, G. (2002) Support vector machines for classification in nonstandard situations. Machine Learning 46, 191–202. 33. Guyon, I., Weston, J., Barnhill, S., and Vapnik, V. (2002) Gene selection for cancer classification using support vector machines. Machine Learning 46(1–3), 389–422. 34. Chen, S. S., Donoho, D. L., and Saunders, M. A. (1999) Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing 20(1), 33–61. 35. Bradley, P. S., and Mangasarian, O. L. (1998) Feature selection via concave minimization and support vector machines. In Shavlik, J. (ed.), Machine Learning Proceedings of the Fifteenth International Conference Morgan Kaufmann, San Francisco, California, p. 82–90. 36. Zhu, J., Rosset, S., Hastie, T., and Tibshirani, R. (2004) 1-norm support vector machines. In Thrun, S., Saul, L., and Schölkopf, B. (ed.), Advances in Neural Information Processing Systems 16, MIT Press, Cambridge, MA. 37. Weston, J., Elisseff, A., Schölkopf, B., and Tipping, M. (2003) Use of the zero-norm with linear models and kernel methods. Journal of Machine Learning Research 3, 1439– 1461. 38. Weston, J., Mukherjee, S., Chapelle, O., Pontil, M., Poggio, T., and Vapnik, V. (2001) Feature selection for SVMs. In Solla, S. A., Leen, T. K., and Muller, K.-R. (ed.), Advances in Neural Information Processing Systems 13, MIT Press, Cambridge, MA, pp. 668–674. 39. Chapelle, O., Vapnik, V., Bousquet, O., and Mukherjee, S. (2002) Choosing multiple parameters for support vector machines. Machine Learning 46 (1–3), 131–59. 40. Zhang, H. H. (2006) Variable selection for support vector machines via smoothing spline ANOVA. Statistica Sinica 16(2), 659–674. 41. Lee, Y., Kim, Y., Lee, S., and Koo, J.Y. (2006) Structured Multicategory Support Vector Machine with ANOVA decomposition. Biometrika 93(3), 555–571. 42. Lin, Y. and Zhang, H. H. (2006) Component selection and smoothing in multivariate nonparametric regression. The Annals of Statistics 34, 2272–2297.
368
Lee
43. Bottou, L., and Lin, C.-J. (2007) Support Vector Machine Solvers. In Bottou, L., Chapelle, O., DeCoste, D., and Weston, J. (ed.), Large Scale Kernel Machines, MIT Press, Cambridge, MA, pp. 301–320. 44. Joachims, T. (1998) Making large-scale support vector machine learning practical. In Schölkopf, C. B. (ed.), Advances in Kernel Methods: Support Vector Machines. MIT Press, Cambridge, MA. 45. Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., and Lin, C.-J. (2008) LIB-
LINEAR: A library for large linear classification. Journal of Machine Learning Research 9, 1871–1874. 46. Hastie, T., Rosset, S., Tibshirani, R., and Zhu, J. (2004) The entire regularization path for the support vector machine. Journal of Machine Learning Research 5, 1391–1415. 47. Lee, Y. and Cui, Z. (2006) Characterizing the solution path of Multicategory Support Vector Machines. Statistica Sinica 16(2), 391–409.
Chapter 12 An Overview of Clustering Applied to Molecular Biology Rebecca Nugent and Marina Meila Abstract In molecular biology, we are often interested in determining the group structure in, e.g., a population of cells or microarray gene expression data. Clustering methods identify groups of similar observations, but the results can depend on the chosen method’s assumptions and starting parameter values. In this chapter, we give a broad overview of both attribute- and similarity-based clustering, describing both the methods and their performance. The parametric and nonparametric approaches presented vary in whether or not they require knowing the number of clusters in advance as well as the shapes of the estimated clusters. Additionally, we include a biclustering algorithm that incorporates variable selection into the clustering procedure. We finish with a discussion of some common methods for comparing two clustering solutions (possibly from different methods). The user is advised to devote time and attention to determining the appropriate clustering approach (and any corresponding parameter values) for the specific application prior to analysis. Key words: Cluster analysis, K-means, model-based clustering, EM algorithm, similarity-based clustering, spectral clustering, nonparametric clustering, hierarchical clustering, biclustering, comparing partitions.
1. Introduction In many molecular biology applications, we are interested in determining the presence of “similar” observations. For example, in flow cytometry, fluorescent tags are attached to mRNA molecules in a population of cells and passed in front of a single wavelength laser; the level of fluorescence in each cell (corresponding, for example, to level of gene expression) is recorded. An observation is comprised of the measurements taken from the different tags or channels. We might be interested in discovering groups of cells that have high fluorescence levels for H. Bang et al. (eds.), Statistical Methods in Molecular Biology, Methods in Molecular Biology 620, DOI 10.1007/978-1-60761-580-4_12, © Springer Science+Business Media, LLC 2010
369
370
Nugent and Meila
multiple channels (e.g., gating) or groups of cells that have different levels across channels. We might also define groups of interest a priori and then try to classify cells according to those group definitions. In microarray analysis, gene expression levels can be measured across different samples (people, tissues, etc.) or different experimental conditions. An observation might be the different expression levels across all measured genes for one person; a group of interest might be a group of patients whose gene expression patterns are similar. We could also look for different patterns among a group of patients diagnosed with the same disease; subgroups of patients that display different gene expression might imply the presence of two different pathologies on the molecular level within patients showing the same symptoms (e.g., T-cell/Bcell acute lymphoblastic leukemia (1)). An observation could also be the expression levels for one gene across many experimental conditions. Similarly expressed genes are used to help identify coregulated genes for use in determining disease marker genes. In these and other applications, we might ask: How many groups are there? Where are they? How are the grouped observations similar or dissimilar? How can we describe them? How are the groups themselves similar or dissimilar? In statistics, clustering is used to answer these types of questions. The goal of clustering is to identify distinct groups in a data set and assign a group label to each observation. Observations are partitioned into subsets, or clusters, such that observations in one subset are more similar to each other than to observations in different subsets. Ideally we would like to find these clusters with minimal input from the user. There are a wide array of clustering approaches, each with its strengths and weaknesses. Generally, an approach can be characterized by (1) the type of data available (observation attributes or (dis)similarities between pairs of observations) and (2) the prior assumptions about the clusters (size, shape, etc.). In this chapter, we will give an overview of several common clustering methods and illustrate their performance on two running two-dimensional examples for easy visualization (Fig. 12.1). Figure 12.1a contains simulated data generated from four groups, two spherical and two curvilinear. Figure 12.1b contains the flow cytometry measurements of two fluorescence markers applied to Rituximab, a therapeutic monoclonal antibody, in a drug-screening project designed to identify agents to enhance its antilymphoma activity (2, 3). Cells were stained, following culture, with the agents anti-BrdU and the DNA binding dye 7-AAD. In Section 2, we introduce attribute-based clustering and some commonly used methods: K-means and K-medoids, popular algorithms for partitioning data into spherical groups; modelbased clustering; and some nonparametric approaches including cluster trees, mean shift methods, and Dirichlet mixture models. Section 3 begins with an overview of pairwise (dis)similarity-
An Overview of Clustering Applied to Molecular Biology
371
(b)
7 AAD
x2
0
0.2
200
0.4
400
0.6
600
0.8
800
1.0
1000
(a)
0.0
0.2
0.4
0.6
x1
0.8
1.0
0
200
400
600
800
1000
Anti−BrdU FITC
Fig. 12.1. (a) Data set with four apparent groups; (b) two fluorescence channels (7-AAD, anti-BrdU) from the Rituximab data set.
based clustering and then discusses hierarchical clustering, spectral clustering, and affinity propagation. Section 4 gives a brief introduction to biclustering, a method that clusters observations and measured variables simultaneously. In Section 5, we review several approaches to compare clustering results. We conclude the chapter with an overview of the methods’ strengths and weaknesses and some recommendations. Prior to describing the clustering approaches (and their behaviors), we first introduce some basic notation that will be used throughout the chapter. Each observation xi is denoted by a vector of length p, xi = {xi1 , xi2 , xi3 , ..., xip }, where p is the number of variables or different measurements on the observation. The observations are indexed over i = 1, 2, ..., n where n is the total number of observations. Clusters are denoted by Ck and are indexed over k = 1, 2, ..., K where K is the total number of clusters. The cardinality (or size) of cluster Ck is denoted by |Ck |. The indicator function Ixi ∈Ck equals 1 (or 0) for all observations xi that are currently (or not currently) assigned to cluster Ck . When summing over observa tions assigned to a cluster, the notations ni=1 Ixi ∈Ck and xi ∈Ck can be used interchangeably. In addition, the methods presented in this chapter are for use with continuous real-valued data, xi ∈ Rp .
372
Nugent and Meila
2. AttributeBased Clustering
In attribute-based clustering, the user has p different measurements on each observation. The values are stored in a matrix X of dimension n x p: X = {x1 , x2 , x3 , ..., xn } ∈ Rp . Row i of the matrix contains the measurements (or values) for all p variables for the ith observation; column j contains the measurements (or values) for the jth variable for all n observations. Attribute-based clustering methods take the matrix X as an input and then look for groups of observations with similar-valued attributes. The groups themselves could be of varying shapes and sizes. The user may have reason to believe that the observations clump together in spherical or oval shapes around centers of average attribute values; the observations could also be grouped in curved shapes. For example, in Fig. 12.1a, there are two spherical groups of observations and two curvilinear groups of observations. Clustering methods often are designed to look for clusters of specific shapes; the user should keep these designations in mind when choosing a method.
2.1. K-Means and K-Medoids
K-means is a popular method that uses the squared Euclidean distance between attributes to determine the clusters (4, 5). It tends to find spherical clusters and requires the user to specify the desired number of clusters, K, in advance. Each cluster is defined by its assigned observations (Ixi ∈Ck = 1) and their center x¯ k = |C1k | xi ∈Ck xi (the average attribute values vector of the assigned observations). We measure the “quality” of the clustering solution by a within-cluster (WC) squared-error criterion: WC =
K
||xi − x¯ k ||2 .
[1]
k=1 xi ∈Ck
The term xi ∈Ck ||xi − x¯ k ||2 represents how close the observations are to the cluster center x¯ k . We then sum over all K clusters to measure the overall cluster compactness. Tight, compact clusters correspond to low WC values. K-means searches for a clustering solution with a low WC criterion with the following algorithm:
An Overview of Clustering Applied to Molecular Biology
K - MEANS
373
Algorithm
Input Observations x1 , . . . xn , the number of clusters K. •Select K starting centers x¯ 1 , . . . , x¯ K . •Iterate until cluster assignments do not change 1. for i = 1:n assign each observation xi to the closest center x¯ k . 2. for j = 1:K re-compute each cluster center as x¯ k = |C1k | xi ∈Ck xi .
We give the algorithm a starting set of centers; the observations are assigned to their respective closest centers, the centers are then re-computed, and so on. Each cycle lowers the WC criterion until it converges to a minimum (or the observations’ assignments do not change from cycle to cycle). Usually, the first few cycles correspond to large drops in the criterion and big changes in the positions of the cluster centers. The following cycles make smaller changes as the solution becomes more refined. Figure 12.2 shows four- and eight-cluster K-means solutions. The four-cluster solution (WC = 25.20946) finds the lower left group, combines the two spheres, and splits the upper left curvilinear group. While it might be surprising that the two spherical groups are not separated, note that, given the fixed number of clusters (K = 4), if one group is split, two groups are forced to merge. The eight-cluster solution (WC = 4.999) separates the two spherical groups and then splits each curvilinear group into multiple groups. The criterion is much lower but at the expense of erroneously splitting the curved groups.
0.0
0.2
0.4
0.6
1.0
(b)
2 22 2 2 2 2 222 22222222 2222 2 2 2 2222 2222 2 222 2 2 2 2 2 2 22 2 2 3 2 222 22 2222 32 22 222 2222222 2 2 222 22 2 2 2 22 2 2 2 22 2 2 22
0.4
x2
0.6
0.8
2 2 22 22 22 222 22 2 22 2 22 2 2 222 2 22 2 22 22 22222 2 2 222222222 2 2 2222 2222 222222222222 22 2 2 2 2222 2 2 22 2 2 22 22 2 2 2
2
0.8
x1
Fig. 12.2. (a) K-means: K = 4; (b) K-means: K = 8.
1.0
0.2
0.2
0.4
x2
0.6
0.8
1.0
(a) 11 1 1 1 111 1111 11 1 1111111 1 1 4 4 1 11 1 1111 1 1 1 11 4444 1 11111111111 11 1 1 111 1 1 11 1 4 444441111 11 1 11 11 444 111111 11 1 1 4 4 4 444444 444 11 4 4 4 4 111 44 4 4 444444 1 1 4 44 444 4 4 4 1 4 4 444 4 4 4 4 4 4 44 1 1 4 44 44 11 444 4 11 1 4 444 4 1 1 13 4 4 3 3 33 3 4 44 44 3 3 2 4 444 4 3 3 33333 4 4 4 4 3 3 333333 3 3 3 3 3 33 3 3 33 3 33 3 3333333 33 3 333333 33 33 3 3 3333 3 3 33 3 333333 333333333 3 3 333 33 3 33 3 3 3 333333 3 33 3 3 33333 3 333333 3 3 3 333333 3 3333 3333 33 333333 3 3 33 333 3 333 3333 3 3 3 33 3 3 3 33
44 4 4 4 444 4444 44 4 4444444 3 3 6 6 44 4 4 33 3 4444 4 3 6 666 4 444444444444 44 3 3 333 3 3 43 3 6 6666644444 44 3 666 44 44 44 33 33 3 3 6 66666 66 6 4 6 6 4 6 66 6 6 6 644 6 6 6 666666 4 3 6 666 666 6 6 6 3 6 6 666 6 6 6 6 6 88 8 33 3 8 88 88 3 888 8 33 3 8 888 8 2 2 22 88 8 2 2 8 2 2 22 2 8 8 8 2 8 888 8 2 2 22222 8 8 8 8 2 2 222222 2 2 2 2 2 22 2 2 22 2 22 2 2222222 22 2 222222 22 55 2 2 2222 2 2 2 5 555555 555222222 22 5 555 55 5 55 2 5 5 555555 5 55 5 5 55555 5 555555 5 5 5 555555 5 5555 5555 55 555555 5 5 55 555 5 555 5555 5 55 5 5 5 55 5 5
0.0
0.2
0.4
0.6 x1
1 1 11 11 11 111 11 1 11 1 11 11 11 111 1 11 1 11 1 1 1 1 1 111 11111 1 11 1 1 1 1 1 1 1111 1 1 11 11111111 11 1 1 1 1111 1 1 11 1 1 11 11 1 1 1
7 77 7 7 7 7 777 77777777 7777 7 7 7 7777 7777 7 777 7 7 7 7 7 7 77 7 7 7 7 777 77 7777 77 77 777 7777777 7 7 777 77 7 7 7 7 7 77 7 7777 77 7
0.8
1.0
374
Nugent and Meila
Although the number of clusters may be predefined as part of the problem, in practice, the number of clusters, K, may be unknown, particularly when the dimensionality of the data prohibits easy visualization. Moreover, the solution can be improved by increasing K as splitting an already tight cluster into two smaller clusters will correspond to a reduction in the criterion. As it is often assumed that overestimating the number of clusters is not a practical solution, we instead search for a solution with a reasonably low WC criterion (that does not decrease substantially with an increase in the number of clusters). We plot the number of clusters against the WC criterion; the “elbow” in the graph corresponds to a reasonable solution. Figure 12.3a contains the “elbow” graph for the flow cytometry data (Section 1). The criterion drops as we increase the number of clusters; around K = 12, 13, the drops become relatively negligible. Figure 12.3b presents the 13-cluster solution. As expected, the observations are partitioned into spherical groups. In fact, the large population of cells in the lower left corner are split into several small clusters. (A solution with six or seven clusters – corresponding to the end of the steeper criterion drops – combines several of these smaller clusters.) One drawback of the K-means method is that the solution could change depending on the starting centers. There are several common choices. Probably the most common is to randomly pick K observations as the starting centers, x¯ k . Figure 12.4 contrasts the original 13-cluster results and another 13-cluster solution given a different set of starting observations. The results are similar; some of the more densely populated areas are separated into spheres in slightly different positions. Users could also use a set of problem-defined centers; a set of centers generated
1000
(b)
800
4e+07
7 AAD 400 600
3e+07
200
2e+07
0
1e+07
Within Cluster Criterion
5e+07
(a)
5
10
15
K = No. of Clusters
Fig. 12.3. (a) Elbow graph; (b) K-means: K = 13.
20
0
200
400
600
Anti−BrdU FITC
800
1000
An Overview of Clustering Applied to Molecular Biology (b)
400
7 AAD
4 44 4 4 4 444 44 4 4 4 4 4 4 4 44 4 4 4 44 4 4 444 4 44 7444 44 4 44 4 4 44 4 4 777 77777777 7 44 444 444 44 7777777777777777777 7 7 7 7 7 77777 777 77 77 7 7 77 7777777 7777 77777777 7 77777 777777777777 7 10 10 10 10 10 10 10 10 10 10 10 7 77777 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 1 010 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 1010 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 1010 777 7 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 1 0 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 7 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 7 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 101010 10 4
200
400 0
0
200
7 AAD
600
600
800
800
1000
1000
(a)
375
0
200
400
600
800
1000
0
200
400
600
800
1000
Anti−BrdU FITC
Anti−BrdU FITC
Fig. 12.4. (a) K = 13; original starting centers; (b) K = 13; different starting centers.
from another clustering algorithm (e.g., hierarchical clustering – Section 3.2) is another common choice. Recall though that each clustering solution corresponds to a WC criterion; if desired, the user could run K-means using several different sets of starting centers and choose the solution corresponding to the lowest criterion. A related method is K-medoids. If outlier observations are present, their large distance from the other observations influences the cluster centers, x¯ k , by pulling them disproportionately toward the outliers. To reduce this influence, instead of re-estimating the cluster center as the average of the assigned observations in Step 2, we replace the center with the observation that would correspond to the lowest criterion value, i.e., x¯ k = argmin
xi ∈Ck x ∈C j k
||xj − xi ||2 .
[2]
This method is computationally more difficult since at each Step 2, the criterion for each cluster is optimized over all choices for the new center (i.e., the observations currently assigned to the cluster). 2.2. Model-Based Clustering
The statistical approach to clustering assumes that the observations are a sample from a population with some density f (x) and that the groups (clusters) in the population can be described by properties of this density. In model-based clustering (6, 7), we assume that each population group (i.e., cluster) is represented by a density in some parametric family fk (x) and that the population density f (x) is a
376
Nugent and Meila
weighted combination (or mixture) of the group densities: f (x) =
K
πk · fk (x;θk ),
[3]
k=1
K
where πk ≥ 0, k=1 πk = 1 are called mixture weights, and the densities fk are called mixture components. The procedure needs to choose the number of components (or clusters) K and the weights πk and to estimate the parameters of the densities fk . Most often the component densities are assumed to be Gaussian with parameters θk = {μk , k } (i.e., elliptical shapes, symmetric, no heavy tails). The covariance matrices k give the “shape” (relative magnitude of eigenvalues), “volume” (absolute magnitude of eigenvalue), and “orientation” (of the eigenvectors of k ) of the clusters. Depending on what is known one can choose each of these values to be the same or different among clusters. There are ten possible combinations, from which arise the following ten models [illustrated in Fig. 12.5 from (8)]:
EII
VII
EEI
VEI
EVI
VVI
EEE
EEV
VEV
VVV
Fig. 12.5. Illustrating the group density components in the candidate models (from (8)).
• EII: equal volume, round shape (spherical covariance) • VII: varying volume, round shape (spherical covariance) • EEI: equal volume, equal shape, axis parallel orientation (diagonal covariance) • VEI: varying volume, equal shape, axis parallel orientation (diagonal covariance) • EVI: equal volume, varying shape, axis parallel orientation (diagonal covariance) • VVI: varying volume, varying shape, equal orientation (diagonal covariance) • EEE: equal volume, equal shape, equal orientation (ellipsoidal covariance)
An Overview of Clustering Applied to Molecular Biology
377
• EEV: equal volume, equal shape, varying orientation (ellipsoidal covariance) • VEV: varying volume, equal shape, varying orientation (ellipsoidal covariance) • VVV: varying volume, varying shape, varying orientation (ellipsoidal covariance) Each model is fit using an Expectation-Maximization (EM) algorithm (9). EXPECTATION-MAXIMIZATION (EM) Algorithm Input Data {xi }i=1:n , the number of clusters K. Initialize parameters π1:K ∈ R, μ1:K ∈ 1:K , 1:K at random. k will be symmetric, positive definite matrices, parametrized according to the chosen covariance structure (e.g., EII, VII, etc.). Iterate until convergence. E step (Estimate data assignments to clusters) for i = 1:n, k = 1:K γki =
πk fk (x) . f (x)
M step (Estimatemixture parameters) denote k= ni=1 γki , k = 1:K note that k k = n k , k = 1:K n n γki xi μk = i=1 n k γki (xi − μk )(xi − μk )T k = i=1 k πk =
or another update according to the covariance structure chosen (10).
The algorithm alternates between estimating the soft assignments γki of each observation i to cluster k and estimating the mixture parameters, πk , μk , k . The values γki are always between 0 and 1, with 1 representing certainty that observation i is in cluster k. The value k represents the total “number of observations” in cluster k. In the algorithm, the equation for estimating k corresponds to the VVV model, where there are no shared parameters between the covariance matrices; the estimation of the other covariance structures is detailed in (10).
378
Nugent and Meila
The number of clusters K and clustering model (e.g., EII) are chosen to maximize the Bayesian Information Criterion (BIC) (10), a function of the likelihood of the observations given the chosen model penalized by a complexity term (which aims to prevent over-fitting). For n observations, if the current candidate model M has p parameters, BIC = 2 · log L(x|θ) − log (n) · p.
[4]
The BICs are calculated for all candidate models. Although only one final model is chosen, some model-based clustering software packages [e.g., mclust in R (6)] will produce graphs showing the change in BIC for an increasing number of clusters given a model, pairwise scatterplots of the variables with observations color-coded by cluster, and two-dimensional (projection) scatterplots of the component densities. When a model is chosen, parameter values (centers, covariances) are estimated, and observations are assigned to the mixture components by Bayes Rule. This assignment is a hard assignment (each x belongs to one cluster only); however, we could relax the hard assignment by instead looking at the vector γi (of length K) for each observation, representing the probabilities that the observations come from each mixture component. If an observation has several “likely” clusters, the user could return the soft assignments γki . Figure 12.6 contains the model-based clustering results for the four-group example data. The procedure searches all ten candidate models over a range of 1–12 clusters (range chosen by user). Figure 12.6a shows the change in the BIC as the number (b)
JI F D H
H JI G A C E B D F
H A G C B D JI E F
H AI C G B D J E F
1.0
H JI G A C F E B D
H A G C BI D E J F
0.8
J HI F E G B D C A
E C A
A C E
x2
BIC 200 400
600
JI B D H F G
J I H B D F G
J HI F E B D G C A
0.6
800
1000
(a)
0
D JI F G C H E A
JI G H A B C D E F
A B
2
4
6 8 number of clusters
10
12
5 5
0.2
−400 −200
5
0.0
1 1 1 1 11 11 1 1 1 1 1 11 1 1 111 11 1 1
6 66 6
6
5
5 55 5 5555 5 5 55 555 5 55 5 5 5
0.2
7
7 77 7 7 777 7 77 7 7777 77 7 7 7 7 7 777 77777 7 7 777777777 7 777 7 7 7 77 7 777 7 77 777 77 7 7777 7 7 77 7 7 777 7 7 77 7 7 7 7 77 7 7 7 7 7
1
6 66 6
6
6
6 66 66
6
3 3 3 3
0.4
B E G C
22 2 2 2 2 2 22222 22 2 222222 2 2 2 2222 22 2 2 2 2 2 2 2 2 2222222222 21 22 222222 222222 22222 2 22222 222 22222 2 22 2 22 222222 2 2 2 2 222 222 2 22 2 2 2 2 2 222 22 2 2 2 2 2 22 2 2 3 2 2 33 3 3 33 33 333 3 3 33 3 3 33 3 333 33 3 33 3
6 6
6
6
6
66 666 66666 66 6 4 4 44 4 4 4 4 4 4 4 44444 444 4 4 444444 44 4 4 44 4444 44 4 4 4 4 44 4 4 44444 4 444444 4 44 44 44 4 4 4 444444 4444 4 4 44 444 4 4 4444 444 444444444 4 5 44 44 44 4 5 4 55 44 5 5 4 4 55 55
0.4
8 88 8 88 8 8 8 8 8 8888 888 88 88 8 8 88 8 8 888 88 8 8 88 888 8 8 8 8 88 8 8 8 8 8 88 8 8 88 88 88 88 888 8 88888 8 8 8 88 88 8 8 88 8 8 88 8 88 8 8 88 8
0.6
0.8
1.0
x1
Fig. 12.6. (a) Number of clusters vs. BIC: EII (A), VII (B), EEI (C), VEI (D), EVI (E), VVI (F), EEE (G), EEV (H), VEV (I), VVV (J); (b) EEV, eight-cluster solution.
An Overview of Clustering Applied to Molecular Biology
379
of clusters increases for each of the models (legend in figure caption). The top three models are EEV, 8 (BIC = 961.79); EEV, 9 (954.24); and EII, 10 (935.12). The observations are classified according to the best model (EEV, 8) in Fig. 12.6b. The EEV model fits ellipsoidal densities of equal volume, equal shape, and varying orientation. The spherical groups are appropriately modeled; the curvilinear groups are each separated into three clusters. This behavior highlights an oft-cited weakness of model-based clustering. Its performance is dependent on adherence to the Gaussian assumptions. Skewed or non-Gaussian data are often over-fit by too many density components (clusters). The reader should keep in mind, however, that the mixture components fk do not need to be Gaussians, and that the EM algorithm can be modified easily (by changing the M step) to work with other density families. Also note in Fig. 12.6a, several models may have similar BIC values; users may want to look at a subgroup of models and choose based on practicality or interpretability. Figure 12.7 contains the results for the flow cytometry example. Model-based clustering chooses a VEV, 11 model (BIC = −37920.18) with ellipsoidal densities of varying volume, equal shape, and varying orientation. The two closest models are VEV, 12 (−37946.37) and VVV, 7 (−38003.07). Note in Fig. 12.7b that the density components can overlap which may lead to overlapping or seemingly incongruous clusters (e.g., the “9” cluster). (b)
G
G
E C
C
C
A
A
A
D C E F A B
C G
J FI H G D B E C A
I H B D G E C A
I H B DD G E C A
I B H G E C A
I H B D G E A C
1000
JI B D F G H E C A
A
0
200
−41000
JI G H
−42000
BIC −40000
−39000
JI
JI H F D E B
800
JI F H D B E
D F B G H
JI H D F B E
I B H D G E A C
7 AAD 400 600
−38000
(a)
2
4
6 8 number of clusters
10
12
4 4 4 4 4 4 4 444 4444444 4444 4444444 4 44 444 444444 4444 44 4 44 4 4 4 4 4 3 4 4444 4 4 333 4 4 4 444444 44 44 4 3 4 44 444444 4444 33 3333 3 3 3333 4 444444 4 3 3 3 6 3 4 3 6 4 3 333 6666666 44644 3 44444 4 3 36 66 66 3 3 3 6 3 3 3 3 3 3 3 3 3 3 6 33 33 3 6 6 66 4 3 33 33 3333333336 6 3 3 6 3 3 3 6 3 3 3 3 3 3 3 3 3 6 6 3 3 3 36 33 33 33 666 6 6666 3 3 333 3 33333 6 6 66 3 333 3333336 3 3 3 3 33 3 3 333 333 6 333 3 6 3 3 3 3 3 6 6 6 6 3 3 3 6 3 3 6 6 3 6 3 3 3 6 3 6 6 3 6 6 3 3 6 3 3 6 3 366 6 3333 6 6 6 66 3 333 333 3 6 333333 63 6 666666 666666 6 6 6 33 333333 33 6 3333 63 666 6 6 66666 66 333 33 66 6 666666663333336366666 6666 66 666 666
0
200
400 600 Anti−BrdU FITC
800
1000
Fig. 12.7. (a) Number of clusters vs. BIC: EII (A), VII (B), EEI (C), VEI (D), EVI (E), VVI (F), EEE (G), EEV (H), VEV (I), VVV (J); (b) VEV, eleven cluster solution.
2.3. Nonparametric Clustering
In contrast, nonparametric clustering assumes that groups in a population correspond to modes of the density f( x) (11, 12, 13, 14). The goal then is to find the modes and assign each observation to the “domain of attraction” of a mode.
380
Nugent and Meila
Nonparametric clustering methods are akin to “mode-hunting” [e.g., (15)] and identify groups of all shapes and sizes. Finding modes of a density is often based on analyzing crosssections of the density or its level sets. The level set at height λ of a density f (x) is defined as all areas of feature space whose density exceeds λ, i.e., L(λ;f (x)) = {x|f (x) > λ}.
[5]
The connected components, or pieces, of the level set correspond to modes of the density, particularly modes whose peaks are above the height λ. For example, Fig. 12.8a below contains a grey-scale heat map of a binned kernel density estimate (BKDE) of the four-group example data (16, 17). The BKDE is a piecewise constant kernel density estimate found over a grid (here 20 × 20); the bandwidth h is chosen by least squares cross-validation (here 0.0244). High-density areas are indicated by black or grey; low-density areas are in white. We can clearly see the presence of four high-frequency areas, our four groups. Figure 12.8b shows the cross-section or level set at λ = 0.00016. Only the bins whose density estimate is greater than 0.00016 remain in the level set (black); the remaining bins drop out (white). At this height, we have evidence of at least two modes in the density estimate. Increasing the height to λ = 0.0029 gives us the level set in Fig. 12.8c. We now have four sections of feature space with density greater than 0.0029, giving us evidence of at least four modes. 1.0
1.0
0.2
0.4
0.6
x1
0.8
1.0
0.4 0.2
0.2
0.2 0.0
0.6
x2
0.6 0.4
x2
0.8
0.8
1.0 0.8 0.6 0.4
x2
(c)
(b)
(a)
0.0
0.2
0.4
0.6
0.8
1.0
0.0
x1
0.2
0.4
0.6
0.8
1.0
x1
Fig. 12.8. (a) BKDE, 20 x 20 grid; h = 0.0244; (b) L(0.00016; BKDE) (c) L(0.0029; BKDE).
Figure 12.8 also shows an inherent limitation of a one-levelset approach. One single cross-section of a density may not show all the density’s modes. The height λ either may not be tall enough to separate all the modes (Fig. 12.8b) or might be too tall and so not identify modes whose peaks are lower than λ (not shown) (11–14). We can take advantage of the concentric nature of level sets (level sets for higher λ are contained in level sets for lower λ) to construct a cluster tree (18, 19). The cluster tree combines information from all the density’s level sets by analyz-
An Overview of Clustering Applied to Molecular Biology
381
ing consecutive level sets for the presence of multiple modes. For example, the level set at λ = 0.00016 in Fig. 12.8b above is the first level set with two connected components. We would split the cluster tree into two branches at this height to indicate the presence of two modes and then examine the subsequent level sets for further splits. Figure 12.9 shows the cluster tree and final assignments of our example data. Each leaf of the tree (numbered node) corresponds to a mode of the density estimate. The first split corresponds to an artifact of the density estimate, a mode with no observations. The “1” cluster is the first split identified in Fig. 12.8b that contains observations and so corresponds to its own branch. The remaining observations and modes stay together in the tree until subsequent level sets indicate further splits. (b) 1.0
0.006
(a)
44 4 4 4 444 4444 44 4 4444444 4 4 44 44 4 4 55 5 4444 4 4 444 4 44444444444 44 4 4 445 5 5 44 4 4 444444444 44 5 444 44444 44 45 55 5 4 4 44444 44 4 4 4 44 4 44 4 4 4 444 44 4 4 444444 4 5 4 4 444 4 4 4 5 4 4 444 4 4 4 4 4 43 4 66 6 3 4 3 44 6 334 4 66 6 3 333 3 6 6 66 3 3 6 6 6 6 66 8 3 33 33 8 3 333 3 8 8 88888 3 3 3 3 8 8 888888 8 8 8 8 8 88 8 8 88 8 88 8 8888888 88 8 888888 88 88 88 8 8888 8 88 8 888888 888888888 8 7 888 88 8 88 8 7 7 8888888 8 88 8 88888 7 888888 8 7 8 888888 8 8888 8888 77 777777 7 8 77 777 7 777 8888 7 7 7 88 8 8 8 8 8
0.8
7 8 6
0.6
3
0.4
x2
0.003
0.004
4 5
0.002 0.000
2
0.2
0.001
Density Estimate λ
0.005
4
1
0.0
0.2
0.4
0.6
2 2 22 22 22 222 22 2 22 2 22 22 22 222 2 22 2 2 2 2 2 2 2222222222 2 22 2 2 2 2222 222 222222222222 22 2 2 2 2222 2 2 22 2 2 22 22 2 2 2
1 11 1 1 1 1 1 111 1111111 1111 1 1 1 1111 1111 1 111 1 1 1 1 1 1 11 1 1 1 1 111 11 1111 11 11 111 1111111 1 1 111 11 1 1 1 1 1 11 1 1111 11 1
0.8
1.0
x1
Fig. 12.9. (a) Cluster tree of the BKDE; (b) cluster assignments based on final “leaves”.
Figure 12.10 illustrates the cluster tree and subsequent assignments for the flow cytometry example. There are 13 leaves in the cluster tree (five are modal artifacts of the binned kernel density estimate). In the cluster assignments, the observations have been labeled as either “core” (•) or “fluff” (x). Core observations are those who lie in bins belonging to the connected components (i.e., black bins in Fig. 12.8b; the cluster centers); fluff observations are in bins that were dropped prior to finding a split in the level set (i.e., the white bins in Fig. 12.8b; the edges of the clusters). Here we would report one large cluster and several smaller clusters. The two most similar clusters are “5” and “6”; they are the last clusters/modes to be separated. This approach does not put any structure or size restrictions on the clusters, only the assumption that they correspond to modes of the density. The hierarchical structure of the cluster tree also identifies “similar” clusters by grouping them; modes that are very close together will be on the same branch of the cluster tree.
382
Nugent and Meila (b) 1000
4
600 400
Anti−BrdU FITC
800
56
200
78 3 2
1 0
Density Estimate λ
0.000 0.002 0.004 0.006 0.008 0.010 0.012
(a)
0
200
400
600
800
1000
7 AAD
Fig. 12.10. (a) Cluster tree with 13 leaves (8 clusters and 5 artifacts); (b) cluster assignments: “fluff” = x; “core” = •.
Nonparametric clustering procedures in general are dependent on the density estimate. Although methods may find all of the modes of a density estimate, they may not correspond to modes of the density (i.e., groups in the population). Spurious modes of a poor density estimate will show up in the cluster tree. For example, the first split on the cluster tree in Fig. 12.9a (far right node) actually corresponds to a modal artifact in the density estimate (very small mode with no observations). Also, multiple modes are found in the curvilinear groups due to inherent noise in the density estimate. Often, pruning or merging techniques are used to combine similar clusters on the same branch (18). 2.4. Mean Shift Methods
Another category of nonparametric clustering methods explicitly finds, for each observation xi , the corresponding mode. All the observations under one mode form a cluster. Again, one assumes that a kernel density estimate (KDE) is available. Typically the kernel used in this estimate is the Gaussian kernel with bandwidth h, K (z) =
1 h p (2π)p/2
e −||z||
2 /2h 2
.
[6]
Hence, the KDE has the form n ||x−xi ||2 1 e 2h2 . f (x) = nh p (2π)p/2
[7]
i=1
If x is a mode of this density, the gradient at x is equal to 0, hence satisfying the relation
x =
1 nh p (2π)p/2
n
1 nh p (2π)p/2
i=1 e
n
||x−xi ||2 2h 2
i=1 e
xi
||x−xi ||2 2h 2
≡ m(x).
[8]
An Overview of Clustering Applied to Molecular Biology
383
The quantity m( x) above is called the mean shift, yielding its name to this class of methods. The idea of the mean shift clustering algorithms is to start with an observation xi and to iteratively “shift” it to m(xi ) until the trajectory converges to a fixed point; this point of convergence will be common to multiple observations; all the observations xi that converge to the same point form a cluster. The Simple Mean Shift algorithm is described below. SIMPLE MEAN SHIFT Algorithm Input Observations x1 , . . . xn , bandwidth h 1. for i = 1:n do (a) x ← xi (b) iterate x ← m(x) until convergence to mi . 2. group observations with same mi in a cluster.
This algorithm can be slow when the sample size n is large, because it has to evaluate the mean shift, which depends on all the n observations, many times for each individual observation. In (20) a faster variant is given, which is recommended for large sample sizes. The Gaussian Blurring Mean Shift (GBMS) Algorithm (21, 25) is a variant of Simple Mean Shift, where instead of following the trajectory x ← m(x) from an observation to a mode of the KDE, the observations themselves are “shifted” at each step. GAUSSIAN BLURRING MEAN SHIFT Algorithm Input Observations x1 , . . . xn , bandwidth h, radius . • Iterate until STOP 1. for i = 1:n compute m(xi ) 2. for i = 1:n compute xi ← m(xi ). • Cluster all the observation within of each other in the same cluster.
Typically, the parameter , which specifies how close two observations must be for us to assume they are “identical” is a fraction of the kernel bandwidth, e.g., = 0.1h. It is very important to note that, if the GBMS algorithm is run to convergence, all the observations converge to the same point! Therefore, the algorithm must be stopped before convergence, which is signified by the condition STOP in the algorithm’s body. If the algorithm is stopped earlier, we obtain more clusters than if we run it for more iterations, because at each iteration
384
Nugent and Meila
some clusters may coalesce. Hence, the number of clusters can be controlled by the number of iterations we let the algorithm run. The GBMS algorithm converges very fast; typically one can expect to find clusters after as few as 6 iterations (of course, if one waits longer, one will find fewer and larger clusters). A more detailed analysis of GBMS and of the stopping condition is in (23). Simple Mean Shift and GBMS do not get to the same clustering result, but both have been tested and found to give good results in practice. As with the level sets methods, mean shift methods also depend, non-critically, on the kernel bandwidth h. We assume that this parameter was selected according to the standard methods for bandwidth selection for KDE. As it is known from the KDE theory, a small h leads to many peaks in the KDE, and therefore to many small clusters, while a large h will have the opposite effect. 2.5. Dirichlet Process Mixture Models Clustering
3. Pairwise (Dis)Similarity Clustering
Dirichlet process mixture models (DPMM) are similar to the mixture models in Section 2.2, with the difference that the number of clusters K is unspecified, and the algorithm returns a number of clusters that varies depending on (1) a model parameter α that controls the probability of creating a new cluster and (2) the number of observations n. The number of clusters grows slowly with n. So, this method is parametric, in the sense that the shapes of the clusters are given (e.g., Gaussian round or elliptic), but it is nonparametric because the number of clusters depends on the observed data. The resulting cluster sizes are unequal, typically with a few large clusters and a large number of small and very small clusters, including single observation clusters. DPMM models have registered many successes in recent years, especially because they are very flexible and can model outliers elegantly (by assigning them to single observation clusters). Clustering data by DPMM is done via Markov Chain Monte Carlo, and the reader is referred to (24) for more details.
In similarity- or dissimilarity-based clustering, the user has a realvalued similarity or dissimilarity measure for each pair of observa n(n−1) tions xi , xj . There are n2 = 2 pairs of observations, but it is more common to store the (dis)similarities in an n x n symmetric matrix (i.e., row 1, col 2 element = row 2, col 1 element). Most pairwise clustering methods will accept this matrix (or some function of it) as an input. We first introduce some notation specific to Section 3. The similarity between two observations xi , xj can be written s(xi , xj ) where s() is a function denoting proximity or closeness. A large
An Overview of Clustering Applied to Molecular Biology
385
value of s(xi , xj ) indicates that two observations are very similar; a small value indicates very different observations. These similarities are then stored in a matrix S of dimension n by n. The element sij is the similarity between xi and xj ; sii is the similarity between an observation xi and itself. However, because a natural inclination is to associate the term “distance” with measuring proximity, large values are often intuitively assigned to objects that are “far apart.” For this reason, we often use dissimilarity between two observations instead as a more natural measure of closeness. The dissimilarity between two observations xi and xj is then d(xi , xj ). Small values of d(xi , xj ) indicate “close” or similar observations; large d(xi , xj ) values correspond to observations that are “far apart” (or very different) from each other. We can store the dissimilarities in an n by n matrix D where dij is the dissimilarity between xi and xj ; dii is the dissimilarity between an observation xi and itself. The relationship between the similarity measure and a dissimilarity measure could be described as inverse. If one were to order all pairs of observations by their proximity, the most similar pair of observations would be the least dissimilar pair of observations. We can correspondingly indicate the similarity and dissimilarity between clusters Ck and Cl as s(Ck , Cl ), d(Ck , Cl ) which allows us to order and/or merge clusters hierarchically. Their (dis)similarity matrices, SC , DC are of dimension K by K. 3.1. Measuring Similarity/Dissimilarity
The most common measure of dissimilarity between two observa p 2 tions is Euclidean distance, i.e., d(xi , xj ) = l=1 (xil − xjl ) = xi − xj . Here the off-diagonal values of D are non-negative entries and the diagonal values of D are zero (since the Euclidean distance between an observation and itself is zero). This measure is commonly the default dissimilarity measure in some clustering algorithms (e.g., K-means, hierarchical clustering) in statistical software packages. However, there are several other ways to indicate the dissimilarity between two observations. We motivate a few commonly used ones here (5, 25, 26). The Manhattan, or “city-block”, distance is the sum of the absolute value of the differences over the p attributes, i.e., p d(xi , xj ) = l=1 |xil − xjl |. Again, observations that are further apart have higher values of d(); the diagonal values of the corresponding D matrix are zero. Note that both the Euclidean distance and the Manhattan distance are special cases of the Minkowski distance where for r ≥ 1, p 1 dr (xi , xj ) = ( |xil − xjl |r ) r . [9] l=1
A more detailed discussion of different distances in this vein can be found in (5).
386
Nugent and Meila
In spectral clustering (11), it is common to transform a dissimilarity measure into a similarity measure using a formula akin to s(xi , xj ) = e −g(d(xi ,xj )) . The user needs to choose the particular function g() to transform the distance between observations. This choice can depend on the application and the algorithm. One can also measure the two observations’ correlation r(xi , xj ) over the p variables, p
− x¯ i )(xjl − x¯ j ) , p 2 2 (x − x ¯ ) (x − x ¯ ) il i jl j l=1 l=1
r(xi , xj ) = p
l=1 (xil
[10]
p where x¯ i = 1p l=1 xil , the average xil value over the p variables. A correlation of 1 would indicate perfect agreement (or two observations that have zero dissimilarity). Here we are clustering based on similarity; the corresponding S matrix has the value 1 along the diagonal. In Section 2.3, we described nonparametric methods that assign observations to a mode of a density. With this view, we might also associate closeness with the contours of the density between two observations. One choice might be s(xi , xj ) = min f (t · xi + (1 − t) · xj ) t∈[0,1]
[11]
or the minimum of the density along the line segment connecting the two observations (19). Two high-density observations in the same mode then have a high similarity; two observations in different modes whose connecting line segment passes through a valley are dissimilar. Measuring the similarity as a function of a connecting path is also done in diffusion maps where observations surrounded by many close observations correspond to a high number of short connecting paths between observations and so a high similarity (27). Regardless of choice of (dis)similarity measure, if we suspect that variables with very different scales may unduly influence the measure’s value (and so affect the clustering results), we can weight the different attributes appropriately. For example, if the variance of one attribute is much larger than that of another, small differences in the first attribute may contribute a similar amount to the chosen measure as large differences in the second attribute. One choice would be to standardize using the Karl Pearson distance: p (xil − xjl )2 d(xi , xj ) = , 2 s l l=1
[12]
An Overview of Clustering Applied to Molecular Biology
387
where sl2 is the variance of the lth variable. Under a change of scale, this distance is invariant. Another common weighted distance is the Mahalonabis distance which effectively weights the observations by their covariance matrix : d(xi , xj ) =
(xi − xj ) −1 (xi − xj ).
[13]
Prior to clustering, the user should identify the appropriate (dis)similarity measure and/or distance for the particular application. In addition, attention should be paid to whether or not to scale the variables prior to finding and clustering the (dis)similarity matrix. 3.2. Hierarchical Clustering
Given a pairwise dissimilarity measure, hierarchical linkage clustering algorithms (5, 12–14) “link up” groups in order of closeness to form a tree structure (or dendrogram) from which we extract a cluster solution. Euclidean distance is most commonly used as the dissimilarity measure but not required. There are two types of hierarchical clustering algorithms: divisive and agglomerative. In divisive clustering, a “top-down” approach is taken. All observations start as one group (cluster); we break the one group into two groups according to a criterion and continue breaking apart groups until we have n groups, one for each observation. Agglomerative clustering, the “bottom-up” approach, is more common and the focus of this section. We start with all n observations as groups, merge two groups together according to the criterion, and continue merging until we have one group of n observations. The merge criterion is based on a continually updated inter-group (cluster) dissimilarity matrix DC where d(Ck , Cl ) is the current inter-group distance between clusters Ck , Cl . The exact formulation of d(Ck , Cl ) depends on the choice of linkage method. The basic algorithm is as follows: HIERARCHICAL AGGLOMERATIVE CLUSTERING Algorithm Input A dissimilarity matrix D of dimension n by n; the choice of linkage method. • All observations start as their own group; D is the inter-group dissimilarity matrix. • Iterate until have just one group with all observations. 1. Merge the closest two groups. 2. Update the inter-group dissimilarities.
The algorithm requires a priori how to define the dissimilarity between two groups, i.e., the linkage method. Three commonly
388
Nugent and Meila
used linkage methods are single linkage, complete linkage, and average linkage. Single linkage defines the dissimilarity between two groups as the smallest dissimilarity between a pair of observations, one from each group, i.e., for Euclidean distance, d(Ck , Cl ) =
min
i∈Ck ,j∈Cl
d(xi , xj ) =
min
i∈Ck ,j∈Cl
xi − xj .
[14]
It is characterized by a “chaining” effect and tends to walk through the data linking up each observation to its nearest neighbor without regard of the overall cluster shape. Complete linkage defines the dissimilarity between two groups as the largest dissimilarity between a pair of observations, one from each group, i.e., for Euclidean distance d(Ck , Cl ) =
max
i∈Ck ,j∈Cl
d(xi , xj ) =
max
i∈Ck ,j∈Cl
xi − xj .
[15]
It tends to partition the data into spherical shapes. Average linkage then defines the dissimilarity between two groups as the average dissimilarity between a pair of observations, one from each group. Other linkage methods include Ward’s, median, and centroid (5). The order in which the groups are linked up is represented in a tree structure called a dendrogram. Two groups are merged at the tree height that corresponds to the inter-group dissimilarity between them. Very similar groups are then connected at very low heights. Dissimilar groups are connected toward the top of the tree. Once constructed, we extract K clusters by cutting the tree at the height corresponding to K branches; any cluster solution with K = 1, 2, ..., n is possible. Choosing K here is a subjective decision; usually we use the dendrogram as a guide and look for natural groupings within the data that correspond to groups of observations where the observations themselves are connected to each other at very low heights, while their groups are connected at taller heights (which would correspond to tight, well-separated clusters). If our groups are well-separated, linkage methods usually show nice separation. For example, Fig. 12.11 has the results for single linkage for the four well-separated groups example using Euclidean distance as a dissimilarity measure. The dendrogram in Fig. 12.11a shows evidence of four groups since there are four groups of observations grouped at low heights that are merged to each other at greater heights. The two spherical groups are on the left of the dendrogram; the two curvilinear groups are on the right. Note that the order that the observations are plotted on the dendrogram is chosen simply for illustrative purposes (i.e., the two spherical groups could have been plotted on the right side of
0.2
478 443 454 408 472 489 490 485 436 426 442 424 466 451 457 458 461 410 474 494 445 453 452 481 459 470 422 439 492 440 444 406 447 421 488 450 404 449 435 416 419 409 493 480 499 423 427 468 500 462 486 412 463 476 448 483 402 441 414 460 437 438 487 491 471 473 425 430 498 446 464 418 495 484 428 433 456 417 467 413 429 403 477 482 432 407 434 401 475 415 469 455 479 496 420 465 431 411 405 497 586 566 591 569 519 579 504 580 534 517 543 590 520 593 521 556 514 553 562 508 549 558 563 576 503 548 507 555 545 537 588 592 518 547 564 578 581 572 505 512 522 540 584 554 529 561 530 574 582 552 600 544 560 598 516 565 597 513 509 536 511 559 575 587 546 523 589 535 583 502 533 570 557 515 541 551 595 532 577 531 538 594 525 573 539 550 585 542 568 571 596 510 567 527 501 528 526 599 506 524 281 269 351 256 257 363 226 315 211 364 245 298 345 282 325 297 350 288 322 268 377 395 399 265 387 203 296 321 277 306 255 287 249 352 217 230 305 360 397 300 301 396 232 357 383 260 358 348 219 238 279 286 272 320 365 303 359 304 331 346 361 371 394 291 234 379 355 218 316 220 271 292 332 247 241 295 205 248 308 214 225 212 261 221 342 222 239 354 233 259 330 389 264 276 390 314 324 367 2 224 335 356 231 400 326 336 386 37542 334 223 236 273 309 378 243 307 278 294 393 229 374 289 263 369 201 338 202 347 340 213 329 323 285 293 317 290 254 215 283 246 284 353 209 312 362 328 237 274 392 240 370 270 208 216 372 319 227 366 210 391 275 373 244 381 251 267 302 252 258 327 339 344 384 398 235 262 337 341 250 299 228 206 380 343 349 207 204 280313 376 311 333 253 382 318 368 385 388 266 310 73 104 57 129 139 141 128 106 120 100 91 190 63 153 82 140 11 43 33 182 32 168 59 110 96 51 167 24 36 22 169 187 23 44 18315 184 199 112 165 115 170 80 159 46 189 118 35 111 52 191 133 174 74 136 146 37 49 94 195 2 48 55 176 90 198 53 76 54 19 193 84 188 14 145 40 10 20 17 79 5 81 157 67 164 114 156 103 99 180 68 172 194 45 181 69 70 123 131 13 7 1051 119 121 3 127 4 97 107 151 62 60 125 113 122 134 177 25 72 155 161 160 38 83 95 150 26 108 126 71 75 135 149 97192 186 124 21 50 178 4 163 56 78 185 87 179 8 162 27 102 144 58 158 116 18 30 34 39 41 154 117 86 130 132 148 142 175 197 6 98 137 66 109 89 138 29 92 16 85 61 12 143 15 96 42 77 152 166 101 173 171 28 64 93 31 200 88 147
0.0
0.2
0.4
0.4
x2
Height
0.6
0.6
0.8
0.8
1.0
1.0
1.2
0.05
0.2
0.4
x2
Height
478 458 431 445 453 408 472 423 427 421 480 499 488 484 492 406 447 440 444 443 454 461 410 474 494 462 409 493 496 420 465 452 481 459 470 422 439 500 411 405 497 489 428 433 417 467 456 446 464 418 495 425 430 468 490 485 424 466 451 457 436 426 442 486 412 463 4498 71 476 448 483 402 441 414 460 437 473 438 487 491 403 401 475 455 479 413 429 477 482 432 407 434 415 469 435 416 419 450 404 449 586 530 544 560 534 585 542 568 574 582 552 600 527 571 596 510 567 524 506 526 599501 528 517 543 590 520 593 598 516 565 597 513 572 505 512 592 580 569 519 579 504 566 591 535 583 502 533 570 5 45 537 588 576 503 548 507 555 558 563 5 78 581 518 547 564 554 529 561 522 540 584 562 508 549 521 556 514 553 509 536 5 46 523 589 575 587 511 559 551 595 532 577 531 538 594 550 539 525 573 557 515 541 1 192 119 57 129 73 104 115 107 151 15 170 139 141 33 128 106 120 82 140 63 153 100 91 190 182 32 168 11 196 43 22 169 24 36 96 51 167 59 110 121 47 112 155 13 7 105 122 113 165 199 187 23 184 44 183 29 9216 171 61 85 12 143 2 8195 64 93 80 159 161 25 134 177 72 160 38 83 31 20089 138 88 147 137 66 109 6 98 78 2 48 146 37 49 94 124 21 50 178 65 9 186 163 118 35 111 52 191 46 136 74 133 174 17 79 26 108 95 150126 185 144 154 41 34 39 58 158 175 197 27 102 132 148 116 18 30 117 142 86 130 87 179 56 8 162 189 99 180 68 172 194 1 49 135 71 75 03 114 156 5 81 157 164 67 54 53 76 84 188 19 193 198 40 145 14 10 20 90 55 176 42 77 123 131 152 166 101 173 97 4 3 127 70 69 45 181 62 60 125 253 382 3 18 368 313244 376 381 310 266 385 388207 333 311 204 280 206 228 380 343 349 296 321 203 255 287 277 306 257 363 226 315350 282 297 245 211 364 298 345 325 288 322 268 377 383 260 358 395 399 265 387 222 341 317 248 308 247 241 295 352 290 359 251 267 302 252 258 327 2 50 299 339 235 262 337 344 384 398 279 242 348 300 301 396 232 357 249 305 360 397 219 238 217 230 332 205 304 331 346 303 272 320 365 286 342 220 201 338 202 347 323 285 293 243 307 278 294 273 309 378 340 213 329 392 240 275 210 391 3 70 270 208 216 372 373 319 227 366254 215 283 292 246 284 328 389 264 276 237 274 353 209 312 362 361 371 214 225 212 261 221 390 324 367 233 259 330 314 239 354 394 291 234 379 335 356 224 231 400 375 334 223 236 326 336 386 393 289 229 374 263 369 271 355 218 316 256 281 269 351
0.00
0.6
0.10
0.8
0.15
1.0
An Overview of Clustering Applied to Molecular Biology
(a)
1
0.0
11 1 1 1 111 1111 11 1 1111111 1 1 11 11 1 1 11 1 1111 1 1 111 1 111111111111 1 1 1 1 111 1 1 11 1 1 111111111 11 1 11 11 111 11111 11 1 1 1 1 1 11111111111 111 1 1 11 1 1 1 1 111111 1 1 1 111 111 1 1 1 1 1 1 111 1 1 1 1 1 11 1 2 2 1 11 11 22 111 1 22 2 1 111 1 2 2 2 2 11 2 2 22 22 2 1 11 11 2 1 111 1 2 2 22 222 1 1 1 1 2 2 222222 2 2 2 2 22 2 2 22 22 22 222222222 22 2 22 22 22 22 2222 22 2 22 22222 2 2 22222 2222222 2 222 22 2 22 2 2 2 2 2 2 22222222 22 22 2 2 2 222222 22222222 2 2222 22222 22 222222 2 2 22 222 2 22 2222 2 2 22 2 2 2 22 2 2
Observations 0.0 0.2
0.2
Fig. 12.12. (a) Complete linkage dendrogram; (b) extracted clusters: K = 4. 0.4
(a) 0.6
1
11 1 1 1 111 1111 11 1 1111111 1 1 11 11 1 1 1111 1 1 11 11 1 1 1 1 1 1 1 1 1111111 111 1 11 11 11 1 11 111 11 1 111 11 1 11 1 111 11 1 111 11 1 1 1 11111 1 1 1 1 111111 1 11 1 1 1 111 1 11 1 1 1 1 1 1 1 1 111 11 1 1 1 1 1 111 1 1 1 1 1 1 11 2 2 1 11 11 22 111 1 22 2 1 111 1 2 2 22 11 2 2 22 22 2 1 11 11 2 1 111 1 2 2 22 222 1 1 1 1 2 2 22222 2 2 2 22 2 2 2 22 2 22 22 2 2222222 22 2 222222 22 22 2 2 2222 2 2 22 2 222222 222222222 2 2 222 22 2 22 2 2 2 222222 2 2 2 22 2 2 222 2 22 22222 2 2 2 222222 2 2222 2222 22 222222 2 2 22 222 2 22 2222 2 2 22 2 2 2 22 22
0.4
x1
0.6
0.8
0.8
389
(b)
3 3 33 33 33 333 33 3 33 3 33 33 3 33 333 333 3 3 3 3 333333333 3333 3 3 3 3333 3333 333333333333 33 3 3 3 3 333 3 3 33 3 3 33 33 3 3 3
4
4
44 4 4 44 44444 44 444 444 4 4 4 4 4 4 4 4 4 444 4 4 44 4 4 444444444 44 4 44 444 44 44 44 444 4444444 4 4 4 444 44 4 4 4 4 4 44 4 44 4 4 44
x1 1.0
Fig. 12.11. (a) Single linkage dendrogram; (b) extracted clusters: K = 4.
the dendrogram as well). Figure 12.11b contains the clustering results if we cut the tree to extract four branches (around height = 0.08). In this case, we recover the true clusters. Figure 12.12 contains the corresponding results for complete linkage. Note that the four groups are not as well-separated in the dendrogram. That is, the differences between the group merge heights and the observation merge heights are not as large, a direct result of the curvilinear groups. (Recall that complete (b)
3 33 33 33 333 33 3 33 3 33 33 3 33 333 3333 3 3 3 3 3 33333333 3 33 3 3 3 3333 3333 333333333333 33 3 3 3 3 333 3 3 33 3 3 33 33 3 3 3 3
44 4 44 4 4 4 444 44444 4 4 4 4 4444 4444 44 4 444 4 44 4 4 4 4 4 44 4 4 4 4 44 444 4 4 4 4 4 4 4 44 44 4 4 444 4 4 4 4 444 44 4 4 4 4 4 44 4 44 4 4 44
4
4
Observations
1.0
3.3. Spectral Clustering 20
40
Observations 0
60
Height 500
752 1504 1422 472 998 453 864 827 62 960 501 887 1317 424 720 939 225 1290 1128 71255 436 345 602 719 4790 55 391 1169 1291 53 121 919 37 147 454 420 174 256 603 634 840 15 483 294 1282 118 903 697 806 689 1027 363 710 86 905 801 834 159 967 1 265 1345 1004 362 387 91 677 1327 1204 1492 586 724 149 704 823 970 945 1332 326 1211 114 590 1115 548 157 873 386 384 1269 295 1320 1120 487 489 183 1339 529 610 528 312 1055 1095 15079 253 318 964 435 296 920 1354 1307 182 580 145 989 1542 276 1467 431 702 1152 68 1116 1218 309 691 38 587 651 871 426 1523 1165 1366 107 1545 1527 1178 1236 579 1144 1034 547 1214 1189 1323 322 184 654 575 616 1487 379 562 805 699 1123 1344 701 14091121 1374 1140 155 1039 290 1246 958 156 1388 83 251 1278 55 1462 204 281 639 104 935 119 782 1022 1382 132 1212 5 36 439 956 1372 33 1449 1285 144 1393 20 1133 443 1539 319 1237 502 1480 427 966 6773 783 161 1445 1262 626 288 561 863 388 918 759 1451 509 13 464 557 51 257 1227 8 408 550 378 1289 196 230 1292 810 1151 485 167 676 1156 300 1535 117 728 372 277 732 1199 1219 584 700 262 222 338 1287 1085 308 341 10 1273 838 1509 1 835 354 186 476 433 11 69 39 518 1331 340 283 621 409 977 541 844 272 1005 236 1377 1012 1524 1444 645 681 744 473 1381 722 1404 1201 481 611 976 771 468 356 1215 1207 58 552 907 1188 1536 780 573 684 1020 1181 595 866 1245 64 723 1008 1112 613 1019 1321 521 1070 1284 1403 172 999 1077 440 1164 1033 330 539 123 264 1234 22 1079 1460 1435 628 858 405 1043 642 1153 61 532 739 1143 1296 413 1170 1216 625 138 437 187 1452 1045 393 1391 1023 329 396 980 795 1419 763 659 714 798 750 888 124 1411 968 1244 1168 1350 1241 1268 1110 1177 612 879 249 503 883 1076 36 534 576 491 254 29 252 923 1383 407 542 1054 448 1226 250 1065 1533 1232 917 1512 1135 1537 898 990 263 177 1051 415 1489 1010 46 284 1103 1261 761 1499 985 661 1217 629 1450 869 19 1053 1306 313 995 1543 1506 377 911 1173 1239 959 797 979 994 486 1206 371 674 1503 857 1067 273 813 215 565 97 901 1243 1257 517 1038 1488 1471 1346 311 1044 512 1324 74 1132 1264 3 4 664 452 663 592 669 1413 1015 1373 982 305 818 31 1198 1472 753 1530 936 735 45 27 315 229 1066 618 418 877 243 1250 16 1130 1062 1334 153 410 558 758 1380 166 261 652 287 984 902 1531 1277 1353 142 1477 777 637 713 394 953 1439 644 143 275 202 736 492 1107 89 361 274 647 1 65 530 1541 169 134 93 417 589 1286 1275 381 113 1220 84 226 716 526 740 819 466 829 233 1075 1502 392 899 874 560 82 404 889 399 894 895 1108 1454 358 951 638 564 1101 519 533 343 1119 335 459 1163 733 1161 193 1361 554 632 1105 1049 1378 679 975 1418 764 1309 737 367 934 1294 266 540 620 779 1408 861 1047 246 662 286 799 1224 1134 1313 385 1235 809 302 480 210 703 1406 1187 209 585 770 1270 158 768 559 1359 1141 395 1304 1185 484 890 1195 317 1150 1074 537 897 916 630 667 957 26 538 860 1519 624 298 854 949 993 1149 100 1529 96 1398 1510 146 99 941 839 78 812 447 511 928 42 835 59 792 24 657 828 570 1522 94 527 1516 1387 1330 683 1437 223 240 904 164 350 1414 1003 1347 878 1233 872 987 996 176 1525 868 122 419 837 365 754 1124 449 588 1465 214 375 106 351 1086 1473 926 152 416 606 1469 490 882 660 339 1432 1098 228 280 198 1428 1142 1191 1162 525 259 927 73 220 826 115 495 403 775 1464 72 478 1068 756 110 695 1118 787 1096 947 1230 1360 32 1281 1508 414 1238 571 1249 692 44 1009 1035 1329 194 411 804 1025 303 1283 1102 1371 1526 25 1322 14 824 412 566 1314 1106 1456 168 366 778 816 1036 444 793 205 170 1271 507 32 195 239 1481 531 1392 523 656 221 421 334 321 1229 71 235 726 1448 1441 876 1288 1483 572 856 1029 450 997 1221 563 1017 267 1534 95 1272 599 655 1468 1513 506 961 1461 325 515 929 1397 289 1001 815 504 896 1256 549 1081 451 432 746 238 50 1089 545 800 135 1014 165 791 139 400 1343 802 88 1478 1180 79 219 635 747 577 781 301 614 390 1356 734 741 54 954 1127 1072 441 1333 1342 938 1308 347 1494 422 1267 402 148 1071 40 1429 333 522 1042 245 1176 137 1040 875 1259 1064 1078 751 109 666 742 578 1518 28 1351 908 1026 1325 1209 665 1310 648 76 1495 922 950 1087 944 608 1247 180 1427 310 1433 314 1242 855 825 1171 463 803 265 841 1496 304 461 258 880 423 906 601 1293 596 633 1341 831 636 788 1252 910 1090 199 279 211 912 1186 555 688 914 1184 328 1385 1521 851 1028 745 1192 154 673 598 546 670 368 269 63 986 1453 331 981 242 820 129 1172 17 870 1402 1122 291 915 1490 686 513 757 1129 1538 493 126 796 348 884 217 1349 56 948 500 931 937 1497 807 1358 306 789 163 320 1486 1091 1376 48 383 969 189 1279 1447 1400 690 567 192 179 1420 131 336 794 821 604 574 846 1520 1000 1458 615 397 12 607 1197 1111 323 1136 974 581 535 1100 1379 1401 1505 1137 1297 1117 892 924 583 1396 1436 66 141 1021 92 1485 1094 1225 1158 1024 1167 766 1154 293 268 643 232 627 808 672 1426 1069 767 75 125 847 983 1011 755 438 1041 1131 150 140 175 940 1202 1463 1104 1514 1060 207 833 190 23 886 988 1061 862 709 247 370 1459 845 1251 116 786 650 1511 355 60 814 1002 1316 1365 292 1364 731 255 593 130 406 128 729 1113 1058 470 551 830 776 1368 913 1084 1501 171 1222 482 1431 1194 30 465 212 653 85 80 344 1052 508 1425 1138 1417 1482 373 1260 705 1276 1280 316 1415 678 991 600 136 749 244 718 859 865 234 765 352 811 5389 1298 1305 380 1179 520 1362 213 649 1442 1412 456 342 1517 556 1099 1082 514 90 946 1210 21 963 1213 1430 353 774 307 102 360 1410 270 1540762 1182 271 1048 955 224 707 685 1016 1493 248 867 5708 43 1311 1148 605 1248 357 442 925 943 646 568 852 112 1474 842 853 1032 359 1126 479 1295 1037 1056 641 460 1302 346 1455544 1370 434 203 623 1318 173 952 721 1097 1254 1544 609 1088 1299 1394 1263 1200 1203 965 1476 364 900 376 430 1328 711 369 822 1160 218 227 748 1155 1384 231 446 382 7121338 1498 1125 1326 1337 1457 1622 395 188 1301 52 105 47 197569 1018 108 497 31057 37 1046 1139 1357 1193 1228 717 932 1389 727 1114 4 25 200 471 505 725 1348 332 992 1208 1196 1363 1355 1367 1175 1446 743 978 1443 1174 1491 237 1484 488 1421 206 631 1007 1166 597 1399 5 82 885 201 457 1475 1266 1438 49 1352 516 1146 1340 185 850 933 1440 4 75 694 1315 1671 738 499 769 1319668 972 1258 401 832 1223 127 760 1375 1080 848 1303 285 1470 682 1369 1424 458 1031 973 1253 696 216 12051030 77 474 67 178 70 698 496 640 398 591 299 1147 374 891 1386 208 1093 181 817 1405 160 1050 675 680 687 1183 1231 706 1500 893 1528 772 1092 101 1073 1190 1063 843 469 429 942 1335 881 1159 467 457 260 241 81 98 428 693 151 658 120 133 594 510 1145 494 1479 111 784 1434 849 1466 1407 278 619 1336 462 785 1274 103 715 1006 1109 1390 1240 971 43 282 962 1157 1423 324 909 498 1300 477 87 445 349 921 191 1059 1515 930 1532 297 162 524 617 1013 836 730 1416 1312 327 553
0 1083
100
1000
80
Height 120
140
390 Nugent and Meila
linkage tends to partition the observations into spheres.) Again, we recover the true clusters when we cut the tree into four branches. The ease with which we extracted the true clusters from the example data may be misleading. When group separation is not obvious, it can be very difficult to determine the number of clusters in the observations. Moreover, groups of observations with high variance or the presence of noise in the data can increase this difficulty. Figure 12.13 contains the single and complete linkage dendrograms for the flow cytometry data. Figure 12.13a exhibits the chaining commonly associated with single linkage and no obvious cluster structure. The complete linkage results in Fig. 12.13b are more promising; however, it is not clear where to cut the tree. Figure 12.14 contrasts two possible complete linkage solutions for K = 4, 5. Note that the observations have been partitioned into “spheres” or symmetric shapes. When faced with an unclear dendrogram, one solution might be to try several different possible values of K and determine the best solution according to another criterion. In similarity-based clustering, we have the analog of parametric and nonparametric clustering in spectral clustering and affinity propagation. (a) (b)
Fig. 12.13. (a) Single linkage dendrogram; (b) complete linkage dendrogram.
Observations
Spectral clustering methods use the eigenvectors and eigenvalues of a matrix that is obtained from the similarity matrix S in order to cluster the data. They have an elegant mathematical formulation and have been found to work well when the similarities reflect well the structure of the data and when there are not too many clusters of roughly equal sizes. Spectral clustering also can tolerate a very small number of outliers.
An Overview of Clustering Applied to Molecular Biology
1000
(b)
3 3 333333333333 3 33 3 33 3 33333 3333 3333 4 3 333333 4 3 333 4 33 3 33 4 3 3 3 3 3 333 4 4 33 3333 4 33 3 3333 33 3 11 1 11 33 3 33 1 3 33 1 3 3 1 3 3 3 33 3 33 11 1 1 1 3 333 111111 1 3 3 2 3333 33 111 11 1 33 1 1 1111 2 1 1111 2 2 33333 111 111 1111 1 11 111 222222 22 22222222 2 111111 2 2 2 1 1 1 2 222 22 2 22 2 11111111 11 111 222222 22 2 111 11 111 11 11111111 22 11 111 11 2 22 2222 22 11 11 22 1 22 1 11 1 1111 2 2 222 22 2 111 1111 11 1 1 1 2 11 1 1111 2222222222 11 11 11 11111 1 2 1 1 2222 2 2222222 111111 1 1 2222 2 111 22 1 1111 2 222 22222 11 111 11 1 22 111 1 11 11 11 11 1 111 1 1 2 2 2 2 2 1 1 1 1 2 2 1111 1 1 11111 1 2 2 2 1 2 1 2 2 2 2 1 1 2 1 11 2 2 222 2 1 11 1111 1 1 11 111 11111 1 1 1 1 1 1 1 2 1 1 2 2 2 2 1 1 1 1 1 1 1 2 1 1 2 1 1 111 11 111 1111 111 1 1111 1 1 1111 1111 11 1 11 111 11 111 11222 22222 222 2 1 11 1 1 1 1 11 1111 11 1 1 1 1 1 11 11 11111 11 111 11 111 1 1 1111 111 1 111111111 11 1 1 1 11 1 111 1 11 1 11 1 11111 111 2 2 2 11 1 11 1 1 1 1 1 1 1 1 1 11 11 1 11 1 1 11 11 1 111 11 1 111111 111 1 11 11 1 1 1 1 11111 1 1 1 11 1 1 11 1 11 1 11 11 1 1 1 1 11 111 11 1111 1 11 11 11 111 11 11111 1 111 11 1 111 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 111 1111 1111 11 1 11 1 1 1 1 11 111 11 1 111 1111111 11 1 11 1 111111 1 1 1 1 11 11 1 1111111 11 11 1 1 1 1111111 111111 11 1 11 11 1 1111111111 11 11 11111111 11 1 11 1 111111111111 1 11 1111 1 111 11 1 4 44
0
200
400
600
800
600 200
7 AAD
800
3
400
44
0
600 400 0
200
7 AAD
800
1000
(a)
391
1000
55
5 55
4 4 444444444444 4 44 4 44 4 44444 4444 4444 5 4 444444 5 4 444 5 44 4 44 5 4 4 4 4 4 444 5 5 44 4444 5 4 4 44444 4 4444 11 1 11 4 4 1 444 4 4 1 1 44 4 44 4 44 111 1 1 4444 44 111111 4 1 1 4 4 4 4 3 4 4444 111 11 1 1 111 33 3333 4 444 11111 1 1 1 1 3 1 11 11 1 11 1 11 11 33 3333 3 333333333 3 11 11111 2 11 33 333 33 3 333 33 3 33 3 1 11 111 222 111 11111111 21 33 22 222 22 333 11 21 33 33 2222222 33 3333 11 111 33333 3 2 1 2 1 111 33333 11 221 1 2 2222 11 2 11111111 1 1 1 1 3 22222 33 33 3 33 111111 3 33 33333 33333 22 3 3 3 1 1 2 3 1 1 22 22222 2 1 2 1 1 1 1111 11 3 3 333333 333333 212 3 3 1111111 222 22 33 22 3 3 3 3 2 3 3 2 2222222 1 3 1 2 1 3 3 3 3 3 3 1 2 3 3 333 3 1 11 11 11 2222 11 111 1 2 2 1 1 1 1 2 2 1 3 2 3 1 1 3 3 3 1 2 2 1 1 2 1 2 2 1 3 22221 11 111 11 1111 2 2222 1 1111 1 21 1111 11 21 11 1 33333 333 3 22 222 11333 3 1 1 22222 222 1 1 22 11 2222 22 22 22222 12 1 1 11 1 11 222 1 11 1111 2 2 1 2 111111111 11 1 2 11 11 22 2 1 2 1 1 11111 111 3 3 3 1 1 22 22 2 2 2 22 2 1 2 2 2 1 1 1 1 11 22 2 2 11 1 2 2 1 222 111111 11 1 2 111 1 22 2 2 1 22222 1 22 1 2 2222 1 11 1 22 2 2 1111 1 1111 2 1 11 2 11 1 1 1 11 1 1111 22 11 1 1 2 2 2 2 222 2 1 2 2 1 1 2 2 1 2 2 2 2 1 1 1 1 2 2 1 2 2 1 2 2 1 1 2 1 22 21111 11111 1111 1 111 1 11 11 1 1 1 22 2 2222 2222222 11 22 21111 2 22 222 22 111 1 2 2 2 111 111 1 1 1 1 1111111 222211 222222 22 1 22 11 1 2222222222 11 1 11 11111111 11 1 222222222222 2 21 1111 1 111 11 2 4
0
Anti−BrdU FITC
200
400
600
800
1000
Anti−BrdU FITC
Fig. 12.14. (a) Complete linkage K = 4; (b) Complete linkage K = 5.
Another great advantage of spectral clustering, especially for data with continuous features, is that the cluster shapes can be arbitrary. We exemplify spectral clustering with a simple algorithm, based on (28) and (29).
SPECTRAL CLUSTERING Algorithm Input Similarity matrix S, number of clusters K. 1. Transform S: Let Di = nj=1 Sij , for j = 1:n. Pij ← Sij /Di , for i, j = 1:n. Form the transition matrix P = [Pij ]nij=1 . 2. Compute the largest K eigenvalues λ1 = 1 ≥ λ2 ≥ . . . ≥ λK and eigenvectors v1 , . . . vK of P. 3. Form the matrix V = [ v2 v3 . . . vK ] with n rows and K − 1 columns. Let xi denote the i-th row of V. The n vectors xi are a new representation of the observations, and these will be clustered by K-means. 4. Find K initial centers in the following way: (a) take c1 randomly from x1 , . . . xn (b) for k = 2, . . . K , set ck = argminxi maxk
5. Run the K-means algorithm on the “observations” x1:n starting from the centers c1:K . 6. Return the indices i of the xi in each cluster.
392
Nugent and Meila
Spectral clustering has connections with the theory of Markov chains (28, 30) and also with K-means (31). Essentially, the algorithm works in two stages: one in which the items i to be clustered are mapped, by the eigenvector calculation, to the K − 1 dimensional vectors xi , and a second one, where the effective clustering takes place, via an attribute-based clustering algorithm, usually Kmeans. One important problem is defining the similarity function s(xi , xj ), and some authors have constructed methods for estimating this function automatically from data (32–34). 3.4. Affinity Propagation
Affinity Propagation (AP) (35) is to spectral clustering what Mean Shift algorithms are to K-means. The idea is to let each item i find a exemplar item k to “represent” it. When the algorithm terminates, all the items represented by the same k form a cluster. The number of exemplars is not fixed in advance but depends on the data and on a parameter of the algorithm. Besides the similarities sij , the algorithm also maintains two other quantities for each pair of items ik; they are the availability aik of item k as an exemplar for i (how much support there is from other items for k to be an exemplar) and the responsibility rik that measures how fit is k to represent i, as compared to other possible candidates k . AFFINITY PROPAGATION Algorithm Similarity matrix S = [sik ]ni,k=1 , parameter λ = 0.5. Iterate the following steps until convergence (or until the assignments k(i) do not change): 1. aik ← 0 for i, k = 1:n 2. for all i (a) find the best exemplar for i: s ∗ ← maxk (sik + aik ), Ai∗ ← argmax (sik + aik ) (can be a set of k
items) (b) for all k, update responsibilities $ rik ←
sik − s ∗ ,
if k ∈ Ai∗
sik − maxk ∈Ai∗ (sik + aik ) otherwise
3. for all k, update availabilities (a) akk ← i=k [rik ]+ where [rik ]+ = rik if rik > 0 and 0 otherwise. (b) for all i, aik ← min{0, rkk + i =i,k [ri k ]+ } 4. assign an exemplar to i by k(i) ← argmax (rik + k
aik )
An Overview of Clustering Applied to Molecular Biology
393
The diagonal elements of S representing self-similarities have a special meaning for AP. The larger sii , the more likely i will be to want to become an exemplar. If all sii values are large, then there will be many small clusters; if the sii ’s are decreased, then items will tend to group easily in large clusters. Hence the diagonal values sii are a way of controlling the granularity of the clustering, or equivalently the number of clusters K, and can be set by the user accordingly.
4. Biclustering A more recent topic of interest is incorporating variable selection into clustering. For example, if there are three variables and only the first two contain clustering information (and, say, the third variable is just random noise), the clustering may be improved if we only cluster on the first two variables. Moreover, it may be that different groups of observations cluster on different variables. That is, the variables that separate or characterize a subset of observations might change depending on the subset. In this section, we briefly mention a biclustering algorithm (36) that simultaneously selects variables and clusters observations. This particular approach is motivated by the analysis of microarray gene expression data where we often have a number of genes n and a number of experimental conditions p. We would like to identify “coherent subsets” of co-regulated genes that show similarity under a subset of conditions. In addition, in contrast to other methods presented in this chapter, we allow overlapping clusters. Define a bicluster Bk as a subset of nk observations X k ∈ X = {x1 , x2 , x3 , ..., xn } and Pk , a subset of pk variables from the original p variables. Bk is then represented by an attribute matrix of dimension nk x pk . Define the residue of an element bij in this bicluster matrix Bk as
pk pk nk nk 1 1 1 bij − bij + bij . resij = bij − pk nk nk · pk j=1
i=1
[16]
j=1 i=1
This formula equals the element value minus its bicluster row mean minus its bicluster column mean plus the overall bicluster mean. In gene expression example, bij might be the logarithm of the relative abundance of the mRNA of a gene i under a specific condition j.
394
Nugent and Meila
We can measure a bicluster by its mean squared residue score H (Xk , Pk ): H (Xk , Pk ) =
1 nk · pk
(resij )2 .
[17]
i∈Xk ,j∈Pk
We would like to find a bicluster of maximal size with a low mean squared residue. A bicluster Bk is called a δ-bicluster if H (Xk , Pk ) < δ where δ is some amount of variation that we are willing to see in the bicluster. Note that the lowest possible mean squared residue score of H (Xk , Pk ) = 0 corresponds to a bicluster in which all the gene expression levels fluctuate in unison. Trivial biclusters of one element (one observation, one variable) also have a zero mean square residue score and so are biclusters for all values of δ. In general, we do a greedy search over the observations and the variables to find the bicluster with the lowest H () score. Briefly the algorithm is as follows: BICLUSTERING
Algorithm
Input An attribute matrix X of dimension n by p; the number of biclusters K; δ • for k in 1:K 1.
H (Xk , Pk ) = H (X , P), the mean squared residue score for all observations, all variables; if H (Xk , Pk ) < δ, go to 4.
2.
While (H (Xk , Pk ) < δ), (a) for all i ∈ X , compute each row’s contribution to H (Xk , Pk ): d(i) =
1 (resij )2 pk j∈Pk
for all j ∈ X, compute each col’s contribution to H (Xk , Pk ): d(j) =
1 (resij )2 . nk i∈Xk
(b) remove the row or column that corresponds to the largest d(), i.e., the largest decrease in H (Xk , Pk ) (c) update Xk , Pk , H (Xk , Pk ). 3.
while (H (Xk , Pk ) < δ),
An Overview of Clustering Applied to Molecular Biology
395
(a) for all i ∈ X , compute each row’s possible contribution to H (Xk , Pk ): d(i) =
1 (resij )2 . pk j∈Pk
for all j ∈ X, compute each col’s possible contribution to H (Xk , Pk ): d(j) =
1 (resij )2 . nk i∈Xk
(b) add the row/column with largest d() that satisfies H (Xk , Pk ) + d() < δ. If nothing added, return current Xk , Pk Else, update Xk , Pk , H (Xk , Pk ). 4. “Mask” the final bicluster in the matrix X.
The algorithm essentially first removes observations and/or variables that are not part of a coherent subset, and then, once the mean squared residue score is below the threshold, cycles back through the discarded observations and variables to check if any could be added back to the bicluster (looking for maximal size). It is a greedy algorithm and can be computationally expensive. Alternative algorithms are presented in (36) to handle simultaneous deletion/addition of multiple observations and/or variables. Once a bicluster has been found, we “mask” it in the original matrix to prevent its rediscovery; one suggested technique is to replace the bicluster elements in the matrix with random numbers. There are several other clustering methods that simultaneously cluster observations and variables including Plaid Models (37), Clustering Objects on Subsets of Attributes (38), and Variable Selection in Model-Based Clustering (39). The user should determine a priori the appropriate algorithm for the application.
5. Comparing Clusterings To assess the performance of a clustering algorithm by comparing its output to a given “correct” clustering, one needs to define a “distance” on the space of partitions of a data set. Distances between clusterings are rarely used alone. For instance, one may use a “distance” d to compare clustering algorithms. A likely scenario is that one has a data set D, with a given “correct” clustering. Algorithm A is used to cluster D, and the resulting clustering
396
Nugent and Meila
is compared to the correct one via d. If the algorithm A is not completely deterministic (e.g., the result may depend on initial conditions, like in K-means), the operation may be repeated several times, and the resulting distances to the correct clustering may be averaged to yield the algorithm’s average performance. Moreover, this average may be compared to another average distance obtained in the same way for another algorithm A . Thus, in practice, distances between clusterings are subject to addition, subtraction, and even more complex operations. As such we want to have a clustering comparison criterion that will license such operations, inasmuch as it makes sense in the context of the application. Virtually all criteria for comparing clusterings can be described using the so-called confusion matrix, or association matrix or contingency table of the pair C , C . The contingency table is a K × K matrix, whose kk th element is the number of observations in the intersection of clusters Ck of C and C of C : k
nkk = |Ck ∩ C |. k
5.1. Comparing Clusterings by Counting Pairs
An important class of criteria for comparing clusterings is based on counting the pairs of observations on which two clusterings agree/disagree. A pair of observations from D can fall under one of the four cases described below: N11 : number of pairs that are in the same cluster under both C and C N00 : number of pairs in different clusters under both C and C
N10 : number of pairs in the same cluster under C but not under C N01 : number of pairs in the same cluster under C but not under C
The four counts always satisfy N11 + N00 + N10 + N01 = n(n − 1)/2. They can be obtained from the contingency table [nkk ]. For example, 2N11 = k,k n2 − n. See (40) for details. kk Fowlkes and Mallows (40) introduced a criterion which is symmetric and is the geometric mean of the probabilities that a pair of observations which are in the same cluster under C (respectively, C ) are also in the same cluster under the other clustering.
F (C , C ) =
N11 N11 . n (n − 1)/2 n
(n − 1)/2 k k k k k
[18]
k
It can be shown that this index represents a scalar product (41).
An Overview of Clustering Applied to Molecular Biology
397
The Fowlkes–Mallows index F has a baseline that is the expected value of the criterion under a null hypothesis corresponding to “independent” clusterings (40). The index is used by subtracting the baseline and normalizing by the range, so that the expected value of the normalized index is 0 while the maximum (attained for identical clusterings) is 1. Note that some pairs of clusterings may theoretically result in negative indices under this normalization. A similar transformation was introduced by (42) for Rand’s index (43)
R(C , C ) =
N11 + N00 . n(n − 1)/2
[19]
The resulting adjusted Rand index has the expression
R(C , C ) − E[R] 1 − E[R] K K nkk 6K nk 76K n 7 n / 2 k =1 2 − k =1 k k=1 k=1 2 6 76 2 7 . = 6 7 K nk K K nk K n n n k =1 2k / 2 k=1 2 + k =1 2k /2 − k=1 2 [20] AR(C , C ) =
The main motivation for adjusting indices like R, F is the observation that the unadjusted R, F do not range over the entire [0, 1] interval (i.e., min R > 0, min F > 0). In practice, the R index concentrates in a small interval near 1; this situation was well illustrated by (40). The use of adjusted indices is not without problems. First, some researchers (44) have expressed concerns at the plausibility of the null model. Second, the value of the baseline for F varies sharply near the values between “near 0.6 to near 0” for n/K > 3. The useful range of the criterion thus varies from approximately [0, 1] to approximately [0.6, 1] (40). The adjusted Rand index baseline, as shown by the simulations of (40), varies even more: 0.5–0.95. This variation makes these criteria hard to interpret – one needs more information than the index value alone to interpret if two clusterings are close or far apart. 5.2. Comparing Clusterings by Set Matching
A second category of criteria is based on set cardinality alone and does not make any assumption about how the clusterings may have been generated. The misclassification error criterion is widely used in the engineering and computer science literature. It represents the probability of the labels disagreeing on an observation, under the best possible label correspondence. Intuitively, one first finds a “best match” between the clusters of C and those of C . If K = K , then
398
Nugent and Meila
this is a one-to-one mapping; otherwise, some clusters will be left unmatched. Then, H is computed as the total “unmatched” probability mass in the confusion matrix. More precisely, 1 H(C , C ) = 1 − max nk,π(k) . n π K
[21]
k=1
In the above, it is assumed without loss of generality that K ≤ K , π is an mapping of {1, . . . K } into {1, . . . K }, and the maximum is taken over all such mappings. In other words, for each π we have a (partial) correspondence between the cluster labels in C and C ; now looking at clustering as a classification task with the fixed-label correspondence, we compute the classification error of C with respect to C . The minimum possible classification error under all correspondences is H. The index is symmetric and takes value 1 for identical clusterings. Further properties of this index are discussed in (45, 46). 5.3. Information Theoretic Measure
Imagine the following game: if we were to pick an observation from D, how much uncertainty is there about which cluster is it going to be assigned? Assuming that each observation has an equal probability of being picked, it is easy to see that the probability of the outcome being in cluster Ck equals P(k) =
nk . n
[22]
Thus we have defined a discrete random variable taking K values, that is uniquely associated to the clustering C . The uncertainty in our game is equal to the entropy of this random variable H (C ) = −
K
P(k) log P(k).
[23]
k=1
We call H (C ) the entropy associated with clustering C . For more details about the information theoretical concepts presented here, the reader is invited to consult (47). Entropy is always non-negative. It takes value 0 only when there is no uncertainty, namely when there is only one cluster. Entropy is measured in bits. The uncertainty of 1 bit corresponds to a clustering with K = 2 and P(1) = P(2) = 0.5. Note that the uncertainty does not depend on the number of observations in D but on the relative proportions of the clusters. We now define the mutual information between two clusterings, that is, the information that one clustering has about the other. Denote by P(k), k = 1, . . . K and P (k ), k = 1, . . . K the random variables associated with the clusterings C , C . Let P(k, k )
An Overview of Clustering Applied to Molecular Biology
399
represent the probability that an observation belongs to Ck in clustering C and to C k in C , namely, the joint distribution of the random variables associated with the two clusterings: ; |Ck C k |
. [24] P(k, k ) = n We define I (C , C ) the mutual information between the clusterings C , C to be equal to the mutual information between the associated random variables
I (C , C ) =
K K k=1
P(k, k ) log
k =1
P(k, k ) . P(k)P (k )
[25]
Intuitively, we can think of I (C , C ) in the following way. We are given a random observation in D. The uncertainty about its cluster in C is measured by H (C ). Suppose now that we are told which cluster the observation belongs to in C . How much does this knowledge reduce the uncertainty about C ? This reduction in uncertainty, averaged over all observations, is equal to I (C , C ). The mutual information between two random variables is always non-negative and symmetric. The quantity VI (C , C ) = H (C ) + H (C ) − 2I (C , C )
=
K K k=1 k =1
P(k, k ) log
P(k)P(k ) P(k, k )2
[26]
is called the variation of information (48) between the two clusterings. At a closer examination, this is the sum of two positive terms: VI (C , C ) = [H (C ) − I (C , C )] + [H (C ) − I (C , C )].
[27]
The two terms represent the conditional entropies H (C |C ), H (C |C ). The first term measures the amount of information about C that we lose, while the second measures the amount of information about C that we have to gain, when going from clustering C to clustering C . The VI satisfies the triangle inequality and has other naturally desirable properties. For instance, if between two clusterings C , C some clusters are kept identical, and some other clusters are changed, then the VI does not depend on the partitioning of the data in the clusters that are identical between C and C . Unlike all the previous other distances, the VI is not bounded between √0 and 1. If C and C have at most K ∗ clusters each, with K ∗ ≤ n, then VI (C , C ) ≤ 2 log K ∗ .
400
Nugent and Meila
5.4. Comparison Between Criteria
The vast literature on comparing clusterings suggests that criteria like R, F , K, J need to be shifted and rescaled in order to allow their values to be compared. The existing rescaling methods are based on a null model which, although reasonable, is nevertheless artificial. By contrast, the variation of information and the misclassification error make no assumptions about how the clusterings may be generated and require no rescaling to compare values of VI (C , C ) for arbitrary pairs of clusterings of a data set. Moreover, the variation of information and misclassification error do not directly depend on the number of observations in the set. This feature gives a much stronger ground for comparisons across data sets, something we need to do if we want to compare clustering algorithms against each other. Just as one cannot define a “best” clustering method out of context, one cannot define a criterion for comparing clusterings that fits every problem optimally. Here we present a comprehensible picture of the properties of the various criteria, in order to allow a user to make informed decisions.
6. Concluding Remarks In this chapter, we have given a brief overview of several common clustering methods, both attribute-based and similarity-based. Each clustering method can be characterized by what type of input is needed and what types or shapes of clusters it discovers. Table 12.1 summarizes some of the basic features of the different algorithms. Prior to clustering, the user should think carefully about the application and identify the type of data available (e.g., attributes, similarities), whether or not there are any preconceived notions of the number or shape of the hypothesized clusters, and the computational power required. Clustering is a powerful tool to discover underlying group structure in data; however, careful attention must be paid to the assumptions inherent in the choices of clustering method and any needed parameters. Users should not simply rely on suggested parameter values but determine (and validate) choices that fit their particular application problem.
7. R Code The statistical software package R has been used to find the clustering results shown in this chapter. R is a freely available, multiplatform (Windows, Linux, Mac OS) program that is widely used in statistics/quantitative courses. It provides a program-
An Overview of Clustering Applied to Molecular Biology
401
Table 12.1 Overview of clustering methods Method
Type
Chooses K
Cluster shapes
K-means
Attribute
User
Spherical
K-medoids
Attribute
User
Spherical
Model-based Clustering
Attribute
Method
Spherical, elliptical
Nonparametric Clustering
Attribute
(user can define range) Method
No restrictions
Simple Mean Shift
Attribute
Method
No restrictions
Gaussian Blurring Mean Shift
Attribute
Method
No restrictions
Dirichlet Mixture Model
Attribute
Method
Spherical, elliptical
Hierarchical Clustering
Similarity
User
Linkage dependent
Single
No restrictions
Complete
Spherical
Spectral Clustering
Similarity
User
No restrictions
Affinity Propagation
Similarity
Method
No restrictions
Biclustering
Attribute
User
No restrictions (can be overlapping)
ming language that can be easily supplemented by the user. It also allows the production of publication-quality graphics. See http://www.r-project.org for more details. #####R Code Used to Cluster Data and Create Pictures #####Code included for a generic data set: data #####For help in R using a function (fxn.name), type help(fxn.name) at the prompt #####Initially reading the data into R and plotting the data data<-read.table("data.txt") plot(data,xlab="x1",ylab="x2",pch=16,cex=1.2, cex.lab=1.5, cex.axis=1.3) #####Using K-Means to find 4 clusters km4<-kmeans(data,4) plot(data,type="n",xlab="x1",ylab="x2",cex.lab=1.5, cex.axis=1.3) text(data,labels=km4$cluster, col=km4$cluster) title("K-means: Four Clusters") ##Finding the total within sum-of-squares sum(km4$withinss) #####Creating the "Elbow Graph" Comparing the Total WithinSS against the Number of Clusters k.vec<-seq(2, 21) withinss.vec<-rep(NA,12) for(j in 1:20){ withinss.vec[j]<-sum(kmeans(data2,j+1)$withinss) } plot(k.vec,withinss.vec,type="b",pch=16,cex=1.2, cex.lab=1.5, cex.axis=1.3, xlab="K = No. of Clusters",ylab="Within Cluster Criterion") title("Choosing the Number of Clusters: The Elbow Graph") #####Using Model-Based Clustering
402
Nugent and Meila #####Need to install/download the mclust library library(mclust) mbc1<-EMclust(data, 1:12) ##searches over K = 1 to 12 mbc1.sum<-summary(mbc1,data) ##Plotting the Observations by their Clusters plot(data,xlab="x1",ylab="x2",pch=16,cex=1.2, cex.lab=1.5, cex.axis=1.3,type="n") text(data,col=mbc1.sum$class,labels=mbc1.sum$class, cex=1.2) ##Plotting the Changes in the BIC by Model Choice and Number of Clusters plot(mbc1) #####Nonparametric clustering #####Here we only include code to find the level set of a 2-dim density #####For cluster tree implementation, contact Rebecca Nugent #####http://www.stat.cmu.edu/∼rnugent ##To use the Binned Kernel Density Estimate, need to download/install the KernSmooth library library(KernSmooth) ##Initially finding the density estimate with bandwidth h = 0.0244 over a 20 x 20 grid ##May need to normalize the density estimate if the bins are not of area 1. densest<-bkde2D(data,c(0.0244, 0.0244),c(20, 20)) ##Creating the Heat Map of the BKDE image(densest,xlab="x1",ylab="x2",cex.lab=1.5, cex.axis=1.3) title("Binned Kernel Density Estimate: 20 x 20 Grid", cex.main=1.5) ##Finding the Level Set at height lambda ls<-densest$ ls[ls=lambda]<-1 image(densestx1,densest$x2,ls,xlab="x1",ylab="x2", cex.axis=1.3, cex.lab=1.5) title(expression(paste("Level Set at ", lambda,"= Your Height Value")), cex.main=1.5) #####Hierarchical Linkage Clustering ##Single Linkage Clustering hc.1s<-hclust(dist(data),method="single") ##Plotting the Dendrogram plclust(hc.1s,xlab="Observations",sub="",main="") ##Cutting the Tree to find 4 clusters cl.1s<-cutree(hc.1s,4) ##Plotting the Observations by their Clusters plot(data,type="n",xlab="x1",ylab="x2",cex.lab=1.5, cex.axis=1.3) text(data,labels=cl.1s,col=cl.1s,cex=1.2) ##Complete Linkage Clustering hc.1c<-hclust(dist(data),method="complete") #####Spectral Clustering #####For spectral clustering implementation, please contact Marina Meila #####http://www.stat.washington.edu/spectral
References 1. Getz, G, Levine, E, and Domany, E (2000). Coupled two-way clustering analysis of gene microarray data. Proceedings of the National Academy of Sciences, 97(22): 12079–12084.
2. Lo, K., Brinkman R., and Gottardo, R. (2008). Automated gating of flow cytometry data via robust model-based clustering. Cytometry, Part A, 73A: 321–332.
An Overview of Clustering Applied to Molecular Biology 3. Gottardo, R. and Lo, K. (2008). flowClust Bioconductor package 4. Lloyd, S. P. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28:129–137. 5. Mardia, K., Kent, J.T., and Bibby, J.M. (1979). Multivariate Analysis. Academic Press, New york. 6. Fraley, C. and Raftery, A. (1998). How many clusters? Which clustering method? answers via model-based cluster analysis. The Computer Journal, 41:578–588. 7. McLachlan, G. J. and Basford, K. E. (1988). Mixture Models: Inference and Applications to Clustering. Marcel Dekker, New York 8. Dean, N., Murphy, T.B., and Downey, G. (2006). Using unlabelled data to update classification rules with applications in food authenticity studies. Journal of the Royal Statistical Society: Series C 55(1):1–14. 9. Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B, 39:1–38. 10. Banfield, J. D. and Raftery, A. E. (1993). Model-based gaussian and non-gaussian clustering. Biometrics, 49:803–821. 11. Wishart, D. (1969). Mode analysis: A generalization of nearest neighbor which reduces chaining effect. In A. J. Cole, (ed.) Numerical Taxonomy, Academic Press, New York, 282–311. 12. Hartigan, J. A. (1975) Clustering Algorithms. Wiley, New York 13. Hartigan, J. A. (1981) Consistency of single linkage for high-density clusters. Journal of the American Statistical Association, 76: 388–394. 14. Hartigan, J. A. (1985) Statistical theory in clustering. Journal of Classification, 2: 63–76. 15. Silverman, B.W. (1981) Using kernel density estimate to investigate multimodality. Journal of the Royal Statistical Society, Series B, 43:97–99. 16. Wand, M. P. and Jones, M. C. (1995) Kernel Smoothing. Chapman & Hall, New York. 17. Wand, M. P. (1994) Fast computation of multivariate kernel estimators. Journal of Computational and Graphical Statistics, 3: 433–445. 18. Stuetzle, W. (2003) Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. Journal of Classification. 20:25–47. 19. Stuetzle, W. and Nugent, R. (2009). A generalized single linkage method for estimating the cluster tree of a density. Journal of Computational and Graphical Statis-
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
403
tics, in Press. The Fast Track version is at DOI:10.1198/jcgs.2009.070409 Com˘aaniciu, D. and Meer, P. (1999). Meanshift analysis and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17: 790–799. Fukunaga, K. and Hostetler, L. D. (1975). The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Transactions Information Theory, IT-21:32–40. Cheng, Y. (1995). Mean shift, mode seeking, and clustering. IEEE Transaction. on Pattern Analysis and Machine Intelligence, 17(8):790–799. Carreira-Perpiñan, M. A. (2007). Gaussian mean shift is an EM algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(5): 767–776. Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9:249–265. Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, New York. (Springer Series in Statistics). Gan, G., Ma, C., and Wu, J. (2007) Data Clustering: Theory, Algorithms, and Applications. ASA-SIAM Series on Statistics and Applied Probability, SIAM, Philadelphia, ASA, Alexandria, VA. Lafon, S. and Lee, A. (2006) Diffusion Maps and Course-Graining: A Unified Framework for Dimensionality Reduction, Graph Partitioning, and Data Set Parameterization IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(9): 1393–1403. Meila, M. and Shi, J. (2001b). A random walks view of spectral segmentation. In Jaakkola, T. and Richardson, T. (eds.), Eighth International Workshop on Artificial Intelligence and Statistics (AISTATS), January 4–7, 2001, Key West, Florida. Ng, A. Y., Jordan, M. I., and Weiss, Y. (2002). On spectral clustering: Analysis and an algorithm. In Dietterich, T. G., Becker, S., and Ghahramani, Z. (ed.), Advances in Neural Information Processing Systems 14, MIT Press, Cambridge, MA. Meil˘aa, M. and Shi, J. (2001a). Learning segmentation by random walks. In Leen, T. K., Dietterich, T. G., and Tresp, V. (eds.), Advances in Neural Information Processing Systems, vol. 13, MIT Press, Cambridge, MA. pp. 873–879. Ding, C. and He, X. (2004). K-means clustering via principal component analysis. In
404
32.
33.
34.
35. 36.
37. 38.
39.
Nugent and Meila Brodley, C. E., (ed.), Proceedings of the International Machine Learning Conference (ICML). Morgan Kauffman. Bach, F. and Jordan, M. I. (2006). Learning spectral clustering with applications to speech separation. Journal of Machine Learning Research, 7:1963–2001. Meil˘aa, M., Shortreed, S., and Xu, L. (2005). Regularized spectral learning. In Cowell, R. and Ghahramani, Z., (eds.), Proceedings of the Artificial Intelligence and Statistics Workshop(AISTATS 05). Shortreed, S. and Meil˘aa, M. (2005). Unsupervised spectral learning. In Jaakkola, T. and Bachhus, F. (ed.), Proceedings of the 21st Conference on Uncertainty in AI, AUAI Press, Arlington, Virginia, pp. 534–544. Frey, B. J. and Dueck, D. (2007). Clustering by passing messages between data points. Science, 315:973–976. Cheng, Y. and Church, G.M. (2000). Biclustering of Expression Data. Proceedings of the International Conference on Intelligent Systems in Molecular Biology. 8:93–103. Lazzeroni, L. and Owen, A. (2000). Plaid Models for Gene Expression Data. Statistica Sinica 12:61–86. Friedman, J. and Meulman, J. (2004). Clustering objects on subsets of attributes. Journal of the Royal Statistical Society, 66: 815–849. Raftery, A.E. and Dean, N. (2006) Variable Selection for Model-Based Clustering Journal of the American Statistical Association, 101(473) 168–178.
40. Fowlkes, E. B. and Mallows, C. L. (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78(383):553–569. 41. Ben-Hur, A., Elisseeff, A., and Guyon, I. (2002). A stability based method for discovering structure in clustered data. In Pacific Symposium on Biocomputing, pp. 6–17. 42. Hubert, L. and Arabie, P. (1985). Comparing partitions. Journal of Classification, 2:193–218. 43. Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66:846–850. 44. Wallace, D. L. (1983). Comment. Journal of the American Statistical Association, 78(383):569–576. 45. Meil˘aa, M. (2005). Comparing clusterings – an axiomatic view. In Wrobel, S. and De Raedt, L. (eds.), Proceedings of the International Machine Learning Conference (ICML). ACM Press, New York. 46. Steinley, D. L. (2004). Properties of the Hubert-Arabie adjusted Rand index. Psychological Methods, 9(3):386–396. Simulations of some adjusted indices and of misclassification error. 47. Cover, T. M. and Thomas, J. A. (1991). Elements of Information Theory. Wiley, New York. 48. Meil˘aa, M. (2007). Comparing clusterings – an information based distance. Journal of Multivariate Analysis, 98:873–895.
Chapter 13 Hidden Markov Model and Its Applications in Motif Findings Jing Wu and Jun Xie Abstract Hidden Markov models have wide applications in pattern recognition. In genome sequence analysis, hidden Markov models (HMMs) have been applied to the identification of regions of the genome that contain regulatory information, i.e., binding sites. In higher eukaryotes, the regulatory information is organized into modular units called cis-regulatory modules. Each module contains multiple binding sites for a specific combination of several transcription factors. In this chapter, we gave a brief review of hidden Markov models, standard algorithms from HMM, and their applications to motif findings. We then introduce the application of HMM to a complex system in which an HMM is combined with Bayesian inference to identify transcription factor binding sites and cis-regulatory modules. Key words: Binding site, cis-regulatory module, hidden Markov model, motif.
1. Introduction Hidden Markov models (HMMs) are popular tools in pattern recognition. In genome sequence adnalysis, HMMs have been applied to identifying genome regions that contain regulatory information, for instance, transcription factor binding sites (TFBSs) and cis-regulatory modules (CRMs) (1–5). More specifically, Wu and Xie (6) developed a method that combines HMMs and Bayesian inferences for identifying CRMs and individual TFBSs. Hidden Markov models are originally developed for speech recognition (7). The model is designed to find out what words were spoken from the sequence of category labels of speech signals with variations in the sound uttered and the time taken to say the various parts of the word. The motif finding problem H. Bang et al. (eds.), Statistical Methods in Molecular Biology, Methods in Molecular Biology 620, DOI 10.1007/978-1-60761-580-4_13, © Springer Science+Business Media, LLC 2010
405
406
Wu and Xie
in genomic sequence analysis has the same structure: based on a sequence of symbols from A, C, G, T, find out the motif pattern (words), which could occur anywhere in the sequence and have variable distance between two motifs. The main aim of the chapter is to introduce the theory of the HMM and its application to identifying the motifs in genomic sequences. We introduce the algorithms related to simple pattern recognition and extend the model to the algorithm that combines the HMM and Gibbs sampling to identify CRMs and TFBSs.
2. Basic Concepts of Hidden Markov Models
Let us start with a simple example. The E2F family of transcription factors (TFs) plays a crucial and well-established role in cell cycle progression. The E2F binding sites follow the pattern described by a probability matrix, which is called a positional weight matrix (PWM) as shown in Table 13.1. The columns define the probabilities of observing each nucleotide at each position.
Table 13.1 The PWM of E2F binding sites 1
2
3
4
5
6
7
8
9
10
11
12
A
0.069
0.034
0
0.034
0
0
0
0
0.034
0.517
0.448
0.448
C
0.103
0.035
0.034
0.448
0.103
1
0
0.759
0.276
0.31
0.035
0.138
G
0.069
0.069
0.069
0.517
0.897
0
1
0.241
0.586
0.104
0.241
0.31
T
0.759
0.862
0.897
0.001
0
0
0
0
0.104
0.069
0.276
0.104
From the PWM of E2F binding sites, we can see that the first position of an E2F binding site is more likely to be a ‘T’ since the probability is 0.759. The second position is also likely to be a ‘T’ because its probability is 0.862 and so on so forth. Then, for a given sequence 1: GCCGCCCTTTCCTCTTTCTTTCGCGCTCTAGCCACCCGG,
[1]
where TTTCGCGCTCTA is an E2F binding site, we can use an HMM to describe it. That is, we can view that the letters ‘A,’ ‘C,’ ‘G,’ and ‘T’ are from a Markov process of hidden states. For example, the first letter ‘G’ is from the background state, the first letter in bold face is from a motif state. In total, the hidden states are the background state and the motif states from the first position to the last position (12th position). The hidden states assume the Markov property. That is, the probability of the current state
Hidden Markov Model and Its Applications in Motif Findings
407
only depends on its previous state. These states are hidden since we only observe the letters but not their states. The probability of observing letters from each state is called the emission probability. The formal definition of a HMM is the following. A HMM is a statistical model for systems of Markov processes. A HMM can be denoted as {O, X , Q , , E} where O is the sequence of observations, O = {o1 , ..., oL }, ol ∈ {‘A’, ‘C’, ‘G’, ‘T ’}, l = 1, ..., L, L is the length of the DNA sequence; X is the sequence of the hidden states, X = {x1 , ..., xL }, xl ∈ {0, ..., M }, M + 1 is the total number of states; Q is the transition matrix of the hidden state, Q = {qij }, i, j = 0, ..., M , qij = P(xl+1 = j|xl = i); and = {π0 , ..., πM } is the initial probability of the states, πi = P(x1 = i), i = 0, ..., M . In addition, the probability of observation ol at state i, E = {P(ol |xl = i)} is known as the emission probability. Example 1: Back to the E2F binding sites example, suppose we know that a DNA sequence contains some E2F binding sites and nothing else then we have 13 hidden states, M = 12. Specifically, one background state, denoted by 0, and one state for each position of the E2F binding sites, denoted by 1–12. Then for sequence [1], the hidden states are 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,3,4,5,6,7,8,9,10,11,12, 0,0,0,0,0,0,0,0,0. Or as shown in the following: GCCGCCCTTTCCTCTTTCTTTCGCGCTCTAGCCACCCGG
000000000000000000EEEEEEEEEEEE000000000, where E represents the 12 motif states, states 1–12. We view the letters in the top line as observations from the hidden path in the bottom line. Suppose we know that an E2F binding site is likely to follow the background state with probability 0.01 and no two E2F sites are next to each other. Then, we have xl ∈ {0, 1, ..., 12}, q01 = 0.01, q00 = 0.99, qi,i+1 = 1, i = 1, ..., 11, q12,0 = 1, and qij = 0 for all other i,j. If the nucleotides that are not E2F binding sites are purely random, then the emission probabilities for the observations from the background state is uniform, i.e., P(ol = ‘A’|xl = 0) = P(ol = ‘C’|xl = 0) = P(ol = ‘G’|xl = 0) = P(ol = ‘T ’|xl = 0) = 0.25. The emission probabilities for the observations from states 1–12 are given by the PWM. For example, P(ol = ‘A’|xl = 1) = 0.069. Suppose we know the first position of the given sequence is not a binding site then we have the initial probabilities π0 = 1 and π1 = · · · = π12 = 0. Then the joint probability of an observed sequence O and a state sequence x is P (O, x) = πx1 P (o1 |x1 )
L < l=2
P (ol |xl )qxl−1 xl .
[2]
408
Wu and Xie
3. The Viterbi Algorithm Now, we can find the E2F binding sites in a given segment of sequence, if there are any, when the parameters Q, and the emission probabilities are known. The most common approach to solve this type of problems is to use a dynamic programming algorithm called the Viterbi algorithm. For a given sequence O, there are many underlying state sequences that could rise it. For example, the underlying state sequence could be all from the background model, or the state sequence could have one E2F binding site at a certain location. However, the probabilities, according to [2], for observing O under different state sequences differ very much. Hence if we are to choose just one path for our prediction, the path with the highest probability is most likely to be the true path. The most probable path x ∗ = argmaxx P(O, x) can be found recursively by the Viterbi algorithm as described in the following: • Initialization: v1 (i) = πi P(o1 |x1 = i), for i = 0, ..., M • Iteration: vl (i) = maxj (vl−1 (j)qji )P(ol |xl = i) ptrl (i) = argmaxj (vl−1 (j)qji ), l = 2, ..., L
• Termination: P(O, x ∗ ) = maxj (vL (j)) xL∗ = argmaxj (vL (j))
∗ ), l = L − 1, ..., 1 • Traceback: xl∗ = ptrl+1 (xl+1 In practice, the Viterbi algorithm is calculated in log space, i.e., calculating log(vl (i)) instead of vl (i). In this way, one can avoid the underflow problem introduced by multiplying many probabilities in the iteration steps.
4. The Forward and Backward Algorithms
Besides being able to calculate the probability of a sequence for the best path, we may want to calculate the probability of a sequence from all possible paths, P(O) =
x
P(O, x).
Hidden Markov Model and Its Applications in Motif Findings
409
An algorithm similar to the Viterbi algorithm called the forward algorithm (18) can efficiently calculate such probability. Define fl (i) = P(o1 , ..., ol , xl = i), the full algorithm is • Initialization: f1 (i) = P(o1 |x1 = i)πi , i = 0, ..., M • Recursion: fl (i) = P(ol |xl = i) j fl−1 (j)qji , l = 2, ..., L • Termination: P(O) = i fL (i) The forward algorithm can also introduce underflow errors. One approach to solve the problem is to use a scaling factor in each recursion: sl = maxi (fl (i)), fl (i) ← fl (i)/sl . Then the full probability of the sequence is P(O) = i fL (i)s1 · · · sL . A more detailed pseudocode for the forward algorithm is the following: f(1,0) = P(o(1), 0); f(1,j) = 0; j=1,...,M s(1) = f(1,0); f(1,0) = f(1,0)/s(1); for (l=2;l<=L;l++) { for (i=0;i<=M;i++) { f(l,i) = 0; for (j=0; j<=M;j++) { f(l,i) = f(l,i)+f(l-1,j)q(j,i); } f(l,i) = f(l,i)*P(o(l),i); } s(l) = max{f(l,i),i=0,...,M}; for (i=0;i<=M;i++) { f(l,i) = f(l,i)/s(l); } } PO = 0; for (i=0;i<=M;i++) { PO = PO+f(L,i); } for (l=1;l<=L;l++) { PO=PO*s(l); }
An alternative way to calculate the full probability of the sequence is to use the algorithm called the backward algorithm. In contrast to the forward algorithm, the backward algorithm calculates the probability of a sequence from the tail. That is, denote bl (i) = P(ol+1 ...oL |xl = i), then the backward algorithm is • Initialization: bL (i) = 1 for all i. • Recursion: bl (i) = j qij bl+1 (j)P(ol |xl = i), l = L − 1, ..., 1. • Termination: P(O) = i P(o1 |x1 = i)b1 (i).
410
Wu and Xie
Now we can calculate the posterior probability of a state at a given location P(xl = i|O) using the forward and backward algorithm. Specifically, we have P(xl = i|O) = P(xl = i, O)/P(O) and P(O, xl = i) = P(o1 ...ol , xl = i)P(ol+1 ...oL |o1 ...ol , xl = i) = P(o1 ...ol , xl = i)P(ol+1 ...oL |xl = i). By the definition of fl and bl , we have P(xl = i|O) = fl (i)bl (i)/P(O),
[3]
where P(O) is calculated by either the forward algorithm or the backward algorithm.
5. The Baum–Welch algorithm
Back to our example, sequence [1] and Example 1. Our goal is to find the motif in the sequence if there is any. Under the HMM, this is equivalent to finding the underlying path of the states. We could therefore use the forward and backward algorithms and the Viterbi algorithm to make such prediction. Example 1 provides the parameters necessary for the HMM. However, we usually do not know the parameters in the model and often need to estimate those parameters from training data. The emission probabilities are estimated by the observed frequencies. For example, the emission probabilities from the background state 0 can be calculated using the counts of the ‘A,’ ‘C,’ ‘G,’ ‘T’ in non-coding region of the genomic sequences. That is, denote e0 (‘A’) as the estimated value of P(ol = ‘A’|xl = 0), then e0 (‘A’) = {No. of ‘A’}/{No. of ‘A,’‘C,’ ‘G,’ ‘T’}. The emission probabilities of the nucleotides in the motif can be estimated similarly from a set of aligned known E2F binding sites. This is usually a positional weight matrix as in Table 1. It is possible that one nucleotide never appears in the training set. In this case, the estimated emission probability becomes 0. However, sometime we do not want to eliminate such emission probability so we add a predetermined pseudocount to the counts of nucleotides. If the training data provide the hidden path, using the same idea in estimating the emission probabilities, we can estimate the transition probabilities. In the E2F example, the only transition probability needs to be determined
Hidden Markov Model and Its Applications in Motif Findings
411
is the transition from the background state to the first position of the motif, q0,1 . Then the rest transition probabilities are defined as q0,0 = 1 − q0,1 , qi,i+1 = 1, i = 1, ..., 11, q12,0 = 1, and qij = 0 for all other i, j, where we assume no two motifs are next to each other. The estimation of q0,1 is similar to the estimation of the emission probabilities. That is, q0,1 = {No. of transitions from 0 to 1}/Total number of transitions. Unfortunately, the paths are usually unknown for the training sequences. Thus, some form of iterative procedure must be used. The Expectation-Maximization (EM) algorithm is an efficient approach to estimate the parameters. Specifically, under the HMM model we have P(xl = i, xl+1 = j|O) =
fl (i)qij bl+1 (j)P(ol+1 |xl+1 = j) . P(O)
Then, the count of transitions from i to j can be obtained by summing over all positions and over all training sequences: Aij =
n n n fln (i)qij bl+1 (j)P(ol+1 |xl+1 = j) l
P(O n )
n
,
[4]
where fln (i) is the forward variable fl (i) calculated for sequence n, and bln is the corresponding backward variable. Similarly, we obtain the expected number of times that letter o appears in state i, f n (i)b n (j) l l . [5] Ei (o) = n) P(O n n {l|ol =o}
The estimated parameters can be derived from the above two counts and then be used as known model parameters to obtain new counts of transitions and emissions. This is the Baum–Welch algorithm (9): • Initialization: Pick arbitrary model parameters. • Recursion: calculate Aij and Ei (o) according to [4] and [5]. A
Let qij = ijA and ei (o) = j ij parameters.
Ei (o) . i Ei (o)
Use these values as new
• Stop if the change in log likelihood is less than a predefined threshold or the number of iterations exceed the predefined maximum number of iterations.
6. Gibbs Sampling An alternative to the Baum–Welch algorithm to estimate the model parameters is the Gibbs sampling method. For problems
412
Wu and Xie
of identifying a single type of motifs, e.g., the E2F example, the only unknown parameter is the transition probability q0,1 . Assume the prior distribution q0,1 ∼ Beta(a, b), where Beta distribution is the conjugate prior. An initial value of q0,1 is generated from the prior distribution. Then the following two steps generate the hidden path and update the transition probabilities. • Sampling the hidden path X. The forward algorithm is used to calculate fl (i), l = 1, ..., L and i = 0, ..., M . Starting from l = L, we generate the hidden state xl by probabilities proportional to fl (i), i = 0, ..., M . Then we move to the front position and repeat the sampling procedure until we reach the first position. • Sampling q0,1 given the hidden path. The transition probability is updated as q0,1 ∼ Beta(a + A, b + B), where A is the number of motifs in the sampled path and B = L − A. The sampling procedure is repeated a large number of times, say 1000 times. The transition probability q0,1 is expected to converge and the motifs are identified at the positions where the frequency of being in the motif states is greater than a cutoff.
7. HMM for cis-Regulatory Modules 7.1. An Advanced HMM
We have shown that HMMs can be applied to identifying motifs as in Example 1. Motif identification is useful in the study of gene transcriptional regulation. However, the HMM model we introduced so far is too simple to describe a cluster of binding site motifs while recent studies suggest a modular organization of binding sites for TFs (10). More specifically, the expression level of a gene is determined by a cooperation of several TFs, whose sites are organized in a modular unit, called a CRM. We now introduce an advanced HMM to detect CRMs, which is a more complicated system. In detecting CRMs, a genomic sequence is considered as an observation from an HMM with the hidden path indicating the module and within-module TFBSs. We introduce the hidden states, the transition probabilities, the emission probabilities, and the likelihood in the following. First, the module and the outside background are defined by two hidden states in the model, i.e., the two circles in Fig. 13.1. Suppose there are at most K possible different TFs in a given sequence. Then K additional hidden states are used to indicate the TFBSs. As a simple example, Fig. 13.1 illustrates an HMM with K = 3 possible TFBS states, denoted by PWMs, and states
Hidden Markov Model and Its Applications in Motif Findings
413
qK + 1
r
0 1–r
K+1 q0 1
PWM1
q1
1
q2 1
PWM2
q3 PWM3
Fig. 13.1. A hidden Markov model for CRMs with K = 3 possible TFBSs. State 0: background outside of modules; State K + 1: background inside a module. Transition probabilities are indicated at the arrows.
K + 1 and 0 indicate the background states inside and outside of the modules, respectively. The hidden path of a module consists of several TFBSs from the set of K possible PWMs, which are connected by the background sequences of state K + 1. The modules are further connected by sequences of state 0 to obtain the hidden path of the whole genomic sequence. The transition probabilities between the hidden states are considered as unknown parameters. The notations are specified as follows. At each position of the observed sequence, given the previous state being in the background (state 0), there is a probability r of initiating a module and probability 1-r of staying in the background. If a module starts at a position, then its hidden state becomes K+1, i.e., the background inside the module, and a series of transitions will happen within the module. From state K+1, there is a probability qk of initiating the kth motif site and consequently the following wk positions will have the hidden state PWMk , where wk is the width of the kth motif. In addition, there is a probability qK +1 of staying in the module background state K+1 and a probability q0 of terminating the module, changing from the module to the outside background state 0. The tran +1 sition probabilities satisfy K k=0 qk = 1. Figure 13.1 shows the structure of the hidden Markov chain with the transition probabilities indicated at the arrows. The emission probabilities are assumed known. Each PWMk inside a module is selected from a database (e.g., TRANSFAC (11)), therefore its position-specific probabilities k and motif width wk are given. For the regions outside the modules and the background regions within the modules, i.e., state 0’s and state K + 1’s, respectively, the emission probability is from a first-order Markov chain with parameter θ0 . This parameter can be easily estimated from the observed sequence data.
414
Wu and Xie
7.2. Baysian Inference
Under the HMM, the complete sequence likelihood is defined. For notation simplicity, we only consider a single observed sequence. For a group of sequences, we assume sequences are independent. Therefore, the joint probability becomes the product of the probability for each sequence and the estimation procedure is done on sequences one by one. Let X denote a single sequence. Let M be the module indicators and A = {A1 , ..., AK , AK +1 } the TFBS indicators, where Ak , k = 1, ..., K , is the indicators of the binding sites for the kth motif and AK +1 is the indicator of the non-site background sequences inside the modules. We use X(Mc ) to denote the background sequence outside of the modules. Denote = {θ0 , 1 , ..., K }, where θ0 is the background parameter of the first-order Markov model and k is the kth PWM. Then, given , q = {q0 , q1 , ..., qK , qK +1 }, and r, the complete sequence likelihood with M and A given is P(X, M, A | , q, r) = P(M, A | r, q)P(X(Mc ) | θ0 , M) P(X(M) | M, A, ).
[6]
Note that parameters q and r are unknown but the emission probability parameters are given. The HMM with its unknown parameters r and q can be estimated by the standard Baum–Welch algorithm. However, the Baum–Welch algorithm does not work well when the parameter space is large. Thus, we take the alternative approach through Gibbs sampling. Combining [6] with prior distributions for the parameters q and r gives the posterior distribution: P(M, A,q, r | X, ) ∝ P(X, M, A | , q, r)π(q)π(r),
[7]
where π(q) is the conjugate prior distribution for q, a Dirichlet distribution with parameter α = {α0 , ..., αK , αK +1 } and π(r) is Beta(a, b). By simulating from the posterior distribution, we obtain the estimated parameters q, ˆ rˆ and consequently the hidden ˆ The most probable hidden path provides the optimal ˆ A. path M, locations of the modules and the binding sites. 7.3. Gibbs Sampling
The Bayesian inferences of the unknown parameters and the hidden path are derived by iterative sampling. First the initial values of q and r are generated from the prior distributions Dirichlet π(q) and Beta π(r), respectively. Then the following two steps continuously update the parameters: 1. Sampling M and A given q and r. To simulate the hidden path of M and A, the forward algorithm of HMM is first used to calculate the marginal probability of X, summing over all possible hidden paths. Consider a sequence X = {x1 , ..., xL }. For a partial sequence
Hidden Markov Model and Its Applications in Motif Findings
415
{x1 , ..., xm }, m ≤ L, let hm (k) be the probability of the observed sequence, requiring that the last observation xm is in state k, hm (k) = P({x1 , ..., xm }, state of xm = k|, q, r), k = 0, 1, ..., K , K + 1. If k is a motif, then the last observation is at the last +1 position of the motif. Note that P(X|, q, r) = K k=0 hL (k). With the HMM shown in Fig. 13.1, the recursion formulae are given by hm (0) = (1 − r)P(xm |xm−1 , θ0 )hm−1 (0)+, q0 P(xm |xm−1 , θ0 )hm−1 (K + 1) hm (K + 1) = rP(xm |xm−1 , θ0 )hm−1 (0) +
K
hm−1 (k)
k=1
+ qK +1 P(xm |xm−1 , θ0 )hm−1 (K + 1), hm (k) = qk P({xm−wk +1 , ..., xm }|k )hm−wk (K + 1), k = 1, ..., K ,
[8]
where P({xm−wk +1 , ..., xm }|k ) is the probability of generating the segment from the kth motif model PWMk . The initial conditions of formula [8] are h0 (0) = 1 and hm (k) = 0, for k = 1, ..., K + 1 and m ≤ 0. With all the values hm (k) calculated, backward sampling is used to sample M and A as follows. Starting from m = L, at position m, we generate the hidden state of xm as either the background outside of modules, i.e., state 0, or the last position of a module, i.e., one of the states k = 1, ..., K , K + 1, by probabilities proportional to the corresponding hm (0), hm (k), and hm (K + 1) in formula [8]. Once we generate the last position of a module and suppose we are backward at position m in the current module, the sequence segment {xm−wk +1 , ..., xm }, (k = 1, ..., K , K + 1), is drawn as a background letter (wK +1 = 1) or a site for one of the K motifs with probability proportional to hm (k) and hm (K + 1) in Formula [8]. Depending on the generated state, we move to the front position and repeat the sampling procedure until we reach the first position of the sequence. 2. Sampling q and r given M and A. Given the current sample M and A, denote the total number of modules by |M|, the length of each module by lj , j = 1, ..., |M|, the number of sites for the kth motif by |Ak |, |M| and |AK +1 | = j=1 lj − K k=1 |Ak |wk is the total length of
416
Wu and Xie
non-site background segments within the modules. Then [q | M, A] follows Dirichlet (|M| + α0 , |A1 | + α1 , ..., |AK | + αK , |AK +1 | + |M| αK +1 ). Similarly, [r | M] = Beta(|M| + a, L − j=1 lj + b), where L is the total length of X. We infer the module by the marginal probability of M, i.e., each sequence position being sampled as within modules. For example, the sampling procedures are repeated 1000 times. The positions where the frequency of being a module is > 0.5 are predicted as modules. Similarly, the positions where the frequency of being PWMk , k = 1, ..., K , is greater than a cutoff are predicted as the TFBSs. The overlapping motifs are removed by selecting the one that has a larger frequency.
8. Exercise Calculate the posterior probabilities of the states of sequence [1] using the parameters given in Example 1 and Fig. 13.1. Solution: Use the forward algorithm in Section 4. Substitute the values in Example 1 into the algorithm and get P(O). Then use the backward algorithm in Section 4 and then obtain the probabilities by [3].
References 1. Crowley, E.M., Roeder, K., and Bina, M. (1997) A statistical model for locating regulatory regions in genomic DNA. J Mol Biol 268, 8–14. 2. Frith, M.C., Hansen, U., and Weng, Z. (2001) Detection of cis-element clusters in higher eukaryotic DNA. Bioinformatics 17, 878–889. 3. Bailey, T.L. and Noble, W.S. (2003) Searching for statistically significant regulatory modules. Bioinformatics 19, (Suppl. 2), ii16–ii25. 4. Rajewsky, N., Vergassola, M., Gaul, U., and Siggia, E.D. (2002) Computational detection of genomic cis-regulatory modules applied to body patterning in the early Drosophila embryo. BMC Bioinformatics 3, 30. 5. Sinha, S., van Nimwegen, E., and Siggia, E.D. (2003) A probabilistic method to detect regulatory modules. Bioinformatics 19, (Suppl. 1), i292–i301. 6. Wu, J. and Xie, J. (2008) ComputationBased Discovery of Cis-Regulatory Modules
7. 8.
9.
10.
11.
by Hidden Markov Model. J Comput Biol 15(3), 279–290. Rabiner, L. and Juang, H. (1993) Fundamentals of Speech Recognition. Prentice Hall, USA. Durbin, R., Eddy, S., Krogh, A., and Mitchison, G. (1998) Biological Sequence Analysis. Cambridge University Press, Cambridge, UK. Baum, L.E. (1972) An equality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. Inequalities 3, 1–8. Yuh, C.H., Bolouri, H., and Davidson, E.H. (1998) Genomic cis-regulatory logic: Experimental and computational analysis of a sea urchin gene. Science 279, 1896–1902. Wingender, E., Chen, X., Hehl, R., et al. (2000) TRANSFAC: an integrated system for gene expression regulation. Nucleic Acids Res 28, 316–319.
Chapter 14 Dimension Reduction for High-Dimensional Data Lexin Li Abstract With advancing of modern technologies, high-dimensional data have prevailed in computational biology. The number of variables p is very large, and in many applications, p is larger than the number of observational units n. Such high dimensionality and the unconventional small-n-large-p setting have posed new challenges to statistical analysis methods. Dimension reduction, which aims to reduce the predictor dimension prior to any modeling efforts, offers a potentially useful avenue to tackle such highdimensional regression. In this chapter, we review a number of commonly used dimension reduction approaches, including principal component analysis, partial least squares, and sliced inverse regression. For each method, we review its background and its applications in computational biology, discuss both its advantages and limitations, and offer enough operational details for implementation. A numerical example of analyzing a microarray survival data is given to illustrate applications of the reviewed reduction methods. Keywords: Dimension reduction, partial least squares, principal component analysis, sliced inverse regression.
1. Introduction With advancing of experimental technologies and observatory instruments, high-dimensional genetic and genomic data have prevailed in computational biology. Here we consider some specific examples. • In a microarray study of diffuse large-B-cell lymphoma (DLBCL), (1) modeled the survival time of the DLCBL patients after chemotherapy based on the patients’ gene expression profiles. The data consist of 240 patients and measurements of 7399 genes. H. Bang et al. (eds.), Statistical Methods in Molecular Biology, Methods in Molecular Biology 620, DOI 10.1007/978-1-60761-580-4_14, © Springer Science+Business Media, LLC 2010
417
418
Li
• In a study to predict regulatory elements, which are strings of the nuclear DNA whose function is to determine activation of genes, (2) discriminated the regulatory elements from non-functional neutral DNA based on the alignment patterns of the genome sequences. The data consist of 257 regulatory elements and 257 neutral DNA strings, along with 1296 alignment patterns. • In a study to discover transcription factor binding motifs, (3) regressed genes’ mRNA expression levels against their promoter region’s matching scores to each of the candidate motifs. The data consist of 5970 genes and 414 motif candidates. All above examples can be cast into a statistical regression problem, where there are a univariate response variable Y , which can be categorical, continuous, or right censored, and pdimensional predictor vector X. A common feature of the above examples is that the predictor dimension p is very large, often in hundreds, thousands or more, and in many cases, the number of predictors p is much larger than the number of observational units n. Such high dimensionality renders many traditional data analytic tools untenable, and the unconventional small-n-large-p setting impedes straightforward applications of classical regression techniques. As such it has necessitated a new type of statistical analysis. Dimension reduction offers an appealing avenue to tackle such high-dimensional regressions. It is based upon the belief that high-dimensional data can be effectively summarized as being concentrated on a low-dimensional space, and its goal is to reduce the dimensionality of the predictor vector as a first step prior to any subsequent modeling and analysis efforts. Potential advantages accrue from working in this dimension reduction paradigm. It in effect transforms a high-dimensional problem to a lowdimensional one by providing a small set of composite predictors upon which prediction and interpretation can be based. Consequently, many existing parametric or nonparametric approaches, which would have been hindered by the curse of dimensionality, can be applied to the reduced low-dimensional data. In addition, effective reduction in dimension often makes informative visualization of the data possible, which would in turn facilitate subsequent model development. Although many statistical methods may have a built-in dimension reduction component, in this chapter, we will concentrate on a particular reduction paradigm, i.e., linear dimension reduction. More specifically, we seek to replace the originally p-dimensional predictor vector X with a d-dimensional linear combinations T X, where is a p × d matrix with d ≤ p. In practice, d is often substantially smaller than p and thus reduction in dimension is achieved. When d is as small as 1, 2, or 3, graph of the data in
Dimension Reduction for High–Dimensional Data
419
reduced dimension becomes feasible. During the phase of dimension reduction, we intend to impose few or no probabilistic models, meanwhile we aim to retain as much regression information as possible. Subsequent model formulation can then be based upon the extracted linear combinations T X with much of flexibility and little information loss. Our choice of linear dimension reduction paradigm will exclude nonlinear dimension reduction methods such as isomap (4) and locally linear embedding (5). However, linear reduction does not mean that only linear regression models are allowed. As we will see later, it encompasses a very rich and flexible class of regression models and can effectively account for nonlinear data structures. We choose the form of linear reduction because it offers a natural and potentially useful class and there has been a rigid statistical theoretical foundation developed for this class. Within this framework of linear dimension reduction, we will review some very commonly used dimension reduction methods, including principal component analysis, partial least squares (6), and sliced inverse regression (7). Depending on how the response information Y is to be used, we group the methods as unsupervised dimension reduction, where Y is not explicitly taken into account, and supervised dimension reduction, where Y is directly incorporated. We will review each method with its background and applications, will discuss both its advantages and limitations, and will offer enough operational details such that the practitioners can implement the method. To facilitate the exposition, we will adopt the following notations throughout the chapter. The notation Ra×b stands for the space of real matrices of dimension a × b, and Ra for the space of real vectors of length a, and these two symbols are mainly used to clarify the dimension of a matrix or a vector. We denote the random predictor vector by X = (X1 , . . . , Xp )T ∈ Rp . We denote the n iid sample observations of X as x1 , . . . , xn , and organize the data as an n × p matrix X with xi ∈ Rp , i = 1, . . . , n, as its rows. Without loss of generality we also assume that E(X) = 0 and the data matrix X is appropriately centered throughout this chapter.
2. Methods 2.1. Principal Component Analysis
Principal component analysis (PCA) is a well-established dimension reduction approach for multivariate data. At the population level, PCA seeks a set of orthogonal linear combinations
420
Li
a1T X, . . . , apT X for the sequential maximization: max Cov(ajT X), aj
subject to aj aj = 1 and aj ak = 0, k = 1, . . . , j − 1. There exists a simple closed form solution to this sequential optimization problem, i.e., the maximizers are the eigenvectors, γ1 , . . . , γp , of the predictor covariance matrix = Cov(X) corresponding to the eigenvalues λ1 ≥ . . . ≥ λp in descending order. We call the γj ’s the principal component directions, and call the linear combinations γjT X’s the principal components. Given n iid sample observations, the sample version of the principal component directions are the eigenvectors γˆj ∈ Rp , j = 1, . . . , p, of the usual sample covariance ˆ = XT X/n, and the sample version of the principal commatrix ponents are Xγˆj ∈ Rn , j = 1, . . . , p. It is interesting to note that, in many bioinformatics applications, data are often arranged in a way that each row corresponds to a gene or SNP and each column to a subject or sample. So the data are of the form XT ∈ Rp×n in our notation. Consider the singular value decomposition of this data matrix, XT = UDV T , where U ∈ Rp×n , V ∈ Rn×n are both orthonormal matrices, and D ∈ Rn×n is a diagonal matrix consisting of the singular values. Since V is orthonormal, we have XT X/n = U (D 2 /n)U T . Then following the above definition, the sample principal components are Xuj ∈ Rn , j = 1, . . . , min (n, p), where uj ’s are the columns of the matrix U. For this reason, PCA is sometimes referred to as the singular value decomposition-based method. We also note that, when p > n or p >> n, it is often suggested to obtain the principal components directly from the columns of the matrix V, and this is because Xuj ∝ vj , for j = 1, . . . , min (n, p). As such one only needs to perform singular value decomposition on a p × n matrix XT instead of a p × p matrix XT X, which could save substantial computations when p is huge. PCA has been studied in the context of regression as well, where the first few principal components, say, γ1T X, . . . , γdT X, are taken in place of X, and one regresses Y on γ1T X, . . . , γdT X. To determine the number d of principal components to include, one can resort to the scree plot and look for the “elbow” in the graph of eigenvalues λj ’s. Alternatively, one can choose d such that the total explained variation exceeds a pre-specified threshold percentage, where the explained variation of the first d principal p components is measured by dk=1 λk / k=1 λk . In many applications, d is often small, and thus substantial dimension reduction is achieved by principal components regression. In general, PCA is very simple to use, since it only involves eigen decomposition of a covariance matrix, it places almost no probabilistic assumption on the distribution of X, and it can work for very large p including the p > n scenario. Moreover, PCA is expected to mitigate the effects of collinearity, which is one of the main motivations for T
T
Dimension Reduction for High–Dimensional Data
421
using it as a reductive method in regression. For these reasons, PCA has been very commonly used in a large variety of scientific applications. For a comprehensive account of PCA, see (8). In bioinformatics, PCA has been applied to microarray gene expression data to produce a set of “eigen-genes” or “supergenes” to capture most of the expression information and to associate with clinical outcome variables. See, for instance, (9) and (10), among others. PCA was used as a dimension reduction pre-processing step in (11–13) to deal with the n < p problem so that subsequent dimension reduction approach can be applied. PCA was also used in biomarker discovery in (14), and in adjustment for heterogeneity in gene expression in (15). In genome-wide association studies, PCA has been employed to capture population structure and to correct for stratification (16, 17). Despite its widespread applications, there has been a common criticism of principal components-based regression, i.e., the principal components are computed only from the marginal distribution of X and the response information Y is not taken into account. For this reason, PCA is viewed as an unsupervised dimension reduction approach, and is often suspected to be inefficient for regression, since reduction is achieved without regard to the response while in principle, there is no reason that the response should not be closely tied to the least important principal components (18). On the other hand, (19) presented an intriguing argument for why the response tends to have a higher correlation with the first principal component than any other component, which to some extent offers an explanation for the popularity of reduction to a few leading principal components. Moreover, (20) offered an alternative view of principal components regression from maximum likelihood estimation of an inverse regression model. 2.2. A General Reductive Paradigm for Regression
Intuitively it seems more effective to incorporate the response information during the phase of dimension reduction, and this leads to the family of supervised dimension reduction approaches. We start with an introduction of a general reductive paradigm for regression. The goal of regression is, in full generality, to develop inferences on the conditional distribution of Y |X. The goal of sufficient dimension reduction is then to seek a p × d matrix , with d ≤ p, such that D
Y |X = Y | T X, or equivalently, Y D
X| T X,
[1]
stands for where = indicates identical distribution, and statistical independence. Given [1], we can replace X with T X, and regress Y on T X without losing any regression information.
422
Li
Note that in [1] is not unique. To obtain a well-defined population parameter, we call any subspace of Rp whose basis matrix satisfies [1] a dimension reduction subspace, and call the intersection of all such dimension reduction subspaces, provided that this intersection itself satisfies [1], the central subspace for regression of Y on X (21). The central subspace uniquely exists under very mild conditions (22), and is denoted as SY |X . Following this definition, SY |X is a parsimonious population parameter that captures full regression information of Y |X and is the main object of interest in our dimension reduction inquiry. Depending on study-specific goals, some regression analysis may focus on the conditional mean E(Y |X) only, leaving other aspects of Y |X unspecified or filled in with plausible assumptions. Under such circumstances, dimension reduction seeks the matrix such that E(Y |X) = E(Y | T X). Accordingly, one can develop the notion of the central mean subspace (23), denoted as SE(Y |X) , which retains full regression information that is available through the mean E(Y |X). Throughout the chapter, we assume SY |X and SE(Y |X) exist and assume (X, Y ) has a joint density. 2.3. Partial Least Squares
We next discuss another popular dimension reduction method, partial least squares (PLS), and connect it with the above reductive paradigm. Originally developed by (6, 24) for econometrics, PLS gained its popularity in chemometrics, and has since spread to research in education, marketing, social sciences, and more recently bioinformatics. From an optimization point of view, PLS seeks a set of orthogonal linear combinations a1T X, . . . , apT X for the sequential maximization: max Cov(ajT X)Corr2 (Y , ajT X), aj
subject to aj aj = 1 and ajT ak = 0, k = 1, . . . , j − 1. So now the response Y is explicitly incorporated into the objective function. The extracted linear components are then linked with the response through a linear model, yielding essentially a single linear combination of X in association with Y . In population, this PLS vector of linear combination is of the form (25, 26) T
βPLS = Ru (RuT Ru )−1 RuT σ , where σ = Cov(X , Y ) ∈ Rp and Ru = (σ , σ , . . . , u−1 σ ) ∈ Rp×u is obtained by iteratively transforming σ by . We call βPLS the PLS direction.
Dimension Reduction for High–Dimensional Data
423
To connect PLS with the reductive paradigm in [1], we first briefly discuss the ordinary least squares (OLS) estimator βOLS = −1 σ . As noted by (27), when the marginal distribution of X is elliptically symmetric, OLS can estimate direction in SY |X , or more precisely, SE(Y |X) . As an example, consider the singleindex model, Y = f (β1T X) + ε, where f is an unknown link function and ε is a mean zero error independent of X. Then βOLS can estimate β1 up to a constant as long as X is elliptically symmetric. This observation implies that OLS can work beyond a linear model, and use of OLS does not necessarily mean that the linear model is true or provides an adequate fit of the data. On the other hand, OLS can only capture information in the conditional mean E(Y |X) but not in higher moments such as Var(Y |X), and it can identify at most one direction (23). Moreover, given a finite sample of observations with size n, if n < p, then the usual samˆ is singular and not invertable. As such ple covariance estimator OLS can not be applied directly in the small-n-large-p regression. Returning to PLS, it is straightforward to re-write βPLS = Ru (RuT Ru )−1 RuT −1 σ = PRu () βOLS , where PRu () denotes the projection matrix onto the subspace Span(Ru ) spanned by the columns of Ru with respect to the inner product. Here we always assume is of full rank in population, while its sample estiˆ is singular when n < p. Meanwhile, one can show that mator SY |X ⊆ Span(Ru ) for some value of u (28, 29). Then, when X is elliptically symmetric, βOLS ∈ SY |X ⊆ Span(Ru ). Since the PLS vector βPLS is a projection of βOLS onto Span(Ru ) and βOLS ∈ Span(Ru ), we then have βPLS = βOLS ∈ SY |X . There are two immediate implications of the above observation. First, we show that the PLS vector is equal to the OLS vector in population. Second, PLS can estimate direction in the central subspace as OLS does, and both methods work beyond the linear model. On the other hand, in βOLS one needs to invert a p × p matrix , while in βPLS one only needs to invert a u × u matrix RuT Ru . Since u is often much smaller than p and also smaller than n, PLS would work for n < p while OLS cannot. Moreover, PLS would also improve estimation accuracy when predictors in X are highly correlated. Computationally, u can be determined by examining the singular values of Ru or by cross-validation; see (29) for more details. In bioinformatics, in particular, in microarray gene expression data analysis, PLS was employed for tumor classification in (30–32) and for survival time prediction in (33–35), among others. In those applications, the first few PLS components, say, a1T X, . . . , adT X, are taken in place of X, but the subsequent
424
Li
regression does not necessarily take the form of a linear model. In that case, the population interpretation of this reductive strategy is unclear and remains to be further investigated. Although PLS incorporates the response information during dimension reduction, it can only identify at most one direction and respond only to the conditional mean E(Y |X), just as ordinary least squares does. We next review a class of dimension reduction estimators that are designed to capture full regression information of Y |X. 2.4. Sliced Inverse Regression
There have been a number of estimation methods proposed to estimate the central subspace or central mean subspace. Examples include sliced inverse regression (7), sliced average variance estimation (36), Fourier estimator (37), directional regression (38) for estimating SY |X , principal Hessian directions (39), and minimum average variance estimation (40) for estimating SE(Y |X) . See (41) for a recent review of the literature. Among those methods, sliced inverse regression (SIR) is perhaps the most widely used and has many elaborations, and we will focus our subsequent discussion on SIR. SIR aims to estimate the central subspace SY |X based on a key observation that, when X is elliptically symmetric, the vector −1 E{X − E(X)|Y } belongs to SY |X . It is often further assumed that, whenever Span( −1 E{X − E(X)|Y }) ⊆ SY |X , the two subspaces are equal. As a consequence of this observation, consider eigen decomposition of the matrix x|y = Cov[E{X − E(X)|Y }] ∈ Rp×p with respect to : x|y γj = λj γj , j = 1, . . . , p.
[2]
Then the first d eigenvectors (γ1 , . . . , γd ) that correspond to the eigenvalues λ1 ≥ . . . ≥ λd > 0 form a basis of SY |X . We call γj ’s the SIR directions. Given n iid sample observations, (7) proposed to first slice the range of the response to hnon-overlapping intervals, and then estimate E(X|Y ) by x¯ s = ¯ )/ns , s = 1, . . . , h, i.e., taking average of x’s that s-th slice (xi − x belong to the same slice, where x¯ is the usual sample average and ns is the number of observations in the hs-th slice.T One then forms ˆ the sample estimate of x|y as x|y = s=1 ns x¯ s x¯ s /n and performs ˆ The number of linits spectral decomposition with respect to . ear combinations d that is needed to fully capture the central subspace is determined by the number of nonzero eigenvalues, which can in turn be determined by sequential asymptotic test (7), or permutation test (42), or information criterion (43). The number of slices h is a tuning parameter in SIR, but the estimation results are not overly sensitive to the choice of h, as long as h > d and the number of observations in each slice is large enough for the asymptotics to provide useful approximation. In sum, SIR
Dimension Reduction for High–Dimensional Data
425
is very simple to compute, and it works for a variety of flexible models, for example, the single-index model Y = f (β1T X) + ε, the heteroscedastic model Y = f1 (β1T X) + f2 (β2T X) × ε, the additive model Y = dj=1 fj (βjT X) + ε, and the generalized linear model log{P(Y = 1|X)/P(Y = 0|X)} = β1T X. In the above cases, f’s are the smooth link functions, ε is a random error independent of X, and β’s form the basis of the central subspace SY |X . There have been many applications of SIR in bioinformatics, see for instance, (2, 3, 11, 12, 44). Moreover, (45) applied sliced average variance estimation of (36), and (46) applied minimum average variance estimation of (40), both to gene expression data for tumor classification. In bioinformatics application, the number of predictors p is often larger than the number of samples n. When n < p, the ˆ is not invertiable, and thus SIR canusual sample covariance not be applied directly. There are several strategies to deal with the small-n-large-p problem. The first is to employ some preprocessing procedure to bring down the dimension of the predictors. For this purpose, one can employ a univariate screening approach, e.g., univariate two sample t test or univariate regression. This strategy was employed by (45), where a small number of genes were pre-selected and dimension reduction method was applied to this subset of selected genes. Alternatively, one may use PCA to produce a number of leading principal components, and then base the subsequent SIR on those extracted principal components. This strategy has been employed in (11–13). The second strategy to tackle n < p is to borrow the idea of PLS reviewed in Section 2.3. More specifically, let ν denote the p × d matrix consisting of the first d eigenvectors of the covariance x|y . Then it satisfies that Span( −1 ν) ⊆ SY |X . We next iteratively transform ν by , obtaining that Ru = (ν, ν, . . . , u−1 ν), and then construct a new estimator νPLS = Ru (RuT Ru )−1 RuT ν. (2) showed that there exists an integer u such that SY |X ⊆ Span(Ru ). Therefore, νPLS = Ru (RuT Ru )−1 RuT −1 ν = PRu() −1 ν = −1 ν. As such, Span(νPLS ) ⊆ SY |X . Similar to PLS, the new estimator νPLS only requires inversion of a ud × ud matrix (RuT Ru ) instead of a p × p matrix , so it works for n < p as long as ud < n. Moreover, it also improves estimation accuracy for highly correlated predictors. An application of this PLS-based dimension reduction approach to the regulatory elements discrimination was given in (2).
426
Li
The third option to enable SIR to work under n < p is to introduce ridge type of regularization. For this purpose, we first re-write SIR in an equivalent least squares formulation (44). That is, we consider minimization of the objective function, G(, ) =
h = =2 = ˆ s= fˆs =(¯xs − x¯ ) − φ = , s=1
over a p × d matrix and a d × h matrix = (φ1 , . . . , φh ), fˆs = ns /n, x¯ s and x¯ are as defined before. Letting ˆ denote the minimizer of G(, ), then ˆ is an equivalent solution of SIR as obtained from the spectral decomposition [2]. This new form allows a straightforward incorporation of a ridge-type regularization. For instance, (44) suggested to add a ridge penalty τ vec()T vec() and to minimize Gτ (, ) =
h = =2 = ˆ s= fˆs =(¯xs − x¯ ) − φ = + τ vec()T vec(),
[3]
s=1
over and for a nonnegative parameter τ , where vec( · ) is a matrix operator that stacks all columns of a matrix to a vector. ˆ is defined as ˆ ) ˆ = arg min, Gτ (, ), then Span() Letting (, a ridge SIR estimator of SY |X . Equation [3] shows a close analogy to ridge regression in the usual ordinary least squares setup. An alternating least squares optimization algorithm was developed to minimize [3], and a generalized cross-validation (GCV) criterion was derived to select the ridge parameter τ (44). One may also consider other forms of ridge penalty (47), e.g., ˜ τ (, ) = G
h s=1
= =2 = ˆ s= ˆ fˆs =(¯xs − x¯ ) − φ = + τ vec()T (Df ⊗ )
vec(), [4] where Df = diag(fˆ1 , . . . , fˆs ) and ⊗ denotes the Kronecker product. Interestingly, (47) showed that the minimizer of [4] consists of the first d eigenvectors from the spectral decomposition ˆ x|y γj = λj ( ˆ + τ Ip )γj , j = 1, . . . , p,
[5]
where Ip denotes a p-dimensional identity matrix. Comparing [5] with the original SIR solution [2], an identity matrix is added to ˆ to make it invertable, and this is exactly the ridge SIR proposal of (3). For this type of ridge SIR, (47) derived a GCV criterion to select the parameter τ . In applications, (44) employed ridge SIR to model and predict survival time given gene expression profiles,
Dimension Reduction for High–Dimensional Data
427
and (3) applied ridge SIR for discovery of transcription factorsbinding motifs. 2.5. Simultaneous Variable Selection
In bioinformatics applications, there are usually a large number of genes or biomarkers, and it is important to sift through this large amount of predictors to distinguish those variables that are relevant to phenotypical response from those which are not. To achieve this goal, we next introduce simultaneous variable selection during the dimension reduction process. For PCA, (48) developed a sparse version of PCA. (49) extended that idea to create a sparse estimate of the basis matrix of the central subspace or central mean subspace, by imposing regularization on each of individual coefficients of the basis matrix. Both solutions would facilitate interpretation of the dimension reduction estimates. On the other hand, sparsity in the extracted principal components or the basis matrix does not correspond to variable selection unless an entire row of the matrix is set to zero simultaneously. To achieve variable selection along with dimension reduction, (50, 51) developed a family of shrinkage dimension reduction estimators. It is interesting to note that, thanks to the model-free nature of sufficient dimension reduction estimators, this family of variable selection approaches do not require any traditional parametric models, and thus can be particularly useful in the exploratory stage of the data analysis where no strong model assumptions have to be imposed. Next we use shrinkage ridge SIR as an example to illustrate the key ideas of shrinkage dimension reduction. The process is two stage: one first obtains an initial estimator, from usual SIR, or ridge SIR, or other dimension reduction methods, and then one imposes shrinkage structure onto the initial estimator. ˆ ) ˆ denote the ridge SIR solution as More specifically, let (, the initial estimator. We next introduce a p × 1 shrinkage vector ω = (ω1 , . . . , ωp )T , and propose to minimize the objective function p h = =2 = = ˆ ˆ ˆ ˆ Gλ (ω) = |ωj | ≤ λ fs =(¯xs −¯x)−diag(ω) φs = subject to s=1
j=1
[6] with respect to ω, for some nonnegative parameter λ. Letting ωˆ = ˆ the shrinkage ridge SIR ˆ ) arg minω Gλ (ω), we call Span(diag(ω) estimator of the central subspace. When λ ≥ p, ωˆ j = 1 for j = 1, . . . , p, and we get back a ridge SIR estimator. As λ gradually decreases, some indices ωj are shrunk to exact zero, indicating the corresponding predictors are not needed for the regression given other predictors, and thus variable selection is achieved along with dimension reduction. Optimization of [6] can be turned into a
428
Li
usual LASSO (52) problem by re-writing the objective Gλ (ω) as =
⎞ =2 ⎞ ⎛ ⎛ 1/2 1/2 = = ˆ ˆf (¯x1 − x¯ ) diag ˆ φˆ 1 fˆ1 = = 1 ⎜ ⎟ = = ⎟ ⎜ ⎜ ⎟ = = .. .. ⎟ Gλ (ω) = =vec ⎜ − ω ⎜ ⎟ = . . ⎠ ⎝ ⎝ = ⎠ =
. 1/2 = = 1/2 ˆ fˆh (¯xh − x¯ ) diag ˆ φˆ h fˆh = = One can then utilize any existing LASSO algorithm (52, 53) to solve the optimization. The penalty parameter λ can be selected using an information criterion as suggested in (44, 50). Moreover (44) applied the shrinkage ridge SIR estimator to the microarray survival prediction and gene selection.
3. Numerical Study In this section we analyze the diffuse large-B-cell lymphoma (DLBCL) microarray gene expression data of (1) to illustrate applications of the dimension reduction methods reviewed in Section 2. DLBCL is the most common type of lymphoma in adults, while the survival rate of the standard chemotherapy for DLBCL is only about 35–40% (1). It is thus important to predict the survival of the chemotherapy and to understand the factors that influence the survival outcome. Rosenwald et al. (1) reported the survival time of 240 DLBCL patients after the chemotherapy, which ranges from about 0 to 21.8 years, and there were 138 patients deceased during the follow-ups. Also measured were expression values of 7399 genes from cDNA microarrays for each individual patient. The goal of our analysis is then to model and to predict the survival time given the gene expression profiles. The data were pre-divided by (1) to a training group of 160 patients and a testing group of 80 patients. Moreover, we employed an initial screening method to narrow down the number of candidate predictors for subsequent dimension reduction analysis. That is, we fitted a univariate Cox model for each gene, and only kept those whose p-values are less than 0.01, which left 329 genes. Such an initial screening procedure has been very commonly used in very high-dimensional data analysis, and in our opinion, is usually a preferable first step. (54) also offered some theoretical justifications for univariate predictor screening. We applied four dimension reduction methods to the training data, PCA, PLS-based SIR, ridge SIR and its shrinkage version. To make those methods comparable, we extracted only the
Dimension Reduction for High–Dimensional Data
429
first leading linear combination of the predictors, v = γˆ1T X, from each dimension reduction method. More specifically, for PCA, γˆ1 ˆ = XT X/160, was obtained as the first eigenvector of the matrix where Xb is the 160 × 329 data matrix with each row corresponding to a patient and each column to a gene. For PLS, since the response is right censored, a direct application of PLS is not feasible. However, one can use the PLS-based SIR estimaˆ u (R ˆT ˆ ˆ −1 ˆ T tor for censored data. That is, γˆ1 = R u Ru ) Ru νˆ , where νˆ was obtained as the first SIR direction from the eigen decomˆ u = (νˆ , ˆ νˆ , . . . , ˆ u−1 νˆ ). u often takes a small position [2] and R number, and in this application, we tried u = 3, 4, and 5, and they all yielded very similar results, so we report the result for u = 5. For ridge SIR, γˆ1 was obtained as the minimizer of the objective function in [3], where the parameter τ was selected using the GCV criterion (44). We also considered the shrinkage version of ridge SIR following [6], where the parameter λ was selected using Akaike information criterion (44), and the procedure selected 34 genes out of 329. Moreover, for PLS-based SIR and ridge SIR, since the response is bivariate, consisting of the censored survival time Y˜ and a binary censoring indicator δ, we employed the double slicing procedure (55). That is, we sliced the observed survival time Y˜ within each sub-sample where δ = 1 and δ = 0, respectively. The remaining steps are exactly the same as a usual SIR. This double slicing procedure is very simple to use, and its justification is given by (12, 55). After the extracted covariate v = γˆ1T X was obtained, a Cox proportional hazards model was fit with v as the predictor. Three risk groups of patients, the low-risk, the intermediate-risk, and the high-risk patients, were defined according to the one-third and two-thirds quantiles of the estimated risk scores. Figure 14.1 shows the Kaplan–Meier estimates of survival curves for the three risk groups, where panel (a) denotes the result based on PCA, panel (b) for PLS-based SIR, panel (c) for ridge SIR, and panel (d) for shrinkage ridge SIR. It is seen that all four dimension reduction methods achieve good separation of the three risk groups, which indicates a good model fit to the training data. The log rank test of difference among three survival curves yielded the p-value of essentially 0 for all cases, which confirms our visual examination. Comparing those four methods, the three supervised dimension reduction estimators perform better than the unsupervised PCA. To further evaluate the predictive performance of each method, the fitted Cox model obtained from the training data was then applied to the testing data, and the same cutoff values used in the training set were used to assign the test samples to the three risk groups. Figure 14.2 shows the corresponding Kaplan–Meier estimates of survival curves. Visually, PCA achieves
430
Li
(b)
0.8 0.6 0.0
0.0
0.2
0.4
Death−free survival
0.6 0.4 0.2
Death−free survival
0.8
1.0
1.0
(a)
0
5
10 Time to death
15
0
20
5
20
15
20
(d)
1.0
0.6 0.0
0.0
0.2
0.4
Death−free survival
0.8
0.8 0.6 0.4 0.2
Death−free survival
15
1.0
(c)
10 Time to death
0
5
10 Time to death
15
20
0
5
10 Time to death
Fig. 14.1. Model fitting: Kaplan–Meier estimate of survival curves for the three risk groups of patients in the training data. Panel (a) denotes the result for PCA, panel (b) for PLS-based SIR, panel (c) for ridge SIR, and panel (d) for shrinkage ridge SIR.
a reasonably good separation of different risk groups, indicating a competent prediction performance. Meanwhile, ridge SIR performs better than PCA, whereas shrinkage ridge SIR further improves the prediction performance than ridge SIR. Finally, the PLS-based SIR shows the best separation for this data set. This visual conclusion is also confirmed by the p-values of the log-rank test, which are 0.0250, 0.0004, 0.0152, and 0.0033, for the four methods, respectively. Overall, all dimension reduction methods help build a good model to characterize the relation between the survival time and the gene expression profiles, whereas the supervised methods are observed to outperform the unsupervised method.
Dimension Reduction for High–Dimensional Data
(b)
0.6 0.0
0.0
0.2
0.4
Death−free survival
0.6 0.4 0.2
Death−free survival
0.8
0.8
1.0
1.0
(a)
0
5
10 15 Time to death
0
20
5
10 15 Time to death
20
(d)
0.6 0.4 0.0
0.0
0.2
0.2
0.4
0.6
Death−free survival
0.8
0.8
1.0
1.0
(c)
Death−free survival
431
0
5
10 15 Time to death
20
0
5
10 15 Time to death
20
Fig. 14.2. Prediction: Kaplan–Meier estimate of survival curves for the three risk groups of patients in the testing data. Panel (a) denotes the result for PCA, panel (b) for PLS-based SIR, panel (c) for ridge SIR, and panel (d) for shrinkage ridge SIR.
4. Discussions In this chapter we have concentrated on a genre of linear dimension reduction approaches that aim to reduce the predictor dimension prior to any modeling efforts. We have reviewed a number of commonly used dimension reduction methods and their applications in computational biology. Accumulated experiences suggest that such genre of dimension reduction could offer a potentially useful tool for high-dimensional data analysis. Unsupervised dimension reduction approach like PCA imposes virtually no probabilistic assumption on either X or Y |X,
432
Li
but it does not utilize any response information during the reduction. By contrast, supervised dimension reduction methods as those reviewed in this chapter can effectively take into account the response information, meanwhile impose few assumptions on Y |X. On the other hand, most of those methods require the marginal distribution of X to be elliptically symmetric. This assumption is often viewed as a mild condition, since it is satisfied when X is multivariate normal, it holds to a good approximation when p is large relative to d (56), and it can be induced by predictor re-weighting (57) and clustering (58). Nevertheless, this assumption does restrict straightforward application to some data types, and future work to relax this condition is needed. When the predictor dimension p is ultra high and the sample size n is limited, it seems intuitively more preferable to first screen the huge number of predictors using some fast and effective algorithm and to bring down the dimension to a much manageable scale before applying any further refined dimension reduction and variable selection method. (54) has proposed sure independence screening assuming the homoscedastic linear model. It is a very important first step along the line, while it is warranted to further relax the parametric model assumption for variable screening. Finally, many dimension reduction routines discussed in this chapter are available in the R library dr authored by Dr. Sanford Weisberg.
Acknowledgments This work was supported in part by National Science Foundation grant DMS 0706919. References 1. Rosenwald, A., Wright, G., Chan, W.C., Connors, J.M., Campo, E., Fisher, R.I., Gascoyne, R.D., Muller-Hermelink, H.K., Smeland, E.B., and Staudt, L.M. (2002) The use of molecular profiling to predict survival after chemotherapy for diffuse large-Bcell lymphoma. The New England Journal of Medicine 346, 1937–1947. 2. Cook, R.D., Li, B., and Chiaromonte, F. (2007) Dimension reduction without matrix inversion. Biometrika 94, 569–584. 3. Zhong, W., Zeng, P., Ma, P., Liu, J.S., and Zhu, Y. (2005) RSIR: regularized sliced
inverse regression for motif discovery. Bioinformatics 21, 4169–4175. 4. Tenenbaum, J.B., Silva, V.D., and Langford, J.C. (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290, 2319–2323. 5. Roweis, S.T., and Saul, L.K. (2000) Nonlinear dimensionality reduction by local linear embedding. Science 290, 2323–2326. 6. Wold, H. (1966) Estimation of principal components and related models by iterative least squares. In Multivariate Analysis, Ed.
Dimension Reduction for High–Dimensional Data
7.
8. 9.
10.
11.
12.
13.
14.
15.
16. 17.
18. 19.
P. R. Krishnaiah, 391–420. New York: Academic Press. Li, K.C. (1991) Sliced inverse regression for dimension reduction (with discussion). Journal of the American Statistical Association 86, 316–327. Jolliffe, I.T. (2002) Principal Components Analysis. Second Edition. Springer, New York. Alter, O., Brown, P.O., and Botstein, D. (2000) Singular value decomposition for genome-wide expression data processing and modeling. Proceedings of National Academy of Sciences, USA 97, 10101–10106. West, M., Blanchette, C., Dressman, H., Huang, E., Ishida, S., Spang, R., Zuzan, H., Olson, J.A. Jr. Marks, J.R., and Nevins J.R. (2001) Predicting the clinical status of human breast cancer by using gene expression profiles. Proceedings of National Academy of Sciences, USA 98, 11462–11467. Chiaromonte, F., and Martinelli, J. (2002) Dimension reduction strategies for analyzing global gene expression data with a response. Mathematical Biosciences 176, 123–144. Li, L., and Li, H. (2004) Dimension reduction methods for microarrays with application to censored survival data. Bioinformatics 20, 3406–3412. Li, L. (2006) Survival prediction of diffuse large-B-cell lymphoma based on both clinical and gene expression information. Bioinformatics 22, 466–471. Wei, T., Liao, B.L., Ackermann, B.L., Jolly, R.A., Eckstein, J.A., Kulkarni, N.H., Helvering, L.M., Goldsteiin, K.M., Shou, J., Estrem, S.T., Ryan, T.P., Colet, J.-M., Thomas, C.E., Stevens, J.L., and Onyia, J.E. (2005) Data-driven analysis approach for biomarker discovery using molecularprofiling technologies. Biomarkers 10, 153– 172. Leek, J.T., and Storey, J.D. (2007) Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genetics 3, 1724–1735. Patterson, N., Price, A.L., and Reich, D. (2006) Population structure and eigenanalysis. PLoS Genetics 2, 2074–2093. Price, A.L., Patterson, N.J., Plenge, R.M., Weinblatt, M.E., Shadick, N.A., and Reich, D. (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics 38, 904–909. Cox, D.R. (1968) Notes on some aspects of regression analysis. Journal of the Royal Statistical Society, Series A. 131, 265–279. Artemiou, A., and Li, B. (2009) On principal components and regression: a statistical
20. 21. 22. 23. 24.
25.
26.
27. 28.
29. 30.
31.
32.
33.
34.
433
explanation of a natural phenomenon. Statistica Sinica, 19, 1557–1565. Cook, R.D. (2007) Fisher Lecture: Dimension reduction in regression (with discussion). Statistical Science 22, 1–26. Cook, R.D. (1998) Regression Graphics: Ideas for Studying Regressions Through Graphics. New York: Wiley. Cook, R.D. (1996) Graphics for regressions with a binary response. Journal of the American Statistical Association 91, 983–992. Cook, R.D., and Li, B. (2002) Dimension reduction for the conditional mean in regression. Annals of Statistics 30, 455–474. Wold, H. (1975) Soft modelling by latent variables: The nonlinear partial least squares (NIPALS) approach. In Perspectives in Probability and Statistics, Papers in Honour of M.S. Barlett, Ed. J. Gani, 117–142. London: Academic Press. Helland, I.S. (1992) Maximum likelihood regression on relevant components. Journal of Royal Statistical Society, Series B 54, 637–647. Helland, I.S., and Almøy, T. (1994) Comparison of prediction methods when only a few components are relevant. Journal of the American Statistical Association 89, 583– 591. Li, K.C., and Duan, N. (1989) Regression analysis under link violation. Annals of Statistics 17, 1009–1052. Naik, P., and Tsai, C.L. (2000) Partial least squares estimator for single-index models. Journal of the Royal Statistical Society, Series B 62, 763–771. Li, L., Cook, R.D., and Tsai, C.L. (2007) Partial inverse regression method. Biometrika 94, 615–625. Nguyen, D.V., and Rocke, D.M. (2002a) Tumor classification by partial least squares using microarray gene expression data. Bioinformatics 18, 39–50. Pérez-Enciso, M., and Tenenhaus, M. (2003) Prediction of clinical outcome with microarray data: a partial least squares discriminant analysis approach. Human Genetics 112, 581–592. Fort, G., and Lambert-Lacroix, S. (2005) Classification using partial least squares with penalized logistic regression. Bioinformatics 21, 1104–1111. Nguyen, D.V., and Rocke, D.M. (2002b) Partial least squares proportional hazard regression for application to DNA microarray survival data. Bioinformatics 18, 1625–1632. Park, P.J., Tian, L. and Kohane, I.S. (2002) Linking gene expression data with patient
434
35.
36. 37.
38.
39.
40.
41.
42.
43.
44. 45.
46.
Li survival times using partial least squares. Bioinformatics 18, 120–127. Li, H., and Gui, J. (2004) Partial Cox regression analysis for high-dimensional microarray gene expression data. Bioinformatics 20, 208–215. Cook, R.D., and Weisberg, S. (1991) Discussion of Li (1991). Journal of American Statistical Association 86, 328–332. Zhu, Y., and Zeng, P. (2006) Fourier methods for estimating the central subspace and the central mean subspace in regression. Journal of the American Statistical Association 101, 1638–1651. Li, B., and Wang, S. (2007) On directional regression for dimension reduction. Journal of the American Statistical Association 102, 997–1008. Li, K.C. (1992) On principal Hessian directions for data visualization and dimension reduction: another application of Stein’s Lemma. Annals of Statistics 87, 1025–1039. Xia, Y., Tong, H., Li, W.K., and Zhu, L.X. (2002) An adaptive estimation of dimension reduction space (with discussion). Journal of the Royal Statistical Society, Series B 64, 363–410. Cook, R.D., and Ni, L. (2005) Sufficient dimension reduction via inverse regression: a minimum discrepancy approach. Journal of the American Statistical Association 100, 410–428. Cook, R.D., and Yin, X. (2001) Dimension reduction and visualization in discriminant analysis. Australian and New Zealand Journal of Statistics 43, 147–177. Zhu, L.X., Miao, B., and Peng, H. (2006) On sliced inverse regression with large dimensional covariates. Journal of the American Statistical Association 101, 630–643. Li, L., and Yin, X. (2008a) Sliced inverse regression with regularizations. Biometrics 64, 124–131. Bura, E., and Pfeiffer, R.M. (2003) Graphical methods for class prediction using dimension reduction techniques on DNA microarray data. Bioinformatics 19, 1252–1258. Antoniadis, A., Lambert-Lacroix, S., and Leblanc, F. (2003) Effective dimension
47. 48.
49. 50. 51.
52.
53. 54.
55. 56.
57.
58.
reduction methods for tumor classification using gene expression data. Bioinformatics 19, 563–570. Li, L., and Yin, X. (2008b) Rejoinder to “A note on sliced inverse regression with regularizations”. Biometrics 64, 982–986. Zou, H., Hastie, T., and Tibshirani, R. (2006) Sparse principal component analysis. Journal of Computational and Graphical Statistics 15, 265–286. Li, L. (2007) Sparse sufficient dimension reduction. Biometrika 94, 603–613. Ni, L., Cook, R.D., and Tsai, C.L. (2005) A note on shrinkage sliced inverse regression. Biometrika 92, 242–247. Bondell, H.D., and Li, L. (2009) Shrinkage inverse regression estimation for model free variable selection. Journal of the Royal Statistical Society, Series B 71, 287–299. Tibshirani, R. (1996) Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B 58, 267–288. Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004) Least Angle Regression. Annals of Statistics 32, 407–451. Fan, J., and Lv, J. (2008) Sure independence screening for ultra-high dimensional feature space (with discussion). Journal of the Royal Statistical Society, Series B 70, 849–911. Li, K.C., Wang, J.L., and Chen, C.H. (1999) Dimension reduction for censored regression data. The Annals of Statistics 27, 1–23. Hall, P., and Li, K.C. (1993) On almost linearity of low dimensional projections from high dimensional data. Annals of Statistics 21, 867–889. Cook, R.D., and Nachtsheim, C.J. (1994) Re-weighting to achieve elliptically contoured covariates in regression. Journal of the American Statistical Association 89, 592–600. Li, L., Cook, R.D., and Nachtsheim, C.J. (2004) Cluster-based estimation for sufficient dimension reduction. Computational Statistics and Data Analysis 47, 175–193.
Chapter 15 Introduction to the Development and Validation of Predictive Biomarker Models from High-Throughput Data Sets Xutao Deng and Fabien Campagne Abstract High-throughput technologies can routinely assay biological or clinical samples and produce wide data sets where each sample is associated with tens of thousands of measurements. Such data sets can be mined to discover biomarkers and develop statistical models capable of predicting an endpoint of interest from data measured in the samples. The field of biomarker model development combines methods from statistics and machine learning to develop and evaluate predictive biomarker models. In this chapter, we discuss the computational steps involved in the development of biomarker models designed to predict information about individual samples and review approaches often used to implement each step. A practical example of biomarker model development in a large gene expression data set is presented. This example leverages BDVal, a suite of biomarker model development programs developed as an open-source project (see http://bdval.org/). Key words: Biomarker model development, gene expression, high-throughput measurement, microarray, machine learning, cross-validation, performance estimates, feature selection, BDVal.
1. Introduction A biomarker is a quantity that can be objectively measured in a sample and used to predict a specific biological status of the sample (or prediction endpoint). Some classic examples of biomarkers are blood pressure (predictive of cardiovascular risk) and blood glucose (predictive of diabetes). These biomarkers can be easily measured and yet inform about the likelihood of disease for the group of patients that share the same biomarker profile. These examples illustrate that practically useful biomarkers are H. Bang et al. (eds.), Statistical Methods in Molecular Biology, Methods in Molecular Biology 620, DOI 10.1007/978-1-60761-580-4_15, © Springer Science+Business Media, LLC 2010
435
436
Deng and Campagne
(i) informative about a biological status that is not directly measurable (i.e., because the status is whether an event will occur in the future or because the specific mechanism that leads to the event occurring in the sample/patient is not well understood) and (ii) much easier to measure than the outcome they are trying to predict. While a variety of biomarkers have been discovered and applied to various biological and medical problems, the recent development of high-throughput measurement platforms has triggered a renaissance of biomarker model development in the biomedical sciences. An NIH biomarker working group has produced the following definition of a biomarker (1): Biological marker (biomarker): A characteristic that is objectively measured and evaluated as an indicator of normal biological processes, pathogenic processes, or pharmacologic responses to a therapeutic intervention.
A biomarker is usually measured in distinct samples. In this context, sample usually refers to material obtained from cell lines, or from tissue, organ or tumor from an experimental animal or human subject (such as patients). Sample: The act of measuring a biomarker for an animal or human subject in the biomarker study. By extension, the biological material used to perform this measurement. By extension, the data for all features measured in the biological sample.
For example, if a biomarker study analyzes data about gene expression in mouse liver, mRNA samples used for gene expression measurement are called samples. Distinct samples may be collected from the same animal liver for technical replication. To be useful for biomarker model development, a sample must always be associated with an endpoint (the sample label). Prediction endpoint: A characteristic of interest about the past, present, or future of an individual or biological sample under investigation.
Examples of prediction endpoints include whether a metastasis sample developed from a colon cancer or lung cancer (information about the past of the sample), whether an individual patient is responding to therapy (present), or whether an individual patient has died of the disease within 3 years after the sample was collected (future). Sample label: For a specific endpoint, a symbol which encodes the past, present or future condition of the sample.
For instance, if the biomarker study aims to predict whether an animal will develop liver tumors after treatment with a drug, the samples in the study may be labeled as “will-develop-tumor”
Introduction to the Development and Validation of Predictive Biomarker
437
and “no-tumor,” among many symbols which can be chosen to encode the association between a specific sample and the liver tumor endpoint. A sample is also often annotated with tissue type, specific treatment, disease type, and other clinical or experimental conditions obtained during sample collection. Such annotations can sometimes be considered as features in their own right and used to develop biomarker models in isolation or in combination with features obtained by high-throughput measurements. High-throughput technologies have been used in biological and clinical studies to measure gene expression (transcriptomics); protein abundance (proteomics); genotypes, haplotypes, and copy number variations (genetics); DNA or histone methylation (epigenetics). This new abundance of data can be used for biomarker model development and suggests applications in fields as diverse as drug development, individualized medicine (2), disease detection and diagnosis (3–6), cancer treatment (7, 8), or environment protection (9). In each of these fields, high-throughput technology makes it possible to measure many candidate markers simultaneously in each sample studied. Each measured candidate biomarker is often referred to as a feature. Feature: A specific measurement performed for a given biomarker, which leads to obtaining one feature value for each sample under investigation.
Several features may be measured for a single biomarker. For instance, in gene expression studies, several features measure the expression of a single gene (in this case, one probe set on a microarray corresponds to a single feature, multiple probe sets measuring a single gene. The gene is the measured biomarker). Dataset: A set of data where feature signal values are provided for a number of samples, generally presented as a table of dimension (#samples, #feature) with metadata that associates each row/column with feature or sample identifiers.
The biomarker model development process is a multidisciplinary effort that calls for a wide range of expertise (e.g., clinical and pathology expertise, statistics for experimental design, and machine learning). In this chapter, we focus on the statistical and machine learning aspects that play an important role in developing and evaluating objective biomarker models. Biomarker model: An equation that relates, for each specific sample, the feature values in the sample to a label prediction. By extension, the program and data that define and implement this equation and make it possible to process sample data to generate predictions.
438
Deng and Campagne
Biomarker models can be derived from data in a number of ways. A popular and successful approach consists of applying supervised learning techniques (briefly reviewed in Section 2.4). Such approaches learn the parameters of a biomarker model from a training data set (a set of samples for which the sample labels are known and associated data). Model training is achieved by estimating the model parameters to fit the pattern of sample–endpoint relationships observed in the training data set. The trained model can then be used to predict the endpoint for new samples given feature values. Because of the large number of features (p) and limited number of samples (m), p>>m, biomarker model development studies face a number of significant challenges (10). In the following sections, we discuss common pitfalls encountered when training biomarker models: over-fitting and artifacts due to batch effects. Finally, we review approaches to avoid and/or recognize when such difficulties occur. The task of biomarker model development is to identify which features can effectively and robustly predict the endpoints of new samples. Model validation is a crucial component of model development. Perhaps counterintuitively, biomarker validation must be done before a final biomarker model can be generated. We cover a number of common approaches to biomarker validation in Section 3. Recent applications of biomarker model development to human diseases have resulted in significant advances for a variety of human diseases, including Parkinson disease (11), lymphoma (12), leukemia (13), ovarian cancer (14), prostate cancer (15, 16), breast cancer (17–20), diabetes (21), and HIV (22). However, it should also be noted that many computational and clinical challenges still hinder the development and widespread adoption of biomarker models derived from high-throughput data (23). This chapter provides an introduction to the key concepts useful to understand studies that develop predictive biomarker models, describes the biomarker model development process, and ends with a practical example to illustrate how biomarker models can be developed and evaluated from data. The example section leverages BDVal: a biomarker model development program developed in our laboratory which we distribute under an open-source license (http://bdval.org). To conclude this introduction, we shall note that biomarker models are focused on predicting the label of an individual sample, one at a time, not with the average behavior over groups of samples. In this important way, biomarker model development studies differ from many microarray studies which have traditionally focused on identifying genes differentially expressed across experimental conditions. We will highlight the differences when appropriate.
Introduction to the Development and Validation of Predictive Biomarker
439
In the next section, we review concepts of importance to biomarker model development studies.
2. Key Concepts 2.1. High-Throughput Features
High-throughput features refer to measures generated by highthroughput technology such as DNA microarrays (24, 25), mass spectrometry (26, 27), and methylation arrays (28, 29). DNA microarrays measure the expression levels of thousands of genes in a specific sample. Features produced by DNA microarrays are used as an example throughout this chapter. Many data sets can be transformed into the sample–feature tabular format used as input to biomarker studies. In proteomic profiling using mass spectrometry (reviewed in (30, 31)), each sample is presented as a list of peaks, with each peak having a specific mass-to-charge ratio (m/z) and the relative abundance of the peak. The m/z ratios of such peaks can be used as potential features. Alternatively, when peaks can be assigned to specific proteins (or fragments of proteins) the number of times a given protein is detected in a sample can be used as a signal value. Biomarker model development in these data sets will seek to identify the peaks or proteins that discriminate different disease status.
2.2. Batch Effect
Batch effect is a technical confounder that is problematic for effective biomarker model development. When samples are measured in batches and when the technique used for feature quantification is susceptible to environmental or operator variation (e.g., ozone levels in the laboratory environment for microarray measurements), differences in signal value may be observed that correlate with the date the samples were processed (or the technician and specific instrument who performed the measurement) (32, 33). In such cases, different batches of samples may harbor artifactual differences in signal values, which can be confused with true differences. A recommended procedure to minimize batch effect is to randomize the processing of samples, such that a similar proportion of sample labels are processed in each batch of measurement. Batch effect is essentially a reproducibility issue but unfortunately its mechanisms are not completely understood. When analyzing a data set collected in batches, the batch effect may dwarf the interested treatment effects. For this reason it can be crucial to detect and account for batch effects during the biomarker model development process.
2.3. Single Biomarker or Biomarker Panel
In many cases, it is beneficial to measure a panel of biomarkers and combine the signal from different markers to make a predic-
440
Deng and Campagne
Fig. 15.1. Illustration of normal and cancer samples classified using a linear combination of biomarkers X1 and X2 . The figure shows that normal and cancer samples cannot be classified well by either X1 or X2 alone. However, they can be classified perfectly using a linear combination of X1 and X2 .
tion. In some studies, no single biomarker will be found which can clearly separate samples by endpoint label. However, considering a linear combination of some markers will be informative. This is illustrated in Fig. 15.1. In cases where several strongly predictive biomarkers are found, considering several biomarkers in combination may also offer advantages, such as reducing the experimental error introduced by the measurement of each single biomarker or reducing the potential future impact of variability in population samples not observed in the initial biomarker model development study. 2.4. Supervised Models
Supervised models play a central role in predictive modeling. A supervised model is a mathematical function that takes the observation of biomarkers as input and outputs predicted endpoints. The model is trained (estimated) from training samples and used for predicting the endpoints of unseen samples. Such models come in various forms. The model is called a regression model for continuous endpoints and classification model (classifier) for categorical endpoints. Simpler statistical models, such as linear regression, are most successful when the number of samples to train the model is at least as large as the number of parameters in the model. The statistical and machine learning fields have also developed methods to produce models for problems where many more features are observrd than training samples are available for learning. To illustrate some of our discussion, assume a data set comprising i = 1. . .n samples, where features are given by the vector Xi and endpoint is given by Yi . Each vector Xi is comprised of j = 1. . .p features and Xij refers to the value of feature j in sample i. A linear regression model is formalized as Yi = α + j=1...p βj Xij + εi , where Yi is the dependent variable (endpoint), Xij are the independent variables (biomarkers), εi are the random deviations from the model associated for each observation (also called the residuals), and the α and βj coefficients
Introduction to the Development and Validation of Predictive Biomarker
441
are the model parameters called regression coefficients. For this model, the unknown parameters α and βj can be estimated in the training set. In the multiple regression setting, the model parameters are obtained by minimizing the sum of the square of the residuals, i.e., the estimated model parameters αˆ and βˆj are cho sen such that i=1...n εi2 has the smallest value. While this traditional approach is powerful it is prone to over-fitting, which we discuss in the next section. The field of machine learning has produced many alternative approaches to model parameter estimation. In machine learning, estimating αˆ and βˆj given the equation of a model is called training the model. After the model has been trained, predicting the endpoint of new samples can be done by evaluating the trained model with the feature values of each sample X. In the linear regression example, the predicted endpoint value Yˆ i for sample i can be calculated as Yˆ i = αˆ + j=1...p βˆj Xij . Binary classification is a common biomarker problem which occurs when an endpoint can only have two values (e.g., 0 or 1, or negative (–1) and positive (+1)). A binary endpoint is often obtained by discretization of a continuous endpoint. Various approaches have been developed for supervised learning and are reviewed in Section 3.5. The choice of supervised approach to build a biomarker model is often guided by prior experience that the analyst may have with a specific approach. However, since a supervised learning approach may perform well for certain types of data sets but not on others, it is considered good practice to try a few competitive approaches and select the one that yields the best estimated performance on the specific data set at hand. The BDVal program (presented in the example section at the end of this chapter) makes such comparisons straightforward.
2.5. Generalization or Over-Fitting
The goal of biomarker model development studies is to develop biomarker models that can be applied to effectively predict the endpoint for other samples than those found in the training data set. The ability of a biomarker model to generalize to new samples is therefore an important property that must be estimated during biomarker model development. In practice, a biomarker model development model is said to exhibit good generalization when its parameters once fixed in training yield consistent performance on yet unseen samples. The converse of generalization is over-fitting: a model that fails to predict new samples in line with expectations is said to be over-fitted. Over-fitting is a serious concern when models have to be trained with fewer samples than features. A model that has been over-fit to the training set may be of little use to predict future samples. In Section 3.1, we present methods commonly used to detect cases when a model is over-
442
Deng and Campagne
fitting to data. Following these protocols helps avoid a common pitfall of biomarker model development. However, it is important to realize that other factors besides over-fitting may cause a model to not perform well in a validation data set. For instance, many, if not most, biomarker models are trained with the assumption that the training set is a representative sample of the larger population of samples to which the model will be applied. This assumption may be violated. For instance consider the case when patient recruitment creates a bias such that the training samples are enriched in patients with a specific characteristic (e.g., more severe disease) than the general patient population. Detecting these types of problems calls for the use of independent validation data sets, in addition to following established validation protocols in the model development phase. 2.6. Feature Reduction
Traditional statistical models are suitable for training a large number of samples with a relatively small number of features. However, high-throughput technologies such as microarray yield data sets with thousands of features (large p). The number of samples (n) available in each study has not scaled with the number of features. Indeed, the number of samples is often limited by non-technological factors (i.e., patient recruitment, tissue procurement, or animal studies). Biomarker model development studies are therefore typical “large p small n problems”, where estimation of model parameters can be very challenging. In a typical microarray study, searching in a high-dimensional space for optimal feature parameters is not only computationally expensive but also leads to over-fitting and poor biomarker reproducibility. A solution to reduce parameter estimation difficulties is to conduct feature reduction: to reduce the number of features prior to model development (34). Therefore, a feature reduction step is commonly used for reducing the number of features for robust model training and testing. Commonly used feature reduction techniques include feature aggregation and feature selection. For microarray studies, features in the final model reduction strategies aim to reduce the number of features used for model building to 5–200 (a number amenable to measurement with lower throughput techniques often preferred for clinical diagnostic tests, see Chapter 14).
2.7. Evaluating the Performance of a Biomarker Model
Objective performance measures are needed at various stages of biomarker model development projects. Such measures are used to guide feature selection, model tuning, or final model selection. In this section, we briefly review the most common performance measures available to evaluate the performance of a biomarker model trained for a binary classification problem. We highlight the advantages and drawbacks of each performance measure.
Introduction to the Development and Validation of Predictive Biomarker
443
Table 15.1 Contingency table/confusion matrix True label for sample endpoint
Model prediction
Positive
Negative
Positive
TP
FP
Negative
FN
TN
A biomarker model trained for a binary endpoint can be applied to a set of test samples. When the endpoint is known for each sample of the test set, the prediction results on these samples can be arranged in a contingency table (also called confusion matrix) (Table 15.1). The contingency table tallies prediction results which are true positive (TP), false positive (FP), false negative (FN), and true negative (TN). The contingency table is very informative but since it contains four numbers, it is not easy to use as such to compare the performance of two models. Instead, many performance measures have been introduced which summarize the content of a given contingency table. Most well-known prediction performance measures are derived from the contingency table, such as the following:
TP , TP+FN TN , specificity= TN+FP TP , precision = TP+FP
sensitivity=recall=power=
positive predictive value (PPV)=precision, negative predictive value (NPV) =
TN , TN+FN
false positive rate (FPR, α) = 1 − specificity, false negative rate (FNR, β) = 1 − sensitivity. The above measures often come in pairs (sensitivity with specificity, precision with recall, PPV with NPV, FPR with FNR) with one focusing on FP and another on FN, respectively. In an effort to further summarize the contingency table using a single number, the following compound measures are commonly used
444
Deng and Campagne
that take into account both FP and FN: Accuracy =
Fβ =
TP+TN = 1 − error rate, TP+FP+TN+FN (1 + β 2 ) · precision · recall , β 2 . precision+recall
where Fβ=1 =
2 · TP , TP+FP+FN
Matthews correlation coefficient (MCC) =
TP · TN − FP . FN (TP+FP)(TP+FN)(TN+FP)(TN+FN)
Most of the above measures return values between 0 and +1. In contrast, MCC returns values between –1 and +1. An MCC coefficient of +1 represents a perfect prediction, 0 an average random prediction, and –1 an inverse prediction. While there is no perfect way of describing the confusion matrix of true and false positives and negatives by a single number, the MCC is a fair measure of classification performance. A common method to estimate performance is to plot sensitivity (true positive rate, TPR) against 1 – specificity (FPR) over the whole range of classifier decision values. The plot is historically known as a receiver operating characteristic (ROC) or simply ROC curve. The area under the ROC curve (AUC) is used as the summarized value for the ROC curve. AUC is generally regarded as a balanced measure which can be used even if the classes are of very different sizes. AUC is often used to evaluate diagnostic test in the clinical setting (35, 36). It returns a value between 0 (anti-correlated prediction) and 1 (perfect prediction), with the value 0.5 representing a random prediction. AUC is equivalent to the probability that a classifier will rank a positive sample higher than a negative sample (37). As such, strictly speaking, AUC measures ranking rather than classification. Some measures can run into problems in certain circumstances. For example, PPV are directly related to the proportion of positive samples in a study or a population of interest. For example, suppose a biomarker model with 99% sensitivity and 99% specificity developed for a rare disease (we assume 0.5% disease incidence in the population). For population screening, the PPV will be only 33%. However in a case–control study where 50% samples are true positive, the PPV is 99%. Therefore, cautions should always be exercised when interpreting these measures for
Introduction to the Development and Validation of Predictive Biomarker
445
extremely unbalanced data sets (the two classes are of very different sizes). The example also shows that the performance requirement for a biomarker may be very different for different purposes. For population screening of rare diseases, it is imperative to have very high performance biomarkers to justify the cost. The performance measures discussed above can be evaluated given a contingency table. Such a table can be produced when a classifier outputs a predicted endpoint symbol and the true endpoint is known for each sample. Most classifiers output a decision value, a numeric value which indicates confidence in the prediction. Some measures evaluate decision values directly and do not require an explicit prediction threshold. AUC belongs to this category. Other methods require that the decision threshold is known and incorporate the estimation of the error associated with threshold estimation. MCC belongs to this latter category. For a specific biomarker model, the optimal trade-off between sensitivity and specificity depends on the application and is governed by the choice of decision threshold. For instance, in cancer treatment, generating a false negative may mean not treating a patient who would have benefited from a more aggressive treatment. This error would have more serious consequences than predicting a false positive (i.e., predicting that a patient requires aggressive treatment when the patient does not). In other applications, false negatives will be less costly than false positives. ROC curves help visualize the trade-off between FP and FN and provide a way to choose “optimal,” application-dependent thresholds (38, 35, 39). Depending on the cost of FPs and FNs in an application, one can choose a suitable threshold.
3. The Biomarker Model Development Process
3.1. Validation Protocols
Figure 15.2 presents an overview of the biomarker model development process. We introduce and discuss each step of this process in the remainder of this section. For any training set, we could build a naïve classifier that maps each training sample to the corresponding training endpoint. Since this classifier learns the training set as if by heart, its prediction performance will be perfect on the training set, but the classifier is very likely to perform poorly on yet unseen samples. The difference between performance on the training set and performance on new sets of samples is called generalization error. The goal of biomarker studies is to develop biomarker models with low generalization error. Such models should have consis-
446
Deng and Campagne
Fig. 15.2. Overview of the biomarker model development process.
tent predictive performance across the population of samples that the model is designed to predict. This objective is clearly distinct from that of selecting the model with the highest apparent training performance. Many validation protocols separate the data set into training set for model training and test set for model performance testing. Performance measured for a model on the test set is expected to estimate the performance that the same model would have on yet unseen samples. There are many ways to split the entire
Introduction to the Development and Validation of Predictive Biomarker
447
data sets into training and test set, and consequently, a number of specific validation protocols have been developed. These protocols include cross-validation, stratified cross-validation, random hold out, and bootstrap (40, 41). For such protocols to provide unbiased estimates of generalization error, the test set must be completely hidden during the training phase(s). However, it is acceptable to use information obtained during the training phase (e.g., model parameter or feature scaling parameters) to obtain prediction on the test set. Figure 15.3 illustrates the K-fold cross-validation procedure. Briefly, a complete training set is randomly divided into K subsets of approximately equal size. For each iteration/split of crossvalidation, one of the K subsets is used as the test set and the other K– 1 subsets are combined as the training set. The process is repeated K times until every subset has been tested. During this process, each sample is tested exactly once. Prediction results are combined for all test samples and performance measures are estimated. Typical choices for the parameter K include K = 10, 5, and 3. The parameter K must sometimes be chosen depending on the number of samples available for cross-validation (e.g., K = 5 or 3 can be preferred when fewer samples are available and/or the classes are very imbalanced), but in any case, K should be determined before looking at evaluation results (41). N-fold cross-validation, also known as leave-one-out, is crossvalidation where K = N, the total number of samples. All but one sample is used for training each validation model, so leave-one-out maximally utilizes the sample information available
Fig. 15.3. K-fold cross-validation.
448
Deng and Campagne
for training. The trained models are essentially the same as if they had been trained on all the samples. Leave-one-out crossvalidation has some disadvantages, including a large computational cost (40, 41). Leave-one-out is almost unbiased, but it has been shown to have large variance (42), in the sense that the resulting models show a greater degree of variation with different samples from the same population of interest than when K-fold cross-validation is used. Therefore it is commonly believed that the results from leave-one-out are generally susceptible to sampling error. It should be noted that many predictive models are sensitive to how the data set is initially divided. To evaluate this effect, the whole K-fold validation process can be repeated a number of times, each time with a different random subset splitting. The standard deviation of performance measures derived from these repetitions could be used as an indication of how the model is affected by this factor. One variation to the regular cross-validation is stratified crossvalidation where each fold is constrained to contain approximately the same proportions of class labels as the original data set. It has been reported that stratified cross-validation yields slightly better performance than regular cross-validation in certain situations (43). Less common variants of cross-validation are the random re-sampling approach and bootstrap. In the random re-sampling approach, a subset of samples can be randomly assigned to the training set (without replacement) and the rest of the samples used as test set. The process can be repeated K times to achieve K-fold cross-validation. Since the test folds are not independent, the process is often considered less rigorous than K-fold crossvalidation. In contrast to random re-sampling, bootstrap samples the training set with replacement. For instance, with N samples are drawn with replacement to make up the training set; the test set consists of the rest of samples which are never drawn. It has been reported that both random re-sampling and bootstrap produce performance estimates with large variance (41). Therefore, neither approach is generally recommended until further evidence might suggest otherwise. 3.2. Feature Summarization
Microarray platforms typically require feature summarization to aggregate probe intensities into probe set level expression intensity. Summarization is a complex statistical modeling process whose goal is to minimize the undesirable variations arising from confounding factors such as manufacturing fluctuations, probe variations, pipetting errors, or electrical fluctuations during scanning. The commonly used summarization algorithms for Affymetrix gene chips include MAS5 (44), RMA (45), GCRMA (46), dChip (47) and can be categorized into single-array summarization (e.g., MAS 5) or multi-array summarization (e.g.,
Introduction to the Development and Validation of Predictive Biomarker
449
dChip, RMA). In the former case, the summarized intensities on each chip are independent of other chips, whereas in the latter case the intensities on each chip are cross-normalized with other chips. The multi-array summarization algorithms have shown some advantages, such as improved consistency and precision. Thus, they are preferred methods for tasks including clustering and identifying differentially expressed genes between classes of samples. However, multi-array algorithms are not especially suitable for biomarker model development. Indeed, multi-chip summarization methods make intensities on a chip dependent on intensities on other chips and do not allow predicting one sample independently of other samples. Importantly, the training set should not be normalized together with samples from the test set, since this will result in leaking information from the test set to the training set and will result in overestimations of model performance. On the other hand, test samples can be normalized using information from the training set. To address these issues, a sequential summarization algorithm called refRMA (48) was developed to facilitate multi-chip normalization in cross-validation settings (see Chapter 8). 3.3. Feature Normalization
After summarization, the features can be further normalized in order to reduce confounding variations. Similar to summarization, the normalization can be performed in two modes: singlearray and multi-array. The main goal of multi-array normalization is to improve the comparability of all chips in a study. However, doing so could reduce the independence of each chip and may undermine the efforts of predictive modeling. Therefore, while multi-array normalization methods are the standard practice for many studies, they are not recommended for biomarker model development studies. The simplest type of normalization is single-array scaling in which each array is scaled independently to an arbitrary target. For instance, the median intensity of each array is (independently) scaled to a pre-fixed value to improve comparability. The scale factor for each chip is calculated and all the signals are scaled accordingly. This method is originally provided in Affymetrix MAS5 software (44). An example of the commonly used multi-array normalization procedure is quantile normalization (49). In this scheme, intensities from all the genes are sorted for each array. The highest intensity from each array is replaced by the average of all of the highest intensities, the second highest on each array is replaced by the average of all of the second highest, and so forth. As a result, all the arrays end up having exactly the same intensity distributions over all genes. Two-color arrays are commonly corrected by lowess (or loess) normalization (reviewed in (50)). It has been observed that the raw ratios observed from two-color arrays tend to be dependent
450
Deng and Campagne
on the raw intensities of the two colors. The lowess normalization seeks to eliminate such intensity-dependent bias from the ratio values by fitting to a curve (see Chapters 8 and 9). 3.4. Batch Effects Removal
Batch effects treatment is one of the most difficult special cases of feature normalization. The best approach to reducing batch effects is to prevent it from happening rather than leaving it to computational analysis. It relies on the microarray manufacturers and lab technicians to calibrate data generation instruments and standardize experimental protocols to minimize the impact of non-biological differences. For example, it is a good practice to randomize the order the samples are processed so that samples of each class have an equal chance of being processed in each batch. Although batch effects can be easily detected using visualization tool such as principal component analysis (PCA) plots, it is still unclear how to correct for batch effect when it is detected in a data set. When batch effect is detected, a conservative approach is to remove the probe sets which show significant correlation with sample batches (i.e., date the sample is processed) and therefore are more likely to be associated with batch effects. The Partek software implements an ANOVA-based method which requires at least a number of samples in each batch to estimate the batch effects size and direction (51). A recently published method (52) uses an empirical Bayes estimator to estimate and correct for batch effects. Since these approaches both require multiple samples in each batch, they may not perform well in the biomarker model development setting. Computationally adjusting batch effects for biomarker model development remains a difficult research problem.
3.5. Classification Approaches
Various classification algorithms can be used to develop biomarker models. In this context, an ideal classification algorithm should produce models that generalize well when presented with few samples described in a high-dimensional feature space, and importantly should provide competitive predictive performance. In this section, we describe several supervised classification methods that are commonly used in biomarker model development studies.
3.5.1. Naïve Bayes Classifier (NBC)
A NBC is a simple probabilistic model based on Bayes’ theorem. The term naïve refers to its strong assumptions on the conditional independence of features. Albeit simple and unrealistic, NBC usually perform surprisingly well in many applications (53). NBC is often used as a benchmark classifier against other classifiers.
3.5.2. K-Nearest Neighbor (KNN)
KNN is a simple classifier that has gained popularity in the microarray field. It is an instance-based classifier or lazy classifier, where the training step is skipped and all the computation is performed at the prediction step by directly comparing the
Introduction to the Development and Validation of Predictive Biomarker
451
testing sample with training samples in feature space. Given a new sample s described by a vector in feature space, the algorithm calculates the distance from s to each of the other training samples and registers the K-nearest neighbors. If a majority of the K-nearest neighbors have the positive class, sample s is predicted to belong to the positive class, and to the negative class otherwise. The value of K is arbitrary, but to avoid tie breaking and to simplify the computation, K is chosen to be a small odd number such as 3 or 5. This simple method has various problems such as its heavy dependence on feature selection or choice of measure distance. Several reports indicate that feature reduction is critical to develop predictive microarray models with KNN (54). Along with NBC, KNN is also often used as a benchmark classifier. 3.5.3. Linear Discriminant Analysis (LDA)
LDA is a classical parametric method which attempts to find the best linear combination of features that separate the classes of samples (the approach can be used for multi-class data sets). Essentially, each sample is projected on to a point on a straight line and the LDA attempts to separate the two groups by minimizing the within-group variance and maximize the between-group variance. The class of a new. sample is determined by its projection on the line as well. LDA is a relatively old and simple approach but performs remarkably well when compared to modern classification algorithms on microarray-based classification problems (55). However, it may not perform well on groups with small sample size and performs optimally on linearly separable samples (see Chapter 11).
3.5.4. Logistic Regression
Logistic regression attempts to predict the probability of an event based on the projection of each sample on to a simple sigmoid curve, called logistic curve (56). For a binary classification problem, logistic regression estimates the parameter of the model given by Yi (zi ) =
1 , 1 + e −zi
where zi = β0 + β1 Xi1 + β2 Xi2 + · · · + β2 Xin and Yi (zi ) is the probability of sample i in one group. Logistic regression and LDA models are closely related and both are linear classification models. However, logistic regression makes fewer assumptions about the underlying data and therefore is more flexible and more robust in case of violations to these assumptions. Unlike LDA, logistic regression does not require that the predictors be normally distributed or have equal variance within each group (57). Issues with feature selection and high-dimensional feature spaces have been addressed by Liao et al. in a recent publication (58).
452
Deng and Campagne
3.5.5. Classification Trees
Classification and regression trees are rule-based classifiers that are sometimes referred to by the acronym C&RT. One popular C&RT implementation is the program C4.5 and C5.0 developed by Quinlan (59, 60). An advantage of tree-based classifiers is that their classification rules are intuitive and can be understood and modified manually by modelers. However classification trees are unstable classifiers: a small change in input may result in large differences in the model and in prediction. Controlling the growth of the tree by pruning methods is essential to develop models that are not over-fitting to the training data. C&RT performance in microarray sample classification was evaluated by Dudoit et al. (55). Traditional classification trees have been slowly replaced by more sophisticated methods such as random forests (61).
3.5.6. Random Forest
A random forest (61) is a classifier that consists of many decision trees and predicts by consensus voting from the trees. Random forests combine the ideas of classification trees, bagging, and random subspace methods (62). The random forests algorithm involves training many classification trees; each tree gives a prediction on an unseen sample and the voting from all the trees decides the final prediction. Among its many advantages, random forest is robust against over-fitting and handles high-dimensional feature spaces well. Random forest has been applied to microarray-based biomarker studies and is perceived as one of the state-of-the-art classification algorithms for these problems (63).
3.5.7. Support Vector Machines (SVMs)
SVMs are a group of modern classifiers that are used extensively in high-throughput numerical data modeling (64). An introduction to support vector machines is provided in (65). In contrast to probabilistic methods, SVMs do not attempt to estimate the conditional probabilities of features with respect to each endpoint (a challenging problem with many features and few training examples). Instead, SVMs are a class of discriminative classification algorithm that aim to find the few features (or combination of features) that best separate the classes in a high-dimensional space derived from features in the training set. SVMs can be trained with linear or nonlinear kernels. Numerous benchmark studies have shown that SVMs perform very competitively in many applications and across fields (66, 67) (see Chapter 12).
3.5.8. Bagging
Bagging (bootstrap aggregating) (68) is an ensemble method that generates and combines a diversity of classifiers to improve the prediction performance. Bagging is a variance reduction technique that helps to reduce over-fitting. From the original data set, bagging generates many bootstrap samples and induces a classifier for each. For a given new sample, each classifier makes a prediction and the bagging algorithm will predict the new sample by polling the classifiers. Since the prediction is from
Introduction to the Development and Validation of Predictive Biomarker
453
many classifiers, the resulting prediction would show desired variance reduction. It has been shown that bagging is a smoothing operation which can effectively improve the classification accuracy (69). 3.5.9. Boosting
Rather than generating parallel classifiers like bagging, boosting iteratively generates a sequence of weak classifiers with moderate predictive performance, such that the combination of the weak classifiers generates a strong classifier. All the samples and all the weak classifiers are weighted in boosting. Boosting does not use bootstrap samples. Instead, the samples are re-weighted after a weak learner is added: samples that are misclassified gain weight and samples that are classified correctly lose weight. Thus, future weak learners focus more on the samples that previous weak learners misclassified. Weak classifiers with higher performance have higher weights. The final strong classifier is based on the weighted voting from the weak classifiers. Ada-boost (70) is a popular boosting algorithm developed for classification. RankBoost is a boosting algorithm recently developed for ranking problems (71). Some reports suggest that boosting does not perform competitively on microarray data (55), while others indicate otherwise (72). Detailed comparison with multiple data sets will be needed to resolve these contradictory reports.
3.6. Feature Aggregation or Transformation Approaches
PCA is a popular method for feature dimensionality reduction (73, 74). It transforms the original features into a set of components, each constructed as a linear combination of the original features. The components are sorted such that the first component represents the greatest total variance of the data, the second component represents the second greatest, and so forth. Therefore the first few components can be used to roughly capture the variance structure of the data set. Note that in PCA, no class labels are used (PCA is an unsupervised approach). Therefore, there is no reason to assume that components are useful for discriminating between data in different classes. PCA has been extensively used in microarray-based studies as a data reduction technique before visualization and data quality control (75). Few studies have performed PCA within splits of cross-validation which is essential to estimate generalization error (see Chapter 14).
3.6.1. Principal Component Analysis
3.6.2. Leveraging Known Biology to Aggregate Features (Gene Ontology, Pathways)
An increasingly popular method for obtaining features is to leverage pathway or ontology knowledge to aggregate features. A recent report by Lesnick et al. shows that using a genomic pathway approach the authors were able to create a strong classification model for endpoints related to Parkinson’s disease (76). Many approaches can be used for integrating pathways with highthroughput data. One is to use relevant pathways as a filter for feature reduction (focusing analysis on pathways that are expected
454
Deng and Campagne
to be involved in the biological process under investigation). Another is to use pathway enrichment analysis to discover pathways that may correlate with the prediction endpoint and then summarize features for the specific pathway biomarker. In any case, irrespective of the approaches, it is essential that pathway discovery and feature aggregation to be embedded in the crossvalidation protocol. 3.7. Feature Selection Approaches
The feature selection process attempts to remove irrelevant and/or redundant features in order to enhance prediction performance and improve a model’s generalization capability. The selected features are essentially candidate biomarkers. Therefore feature selection is one of the key tasks in biomarker model development.
3.7.1. Univariate Feature Selection Methods
The simplest and most intuitive methods for feature selection are univariate methods. These methods consider one feature at a time and evaluate the relevance of the feature to the prediction endpoint. Feature endpoint relevance is commonly defined in the form of signal-to-noise ratios. Features are processed independently of each other and ranked according to a relevance measure. Top-ranked genes are selected as candidate biomarker and used to train a predictive model. The t-statistic provides an example of feature relevance measure. Suppose we have an endpoint with two symbols: A and B, the t-statistic is defined as follows: XA − XB t= , 2 sA nA
+
sB2 nB
where X A is the average of log intensities of feature X in the group of samples corresponding to endpoint symbol A; X B is the average of log-intensities for the other samples (corresponding to endpoint symbol B); sA and sB are the standard deviation of the two group populations. t-statistic was traditionally used to determine whether to reject the null hypothesis that the means of two populations (represented by the samples) are equal. Log intensities are used here, and not raw intensities. By taking the logarithm of a feature (commonly by base 2), equal changes in up/down differences are represented by equal numerical values. The numerator is the average difference of log intensities between the two groups and denominator is the standard deviation of the numerator; nA and nB are the number of samples in each group. The denominator measures the variability of the measurements within each class; however, it can be problematic when estimated using a small sample size. Selecting features by t-test consists of
Introduction to the Development and Validation of Predictive Biomarker
455
ranking features by the p-value of the t-statistic and keeping features whose p-value is below a defined level of significance. Another commonly used feature ranking and selection criterion is the fold change which is the ratio between the average intensities of the two groups. The logarithm of fold change is X A − X B . Compared with t-test, the fold change ignores the variability of the measurements in each class. Note that a twofold change equals 1 in terms of log ratio and a fold change of 0 indicates no change. Significance Analysis of Microarrays (SAM) (77) is another popular criterion that takes the mid-ground between t-test and fold change. Its ranking statistic is defined as XA − XB a+
2 sA nA
+
, sB2 nB
where a is an arbitrary constant number estimated across arrays. When a is relatively large, SAM is equivalent to fold change; when a is close to 0, SAM becomes t. The use of t-test entails strong assumptions on the data, such as the normality of data, independence of samples and equality of variances on the two groups. If these assumptions are not satisfied, they can lead to unreliable results. The Mann–Whitney U-test is a non-parametric analog of the unpaired two-sample t-test. It is a preferred alternative when the normality of the data is questioned. The test involves the calculation of the U statistic and approximate p-value based on Z statistic. It is also equivalent to the Wilcoxon rank sum test and Kendall-tau correlation coefficient between log intensity and binary class labels. The comparison of these feature ranking statistics is reviewed in (78). In the MicroArray Quality Control (MAQC) project (79) and other reports (80, 81), the reproducibility of t-test, U-test, and SAM is compared and questioned. The MAQC-I reports suggested that using fold change in combination with t-test helps identify features that are more reproducible when results from multiple groups are compared. MAQC-I results were obtained in the context of group comparisons and need to be evaluated in the context of biomarker model development. 3.7.2. Multivariate Feature Selection Methods
Univariate methods do not consider the correlation between selected features. As a result, the selected features may be highly redundant and causing problems with some learning algorithms (i.e., naïve Bayes assumes feature independence). On the contrary, genes with low correlation may be combined into a discriminative predictor (see Fig. 15.1). Multivariate feature selection methods take into account feature–feature correlations and perform feature selection on
456
Deng and Campagne
subsets of features with the goal of improved prediction performance. One typical method is minimum redundancy–maximum relevance (mRMR) (82). The mRMR method explicitly quantifies the relevance between each feature and endpoint in addition to the redundancy between pair of features. For a given set of features S, its redundancy and relevance is defined as 1 1 Redundancy = I (i, j) and Relevance = I (i, e), |S|2 |S|2 i,j∈S
i∈S
where I(.,.) is the mutual information between the two vectors (83). The feature selection criterion seeks to maximize (relevance– redundancy) or relevance/redundancy. The optimization proposed to use a simple incremental algorithm which starts with the feature of maximum relevance and adds new features incrementally to satisfy the feature selection criterion. Support vector machines trained with a linear kernel can also be analyzed to determine the weight that each feature carries over the decision function. SVM feature weights can be used for ranking and selection of features. It can be shown that the weights for each feature are directly related to prediction performance (84). Therefore, we can select top-ranked features based on the weight of features for SVM prediction. The method is called SVM weight. A slightly different method is to process features one at a time. This can be done either by forward selection (step up) or backward elimination (step down). In SVMs, they are called recursive feature elimination (RFE) (84) or recursive feature addition (RFA) (85), respectively. Briefly, RFE starts with all features and eliminates the feature that has the smallest weight. Then RFE re-evaluates the weights of remaining features. Then the algorithm repeats the elimination process until a desired number of features or a desired performance has been reached. Removing one feature at a time is not practical for large data sets. The implementation of RFE in BDVal removes up to half of the features after each iteration of the algorithm (the ratio of samples kept at each iteration is configurable). In the recursive feature addition method, the algorithm starts with an empty feature set and incrementally processes the features. RFE and RFA are greedy optimization techniques that run reasonably fast but do not guarantee optimal feature subset. 3.7.3. Subset Searching Methods
As a special type of multivariate method, subset searching methods aim to identify subset of features that optimize a specific prediction performance measure (i.e., AUC). Individual features are not evaluated (as in t-test or SVM weight). Instead, the features are drawn randomly as a subset from the feature pool. The goal
Introduction to the Development and Validation of Predictive Biomarker
457
is to find a subset of features that yields favorable cross-validated model prediction performance. The subset searching approaches (also known as wrappers) are computationally-intensive methods that are also highly dependent on the classification model used for prediction. However, since they evaluate the prediction performance of many subsets within cross-validation, the resultant performance is usually competitive and a more compact biomarker set than filters can usually be obtained at the cost of heavy computation (86). Several reports suggested that feature subset methods may obtain better subset of predictive genes than other approach in biomarker studies (87, 88). It should be noted that feature subset selection is part of model training so it should be completely isolated from the test set. Therefore, subset searching should be done by optimizing a performance measure obtained by inner cross-validation within each split of the training set. These procedures are therefore only limited to the largest data sets, and even for those may require choosing threefold cross-validation to estimate cross-validated performance. This inner cross-validation is only for the purpose of feature subset selection, whereas the outer cross-validation is for model generalization error estimation. One challenge of the subset search approaches is that the number of all possible feature subsets grows exponentially as the number of features increases. For 10,000 features (small for highthroughput biological data), the number of all subsets is greater than 1 trillion. Therefore it is not possible to enumerate and evaluate every subset. While there is no efficient way to find the optimum feature subset, the feature subset selection is usually done using heuristic optimization techniques such as Monte Carlo optimization, simulated annealing (89), genetic algorithm (90), ant colony optimization (91), and Tabu search (92). The optimization algorithms use heuristics and cannot guarantee optimal solutions. Finally, feature selection strategies can be used in combination. For example, fold change in combination with t-test can be used to select intermediate features (i.e., 400 features) that are at least marginally correlated to the endpoints. Then SVM weight or subset searching can be used to further reduce the number of features. In fact, the combination of feature selection has become a preferred practice in microarray-based biomarker analysis. Combining feature selection strategies can be easily accomplished with BDVal (see Worked Example). 3.8. Model Selection
A model refers to the specification of all the parameters required to predict the label of an unknown sample (recall the simple linear regression example). Model selection is the task of choosing the model that is expected to yield the most competitive and
458
Deng and Campagne
consistent performance on samples from the population that the sample is drawn from. The optimal way to perform model selection in biomarker model development is still unknown. Model selection is an often complicated process: it is based not only on the objective validation performance evaluated by cross-validation but also on other factors such as model complexity, interpretability, or stability. For example, many investigators would favor a model with 5 features rather than one with 50 features, if they have comparable performance. Some investigators also prefer rule-based models (such as classification trees) because the decision rules are simple to understand intuitively. Beyond these subjective considerations, a practical way for model selection is to rank all candidate models by decreasing performance (e.g., measured by MCC or AUC) and choose a model among the top performers that uses the simplest modeling methodology. The group of top performers can be defined by examining the standard deviation of the performance measure used to rank the models.
3.9. Generating Final Models
The validation protocol generates a number of validation models that must be integrated into a final model for predicting new samples. Figure 15.4 displays two popular schemes in generating the final model from cross-validation models. There are some options in generating the final model. The options have subtle but important differences. A common approach (Fig. 15.4A) is to use bagging to combine the cross-validation models, one for each fold into a consensus model. A new sample is predicted based on the voting from all the models. Another option (Fig. 15.4B) is to derive a final feature set, from the features most often found in cross-validation models. The consensus features can be used to train a final model, using the entire training set. A third option consists of applying the same feature selection and machine learning protocol evaluated by cross-validation to the entire training set (direct final model generation). These three options are used in models published in the literature. For example, Setlur and colleagues applied the consensus feature approach in their prostate cancer data set (16). Dutkowski et al. (93) have obtained good results using consensus feature approach on proteomic data sets for cancer biomarker model development. The consensus decision method for generating the final model is strategically in line with bagging or other ensemble-based models for variance reduction. Golub et al. (94) used direct final model generation for predicting sub-groups of leukemia patients. Another example of direct approach is to predict estrogen receptor status and ERBB2 status in breast cancer patients (95). However, the impact of these model generation choices on prediction performance with new samples remains to be studied.
Introduction to the Development and Validation of Predictive Biomarker
459
Fig. 15.4. Generating a final model by (A) consensus models and (B) consensus features.
3.10. Testing the Model on Independent Validation Data 3.10.1. Testing on a Different Patient Cohort
3.10.2. Transferring a Model to Another Measurement Platform
Because many high-throughput biomarker studies employed a small and selective set of samples due to the high experimental cost and difficulty in recruiting patients, they are not designed for representing the population of interest with optimized statistical power. Therefore, any promising biomarkers must be subject to rigorous validation in additional cohort of samples. The results from these validations can unveil the discrepancy when the selected data set is characteristically different from the population of interest. For example, if the preliminary biomarkers were derived from samples that are mostly males, the biomarkers may not perform as well as before when female samples are tested. In practice, the confounding factors are often not obvious, and testing on independent validation data sets is the main practical approach to detect sample bias. In order to benefit sites that use alternative platforms, the classification model and its biomarkers developed on a specific platform may be transferred to other platforms. However, cross-platform performance has long been an issue for microarray. The transferring process is also not trivial due to the significant technological differences on probe design, chip manufacturing process, sample extraction and processing, vendor-supplied data generation software, and other factors. The computational challenges to address these differences include probes mapping, feature re-scaling, or cross-normalization. It has been reported that different platforms and different laboratories can generate highly consistent features when appropriate filters are used (79, 96). While these results are promising, the transferability of biomarker models across platforms is an open research question. Since heavy model adaptations may be required before a model developed for a platform can be used with sample data measured on another platform, switching platforms (i.e., from microarray to quantitative PCR) is not recommended if a study aims to validate a biomarker model.
460
Deng and Campagne
3.11. Reporting Reproducible Biomarker Discoveries
4. Model Development Example
With the multiplication of biomarker studies, it appears desirable to establish formal standards in reporting biomarker models and associated information. Formal standards would make it possible to compare approaches across data sets and to ensure that a published biomarker model can be evaluated fairly (and nonambiguously) using independent data sets. Elements of information that should be included in publications reporting new biomarker discoveries are data sets, endpoints for each sample, validation protocol, cross-validation results, specific features used in the final model, and specific parameters of the model (or program and data that implement the model). Besides the discovered biomarkers itself, it is often necessary to include modeling details such as feature selection procedure, classifier employed, implementation details in order to make results fully reproducible. These considerations have motivated the development of BDVal, an open-source biomarker development and validation program which tracks biomarker information during the discovery process. Information produced by BDVal can be distributed as supplementary material of scientific publications reporting the discovery of new biomarkers (and specific biomarker models). The next section provides a step-by-step example of using BDVal to develop biomarker models.
This section presents an example of a biomarker model development project. For this example, we have analyzed a large prostate cancer data set published recently and available through the Gene Expression Omnibus (GEO). We have conducted the analysis with BDVal, a suite of programs developed in our laboratory and distributed as an open-source project. BDVal implements many of the approaches that we described in this chapter and automates most of the steps of the biomarker model development process. As of Spring 2009, the BDVal program can be downloaded from http://bdval.org and should work on various computing platforms (we routinely use BDVal on Linux, Windows (XP, Vista), and Mac OS X). Note that BDVal has no graphical user interface; all operations are done through the command line in a terminal or console. The web site includes a user manual which provides a detailed tutorial and walks the user through analyzing a publicly available data set. Since both data set and analysis programs are freely available, we encourage readers to follow the online user manual walk-through on their own computer. This section
Introduction to the Development and Validation of Predictive Biomarker
461
summarizes the analysis steps and presents performance estimates for models developed from this data set. 4.1. The Prostate Cancer Fusion Data Set
As a prerequisite to a biomarker model development project, one must obtain a data set suitable for biomarker model development. Throughout this example, we use the prostate cancer fusion data set assembled by Setlur and colleagues and made publicly available in the GEO database (16). This data set was assembled by Setlur and colleagues to investigate which gene expression changes are associated with the fusion of gene transmembrane protease serine 2 (TMPRSS2) with gene v-ets erythroblastosis virus E26 oncogene homolog (avian) (ERG) (16). The so-called TMPRSS2-ERG gene fusion is a common molecular sub-type of prostate cancer. The data set can be downloaded directly from GEO with the accession code GSE8402 (direct URL ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SOFT/by_series/ GSE8402/GSE8402_family.soft.gz). The Setlur data set contains a total of 354 training samples consisting of 292 samples associated with the negative class label (non-fusion) and 62 samples with the positive class label (fusion). A different cohort of 109 samples (68 negative and 41 positive) were used for biomarker validation. The gene expression of about 6,000 genes across all the samples were interrogated using Illumina Human 6k Transcriptionally Informative Gene Panel for DASL (16).
4.2. Model Development and Validation with BDVal
BDVal supports many of the predictive model development methods described in this chapter. We briefly list here the methods which BDVal implements and indicate how each method maps to the mode command line argument of BDVal (see the User Manual and other online documentations for details about the various modes). The following list organizes BDVal modes into method categories. • Feature selection: t-test, fold change, Kendall tau, min–max, smv weights, svm weights iterative, GA wrapper • Validation protocols: cross-validation with or without stratification, leave-one-out • Embedding feature selection steps within cross-validation: sequence, define splits, execute splits • Generate a model: write model • Predict with a model: predict Researchers generally need to evaluate many alternative methods to construct models and estimate model performance before selecting a final biomarker model for a given endpoint. BDVal automates many steps of model development and performance evaluation. The online user manual provides a detailed tutorial that shows how to configure the tool to develop models with a
462
Deng and Campagne
Table 15.2 Summary of the 308 biomarker model development approaches applied to the Setlur data set Type of approaches varied
Approaches tested
Number of features
5,10,20,30,40,50,60,70,80,90,100
Learning algorithm
Support vector machine (SVM), random forest (RF), naïve Bayes (NB), LogitBoost
Cross-validation
Fivefold CV × 10 random repeats (CV), feature selection fold (FSF CV) yes/no, fivefold CV with consensus features × 10 random repeats (CF CV) t-test, fold change (FC), SVM weights, recursive feature elimination (RFE), genetic algorithm (GA Wrapper), gene lists
Feature selection and aggregation
large number of features in the final model selection strategies, classifiers, number of features, and so on. For this example, we estimated performance by stratified fivefold cross-validation for 308 different modeling approaches, summarized in Table 15.2. We estimated various performance measures, but here we will describe the area under the ROC curve performance estimates only (we write AUC_CV to refer to the area under the ROC curve estimated by cross-validation on the Setlur training set). As we discussed in Section 3, each model configuration in cross-validation will yield many split models (one split model for each split of cross-validation) that need to be summarized into a single “final model.” For this example, we created final models by using the consensus of features identified in each split of crossvalidation and by training the final model using data in the entire training set. Final models generated in this way can be used to predict samples in new samples. We applied the 308 final models generated to the Setlur validation data set and recorded validation performance estimates (for instance, AUC_validation which represents the area under the ROC curve estimated on the Setlur validation set). 4.3. Model Performance Measurements
Since models developed with BDVal are stored with detailed modeling information (which is collected automatically as the models are built; see user manual and the model conditions.txt file), the performance measures estimated for the model can be analyzed in the context of how the model was developed. Figure 15.5 presents a side-by-side comparison of performance, as estimated on the training set (left panels) and as observed on the validation set (right panels). Figure 15.5A shows the AUC_CV performance using different number of features. In this case, we see a clear improvement in models with larger number
Introduction to the Development and Validation of Predictive Biomarker
463
Fig. 15.5. Biomarker model evaluation and comparison for the Setlur data set.
of features. Interestingly, Fig. 15.5B shows a different pattern for AUC_validation: models built with smaller number of features seem to yield higher estimates of performance. This apparent discrepancy could be due to differences in the patient cohorts used as training and validation sets, but clearly highlights the importance of testing models on various independent data sets and patient cohorts. Figure 15.5C shows the performance
464
Deng and Campagne
distribution of models categorized by different classifiers. SVMbased models seem to outperform other models such as logistic or KStar classifiers in terms of performance average and dispersion (as indicated by the position and size of the quantile box). Similarly in validation shown in Fig. 15.5D, SVM and random forest remained competitive and logistic and NBC showed below average performance. Figure 15.5E shows the impact of feature selection methods on model performance. For instance, models derived from fold change and RFE appear to be better than those derived using t-test and GA wrappers. In validation shown in Fig. 15.5F, the same pattern was preserved. Although the above analyses are univariate, they can provide useful insights on how to derive the final biomarker model for the endpoint (assuming that further independent validation set is available to further test the model selected). For example, choosing a model with relatively high AUC_CV using fold change, SVM with moderate number of features would be a sensible choice for this specific data set. Figure 15.6 shows the scatter plot between AUC_CV and AUC_validation. The models that fall on the straight line have equal AUC_CV and AUC_validation. Interestingly, most models show better AUC_validation than AUC_CV in this data set (this is not the case in most data sets). However, some
Fig. 15.6. The model validation performance (AUC_validation) vs. model cross-validation performance (AUC_CV) across 308 models.
Introduction to the Development and Validation of Predictive Biomarker
465
Table 15.3 Top 20 models (ranked by AUC_CV) for the prostate cancer data set Model
AUC_CV
AUC_validation
Classifier
numFeatures
Feature selection
1
0.81
0.81
LibSVM
100
SVM weights
2
0.81
0.81
LibSVM
100
Fold change
3
0.81
0.82
LibSVM
80
Fold change
4
0.81
0.84
LibSVM
100
5
0.81
0.81
LibSVM
90
Fold change
6
0.81
0.82
LibSVM
80
Fold change
7
0.81
0.83
LibSVM
90
RFE
8
0.81
0.81
LibSVM
70
Fold change
RFE
9
0.81
0.81
LibSVM
60
Fold change
10
0.81
0.81
LibSVM
100
Fold change
11
0.8
0.81
LibSVM
60
SVM weights
12
0.8
0.82
LibSVM
90
Fold change
13
0.8
0.82
LibSVM
70
Fold change
14
0.8
0.82
LibSVM
80
SVM weights
15
0.8
0.81
LibSVM
50
Fold change
16
0.8
0.81
LibSVM
90
SVM weights
17
0.8
0.8
LibSVM
40
Fold change
18
0.8
0.83
LibSVM
80
RFE
19
0.8
0.8
LibSVM
90
t-Test
20
0.8
0.82
LibSVM
70
SVM weights
promising models in the cross-validation step end up with poor validation performance. For example, one model with AUC_CV 0.78 dropped its performance to a nearly useless 0.55 on validation set. Table 15.3 shows the top 20 models ranked by AUC_CV. Figure 15.6 suggests that fivefold-stratified crossvalidation with embedded feature selection as implemented in BDVal is a relatively reliable method for model selection in the Setlur data set. This concludes the model development example and this chapter. We encourage interested readers to download BDVal and reproduce these results with the Setlur data set in order to familiarize themselves with the techniques and software. References 1. Group, B. D. W. (2001) Biomarkers and Surrogate Endpoints: Preferred Definitions and Conceptual Framework. Clin Pharmacol Ther 69, 89–95.
2. Evans, W. E., and Relling, M. V. (1999) Pharmacogenomics: Translating Functional Genomics into Rational Therapeutics. Science 286, 487–491.
466
Deng and Campagne
3. He, Y. (2006) Genomic Approach to Biomarker Identification and its Recent Applications. Cancer Biomark 2,; 103–133. 4. Yasui, Y., Pepe, M., Thompson, M., Adam, B., Wright, G., Qu, Y., Potter, J., Winget, M., Thornquist, M., and Feng, Z. (2003) A Data-Analytic Strategy for Protein Biomarker Discovery: Profiling of High-Dimensional Proteomic Data for Cancer Detection. Biostatistics 4, 449–463. 5. Baylin, S. B., and Ohm, J. E. (2006) Epigenetic Gene Silencing in Cancer–a Mechanism for Early Oncogenic Pathway Addiction? Nat Rev Cancer 6, 107–116. 6. Cho, W. C. (2007) Contribution of Oncoproteomics to Cancer Biomarker Discovery. Mol. Cancer 6, 25. 7. Sawyers, C. (2005) Making Progress through Molecular Attacks on Cancer. Cold Spring Harb Symp Quant Biol 70,; 479–482. 8. Riesterer, O., Milas, L., and Ang, K. (2007) Use of Molecular Biomarkers for Predicting the Response to Radiotherapy with or without Chemotherapy. J Clin Oncol 25,; 4075– 4083. 9. Lobdell, D. T., and Mendola, P. (2005) Development of a Biomarkers Database for the National Children’s Study. Toxicol Appl Pharmacol 206, 269–273. 10. Simon, R. (2003) Supervised analysis when the number of candidate features greatly exceeds the number of cases. Association for Computing Machinery SIGKDD Explorations 5 (2), 31–36. 11. Scherzer, C. R., Eklund, A. C., Morse, L. J., Liao, Z., Locascio, J. J., Fefer, D., Schwarzschild, M. A., Schlossmacher, M. G., Hauser, M. A., Vance, J. M., Sudarsky, L. R., Standaert, D. G., Growdon, J. H., Jensen, R. V., and Gullans, S. R. (2007) Molecular Markers of Early Parkinson’s Disease Based on Gene Expression in Blood. Proc Natl Acad Sci 104, 955–960. 12. Lenz, G., Wright, G., Dave, S. S., Xiao, W., Powell, J., Zhao, H., Xu, W., Tan, B., Goldschmidt, N., Iqbal, J., Vose, J., Bast, M., Fu, K., Weisenburger, D. D., Greiner, T. C., Armitage, J. O., Kyle, A., May, L., Gascoyne, R. D., Connors, J. M., Troen, G., Holte, H., Kvaloy, S., Dierickx, D., Verhoef, G., Delabie, J., Smeland, E. B., Jares, P., Martinez, A., Lopez-Guillermo, A., Montserrat, E., Campo, E., Braziel, R. M., Miller, T. P.,; Rimsza, L. M., Cook, J. R., Pohlman, B., Sweetenham, J., Tubbs, R. R., Fisher, R. I., Hartmann, E., Rosenwald, A., Ott, G., Muller-Hermelink, H. K., Wrench, D., Lister, T. A., Jaffe, E. S., Wilson, W. H., Chan, W. C., Staudt, L. M.,
13.
14.
15.
16.
17.
18.
and Lymphoma/Leukemia Molecular Profiling Project. (2008) Stromal Gene Signatures in Large-B-Cell Lymphomas. N Engl J Med 359, 2313–2323. Metzeler, K. H., Hummel, M., Bloomfield, C. D., Spiekermann, K., Braess, J., Sauerland, M., Heinecke, A., Radmacher, M., Marcucci, G., Whitman, S. P., Maharry, K., Paschka, P., Larson, R. A., Berdel, W. E., Buchner, T., Wormann, B., Mansmann, U., Hiddemann, W., Bohlander, S. K., Buske, C., and for Cancer and Leukemia Group B and the German AML Cooperative Group. (2008) An 86-Probe-Set Gene-Expression Signature Predicts Survival in Cytogenetically Normal Acute Myeloid Leukemia. Blood 112,; 4193– 4201. Mok, S. C., Chao, J., Skates, S., Wong, K., Yiu, G. K., Muto, M. G., Berkowitz, R. S., and Cramer, D. W. (2001) Prostasin, a Potential Serum Marker for Ovarian Cancer: Identification through Microarray Technology. J Natl Cancer Inst 93, 1458–1464. Varambally, S., Yu, J., Laxman, B., Rhodes, D., Mehra, R., Tomlins, S., Shah, R., Chandran, U., Monzon, F., Becich, M., Wei, J., Pienta, K., Ghosh, D., Rubin, M., and Chinnaiyan, A. (2005) Integrative Genomic and Proteomic Analysis of Prostate Cancer Reveals Signatures of Metastatic Progression. Cancer Cell 8, 393–406. Setlur, S. R., Mertz, K. D., Hoshida, Y., Demichelis, F., Lupien, M., Perner, S., Sboner, A., Pawitan, Y., Andren, O., Johnson, L. A., Tang, J., Adami, H. O., Calza, S., Chinnaiyan, A. M., Rhodes, D., Tomlins, S., Fall, K., Mucci, L. A., Kantoff, P. W., Stampfer, M. J., Andersson, S. O., Varenhorst, E., Johansson, J. E., Brown, M., Golub, T. R., and Rubin, M. A. (2008) Estrogen-Dependent Signaling in a Molecularly Distinct Subclass of Aggressive Prostate Cancer. J Natl Cancer Inst 100, 815–825. van’t Veer, Laura J., Dai, H., van de Vijver, Marc J., He, Y. D., Hart, A. A. M., Mao, M., Peterse, H. L., van der Kooy, K., Marton, M. J., Witteveen, A. T., Schreiber, G. J., Kerkhoven, R. M., Roberts, C., Linsley, P. S., Bernards, R., and Friend, S. H. (2002) Gene Expression Profiling Predicts Clinical Outcome of Breast Cancer. Nature 415,; 530–536. Gianni, L., Zambetti, M., Clark, K., Baker, J., Cronin, M., Wu, J., Mariani, G., Rodriguez, J., Carcangiu, M., Watson, D., Valagussa, P., Rouzier, R., Symmans, W. F., Ross, J. S., Hortobagyi, G. N., Pusztai, L., and Shak, S. (2005) Gene Expression Profiles in Paraffin-Embedded Core Biopsy Tissue Pre-
Introduction to the Development and Validation of Predictive Biomarker
19. 20.
21.
22.
23. 24.
25.
26.
27.
28.
dict Response to Chemotherapy in Women with Locally Advanced Breast Cancer. J Clin Oncol 23, 7265–7277. Bertucci, F., and Birnbaum, D. (2007) Breast Cancer Genomics: Real-Time Use. Lancet Oncol 8, 1045–1047. Buyse, M., Loi, S., van’t Veer, L., Viale, G., Delorenzi, M., Glas, A. M., d’Assignies, M. S., Bergh, J., Lidereau, R., Ellis, P., Harris, A., Bogaerts, J., Therasse, P., Floore, A., Amakrane, M., Piette, F., Rutgers, E., Sotiriou, C., Cardoso, F., Piccart, M. J., and TRANSBIG Consortium. (2006) Validation and Clinical Utility of a 70-Gene Prognostic Signature for Women with Node-Negative Breast Cancer. J Natl Cancer Inst 98,; 1183– 1192. Sreekumar, R., Halvatsiotis, P., Schimke, J. C., and Nair, K. S. (2002) Gene Expression Profile in Skeletal Muscle of Type 2 Diabetes and the Effect of Insulin Treatment. Diabetes 51, 1913–1920. Suzman, D. L., McLaughlin, M., Hu, Z., Kleiner, D. E., Wood, B., Lempicki, R. A., Mican, J. M., Suffredini, A., Masur, H., Polis, M. A., and Kottilil, S. (2008) Identification of Novel Markers for Liver Fibrosis in HIV/hepatitis C Virus Coinfected Individuals using Genomics-Based Approach. AIDS 22, 1433–1439. Pritzker, K. P. (2002) Cancer Biomarkers: Easier Said than done. Clin Chem 48,; 1147– 1150. Lashkari, D. A., DeRisi, J. L., McCusker, J. H., Namath, A. F., Gentile, C., Hwang, S. Y., Brown, P. O., and Davis, R. W. (1997) Yeast Microarrays for Genome Wide Parallel Genetic and Gene Expression Analysis. Proc Natl Acad Sci USA 94, 13057–13062. Schena, M., Shalon, D., Davis, R. W., and Brown, P. O. (1995) Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray. Science 270, 467–470. Karas, M., and Hillenkamp, F. (1988) Laser Desorption Ionization of Proteins with Molecular Masses Exceeding 10,000 Daltons. Anal Chem 60, 2299–2301. Fenn, J. B., Mann, M., Meng, C. K., Wong, S. F., and Whitehouse, C. M. (1989) Electrospray Ionization for Mass Spectrometry of Large Biomolecules. Science 246,; 64–71. Hatada, I., Fukasawa, M., Kimura, M., Morita, S., Yamada, K., Yoshikawa, T., Yamanaka, S., Endo, C., Sakurada, A., Sato, M., Kondo, T., Horii, A., Ushijima, T., and Sasaki, H. (2006) Genome-Wide Profiling of
29.
30. 31. 32.
33.
34.
35. 36.
37. 38.
39.
40. 41.
467
Promoter Methylation in Human. Oncogene 25, 3059–3064. Ching, T. T., Maunakea, A. K., Jun, P., Hong, C., Zardo, G., Pinkel, D., Albertson, D. G., Fridlyand, J., Mao, J. H., Shchors, K., Weiss, W. A., and Costello, J. F. (2005) Epigenome Analyses using BAC Microarrays Identify Evolutionary Conservation of Tissue-Specific Methylation of SHANK3. Nat Genet 37, 645–651. Aebersold, R., and Mann, M. (2003) Mass Spectrometry-Based Proteomics. Nature 422, 198–207. Aebersold, R., and Goodlett, D. R. (2001) Mass Spectrometry in Proteomics. Chem Rev 101, 269–295. Branham, W. S., Melvin, C. D., Han, T., Desai, V. G., Moland, C. L., Scully, A. T., and Fuscoe, J. C. (2007) Elimination of Laboratory Ozone Leads to a Dramatic Improvement in the Reproducibility of Microarray Gene Expression Measurements. BMC Biotechnol 7, 8. Fare, T. L., Coffey, E. M., Dai, H., He, Y. D., Kessler, D. A., Kilian, K. A., Koch, J. E., LeProust, E., Marton, M. J., Meyer, M. R., Stoughton, R. B., Tokiwa, G. Y., and Wang, Y. (2003) Effects of Atmospheric Ozone on Microarray Data Quality. Anal Chem 75, 4672–4675. Baldi, P., Brunak, S., Chauvin, Y., Andersen, C. A., and Nielsen, H. (2000) Assessing the Accuracy of Prediction Algorithms for Classification: An Overview. Bioinformatics 16, 412–424. J.A., Swets. (1988) Measuring the Accuracy of Diagnostic Systems. Science 240,; 1285– 1293. Zweig, M. H., and Campbell, G. (1993) Receiver-Operating Characteristic (ROC) Plots: A Fundamental Evaluation Tool in Clinical Medicine. Clin Chem 39,; 561–577. Fawcett, T. (2006) An Introduction to ROC Analysis. Pattern Recognit Lett 27, 861–874. Hanley, J. A., and McNeil, B. J. (1983) A Method of Comparing the Areas Under Receiver Operating Characteristic Curves Derived from the Same Cases. Radiology 148, 839–843. Hanley, J. A., and McNeil, B. J. (1982) The Meaning and use of the Area Under a Receiver Operating Characteristic (ROC) Curve. Radiology 143, 29–36. Shao, J. (1993) Linear Model Selection by Cross-Validation. J Am Stat Assoc 88,; 486– 494. Kohavi, R. (1995) A Study of CrossValidation and Bootstrap for Accuracy
468
42. 43. 44. 45.
46.
47.
48.
49.
50. 51. 52.
53. 54.
55.
56.
Deng and Campagne Estimation and Model Selection. Morgan Kaufmann. Efron, B. (1983) Estimating the Error Rate of a Prediction Rule: Improvement on CrossValidation. J Am Stat Assoc 78, 316–331. Parker, B. J., Gunter, S., and Bedo, J. (2007) Stratification Bias in Low Signal Microarray Studies. BMC Bioinformatics 8, 326. Affymetrix. Affymetrix.http://www.affyme;trix.com/support/developer/stat_sdk/index. ;affx ed. Irizarry, R. A., Bolstad, B. M., Collin, F., Cope, L. M., Hobbs, B., and Speed, T. P. (2003) Summaries of Affymetrix GeneChip Probe Level Data. Nucleic Acids Res 31, e15. Wu, Z., Irizarry, R. A., Gentleman, R., MartinezMurillo, F., and Spencer, F. (2004, December) A Model-Based Background Adjustment for Oligonucleotide Expression Arrays. J Am Stat Assoc 99, 909–917. Li, C., and Wong, W. H. (2001) ModelBased Analysis of Oligonucleotide Arrays: Expression Index Computation and Outlier Detection. Proc Natl Acad Sci USA 98,; 31– 36. Katz, S., Irizarry, R. A., Lin, X., Tripputi, M., and Porter, M. W. (2006) A Summarization Approach for Affymetrix GeneChip Data using a Reference Training Set from a Large, Biologically Diverse Database. BMC Bioinformatics 7, 464. Bolstad, B. M., Irizarry, R. A., Astrand, M., and Speed, T. P. (2003) A Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on Variance and Bias. Bioinformatics 19, 185–193. Quackenbush, J. (2002) Microarray Data Normalization and Transformation. Nat Genet 32(Suppl), 496–501. Partek. Partek.http://www.partek.com/ed. Johnson, W. E., Li, C., and Rabinovic, A. (2007) Adjusting Batch Effects in Microarray Expression Data using Empirical Bayes Methods. Biostatistics 8, 118–127. Hand, D. J., and Yu, K. (2001) Idiot’s Bayes: Not so Stupid After all? Int Stat Rev 69,; 385–398. Deegalla, S., and Boström, H. (2007) Classification of Microarrays with kNN: Comparison of Dimensionality Reduction Methods, in Lecture Notes in Computer Science, Springer Berlin/Heidelberg. Dudoit, S., Fridlyand, J., and Speed, T. P. (2002) Comparison of Discrimination Methods for the Classification of Tumors using Gene Expression Data. J Am Stat Assoc 97, 77–87. Hosmer, D. W., and Lemeshow, S. (2000) Applied Logistic Regression (Wiley Series in
57.
58.
59. 60. 61. 62.
63.
64. 65.
66.
67. 68. 69. 70.
71.
72.
Probability and Statistics). Wiley-Interscience Publication. Tabachnick, B. G., and Fidell, L. S. (2006) Using Multivariate Statistics, 5th ed., Allyn & Bacon, Inc., Needham Heights, MA, USA. Liao, J. G., and Chin, K. V. (2007) Logistic Regression for Disease Classification using Microarray Data: Model Selection in a Large p and Small n Case. Bioinformatics 23,; 1945–1951. Quinlan, J. R. (1993) C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. Quinlan, J. R. (1996) Improved use of Continuous Attributes in C4.5. J Artificial Intell Res 4, 77–90. Breiman, L. (2001) Random Forests. Machine Learning. 45, 5–32. Ho, T. K. (1998) The Random Subspace Method for Constructing Decision Forests. IEEE Trans Pattern Anal Mach Intell 20, 832–844. Diaz-Uriarte, R., and Alvarez de Andres, S. (2006) Gene Selection and Classification of Microarray Data using Random Forest. BMC Bioinformatics 7, 3. Cortes, C., and Vapnik, V. (1995) Support Vector Networks. Springer, Netherlands. Joachims, T. (2002) Learning to Classify Text Using Support Vector Machines. Kluwer/Springer, Norwell, Massachusetts, USA. Brown, M. P., Grundy, W. N., Lin, D., Cristianini, N., Sugnet, C. W., Furey, T. S., Ares, M., Jr., and Haussler, D. (2000) Knowledge-Based Analysis of Microarray Gene Expression Data by using Support Vector Machines. Proc Natl Acad Sci USA 97, 262–267. Meyer, D., Leisch, F., and Hornik, K. (2003) The Support Vector Machine Under Test. Neurocomputing 55, 169–186. Breiman, L. (1996) Bagging Predictors. Machine Learning 24, 123–140. Bühlmann, P., and Yu, B. (2002) Analyzing Bagging. Annals of Statistics 30,; 927–961. Freund, Y., and Schapire, R. E. (1997) A Decision-Theoretic Generalization of Online Learning and an Application to Boosting. J Comp Sys Sci 55, 119–139. Freund, Y., Iyer, R., Schapire, R. E., and Singer, Y. (2003) An Efficient Boosting Algorithm for Combining Preferences. J Mach Learn Res 4, 933–969. Dettling, M., and Buhlmann, P. (2003) Boosting for Tumor Classification with Gene Expression Data. Bioinformatics 19,; 1061–1069.
Introduction to the Development and Validation of Predictive Biomarker 73. Yeung, K. Y., and Ruzzo, W. L. (2001) Principal Component Analysis for Clustering Gene Expression Data. Bioinformatics 17, 763–774. 74. Jolliffe, I. T. (1980) Principal Component Analysis. Springer, New York. 75. Sanguinetti, G., Milo, M., Rattray, M., and Lawrence, N. (2005) Accounting for ProbeLevel Noise in Principal Component Analysis of Microarray Data. Bioinformatics 21,; 3748–3754. 76. Lesnick, T., Papapetropoulos, S., Mash, D., Ffrench-Mullen, J., Shehadeh, L., de Andrade, M., Henley, J., Rocca, W., Ahlskog, J., and Maraganore, D. (2007) A Genomic Pathway Approach to a Complex Disease: Axon Guidance and Parkinson Disease. PLoS Genet 3, e98. 77. Tusher, V. G., Tibshirani, R., and Chu, G. (2001) Significance Analysis of Microarrays Applied to the Ionizing Radiation Response. Proc Natl Acad Sci USA 98,; 5116–5121. 78. Allison, D. B., Cui, X., Page, G. P., and Sabripour, M. (2006) Microarray Data Analysis: From Disarray to Consolidation and Consensus. Nat Rev Genet 7, 55–65. 79. MAQC Consortium, Shi, L., Reidal., L. H., Jones, et al. (2006) The MicroArray Quality Control (MAQC) Project shows Interand Intraplatform Reproducibility of Gene Expression Measurements. Nat Biotechnol 24, 1151–1161. 80. Shi, L., Tong, W., Fang, H., Scherf, U., Han, J., Puri, R., Frueh, F., Goodsaid, F., Guo, L., Su, Z., Han, T., Fuscoe, J., Xu, Z. A., Patterson, T., Hong, H., Xie, Q., Perkins, R., Chen, J., and Casciano, D. (2005) Cross-Platform Comparability of Microarray Technology: Intra-Platform Consistency and Appropriate Data Analysis Procedures are Essential. BMC Bioinformatics 6, S12. 81. Shi, L., Jones, W. D., Jensen, R. V., Harris, S. C., Perkins, R. G., Goodsaid, F. M., Guo, L., Croner, L. J., Boysen, C., Fang, H., Qian, F., Amur, S., Bao, W., Barbacioru, C. C., Bertholet, V., Cao, X. M., Chu, T. M., Collins, P. J., Fan, X. H., Frueh, F. W., Fuscoe, J. C., Guo, X., Han, J., Herman, D., Hong, H., Kawasaki, E. S., Li, Q. Z., Luo, Y., Ma, Y., Mei, N., Peterson, R. L., Puri, R. K., Shippy, R., Su, Z., Sun, Y. A., Sun, H., Thorn, B., Turpaz, Y., Wang, C., Wang, S. J., Warrington, J. A., Willey, J. C., Wu, J., Xie, Q., Zhang, L., Zhang, L., Zhong, S., Wolfinger, R. D., and Tong, W. (2008) The Balance of Reproducibility, Sensitivity, and Specificity of Lists of Differentially Expressed
82.
83. 84.
85.
86. 87.
88. 89. 90. 91.
92. 93. 94.
95.
469
Genes in Microarray Studies. BMC Bioinformatics 9 (Suppl 9), S10. Ding, C., and Peng, H. (2005) Minimum Redundancy Feature Selection from Microarray Gene Expression Data. J Bioinform Comput Biol 3, 185–205. Shannon, C., and Weaver, W. (1949) The Mathematical Theory of Communication. University of Illinois Press, Urbana, IL, USA. Guyon, I., Weston, J., Barnhill, S., and Vapnik, V. (2002) Gene Selection for Cancer Classification using Support Vector Machines. Machine Learning; 46, 389–422. Liu, Q., and Sung, A. H. (2006) Recursive Feature Addition for Gene Selection. International Joint Conference on Neural Networks. Vancouver, BC, Canada,; pp. 1360–1367. Kohavi, R., and John, G. (1997) Wrappers for Feature Subset Selection. Artif Intell 97, 273–324. Inza, I., Larranaga, P., Blanco, R., and Cerrolaza, A. J. (2004) Filter Versus Wrapper Gene Selection Approaches in DNA Microarray Domains. Artif Intell Med 31, 91–103. Xiong, M., Fang, X., and Zhao, J. (2001) Biomarker Identification by Feature Wrappers. Genome Res 11, 1878–1887. Kirkpatrick, S., Gelatt, C. D., Jr, and Vecchi, M. P. (1983) Optimization by Simulated Annealing. Science 220, 671–680. Holland, J. H. (1975) Adaptation in Natural and Artificial Systems. University of Michigan Press, Ann Arbor. Carbonaro, A., and Maniezzo, V. (2003) The Ant Colony Optimization Paradigm for Combinatorial Optimization. Advances in Evolutionary Computing: Theory and Applications. Springer-Verlag, New York, NY, USA,; pp. 539–557. Glover, F., and Laguna, M. (1997) Tabu Search.Kluwer, Norwell, MA, USA. Dutkowski, J., and Gambin, A. (2007) On Consensus Biomarker Selection. BMC Bioinformatics 8, S5. Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D., and Lander, E. S. (1999) Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science 286, 531–537. Gong, Y., Yan, K., Lin, F., Anderson, K., Sotiriou, C., Andre, F., Holmes, F. A., Valero, V., Booser, D., Pippen, J., John E., Vukelja, S., Gomez, H., Mejia, J., Barajas, L. J., Hess, K. R., Sneige, N.,
470
Deng and Campagne
Hortobagyi, G. N., Pusztai, L., and Symmans, W. F. (2007) Determination of Oestrogen-Receptor Status and ERBB2 Status of Breast Carcinoma: A Gene-Expression Profiling Study. The Lancet Oncology 8, 203– 211. 96. Guo, L., Lobenhofer, E. K., Wang, C., Shippy, R., Harris, S. C., Zhang, L., Mei,
N., Chen, T., Herman, D., Goodsaid, F. M., Hurban, P., Phillips, K. L., Xu, J., Deng, X., Sun, Y. A., Tong, W., Dragan, Y. P., and Shi, L. (2006) Rat Toxicogenomic Study Reveals Analytical Consistency Across Microarray Platforms. Nat Biotechnol 24, 1162–1169.
Chapter 16 Multi-gene Expression-based Statistical Approaches to Predicting Patients’ Clinical Outcomes and Responses Feng Cheng, Sang-Hoon Cho, and Jae K. Lee Abstract Gene expression profiling technique now enables scientists to obtain a genome-wide picture of cellular functions on various human disease mechanisms which has also proven to be extremely valuable in forecasting patients’ prognosis and therapeutic responses. A wide range of multivariate techniques have been employed in biomedical applications on such expression profiling data in order to identify expression biomarkers that are highly associated with patients’ clinical outcome and to train multi-gene prediction models that can forecast various human disease outcome and drug toxicities. We provide here a brief overview on some of these approaches, succinctly summarizing relevant basic concepts, statistical algorithms, and several practical applications. We also introduce our recent in vitro molecular expressionbased algorithm, the so-called COXEN technique, which uses specialized gene profile signatures as a Rosetta Stone for translating the information between two different biological systems or populations. Key words: Multivariate analysis, gene expression profiling, COXEN, classification, toxicogenomics.
1. Introduction In a biological organism, DNA gene sequences are first transcribed to messenger RNAs (mRNAs), which are translated into proteins, which in turn carry out diverse biologic functions of the biological subject. That is, a gene’s sequence tells what that gene could potentially do in a biological system while its expressions in RNA and ultimately in protein indicate what it is actually doing (1). Stemmed from the Human Genome Project, genomic expression microarray is a recent high-throughput technique for obtaining the genome-wide measurement of expression levels of H. Bang et al. (eds.), Statistical Methods in Molecular Biology, Methods in Molecular Biology 620, DOI 10.1007/978-1-60761-580-4_16, © Springer Science+Business Media, LLC 2010
471
472
Cheng, Cho, and Lee
mRNAs (2) which allows biological researchers to monitor functional (expression) activities of thousands of genes in a single analysis (1). These mRNA microarray profiles can simultaneously provide a global picture of the cellular functions of target biological and human disease mechanisms (1, 3–6). There are quite a few different high-throughput biotechnologies including RNA microarray (2), SAGE (7), CGH (8), 2D-gel (9), mass spectrometry (10), protein array (11), and many others. Among these, the RNA microarray technique is currently one of the best approaches to comprehensively and accurately quantifying genome-wide transcription patterns (2). RNA microarray is typically designed with known complement DNA sequence segments (or probes) attached to a solid surface such as glass or plastic. These probes can be made by using either PCR-amplified complementary DNA (cDNA) or synthetic DNA oligonucleotides. The basic idea of microarray is to use these fixed DNA probes to assess the presence and concentration of their complementary RNA targets in a biological sample of interest. First introduced in mid-1990s (12, 13), this microarray technology has revolutionized the way biological scientists examine gene (expression) functions and has been widely used in molecular biomedical research accounting for more than 30,000 PubMed articles (14). The public microarray repositories such as NCBI Gene Expression Omnibus (GEO) and EBI ArrayExpress databases now contain over 400,000 microarrays from more than 15,000 independent studies (Table 16.1).
Table 16. 1 Public microarray databases (February 2008) Database
Datasets/studies
Arrays
NCBI GEOa
11238
302646
ArrayExpressb
7582
223813
a http://www.ncbi.nlm.nih.gov/geo/; b http://www.ebi.ac.uk/arrayexpress/.
Gene expression profiling has also proven quite useful for various biomedical investigations (3–6). For example, such expression profiling has been used to provide broad mechanistic information regarding the progression of various human diseases and the molecular response to toxin exposure (5, 6, 15, 16), to simultaneously survey the activities of many relevant genes in gene pathways and networks which may enable one to identify novel molecular targets and mechanisms for therapeutic intervention (17, 18), or to develop important diagnostic assays for cancer and other diseases by identifying and training molecular gene signature-based prediction models for patients’ therapeutic responses and risk assessment (3, 19). In this chapter, we will
Multi-gene Expression-based Statistical Approaches
473
focus on these biomedical applications of multi-gene statistical modeling techniques for predicting clinical responses and outcomes of human patients.
2. General Statistical Background
Statistical procedures in the development of multi-gene prediction modeling can be divided into three steps: (1) the identification of gene biomarkers, (2) the construction of prediction models, and (3) the evaluation of candidate prediction models.
2.1. Identification of Gene Biomarkers
The first step in multi-gene prediction modeling is to identify the subset of genes which are potential candidate predictors with the most relevant biological information for the clinical response or outcome of interest. The biomarker discovery usually entails selecting the genes that exhibit statistically significant differential expression between contrasting groups of drug response or disease outcome. When the drug response or disease outcome of the clinical endpoint or drug activity data provide continuous values of such an outcome, this discovery can also be performed by identifying the genes that are highly associated with the continuous spectrum of the response or outcome values of the subject population. This biomarker identification step also serves as an initial filtering for reducing the parameter space of the gene variables from very high-dimension microarray data. In addition to classical two-sample t-tests, several statistical approaches have been utilized for assessing differential gene expression, including variant t-tests (20–24), empirical Bayes methods (25–27), linear mixed effect model (28), generalized likelihood ratio test (29), and the local-pooled-error test (30). These statistical approaches rely on different underlying assumptions, such as distributional specifications, the exchangeability for a random effect distribution, a constant coefficient of variation, a mean–variance relationship, and others. It is thus important that they should be applied with care for biomarker discovery, especially considering such statistical assumptions.
2.2. Construction of a Prediction Models
Upon identifying relevant biomarkers, a statistical classification modeling technique can be used for constructing a multivariate prediction model. Classification is a supervised learning method where the known class labels of a training set play a role as a supervising “teacher” for classification modeling (31–33). The objective of classification is then to derive a prediction rule to assign a new subject to one of these possible classes. Several statistical classification methods in biomedical research have been utilized including a variant of linear discriminant analysis (LDA) (34),
474
Cheng, Cho, and Lee
support vector machines (SVM) (35–37), Bayesian regression (38), partial least squares (39), GA/KNN (40), and betweengroup analysis (41). Let us briefly introduce here LDA and SVM techniques. LDA produces linear boundaries which classify subjects into a small number of predefined classes. The posterior distribution of a class after observing a subject is calculated as a membership probability of the observed subject for that class. The prediction rule of LDA assigns a new subject to the most probable class whose corresponding membership probability is maximized. Under the 0–1 loss function, the prediction rule of LDA minimizes the expected classification error. Assuming the conditional probability of subjects is a multivariate Gaussian distribution with a common covariance structure for all classes, the prediction rule of LDA can be derived as a linear combination of input variables. The popularity of LDA is not only due to its simplicity and computational efficiency but also its consistently high performance on many practical applicattions (33). SVM is a non-parametic classification method for building a binary prediction rule (42). The linear SVM was geometrically motivated by finding separating hyperplanes maximizing the margin of the input space between two classes. Any training vectors lying on the hyperplanes, whose removal would change the prediction rule, are called support vectors. As a generalization of the linear SVM, the nonlinear SVM enlarges its input feature space, which is often infinite dimensional, using basis expansion such as polynomials or splines. As an immediate advantage, the nonlinear SVM can produce much more flexible boundaries. The nonlinear SVM can be cast as a regularization problem in the reproducing kernel Hilbert space. This regularization problem can be computationally achieved by the application of a kernel. As the sample size increases, the prediction rule of the nonlinear SVM approaches the Bayes rule of classification under the expected misclassification rate (43). For an introduction to the SVM, see (44, 45). For reproducing kernel Hilbert spaces, see a relevant reference elsewhere (46). Note that there unlikely exists one dominating classification method, e.g., LDA or SVM, which is superior to all others. In practice, experimenters need to compare and choose one with higher accuracy according to rigorous evaluation criteria as described in the next section (see Chapter 11 to learn more about LDA and SVM). 2.3. Evaluation of Prediction Models
The performance of a statistical prediction model is assessed by various statistical measures such as classification error rate and area under the ROC curve (AUC), the product of posterior classification probabilities (47–49), and a recent index so-called misclassification-penalized posterior (MiPP) (50). In particular, a receiver operating characteristic (ROC) curve provides a graphical
Multi-gene Expression-based Statistical Approaches
475
representation of the probability of a true positive result, i.e., sensitivity, against the probability of a false positive result, i.e., 1 – specificity, for a range of different cutoff points. The closer the line to the upper left-hand corner of the graph, the more accurate the model. The overall power of a classification model is measured by the area under its ROC curve compared with that of a random classifier. The optimal cutoff value on an ROC curve is determined by maximizing the so-called Youden’s index defined as sensitivity + specificity – 1. At the optimal cutoff value, one can calculate: sensitivity (=predicted true positives/all true positives), specificity (=predicted true negative/all negatives), positive predictive value (PPV, = predicted true positives / all positives), and negative predictive value (NPV, = predicted true negatives / all negatives). More precise definitions of sensitivity, specificity, PPV, and NPV are available in Chapter 15. We illustrate some practical applications of multi-gene expression-based prediction modeling in the following sections.
3. Multi-gene Expression-Based Prediction of Chemosensitivity
3.1. COXEN Algorithm and Applications
The prediction of tumor chemosensitivity is quite challenging and limited in current cancer treatment. While human patient data-based prediction modeling has been traditionally performed as described in the previous section, a statistical technique of chemosensitivity prediction, the so-called COXEN (COeXpression ExtrapolatioN), was recently introduced based on multi-gene expression information and drug activities on in vitro cancer cell lines (3). Since this technique utilizes a novel prediction modeling strategy, we briefly detail its technical steps here. In particular, this study was based on in vitro drug activity data on a panel of 60 cancer cell lines extensively used by the US National Cancer Institute (NCI) to test chemical compounds for their anticancer potency. The NCI-60 panel includes nine cancer subtypes: leukemia, melanoma, and cancers of the lung, colon, brain, ovary, breast, prostate, and kidney. Currently, more than 100,000 chemical compounds have been screened and full experimental information on more than 45,000 compounds is publicly available. However, not all important cancer types are included in this panel. For example, bladder cancers are not included in the NCI-60. Additionally, these in vitro drug activities have been unable to be directly linked to clinical effectiveness in patients (in vivo). Overcoming these limitations, the COXEN algorithm is conceptually designed as follows. The COXEN algorithm is composed of six distinct steps. The end result is termed the “COXEN score” which reflects the predicted
476
Cheng, Cho, and Lee
sensitivity of a particular cell line or human tumor to the specific drug being evaluated by the algorithm. Generically, the steps for prediction of a drug’s activity in cells belonging to some Set 2 on the basis of its activity pattern in different cells of some Set 1 are as follows (Fig. 16.1):
Fig. 16.1. Schematic diagram of the prediction and validation of COXEN algorithm.
• Step 1 Experimentally determine the drug’s pattern of activity in cells of Set 1 • Step 2 Experimentally measure molecular characteristics of cells in Set 1 • Step 3 Select a subset of those molecular characteristics that most accurately predict the drug’s activity in cell Set 1 (“chemosensitivity signature” selection) • Step 4 Experimentally measure the same molecular characteristics of cells in Set 2 • Step 5 Among the molecular characteristics selected in Step 3, identify a subset that shows a strong pattern of “coexpression extrapolation” between cell Sets 1 and 2 • Step 6 Use a multivariate algorithm such as LDA to predict the drug’s activity in Set 2 cells on the basis of the drug’s activity pattern in Set 1 and the molecular characteristics of Set 2 selected in Step 5. The output of the multivariate analysis is a COXEN score. Its capability of chemosensitivity prediction was first investigated for cancer subtypes not included in the NCI-60 panel by using BLA-40, a panel of 40 human urothelial bladder
Multi-gene Expression-based Statistical Approaches
477
carcinomas. Genes highly associated with chemosensitivity were first identified by comparing sensitive and resistant NCI-60 cells’ expression profiles for each of these two compounds. Multivariate prediction models were then constructed for cisplatin and paclitaxel which are clinically used against bladder cancer as follows. Several competing models were independently evaluated by using the 10 most sensitive and 10 most resistant BLA-40 lines. For those cells, prediction accuracies were 85% for cisplatin and 78% for paclitaxel. These prediction accuracies were of highly statistical significance demonstrating that the available NCI-60 chemosensitivity data could be used to infer the chemosensitivity of other cell types beyond the NCI-60 panel. This observation was extremely valuable for patients’ chemosensitivity predictions and novel anticancer drug development. Indeed, a new agent with high potency against bladder cancer was identified using this protocol. It was also found that COXEN can be used to extrapolate in vitro drug screening data from NCI-60 to predict clinical response to chemotherapeutic drugs in breast cancer patients, given the gene expression profiles of the patients’ known tumors. The authors used two breast cancer clinical trials, DOC-24 (24 patients treated with docetaxel) (51) and TAM-60 (60 patients treated with tamoxifen) (52) as test sets. Similar to the BLA40 case, they identified the gene molecular signatures relevant to each compound’s drug sensitivity on NCI-60 and concordantly expressed between the NCI-60 set and the patient’s gene expression profile. They then derived the corresponding LDA classification models for each drug on the basis of these gene molecular signature sets. The classification prediction accuracies for DOC-24 and TAM-60 were 75% and 71%, respectively. The accuracy of clinical response prediction was lower than that for BLA-40 but was nevertheless statistically significant. COXEN scores were also compared with patients’ residual tumor sizes in DOC-24. The rank-based Spearman correlation of those results showed statistical significance (P = 0.033). Kaplan–Meier survival analysis on TAM-60 illustrated the predicted responder group (5-year disease-free survival (DFS) ∼89%) showed a significantly longer DFS time than the predicted non-responder group (5-year DFS 48%; P = 0.021). Such in silico prediction modeling is facile and much more efficient than patient data-based prediction modeling and may provide a promising way to select the best treatment option(s) for each patient’s heterogeneous tumor in the future.
4. Toxicogenomics Prediction
Application of the gene expression profiling in predicting toxicology is referred to as toxicogenomics (5, 6, 11, 16, 53–58) and is a supplementary method to the current approach to
478
Cheng, Cho, and Lee
toxicological assessment. Generally, current approaches evaluate the risk of toxic agents by measuring one or several important enzymes in biological systems such as aminotransferases, alkaline phosphatase, and P450 (59–61). Different from these methods, gene expression profiling monitors the cell-wide changes in gene expression following exposure to toxic agents (5). This comprehensive genome-wide investigation is usually more informative than the traditional single factor test (5). Because gene expression changes occur before histopathologic changes, gene expression profiling has the potential to identify toxic compounds early in the drug discovery process. By identifying gene expression changes that strongly correlate with toxicity, it may be possible to identify compounds to be potentially toxic by classification method. Recently, a few successful attempts in this field have been reported (Table 16.2). Some of these studies are further detailed below.
Table 16. 2 Statistical techniques used in recent toxicogenomics studies Gene biomarkers identification methods
Classification methods
Authors
Research topics
Zidek et al.
Drug acute hepatotoxicity prediction
ANOVAa
SVMb
Spicker et al.
Drug acute hepatotoxicity prediction
ANOVAa
KNNc
Uehara et al.
Drug carcinogenicity evaluation
ANOVAa
PAMd
Fielden et al.
Drug carcinogenicity evaluation
t-test
A-SPLPe
Rieger et al.
Radiation toxicology studies
SAMf
NSCg
Svensson et al.
Radiation toxicology studies
t-test
Linear regression
a ANOVA: Analysis of variance; b SVM: Support vector machine; c KNN: K-nearest neighbors; d PAM: Prediction analysis of microarray; e A-SPLP: Adjusted sparse linear programming; f SAM: Significance Analysis of Microarrays; g NSC: Nearest shrunken centroid.
4.1. Drug Hepatotoxicity Prediction
The liver is a major organ of detoxification and metabolism. Therefore, it is also a common site for toxicity in the body (11, 15, 61–63). Indeed, Hepatotoxicity (liver toxicity) results in the failure of many drug candidates during development (54, 61). Zidek and colleagues (64) have established a predictive screening system for acute hepatotoxicity by using a bead-based Illumina oligonucleotide microarray containing 550 liver-specific genetic probes. There were 6 hepatotoxicants among 12 model compounds under their tests: tetracycline, carbon tetrachloride (CCL4), 1-naphthylisothiocyanate (ANIT),
Multi-gene Expression-based Statistical Approaches
479
erythromycin estolate, acetaminophen (AAP), and chloroform, and 6 non-hepatotoxic compounds: clofibrate, theophylline, naloxone, estradiol, quinidine, and dexamethasone. These compounds were administered to in vivo animal (rat) models. After 6, 24, and 72 hours, rat livers were taken for the analysis of gene expression profiling. The original authors found 64 toxicityrelated genes by analysis of variance (ANOVA). Based on these toxicological molecular signatures, an SVM method was used for final discrimination between hepatotoxic and non-hepatotoxic compounds. All model compounds were accurately predicted for those samples with high dose treatment for 24 hours, demonstrating that hepatotoxicity could be adequately predicted by a gene expression profiling method. Spicker and his colleagues (65) used Affymetrix RAE230v2 microarray in their studies of the responses of rat liver tissue (N= 39) after the administration of three toxic compounds (ANIT, DMN, and NMF) and three non-toxic compounds (caeruelein, dinitrophenol, and rosiglitazone). Hepatotoxicity-related genes were identified by ANOVA and ranked by significance in differentiating non-toxic versus toxic compounds. Three genes glucokinase regulatory protein (GCKR), ornithine aminotransferase (OAT), and cytochrome P450, subfamily IIC (mephenytoin 4hydroxylase) (Cyp2C29) were then selected for prediction modeling. These top genes were also validated by quantitative-RTPCR. They then used 20 new rat samples as independent test sets. These animals were treated by two new toxic compounds, aflatoxin (At) and dimethylformamide (DMF), and two nontoxic compounds, buthionine sulphoxide (BS) and acivicin (Ac). K-nearest neighbors (KNN) method was performed for the final classification of non-toxic versus toxic compounds. Based on quantitative-RT-PCR data and the three-gene model, 18 of these 20 samples were correctly predicted corresponding to an accuracy (accuracy=correctly predicted samples/total number of samples) of 0.9. Using OAT alone as the predictor, all 20 test set samples were also correctly classified. 4.2. Drug Carcinogenicity Evaluation
Drug carcinogenicity evaluation is another big challenge in drug development. The traditional bioassay of long-term (e.g., 2 year) rodent model is expensive and time consuming (66). This kind of experimental bioassay can often cost millions of dollars per compound. Therefore, any rapid assay that predicts drug carcinogenicity would be highly valuable. Several research groups explored the possibility of applying a toxicogenomics approach in this field. Uehara and colleagues (67) analyzed gene expression data from rat livers treated with two known hepatocarcinogens, thioacetamide (TAA), and methapyrilene (MP). Three hundred and fourty nine commonly changed genes (276 up-regulated and 73 down-regulated probes) were identified by ANOVA with
480
Cheng, Cho, and Lee
multiple comparison tests. Prediction models were constructed based on these genes by using a classification method named “prediction analysis of microarray” (PAM). Thirty hepatocarcinogens and non-hepatocarcinogens were used to validate the model. By 10-fold cross-validation, a molecular signature containing 112 genes was chosen. The prediction model gave an overall accuracy of 95%. Especially, almost all of the non-carcinogenic samples were correctly predicted. The whole experiment took only 28 days. This study suggested the possibility of identifying hepatocarcinogens at an early stage administration with high precision. Fielden and his collaborators (66) from Iconix Biosciences Incorporation performed a large-scale toxicogenomics study in this field. They collected gene expression microarray data in rat livers treated for 5 days with one of the 147 structurally and mechanistically diverse hepatocarcinogens and non-hepatocarcinogens. They randomly chose 25 hepatic tumorigens and 75 nonhepatocarcinogens to form the training set. The remaining 47 compounds, were used as the test sets. A new linear classification method, adjusted sparse linear programming (A-SPLP), was used to find an optimal linear combination of genes for the highest separation between the two classes (68). In this study, a test on independent sets showed a sensitivity and specificity of 86 and 81%, respectively. They also compared the prediction results with those by traditional biomarkers such as liver weight, hepatocellular hypertrophy, hepatic necrosis, serum ALT activity, induction of cytochrome P450 genes, and repression of Tsc-22 or alpha2macroglobulin messenger RNA, and confirmed that the gene expression-based signature significantly outperformed the other conventional predictors. 4.3. Radiation Toxicity Prediction
Radiotherapy is one of the standard treatments for cancer (69). Although this treatment showed substantial benefits for some specific cancer types, toxicity from radiation therapy is a main concern (69). In particular, radiation could result in severe side effects in some patients, such as cystic fibrosis (69, 70). The dose of radiotherapy should be given based on the tolerance of patients (71). However, radiation response is often found to vary considerably among different patients (72). An optimal decision on the dose concentration according to patient sensitivity is thus a key step during radiation treatment (69). Currently, sensitivity can only be determined from the patient’s responses after treatment, but certain injuries may occur at that time. Therefore, a more convenient, accurate technique is critically needed to identify patients who can be benefited from radiation therapy with tolerable side effects prior to actual treatment (71). Over the last several years, a gene expression profiling technique has also been used to predict radiation toxicity (71, 73– 76). Rieger and colleagues (75) collected lymphoblastoid cells
Multi-gene Expression-based Statistical Approaches
481
derived from the peripheral blood of 14 patients who suffered severe acute radiation toxicity and other patients with mild toxicity. The gene expression profiling was performed on these cell lines at 4 hours after radiation exposure. Significance analysis of microarrays (SAM) was first used to compare the gene expression level of these two groups and to identify radiation responsive genes. From these genes, nearest shrunken centroid (NSC) analysis was then performed to derive a small subset of highly predictive genes. Finally, 24 genes were selected to develop a computational prediction model of radiation toxicity. This computational model correctly predicted radiation toxicity in 9 out of 14 patients (64.3%). There were also no false positives among the other 43 controls. This prediction was extremely significant by Fisher’s two-tailed exact test (P = 2.2×10–7 ). Svensson and colleagues (76) performed a similar investigation. From >800 patients treated by radiotherapy for prostate cancer, 21 were reported with severe complications after radiation treatment (over-responders, OR) and 17 patients with minor or no toxicity (non-responders, NR). Of these 38, 26 (15 ORs and 11 NRs) patients were used to train the prediction model whereas the remaining 12 patients (6 ORs and 6NRs) were used for an independent test set. They collected blood lymphocyte samples from these patients; these cells were then irradiated with X-ray radiation. The radiation response was analyzed by gene expression profiling. Two classifiers, individual genes, and gene sets based on functional or cellular colocalization categories (defined by the Gene Ontology) were used for the construction of predictive models. The classifier based on the radiation response-associated genes correctly classified 63% of patients. The classifier based on functional gene sets improved correct classification to 86%. Most of these discriminative genes and gene sets belonged to the ubiquitin, apoptosis, and stress signaling networks. In the independent set of 12 patients, a toxicity status of 8 was correctly predicted by the gene set classifier.
5. Conclusion Advancements in molecular expression profiling techniques have provided unprecedented insight into basic molecular dysfunctions leading to human biology and diseases. Furthermore, various statistical bioinformatics methods on gene expression data have enabled us to identify expression signatures that are highly useful in forecasting many aspects of human disease outcome, including therapeutic responses and various drug toxicities. Successful incorporation of statistical bioinformatics tools that can simultaneously utilize the information from many relevant genes will
482
Cheng, Cho, and Lee
thus continuously provide a better understanding and prediction capability of various human diseases’ outcomes and physiological responses.
Acknowledgment This work was supported in part by National Institutes of Health grant R01HL081690 to JKL. References 1. Jaluria P, Konstantopoulos K, Betenbaugh M, Shiloach J. A perspective on microarrays: Current applications, pitfalls, and potential uses. Microb Cell Fact 2007; 6:4. 2. King HC, Sinha AA. Gene expression profile analysis by DNA microarrays: Promise and pitfalls. JAMA 2001; 286:2280–8. 3. Lee JK, Havaleshko DM, Cho H, Weinstein JN, Kaldjian EP, Karpovich J, et al. A strategy for predicting the chemosensitivity of human cancers and its application to drug discovery. Proc Natl Acad Sci USA 2007; 104: 13086–91. 4. Welsh JB, Zarrinkar PP, Sapinoso LM, Kern SG, Behling CA, Monk BJ, et al. Analysis of gene expression profiles in normal and neoplastic ovarian tissue samples identifies candidate molecular markers of epithelial ovarian cancer. Proc Natl Acad Sci USA 2001; 98:1176–81. 5. Gatzidou ET, Zira AN, Theocharis SE. Toxicogenomics: A pivotal piece in the puzzle of toxicological research. J Appl Toxicol 2007; 27:302–9. 6. Yang Y, Blomme EA, Waring JF. Toxicogenomics in drug discovery: From preclinical studies to clinical trials. Chem Biol Interact 2004; 150:71–85. 7. Yamashita T, Honda M, Kaneko S.Application of serial analysis of gene expression in cancer research. Curr Pharm Biotechnol 2008; 9:375–82. 8. van Beers EH, Nederlof PM. Array-CGH and breast cancer. Breast Cancer Res 2006; 8:210. 9. Vietor I, Huber LA. In search of differentially expressed genes and proteins. Biochim Biophys Acta 1997; 1359:187–99. 10. Feng X, Liu X, Luo Q, Liu BF. Mass spectrometry in systems biology: An overview. Mass Spectrom Rev 2008; 27: 635–60.
11. Bandara LR, Kennedy S. Toxicoproteomics – a new preclinical tool. Drug Discov Today 2002; 7:411–8. 12. Lashkari DA, DeRisi JL, McCusker JH, Namath AF, Gentile C, Hwang SY, et al. Yeast microarrays for genome wide parallel genetic and gene expression analysis. Proc Natl Acad Sci USA 1997; 94: 13057–62. 13. Schena M, Shalon D, Davis RW, Brown PO. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 1995; 270:467–70. 14. Hackett JL, Lesko LJ. Microarray data – the US FDA, industry and academia. Nat Biotechnol 2003; 21:742–3. 15. Baken KA, Vandebriel RJ, Pennings JL, Kleinjans JC, van Loveren H. Toxicogenomics in the assessment of immunotoxicity. Methods 2007; 41:132–41. 16. Battershill JM. Toxicogenomics: Regulatory perspective on current position. Hum Exp Toxicol 2005; 24:35–40. 17. Werner T. Bioinformatics applications for pathway analysis of microarray data. Curr Opin Biotechnol 2008; 19:50–4. 18. Curtis RK, Oresic M, Vidal-Puig A. Pathways to the analysis of microarray data. Trends Biotechnol 2005; 23:429–35. 19. Potti A, Dressman HK, Bild A, Riedel RF, Chan G, Sayer R, et al. Genomic signatures to guide the use of chemotherapeutics. Nat Med 2006; 12:1294–300. 20. Efron B, Tibshirani R, Storey JD, Tusher V. Empirical Bayes analysis of a microarray experiment. J Am Stat Assoc 2001; 96: 1151–60. 21. Dudoit S, Yang YH, Callow MJ, Speed TP. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Stat Sin 2002; 12:111–39.
Multi-gene Expression-based Statistical Approaches 22. Smyth GK. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 2004; 224:111–36. 23. Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. PNAS 2001; 98:5116–21. 24. Jianhua Hu FAW. Assessing differential gene expression with small sample sizes in oligonucleotide arrays using a mean-variance model. Biometrics 2007; 63:41–9. 25. Ingrid Lonnstedt TS. Replicated microarray data. Stat Sin 2002; 12:31–46. 26. Newton MA, Kendziorski CM, Richmond CS, Blattner FR. On differential variability of expression ratios: Improving statistical inference about gene expression changes from microarray data. J Comput Biol 2001; 8:37. 27. Newton MA, Noueiry A, Sarkar D, Ahlquist P. Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics 2004; 5:155–76. 28. Wolfinger RD, Gibson G, Wolfinger ED, Bennett L, Hamadeh H, Bushel P, et al. Assessing gene significance from cDNA microarray expression data via mixed models. J Comput Biol 2001; 8:625–37. 29. Wang S, Ethier S. A generalized likelihood ratio test to identify differentially expressed genes from microarray data. Bioinformatics 2004; 20:100–4. 30. Jain N, Thatte J, Braciale T, Ley K, O’Connell M, Lee JK. Local-pooled-error test for identifying differentially expressed genes with a small number of replicated microarrays. Bioinformatics 2003; 19: 1945–51. 31. Bishop CM. Pattern Recognition and Machine Learning. New York: Springer, 2006. 32. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. New York: Springer-Verlag, 2001. 33. Michie D, Spiegelhalter D, Taylor C, eds. Machine Learning, Neural and Statistical Classification. New York: Ellis Horwood, 1994. 34. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov P, et al. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 1999; 286:531–7. 35. Brown MPS, Grundy WN, Lin D, Cristianini N, Sugnet CW, Furey TS, et al. Knowledgebased analysis of microarray gene expression data by using support vector machines. PNAS 2000; 97:262–7.
483
36. Furey TS, Cristianini N, Duffy N, Bednarski DW, Schummer M, Haussler D. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 2000; 16:906–14. 37. Mukherjee S, Tamayo P, Slonim D, Verri A, Golub T, Mesirov JP, Poggio T. Support vector machine classification of microarray data. MIT 1998. 38. West M, Blanchette C, Dressman H, Huang E, Ishida S, Spang R, et al. Prediction the clinical status of human breast cancer by using gene expression profiles. PNAS 2001; 98. 39. Nguyen DV, Rocke DM. Tumor classification by partial least squares using microarray gene expression data. Bioinformatics 2002; 18:39–50. 40. Li L, Weinberg CR, Darden TA, Pedersen LG. Gene selection for sample classification based on gene expression data: Study of sensitivity to choice of parameters of the GA/KNN method. Bioinformatics 2001; 17:1131–42. 41. Culhane AC, Perriere G, Considine EC, Cotter TG, Higgins DG. Between-group analysis of microarray data. Bioinformatics 2002; 18:1600–8. 42. Vapnik V. The Nature of Statistical Learning Theory. New York: Springer-Verlag, 1995. 43. Lin Y. Support vector machines and the Bayes rule in classification. Data Min Knowl Discov 2002:259–75. 44. Burges CJC. A tutorial on support vector machines for pattern recognition. Data Min Knowl Discov 1998; 2:121–67. 45. Bennett KP, Campbell C. Support Vector Machines: Hype or Hallelujah? SIGKDD Explorations 2000; 2:1–13. 46. Wahba G. Spline Models for Observational Data. Vol. Philadelphia: SIAM, 1990; 2. 47. Bradley AP. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn 1997; 30:1145–59. 48. Hand DJ. Construction and Assessment of Classification Rules. 1st edn. Chichester: John Wiley and Sons, 1997. 49. Soukup M, Lee JK. Developing optimal prediction models for cancer classification using gene expression data. J Bioinform Comput Biol 2004; 1:681–94. 50. Soukup M, Cho H, Lee JK. Robust classification modeling on microarray data using misclassification penalized posterior. Bioinformatics 2005; 21:i423–i30. 51. Chang JC, Wooten EC, Tsimelzon A, Hilsenbeck SG, Gutierrez MC, Elledge R, et al.
484
52.
53. 54. 55.
56.
57.
58.
59.
60. 61. 62. 63. 64.
65.
Cheng, Cho, and Lee Gene expression profiling for the prediction of therapeutic response to docetaxel in patients with breast cancer. Lancet 2003; 362:362–9. Ma XJ, Wang Z, Ryan PD, Isakoff SJ, Barmettler A, Fuller A, et al. A two-gene expression ratio predicts clinical outcome in breast cancer patients treated with tamoxifen. Cancer Cell 2004; 5:607–16. Gant TW. Application of toxicogenomics in drug development. Drug News Perspect 2003; 16:217–21. Mendrick DL. Genomic and genetic biomarkers of toxicity. Toxicology 2008; 245:175–81. Lord PG, Nie A, McMillian M. Application of genomics in preclinical drug safety evaluation. Basic Clin Pharmacol Toxicol 2006; 98:537–46. Lettieri T. Recent applications of DNA microarray technology to toxicology and ecotoxicology. Environ Health Perspect 2006; 114:4–9. Ju Z, Wells MC, Walter RB. DNA microarray technology in toxicogenomics of aquatic models: methods and applications. Comp Biochem Physiol C Toxicol Pharmacol 2007; 145:5–14. Martin MT, Brennan RJ, Hu W, Ayanoglu E, Lau C, Ren H, et al. Toxicogenomic study of triazole fungicides and perfluoroalkyl acids in rat livers predicts toxicity and categorizes chemicals based on mechanisms of toxicity. Toxicol Sci 2007; 97:595–613. Lynch T, Price A. The effect of cytochrome P450 metabolism on drug response, interactions, and adverse effects. Am Fam Physician 2007; 76:391–6. Michalets EL. Update: Clinically significant cytochrome P-450 drug interactions. Pharmacotherapy 1998; 18:84–112. Navarro VJ, Senior JR. Drug-related hepatotoxicity. N Engl J Med 2006; 354:731–9. Zimmerman HJ. Drug-induced liver disease. Drugs 1978; 16:25–45. Mumoli N, Cei M, Cosimi A. Drug-related hepatotoxicity. N Engl J Med 2006; 354: 2191–3; author reply-3. Zidek N, Hellmann J, Kramer PJ, Hewitt PG. Acute hepatotoxicity: A predictive model based on focused illumina microarrays. Toxicol Sci 2007; 99:289–302. Spicker JS, Brunak S, Frederiksen KS, Toft H. Integration of clinical chemistry, expression, and metabolite data leads to better toxicological class separation. Toxicol Sci 2008; 102:444–54.
66. Fielden MR, Brennan R, Gollub J. A gene expression biomarker provides early prediction and mechanistic assessment of hepatic tumor induction by nongenotoxic chemicals. Toxicol Sci 2007; 99:90–100. 67. Uehara T, Hirode M, Ono A, Kiyosawa N, Omura K, Shimizu T, et al. A toxicogenomics approach for early assessment of potential non-genotoxic hepatocarcinogenicity of chemicals in rats. Toxicology 2008; 250:15–26. 68. Natsoulis G, El Ghaoui L, Lanckriet GR, Tolley AM, Leroy F, Dunlea S, et al. Classification of a large microarray data set: Algorithm comparison and analysis of drug signatures. Genome Res 2005; 15:724–36. 69. Bentzen SM. Preventing or reducing late side effects of radiation therapy: radiobiology meets molecular pathology. Nat Rev Cancer 2006; 6:702–13. 70. Peeters ST, Heemsbergen WD, van Putten WL, Slot A, Tabak H, Mens JW, et al. Acute and late complications after radiotherapy for prostate cancer: results of a multicenter randomized trial comparing 68 Gy to 78 Gy. Int J Radiat Oncol Biol Phys 2005; 61: 1019–34. 71. Kruse JJ, Stewart FA. Gene expression arrays as a tool to unravel mechanisms of normal tissue radiation injury and prediction of response. World J Gastroenterol 2007; 13:2669–74. 72. Chon BH, Loeffler JS. The effect of nonmalignant systemic disease on tolerance to radiation therapy. Oncologist 2002; 7: 136–43. 73. Rodningen OK, Overgaard J, Alsner J, Hastie T, Borresen-Dale AL. Microarray analysis of the transcriptional response to single or multiple doses of ionizing radiation in human subcutaneous fibroblasts. Radiother Oncol 2005; 77:231–40. 74. Andreassen CN. Can risk of radiotherapyinduced normal tissue complications be predicted from genetic profiles? Acta Oncol 2005; 44:801–15. 75. Rieger KE, Hong WJ, Tusher VG, Tang J, Tibshirani R, Chu G. Toxicity from radiation therapy associated with abnormal transcriptional responses to DNA damage. Proc Natl Acad Sci USA 2004; 101:6635–40. 76. Svensson JP, Stalpers LJ, Esveldt-van Lange RE, Franken NA, Haveman J, Klein B, et al. Analysis of gene expression using gene sets discriminates cancer patients with and without late radiation toxicity.PLoS Med 2006; 3:e422.
Chapter 17 Two-Stage Testing Strategies for Genome-Wide Association Studies in Family-Based Designs Amy Murphy, Scott T. Weiss, and Christoph Lange Abstract The analysis of genome-wide association studies (GWAS) poses statistical hurdles that have to be handled efficiently in order for the study to be successful. The two largest impediments in the analysis phase of the study are the multiple comparisons problem and maintaining robustness against confounding due to population admixture and stratification. For quantitative traits in family-based designs, Van Steen (1) proposed a two-stage testing strategy that can be considered a hybrid approach between family-based and population-based analysis. By including the population-based component into the family-based analysis, the Van Steen algorithm maximizes the statistical power, while at the same time, maintains the original robustness of family-based association tests (FBATs) (2–4). The Van Steen approach consists of two statistically independent steps, a screening step and a testing step. For all genotyped single nucleotide polymorphisms (SNPs), the screening step examines the evidence for association at a population-based level. Based on support for a potential genetic association from the screening step, the SNPs are prioritized for testing in the next step, where they are analyzed with a FBAT (3). By exploiting population-based information in the screening step that is not utilized in family-based association testing step, the two steps are statistically independent. Therefore, the use of the population-based data for the purposes of screening does not bias the FBAT statistic calculated in the testing step. Depending on the trait type and the ascertainment conditions, Van Steen-type testing strategies can achieve statistical power levels that are comparable to those of population-based studies with the same number of probands. In this chapter, we review the original Van Steen algorithm, its numerous extensions, and discuss its advantages and disadvantages. Key words: Family-based association testing (FBAT), transmission disequilibrium test (TDT), genome-wide association studies (GWAS), multiple comparisons.
1. Introduction Recent improvements in technology and the expanding content in single nucleotide polymorphism (SNP) repositories (5, 6) have H. Bang et al. (eds.), Statistical Methods in Molecular Biology, Methods in Molecular Biology 620, DOI 10.1007/978-1-60761-580-4_17, © Springer Science+Business Media, LLC 2010
485
486
Murphy, Weiss, and Lange
increased the capacity of high-throughput genotyping. As multiple platforms now have the ability to genotype more than one million SNPs (7–10), genome-wide association studies have become a commonly used gene-mapping tool for complex diseases and phenotypes. Genome-wide association studies (GWAS) have led to the discovery of novel, robust associations for numerous complex diseases and phenotypes (11–31). While the new findings may be replicated reliably in other studies, the amount of phenotypic variation that is explained by association discoveries is small in comparison to the estimated total heritability of most diseases/traits. This suggests that the current GWAS are not able to identify most of the disease loci. Potential reasons are study heterogeneity/confounding and lack of sufficient statistical power to address the inherent multiple testing problem in GWAS. For studies with unrelated individuals, multi-stage study designs (32– 36) or meta-analysis of several genome-wide association studies (13–40) have been suggested to minimize the impact of the multiple comparison problem on the statistical power of the study. However, if multi-stage designs or meta-analyses fail to identify SNPs that reach significance at a genome-wide level, it is difficult to determine whether this failure is caused by a lack of statistical power or whether it is attributable to genetic/study heterogeneity. Therefore, an important analysis decision will be how to weigh the relative importance of the two issues in order to find the right mixture between maximizing statistical power and achieving a sufficient degree of robustness against confounding. For family-based designs, we can choose from a wide range of analysis options. These options range from traditional family-based association analysis methods (e.g., TDT, PDT, FBAT (31–43)), which are fully robust against confounding, but not optimal in terms of power, to the implementation of population-based analysis approaches (44, 45) that achieve higher power levels, but require additional adjustments to guard against possible confounding. By design, the family-based approaches are completely protected against the influences of population admixture and model misspecification, and can be applied to a large variety of phenotypes, which have been implemented in the FBAT approach (3). The second class of approaches, which so far are only available for quantitative traits, maximizes the statistical power of family-based designs by application of modified population-based analysis approaches. The quantitative trait is modeled as a function of the genotype, and the relatedness of family members is described by the variance/covariance matrix of the phenotypes. In order to guard the population-based analysis against confounding by population substructure, it has been proposed (46) to apply approaches such as genomic control (47–49) or Eigenstrat (50) to family-based data. However, it is not clear yet whether
Genome-Wide Association Studies in Family-Based Designs
487
the application of such approaches is valid here, since they have been designed to adjust for population substructure in samples of unrelated individuals. The strongest form of population substructure is family data, whose presence could dilute the ability of population-based adjustment approaches to detect the more subtle substructure caused by population admixture. Van Steen-type testing strategies (1, 51–53) are two-stage testing strategies, which have been suggested as hybrid approaches (1, 51–53) that combine elements of both the familybased approach and the population-based approach. They sustain the two key advantages of the FBAT methodology: 1) robustness against confounding due to population admixture and heterogeneity; and 2) flexibility of the approach (3) with respect to the choice of the target phenotype. In the first step of the testing strategy, also known as the screening step, the populationbased information that is not used by the family-based association test (FBAT) is utilized to assess the evidence for association at a population level. Based on the relative strength of the observed association, SNPs are prioritized (i.e., ranked) for testing in the second step, in which they are tested with a family-based association test (3). The hybrid approaches can achieve power levels that are similar to those of population-based analysis (51). Originally, Van Steen-type testing strategies were developed for the analysis of quantitative traits in unascertained samples (54, 55). Various extensions and generalizations of the approach have been proposed. Adaptations of the approach have been developed for: • Time-to-onset data and categorical variables (56) • Dichotomous traits in ascertained samples (53) • Arbitrary family structures (52) • Improved utilization of the information from the screening step using a weighted Bonferroni approach (51) • Expression data (57) • Multivariate phenotypes (31) In this chapter, we will briefly review the basic concepts of family-based association tests (TDTs/FBATs) (2, 3). An introduction of the fundamental ideas of Van Steen-type testing strategies will follow, including an outline for extension of the approach. Furthermore, we discuss the advantages and disadvantages of such methods in the context of large-scale association studies.
1.1. Family-Based Tests for Genetic Association
The basic approach to family-based association tests is described in full detail elsewhere (3, 4, 58). The construction of the test consists of two separate steps. For the first step, a statistic that is a function of offspring genotypes and phenotypes is specified.
488
Murphy, Weiss, and Lange
In principle, any function which measures association between phenotype and genotype can be used. A particularly convenient class of statistics is given by score tests using exponential family models (3, 59, 60). In this case the statistic can be defined as S = i, j Xij Tij , where i indexes families, j indexes the offspring in families, Tij is an appropriate function of the ijth offspring’s phenotype Yij , and Xij is a coding function of his/her genotype. The standard transmission disequilibrium test (TDT)type statistic sets Tij = 1 if affected and 0 otherwise, while Xij counts the number of copies of a particular allele that the offspring has. The coded phenotype Tij can be arbitrary (e.g., quantitative, dichotomous, time-to-onset). Codings have been suggested for samples including affected and unaffected probands (61), quantitative traits with and without covariates (60, 62, 63), time-to-onset phenotypes (64–66), multivariate phenotypes (69, 70), longitudinal data and repeated measurements (68) and missing values (69, 70). The coding of the genotype, Xij , must be a function uniquely mapping an individual’s genotype to a hypothesized genetic model. In principle, any function can be used, either scalar (such as additive, recessive, or dominant) or vector valued (e.g., an indicator variable coding for each genotype). Multiallelic genes can also be handled using vector value coding (71). The second step is to compute the distribution of the sufficient statistic, S, under the null. The distribution of S is evaluated by determining all possible outcomes for the offspring genotypes (and the probability of observing each outcome) under the null hypothesis of interest, conditioning on all of the sufficient statistics for any nuisance parameters. In practice, this means conditioning on all phenotypes, as well as parental genotypes when observed, or on the sufficient statistics for parental genotypes when parental information is missing (58). We consider the two null hypotheses: (1) H0 : no linkage and no association; and (2) H0 : linkage but no association. When there is more than one offspring in a family, the two different null hypotheses lead to different distributions for the offspring genotypes. Rabinowitz and Laird (58) provide an exhaustive set of tables that allow one to calculate the genotype distribution of offspring for any type of nuclear family (with genotypes available for any number of siblings and any number of parents) for both types of null hypotheses. Horvath (65) showed how these distributions can be used to calculate the mean and variance of each family’s contribution to the test statistic under the null hypothesis of no linkage and no association. This allows us to calculate a large sample chi-square or standard normal (i.e., Z) statistic for either null hypothesis. In summary, FBAT methodology provides a general framework that can accommodate association testing in many settings, including the use of nuclear families with any number of children, families with or without missing parents, families within pedigrees,
Genome-Wide Association Studies in Family-Based Designs
489
any type of phenotype, and for hypothesis testing of both linkage and association, or association in the presence of linkage. Next, we review how the information used in a (FBAT) may be partitioned in order to conduct both screening and testing in the same data set.
1.2. Background: The Decomposition of Family-Based Data into Independent Components
In family-based designs, testing strategies for genome-wide association studies that address the multiple testing problem within one study do so in a more powerful way than standard approaches. This is achieved by dividing the data set into two statistically independent partitions, utilizing the special structure of the family data. The two components are defined by two overlapping parts of the data. The first component assesses the evidence for a potential association at a population level, which is examined based on the proband phenotypes, Y , and the parental genotypes or sufficient statistic, S (58). The second component consists of the data that describe the association at the family level when the FBAT approach (3, 58) is used (i.e., the differences between the observed offspring’s genotypes, X, and the expected offspring’s genotypes, EX, conditional on the parental genotypes and the offspring’s phenotype). The FBAT statistic is a conditional test that treats the offspring genotype, X, as a random variable and conditions on the phenotype, Y , and the sufficient statistic, S. Because of this property, the density of the joint distribution for X, Y , and S can be decomposed in the following manner into two statistically independent components (4), i.e.,
p(X , Y , S) = p(X |Y , S)p(Y , S).
[1]
The density of the screening step is given by p(Y , S) and the density of the FBAT testing step is given by p(X |Y , S ). In the screening step of the Van Steen algorithm, the evidence for a potential association is examined for all genotyped SNPs based on the parental genotypes/sufficient statistic, S, and the offspring phenotype, Y . From the screening step, the most promising SNPs, i.e., in terms of genetic effect size or statistical power, are then selected for the testing step. In this step, the SNPs are tested for association using the FBAT statistic. Because the only random component of the FBAT statistic (the offspring genotype, X) is not incorporated into the screening step, the data can be partitioned into two independent parts. Consequently, Equation [1] implies that the screening step and the FBAT testing step are independent.
490
Murphy, Weiss, and Lange
2. The Screening Technique for Continuous Phenotypes in a Family-Based Design
To address the multiple testing problem in genome-wide association studies using quantitative traits, Van Steen et al. (1) developed a novel testing strategy, which is also available for time-toonset phenotypes with censoring (56) or affection status in ascertained samples (53). The strategy allows genomic screening and testing using the same data set. As previously noted, for each SNP, the data set is divided into two statistically independent parts, the between-family component and the within-family component (72). Like the general FBAT approach, the algorithm is very flexible. It can be applied to any type of nuclear family, extended pedigrees, and various phenotypes, such as time-to-onset or multivariate phenotypes, as well as repeated measurements or longitudinal data. The analysis strategy is implemented in the PBAT software package (73, 74). In the screening step of the strategy, the conditional power of the FBAT statistic (3) is estimated for each SNP, using the conditional mean model approach (54, 55, 73). A subset of SNP with the highest power estimates (e.g., the top 10 SNP by power rank) are selected for the testing step of the testing strategy. In the second step, the selected SNPs are tested for association, using the FBAT statistic. The power estimates based on the conditional mean model do not bias the significance level of any subsequently computed FBAT statistic, because the screening and testing steps are statistically independent (1, 4, 55, 54). The FBAT results for the selected SNPs need only to be adjusted for only the number of SNPs that were tested with FBATs in the second step (e.g., if 10 SNPs were selected, Bonferroni correction for multiple comparisons is α/10). A SNP that reaches significance after this adjustment is also significant at a genome-wide level. Alternatively, one could also select SNPs based on the significance of their genetic effect size estimates in the conditional mean model (54). For theoretical reasons and based on simulation studies, conditional power estimates are preferable (1). The rationale here is that although sufficiently high statistical power does not imply the existence of an association (since a significant finding at the population level could be due to confounding or population stratification), it is a prerequisite for the ability to test for an association successfully. The screening steps and the testing steps of the algorithm are outlined below. The steps of the screening strategy are applied to all genotyped SNPs. Screening step 1: Specify a plausible linear regression model that functionally relates the offspring phenotype(s) of interest to genotypic information. The coding of the marker genotypes is a reflection of the underlying disease model. In linking trait(s) to
Genome-Wide Association Studies in Family-Based Designs
491
coded genotypes, different selections of covariates with the test locus can be considered as well. For example, a regression model to link an offspring’s phenotype Yij to his/her genotype Xij and a covariate vector Zij is E(Yij ) = aXij + bZij .
[2]
Screening step 2: At each marker, replace the observed offspring’s genotype by the conditional mean given the sufficient statistic Si as defined in Rabinowitz and Laird (58) at the marker. Hence, instead of the standard regression model we use the conditional mean model as defined in Lange et al. (54, 55): E(Yij ) = aE[Xij |Si ] + bZij
where E[Xij |Si ] replaces Xij . [3]
Screening Step 3: Estimate the genetic effect size a in the conditional mean model [3] and compute the conditional power of the FBAT statistic (54), given the observed data (i.e., the power of the test statistic is computed, conditional on the offspring phenotypes and the parental genotypes or the minimal sufficient statistics when parental genotypes are missing). Note that estimation of a requires variation in the Yij ’s. Testing Step 1: Select a subset of SNPs with highest conditional power estimates (e.g., top 10 SNPs). Testing Step 2: Using either Bonferroni correction or FDR (False Discovery Rate), adjust the pre-specified significance level for the number of SNP selected for testing (e.g., 10 comparisons) and calculate the FBAT statistic on the selected combinations of phenotypes and markers. Although population admixture and stratification may bias the estimate of a and thus will affect the power of the proposed testing strategy, the testing step of the screening technique avoids confounding due to model misspecification as well as admixture or population stratification. The final decision on potential marker associations is based on the FBAT statistic, which guards against these confounding factors. One criticism of the Van Steen et al. method (1) is that only a small number of (high-powered) SNPs are selected for the testing stage and formally tested for association. Recently, novel methodology has been developed to address this shortcoming. The procedure developed by Ionita-Laza et al. (51) applies a Bonferroniweighting scheme to the testing stage, such that all genotyped SNPs are tested for association while still preserving the overall alpha level, where the power from the initial screening stage informs the weights of the testing stage.
492
Murphy, Weiss, and Lange
3. Software Implementation The Van Steen algorithm has been implemented in the PBAT software package (73, 74). The PBAT package is an integrated package that covers all aspects of the family-based study, including prospective power calculations during the design phase of a study, tools for analysis of complex phenotypes (quantitative traits, timeto-onset data, multivariate data, repeated measurements, longitudinal data, with/without covariate adjustment), and testing strategies for handling the multiple comparison problems. The PBAT package also has been fully integrated into the Rsoftware environment as a library (75). All PBAT functions are available in R through a user-friendly GUI: P 2 BAT and GWA analysis can be run parallel on clusters. Further, GoldenHelix Inc. (www.goldenhelix.com) offers a commercial version of PBAT with full user support, including professional documentation and debugging.
4. Discussion The Van Steen algorithm (1, 51, 53, 76) was originally developed to be applied to one of the first genome-wide association studies, a 100 K scan of 1400 members of the family sample of the Framingham Heart Study. The approach detected a SNP that achieved genome-wide significance and identified a novel candidate gene for obesity (12), INSIG2. Subsequently, it was possible to replicate the association with the same SNP in four independent studies, including one African-American cohort (12). Additional replications attempts resulted in clear replications and clear non-replication (77–83). Although a false-positive finding cannot be ruled out at this stage, the clear replications in large-scale population studies without an ascertainment condition (i.e., KORA, DeCode) provide strong support for this being a real finding. Since its inception, the Van Steen approach has been successfully applied to GWAS for Alzheimer’s Disease (31) and Attention Deficit Hyperactivity Disorder (27, 30). The fact that many non-replications stem from replication attempts in control populations where the subjects have diseases other than the one of interest for the particular GWAS suggests heterogeneity between the different studies. Study heterogeneity is a common issue in GWAS and has been observed in other studies as well (29, 30). In the presence of study heterogeneity, testing strategies that establish genome-wide significance within one study have distinct advantages. Replication attempts in additional studies can serve to generalize the established associations and assess heterogeneity
Genome-Wide Association Studies in Family-Based Designs
493
between study populations. While it may be more time consuming and expensive to recruit families for genetic studies, the power of a family-based analysis with trios is similar to a case–control analysis (4), and the Van Steen method (1) allows studies with family-based designs to harness the power of a population-based analysis while preserving the best features of family-based studies. References 1. Van Steen, K., McQueen, M., Herbert, A., Raby, B., Lyon, H., DeMeo, D., Murphy, A., Su, J., Datta, S., Rosenow, C., et al. (2005). Genomic screening and replication using the same data set infamily-based association testing. Nature Genetics, 37, 683–691. 2. Spielman, R., McGinnis, R., and Ewens, W. (1993). Transmisson test for linkage disequilibrium: The insulin gene region and insulin-dependent diabetes mellitus (IDDm). American Journal of Human Genetics, 52, 506–516. 3. Laird, N., Horvath, S., and Xu, X. (2000). Implementing a unified approach to familybased tests of association. Genetic Epidemiology, 19, S36. 4. Laird, N. and Lange, C. (2006). Familybased designs in the age of large-scale geneassociation studies. Nature Review Genetics, 7(5), 385–94. 5. The International HapMap Consortium. (2005). A haplotype map of the human genome. Nature, 427, 1299–1320. 6. The International HapMap Consortium. (2007). The international hapmap consortium: A second generation human haplotype map of over 3.1 million snps. Nature, 449, 851–861. 7. Matsuzaki, H., Dong, S., Loi, H., Di, X., Liu, G., Hubbell, E., Law, J., Berntsen, T., Chadha, M., Hui, H., et al. (2004). Genotyping over 100, 000 snps on a pair of oligonucleotide arrays. Nature Methods, 11, 109–111. 8. Di, X., Matsuzaki, H., Webster, T. A., Hubbell, E., Liu, G., Dong, S., Bartell, D., Huang, J., Chiles, R., Yang, G., et al. (2005). Dynamic model based algorithms for screening and genotyping over 100 k snps on oligonucleotide microarrays. Bioinformatics, 21, 1958–1963. 9. Gunderson, K., Kuhn, K., Steemers, F., Ng, P., Murray, S., and Shen, R. (2006). Wholegenome genotyping of haplotype tag single nucleotide polymorphisms. Pharmacogenomics, 7, 641–648. 10. Wadma, M. (2006). The chips are down. Nature Digest, 444, 256–257.
11. Klein, R. J., Zeiss, C., Chew, E. Y., Tsai, J. Y., Sackler, R. S., Haynes, C., Henning, A. K., Sangiovanni, J. P., Mane, S. M., Mayne, S. T., et al. (2005). Complement factor h polymorphism in age-related macular degeneration. Science, 308, 385–389. 12. Herbert, A., Gerry, N., McQueen, M., Heid, I., Pfeufer, A., Illig, T., Wichmann, E.-H., Meitinger, T., Hunter, D., Hu, F., et al. (2006). Genetic variation near INSIG2 is a common determinant of obesity in western europeans and african americans. Science, 312, 279–283. 13. Zeggini, E., Weedon, M. N., Lindgren, C. M., Frayling, T. M., Elliott, K. S., Lango, H., Timpson, N. J., Perry, J. R., Rayner, N. W., Freathy, R. M., et al. (2007). Replication of genome-wide association signals in uk samples reveals risk loci for type 2 diabetes. Science, 316, 1336–1341. 14. Wellcome Trust Case Control Consortium. (2007). Genome-wide association study of 14, 000 cases of seven common diseases and 3, 000 shared controls. Nature, 447, 661–78. 15. Easton, D. F., Pooley, K. A., Dunning, A. M., Pharoah, P. D., Thompson, D., Ballinger, D. G., Struewing, J. P., Morrison, J., Field, H., Luben, R., et al. (2007). Genome-wide association study identifies novel breast cancer susceptibility loci. Nature, 447, 1087–1093. 16. Buch, S., Schafmayer, C., Volzke, H., Becker, C., Franke, A., von Eller-Eberstein, H., Kluck, C., Bassmann, I., Brosch, M., Lammert, F., et al. (2007). A genome-wide association scan identifies the hepatic cholesterol transporter abcg8 as a susceptibility factor for human gallstone disease. Nature Genetics, 39, 995–999. 17. Bierut, L. J., Madden, P. A., Breslau, N., Johnson, E. O., Hatsukami, D., Pomerleau, O. F., Swan, G. E., Rutter, J., Bertelsen, S., Fox, L., et al. (2007). Novel genes identified in a high-density genome wide association study for nicotine dependence. Human Molecular Genetics, 16, 24–35. 18. Zanke, B. W., Greenwood, C. M., Rangrej, J., Kustra, R., Tenesa, A., Farrington, S. M., Prendergast, J., Olschwang, S., Chiang,
494
19.
20.
21.
22.
23.
24.
25.
26.
27.
Murphy, Weiss, and Lange T., Crowdy, E., et al. (2007). Genome-wide association scan identifies a colorectal cancer susceptibility locus on chromosome 8q24. Nature Genetics, 39, 989–994. Yeager, M., Orr, N., Hayes, R. B., Jacobs, K. B., Kraft, P., Wacholder, S., Minichiello, M. J., Fearnhead, P., Yu, K., Chatterjee, N., et al. (2007). Genome-wide association study of prostate cancer identifies a second risk locus at 8q24. Nature Genetics, 39, 645–649. Winkelmann, J., Schormair, B., Lichtner, P., Ripke, S., Xiong, L., Jalilzadeh, S., Fulda, S., Putz, B., Eckstein, G., Hauk, S., et al. (2007). Genome-wide association study of restless legs syndrome identifies common variants in three genomic regions. Nature Genetics, 39, 1000–1006. Sladek, R., Rocheleau, G., Rung, J., Dina, C., Shen, L., Serre, D., Boutin, P., Vincent, D., Belisle, A., Hadjadj, S., et al. (2007). A genome-wide association study identifies novel risk loci for type 2 diabetes. Nature, 445, 881–885. Frayling, T., Timpson, N., Weedon, M., Zeggini, E., Freathy, R., Lindgren, C., Perry, J., Elliott, K., Lango, H., Rayner, N., et al. (2007). A Common Variant in the FTO Gene Is Associated with Body Mass Index and Predisposes to Childhood and Adult Obesity. Science, 316, 889. Saxena, R., Voight, B., Lyssenko, V., Burtt, N., de Bakker, P., Chen, H., Roix, J., Kathiresan, S., Hirschhorn, J., Daly, M., et al. (2007). Genome-Wide Association Analysis Identifies Loci for Type 2 Diabetes and Triglyceride Levels. Science, 316, 1331–1336. Scott, L., Mohlke, K., Bonnycastle, L., Willer, C., Li, Y., Duren, W., Erdos, M., Stringham, H., Chines, P., Jackson, A., et al. (2007). A Genome-Wide Association Study of Type 2 Diabetes in Finns Detects Multiple Susceptibility Variants. Science, 316, 1341. Lettre, G., Jackson, A., Gieger, C., Schumacher, F., Berndt, S., Sanna, S., Eyheramendy, S., Voight, B., Butler, J., Guiducci, C., et al. Identification of ten loci associated with height highlights new biological pathways in human growth. Nature, 200, 8. Neale, B., Lasky-Su, J., Anney, R., Franke, B., Zhou, K., Maller, J., Vasquez, A., Asherson, P., Chen, W., Banaschewski, T., et al. (2008). Genome-wide association scan of attention deficit hyperactivity disorder. American Journal Medical Genetics B Neuropsychiatric Genetics, 147, 1377–1344. Lasky-Su, J., Anney, R., Neale, B., Franke, B., Zhou, K., Maller, J., Vasquez, A., Chen, W., Asherson, P., Buitelaar, J., et al. (2008).
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
Genome-wide association scan of the time to onset of attention deficit hyperactivity disorder. American Journal Medical Genetics B Neuropsychiatric Genetics, 147, 1355–1358. Kathiresan, S., Willer, C., Peloso, G., Demissie, S., Musunuru, K., Schadt, E., Kaplan, L., Bennett, D., Li, Y., Tanaka, T., et al. (2009). Common variants at 30 loci contribute to polygenic dyslipidemia. Nature Genetics, 41, 56–65. Lasky-Su, J., Lyon, H., Emilsson, V., Heid, I., Molony, C., Raby, B., Lazarus, R., Klanderman, B., Soto-Quiros, M., Avila, L., et al. (2008). On the Replication of Genetic Associations: Timing Can Be Everything! The American Journal of Human Genetics, 82, 849–858. Lasky-Su, J., Neale, B., Franke, B., Anney, R., Zhou, K., Maller, J., Vasquez, A., Chen, W., Asherson, P., Buitelaar, J., et al. (2008). Genome-wide association scan of quantitative traits for attention deficit hyperactivity disorder identifies novel associations and confirms candidate gene associations. American Journal Medical Genetics B Neuropsychiatric Genetics, 147, 1345–1354. Bertram, L., Lange, C., Mullin, K., Parkinson, M., Hsiao, M., Hogan, M., Schjeide, B., Hooli, B., DiVito, J., Ionita, I., et al. (2008). Genome-wide Association Analysis Reveals Putative Alzheimer’s Disease Susceptibility Loci in Addition to APOE. American Journal of Human Genetics, 83, 623–632. Satagopan, J. and Elston, R. (2003). Optimal two-stage genotyping in population-based association studies. Genetic Epidemiology, 25, 149–157. Satagopan, J., Venkatraman, E., and Begg, C. (2004). Two-stage designs for gene-disease association studies with sample size contraints. Biometrics, 60, 589–597. Satagopan, J., Verbel, D., Venkatraman, E., Offit, K., and Begg, C. (2004). Two-stage designs for gene-disease association studies. Biometrics, 58, 163–170. Thomas, D., Xie, R., and Gebregziabher, M. (2004). Two-stage sampling designs for gene association studies. Genetic Epidemiology, 27, 401–414. Hirschhorn, J. and Daly, M. (2005). Genome-wide association studies for common diseases and complex traits. Nature Review Genetics, 6, 95–108. Evangelou, E., Maraganore, D., and Ioannidis, J. (2007). Meta-analysis in genome-wide association datasets: Strategies and application in parkinson disease. PLoS ONE, 2, e196.
Genome-Wide Association Studies in Family-Based Designs 38. Ioannidis, J. P., Patsopoulos, N. A., and Evangelou, E. (2007). Heterogeneity in meta-analyses of genome-wide association investigations. PLoS ONE, 2, e841. 39. Scott, L. J., Mohlke, K. L., Bonnycastle, L. L., Willer, C. J., Li, Y., Duren, W. L., Erdos, M. R., Stringham, H. M., Chines, P. S., Jackson, A. U., et al. (2007). A genome-wide association study of type 2 diabetes in finns detects multiple susceptibility variants. Science, 316, 1341–1345. 40. Saxena, R., Voight, B. F., Lyssenko, V., Burtt, N. P., de Bakker, P. I., Chen, H., Roix, J. J., Kathiresan, S., Hirschhorn, J. N., Daly, M. J., et al. (2007). Genome-wide association analysis identifies loci for type 2 diabetes and triglyceride levels. Science, 316, 1331–1336. 41. Spielman, R. and Ewens, W. (1998). A sibship test for linkage in the presence of association. American Journal of Human Genetics, 62, 450–458. 42. Martin, E., Bass, M., and Kaplan, N. (2001). Correcting for a potential bias in the pedigree disequilibrium test. American Journal of Human Genetics, 68, 1065–1067. 43. Monks, S. and Kaplan, N. (2000). Removing the sampling restrictions from family-based tests of association for a quantitative-trait locus. American Journal Human Genetics, 66, 576–592. 44. Chen, W. and Abecasis, G. (2007). Familybased association tests for genomewide association scans. American Journal of Human Genetics, 81, 913–926. 45. Aulchenko, Y., de Koning, D., and Haley, C. (2007). Genomewide rapid association using mixed model and regression: A fast and simple method for genomewide pedigreebased quantitative trait loci association analysis. Genetics, 177, 577. 46. Macgregor, S. (2008). Optimal two-stage testing for family-based genome-wide association studies. American Journal of Human Genetics, 82, 797–799. 47. Devlin, B. and Roeder, K. (1999). Genomic control for association studies. Biometrics, 55, 997–1004. 48. Bacanu, S., Devlin, B., and Roeder, K. (2000). The power of genomic control. American Journal of Human Genetics, 66, 1933–1944. 49. Devlin, B., Roeder, K., and Wasserman, L. (2001). Genomic control, a new approach to genetic-based association studies. Theoretical Population Biology, 60, 155–166. 50. Price, A., Patterson, N., Plenge, R., Weinblatt, M., Shadick, N., and Reich, D. (2006). Principal components analysis cor-
51.
52.
53.
54.
55.
56.
57.
58.
59.
60.
61.
495
rects for stratification in genome-wide association studies. Nature Genetics, 38, 904–909. Ionita-Laza, I., McQueen, M., Laird, N., and Lange, C. (2007). Genomewide weighted hypothesis testing in family-based association studies, with an application to a 100 k scan. American Journal of Human Genetics, 81, 607–14. Feng, T., Zhang, S., and Sha, Q. (2007). Two-stage association tests for genome-wide association studies based on family data with arbitrary family structure. European Journal of Human Genetics, 15, 1169–1175. Murphy, A., Weiss, S., and Lange, C. (2008). Screening and replication using the same data set: Testing strategies for family-based studies in which All probands are affected. PLoS Genetics, 41(9), e1000197 Lange, C., DeMeo, D., Silverman, E., Weiss, S., and Laird, N. (2003). Using the noninformative families in family-based association tests: A powerful new testing strategy. American Journal of Human Genetics, 79, 801– 811. Lange, C., Lyon, H., DeMeo, D., Raby, B., Silverman, E., and Weiss, S. (2003). A new powerful non-parametric two-stage approach for testing multiple phenotypes in familybased association studies. Human Heredity, 56, 10–17. Jiang, H., Harrington, D., Raby, B., Bertram, L., Blacker, D., Weiss, S., and C., L. (2006). Family-based association test for time-toonset data with time-dependent differences between the hazard functions. Genetic Epidemiology, 30(2), 124–132. Degnan, J., Lasky-Su, J., Raby, B., Xu, M., Molony, C., Schadt, E., and Lange, C. (2008). Genomics and genome-wide association studies: An integrative approach to expression QTL mapping. Genomics, 92, 129–133. Rabinowitz, D. and Laird, N. (2000). A unified approach to adjusting association tests for population admixture with arbitrary pedigree structure and arbitrary missing marker information. Humman Heredity, 50, 211–223. Clayton, D. and Jones, H. (1999). Transmission/disequilibrium tests for extended marker haplotypes. American Journal of Human Genetics, 65, 1161–1169. Lunetta, K., Faraone, S., Biederman, J., and Laird, N. (2000). Family-based tests of association and linkage that use unaffected sibs, covariates, and interactions. American Journal of Human Genetics, 66, 605–614. Whittaker, J. and Lewis, C. (1998). Power comparisons of the transmission/
496
62.
63.
64.
65.
66.
67.
68.
69.
70.
71.
Murphy, Weiss, and Lange disequilibrium test and sibtransmission/ disequilibrium-test statistics. American Journal of Human Genetics, 65, 578–580. Lange, C., DeMeo, D., and Laird, N. (2002). Power and design considerations for a general class of family-based association tests: Quantitative traits. American Journal of Human Genetics, 71, 1330–1341. Lange, C. and Laird, N. (2002). On a general class of conditional tests for family-based association studies in genetics: the asymptotic distribution, the conditional power and optimality considerations. Genetic Epidemiology, 23, 165–180. Mokliatchouk, O., Blacker, D., and Rabinowitz, D. (2001). Association tests for traits with variable age at onset. Human Heredity, 51, 46–53. Horvath, S., Xu, X., and Laird, N. (2001). The family based association test method: strategies for studying general genotypephenotype associations. European Journal of Human Genetics, 9, 301–306. Lange, C., Blacker, D., and Laird, N. (2004). Family-based association tests for survival and times-to-onset analysis. Statistics in Medicine, 23, 179–189. Lange, C., Silverman, E., Xu, X., Weiss, S., and Laird, N. (2003a). A multivariate familybased association test using generalized estimating equations: {FBAT-GEE}. Biostatistics, 4, 195–206. Lange, C., Van Steen, K., Andrew, T., Lyon, H., DeMeo, D., Murphy, A., Silverman, E., A, M., Weiss, S., and Laird, N. (2004). A family-based association test for repeatedly measured quantitative traits adjusting for unknown environmental and/or polygenic effects. Statistical Applications in Genetics and Molecular Biology: Vol. 3: No. 1, Article 17. http://www. bepress.com/sagmb/vol3/iss1/art17. Murphy, A., Blacker, D., and Lange, C. (2004). Imputing missing phenotypes: A new fbat-statistic. Statistical Modelling, 4, 96–100. Murphy, A., Van Steen, K., and Lange, C. (2004). On missing phenotype data in multivariate family based association tests: imputation strategies based on the em-algorithm, the da-algorithm and the conditional mean model. Far East Journal of Theoretical Statistics, 13, 175–188. Schaid, D. and Sommer, S. (1994). Comparison of statistics for candidate-gene association studies using cases and parents. American Journal of Human Genetics, 55, 402–409.
72. Fulker, D., Cherny, S., Sham, P., and Hewit, J. (1999). Combined linkage and association sib-pair analysis for quantitative traits. Encyclopedia of Human Genetics and Genetic Epidemiology, 64, 259–267. 73. Lange, C., DeMeo, D., Silverman, E., Weiss, S., and Laird, N. (2004). PBAT: tools for family-based association studies. American Journal of Human Genetics, 74, 367–369. 74. Van Steen, K. and Lange, C. (2005). PBAT: a comprehensive software package for genome-wide association analysis of complex family based studies. Human Genomics, 2, 67–69. 75. Hoffmann, T. and Lange, C. (2006). P2BAT: a massive parallel implementation of pbat for genome-wide association studies in R. Bioinformatics., 22(24), 3103–3105. 76. McQueen, M., Weiss, S., Laird, N., and Lange, C. (2007). On the parsing of statistical information in family-based association testing. Nature Genetics, 39, 281–282. 77. Rosskopf, D., Bornhorst, A., Rimmbach, C., Schwahn, C., Kayser, A., Kruger, A., Tessmann, G., Geissler, I., Kroemer, H., and Volzke, H. (2007). Comment on “a common genetic variant is associated with adult and childhood obesity”. Science, 315, 187. 78. Hall, D., Rahman, T., Avery, P., and Keavney, B. (2006). INSIG-2 promoter polymorphism and obesity related phenotypes: association study in 1428 members of 248 families. BMC Medical Genetics, 7, 83. 79. Dina, C., Meyre, D., Samson, C., Tichet, J., Marre, M., Jouret, B., Charles, M., Balkau, B., and Froguel, P. (2007). Comment on “a common genetic variant is associated with adult and childhood obesity”. Science, 315, 187. 80. Loos, R., Barroso, I., O’Rahilly, S., and Wareham, N. (2007). Comment on “a common genetic variant is associated with adult and childhood obesity”. Science, 315, 187. 81. Lyon, H., Emilsson, V., Hinney, A., Heid, I., Lasky-Su, J., Zhu, X., Thorleifsson, G., Gunnarsdottir, S., Walters, G., Thorsteinsdottir, U., et al. (2007). The association of a SNP upstream of INSIG2 with body mass index is reproduced in several but not all cohorts. PLoS Genetics, 3, e61. 82. Smith, A., Cooper, J., Li, L., and Humphries, S. (2007). INSIG2 gene polymorphism is not associated with obesity in caucasian, afrocaribbean and indian subjects. International Journal of Obesity, 31, 1753–1755. 83. Kumar, J., Sunkishala, R., Karthikeyan, G., and Sengupta, S. (2007). The common genetic variant upstream of INSIG2 gene is not associated with obesity in indian population. Clinical Genetics, 71, 415–418.
Chapter 18 Statistical Methods for Proteomics Klaus Jung Abstract During the last decade, analytical methods for the detection and quantification of proteins and peptides in biological samples have been considerably improved. It is therefore now possible to compare simultaneously the expression levels of hundreds or thousands of proteins in different types of tissue, for example, normal and cancerous, or in different cell lines. In this chapter, we illustrate statistical designs for such proteomics experiments as well as methods for the analysis of resulting data. In particular, we focus on the preprocessing and analysis of protein expression levels recorded by the use of either two-dimensional gel electrophoresis or mass spectrometry. Key words: Protein expression, data preprocessing, differential proteome analysis, disease classification, two-dimensional gel electrophoresis, mass spectrometry.
1. Introduction The sequencing of the human genome and those of other organisms as well as the analysis of gene expression data already provided deeper insights into biological systems. However, the direct inference from the RNA level to the protein level is generally not possible, because not every gene that is expressed is translated into only one single protein. In addition, a protein can have several isoforms. Hence the proteome (which is defined as the totality of expressed proteins in an organism, a tissue or a cell at a certain point in time) is still more complex than the genome, and while the size of the human genome is estimated to lie between 20,000 and 25,000 genes, the size of the human proteome is estimated to lie between 500,000 and 1,000,000 proteins. Therefore, the analysis of protein regulations gained more importance during the H. Bang et al. (eds.), Statistical Methods in Molecular Biology, Methods in Molecular Biology 620, DOI 10.1007/978-1-60761-580-4_18, © Springer Science+Business Media, LLC 2010
497
498
Jung
last decade. In comparison to gene expression data recorded by DNA microarrays, data sets of protein expression are even more complex. On the one hand, there exist a lot of different laboratory technologies for recording protein expression levels, i.e., different types of two-dimensional gel electrophoresis (2-DE) and of mass spectrometry (MS). On the other hand, there is a higher uncertainty with regard to what is detected by these analytical methods. While a probe set on a DNA microarray is usually associated with a certain gene, molecules that were detected by 2-DE and MS have to be identified by comparing their observed mass profile with theoretical mass profiles listed in databases (1). Gene and protein expression data are quite similar and also the problems studied in the genomics and proteomics area are nearly the same. In fact, a lot of statistical techniques that are applied for the analysis of gene expression data, for example, multiple hypothesis testing, cluster analysis and classification methods, also play an important role in the analysis of protein expression levels (2). However, what makes the great difference in the data analysis between these two areas is the preprocessing of raw measurements. Therefore, the generation of protein expression levels from raw gel images and raw mass spectra is also included in this chapter. The most frequently used technologies in proteomics are 2-DE (3) as well as its improved version 2-D Difference Gel Electrophoresis (DIGE) (4) and MS (5). In this chapter we illustrate a selection of the most frequent designs of experiments with these technologies and the statistical analysis of the resulting data structures. The next section details two group comparisons with 2-DE and 2-D DIGE. Afterwards, two group comparisons as well as classification algorithms for MS data are presented. At the end of this chapter, we give a short outlook on other statistical methods for proteomics experiments.
2. TwoDimensional Gel Electrophoresis
2-DE is an analytical method that separates the proteins in a biological sample by their isoelectric point (first dimension) and their size (second dimension). The isoelectric point is related to the pH value of a molecule and thus to its basicity and acidity. Either labelled by a silver staining or by a fluorescent dye before 2-DE, the proteins cause individual spots on the 2-D gel, where their size or the fluorescence is taken as a measure of their abundance (6). In the following experimental designs, we distinguish between approaches where only one sample is applied to a gel and those approaches where two or three samples are applied to the
Statistical Methods for Proteomics
499
same gel. The second approach is called the 2-D DIGE method where up to three samples, labelled by three different dyes, can be applied to the same gel, reducing thus the number of necessary gels on the one hand and also the technical variance on the other hand. 2.1. Experimental Designs
The most frequent problems studied by the use of 2-DE are two group comparisons, the unpaired case and the paired one. In the unpaired case, proteome samples from two distinct groups of patients or cell lines are compared, for example, healthy versus diseased or disease state one versus disease state two or treated versus untreated. We will consider this setup as design A. The common goal of such studies is to find proteins that are significantly up- or downregulated in the samples of either group. If one-sample gels are used, n = n1 + n2 gels are prepared, where the first n1 samples are independent biological replications from the one group and the last n2 samples are from the other group (Fig. 18.1a). Instead, if 2-D DIGE gels are used, an internal standard is additionally put onto each gel, which is a mixture of all n samples (Fig. 18.1b). a
b
c
Fig. 18.1. Examples of experimental designs for proteome experiments with twodimensional gel electrophoresis. (a) Comparison of two groups with one-sample gels. (b) Comparison of independent groups with 2-D DIGE gels. (c) Comparison of dependent groups with 2-D DIGE gels.
In the case of paired observations, the proteome samples from the two groups are dependent, i.e., two samples from each experimental unit are studied, for example, tumour and mucosa of the same animal. Hence, n1 = n2 in this experimental setup, which we consider as design B. In the approach with one-sample gels, again n gels are prepared, one for each sample. In the 2-D DIGE approach, one can put each pair of samples on the same gel, so that just n1 = n2 gels are necessary. Again, an internal standard composed of all n samples is put on each gel in that approach (Fig. 18.1c). Further aspects of experimental designs for 2-D DIGE experiments were studied by (7–9).
500
Jung
2.2. Preprocessing of Expression Levels
The readily prepared gels are scanned and each of them yields one, two or three digital images (depending on the 2-DE approach and the experimental design). Automated image analysis detects the protein spots on each gel and calculates associated expression levels, based on either the size or the fluorescence intensity of a spot. Usually, one gel is selected as master gel and the spot maps of all other gels are matched to the spot map of this gel. Before the actual statistical analysis, the expression levels have to be preprocessed by several steps. In the 2-D DIGE approach, the first step is the removal of dye effects. Assume there have been m spots detected on a gel prepared with three samples, which were labelled by Cy2, Cy3
and Cy5, respectively. Image anal(1) (2) (3) ysis yields then a triple xj , xj , xj of expression levels for each spot j (j = 1, . . ., m), where it is assumed that the different dyes produce different additional gains to the expression levels (Fig. 18.2a). Kreil et al. (10) proposed the normalization model
Fig. 18.2. Removal of dye effects in 2-D DIGE gels. In this example Cy5 and Cy3 dyes have a stronger impact on the expression level than the Cy2 dye (a). After removing the dye effects, the remaining scatter is supposed to just express the biological variation (b).
(k)
yj (k)
where yj
(k)
= a(k) xj
+ b (k) ,
is the normalized expression level of spot j in image (k)
k (k = 1, 2, 3), xj
the associated observed spot intensity, a(k)
an adjustment factor for the dye effect and b (k) an offset for removing some background intensities. The parameters a(k) and b (k) can be estimated by a maximum likelihood approach (11), which is implemented in the package ‘vsn’ for the free software R (www.r-project.org). After this normalization, dye effects are removed (Fig. 18.2b).
Statistical Methods for Proteomics
501
As a next step, gels have to be made comparable. In the 2-D DIGE approach, this is achieved by dividing spot-wise all intensities from the two experimental groups by those of the internal standard. In the approach with one-sample gels, all gels can be normalized by the ‘quantile normalization’ procedure (12), which is available in the R-package ‘affy’. This method shifts the distributions of the expression levels from all gels to have the same quantiles. It is known that the variance of the expression levels from a spot depends on the related mean. Therefore, as a last preprocessing step, a variance stabilizing function, e.g., the logarithm (usually log2), is applied to the expression levels. (In the ‘vsn’ algorithm, the arsinh function is used instead.) A major drawback of 2-DE is that not all spots detected on the master gel are also detected on each of the other gels. Hence, the final number of available expression values is different for the individual spots. It is recommended to predefine (before the experiment is performed) a minimum number of available expression values a spot should have. After preprocessing, all spots which were detected in a lower number of gels than this threshold should be excluded from further analysis. If there is only a small number of ‘missing values’ (e.g., smaller than 5%), an imputation of these missing values can be considered (13, 14). 2.3. Detection of Upand Downregulated Proteins
The aims of the experiments detailed above are first to find significantly up- or downregulated proteins in either of the compared groups and second to quantify the strength of up- or downregulation. For analysing the first problem, a statistical test is carried out for each of the m detected spots in order to compare the expression levels of the associated protein in the two groups. Because there is no sufficient evidence that expression levels recorded by 2-DE are normally distributed, nonparametric tests are recommendable, i.e., one should use the Mann–Whitney U test in the case of independent groups and the Wilcoxon matched pairs test for paired observations. If the number of missing values is very low and if further clinical covariables are to be studied, other multivariate approaches can be considered, for example, linear models developed for the analysis of gene expression data (15). Typically, the number m of detected spots in a 2-DE experiment lies somewhere between 1 and 4000. Performing such a high number of statistical tests simultaneously increases the probability for false-positive test decisions. In order to reduce the number of false-positive findings, p-values should be adjusted either in terms of the family-wise error rate (FWER) or in terms of the false discovery rate (FDR). The FWER is defined as the probability of at least one false-positive decision. The FDR is defined as the expected portion of false positives among all
502
Jung
positive test decisions. Algorithms for p-value adjustments are detailed by Dudoit et al. (16). Because statistical significance does not necessarily mean biological relevance, the so-called fold change is usually derived for assessing the strength of up- or downregulation. With log2transformed expression levels, the fold change is derived as the difference of the means of the two groups, if groups are independent. In the case of dependent observations, the fold change is defined as the mean of pair-wise differences. If the variance of expression levels is high, the relevance of proteins with large fold changes is often overstated. Therefore, confidence intervals for the fold change can help biologists to correctly interpret the strength of expression change (17). 2.4. Multivariate Procedures for Quality Control
Before comparing groups by statistical tests it is helpful to visualize data by multivariate techniques. Such methods can serve as quality control and for identifying individual gels as outliers. One such method is principal component analysis (PCA) which can be employed to visualize high-dimensional data in a twoor three-dimensional plot. Ideally, the two experimental groups should form two distinct clusters in this plot, where gels outside these clusters should eventually be considered as outliers. Another multivariate method is cluster analysis which can be used to visualize high-dimensional data in the shape of a tree (see Chapter 12). Gels with similar expression profiles appear on branches close together, while those with a different expression profile appear on branches far away from each other.
3. Mass Spectrometry In comparison to gel-based approaches, samples in quantitative MS experiments are not labelled by fluorescent dyes or silver staining but by additional mass tags of different weight, e.g., by isotope-coded affinity tags (ICAT) (18) or by isobaric tags for relative and absolute quantitation (iTRAQ) (19). In a mass spectrometer, a biological sample is ionized and afterwards ions are separated by mass. This allows determining the existence and abundance of ions with specified mass-to-charge ratios in the sample. A mass spectrum consists thus of a large list of pairs (x, y), where x is a point on the mass-to-charge (m/z) axis and y the detection intensity at that point. The detection intensity is taken as a measure for the abundance. In proteomics, proteins or peptides are associated with certain intensity peaks (or combinations of peaks) of a mass spectrum.
Statistical Methods for Proteomics
3.1. Preprocessing
503
When recording a mass spectrum, the signal intensity is not detected consistently. In many cases, intensity is overestimated in the beginning. Therefore, the first step of preprocessing is the estimation and subtraction of a baseline (Fig. 18.3). Second, a peak-finding algorithm seeks for start, mid- and end points of intensity peaks. The (m/z) value at each mid-point is taken as reference for the detected peak. All intensity values between start and end points of a peak are usually summarized (e.g., by the sum or the area under the curve) as the final measure of abundance of that peak (20). As a last step, peaks are aligned over all spectra of the experiment to get a common set of pairs (m/z, abundance). For that purpose, common alignment algorithms stretch and compress spectra, so that mid-points of peaks appear at the same positions on all spectra. A set of preprocessing functions for mass spectra are implemented in the R-package ‘msProcess’. Further details on the preprocessing of mass spectra were illustrated by Jeffries (21).
Fig. 18.3. Preprocessing of a mass spectrum. Raw mass spectrum with estimated baseline (a) and processed spectrum after baseline subtraction and application of a peak finding algorithm (b).
3.2. Design of Two Group Comparisons
The most frequent problem in MS-based proteomics experiments is similar to those of gel-based approaches, that is, the comparison of proteome samples from two groups. In the case of two independent groups, all biological samples are labelled with a mass tag of certain weight, while a common reference sample is labelled with a mass tag of different weight. From the mixture of each biological sample with the common reference, one mass spectrum is generated. The peaks of the biological sample and that of the reference sample appear side by side, where the distance between both on the m/z axis is equal to the difference of the
504
Jung
different mass tags. The ratio of the abundances from these two peaks is taken as a measure for comparing the different samples. In the case of dependent observations, a common reference is not needed. The two samples from the same biological unit are then labelled by two different mass tags, and ratios between the two samples can directly be compared. For the statistical comparison of the two groups, procedures for multiple hypothesis testing can be applied like in the 2-DE approaches (see also Chapter 9). 3.3. Classification Problems
4. Further Proteomics Approaches
Another frequent problem of MS-based proteomics is to generate mass spectra from body fluids of patients, for example, serum, saliva and urine (22), and compare these spectra to known spectral patterns of formerly studied diseased and healthy individuals. The goal of such problems is the correct diagnosis and classification of new patients. Oftentimes, more than two groups are compared, e.g., healthy individuals with those at different disease states. Classification rules are established by comparing the samples of a training set of individuals. The prognostic power of a rule is evaluated afterwards by a test set. Because the studied mass spectra have usually more features (peaks) than the number of individuals in the training set, dimension reduction is a necessary first step in building a classification rule, i.e., peaks which do not considerably contribute to the distinction of the classes have to be removed. Jeffries (23) has studied a genetic algorithm where a collection of mass peaks is regarded as a chromosome and the peaks themselves are regarded as genes. As in population genetics, chromosomes are recombined, mutated and rejected. In each evolutionary step, a classification rule is established with the current set of chromosomes and those chromosomes which produce high classification errors are rejected. If a certain stop criterion is reached (e.g., predefined number of iteration steps), the algorithm stops. The peaks from the most promising chromosome are taken for distinguishing between classes. Another approach of dimension reduction is to transform the spectral space into the space of principal components (24). Afterwards, discriminant analysis can be applied to those principal components which strongly contribute to the explanation of the variance of the data. Zhang et al. (25) proposed a robust method using support vector machines with regard to outlier-affected MS data.
In the above sections we illustrated the statistical aspects of group comparisons in proteomics experiments, which are the most frequent problems in this research area. The range of
Statistical Methods for Proteomics
505
statistical methods for this area is, however, much larger and is still expanding. A basic question when planning an experiment for group comparisons concerns the number of needed samples in each group. If one compares the expression of only a single protein between two groups, this number depends on (a) the variance between studied individuals, (b) the size of the difference one would like to detect between groups, (c) the tolerable probabilities for false-positive and false-negative test decision and (d) the type of the statistical test used. Thus, the variance between individuals has to be known from earlier studies or from a small pilot collective. In the case of proteomics, where hundreds or thousands of tests are performed simultaneously, the necessary sample size is usually determined in terms of a desired average power, that is, the average probability of detecting a true positive (16). Other approaches seek an appropriate sample size such that a predefined FDR is maintained (26). When planning a study for classification problems, the sample size is determined with regard to minimizing the misclassification error. Fu et al. (27) proposed a sequential approach for this problem, where the sample size is increased until a certain stopping criterion is reached. Extensions of the two group comparison can be considered by including a time component as second experimental factor. Sitek et al. (28), for example, studied protein expression levels in stimulated and non-stimulated proteome samples of neuroblastoma cell lines at 0, 0.5, 1, 6 and 24 hours after stimulation. Thus, in the 2-D DIGE approach one will need n · T gels if groups are independent and (n/2) · T gels if groups are dependent, where T is the number of time levels. A difficulty that arises in such an experiment is the building of the internal standard, because it is usually not possible to merge the samples from all points in time. Therefore, a global normalization (e.g., quantile normalization) may perhaps be a better idea than using an internal standard. As in two group comparisons without a time factor, nonparametric approaches are preferable for the analysis of time-dependent protein expression levels. A detailed description of nonparametric models for such longitudinal data is given in Brunner et al. (29). Besides finding proteins which are differentially regulated between different groups, knowledge about the relations between proteins is also important for understanding biological systems and the architecture of biological pathways. Statistical approaches for reconstructing pathways on the basis of protein expression levels are usually based on graphical models, where proteins are defined as nodes and their interactions as edges. Three of such approaches are relevance networks (RN), graphical Gaussian models (GGM) and Bayesian networks (BN) (30) . The simplest approach is the RN, where the interaction between each pair of
506
Jung
proteins is estimated by their correlation coefficient. Those edges between non-correlated proteins are not displayed in the graphical model. GGMs instead use partial correlations. The partial correlation between two variables is defined as their conditional correlation on all other proteins. A quite complex and computational intensive method is BN, where additionally the direction of an edge is determined. Proteomics is a steadily growing field of research, and new statistical challenges are still arising from enhancements of diverse laboratory techniques, such as liquid chromatography–MS, tandem MS and protein microarrays. References 1. Nesvizhskii, A. I., Keller, A., Kolker, E., and Aebersold, R. (2002) A statistical model for identifying proteins by tandem mass spectrometry. Anal Chem 75, 4646–4658. 2. Urfer, W., Grzegorczyk, M., and Jung, K. (2006) Statistics for proteomics: a review of tools for analyzing experimental data. Pract Proteomics 1, 48–55. 3. Klose, J., and Kobalz, U. (1995) Twodimensional electrophoresis of proteins: and updated protocol and implications for functional analysis of the genome. Electrophoresis 4, 1034–1059. 4. Ünlü, M., Morgan, M. E., and Minden, J. S. (1997) Difference gel electrophoresis: A single gel method for detecting changes in protein extracts. Electrophoresis 18, 2071–2077. 5. Aebersold, R., and Goodlett, D. R. (2001) Mass spectrometry in proteomics. Chem Rev 101, 269–295. 6. Stühler, K., Pfeiffer, K., Joppich, C., Stephan, C., Jung, K., Müller, M., Schmidt, O., van Hall, A., Hamacher, M., Urfer, W., Meyer, H. E., and Marcus, K. (2006) Pilot study of the Human Proteome Organisation Brain Proteome Project: Applying different 2-DE techniques to monitor proteomic changes during murine brain development. Proteomics 6, 4899–4913. 7. Karp, N. A., McCormick, P. S., Russell, M. R., and Lilley, K. S. (2007) Experimental and statistical considerations to avoid false conclusions in proteomic studies using differential in-gel electrophoresis. Mol Cell Proteomics 6, 1354–1364. 8. Fodor, I. K., Nelson, D. O., AlegriaHartman, M., Robbins, K., Langlois, R. G., Turteltaub, K. W., Corzett, T.H., and McCutchen-Maloney, S.L. (2005) Statistical challenges in analysis of two-dimensional difference gel electrophoresis experi-
9.
10.
11.
12.
13.
14.
15.
16.
ments using DeCyder. Bioinformatics 21, 3733–3740. Chich, J.-F., David, O., Villers, F., Schaeffer, B., Lutomski, D., and Huet, S. (2007) Statistics for proteomics: Experimental design and 2-DE differential analysis. J Chromatogr B 849, 261–272. Kreil, D. P., Karp, N. A., and Lilley, K. S. (2004) DNA microarray normalization methods can remove bias from differential protein expression analysis of 2D difference gel electrophoresis results. Bioinformatics 20, 2026–3740. Huber, W., Heydebreck, A., von Sültmann, H., Poustka, A., and Vingron, M. (2002) Variance stabilization applied to microarray data calibration and the quantification of differential expression. Bioinformatics 18, S96–S104. Bolstad, B. M., Irizarry R. A., Astrand, M., and Speed, T. P. (2003) A comparison of normalization methods for high density oligonucleotide array data based on bias and variance. Bioinformatics 19, 185–193. Jung, K., Gannoun, A., Sitek, B., Meyer, H. E., Stühler, K., and Urfer, W. (2005) Analysis of dynamic protein expression data. RevStatStat J 3, 99–111. Jung, K., Gannoun, A., Sitek, B., Apostolov, O., Schramm, A., Meyer, H. E., Stühler, K., and Urfer, W. (2006) Statistical evaluation of methods for the analysis of dynamic protein expression data from a tumor study. RevStatStat J 4, 67–80. Smyth, G. K. (2004) Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 3, Article 3. Dudoit, S., Shaffer, J. P., and Boldrick, J. C. (2003) Multiple hypothesis testing in microarray experiments. Stat Sci 18, 71–103.
Statistical Methods for Proteomics 17. Jung, K., Poschmann, G., Podwojski, K., Eisenacher, M., Kohl, M., Pfeiffer, K., Meyer, H. E., Stühler, K., and Stephan, C. (2009) adjusted confidence intervals for the expression change of proteins observed in 2-dimensional difference gel electrophoresis. J Proteomics Bioinform 2, 78–87. 18. Gygi, S. P., Rist, B., Gerber, S. A., Turecek, F., Gelb, M. H., and Aebersold, R. (1999) Quantitative analysis of complex protein mixtures using isotope-coded affinity tags. Nat Biotechnol 17, 994–999. 19. Ross, P. L., Huang, Y. N., Marchese, J. N., et al. (2004) Multiplexed protein quantitation in Saccharomyces cerevisiae using aminereactive isobaric tagging reagents. Mol Cell Proteomics 3, 1154–1169. 20. Boehm, A. M., Pütz, S., Altenhöfer, D., Sickmann, A., and Falk, M. (2007) Precise protein quantification based on peptide quantification using iTRAQTM . BMC Bioinformatics 8, 214. 21. Jeffries, N. (2005) Algorithms for alignment of mass spectrometry proteomic data. Bioinformatics 21, 3066–3073. 22. Pusch, W., Flocco, M. T., Leung, S.-M., Thiele, H., and Kostrzewa, M. (2003) Mass spectrometry-based clinical proteomics. Pharmacogenomics 4, 463–476. 23. Jeffries, N. O. (2004) Performance of a genetic algorithm for mass spectrometry proteomics. BMC Bioinformatics 5, 180. 24. Lilien, R. H., Farid, H., and Donald, B. R. (2003) Probabilistic disease classification of expression dependent proteomic data from
25.
26.
27.
28.
29.
30.
507
mass spectrometry of human serum. J Comput Biol 10, 925–946. Zhang, X., Lu, X., Shi, Q., Xu, X., Leung, H., Harris, L. N., Iglehart, J. D., Miron, A., Liu, J. S., and Wong, W. H. (2006) Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data. BMC Bioinformatics 7, 197. Cairns, D. A., Barrett, J. H., Billingham, L. J., Stanley, A. J., Xinarianos, G., Field, J. K., Johnson, P. J., Selby, P. J., and Banks, R. E. (2009) Sample size determination in clinical proteomic profiling experiments using mass spectrometry for class comparison. Proteomics 9, 74–86. Fu, W. J., Dougherty, E. R., Mallick, B., and Carrol, R. (2005) How many samples are needed to build a classifier: A general sequential approach. Bioinformatics 21, 63–70. Sitek, B., Apostolov, O., K. S., Pfeiffer, K., Meyer, H. E., Eggert, A., and Schramm, A. (2005) Identification of dynamic proteome changes upon ligand activation of trk-receptors using two-dimensional fluorescence difference gel electrophoresis and mass spectrometry. Mol Cell Proteomics 4, 291–299. Brunner, E., Domhof, S., and Langer, F. (2002) Nonparametric Analysis of Longitudinal Data in Factorial Experiments. John Wiley & Sons, New York. Grzegorczyk, M. (2007) Extracting protein regulatory networks with graphical models. Proteomics 7(S1), 51–59.
Part V Meta-analysis for High-Dimensional Data
Chapter 19 Statistical Methods for Integrating Multiple Types of High-Throughput Data Yang Xie and Chul Ahn Abstract Large-scale sequencing, copy number, mRNA, and protein data have given great promise to the biomedical research, while posing great challenges to data management and data analysis. Integrating different types of high-throughput data from diverse sources can increase the statistical power of data analysis and provide deeper biological understanding. This chapter uses two biomedical research examples to illustrate why there is an urgent need to develop reliable and robust methods for integrating the heterogeneous data. We then introduce and review some recently developed statistical methods for integrative analysis for both statistical inference and classification purposes. Finally, we present some useful public access databases and program code to facilitate the integrative analysis in practice. Key words: Integrative analysis, high-throughput data analysis, microarray.
1. Introduction With the unprecedented amount of information from highthroughput experiments, such as gene expression microarrays, protein–protein interactions, large-scale sequencing, genomewide copy number information, and genome–wide DNA–protein binding maps, there is an urgent need to develop reliable and robust methods for integrating these heterogeneous data to generate systematic biological insights into states of cells, mechanisms of disease, and treatments. Integrating diverse sources of data can not only increase statistical power of data analysis, but also provide deeper biological understanding. Concrete efforts have been made to study the best ways to collect, store, and distribute these H. Bang et al. (eds.), Statistical Methods in Molecular Biology, Methods in Molecular Biology 620, DOI 10.1007/978-1-60761-580-4_19, © Springer Science+Business Media, LLC 2010
511
512
Xie and Ahn
data. This chapter will focus on the statistical methods to improve the power of identifying meaningful biological findings by integrating different types of data efficiently. We will introduce the problem by giving two examples and then discuss several statistical methods.
2. Examples 2.1. Gene Expression Regulation
Gene expression is a process of “the full use of the information in a gene via transcription and translation leading to production of a protein and hence the appearance of the phenotype determined by that gene” (1). The gene expression process determines the intra-cellular concentration of proteins, which play an important role in many biological systems. On the other hand, the gene expression procedure is controlled by certain proteins (regulators) in an organized way. Transcriptional control is a critical step in regulation of gene expression. Understanding such a control on a genomic level involves deciphering the mechanisms and structures of regulatory programs and networks, which will facilitate understanding the ways organisms function and respond to environmental signals. Answers to these questions will facilitate basic biology and medical research, leading to applications in clinical diagnosis and finding new treatments for diseases. Gene regulation is a complicated biology process, and we will describe some basic steps of the process. First, cells get input signals from their environment. Second, through many signaling pathways, some transcription factors (TFs) are activated. Third, the TFs bind to the target genes’ cis-regulatory DNA sequence. Finally, the binding of TFs and their cis-regulatory DNA sequence control the expression level of the target genes. A critical element of understanding the gene regulation is to identify which genes are target genes of a specific TF. There are two ways to answer this question: (1) A direct way is to investigate which DNA sequences that a TF is binding to. A chromatin immunoprecipitation microarray experiment (CHIP-chip or genome-wide location experiment) (2, 3) could be used to detect the genome-wide target DNA sequences that are bound by specific proteins. (2) An indirect way is to investigate which genes express differently under the condition of with and without the appearance of TFs. Global expression profiles by microarray experiments could be used for this purpose. DNA sequence information is also very important to identify the target genes. All of global expression profiles (indirect way), genome-wide location data (direct way), and DNA sequence data play important roles in constructing regulatory networks, but none of them
Statistical Methods for Integrating Multiple Types of High-Throughput Data
513
can accurately get the whole picture alone. Using expression profiles alone cannot discriminate direct targets of the transcription factors from indirect downstream effects, all of which are observed if expression profiles alone are analyzed (4). On the other hand, using genome-wide location data, we can identify the binding sites of a (TF), suggesting the transcription factor may have regulatory effects on the gene, but it is possible that the (TF) does not fully or even partially regulate the gene at the time (5). Also, DNA sequence data can provide information about potential binding affinities of each gene to the TF, but potential binding does not necessarily mean that sequence will be bound and regulated by TF in vivo. More importantly, because of the high noise–signal ratio nature of the high-throughput data, there is limited statistical power to identify the TF binding targets using only one source of data alone. Thus, integrating these heterogeneous and independently obtained data can improve the detection power and is a key step to understand the complete mechanism of transcriptional regulation and the form of regulatory networks (2–6). On the other hand, how to integrate diverse types of genomic data in an efficient way is still a challenge problem in the current bioinformatics research. In this chapter, we will review some existing methods to address this question. 2.2. Cancer Prognosis
Cancer prognosis predicts the course and outcome of cancer, that is, the chance that a patient will recover or have a recurrence (return of the cancer). Traditionally, cancer prognosis largely depends on clinical information such as the type and location of the cancer, the stage and grade of the disease, the patient’s age and general health. Recently, patient’s molecular profiles have been increasingly used to predict cancer prognosis. Shedden et al. (7) showed that genome-wide expression profile can improve the prediction of lung cancer recurrence compared to clinical prognostic factors. The molecular profiles used for cancer prognosis will be extended to protein expression profiling, miRNA profiling, DNA copy number profiling, potentially large-scale tumor genomic sequencing such as for specific oncogene mutations. This information will be coupled with germ line DNA polymorphism analysis which now can evaluate about 106 polymorphisms at one time, some of which identify inter-individual variation that could provide prognostic information as well as response to therapy and toxicity prediction. Also some groups are using serum proteomic and antibody profiles for early cancer detection, prognosis, and predicting response to specific therapies. Xie and Minna (8) described how to use molecular profiles to facilitate the prediction for lung cancer patients. With such large amounts of high-throughput data, there is an urgent need for statistical methods to integrate these information for cancer prognosis.
514
Xie and Ahn
3. Integrative Approach for Statistical Inference 3.1. Using Shrinkage Approach to Incorporate Prior Information
Shrinkage methods have been widely used in classification and prediction (9–11). The motivation is more from the point of a better bias-variance trade-off. It serves as a “de-noising” procedure that reduces the variance (with a possible increase of bias) and therefore results in the improved prediction performance. Besides the good empirical performance, shrinkage methods can be also justified from a Bayesian point of view (12, 13). Xie et al. (14) considered a use of shrinkage in the context of hypothesis testing, such as detecting differential gene expression for expression data or DNA–protein binding sites for genome-wide location data. An advantage of the method is that we can incorporate prior biological knowledge, for example, in detecting differential gene expression for thousands of genes, in many applications we know a priori that most of the genes will be equally expressed (EE); use of this prior knowledge can be accomplished by shrinking the test statistics of those genes believed a priori to be EE toward their null values (i.e., expected values under the null hypothesis of EE). Furthermore, we can also take advantage of the existence of other sources of data. To illustrate the shrinkage method, Xie et al. (14) used a SAM-t (statistical analysis of microarray) statistic to analyze microarray data (15, 16), but other more elaborate statistical methods, such as the nonparametric empirical Bayes method (17), SAM (18), and the mixture model method (16), can be also implemented. A statistical analysis method can be similarly applied to either genome-wide location data or gene expression data. A difference is that we are only interested in the enriched spots in location data whereas both up and down expressed genes are of interest in gene expression data. Hence, we will use a one-sided test for genome-wide location data, rather than a two-sided test commonly taken for gene expression data. Specifically, for any given d > 0, we claim any gene i satisfying Zi > d to be significant, and we estimate the total positive (TP) numbers as > TP(d) = # {i: Zi > d}. If there are total G1 genes with true positives, the sensitivity and specificity are sens(d) = TP(d)/G1 , spec(d) = 1 − FP(d)/(G − G1 ),
Statistical Methods for Integrating Multiple Types of High-Throughput Data
515
where TP and FP are the numbers of total positives and true false positives, respectively. The true/observed false discovery rate (FDR) (19, 20) and its estimate are > > > FDR(d) = FP(d)/TP(d), F DR(d) = FP(d)/ TP(d), > where FP(d) is the estimated false positive number. A standard way to estimate FP is > FP(d) =
B
(b)
#{i:zi
> d}/B,
b=1 (b)
where zi is the test statistic calculated from the bth permutated data set and B is the total number of permutations. This standard method may overestimate FP (16), and furthermore, the magnitude of induced bias may depend on the test statistic being used (21). Hence, it is not appropriate to use the resulting FDR estimates to evaluate different statistics. In this chapter, we will use a modified method proposed by Xie et al. (21) to estimate FP. The idea is quite simple: the overestimation of the standard method is mainly caused by the existence of target genes; if we use only the predicted non-target genes to estimate the null distribution, it will reduce the impact of target genes and improve the estimation of FP. Specifically, we use only non-significant genes to estimate FP: > FP(d) =
B
(b)
# {i: zi
> d & Zi > d}/B.
b=1
Xie et al. (21) gave more detailed descriptions and justifications for this method. In testing a null hypothesis, a false rejection occurs because just by chance the test statistic value (or its absolute value) is too large. Hence, if we know a priori that the null hypothesis is likely to be true, we can shrink the test statistic (or its absolute value) toward zero, which will reduce the chance of making a false positive. In the current context, among all the genes, suppose that based on some prior information we can specify a list of genes for which the null hypothesis is likely to hold, thus their test and null statistics are to be shrunken. If gene i is in the list, then for a given threshold value s, its test and null statistics are shrunken: (b)
(b)
(b)
Zi (s) = sign(Zi )(|Zi | − s)+ , zi (s) = sign(zi )(|zi | − s)+ , where f+ = f if f > 0 and f+ = 0 if f ≤ 0. On the other hand, if (b) (b) gene i is not in the list, then Zi (s) = Zi and zi (s) = zi .
516
Xie and Ahn
We proceed as before to draw statistical inference using Zi (s) (b) and zi (s). The shrinkage method we use is called soft thresholding (22, 23), in contrast to the usual hard thresholding. In the hard thresholding, when |Zi | is larger than s, the new statistic will remain unchanged, rather than be shrunken toward zero by the amount of s as the soft thresholding does. This property makes the statistics generated by the hard thresholding “jumpy,” since as the threshold s increases, the statistics of some genes may suddenly jump from zero to their original values. How many genes to shrink and how much to shrink are important parameters to be determined in data analysis. Xie et al. (14) proposed taking multiple trials using various parameter values and then using estimated FDR as a criterion to choose the optimal parameter values. It has been shown to work well. In practice, we suggest to use area under curve (AUC) as a measurement to compare estimated FDRs and tune the parameter. Specifically, we try different s values, for example, s = 0, 0.2, 0.4, and 0.6. For each s value, we estimate FDRs for different number of total positive genes and then make a plot of FDR vs. the number of total positive genes. So we can get one curve for each s value, and then calculate AUC for each curve. We will choose s value with lowest AUC as the optimal parameter value. The idea of using shrinkage to combine two data sets is simple: in the example with gene expression data and DNA–protein binding data, first, we use the expression data to generate a prior gene list where the genes are more likely to be “non-target” of the protein; and then we shrink the statistics of these genes toward null values in the binding data analysis. We treat all the genes in the gene list equally; this simplicity makes the shrinkage method flexible and thus applicable to combining various types of data or prior knowledge. For example, if we can generate a list of genes for which the null hypothesis is more likely to hold based on a literature review or some relevant databases, such as Gene Ontology (GO), then we can incorporate this prior information into the following analysis by using our proposed shrinkage method. An alternative way of a combined analysis is to make the amount of shrinkage for each gene depend on the probability of that gene in the gene list (or equivalently, on the amount of statistical evidence of rejecting the null hypothesis for the gene based on prior information): the higher probability, the more we will shrink it toward zero. This method can possibly use the prior information more efficiently, but it also requires a stronger link and association between the two sources of data; otherwise, it may not perform well. Xie et al. (14) illustrated that the quality of the prior gene list influenced the amount of shrinkage we should have and the final performance of the shrinkage method. The more the gene
Statistical Methods for Integrating Multiple Types of High-Throughput Data
517
list agrees with the truth, the larger amount of shrinkage should be taken, and the better the shrinkage method performs. On the other hand, when the gene list is very unreliable, any shrinking does not help. This phenomenon is consistent with the general Bayesian logic: how much information we should use from the prior knowledge (i.e., the gene list here) depends on the quality of the prior knowledge; if the prior knowledge is very vague, then we should use flat prior (here s = 0) so that the posterior information comes largely or only from the data itself. 3.2. Incorporate Gene Pathway and Network Information for Statistical Inference
Gene functional groups and pathways, such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways (24), play critical roles in biomedical research. Recently, genome-wide gene networks, represented by undirected graphs with genes as nodes and gene–gene interactions as edges, have been constructed using high-throughput data. Lee et al. (25) constructed functional networks for the yeast genome using a probabilistic approach. Franke et al. (26) constructed a human protein–protein interaction network. It is reasonable to assume that the neighboring genes in a network are more likely to share biological functions and thus to participate in the same biological processes, therefore, their expression levels are more likely to be similar to each other. Some recent work has attempted to incorporate genome-wide gene network information into statistical analysis of microarray data to increase the analysis power. Wei and Li (27) proposed integrating KEGG pathways or other gene networks into analysis of differential gene expression via a Markov random field (MRF) model. In their model, the state of each gene was directly modeled via a MRF. Spatial statistical models have been increasingly used to incorporate other information into microarray data analysis. Xiao et al. (28) applied a hidden Markov model to incorporate gene chromosome location information into gene expression data analysis. Broet et al. (29) applied a spatial mixture mode to introduce gene-specific prior probabilities to analyze comparative genomic hybridization (CGH) data. Wei and Pan (30) extended the work of Broet et al. from incorporating one-dimensional chromosome locations to two-dimensional gene network. They utilized existing biological knowledge databases, such as KEGG pathways, or computationally predicted gene networks from integrated analysis (25), to construct gene functional neighborhoods and incorporating them into a spatially correlated normal mixture model. The basic rationale for their model is that functionally linked genes tend to be co-regulated and co-expressed, which is thus incorporated into analysis. This is an efficient way to incorporate network information into statistical inference and we will introduce their method in this section.
518
Xie and Ahn
3.2.1. Standard Normal Mixture Model
We want to identify binding targets or differentially expressed genes in this analysis. We will use latent variable Ti to indicate whether gene i is true binding target (or differently expressed) gene. Suppose that the distribution functions of the data (e.g., Z-score) for the genes with Ti = 1 and Ti = 0 are f1 and f0 , respectively. Assuming that a priori all the genes are independently and identically distributed (iid), we have a marginal distribution of Zi as a standard mixture model: f (zi ) = π0 f0 (zi ) + (1 − π0 )f1 (zi ),
[1]
where π0 is the prior probability that H0,i (null hypothesis) holds. It is worth noting that the prior probabilities are the same for all genes. The standard mixture model has been widely used in microarray data analysis (17, 31–33). The null and non-null distributions f0 and f1 can be approxiK mated by finite normal mixtures: f0 = k00=1 π0k0 φ(μk0 , σk20 ) and 1 2 2 f1 = K k1 =1 π1k1 φ(μk1 , σk1 ), where φ(μ, σ ) is the density function for a Normal distribution with mean μ and variance σ 2 . For Z-score, using Kj = 1 often suffices (33). Wei and Pan (30) demonstrated that K0 = 2 and K1 = 1 worked well in most cases. The standard mixture model can be fitted via maximum likelihood with the Expectation-Maximization (EM) algorithm (34). Once the parameter estimates are obtained, statistical inference is based on the posterior probability that H1,i (alternative hypothesis) holds: Pr(Ti = 1|zi ) = π1 f1 (zi )/f (zi ). 3.2.2. Spatial Normal Mixture Model
In a spatial normal mixture model, Wei and Pan (30) introduced gene-specific prior probabilities πi,j = Pr(Ti = j) for i = 1, . . . , G and j = 0, 1. The marginal distribution of zi is f (zi ) = πi,0 f0 (zi ) + πi,1 f1 (zi ),
[2]
where f0 (zi ) and f1 (zi ) represent density distributions under null and alternative hypotheses, respectively. Note that the prior probability specification in a stratified mixture model (35, 36) is a special case of Equation [2]: a group of the genes with the same function share a common prior probability while different groups have possibly varying prior probabilities; in fact, a partition of the genes by their functions can be regarded as a special gene network. Based on a gene network, the prior probabilities πi,j are related to two latent Markov random fields xj = {xi,j ;i = 1, ..., G} by a logistic transformation: πi,j = exp(xi,j )/[exp(xi,0 ) + exp(xi,1 )].
Statistical Methods for Integrating Multiple Types of High-Throughput Data
519
Each of the G-dimensional latent vectors xj is distributed according to an intrinsic Gaussian conditional autoregression model (ICAR) (37). One key feature of ICAR is the Markovian interpretation of the latent variables’ conditional distributions: the distribution of each spatial latent variable xi,j , conditional on x(−i),j = {xk,j ;k = i}, depends only on its direct neighbors. More specifically, we have ⎛ ⎞ 2 σ 1 Cj ⎠, xi,j |x(−i),j ∼ N ⎝ xl,j , mi mi l∈δi
where δi is the set of indices for the neighbors of gene i, and mi is the corresponding number of neighbors. To allow identifiability, they imposed i xij = 0 for j = 0, 1. In this model, 2 acts as a smoothing prior for the spatial field the parameter σCj and consequently controls the degree of dependency among the 2 prior probabilities of the genes across the genome: the smaller σCj induces more similar πi,j ’s for those genes that are neighbors in the network. 3.3. Joint Modeling Approaches for Statistical Inference
Now we will discuss a joint modeling approach to integrate different sources of data. The benefit of joint modeling is that it can potentially improve the statistical power to detect target genes. For example, a gene may be supported in each source of data with some but not overwhelming evidence to be a target gene; in other words, this gene will not be identified as statistically significant based on either source of data, however, by integrating different sources of data in a joint model, the gene may be found to be significant. Pan et al. (38) proposed a nonparametric empirical Bayes approach to joint modeling of DNA–Protein binding data and gene expression data. The simulated data shows the improved performance of the proposed joint modeling approach over that of other approaches, including using binding data or expression data alone, taking an intersection of the results from the two separate analyses and a sequential Bayesian method that inputs the results of analyzing expression data as priors for the subsequent analysis of binding data. Application to a real data example shows the effects of the joint modeling. The nonparametric empirical Bayes approach is attractive due to its flexibility. Xie (39) proposed a parametric Bayesian approach to jointly modeling DNA– protein binding data (ChIP-chip data), gene expression data and DNA sequence data to identify the binding target genes of a transcription factor. We will focus on this method here.
3.3.1. Analyzing Binding Data Alone
We use a Bayesian mixture model (40) to analyze binding data. Specifically, suppose Xij is the log ratio of the intensities of test and control samples in ChIP-chip experiment for gene i
520
Xie and Ahn
(i = 1, ..., G) and replicate j (j = 1, ..., K ). We specify the model as following: iid
Xij |μix ∼ N (μix , σix2 ), iid
2 ), μix |Iix = 0 ∼ N (0, τ0x iid
2 ), μix |Iix = 1 ∼ N (λx , τ1x iid
Iix |px ∼ Ber(px ), where μix is the mean of log ratio for gene i; Iix is an indicator variable: Iix = 0 means the gene i being non-binding target gene and Iix = 1 means gene i being binding target gene. We assume that the mean log ratios of non-target genes concentrate 2 ), while the expected mean log around 0 with small variance (τ0x ratios of target genes follow a normal distribution with a positive mean. The prior distribution for indicator Iix is a Bernoulli distribution with probability px . The advantage of this hierarchical mixture model is that we can borrow information across genes to estimate the expected mean intensity, and we can use the posterior probability of being a binding target gene directly to do inference. 3.3.2. Joint Modeling
Similar to binding data, we used mixture models to fit expression data Yij and sequence data zi . For expression data iid
Yij |μiy ∼ N (μiy , σiy2 ), iid
2 ), μiy |Iiy = 0 ∼ N (0, τ0y iid
2 ), μiy |Iiy = 1 ∼ N (λy , τ1y iid
2 ), μiy |Iiy = 2 ∼ N ( − λy , τ1y iid
Iiy |Iix = 0 ∼ Multinomial(py00 , py10 , py20 ), iid
Iiy |Iix = 1 ∼ Multinomial(py01 , py11 , py21 ), where Iiy is a three-level categorical variable: Iiy = 0 indicates an equally expressed gene, Iij = 1 represents an up-regulated gene, and Iij = 2 means a down-regulated gene. Here we used conditional probabilities to connect the binding data and the expression data. Intuitively, the probability of being equally expressed for a non-binding target gene, py00 , should be higher than the probability of being equally expressed for a binding target gene, py01 . The difference between the conditional probabilities measures the correlation between the binding data and the expression data. If the two data sets are independent, the two sets of conditional probabilities will be the same. Therefore, this model is flexible to accommodate the correlations between data.
Statistical Methods for Integrating Multiple Types of High-Throughput Data
521
Similarly, we model the sequence data as iid
2 ), zi |Iiz = 0 ∼ N (λz1 , τ1z iid
2 ), zi |Iiz = 1 ∼ N (λz2 , τ2z iid
Iiz |Iix = 0 ∼ Ber(pz0 ), iid
Iiz |Iix = 1 ∼ Ber(pz1 ), where Iiz = 1 indicates gene i being potential target gene based on sequence data, we will call it a potential gene; and Iiz = 0 means gene i being a non-potential gene.
Fig. 19.1. The graphical overview of the hierarical structure of the joint model.
Figure 19.1 gives a graphical overview of this model. In summary, the model combines expression data and sequence data with binding data through the indicator variables. This model can automatically account for heterogeneity and different specificities of multiple sources of data. The posterior distribution of being a binding target can be used to explain how this model integrates different data together. For example, if we combine binding and expression data, the posterior distribution of being a binding target gene Iix is Iix | · ∼Ber(pix ), pix = A=
A A+B , 2 )− 12 exp px (τ1x
(μ −λ )2 − ix 2 x 2τ1x
2 )− 12 exp − B = (1 − px )(τ0x
μ2ix 2 2τ0x
I =0 I =1 I =2
iY iY iY py11 py21 py01 IiY =0 IiY =1 IiY =2 py00 py10 py20 .
where Iix |· represents the posterior distribution of Iix condition on all other parameters in the model and the data. We define p$y0 = (py00 , py10 , py20 ) and p$y1 = (py01 , py11 , py21 ). If the expression data do not contain information about binding, then p$y0 = p$y1 , which makes all the terms containing Y in the formula canceled out. In
522
Xie and Ahn
this case, only information contained in binding data X is used to do inference. On the other hand, the difference between p$y0 and p$y1 will be big when expression data contain information about binding. In this case, the information in expression data IiY will also be used to make inference. 3.3.3. Statistical Inference
Assuming that the binding, expression, and sequence data are conditionally independent (condition on the indicator Iix ), we can get the joint likelihood for the model. Based on the joint likelihood, we can obtain the closed form of full conditional posterior distribution for most of the parameters (except λx , λy , and dz ). Gibbs sampler was used to do Markov Chain Monte Carlo simulations for the parameters having the closed form. For λx , λy , and dz , Metropolis–Hastings algorithm was applied to draw the simulation samples. The iterations after burn-in samples were used as posterior simulation samples and were used for statistical inferences.
3.3.4. The Effects of Joint Modeling
Xie (39) illustrated that when using the binding data alone, the estimated posterior probabilities were positively associated with the mean binding intensities from binding data. In this model, the posterior probability does not depend on the expression data or sequence data. On the other hand, after doing the joint modeling, the posterior probabilities of the genes with high expression values have been increased compared to using binding data alone, but the sequence score did not have much influence on the inference for this data. In summary, this model can automatically account for heterogeneity and different specificity of multiple sources of data. Even if an addition data type does not contain any information about binding, the model can be approximated to that of using binding data alone.
4. Integrative Analysis for Classification Problem
Statistical classification methods, such as support vector machine (SVM) (41), random forest (42) and Prediction Analysis for Microarrays (PAM) (9), have been widely used for diagnosis and prognosis of breast cancer (10, 43), prostate cancer (44, 45), lung cancer (7, 46), and leukemia (47). Meanwhile, biological functions and relationships of genes have been explored intensively by the biological research community, and those information has been stored in databases, such as those with the Gene Ontology annotations (48) and the Kyoto Encyclopedia of Genes and Genomes (24). In addition, as mentioned before, prior exper-
Statistical Methods for Integrating Multiple Types of High-Throughput Data
523
iments with similar biological objectives may have generated data that are relevant to the current study. Hence, integrating information from prior data or biological knowledge has potential to increase the classification and prediction performance. The standard classification methods treat all the genes equally a priori in the process of model building, ignoring biological knowledge of gene functions, which may result in a loss of their effectiveness. For example, some genes have been identified or hypothesized to be related to cancer by previous studies; others may be known to have the same function of or be involved in a pathway with some known/putative cancer-related genes, hence we may want to treat these genes differently from other genes a priori when choosing genes to predict cancer-related outcomes. Some recent research has taken advantages of the prior information for classification problem. Lottaz and Spang (49) proposed a structured analysis of microarray data (StAM), which utilized the GO hierarchical structure. Biological functions of genes in GO hierarchal structure are organized as a directed acyclic graph: each node in the graph represents a biological function, and a child node has a more specific function while its parent node has a more general one. StAM first built classifiers for every leaf node based on an existing method, such as PAM, then propagated their classification results by a weighted sum to their parent nodes. The weights in StAM are related to the performance of the classifiers and a shrinkage scheme is used to shrink the weights toward zero so that a sparse representation is possible. This process is repeated until the results are propagated to the root node. Because the final classifier is built based on the GO tree, StAM greatly facilitates the interpretation of a final result in terms of identifying biological processes that are related to the outcome. StAM uses only genes that are annotated in the leaf nodes (i.e., with most detailed biological functions) as predictors, so it may miss some important predictive genes. Wei and Li (27) proposed a modified boosting method, nonparametric pathway-based regression (NPR), to incorporate gene pathway information into classification model. NPR assumed that the genes can be first partitioned into several groups or pathways, and only pathway-specific new classifiers (i.e., using only the genes in each of the pathways) were built for boosting procedure. More recently, Tai and Pan (50) proposed a flexible statistical method to incorporate prior knowledge of genes into prediction models. They adopted group-specific penalty terms in a penalized method allowing genes from different groups to have different prior distributions (e.g., different prior probabilities of being related to the cancer). Their model is similar to NPR with regard to grouping genes, but they apply to any penalized method through the use of group-specific penalty terms while NPR only applies to boosting. Garrett-Mayer et al. (51) proposed using meta-analytic
524
Xie and Ahn
approaches to combine several studies using pooled estimates of effect sizes.
5. Useful Databases and Program Code 5.1. Databases
In order to meet the urgent need of integrated analysis of highthroughput data, the National Institutes of Health (NIH) and the European Molecular Biology Laboratory (EMBL) have made concrete effort to build and maintain several large-scale and public-accessible databases. These databases are very valuable for both biomedical research and methodology development. We will introduce several of them briefly.
5.1.1. Array Express
Founded by EMBL, ArrayExpress is a public repository for transcriptomics data. It stores gene-indexed expression profiles from a curated subset of experiments in the repository. Public data in ArrayExpress are made available for browsing and querying on experiment properties, submitter, species, etc. Queries return summaries of experiments, and complete data or subsets can be retrieved. A subset of the public data is re-annotated to update the array design annotation and curated for consistency. These data are stored in the ArrayExpress Warehouse and can be queried on gene, sample, and experiment attributes. Results return graphed gene expression profiles, one graph per experiment. Complete information about ArrayExpress can be found from the web site http://www.ebi.ac.uk/Databases/.
5.1.2. Gene Expression Omnibus (GEO)
Founded by National Center for Biotechnology Information (NCBI), GEO is a gene expression/molecular abundance repository supporting Minimum Information About a Microarray Experiment (MIAME) compliant data submissions and a curated, online resource for gene expression data browsing, query, and retrieval. Currently it has 2,61,425 data including 4,931 platforms 2,47,016 samples and 9,478 series. All data are accessible through the web site http://www.ncbi.nlm.nih.gov/geo/.
5.1.3. Oncomine
Founded by Drs. Arul Chinnaiyan and Dan Rhodes at the University of Michigan and currently maintained by Compendia Bioscience Company, the Oncomine Research Platform is a suite of products for online cancer gene expression analysis dedicated to the academic and non-profit research community. Oncomine combines a rapidly growing compendium of 20,000+ cancer transcriptome profiles with an analysis engine and a web application for data mining and visualization. It cur-
Statistical Methods for Integrating Multiple Types of High-Throughput Data
525
rently includes over 687 million data points, 25,447 microarrays, 360 studies, and 40 cancer types. Oncomine access is available through the Oncomine Research Edition and the Oncomine Research Premium Edition. Information is available via the web site http://www.oncomine.org/main/index.jsp. 5.1.4. The Cancer Genome Atlas (TCGA)
Joint effort by the National Cancer Institute (NCI) and the National Human Genome Research Institute (NHGRI) of the NIH, TCGA is a comprehensive and coordinated effort to accelerate our understanding of the molecular basis of cancer through the application of genome analysis technologies, including large-scale genome sequencing. TCGA Data Portal provides a platform for researchers to search, download, and analyze data sets generated by TCGA. This portal contains all TCGA data pertaining to clinical information associated with cancer tumors and human subjects, genomic characterization, and highthroughput sequencing analysis of the tumor genomes. In addition, the Cancer Molecular Analysis Portal provides the ability for researchers to use analytical tools designed to integrate, visualize, and explore genome characterization from TCGA data. The following web site will lead to download TCGA data http://tcgadata.nci.nih.gov/tcga/.
5.2. WinBUGS Codes for Incorporating Gene Network Information into Statistical Inference
model { for( i in 1:N ){ Z[i] ∼dnorm(muR[i], tauR[i]) # z-scores muR[i] <- mu[T[i]] tauR[i] <- tau[T[i]] T[i] ∼dcat(pi[ ]) # latent variable (zero/negative/ # postive components) T1[i] <-equals(T[i],1) ;T2[i] <-equals(T[i],2); T3[i]<-equals(T[i],3); } # prior for mixing proportions pi[1:3] ∼ddirch(alpha[]) # priors (means of normal mixture components) mu[1] <- 0 # zero component mu[2] ∼dnorm(0, 1.0E-6)I(a,0.0) # negative component mu[3] ∼dnorm(0, 1.0E-6)I(0.0,b) # positive component # priors (precision/variance of normal mixture # components) tau[1] ∼dgamma(0.1, 0.1) tau[2] ∼dgamma(0.1, 0.1) tau[3] ∼dgamma(0.1, 0.1) sigma2[1] <-1/tau[1] sigma2[2] <-1/tau[2] sigma2[3] <-1/tau[3] }
5.2.1. For a Three-Component Standard Normal Mixture Model
526
Xie and Ahn
5.2.2. For a Three-Component Spatial Normal Mixture Model
model { for( i in 1:N ) { Z[i] ∼dnorm(muR[i], tauR[i]) # z-scores muR[i] <- mu[T[i]] tauR[i] <- tau[T[i]] # logistic transformation pi[i,1] <-1/(1+exp(x2[i]-x1[i])+exp(x3[i]-x1[i])) pi[i,2] <-1/(1+exp(x1[i]-x2[i])+exp(x3[i]-x2[i])) pi[i,3] <-1/(1+exp(x1[i]-x3[i])+exp(x2[i]-x3[i])) T[i] ∼dcat(pi[i,1:3]) # latent variable (zero/ # negative/postive components) T1[i] <-equals(T[i],1) ;T2[i] <-equals(T[i],2); T3[i] <-equals(T[i],3) } # Random Fields specification x1[1:N] ∼car.normal(adj[], weights[], num[], tauC[1]) x2[1:N] ∼car.normal(adj[], weights[], num[], tauC[2]) x3[1:N] ∼car.normal(adj[], weights[], num[], tauC[3]) # weights specification for(k in 1:sumNumNeigh) { weights[k] <- 1 } # priors (precision/variance for MRF) tauC[1] ∼dgamma(0.01, 0.01)I(0.0001,) tauC[2] ∼dgamma(0.01, 0.01)I(0.0001,) tauC[3] ∼dgamma(0.01, 0.01)I(0.0001,) sigma2C[1] <- 1/tauC[1] sigma2C[2] <- 1/tauC[2] sigma2C[3] <- 1/tauC[3] # priors (means of normal mixture components) mu[1] <- 0 # zero component mu[2] ∼dnorm(0, 1.0E-6)I(a,0.0) # negative component mu[3] ∼dnorm(0, 1.0E-6)I(0.0,b) # positive component # priors (precision/variance of normal mixture # components) tau[1]∼dgamma(0.1, 0.1) tau[2]∼dgamma(0.1, 0.1) tau[3]∼dgamma(0.1, 0.1) sigma2[1] <- 1/tau[1] sigma2[2] <- 1/tau[2] sigma2[3] <- 1/tau[3] }
Acknowledgments The authors thank Drs. Wei Pan, Peng Wei, Feng Tai, and Guanghua Xiao for discussions and suggestions, and thank
Statistical Methods for Integrating Multiple Types of High-Throughput Data
527
Dr. Peng Wei for providing WinBUGS programs. This work was partially supported by NIH UL1 RR024982 1R21 DA027592, and SPORE P50 CA70907.
References 1. Lackie J, Dow J. The Dictionary of Cell and Molecular Biology. Academic Press: London, 1999. 2. Ren B, Robert F, Wyrick JJ, Aparicio O, Jennings EG, Simon I, Zeitlinger J, Schreiber J, Hannett N, Kanin E, et al. Genome-wide location and function of DNA binding proteins. Science 2000; 290(5500): 2306–9. 3. Iyer VR, Horak CE, Scafe CS, Botstein D, Snyder M, Brown PO. Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF. Nature 2001; 409(6819):533–8. 4. Shannon MF, Rao S. Transcription. Of chips and ChIPs. Science 2002; 296(5568):666–9. 5. Simon I, Barnett J, Hannett N, Harbison CT, Rinaldi NJ, Volkert TL, Volkert Wyrick JJ, Volkert Zeitlinger J, Volkert Gifford DK, Volkert Jaakkola TS, et al. Serial regulation of transcriptional regulators in the yeast cell cycle. Cell 2001; 106(6):697–708. 6. Buck MJ, Lieb JD. ChIP-chip: considerations for the design, analysis, and application of genome-wide chromatin immunoprecipitation experiments. Genomics 2004; 83(3):349–60. 7. Shedden K, Taylor JMG, Enkemann SA, Tsao MS, Yeatman TJ, Gerald WL, Eschrich S, Jurisica I, Giordano TJ, Misek DE, et al. Gene expression-based survival prediction in lung adenocarcinoma: a multi-site, blinded validation study. Nat Med 2008; 14(8): 822–7. 8. Xie Y, Minna JD. Predicting the future for people with lung cancer. Nat Med 2008; 14(8):812–3. 9. Tibshirani R, Hastie T, Narasimhan B, Chu G. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci USA 2002; 99(10): 6567–72. 10. Huang X, Pan W. Linear regression and two-class classification with gene expression data. Bioinformatics 2003; 19(16): 2072–8. 11. Wu B. Differential gene expression detection and sample classification using penalized linear regression models. Bioinformatics 2006; 22(4):472–6.
12. Carlin B, Louis T. Bayes and Empirical Bayes Methods for Data Analysis. Chapman and Hall/CRC Press: Boca Raton, FL, 2000. 13. Hastie T, Tibishirani R, Friedman J. The Elements of Statistical Learning. Springer; New York, NY, 2001. 14. Xie Y, Pan W, Jeong KS, Khodursky A. Incorporating prior information via shrinkage: a combined analysis of genomewide location data and gene expression data. Stat Med 2007; 26(10): 2258–75. 15. Guo X, Qi H, Verfaillie CM, Pan W. Statistical significance analysis of longitudinal gene expression data. Bioinformatics 2003; 19(13):1628–35. 16. Pan W. On the use of permutation in and the performance of a class of nonparametric methods to detect differential gene expression. Bioinformatics 2003; 19(11): 1333–40. 17. Efron B, Tibshirani R, Storey JD, Tusher V. Empirical Bayes analysis of a microarray experiment. J Am Stat Assoc 2001; 96(456):1151–60. 18. Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA 2001; 98(9):5116–21. 19. Benjamini Y, Hochberg Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J R Stat Soc, Series B 1995; 57: 289–300. 20. Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Nat Acad Sci USA 2003; 100(16):9440–45, 10.1073. 21. Xie Y, Pan W, Khodursky AB. A note on using permutation-based false discovery rate estimates to compare different analysis methods for microarray data. Bioinformatics 2005; 21(23): 4280–8. 22. Donoho DL, Johnstone IM. Adapting to unknown smoothness via wavelet shrinkage. J Am Stat Assoc 1995; 90(432):1200–24. 23. Donoho D. De-noising by soft-thresholding. Information Theory, IEEE Trans, May 1995; 41(3):613–27, 10.1109/18.382009.
528
Xie and Ahn
24. Kanehisa M, Goto S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res 2000; 28(1):27–30. 25. Lee I, Date SV, Adai AT, Marcotte EM. A probabilistic functional network of yeast genes. Science 2004; 306(5701): 1555–8. 26. Franke L, van Bakel H, Fokkens L, de Jong ED, Egmont-Petersen M, Wijmenga C. Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes. Am J Hum Genet 2006; 78(6):1011–25. 27. Wei Z, Li H. A Markov random field model for network-based analysis of genomic data. Bioinformatics 2007; 23(12): 1537–44. 28. Xiao G, Cavan R, Khodursky A. A improved detection of differentially expressed genes via incorporation of gene location. Biometrics 2009; In Press. 29. Broet P, Richardson S. Detection of gene copy number changes in CGH microarrays using a spatially correlated mixture model. Bioinformatics 2006; 22(8): 911–8. 30. Wei P, Pan W. Incorporating gene networks into statistical tests for genomic data via a spatially correlated mixture model. Bioinformatics 2008; 24(3):404–11. 31. Newton MA, Kendziorski CM, Richmond CS, Blattner FR, Tsui KW. On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data. J Comput Biol 2001; 8(1):37–52. 32. Pan W. A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics 2002; 18(4):546–54. 33. McLachlan GJ, Bean RW, Jones LBT. A simple implementation of a normal mixture approach to differential gene expression in multiclass microarrays. Bioinformatics 2006; 22(13):1608–15. 34. McLachlan G, Peel D. Finite Mixture Models. Wiley: New York, 2000. 35. Pan W. Incorporating gene functions as priors in model-based clustering of microarray gene expression data. Bioinformatics 2006; 22(7):795–801. 36. Lee Y, Nelder JA. Double hierarchical generalized linear models (with discussion). J R Stat Soc: Series C (Applied Statistics) May 2006 55(2):139–85. 37. Besag J, Kooperberg C. On conditional and intrinsic autoregression. Biometrika 1995; 82(4):733–46.
38. Pan W. Incorporating biological information as a prior in an empirical Bayes approach to analyzing microarray data. Stat Appl Genet Mol Biol 2005; 4(NIL):Article12. 39. Xie Y JK, Pan W, Xiao G, Khodursky A. A Bayesian Approach to joint Modeling of Protein-DNA Binding, Gene Expression and Sequence Data. Statistics in Medicine 2009; in press. 40. Lonnstedt I, Britton T. Hierarchical Bayes models for cdna microarray gene expression. Biostatistics 2005; 6:279–91. 41. Vapnik V. Statistical Learning Theory. Wiley: New York, 1998. 42. Breiman L. Random forests. Machine Learning 2001; 45(1):5–32. 43. Wang Y, Klijn JGM, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, Timmermans M, van Gelder MEM, Yu J, et al. Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet 2005; 365(9460): 671–9. 44. Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D’Amico AV, Richie JP, et al. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 2002; 1(2): 203–9. 45. Welsh JB, Sapinoso LM, Su AI, Kern SG, Wang-Rodriguez J, Moskaluk CA, Frierson HFJ, Hampton GM. Analysis of gene expression identifies candidate markers and pharmacological targets in prostate cancer. Cancer Res 2001; 61(16): 5974–8. 46. Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, Ladd C, Beheshti J, Bueno R, Gillette M, et al. Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc Nat Acad Sci USA 2001; 98(24):13 790–95. 47. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999; 286(5439):531–7. 48. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000; 25(1):25–9. 49. Lottaz C, Spang R. Molecular decomposition of complex clinical phenotypes using biologically structured analysis of
Statistical Methods for Integrating Multiple Types of High-Throughput Data microarray data. Bioinformatics 2005; 21(9): 1971–8. 50. Tai F, Pan W. Incorporating prior knowledge of predictors into penalized classifiers with multiple penalty terms. Bioinformatics 2007; 23(14):1775–82.
529
51. Garrett-Mayer E, Parmigiani G, Zhong X, Cope L, Gabrielson E. Cross-study validation and combined analysis of gene expression microarray data. Biostatistics 2008; 9(2): 333–54.
Chapter 20 A Bayesian Hierarchical Model for High-Dimensional Meta-analysis Fei Liu Abstract Many biomedical applications are concerned with the problem of selecting important predictors from a high-dimensional set of candidates, with the gene expression data as one example. Due to the fact that the sample size in any single study is usually small, it is thus important to combine information from multiple studies. In this chapter, we introduce a Bayesian hierarchical modeling approach which models study-to-study heterogeneity explicitly to borrow strength across studies. Using a carefully formulated prior specification, we develop a fast approach to predictor selection and shrinkage estimation for highdimensional predictors. The proposed approach, which is related to the relevance vector machine (RVM), relies on maximum a posteriori (MAP) estimation to rapidly obtain a sparse estimate. As for the typical RVM, there is an intrinsic thresholding property in which unimportant predictors tend to have their coefficients shrunk to zero. The method will be illustrated with an application of selecting genes as predictors of time to an event.
Key words: Bayesian hierarchical model, MAP estimation, meta analysis, relevance vector machine, shrinkage.
1. Introduction In modern biomedical research, it has become a routine to encounter problems involving a massive number of predictors, with gene expression data as one example. Because the sample size from any one study is typically insufficient to allow accurate selection of important predictors, there has been increased emphasis in recent years on borrowing of strength across data from multiple studies. Different studies are typically conducted by different H. Bang et al. (eds.), Statistical Methods in Molecular Biology, Methods in Molecular Biology 620, DOI 10.1007/978-1-60761-580-4_20, © Springer Science+Business Media, LLC 2010
531
532
Liu
labs and may involve varying platforms and event definitions. These differences lead to study-to-study heterogeneity, which should be accommodated in the analysis. In this chapter, we focus on the problem of flexibly borrowing strength across studies in selecting predictors of an event time from a massive number of candidates. Following a related approach to (1) in performing signal reconstruction in machine learning context, we propose to include the same gene-specific precision parameters for the different studies in order to borrow information. The model specification allows the gene-specific coefficients to vary across studies, while including dependence in the degree of shrinkage toward zero. To allow censoring, we apply the stochastic ExpectationMaximization (SEM) algorithm (2–5). The proposed approach, which we refer to as hierarchical relevance vector machine with censoring (HRVM-C), produces sparse estimates of the genespecific coefficients, with many of the coefficients set to zero. We illustrate the use of HRVM-C with an application of selecting genes as predictors of time to an event.
2. Hierarchical Relevance Vector Machine with Censoring 2.1. Multi-task Relevance Vector Machine
Let S be the total number of related studies and ni be the number of samples in study i, i = 1, . . . , S. Denote the value of the response variable of the jth subject in study i by tij , j = 1, . . . , ni , and the value of its kth predictor variable by xijk , k = 1, . . . , p. Set t i = (t i1 , . . . , t ini ) , t = (t 1 , . . . , t S ) , x ij = (xij1 , . . . , xijp ) , and x i = (x i1 , . . . , x ini ) . The Multi-task Relevance Vector Machine (MT-RVM) in (1) can be represented as t i = xi β i + εi , εi ∼ N(0, α0−1 I ni ) , βik ∼ N(0, α0−1 αk−1 ) , [1] where βi = (βi1 , . . . , βip ) is the coefficient vector for study i. To encourage sparsity and information sharing across studies, the MT-RVM further places independent Gamma priors on α = (α1 , . . . , αp ),
p(α | c, d) =
p < k=1
p < d c c−1 α Ga(αk | c, d) = exp ( − dαk ) , [2] (c) k k=1
. with (c) = z c−1 exp ( − z)dz. Similarly, a Gamma prior is specified for α0 , p(α0 | a, b) = Ga(α0 | a, b).
A Bayesian Hierarchical Model for High-Dimensional Meta-analysis
533
The values of the hyperparameters a, b, c, and d control the shape of the priors, with small values indicating distributions having a large spike at 0 and heavy right tails. A recommended default choice, given in (1), is a = b = c = d = 0. Although this choice results in an improper posterior distribution, the MAP estimates exist and have a sparseness-favoring property in which many of the elements will be exactly zero, with such elements shared across the different subjects. Hence, one obtains simultaneous variable selection across subjects, while allowing differences. The value of αk−1 reflects the importance of predictor k. In particular, for certain k, one obtains ? αk = ∞, which implies ?ik = 0 for all i. As another extreme, for values of ? αk close to that β zero, substantial heterogeneity is allowed, with βik and βi k being potentially very different. The MAP estimate for α is defined as %
$ MAP
? α
= arg max α
log p(αk | c, d) + log L(t ;α)
,
k
where log L(t ;α) is the log-likelihood in [1] after integrating out βi and α0 , n
1 (ni + 2a) log t i B −1 log L(t ;α) = − t + 2b i i 2 i=1 [3] ' + log |B i | , with B i = I + x i A−1 x i and A = diag(α1 , . . . , αp ). Under the default choice of the hyperparameters, the MAP estimate for nα is equal to the maximum likelihood estimate (MLE). The key in borrowing information is to use common hyperprior variances, which occur in the second hierarchy of MT-RVM. To further elaborate on this point, we consider the following simplified case, tij ∼ N(μi , 1) , μi ∼ N(0, α −1 ) , i = 1, . . . , S;j = 1, . . . , n . We can write the log-likelihood for α in terms of sufficient statistics, S t¯i·2 1 −1 log (α + 1/n) + −1 , (α) = − 2 α + 1/n i=1
ni tij . Differentiating (α) with respect to α and where t¯i· = n1 j=1 setting the result to 0, we get the MLE for α as
534
Liu
? α=
S2 i=1 t¯i· −1/n
S
? α = ∞,
,
if
S i=1
2 t¯i· − 1/n > 0 ,
otherwise.
Note that the estimation equation for α involves the sufficient statistics t¯i· from all studies. Further, if ? α = ∞, it allows simultaneous variable selection by setting all μi to 0. Borrowing information also occurs in estimation of the coefficients. In our simple −1 example, the posterior mean of μi is α−1α+1/n t¯i· , which has been shrunk toward the prior mean of zero. In many applications, the interest of selection is not only the predictors that are consistently occurring in all the studies, but also those that are very significant in some of the studies. The example suggests that the MT-RVM is sensitive to both types of signals. A predictor will be selected whenever Si=1 t¯i·2 − 1/n > 0, which includes the following two cases: (a) t¯i·2 > 1/n for all i. This corresponds to the cases when μi is included in all the studies. It is obvious to see that ? α < ∞, and thus μi will be selected; (b) t¯i·2 >> 1/n for some i. This corresponds to the cases when the signal is very strong in some study. Again, it is clear that ? α < ∞, and thus μi will be selected. 2.2. HRVM-C
Building on the MT-RVM approach, we propose an HRVM-C method for high-dimensional variable selection in meta-analysis of survival data. We first extend the accelerated failure time (AFT) model (6) to a multi-study AFT model as follows. Denoting the log-failure time (survival time) for subject j in study i by tij , we first model the log-failure time for each individual study by the AFT model as in [1], and then combine data from multiple studies by placing multivariate student distributions as the priors for the study-specific coefficients as
− p+2a 2 −1 p(β i ) ∝ 2a + ab β i Aβ i , where A = diag(α1 , . . . , αp ). For the precision parameters nα, we specify Gamma priors as in [2]. Following (7), we can express the multivariate student distribution as a scale mixture of normals, which leads to the MT-RVM model as in [1] and [2]. The hyperparameters are set as the default values in the MT-RVM. The log-likelihood for nα in [3] does not hold in the presence of censoring. Here, we focus on the case of right censoring, although it is straightforward to allow for interval censoring using the same type of strategy. Let yij be the censored observation and δij be the censoring indicator, yij = tij if δij = 1, and tij > yij if δij = 0. Instead of observing t, we observe (y i , δ i ), for i = 1, . . . , S, where y i = (yi1 , . . . , yini ) and δ i = (δi1 , . . . , δini ).
A Bayesian Hierarchical Model for High-Dimensional Meta-analysis
535
For study i, we further denote the non-censored observations by y i(nc) and the censored observations by y i(c) . Correspondingly, the predictor variables are denoted by x i(nc) and x i(c) , respectively. Set y (nc) = (y 1(nc) , . . . , y S(nc) ), y (c) = (y 1(c) , . . . , y S(c) ), and y = (y (nc) , y (c) ). Therefore, the log-likelihood is log L(y, δ; α) = log L(y (nc) ; α) log + i,j:δij =0
tij >yij
p(tij | x ij , α)dtij ,
where log L(y (nc) ;α) is defined in [3] and p(tij | xij , α) is defined according to [1], (tij | x ij , α) ∼ N(x ij β i , α −1 0 ). The MAP estimate for α under censoring is thus defined as $ % MAP ? = arg max log p(αk | c, d) + log L(y, δ;α) . [4] α α
k
There is no analytic solution. The Monte Carlo EM algorithm could be used, but is very computationally intensive for large data sets. Hence, in the next section, we describe an alternative SEM approach.
2.3. Computation Details
For complete data (no censoring), (1) developed a fast algorithm to obtain ? α MAP . The key of their algorithm is to locally maximize the log-likelihood by considering the dependence of the log-likelihood on a single parameter αk , k = 1, . . . , p. Letting x ik be the kth column of x i and Bi,−k be the matrix after removing the contribution of x ik from Bi , the algorithm increases the loglikelihood at each iteration by setting ? αk ≈ S
S
2 /g −s (ni +2a)qik ik ik i=1 sik (sik −q 2 /gik ) ik
? αk = ∞
if
S (ni + 2a)q 2 /gik − sik ik
i=1
2 /g ) sik (sik − qik ik
> 0,
otherwise ,
where −1 −1 −1 x ik , qik = x ik Bi,−k t i , and gik = t i Bi,−k t i + 2b . sik = x ik Bi,−k
Setting αk = ∞ implies that βik = 0 for all i, removing the kth predictor from the model. Therefore, we can add or delete the predictors (genes) by iteratively performing the above estimation for different k. In the presence of censoring, we can obtain ? α MAP in [4] by the SEM algorithm as follows. At each iteration, the S-step of
536
Liu
the SEM imputes a single complete data set given the current parameter values, with the M-step then performed with the completed data set as"discussed above. # The SEM algorithm generates a Markov chain, α (1) , . . . , α (N ) , which can be averaged to produce the point estimate of α. For each k, k = 1, . . . , p, we calculate the frequency of inclusion for a given predictor, which is defined as pk ≈
N
(h) h=1 I (αk <∞)
(h) αk
N
,
< ∞, or where pk is the proportion of iterations for which the hth predictor is included in the model. We keep those predictors with pk bigger than a certain threshold (e.g., pk = 0.5), and then estimate the corresponding hyperparameters by ? αk−1 = (h) 1 N h=1 1/αk . N
3. Analysis of the Gene Expression Barcode Data
We demonstrate our method with the gene expression barcode data in (8). The data consist of three breast cancer studies (Affymetrix HGU133A array) in (9–11), that include patient survival data. There are 243 subjects in the first study, 156 in the second study, and 101 in the third study. Additionally, in the first study, 52 are censored and 15 are missing. All observations in the second study are censored and in the third study, 61 observations are censored. Here, we focus on gene selections by HRVM-C, and hence remove the missing data from consideration. The gene expression profile consists of 22,215 genes. To remove the variability in the gene expression profile between different studies, we use the gene barcode of the microarray data as our predictor variables. Many genes are in the same status (barcode is 1 or 0) among all the subjects. To avoid identifiable issues arising from including those genes in the model, we remove them from consideration. This reduces the total number of genes to 11,851. Our goal is to select the genes that are influential to the patient survival. We perform the gene selection by the HRVM-C for the gene barcode data. We run the SEM for 10,000 iterations. The log-likelihood of the pseudo-complete data, Q (nα (h) | nα (h−1) ), which is shown in Fig. 20.1, indicates that the Markov chain is mixing well after the first 2000 iterations. We thus treat the first 2000 iterations as burn-in period and use the remaining 8000 iterations to estimate pk and ? αk−1 as discussed in Section 2.3. The left panel of Fig. 20.2 shows the histogram of the nonzero pk ’s. To choose the threshold, we note that the value 0.1 seems to separate the pk ’s into two groups, with the group to the right of the value corresponding to the signal, and to the left corresponding
A Bayesian Hierarchical Model for High-Dimensional Meta-analysis
537
Q(α(h) | α(h–1))
–500 –1000 –1500 –2000 –2500
0
5000 h
10000
Fig. 20.1. Q (α (h) | α (h−1) ) given by the SEM algorithm.
120
50
100
40 30
60
α–1
density
80
20
40
10
20 0 0
0.1
0.2
0.3
0.4 p
k
0.5
0.6
0.7 0.8
0.5
1 gene index
1.5
2 x 104
Fig. 20.2. Left: Histogram of the nonzero pk . The vertical bar indicates the threshold at 0.1. Right: ? αk−1 versus gene label k given by the SEM algorithm.
to the noise resulting from SEM iterations. We thus use threshold 0.1, i.e., those genes with pk > 0.1 are kept. This leads to a final selection of 151 genes. The value of αk−1 indicates the importance of the kth gene (bigger αk−1 suggests that gene k is more important). In the right panel of Fig. 20.2, we show the estimates of αk−1 as a function of the gene label k. With many αk−1 being zero, the algorithm is able to produce a sparse model. Finally, we check the biological meanings of the selected genes. Most of the selected genes are known to be cancer related. In Table 20.1, we list the top five genes selected in our list.
538
Liu
Table 20.1 Information about the top five selected genes. The importance of the genes (fourth column) is measured by α -1 k , and the frequency (fifth column), pk , is the number of iterations that the variable is selected. AffyID
Chra Locationsb
Importance Frequency GANc
MIMd
1007_s_at
6
2098794
50.31091
0.7980
U48705
1053_at
7
–73283769
34.68177
0.2194
M87338
60040 discoidin domain receptor family, member 1 600404 replication factor C (activator 1) 2, 40 kDa
117_at
1
159760659
32.13420
0.5356
X51757
140555 heat shock 70 kDa protein 6 (HSP70B’)
121_at
2
–113690044 25.43365
0.2649
X69699
167415 paired box 8
1255_g_at
6
42231151
0.4431
L36861
600364 guanylate cyclase activator 1A (retina)
15.67909
a : Chromosome number. b : Chromosomal locations measured as the number of base pairs from the p to q arms. ± signs indicate if probes are located on the sense or antisense strands or not. c : GeneBank Accession Number (GAN). d : Mendelian Inheritance in Man (MIM) identifier.
4. Summary In this chapter, we have considered variable selection in large dimension, with multiple studies, by hierarchical relevance vector machine with censoring (HRVM-C). The method is preferable due to the fact that it explicitly borrows information across different studies. Censored observations are dealt with by the SEM algorithm, an efficient version of the EM algorithm when the number of missing data is large. We have also demonstrated the effectiveness of the method on three existing breast cancer studies. We believe HRVM-C would provide a useful tool for borrowing information across multiple studies in high–dimensional variable selection problems.
5. Supplementary Material Further details and complete information needed to recapitulate the analysis are available at the www.stat. missouri.edu/∼liufei/HRVMCWeb.zip. This includes the Matlab code to conduct the analysis and a brief readme on use of the code.
A Bayesian Hierarchical Model for High-Dimensional Meta-analysis
539
Acknowledgments This research was supported in part by the Statistical and Applied Mathematical Sciences Institute (SAMSI) Summer 2008 research program on Meta-analysis: Synthesis and Appraisal of Multiple Sources of Empirical Evidence. The gene barcode data used in this paper were kindly provided by Dr. Rafael Irizarry and Dr. Michael Zilliox. Research of Fei Liu was partially supported by the University of Missouri-Columbia research board award. References 1. Ji, S., Dunson, D., and Carin, L. (2009) Multi-task compressive sensing. IEEE Transactions Signal Processing 57, 92–106. 2. Chauveau, D. (1995) A stochastic em algorithm for mixtures with censored data. Journal of Statistical Planning and Inference 46, 1–25. 3. Ip, E. H. S. (1994) A stochastic EM estimator in the presence of missing data: theory and applications. Ph.D. thesis, Department of Statistics, Stanford University. 4. Marschner, I. C. (2001) On stochastic versions of the em algorithm. Biometrika 88, 281–286. 5. Tregouet, D. A., Escolano, S., Tiret, L., Mallet, A., and Golmard, J. L. (2004) A new algorithm for haplotype-based association analysis: the stochastic-em algorithm. Annals of Human Genetics 68, 165–177. 6. Datta, S., Le-Rademacher, J., and Datta, S. (2007) Predicting patient survival from microarray data by accelerated failure time modeling using partial least squares and lasso. Biometrics 63, 259–271. 7. West, M. (1987) On scale mixtures of normal distributions. Biometrika 74, 646–648. 8. Zilliox, M. and Irizarray, R. (2007) A gene expression bar code for microarray data. Nature Methods 4, 11, 911–913.
9. Miller, M., Wang, C., Parisini, E., Coletta, R., Goto, R., Lee, S., Barral, D., Townes, M., Roura-Mir, C., Ford, H., Brenner, M., and Dascher, C. C.(2005) Characterization of two avian MHC-like genes reveals an ancient origin of the cd1 family. Processings of National Academy of Science, USA, 102, 8674–8679. 10. Pawitan, Y., Bjohle, J., Amler, L., Borg, A., Egyhazi, S., Hall1, P., Han, X., Holmberg, L., Huang, F., Klaar, S., Liu, E., Miller, L., Nordgren, H., Ploner, A., Sandelin, K., Shaw, P., Smeds, J., Skoog, L., Wedren, S., and Bergh, J. (2005) Gene expression profiling spares early breast cancer patients from adjuvant therapy: derived and validated in two population-based cohorts. Breast Cancer Research 7, R953–R964. 11. Sotiriou, C., Wirapati, P., Loi, S. A. H., Fox, S., Smeds, J., Nordgren, H., Farmer, P., Praz, V., Haibe-Kains, B. C. D., Larsimont, D., Cardoso, F., Peterse, H., Nuyten, D. M. B., Van de Vijver, M. J. B., Piccart, M., and Delorenzi, M. (2006) Gene expression profiling in breast cancer: Understanding the molecular basis of histologic grade to improve prognosis. Journal of National Cancer Institute 98, 262–272.
Chapter 21 Methods for Combining Multiple Genome-Wide Linkage Studies Trecia A. Kippola and Stephanie A. Santorico Abstract Cardiovascular disease, metabolic syndrome, schizophrenia, diabetes, bipolar disorder, and autism are a few of the numerous complex diseases for which researchers are trying to decipher the genetic composition. One interest of geneticists is to determine the quantitative trait loci (QTLs) that underlie the genetic portion of these diseases and their risk factors. The difficulty for researchers is that the QTLs underlying these diseases are likely to have small to medium effects which will necessitate having large studies in order to have adequate power. Combining information across multiple studies provides a way for researchers to potentially increase power while making the most of existing studies. Here, we will explore some of the methods that are currently being used by geneticists to combine information across multiple genome-wide linkage studies. There are two main types of meta-analyses: (1) those that yield a measure of significance, such as Fisher’s p-value method along with its extensions/modifications and the genome search meta-analysis (GSMA) method, and (2) those that yield a measure of a common effect size and the corresponding standard error, such as model-based methods and Bayesian methods. Some of these methods allow for the assessment of heterogeneity. This chapter will conclude with a recommendation for usage. Key words: Meta-analysis, genome search meta-analysis, combining multiple studies, heterogeneity, Fisher’s p-value method.
1. Introduction Cardiovascular disease, metabolic syndrome, schizophrenia, diabetes, bipolar disorder, and autism are a few of the numerous complex diseases for which researchers are trying to decipher the genetic composition. The complexities of these types of diseases are myriad. In addition to being oligogenic or polygenic, the H. Bang et al. (eds.), Statistical Methods in Molecular Biology, Methods in Molecular Biology 620, DOI 10.1007/978-1-60761-580-4_21, © Springer Science+Business Media, LLC 2010
541
542
Kippola and Santorico
possibility exists for multiple interactions, including gene–gene interactions and gene–environment interactions. One interest of geneticists is to determine the quantitative trait loci (QTLs) that underlie the genetic portion of these diseases and their risk factors. However, in order to succeed, the various types of interactions must also be considered. The difficulty for researchers is that the QTLs underlying these diseases are likely to have small-to-medium effects necessitating large samples which may be prohibitively expensive. Combining information across multiple studies provides a way for researchers to potentially increase power while making the most of existing studies. Meta-analysis is a term first coined by Glass in 1976 (1) and refers to the combining of information across multiple studies; however, as early as 1925, Fisher provided a method of metaanalysis (2). As an example of where a meta-analysis would be appropriate, consider a recent study by Edwards et al. (3) that included conducting four primary genome-wide linkage scans searching for QTLs that predispose a person to metabolic syndrome. Metabolic syndrome is characterized by hypertension, dyslipidemia, glucose intolerance, and central obesity. They used trait and genotype information that was collected as part of the 1993 American Diabetes Association’s Genetics of NonInsulin Dependent Diabetes Mellitus (GENNID) study. The data were independently collected from families in four different ethnic groups: African Americans (n = 252), Caucasian Americans (n = 443), Japanese Americans (n = 123), and Mexican Americans (n = 389). The results of their studies included identifying 12 regions of the genome that were suggestive for linkage to factors of metabolic syndrome; however, these results were not consistent across the different ethnic samples. Also, they found no regions of the genome that were considered significant for linkage to metabolic syndrome. This may be due to small sample sizes in some of the ethnic groups, especially the Japanese American sample. The inconsistent results across the ethnic groups could also be the result of genetic heterogeneity. A meta-analysis, using a method that could also test for significant heterogeneity, would be appropriate to identify genomic regions that are suggestive and/or significant for linkage by combining the four primary studies. There are several factors that must be considered in order for a researcher to successfully conduct a meta-analysis. Some of these factors include the following: choosing which studies to include/exclude from the analysis, which method to use to conduct the analysis, what effect could be combined across multiple studies, whether or not the primary studies are relatively homogeneous, and the possible presence of publication bias (4, 5). Several articles have been written that detail the steps that a researcher should follow in conducting a meta-analysis (6, 7).
Methods for Combining Multiple Genome-Wide Linkage Studies
543
The lack of homogeneity among primary studies adds another level of difficulty to the quest for identifying QTLs underlying complex diseases. Among different populations, heterogeneity could be due to different genes, different alleles, different environments, and possibly different interactions. Additionally, differences in study designs, marker panels, and phenotypic definitions can contribute to heterogeneity between studies. In this chapter we will explore some of the methods that are currently being used by geneticists to combine information across multiple genome-wide linkage studies. Although the context of these methods is for genome-wide linkage scans, the methods presented here are not limited to linkage studies. Their principles can also be applied to genome-wide association scans, scans that test for linkage and association and studies that assess some characteristic of the location over the genome, e.g., gene expression studies. There are two main types of meta-analyses: (1) those that yield a measure of significance, such as Fisher’s p-value method along with its extensions/modifications and the genome search meta-analysis (GSMA) method, and (2) those that yield a measure of a common effect size and the corresponding standard error, such as model-based methods and Bayesian methods. Some of these methods allow for the assessment of heterogeneity. After discussing the aforementioned methods, this chapter will conclude with a recommendation for usage.
2. Methods 2.1. Combining Raw Data
Pooling the original data from multiple studies could be extremely beneficial. If there was minimal heterogeneity between the studies being combined, then the increased sample size would yield greater statistical power to detect small effect sizes. Pooling the raw data requires extensive collaboration between multiple centers and/or researchers. This may or may not be feasible due to budget constraints and geographic considerations. McQueen et al. (8) successfully pooled raw data from 11 prior genome-wide linkage scans for bipolar disorder. In combining the data, they attempted to control for possible heterogeneity by using a common marker map that incorporated each individual marker map from the 11 studies and by using a uniform classification of bipolar disorder. They were successful in identifying two regions of the genome, on chromosomes 6q and 8q, with significant linkage (9) and two regions, on chromosomes 9p and 20p, with suggestive linkage to bipolar disorder. McQueen et al. (8) also tested for heterogeneity using the Q-statistic as suggested by Laird and Mosteller (7), which is discussed in Section 2.4.2.
544
Kippola and Santorico
There was no significant heterogeneity at the identified linkage regions. Greenwood et al. (10) combined data across 13 field centers associated with the Family Blood Pressure Program. The combined data included over 4000 affected sibpairs and encompassed five ethnic populations. Common markers, phenotypic definitions, and genotype measurements were used at all 13 centers. In addition to the combined analysis, each field center ethnic group combination was separately analyzed for linkage to hypertension. They also analyzed data for combined ethnic groups across field centers. Comparing linkage results from the subgroup analyses with the linkage results from the pooled analysis allowed the researchers to look for consistency across the analyses. Carter et al. (11) pooled genotype data from three studies, conducted by the International Genetics of Ankylosing Spondylitis Consortium, investigating linkage to ankylosing spondylitis. Highly significant linkage was detected on chromosome 6 in the region of the major histocompatibility complex (MHC), which was a replication of several previous studies. 2.2. Ad Hoc Comparison and Vote Counting Methods
The ad hoc comparison method is probably the simplest method to use when combining/comparing linkage results across multiple studies. The basic premise of this method is to identify areas of the genome across multiple studies that have been shown to be suggestive of linkage. This might be accomplished by setting a maximum p-value threshold or a minimum logarithm of the odds (LOD) score to identify the regions for comparison. The LOD score is defined as the base 10 logarithm of the likelihood ratio. The next step in carrying out this method is to compare across the multiple studies to identify areas of the genome where the suggestive linkage threshold has been met in some or all of the studies. This method was used by Johanneson et al. (12) in an analysis that compared genome scans in multicase families sampled from three different ethnic populations. Their objective was to identify regions of the genome that might be linked to systemic lupus erythematosus. Using three independent samples from Iceland, Sweden, and Mexico and three studies from the United States that included families of mixed ethnicities, regions of putative linkage, defined as having LOD ≥ 1, were identified in each study. These regions were subsequently compared across the studies resulting in the identification of five overlapping regions. Combining the Icelandic and Swedish studies and stratifying families for the presence of a C4AQ0 allele, they identified a significant linkage region with a LOD score above 4, which they named hSLE1 (12). In Caucasians, the C4AQ0 allele had previously been shown to be associated with systemic lupus erythematosus (13). It is important to note that this region was not significant in the studies from
Methods for Combining Multiple Genome-Wide Linkage Studies
545
the United States. The authors state that this may be a result of genetic heterogeneity due to the mixed ethnicities in the United States studies; that is, using more families from a mixed population may increase genetic heterogeneity and therefore reduce power. Similar to the ad hoc method, vote counting (14) is another method for combining information from multiple studies. Vote counting methods sort the studies based on the results of the hypotheses tests to determine whether more studies “vote” for a positive treatment effect, a negative treatment effect, or no treatment effect. That is, vote counting determines the number of studies that have exceeded a common critical value and hence rejected the null hypothesis. The category with the most counts is the scenario that is used as the overall conclusion. In genomewide studies this process is carried out for each genomic region so that whether or not there is a treatment effect is determined by which category, positive treatment effect, negative treatment effect, or no treatment effect, has the most votes. Hedges and Olkin (15) provide some statistical methods for analyzing vote counting data when the studies provide estimates of effect magnitude parameters.
2.3. Fisher’s p-Value Method and Extensions
Sir Ronald A. Fisher (2) provided a method to combine multiple statistical studies by combining p-values. For k tests, the test statistic for testing the aggregate hypothesis is computed as the product of the kp-values. Under the null hypothesis, the quank /k tity −2 ln 1 pi = −2 1 ln pi is approximately distributed as a chi-square with 2k degrees of freedom. For genome-wide linkage studies this would entail computing the product of the k p-values at each genomic region tested, necessitating the use of identical marker maps between studies or that information at regions not tested be imputed. Several extensions/modifications of Fisher’s method have been proposed (16–21). Edgington (16) proposed using the sum of the k p-values. This test statistic has an approximate normal distribution with a mean of k/2 and a variance of k/12. Laird and Mosteller (7) point out that this test works best for k ≥ 3. Finally, one could also determine for each individual p-value the corresponding standard normal variate zi = −1 (pi ), and then compute the test statistic as the sum of the k-many zi ’s (22). This test statistic is normally distributed with a mean of zero and a variance of k. Hence, the ratio of the sum of the zi ’s and the square-root of k follows a standard normal distribution. The test statistic can then be compared with the appropriate twotailed critical value from the standard normal distribution. This test statistic was first proposed by Stouffer et al. (22).
546
Kippola and Santorico
Iyengar et al. (17) recently used Fisher’s p-value method to conduct a meta-analysis of 11 centers to investigate linkage to nephropathy and albuminuria in multiethnic populations. They used a single protocol to collect the data at each of the 11 centers. Individual analyses, using the Haseman–Elston regression method (23), were conducted on four separate ethnic groups. Results were then combined across the four ethnic groups using Fisher’s p-value method. As a result of their meta-analysis, they detected evidence for linkage on chromosomes 7q21.3, 10p15.3, 14q23.1, and 18q22.3, the first three of which were regions of linkage to diabetic nephropathy previously identified by other studies (17). Additionally they detected evidence of linkage to albuminuria on 2q14.1, 7q21.1, and 15q26.3. To correct the bias of Fisher’s p-value method when applied to nonparametric linkage studies, Province (18) recommended assigning a p-value = 2 ln1(2) ≈ 0.72 to every instance where LOD=0. This correction factor eliminates almost all of the bias that exists since Fisher’s p-value method is derived for two-tail hypothesis testing and nonparametric linkage testing is a onetail procedure. This correction factor was utilized in two metaanalyses involving data from the National Heart, Lung and Blood Institute’s Family Blood Pressure Program. Province et al. (24) conducted a meta-analysis combining four studies to search for hypertension and blood pressure genes. No significant evidence for linkage was found, but there were several suggestive peaks. One peak in particular, on chromosome 2, was previously seen in the individual studies and two additional studies (25, 26). In the second meta-analysis using the bias correction factor, Rice et al. (27) identified two regions with enhanced linkage for hypertension after applying this method every 1 cM throughout the genome for the combined analysis of three studies from AfricanAmerican and Nigerian families. One region, which was located on chromosome 2p (meta-LOD=2.91), was a replication of previous findings; however, the other region, located on chromosome 7p (meta-LOD=3.54) was a novel finding. Badner and Gershon (28) propose combining the individual p-values after correcting each for the size of the region containing a pre-specified minimum p-value. They applied this method to meta-analyses of whole genome scans for autism, bipolar disorder, and schizophrenia (19, 28) and found evidence for an autism susceptibility locus at 7q; a bipolar susceptibility locus at 13q and 22q; and a schizophrenia locus at 22q. Zaykin et al. (20) proposed using the product of p-values that do not exceed a pre-specified threshold, such as α = 0.05, as a test statistic when combining multiple studies. This is commonly referred to as the truncated product method (TPM). The hypotheses tested via TPM are slightly different from those
Methods for Combining Multiple Genome-Wide Linkage Studies
547
being tested via Fisher’s method and its extensions in that TPM can address the question of whether or not any of the tests below the pre-specified threshold were significant. Specifically, for other methods of combining studies, the alternative hypothesis is that linkage exists for at least one of the studies being combined whereas for the TPM method, the alternative hypothesis is that linkage exists for at least one of the studies which were significant at the pre-specified threshold. A C++ program for implementing the truncated product method is available at ftp://statgen.ncsu.edu/pub/zaykin/tpm Loesgen et al. (21) have proposed several weighting methods for computing a weighted Z-score to combine studies. These weighting methods can incorporate sample size differences, marker set differences, and differences in marker informativeness between studies. While they were able to demonstrate via simulation that incorporating weights yields higher statistical power to detect linkage, clear guidance is not given on which weighting method is recommended. In addition to pooling data from the three International Genetics of Ankylosing Spondylitis Consortium studies, Carter et al. (11) also conducted a weighted Z-score meta-analysis to investigate linkage for ankylosing spondylitis. They incorporated weights to allow for study size differences and information content. In addition to the previous results on chromosome 6, in the region of MHC, the weighted Z-score analysis yielded four highly significant markers on chromosome 6 that were not previously replicated across the studies. 2.4. Genome Search Meta-analysis (GSMA) Method 2.4.1. GSMA
The Genome Search Meta-Analysis (GSMA) Method developed by Wise et al. (29) is a nonparametric method for combining across multiple genome scans. Each individual genome scan is divided into equal sized bins. For each bin in an individual study, the most significant result is identified. The bins are then ranked with the lowest rank assigned to the least significant bin and the highest rank to the most significant bin. Ties are assigned equal ranks and if a mixture of parametric and nonparametric studies are being analyzed then LOD=0 is assigned to every marker having LOD ≤ 0. The same bins must be used in each study and the ranks of the bins are then summed across the multiple studies. The average bin rank or the total sum of the ranks may be reported. Permutation methods can be used to determine the significance of the summed or averaged ranks for each bin. The null hypothesis for each bin is that no susceptibility loci exist within the bin and the rank is therefore randomly distributed over bins. For a given bin, let Xi , i = 1, . . ., m, represent the rank assigned for the ith out of m studies with each having n bins. For each bin, under the null hypothesis, the probability that the sum
548
Kippola and Santorico
of ranks of the bins across the studies equals a specified value R is given by:
P
=
m ⎧ ⎪ ⎪ ⎪ ⎪ ⎨
i=1 Xi
1
n ⎪ ⎪ ⎪ ⎪ ⎩
m
0 d
=R =
k=0 ( − 1)
0
k
R − kn − 1 m−1
×
m k
for R < m for m ≤ R ≤ mn for R > mn,
where d is the integer part of (R–m)/n. Interested readers can find the derivation of this distribution in the article by Wise et al. (29). The GSMA relies on two major distributional assumptions. First, for each individual study, the maximum LOD scores or lowest p-values identified within each bin are assumed to be independent and identically distributed, which is approximately achieved if the markers are evenly spaced and of equal informativeness and if the width of the bins is adequate to compensate for correlation between nearby markers (29). For genome-wide scans, if each bin contains at least one marker then this assumption is well-approximated (29). Second, each study makes an equal contribution to the GSMA; however, studies may be weighted to account for factors such as differing samples sizes. In light of these assumptions, candidate gene studies and second-stage analyses of two-stage genome scans should not be included in the studies being combined using the GSMA method (29). Doing so violates the assumption of equally spaced markers. Also, researchers are cautioned that GSMA loses power if the entire genome is not used (30). The bin size must be selected prior to conducting the GSMA. Bins should be large enough to counteract correlation between nearby markers and should ensure that the smallest chromosome contains at least two bins (31). Also, the bin width should be appropriate for all chromosomes; thus the minimum recommended bin width is 20 cM. To accommodate these suggestions in human genome-wide studies, a bin size of 30 cM is recommended and frequently used (31). GSMA has been used in several recent meta-analyses of complex diseases. Wise et al. (31) combined four genome-wide searches for multiple sclerosis from four different countries. Six significant regions were found with the strongest evidence on chromosomes 6p and 19q. Both of these regions were identified in all four individual scans. The remaining regions on 2p, 5p, 14q, and 17q were not consistent across all four studies and some regions that were identified by individual scans were not
Methods for Combining Multiple Genome-Wide Linkage Studies
549
replicated. This may be due to heterogeneity since each individual scan was performed on different ethnic populations. GSMA was used to conduct a meta-analysis on four previous genome searches for susceptibility loci for rheumatoid arthritis, resulting in confirmation of previous evidence of HLA loci as the greatest susceptibility factor for rheumatoid arthritis and evidence of linkage on chromosomes 1, 6, 8, 12, 16, and 18 (32). Thirty-seven genome-wide linkage studies for body mass index (BMI) and obesity (defined by BMI) were combined in a meta-analysis using the GSMA method (33). The researchers conducted both weighted (study size) and unweighted analyses, the results of which were that no conclusive evidence was found for linkage to BMI or obesity. Software is available from http://www.kcl.ac.uk/schools/ medicine/depts/memoge/research/epidemiology/gsma/ to conduct a meta-analysis using the GSMA method. This program allows for weighted/unweighted analyses. Output includes the summed rank per bin and p-values obtained by simulation. A test data set is also provided. 2.4.2. Heterogeneity Testing Within GSMA
A major disadvantage of the original GSMA method is that it does not recognize or allow for genetic heterogeneity that is likely to exist between studies. It has been suggested that for each chromosomal bin, the amount of variability of the ranks across the studies included in a GSMA can be used as a measure of heterogeneity (34). Testing for heterogeneity is done by computing one of three statistics (Q, B, and Ha) which measure the variability in the ranks. For any single bin, let Xi represent the rank for the ith study, X¯ represent the mean of the ranks across the studies for that bin and wi be the weighting factor for the ith study. If there is no weighting, then wi = 1 for all studies. The Q statistic, which is a generalization of Cochran’s Q-statistic (35), is the weighted sum of the squared deviations of the individual ranks from the average m rank for that bin, Q = wi (Xi − X¯ )2 . The Ha statistic uses the i=1
absolute deviations from the mean, Ha =
m
wi |Xi − X¯ |, and the
i=1
B statistic uses all pairwise absolute differences between the studm wi wj |Xi − Xj |. ies, B = i, j = 1 i = j
When testing for the presence of heterogeneity among the studies, researchers are interested in which bins have significantly high heterogeneity as well as which bins have significantly low heterogeneity. High heterogeneity is seen as evidence of differences among the studies, and low heterogeneity is seen as evidence of consistency among the studies. Statistical significance can be determined via Monte Carlo methods. This is done by
550
Kippola and Santorico
randomly permuting the ranks of the bins of each study; calculating the desired statistic; and repeating this process a large number of times. The level of significance for high heterogeneity is the percentage of time that the simulated statistics exceeds the observed statistic, and the level of significance for low heterogeneity is the percentage of time that the simulated statistics fall below the observed statistic. It has been shown that for both weighted and unweighted analyses, concordance is high between all three heterogeneity metrics (34). Researchers are cautioned that some of the metrics for measuring heterogeneity may also depend on the average rank of each bin (34). It is recommended for bins with significant low heterogeneity that one should also test whether bins of similar average ranks have significantly low heterogeneity. This is commonly referred to as the rank-restricted approach to heterogeneity testing and may also be accomplished via a Monte Carlo test. Testing for heterogeneity, while conducting a genome search meta-analysis, allows the researcher to make more informed conclusions. Areas of the genome found to have significant linkage via GSMA and also significant low heterogeneity are indicative of consistent results across the individual studies. This suggests that there is more support for evidence of linkage for the highly significant bins that also have significant low heterogeneity (34). However, it has been shown, via a simulation study, that the power to detect high and low heterogeneity is quite low (36). In the same study it was also shown that the rank-unrestricted test is conservative for high heterogeneity and the rank-restricted test is liberal for low heterogeneity. It is recommended that the rank-restricted test should always be used; however, this test requires extensive computing (36). Several meta-analyses for various complex diseases have tested for heterogeneity within GSMA. These include meta-analyses for rheumatoid arthritis (34), schizophrenia (34), quantitative lipid traits (37), bone mineral density (38), age-related macular degeneration (39), myocardial infarction (40), autism and autismspectrum disorder (41). Of special interest are the meta-analyses for autism and autism-spectrum disorder. In these analyses, different results were obtained when heterogeneity was tested with and without using the rank-restricted method. Also, in the myocardial infarction meta-analysis, since all populations were Caucasians, the significant heterogeneity should reflect differences in study design and implementation rather than ethnic differences (40). HEGESMA (heterogeneity-based genome search metaanalysis) is a comprehensive computer software package that is available to perform GSMA with and without testing for heterogeneity. The output includes all three heterogeneity statistics with corresponding significance levels and the average rank per bins. This software may be downloaded from http://biomath.med. uth.gr/.
Methods for Combining Multiple Genome-Wide Linkage Studies
2.5. Model-Based Methods
551
Model-based methods provide a way of combining multiple studies to determine an estimate of a common effect size and the corresponding standard error. The choice must be made to use either a fixed effects model or a random effects model. For a fixed effects model, the primary studies are assumed to have come from the same population with a common mean and the goal is to estimate this common mean effect. The corresponding standard error for the estimated common mean effect is the sampling error. For a random effects model, the primary studies are assumed to each have a different true effect which all come from a distribution of effects. The goal is to estimate the amount of variability in the effects between the primary studies. For a random effects model, there are two sources of variation that must be estimated, the within-study variation and the between-study variation. Using a random effects model as opposed to a fixed model, accounts for heterogeneity that may exist between the studies (15, 42). Hedges and Olkin (15) provide theoretical details on using fixed effects and random effects models for combining multiple studies. Gu and Province (42) have detailed a random effects model for combining effects from multiple linkage studies utilizing sibpair analysis methods. They propose using π, the proportion of genes shared identical by descent (IBD) at the marker locus, as the common effect extracted from each study. A random effects model is then used to determine the between-study variation in the proportion of genes shared IBD between the different studies. Weighted least-squares regression is used to estimate an overall effect and the various variance components. As presented by Gu and Provence (42), suppose there are n studies to be combined; for the ith study, i = 1, . . ., n, let πi equal the observed IBD proportion at the marker locus, which can be partitioned into two additive components, τi , the actual proportion of genes shared IBD at the marker locus, and the sampling error, εi . Note that the variance of εi is actually the sample variance from the ith study, Si2 . Since this is a random effects model, then τi , the actual proportion of genes shared IBD at the marker locus, is considered a realization of a random variable from a distribution having a mean, τ , and a variance which is the between-study variance (42). That is τi = τ + δi where δi is a realization from a distribution having mean, 0, and variance, σˆ δ2 . The goal is to estimate σδ2 . If σδ2 is significantly greater than zero, then there is significant variation among the studies. The estimated between-study variance is computed as follows: σˆ δ2 =
1 1 2 (πi − π )2 − Si , n−1 n n
n
i=1
i=1
552
Kippola and Santorico
where π is the unweighted average of IBD proportions for the n studies. The estimated between-study variance is then used to compute weights as follows: wˆ i =
1 2 σˆ δ + Si2
@
n i=1
1 2 σˆ δ + Si2
.
These weights are used to estimate the overall proportion of n genes shared IBD at the marker locus, τˆ = wˆ i πi To test for i=1
possible heterogeneity, a test statistic, Q, may be computed as Q =
n (πi − π˜ )2 i=1
Si2
,
where π˜ =
n n πi 1 . / S 2 j=2 Sj2 i=1 i
Under the null hypothesis of no heterogeneity, Q is approximately distributed as a chi-square random variable with (n–1) degrees of freedom. Gu and Province point out that determining whether or not significant heterogeneity exists does not tell the researcher whether or not to pool the studies (42). In addition to detailing the above method based on weighted least squares, Gu and Province provide specific guidance on how π, the proportion of genes shared IBD at the marker locus, may be extracted from primary studies based on the Haseman–Elston regression method (23), the Risch–Zhang extremely discordant sibpair method (43), and the affected sibpair method (44). Heijmans et al. (45) used a random effects model to conduct a meta-analysis of four genome-wide linkage scans to identify QTLs that may affect plasma lipid levels. Since the primary studies were twin studies from Sweden, the Netherlands, and Australia, they used a random effects model to acknowledge possible heterogeneity. Suggestive linkage was determined for several locations, six of which were replications of previous linkage results. Rice et al. (46) developed a logistic regression method for analyzing sibpair linkage data where explanatory variables and covariates can be used to model how many alleles are shared IBD at a given marker. Waldman and Robinson (47) used this method to combine information from four studies that were part of the Genetics Analysis Workshop 12 data. In addition to using information about the transmitting parent and sex of the siblings as both covariates and effect modifiers, they also incorporated a
Methods for Combining Multiple Genome-Wide Linkage Studies
553
covariate representing each of the different studies. Their analysis yielded very little evidence for linkage. Heijmans et al. (45) used R (http://www.r-project.org) to complete their analysis. Interested readers may obtain a copy of the program by contacting the authors. 2.6. Bayesian Methods
While combining information from multiple studies was not their primary intention, Biswas and Lin (48) have developed an empirical Bayesian method that does allow researchers to combine studies. They begin by using a method, first suggested by Smith (49) that incorporates a mixing parameter, usually called α, into the likelihood equation. The mixing parameter represents the proportion of families in a study that may be linked. Details for this linkage method are thoroughly discussed by Ott (50). Biswas and Lin suggest that the incorporation of covariates into the above model can improve power if the chosen covariates can actually distinguish between linked and unlinked families. Suppose that the data can be grouped based on the values of a covariate into G (>1) groups, each having size ni , i = 1, . . . , G. Letting xij , i = 1, . . . , G; j = 1, . . . , ni represent the observed genotype data for the ijth family then, for a specified location in the genome, d, the likelihood of the data may be written as L(α, d|x) =
j=1
{αij Lij (d|xij ) + (1 − αij )Lij (∞|xij )},
where α = {αij , i = 1, . . . , G; j = 1, . . . , ni }, x = {xij , i = 1, . . . , G; j = 1, . . . , ni } and Lij is the standard homogeneity likelihood (49) of the ijth family at position d. In this context, ∞ is used to represent that the disease gene is not on the chromosome under study (50). As this is a Bayesian method, a prior distribution, Beta(ti ai , ti (ni − ai )), is specified for the αij s. According to Biswas and Lin, the prior distribution is chosen in such a way so as to model the dependence of the αij s on the covariates. For computational efficiency, the ti s, are tuning parameters, used to ensure that the Beta parameters do not become too large and the ai s are nuisance parameters. Biswas and Lin set ti = 0.1 for all of their analyses. The details for this method are beyond the scope of this chapter and may be found elsewhere (48, 51). In what follows, a very brief outline of the method is presented. The first step is to estimate the nuisance parameters, the ai s. This is accomplished using a stochastic ExpectationMaximization (SEM) algorithm (52). After framing the problem as a missing data problem, which is an empirical Bayesian technique, the SEM algorithm iterates between estimating the ai s and maximizing the likelihood. After estimates of
554
Kippola and Santorico
the ai s, notated ? ai s, are obtained, the second step is to build a hierarchal Bayes model in which the ? ai s are incorporated into the Beta prior for the mixing parameters, the αij s. The third step is to determine the posterior distribution. This is accomplished using a reversible jump Markov Chain Monte Carlo algorithm. The posterior distribution is summarized ? , the posterior probability of linkage and by two values, pˆ and m the location of the disease gene under linkage, respectively. Finally > is computed. Linkage is concluded an estimated Bayes factor, BF > > 0 . BF > 0 = 25 correif BF exceeds a pre-specified threshold, BF sponds to a strong linkage signal (53). Credible sets (the Bayesian version of confidence intervals) may be calculated for the posterior probability of linkage. Incorporating a covariate to model study differences allows a researcher to use this method to combine studies. Biswas and Lin accomplished this using a covariate for ethnicity to combine asthma data from three different populations: Caucasians, African-Americans, and Germans from data of the Genetic Analysis Workshop 12. Using a procedure that they developed which is analogous to model building, they determined that a model with two groups (combining the Caucasian and German samples) was better than a model with one or three groups. After conducting their analysis, they obtained a location with extremely strong > = 196). linkage (BF
3. Discussion Perhaps the most important steps involved in conducting a metaanalysis occur before the actual analysis. Choosing which studies to include/exclude from a meta-analysis is no trivial matter. Several authors have addressed issues related to this and provide guidelines for a researcher to follow when undertaking a meta-analysis (6, 7). Additionally, publication bias, which refers to the practice of publishing significant results over nonsignificant results, is also an issue that should be addressed. There are several methods which may be used to detect the presence of publication bias (5), but there is no general consensus as to what to do if it is determined that publication bias exists for the collection of primary studies being considered for selection in a meta-analysis. Assuming that the appropriate preliminary steps have been taken and the primary studies that are to be included in a metaanalysis have been chosen, the next step is to determine which method is the most appropriate for the analysis. With the exception of pooling the raw data, all of the other methods discussed in this chapter require summary information from each of the primary studies. Thus, the choice of method is somewhat dictated
Methods for Combining Multiple Genome-Wide Linkage Studies
555
by the information that may be extracted from each of the primary studies and also may be determined by time and financial considerations. Regardless of the method chosen, it is important to remember that not accounting for genetic heterogeneity between and also within studies will reduce the power to detect linkage (29, 37, 54). Also, power to detect linkage may be reduced when the individual studies use different sets of genetic markers and studies have different sample sizes (21), and it may be more difficult to detect linkage if each individual study is not reasonably homogeneous (12). See Table 21.1 for a summary of the discussed methods. If there is little time or money available, then the ad hoc comparative analysis method or a vote counting procedure are quick and easy methods to implement. An advantage is that the primary studies may use different markers and analysis methods. There are however several disadvantages. Limitations of vote counting procedures include the requirement of a large number of studies. This is due to the applied methodology being based on asymptotic distributions, the assumption of equal sample sizes and the inability of the method to handle the case of all (or none) of the studies exceeding the critical value (15, 55). Hedges and Olkin state unequivocally that “conventional vote counting procedures are inherently flawed and likely to be misleading” (15, 55). Also, vote counting methods can have decreased power to detect small to moderate effect sizes as the number of included studies increases (15, 55). In other words, by including more studies the probability of making a wrong conclusion increases. If the researcher is able to extract p-values or LOD scores from each of the primary studies, then using Fisher’s p-value method or one of the extensions may be appropriate. According to Laird and Mosteller (7), an advantage of Fisher’s p-value method is that the test has good power for densities under the research hypothesis where the majority of values are near zero. However, they also point out several disadvantages, including that use of this test cannot tell us which of the k centers involved in the meta-analysis has a greater effect. More specifically, when using this method, we cannot determine the size of the difference between studies, only that a difference among the k studies exists (7). Fisher’s p-value method is inherently biased when applied to significance results from nonparametric (model-free) linkage studies (18). Additionally, Fisher’s p-value method and extensions cannot measure between study heterogeneity. It is also disconcerting that one large p-value can overwhelm several small ones, resulting in decreased power (20). The truncated product method attempts to overcome this deficiency and actually has increased power when subsets of the combined hypotheses are actually false (20). The GSMA method provides several advantages. The method allows for different sampling schemes, different statistical
Data required
p-values, Z-scores or LOD scores for each location in the genome to be tested
p-values, Z-scores or LOD scores for each location in the genome to be tested
p-values, Z-scores or LOD scores for each location in the genome to be tested
Common effect estimates and corresponding standard errors extracted from each study for each location in the genome to be tested or raw data
Original genotype and phenotype data
Method
Ad hoc and vote counting (Section 2.2)
Fisher’s p-value method and extensions (Section 2.3)
GSMA (Section 2.4)
Model-based Methods (Section 2.5)
Bayesian Methods (Section 2.6)
Cons • Require large number of studies • Vote counting is “ . . .inherently flawed and likely to be misleading” (15, 55) • Decreased power to detect small-tomoderate effect sizes as the number of included studies increases (15, 55) • Does not use all genetic information • Only a test of significance, i.e., does not yield an estimate of genetic effect • Does not allow for assessment of heterogeneity • Only a test of significance, i.e., does not yield an estimate of genetic effect
• Requires a common effect to be extracted from each primary study • Requires a common marker map
• Computationally intensive • Requires specification of a prior distribution
Pros • Quick and easy to implement • Allows for different marker sets and analysis methods
• Allows for different marker sets and analysis methods • Power may be increased by weighting • Allows for different sampling schemes and marker analyses • Method extends to assess heterogeneity (HEGESMA) • Robust across bin definitions and methods of data acquisition • Power may be increased by weighting • Provides estimates of effect sizes along with a measure of standard error • Heterogeneity may be accounted for by using a random effects model • Power may be increased by weighting • Yields credible sets for posterior probability of linkage and location of disease gene under linkage • Heterogeneity may be accounted for by incorporating covariates
Table 21.1 Summary of methods to combine genome-wide studies
556 Kippola and Santorico
Methods for Combining Multiple Genome-Wide Linkage Studies
557
analysis methods, e.g., LOD scores, Z-scores, p-values, and different marker analysis methods, e.g., single-point and multi-point marker analysis (31). Studies have shown that GSMA is robust across different bin definitions and methods of data acquisition, such as obtaining data directly from primary investigators, extracting data from published genome-wide graphs or tables of most significant results (30, 31); that is, regardless of the bin size or method used to obtain the data, the GSMA yields consistent results. It is possible to increase power by weighting the studies to incorporate information regarding different study sample sizes, different number of markers or different number of pedigrees (31, 56). The basic idea behind weighting is that studies with larger sample sizes or more pedigrees should contribute more to the overall combined analysis. Weighting schemes are commonly used with GSMA (32–34, 37–40), but again, as with the weighted Z-score approach, there is no general consensus as to the optimum weighting scheme. The GSMA method does not detect effects that are present in only a subset of the studies and does not provide an estimate of effect size. Rather, GSMA simply tells the researcher which bin(s) may contain significant loci. Another major disadvantage of this method is that the individual studies may or may not be testing the same hypotheses, and it is possible that different phenotypic definitions may have been utilized across the studies. As with the Fisher’s p-value method and extensions, GSMA does not use all available genetic information and does not provide any parameter estimates. If the researcher suspects that heterogeneity may be an issue between the primary studies, then the HEGESMA method would be appropriate. If, upon selection of the primary studies to be included into a meta-analysis, a researcher has identified a common effect that is able to be extracted from each of the studies, then a model-based method would be appropriate. A major advantage of model-based methods is that they provide estimates of effect sizes along with a measure of the standard error. Another advantage is that heterogeneity may be accounted for by using a random effects model as opposed to a fixed effects model. A disadvantage of modelbased methods is that they do require estimates of a common effect along with corresponding standard errors from each of the primary studies. Finally, if in addition to being able to extract a common effect from the primary studies, a researcher also has adequate computational resources, then using a Bayesian approach would be appropriate. Certainly an advantage of the Bayesian method is that the method yields credible sets for the posterior probability of linkage and the location of the disease gene under linkage. Another advantage is that heterogeneity may be accounted for by introducing appropriate covariates into the model. In addition to being computationally intensive, a disadvantage of the Bayesian
558
Kippola and Santorico
method is that it requires the specification of a prior distribution, which may be subjective. As is likely with most other methods, this Bayesian method loses power to detect a QTL if the QTL is in epistasis with another QTL (48). Combining information from previous studies can be an efficient use of resources but care needs to be taken to appropriately select studies and to account for heterogeneity. The choice of method for analysis depends on the available data and resources and the goal of the study (i.e., testing or estimation). References 1. Glass, G. (1976) Primary, secondary and meta-analysis of research, Education Researcher 5, 3–8. 2. Fisher, R. (1925) Statistical Methods for Research Workers, 13 ed., Oliver & Loyd, London. 3. Edwards, K., Hutter, C., Wan, J., Kim, H., and Monks, S. (2008) Genome-wide linkage scan for metabolic syndrome: the GENNID study, Obesity (Silver Spring) 16, 1596–1601. 4. Rosenthal, R. (1979) The file drawer problem and tolerance for null results, Psychological Bulletin 86, 638–641. 5. Munafo, M. R., Clark, T. G., and Flint, J. (2004) Assessing publication bias in genetic association studies: evidence from a recent meta-analysis, Psychiatry Research 129, 39–44. 6. Attia, J., Thakkinstian, A., and D’Este, C. (2003) Meta-analyses of molecular association studies: methodologic lessons for genetic epidemiology, Journal of Clinical Epidemiology 56, 297–303. 7. Laird, N. M., and Mosteller, F. (1990) Some statistical methods for combining experimental results, International Journal of Technological Assessment in Health Care 6, 5–30. 8. McQueen, M. B., Devlin, B., Faraone, S. V., Nimgaonkar, V. L., Sklar, P., Smoller, J. W., Jamra, R. A., Albus, M., Bacanu, S. A., Baron, M., Barrett, T. B., Berrettini, W., Blacker, D., Byerley, W., Cichon, S., Coryell, W., Craddock, N., Daly, M. J., DePaulo, J. R., Edenberg, H. J., Foroud, T., Gill, M., Gilliam, T. C., Hamshere, M., Jones, I., Jones, L., Juo, S. H., Kelsoe, J. R., Lambert, D., Lange, C., Lerer, B., Liu, J. J., Maier, W., MacKinnon, J. D., McInnis, M. G., McMahon, F. J., Murphy, D. L., Nothen, M. M., Nurnberger, J. I., Pato, C. N., Pato, M. T., Potash, J. B., Propping, P., Pulver, A. E., Rice, J. P., Rietschel, M., Scheftner, W., Schumacher, J., Segurado, R., Van Steen, K., Xie, W. T., Zandi, P. P., and Laird, N.
9.
10.
11.
12.
13.
M. (2005) Combined analysis from eleven linkage studies of bipolar disorder provides strong evidence of susceptibility loci on chromosomes 6q and 8q, American Journal of Human Genetics 77, 582–595. Lander, E., and Kruglyak, L. (1995) Genetic dissection of complex traits: guidelines for interpreting and reporting linkage results, Nature Genetics 11, 242–247. Greenwood, T. A., Libiger, O., Kardia, S., Hanis, C., Morrison, A. C., Gu, C. C., Rice, T., Miller, M., Turner, S. T., Myers, R. H., Grove, J., Hsiao, C.-F., Weder, A. B., and Schork, N. J. (2007) Comprehensive linkage and linkage heterogeneity analysis of 4344 sibling pairs affected with hypertension from the Family Blood Pressure Program, Genetic Epidemiology 31, 195–210. Carter, K. W., Pluzhnikov, A., Timms, A. E., Miceli-Richard, C., Bourgain, C., Wordsworth, B. P., Jean-Pierre, H., Cox, N. J., Palmer, L. J., Breban, M., Reveille, J. D., and Brown, M. A. (2007) Combined analysis of three whole genome linkage scans for ankylosing spondylitis, Rheumatology 46, 763–771. Johanneson, N., Steinsson, K., Lindqvist, A. K., Kristjansdottir, H., Grondal, G., Sandino, S., Tjernstrom, F., Sturfelt, G., GranadosArriola, J., Alcocer-Varela, J., Lundberg, I., Jonasson, I., Truedsson, L., Svenungssons, E., Klareskog, L., Alarcon-Segovia, D., Gyllensten, U. B., and Alarcon-Riquelme, M. E. (1999) A comparison of genome-scans performed in multicase families with systemic lupus erythematosus from different population groups, Journal of Autoimmunity 13, 137–141. Hartung, K., Fontana, A., Klar, M., Krippner, H., Jorgens, K., Lang, B., Peter, H. H., Pichler, W. J., Schendel, D., Robin-Winn, M., and Deicher, H. (1989) Associations of class I, II, and III MHC-gene products with systemic lupus erythematosus: results of
Methods for Combining Multiple Genome-Wide Linkage Studies
14.
15. 16.
17.
18. 19.
20.
21.
22.
23.
24.
a central European multicenter study, Rheumatology International 9, 13–18. Light, R. J., and Smith, P. V. (1971) Accumulating evidence: procedures for resolving contradictions among different research studies, Harvard Educational Review 41, 429–471. Hedges, L. V., and Olkin, I. (1985) Statistical Methods for Meta-Analysis, Academic Press, New York. Edgington, E. (1972) An additive model for combining probability values from independent experiments, Journal of Psychology 80, 351–363. Iyengar, S. K., Abboud, H. E., Goddard, K. A. B., Saad, M. F., Adler, S. G., Arar, N. H., Bowden, D. W., Duggirala, R., Elston, R. C., Hanson, R. L., Ipp, E., Kao, W. H. L., Kimmel, P. L., Klag, M. J., Knowler, W. C., Meoni, L. A., Nelson, R. G., Nicholas, S. B., Pahl, M. V., Parekh, R. S., Quade, S. R. E., Rich, S. S., Rotter, J. I., Scavini, M., Schelling, J. R., Sedor, J. R., Sehgal, A. R., Shah, V. O., Smith, M. W., Taylor, K. D., Winkler, C. A., Zager, P. G., Freedman, B. I., on behalf of the Family Investigation of Nephropathy and Diabetes Research Group. (2007) Genome-wide scans for diabetic nephropathy and albuminuria in multiethnic populations: The Family Investigation of Nephropathy and Diabetes (FIND), Diabetes 56, 1577–1585. Province, M. A. (2001) The significance of not finding a gene, The American Journal of Human Genetics 69, 660–663. Badner, J. A., and Gershon, E. S. (2002) Meta-analysis of whole-genome linkage scans of bipolar disorder and schizophrenia, Molecular Psychiatry 7, 405–411. Zaykin, D. V., Zhivotovsky, L. A., Westfall, P. H., and Weir, B. S. (2002) Truncated product method for combining P-values, Genetic Epidemiology 22, 170–185. Loesgen, S., Dempfle, A., Golla, A., and Bickeboller, H. (2001) Weighting schemes in pooled linkage analysis, Genetic Epidemiology 21, S142–S147. Stouffer, S., Suchman, E., DeVinney, L., Star, S., and Williams, R. J. (1949) Adjustment during army life, in The American Soldier, Princeton University Press, Princeton. Haseman, J., and Elston, R. (1972) The investigation of linkage between a quantitative trait and a marker locus, Behaviour Genetics 2 (1), 3–19. Province, M. A., Kardia, S. L. R., Ranade, K., Rao, D. C., Thiel, B. A., Cooper, R. S., Risch, N., Turner, S. T., Cox, D. R., Hunt, S. C., Weder, A. B., and
25.
26.
27.
28.
29.
30.
31.
32.
33.
559
Boerwinkle, E. (2003) A meta-analysis of genome-wide linkage scans for hypertension: The National Heart, Lung and Blood Institute Family Blood Pressure Program, American Journal of Hypertension 16, 144–147. Rice, T., Rankinen, T., Province, M. A., Chagnon, Y. C., Perusse, L., Borecki, I. B., Bouchard, C., and Rao, D. C. (2000) Genome-wide linkage analysis of systolic and diastolic blood pressure : The Quebec Family Study, Circulation 102, 1956–1963. Atwood, L. D., Samollow, P. B., Hixson, J. E., Stern, M. P., and MacCluer, J. W. (2001) Genome-wide linkage analysis of blood pressure in Mexican Americans, Genetic Epidemiology 20, 373–382. Rice, T., Cooper, R. S., Wu, X. D., Bouchard, C., Rankinen, T., Rao, D. C., Jaquish, C. E., Fabsitz, R. R., and Province, M. A. (2006) Meta-analysis of genome-wide scans for blood pressure in African American and Nigerian samples, American Journal of Hypertension 19, 270–274. Badner, J. A., and Gershon, E. (2002) Regional meta-analysis of published data supports linkage of autism with markers on chromosome 7, Molecular Psychiatry 7, 56–66. Wise, L. H., and Lewis, C. M. (1999) A method for meta-analysis of genome searches: application to simulated data, Genetic Epidemiology 17, S767–S771. Forabosco, P., Ng, M. Y. M., Bouzigon, E., Fisher, S. A., Levinson, D. F., and Lewis, C. M. (2007) Data acquisition for metaanalysis of genome-wide linkage studies using the genome search meta-analysis method, Human Heredity 64, 74–81. Wise, L. H., Lanchbury, J. S., and Lewis, C. N. (1999) Meta-analysis of genome searches, Annals of Human Genetics 63, 263–272. Choi, S. J., Rho, Y. H., Ji, J. D., Song, G. G., and Lee, Y. H. (2006) Genome scan metaanalysis of rheumatoid arthritis, Rheumatology 45, 166–170. Saunders, C. L., Chiodini, B. D., Sham, P., Lewis, C. M., Abkevich, V., Adeyemo, A. A., de Andrade, M., Arya, R., Berenson, G. S., Blangero, J., Boehnke, M., Borecki, I. B., Chagnon, Y. C., Chen, W., Comuzzie, A. G., Deng, H. W., Duggirala, R., Feitosa, M. F., Froguel, P., Hanson, R. L., Hebebrand, J., Huezo-Dias, P., Kissebah, A. H., Li, W. D., Luke, A., Martin, L. J., Nash, M., Ohman, M., Palmer, L. J., Peltonen, L., Perola, M., Price, R. A., Redline, S., Srinivasan, S. R., Stern, M. P., Stone, S., Stringham, H., Turner, S., Wijmenga, C., and Collier, D. A.
560
34.
35. 36.
37.
38.
39.
40.
41.
42.
43.
Kippola and Santorico (2007) Meta-analysis of genome-wide linkage studies in BMI and obesity, Obesity 15, 2263–2275. Zintzaras, E., and Ioannidis, J. P. A. (2005) Heterogeneity testing in meta-analysis of genome searches, Genetic Epidemiology 28, 123–137. Cochran, W. (1954) The combination of estimates from different experiments, Biometrics 10, 101–129. Lewis, C. M., and Levinson, D. E. (2006) Testing for genetic heterogeneity in the genome search meta-analysis method, Genetic Epidemiology 30, 348–355. Malhotra, A., Coon, H., Feitosa, M. F., Li, W. D., North, K. E., Price, R. A., Bouchard, C., Hunt, S. C., Wolford, J. K., and American Diabetes Association, G. S. G. (2005) Meta-analysis of genome-wide linkage studies for quantitative lipid traits in African Americans, Human Molecular Genetics 14, 3955–3962. Lee, Y. H., Rho, Y. H., Choi, S. J., Ji, J. D., and Song, G. G. (2006) Meta-analysis of genome-wide linkage studies for bone mineral density, Journal of Human Genetics 51, 480–486. Fisher, S. A., Abecasis, G. R., Yashar, B. M., Zareparsi, S., Swaroop, A., Iyengar, S. K., Klein, B. E. K., Klein, R., Lee, K. E., Majewski, J., Schultz, D. W., Klein, M. L., Seddon, J. M., Santangelo, S. L., Weeks, D. E., Conley, Y. P., Mah, T. S., Schmidt, S., Haines, J. L., Pericak-Vance, M. A., Gorin, M. B., Schulz, H. L., Pardi, F., Lewis, C. M., and Weber, B. H. F. (2005) Metaanalysis of genome scans of age-related macular degeneration, Human Molecular Genetics 14, 2257–2264. Zintzaras, E., and Kitsios, G. (2006) Identification of chromosomal regions linked to premature myocardial infarction: a metaanalysis of whole-genome searches, Journal of Human Genetics 51, 1015–1021. Trikalinos, T. A., Karvouni, A., Zintzaras, E., Ylisaukko-oja, T., Peltonen, L., Jarvela, I., and Ioannidis, J. P. A. (2006) A heterogeneity-based genome search metaanalysis for autism-spectrum disorders, Molecular Psychiatry 11, 29–36. Gu, C., Province, M., Todorov, A., and Rao, D. C. (1998) Meta-analysis methodology for combining non-parametric sibpair linkage results: genetic homogeneity and identical markers, Genetic Epidemiology 15, 609–626. Risch, N., and Zhang, H. (1995) Extreme discordant sib pairs for mapping quantitative trait loci in humans, Science 268 (5217), 1584–1589.
44. Blackwelder, W., and Elston, R. (1985) A comparison of sib-pair linkage tests for disease susceptibility loci, Genetic Epidemiology 2, 85–97. 45. Heijmans, B. T., Beekman, M., Putter, H., Lakenberg, N., van der Wijk, H. J., Whitfield, J. B., Posthuma, D., Pedersen, N. L., Martin, N. G., Boomsma, D. I., and Slagboom, P. E. (2005) Meta-analysis of four new genome scans for lipid parameters and analysis of positional candidates in positive linkage regions, European Journal of Human Genetics 13, 1143–1153. 46. Rice, J. P., Rochberg, N., Neuman, R. J., Saccone, N. L., and Liu, K. Y., Zhang, X., and Culverhouse, R. (1999) Covariates in linkage analysis, Genetic Epidemiology 17 (Suppl 1), S691–S695. 47. Waldman, I. D., and Robinson, B. F. (2001) Meta-analysis of sib pair linkage studies of asthma and the interleukin-9 gene (IL9), Genetic Epidemiology 21, S109–S114. 48. Biswas, S., and Lin, S. (2007) Incorporating covariates in mapping heterogeneous traits: a hierarchical model using empirical Bayes estimation, Genetic Epidemiology 31, 684–696. 49. Smith, C. A. B. (1963) Testing for heterogeneity of recombination fraction values in human genetics, Annals of Human Genetics 27, 175–182. 50. Ott, J. (1999) Analysis of Human Genetic Linkage, The Johns Hopkins University Press, Baltimore, MD. 51. Lin, S. L., and Biswas, S. (2004) On modeling locus heterogeneity using mixture distributions, Bmc Genetics 5, 29. 52. Diebolt, J., and Ip, E. (1996) Stochastic EM: method and application, in Markov Chain Monte Carlo in Practice (Gilks, W., Richardson, S., and Spiegelhalter, D., Eds.), pp. 259–273, Chapman & Hall, London, UK. 53. Raftery, A. (1996) Hypothesis testing and model selection, in Markov Chain Monte Carlo in Practice (Gilks, W., Richardson, S., and Spiegelhalter, D., Eds.), pp. 163–187, Chapman & Hall, London, UK. 54. Dempfle, A., and Loesgen, S. (2004) Metaanalysis of linkage studies for complex diseases: an overview of methods and a simulation study, Annals of Human Genetics 68, 69–83. 55. Hedges, L. V., and Olkin, I. (1980) Votecounting methods in research synthesis, Psychological Bulletin 88, 359–369. 56. Levinson, D. F., Levinson, M. D., Segurado, R., and Lewis, C. M. (2003) Genome scan meta-analysis of schizophrenia and bipolar disorder, part I: Methods and power analysis, American Journal of Human Genetics 73, 17–33.
Part VI Other Practical Information
Chapter 22 Improved Reporting of Statistical Design and Analysis: Guidelines, Education, and Editorial Policies Madhu Mazumdar, Samprit Banerjee, and Heather L. Van Epps Abstract A majority of original articles published in biomedical journals include some form of statistical analysis. Unfortunately, many of the articles contain errors in statistical design and/or analysis. These errors are worrisome, as the misuse of statistics jeopardizes the process of scientific discovery and the accumulation of scientific knowledge. To help avoid these errors and improve statistical reporting, four approaches are suggested: (1) development of guidelines for statistical reporting that could be adopted by all journals, (2) improvement in statistics curricula in biomedical research programs with an emphasis on hands-on teaching by biostatisticians, (3) expansion and enhancement of biomedical science curricula in statistics programs, and (4) increased participation of biostatisticians in the peer review process along with the adoption of more rigorous journal editorial policies regarding statistics. In this chapter, we provide an overview of these issues with emphasis to the field of molecular biology and highlight the need for continuing efforts on all fronts. Key words: Biomedical sciences, molecular biology, statistical design and analysis, reporting guideline, statistical education, statistical errors, randomized clinical trials, observational study, metaanalysis, class discovery, class prediction.
1. Introduction Publication of manuscripts in peer-reviewed scientific journals is essential for the dissemination of scientific knowledge. The purpose of peer review is to provide a critical evaluation of the scientific data in a submitted paper, but the accuracy of this evaluation is often complicated by improperly analyzed data. Inappropriate statistical analyses can lead to misinterpretation of data and guide researchers to incorrect conclusions, which may then reach H. Bang et al. (eds.), Statistical Methods in Molecular Biology, Methods in Molecular Biology 620, DOI 10.1007/978-1-60761-580-4_22, © Springer Science+Business Media, LLC 2010
563
564
Mazumdar, Banerjee, and Van Epps
the public, including patients and health-care providers. Inaccurate published data can also result in wasting of scarce health-care resources by other researchers who base future research projects on the published literature. Unfortunately, there are numerous examples of research studies in which data were inaccurately reported (1–5). Molecular biology involves the study of macromolecules, including nucleic acids and proteins, that govern all functions in cells, organs, and tissues. Many of the tools used in molecular biology research (such as real-time polymerase chain reaction (PCR), Western blots, complementary DNA/oligonucleotide arrays, and mass spectrometry) are quantitative in nature and generate copious amounts of data. To extract accurate and useful conclusions from these large data sets requires appropriate statistical analysis, whether simple or complex. Many data sets can be analyzed using simple analyses, which include basic data plotting (e.g., point plots, bar graphs, and stem and leaf plots), data summarization (e.g., mean, median, trimmed mean), and regression analyses (e.g., linear, logistic, proportional hazards regression). Other types of data sets require more complex statistical methods, including hierarchical clustering, mixed modeling, cross-validation, permutation test-based approaches, artificial neural networks, and support vector machines. Methods needed for evaluation of diagnostic tests require receiver operating characteristics (ROC) analysis whereas discovery of predictive markers needs the methods of model building, sensitivity analysis, and validation. The appropriate statistical design of experiments also varies widely and includes methods such as randomized controlled trials, observational studies (with cross-sectional, case–control, cohort designs, and their variants), and factorial designs, to name a few (6). Meta-analysis is also commonly utilized for reviewing and summarizing large amounts of previous quantitative research on a common topic (7). The proportion of original manuscripts published in biomedical journals that contain some form of statistical analysis is estimated at 60–90%, with molecular biology journals being at the high end of the range (8–13). The complexity of published statistical analyses has increased steadily in recent decades, and statistical flaws have been revealed in many published reports (14, 15). For example, in 2002, patterns of mass spectroscopy peaks were reported to diagnose ovarian, prostate, bladder, and colon cancer with nearly 100% sensitivity and specificity (16–21). Not surprisingly, these reports generated tremendous interest among clinical investigators, funding agencies, pharmaceutical companies, and the public (22), and plans were made to market a commercial blood test for ovarian cancer based on these results. Enthusiasm was heavily tempered when several subsequent articles expressed skepticism about the data in the original report (22, 23). The
Improved Reporting of Statistical Design and Analysis
565
concerns about the original paper were based on both the biological implausibility of the reported accuracy of the test (24–28) and, more importantly, on the inability of other investigators to reproduce the results in independent studies (29–34). Furthermore, when the data from the original cancer studies were reanalyzed by other investigators, the success in predicting cancer incidence was no better than chance (29). Analysis of cancerous and noncancerous samples on separate days was also suggested to have introduced bias, as the calibration of the mass spectrometer used to analyze the samples varied slightly over time (29). The original investigators eventually agreed that reproducibility had not been demonstrated and all related developments stemming from that study were ceased. Although the above example is an extreme case, it illustrates the importance of appropriate experimental design, data analysis, and appropriate reporting. To help avoid these errors and improve statistical reporting, four approaches are suggested: (1) development of guidelines for statistical reporting that could be adopted by all journals, (2) improvement in statistics curricula in biomedical research programs with an emphasis on hands-on teaching by biostatisticians, (3) expansion and enhancement of biomedical science curricula in statistics programs, and (4) increased participation of biostatisticians in the peer review process along with the adoption of more rigorous journal editorial policies regarding statistics. Although general guidelines regarding proper use of statistics in biomedical research have been available for the last two decades, more specific checklists and flowcharts were developed between 1995 and 2005 for designing and analyzing studies such as randomized clinical trials, observational studies, meta-analyses, tumor marker prognostic studies, and accuracy studies for diagnostic tests. Specific guidelines for designing and analyzing studies involving animal experimentation, microarray and proteomics studies, and genome-wide associations are also under development. These guidelines may be beneficial to researchers, but consistent adoption of these standards by scientific journals and their application to the manuscript review process will be required to assure that all studies have been rigorously designed and analyzed. A team approach to improving statistical design, analysis, and reporting is most likely the best way to proceed, as it is not currently practical to expect every research group to employ a fulltime statistician. Efforts might instead focus on increasing the understanding of statistics among researchers and increasing the availability of statisticians for consultation. A fruitful collaboration between the two requires that the researcher understands basic statistical principles and that the statistician understands basic principles of experimental design and procedure. Efforts aimed at improving statistician training including education in biology,
566
Mazumdar, Banerjee, and Van Epps
communication, and personal effectiveness have begun in various universities encouraged by workshops conducted by the National Institutes of Health (NIH) (35). The NIH has also contributed to this process by mandating that recipients of NIH biomedical training grants must complete courses in biostatistical design and analysis. Although the impact of this requirement on the quality of statistical reporting in published research articles will be difficult to assess, preliminary information on the improvement of knowledge suggests that additional efforts are needed (36). Statistical consultation provides a practical avenue for the mutual education of statisticians and biomedical researchers. The statisticians have the opportunity to learn about the nuances of experimental design and setup, and biomedical researchers could gain knowledge of appropriate statistical design and analysis (37). Peer-reviewed journals have increasingly recruited statisticians into the review process to avoid publishing statistically flawed studies. However, in journals’ “Instruction to Authors” the inclusion and enforcement of specific guidelines for reporting statistics vary from journal to journal. It is also unclear what percentage of reviewed manuscripts are evaluated by a statistician and to what extent their recommendations are heeded by editors and authors. In this chapter, we discuss the issues raised above with the goal of increasing awareness of potential pitfalls in statistical reporting and the tools available for its improvement. In Table 22.1, we provide a list of premier guidelines developed in the last three decades and, in Section 2, briefly describe ten of the most relevant statistical guidelines for research in molecular biology. Ongoing changes being made to the curricula of biomedical scientists and biostatisticians are discussed in Section 3. Challenges regarding the optimal utilization of statistical reviewers in the peer review process are then surveyed in Section 4. Some discussions and steps to follow for utilizing the available tools are offered in the discussion section (Section 5).
2. Guidelines for Reporting Statistical Analysis
One suggestion for improvement of statistical reporting in published articles is for journals to provide specific guidelines. The first major step in this direction was taken in 1978 when a small group of medical journal editors met informally in Vancouver, British Columbia, to establish guidelines for the formatting of manuscripts submitted to their journals. The group, now known as the Vancouver Group, published these guidelines in 1979 (38). The Vancouver Group expanded and evolved into the International Committee of Medical Journal Editors (ICMJE), which
Uniform requirements for manuscripts submitted to biomedical journals
Statistical guidelines for contributors to medical journals
Methodologic guidelines for reports of clinical trials
Guidelines for statistical reporting in articles for medical journals. Amplifications and explanations
2.
3.
4.
Title of Guideline (Type) (General guidelines)
1.
ID
Bailar JC, 3rd, Mosteller F. 1988, Ann Intern Med, 108: 266–273
Simon R, Wittes R. 1985. Cancer Treatment Reports 69: 1–3
Altman DG, Gore SM, Gardner MJ, Pocock SJ. 1983 British Medical Journal, 286: 1489–1493
International Committee of Medical Journal Editors (ICMJE): 1979, British Medical Journal, 1(6162): 532–5 1982, Ann Intern Med, 96: 766–71 1988, Ann Intern Med, 108: 258–65 1997, Ann Intern Med, 126: 36–47
Reference
Table 22.1 Guidelines for reporting statistics in biomedical journals
15-point guideline
9-point guideline (developed for guiding manuscript submission to Cancer Treatment Reports)
Tutorial on all basic and important elements
Two paragraphs with general guidance modified over time
Information Content
(Continued)
http://www.icmje.org/
Website (Connected Feb 2010)
Improved Reporting of Statistical Design and Analysis 567
Guidelines for reporting statistics in journals published by the American Physiological Society
6.
The CONSORT statement: revised recommendations for improving the quality of reports of parallel-group randomized trials
Strengthening the Reporting of Observational Studies in Epidemiology statement: guidelines for reporting observational studies
Improving the quality of reports of meta-analyses of randomized controlled trials: the QUOROM statement. Quality of Reporting of Meta-analyses
7.
8.
9.
Study Type Specific Guidelines
Statistical guidelines for The British Journal of Surgery
Title of Guideline (Type) (General guidelines)
5.
ID
Table 22.1 (Continued)
Moher D, Cook DJ, Eastwood S, Olkin I, Rennie D, Stroup DF. for the QUOROM group 1999, Lancet, 354: 1896–1900
vonElm E, Altman D, Egger M, Pocock S, Gøtzsche P, Vandenbroucke J. 2007 for the STROBE Initiative: Lancet, 370: 1453–1457
Moher D, Schulz KF, Altman DG, for the Consort Group 2001. The Lancet, 357: 1191–1194
Curran-Everett, D and Benos, DJ: 2004, Journal of Neurophysiol, 92: 669–671
Murray GD. 1991, British Journal of Surgery, 78: 782–784
Reference
and
21 headings and subheadings
22-item checklist (with 4 each specific to cohort, cross-sectional, and case– control studies)
22-item checklist flowchart
10-point guideline
General guideline divided by relevant sections of manuscripts
Information Content
http://www.strobestatement.org/
http://www.consortstatement.org/
Website (Connected Feb 2010)
568 Mazumdar, Banerjee, and Van Epps
Meta-analysis for Observational Studies in Epidemiology (MOOSE): A proposal for reporting
Towards complete and accurate reporting of studies of diagnostic accuracy: the STARD initiative
REporting recommendations for tumour MARKer prognostic studies (REMARK)
11.
12.
Title of Guideline (Type) (General guidelines)
10.
ID
Table 22.1 (Continued)
McShane L, Altman D, Sauerbrei W, Taube S, Gion M, Clark GM. 2005. for the Statistics Subcommittee of the NCIEORTC Working Group on Cancer Diagnostics: British Journal of Cancer, 93: 387–391
Bossuyt PM, Reitsma JB, Bruns DE, Gatsonis CA, Glasziou PP, Irwig LM, Lijmer JG, Moher D, Rennie D, de Vet HCW for the STARD group 2003, American Journal of Roentgenology, 181: 51–55.
Stroup DF, Berlin JA, Morton SC, Olkin I, Williamson JD, Rennie D, Moher D, Becker BJ, Sipe TA, Thacker SB for the MOOSE Group 2000, JAMA, 283: 2008–2012
Reference
20-item checklist
25-item checklist
35-item checklist
Information Content
http://www.stardstatement.org/
Website (Connected Feb 2010)
Improved Reporting of Statistical Design and Analysis 569
570
Mazumdar, Banerjee, and Van Epps
now meets annually and has gradually broadened its concerns. The general guidelines of the ICMJE for the reporting of statistics are the following: Describe statistical methods with enough detail to enable a knowledgeable reader with access to the original data to verify the reported results. When possible, quantify findings and present them with appropriate indicators of measurement error or uncertainty (such as confidence intervals). Avoid relying solely on statistical hypothesis testing, such as the use of p-values, which fails to convey important quantitative information. Discuss the eligibility of experimental subjects. Give details about randomization. Describe the methods for and success of any blinding of observations. Report complications of treatment. Give numbers of observations. Report losses to observation (such as dropouts from a clinical trial). References for the design of the study and statistical methods should be to standard works when possible (with pages stated) rather than to papers in which the designs or methods were originally reported. Specify any general-use computer programs used. Put a general description of methods in the Methods section. When data are summarized in the Results section, specify the statistical methods used to analyze them. Restrict tables and figures to those needed to explain the argument of the paper and to assess its support. Use graphs as an alternative to tables with many entries; do not duplicate data in graphs and tables. Avoid nontechnical uses of technical terms in statistics, such as “random” (which implies a randomizing device), “normal,” “significant,” “correlations,” and “sample.” Define statistical terms, abbreviations, and most symbols.
Although these requirements have been widely adopted by biomedical journals, many submitted manuscripts continue to suffer from major statistical flaws, and a significant proportion of these flaws evade referee and editorial scrutiny prior to publication (39). Many tutorial-type guidelines were written starting in the late 1980s (40, 41). Renewed efforts toward developing guidelines with checklists and flowcharts have emerged more recently (42, 43) (see Table 22.1). These new initiatives are more focused than before in their scope. For example, guidelines for reporting “randomized clinical trials” are separated from those for reporting “epidemiologic” studies. Typically, each guideline is motivated by the published survey of errors in reporting detected for a particular type of study. The process starts with compilation of a checklist of items important for the statistical design, reporting of methodologies, and data pertinent to that field. Specific suggestions for what to report in each part of a paper classified by “title,” “abstract,” “introduction,” “methods,” “results,” and “discussion” are made. A flowchart tracking patients’ accrual/loss providing information on the sample size the study was initiated
Improved Reporting of Statistical Design and Analysis
571
with and the sample size that was utilized in the final analysis is recommended. Experts in the field begin the process but a huge effort is put into consensus building and inclusion of opinions from all stakeholders through face-to-face meetings and electronic avenues (e-mails, video conferences, and websites). Comments are gathered for modification of each aspect of a guideline. After repeated revisions, the final version is published in many medical journals simultaneously and endorsement from a large number of medical journals and editorial groups, including the ICMJE, is obtained. Websites with comprehensive information about related efforts are maintained for providing a space for continued discussion. Reports related to guideline effectiveness or problems arising from their use can be documented and updated on the website, allowing for modification as needed. Guidelines may also be translated into other languages for further use. Success of these guidelines in improving reporting is expected due to the changed approach of their development, adoption, evaluation, and monitoring. Below, we briefly describe ten guidelines chosen for their relevance to the field of molecular biology; the first six developed in the rigorous manner described above are already in use by a large number of journals (Sections 2.1, 2.2, 2.3, 2.4, 2.5, and 2.6); the last four are under development (Sections 2.7, 2.8, 2.9, and 2.10) and have been adopted by only a select few journals or groups. 2.1. CONSORT (Consolidated Standards of Reporting Trials) (http://www.consortstatement.org)
A randomized controlled trial (RCT) is a type of study commonly utilized for testing the comparative efficacy of interventions such as treatment combinations, health-care services, or health technologies. The distinguishing feature of RCTs is the random allocation of different interventions (treatments or conditions) to subjects. As long as the numbers of subjects are sufficient (i.e., determined by a power calculation), both known and unknown confounding factors are expected to be evenly distributed between the interventions, and an unbiased estimate of the comparative efficacy can be computed. Many publications have assessed inadequate reporting of RCTs (44–46). For example, a review of 122 published RCTs analyzing the effectiveness of selective serotonin reuptake inhibitors as a first-line treatment for depression found that only one paper adequately described randomization (45). Empirical evidence reveals that inadequate reporting of randomization is associated with biased estimates of treatment effects on the order of 20% (47). A survey of content and quality of 2000 controlled trials in schizophrenia over 50 years found that studies are unacceptably small (mean number of patients 65) and poorly reported (64% had a quality score of ≤ 2 (maximum score of 5)) (46). Similarly a review of 279 published RCTs in head injury found that no trials were large enough
572
Mazumdar, Banerjee, and Van Epps
to reliably detect a 5% absolute reduction in risk of death and disability, a clinically meaningful difference. In response to repeated finding of these weaknesses, two independent initiatives culminated in the publication of CONSORT (CONsolidated Standards Of Reporting Trials) (48). The CONSORT statement includes checklist items focused on reporting trial design, analysis, interpretation, and a flow diagram charting the progress of all participants throughout a given trial. These guidelines are primarily intended for use in writing, reviewing, or assessing reports of simple two-group parallel RCTs. Studies indicated that the use of CONSORT improves the quality of RCT reports (49, 50). In an assessment of 71 RCTs published in 1994, allocation concealment (separating the process of randomization from the recruitment of participants) was not clearly reported in 61% (n = 43) of the trials. Four years later, after these three journals adopted the CONSORT guidelines, the same proportion dropped to 39% (30 of 77), demonstrating significant improvement (51). However, comments about the need for reducing the complexity of some items and the structure of the flowchart were also documented. Based on these feedbacks from CONSORT users, a revised version of CONSORT was published consisting of a 22-item checklist and a flow diagram, along with some brief descriptive text (52). Areas of revision included guidelines for reporting ethics approval, outcome definition, and sample size. The revised flow diagram depicts information from the four stages of a trial (enrollment, intervention allocation, follow-up, and analysis) separately (49). Querying on the quality of RCT reporting in leading medical journals since the revised CONSORT statement reveals that although reporting of some recommendations is high (80% sequence generation, 83% sample size justification, 87% methods of analysis, etc.), reporting of several essential recommendations remains suboptimal (allocation concealment: 48%, blinding status of participants: 40%, blinding status of health-care providers: 15%, etc.) (53). The website continues to maintain all related reports and information for easy dissemination and collects information for further refinement. The CONSORT guidelines, which have now been adopted by approximately 300 journals and have been translated into Dutch, English, French, Japanese, Spanish, and German, provide a good model for all groups interested in developing statistical reporting guidelines. 2.2. STROBE (Strengthening the Reporting of Observational Studies in Epidemiology) (http://www.strobestatement.org/)
Observational studies are useful for drawing inferences about the effect of an intervention when it is not possible to randomly assign subjects into various interventions due to either ethical or practical considerations. This is in contrast to the controlled experiments, such as the RCTs, discussed earlier. Many questions in medical research are investigated through three common types
Improved Reporting of Statistical Design and Analysis
573
of observational studies: (1) cohort, (2) case–control, and (3) cross-sectional studies (54). In the absence of randomization, it becomes extremely important to be aware of all potential confounding variables that may affect the results. Adequate reporting of all aspects related to the covariates (methods of measurement, reason for missingness, etc.) and coverage of statistical methods utilized for assessing the effects of these covariates on the main outcome also become crucial. In published observational research, important information about covariates is often missing or is unclear. An analysis of epidemiological studies published in general medical and specialist journals found that the rationale behind the choice of potential confounding variables was often not reported (5). Only a few reports of case–control studies in psychiatry, for example, explained the methods used to identify cases and controls (55). In a survey of longitudinal studies in stroke research, 17 of 49 articles (35%) did not specify eligibility criteria (56). It has been argued that without sufficient clarity of reporting, the benefits of research are achieved more slowly (57). Thus there is a need for guidance in reporting observational studies (58). In response to these problematic observations, the STROBE statement was developed following upon the rigorous development process of CONSORT. The STROBE checklist consists of 22 items considered essential for quality reporting of observational studies with 18 common items and 4 items each specific to cohort, case–control, and cross-sectional studies (59). Comments from users are being collected through a forum on the STROBE website for further refinement and evaluation. 2.3. QUOROM (Quality of Reporting of Meta-analyses)
Meta-analysis is a statistical technique for reviewing and summarizing large amounts of previously generated quantitative research on a specific topic. This field emerged with studies that combined RCTs through systematic literature reviews. These reviews attempted to reduce bias and increase power via systematic identification, appraisal, synthesis, and optimal statistical aggregation of all relevant studies on a specific topic according to a predetermined criterion. The number of published meta-analyses has increased substantially in the past decade. These integrative articles can help guide clinical decisions, can provide a foundation for the development of evidence-based practice guidelines, and may provide a launching point for future research. Meta-analysis research has the potential to be flawed, and the process by which meta-analyses are carried out and reported has undergone scrutiny. A 1987 survey of 86 meta-analyses of randomized trials showed that only 24 (28%) of the 86 meta-analyses included 6 important content areas: study design, combinability, control of bias, statistical analysis, sensitivity, and applicability (60). An update that included more recently published meta-
574
Mazumdar, Banerjee, and Van Epps
analyses revealed little improvement in the rigor of the reports surveyed (61). The increased number of published meta-analyses highlights such issues as discordant meta-analyses on the same topic and discordant meta-analyses and randomized trial results on the same question. Issues of “publication bias,” in which studies with statistical significant results have a higher chance of getting published, were repeatedly observed but were not consistently reported in the meta-analytic studies. These issues motivated the development of Quality of Reporting of Meta-analyses (QUOROM) statement. QUOROM consists of a checklist and a flow diagram (62). The checklist describes appropriate presentation for the abstract, introduction, methods, results, and discussion sections of metaanalysis reports. It is organized into 21 headings and subheadings including searches, selection, validity assessment, data abstraction, study characteristics, quantitative data synthesis, results with “trial flow,” study characteristics, and quantitative data synthesis. The flow diagram provides information about both the numbers of RCTs identified, included, and excluded and the reasons for exclusion of trials. This systematic way of selecting the studies to be included in a meta-analysis and documenting the details of each study is expected to improve reporting of the important content areas discussed above. Data regarding the effectiveness of QUOROM guidelines have shown improvement. In 2005, 6 years after publication of the QUOROM guidelines, Delaney et al. (63) showed a statistically significant improvement in the estimated mean quality score as assessed by the “Overall Quality Assessment Questionnaire” (64) based on 139 reports of meta-analysis. Another study, based on a review of 161 articles, concluded that the fulfillment of most QUOROM items improved with time (65). Of note, however, recent study based on 87 systematic reviews found that compliance with QUOROM recommendations is not widespread in systematic reviews or meta-analyses (66), indicating the need for a more universal adoption of QUOROM guidelines and perhaps a website for collection of feedback. 2.4. MOOSE (Meta-analysis of Observational Studies in Epidemiology)
The merit in application of formal meta-analytical methods to observational studies has been highly debated (67, 68). Although meta-analyses of RCTs are usually preferred to meta-analyses of observational studies, the number of published meta-analyses concerning observational studies in biomedical field has increased substantially during the past four decades (68). Reasons for the difficulty in combining the results from various studies include the variety of design used in the original studies, differences in data collection methods, and the numerous possible ways of defining the same exposure or the confounding variables. Since the studies often adjust for different confounding fac-
Improved Reporting of Statistical Design and Analysis
575
tors, it becomes impossible to bring their outcomes on the same footing. Heterogeneity arising from these issues and higher expected occurrences of publication bias makes the meta-analysis of observational studies daunting (69, 70). Yet when carefully approached and reported well, they provide an useful tool for quantifying sources of variability in results across studies and understanding the overall knowledge gained about a research question (71, 72). Acknowledgment of the gain despite complications led to the development of MOOSE—a checklist of 35 items for reporting that builds on similar activities for combining RCTs (QUOROM guideline). The checklist includes recommendations for reporting background, search strategy, methods, results, discussion, and conclusions. No assessment of its effectiveness has been performed yet. Efforts toward adoption of MOOSE by more journals and creation of a website for collecting comments are needed. 2.5. STARD (Standards for Reporting of Diagnostic Accuracy Studies) (http://www.stardstatement.org/)
A diagnostic test is any kind of medical test performed to aid in the diagnosis or detection of disease. New diagnostic tests are developing rapidly and the technology underlying existing tests is continuously undergoing improvement. Exaggerated and/or biased results stemming from poorly designed and reported diagnostic studies could result in tests becoming available to the public earlier than warranted. Use of such premature tests could lead to incorrect diagnoses and therapeutic interventions. Studies to determine the accuracy of such diagnostic tests are a vital part of the evaluation process (73–75). In accuracy studies, the outcomes from one or more tests under evaluation are compared with outcomes from the reference standard test, both measured in subjects who are suspected of having the condition of interest. In this context, the reference standard is considered to be the current “gold standard” for establishing the presence or absence of the condition of interest. The term accuracy refers to the level of agreement between the information from the test under evaluation and the reference standard and can be measured in many formats including but not limited to sensitivity and specificity, likelihood ratios, diagnostic odds ratio, and the area under a ROC curve (76–78). Several potential threats to the internal and external validity of studies of diagnostic accuracy have been described in published literature (79, 80). Several such surveys revealed that the quality of diagnostic methods’ studies was usually mediocre at best, although evaluations were often hampered by the lack of information on key elements of design, conduct, and analysis in many of the reports (81–83). The STARD initiative was undertaken in order to improve the accuracy and completeness of reporting studies of diagnostic accuracy, to allow readers to assess the potential for bias in the study (internal validity), and to evaluate its generalizability (external validity).
576
Mazumdar, Banerjee, and Van Epps
The STARD statement consists of a checklist of 25 items and recommends the use of a flow diagram, which describes the design of the study and the flow of patients (42). Although preliminary, mixed results have been reported regarding improvement in quality of reporting since adoption of the STARD statement. Two studies reported no improvement (84, 85) and one study reported slight improvement (86). Slow adoption rate by journals was thought to be one potential reason for the mixed results. A second potential reason was the large variability in the degree to which these reporting guidelines are incorporated in journals’ instructions to authors. Smidt and colleagues identified some intriguing differences in wording (“require,” “encourage,” “consult”) regarding the extent to which the STARD statement was to be followed (87). There has been conjecture that this variability is further increased by the discretion given to peer reviewers in using the STARD statement. Despite the fact that reporting is slowly getting better, it is generally accepted that there is still ample room for improvement (88). 2.6. REMARK (REporting recommendations for tumor MARKer prognostic studies)
Prognostic markers are those that have an association with some clinical outcome, typically a time-to-event outcome such as overall survival or recurrence-free survival, and are often utilized in the clinical management of a patient. For example, markers may be used as decision aids for determining whether a patient should receive adjuvant chemotherapy or how aggressive that therapy should be. Despite years of research and hundreds of reports on prognostic tumor markers in oncology, the number of markers that have emerged as clinically useful is quite small. Often, initially reported studies of a marker show great promise, but subsequent studies on the same or related markers yield inconsistent conclusions or directly contradict other results. A variety of methodological problems have been cited to explain these discrepancies, including underpowered studies or overly optimistic reporting of effect sizes and significance levels due to multiple testing, subset analyses, and cutpoint optimization (89). Unfortunately, many tumor marker studies have not been reported in a rigorous fashion, and published articles often lack sufficient information to allow adequate assessment of the quality or generalizability of the results. Such reporting deficiencies are increasingly being highlighted by systematic reviews of the published literature on particular markers or cancers (90–95). To improve on this situation, the REMARK guideline was developed. REMARK is a 20-item guideline with specific items grouped under headings of “Introduction,” “Materials and Methods,” “Results,” and “Discussion,” reflecting the relevant sections of a published scientific article. Under one item, it is indicated that a diagram may be helpful to specify numbers of individuals included
Improved Reporting of Statistical Design and Analysis
577
at different stages of a study (e.g., number in original sample, number remaining after exclusions, and number incorporated into univariate and multivariable analyses). At present, there has been no assessment of the effectiveness of the REMARK guidelines, and a website has not yet been developed. 2.7. Reporting Studies with Animal Experiments (http://www.nationalacademies.org/ilar)
Animal experiments are usually performed to make discoveries about the biology of the species being studied and to infer something about humans or other target species. Thus, laboratory animals are usually used as “models” of some other species. As with trials with human participants, experiments using laboratory animals should be well designed, correctly analyzed, clearly presented, and correctly interpreted if they are to be ethically acceptable. Unfortunately, surveys of published papers reveal that many fall short of this ideal, and in some cases, the conclusions are not sufficiently supported by the data (96–98). McCance et al. reviewed 133 papers in 23 consecutive issues of the Australian Veterinary Journal for their statistical content and concluded that only 38 (29%) of the papers would have been acceptable to a statistical referee without revision. They found that revisions would have been indicated in 88 (66%) and that the remaining 7 (5%) had major flaws. Weaknesses in design were found in 40 (30%) of the studies, mainly regarding randomization and the numbers of animals included in the experiment. Deficiencies in analysis were identified in 60 (45%) of the studies primarily in the failure to use appropriate methods for multiple comparisons and repeated measures. Problems in presentation were detected in 44 (33%) of the papers, largely involving insufficient information about the data or its statistical analysis and presentation. Conclusions were considered to be inconsistent with the analysis in 35 (26%) of the papers, due mainly to their interpretation of significance testing. It was suggested that statistical refereeing, the publication of statistical guidelines for authors, and statistical advice to Animal Experimentation Ethics Committees could all play a part in improving these problems (99). In response to many of the problematic observations outlined above, the editorial board of Institute for Laboratory Animal Research (ILAR) journal dedicated a full volume and published eight peer-reviewed papers on design, analysis, and interpretation of experiments using laboratory animals. This effort was aimed at improving the quality of research and its reporting in the context of animal experiments. The topics covered are (1) Introduction to the Design and Statistical Analysis of Animal Experiments, (2) Control of Variability, (3) Practical Aspects of Experimental Design in Animal Research, (4) Sample Size Determination, (5) Role of Ancillary Variables in the Design, Analysis, and Interpretation of Animal Experiments, (6) Use of Factorial Designs to Optimize Animal Experiments and Reduce Animal Use, (7) Alter-
578
Mazumdar, Banerjee, and Van Epps
native Methods for the Median Lethal Dose Test: The Up-andDown Procedure for Acute Oral Toxicity, and (8) Guidelines for the Design and Statistical Analysis of Experiments Using Laboratory Animals. These materials are freely available electronically, but the level of their use and evaluation of their effectiveness are not yet established. Creating a corresponding checklist/flowchart and a conduit for collecting feedback would be valuable. 2.8. Reporting Studies with Microarray Data
DNA microarray technology has had a profound impact in the field of biomedical research with an exponentially growing number of publications. However, there is some skepticism about the reproducibility and validity of these findings (100, 101). The objective in a microarray study is to find a set of genes that are differentially expressed with respect to an outcome (e.g., condition, treatment, or experimental perturbation). Since typical microarray experiments involve thousands of genes, only a fraction of which is expected to be differentially expressed, it is critical to design the study to have sufficient power for detection and to guard against false positives. The final data obtained in a microarray experiment result from numerous technological steps (e.g., hybridization, image scanning, preprocessing) and it is important to separate technological variation from biological variation, a process commonly known as normalization. Relative robustness of microarray data analysis to different methods of normalization was supported by previous research; however, emerging evidence suggests different normalization techniques can have a significant effect on the prediction outcome (102). Description of the normalization methods also fosters reproducibility of results. Gene expression data often have numerous missing values. Different procedures of estimating missing values could potentially lead to different conclusions from the same data set, therefore emphasizing the need for better reporting on the extent of missing values and methods used for performing analysis with missing data is crucial. Two recent studies reviewed published manuscripts on gene expression data and provided a comprehensive list of common errors or flaws (103, 104). One study (by Jafari and Azuaje) reviewed 293 articles published during the period of 2003–2005, and the other (by Dupuy and Simon) reviewed 90 manuscripts published between 2000 and 2004. The first report found that none of the reviewed articles specified the approaches used to calculate statistical power and sample size, and only 8% (n of 23) of the studies discussed the limitation of sample size. They also found that 67% (n of 195) reported that the data were normalized but only 36% (n of 106) described the process of normalization in detail. Only 15% (n of 44) of the manuscripts described the methods of estimating missing values. Although ANOVA and t-tests are widely used in the analysis of differential expression, funda-
Improved Reporting of Statistical Design and Analysis
579
mental assumptions (e.g., homogeneity of variance and normality) underlying these statistical tests are rarely verified. Dupuy and Simon found that less than 5% of the studies that used ANOVA and t-tests discussed the homogeneity of variance assumptions, around 5% of the studies actually provided justification for using these tests. Another serious flaw found in most studies was inadequate reporting or correction for false positives. Jafari and Azuaje found that 9 out of 23 articles had an inadequate, unclear, or unstated method for controlling the number of false positives. Class discovery pertains to the process of dividing samples or specimens into reproducible classes (based on their gene expression) that have similar properties or behavior. These classes are generally referred to as “unsupervised” because they are based on similarities of gene expression profiles and not on any external variable. A serious and common flaw in class discovery is to deem clusters meaningful for distinguishing different outcomes when the genes used to form these clusters were selected based on their correlation with the outcome variable. Jafari and Azuaje found spurious claims reported in 46.4% (13 out of 28) of the studies reviewed. Supervised prediction involves building and validating a classifier that can be used to predict the outcome of individuals based on their gene expression profile. The fundamental rule for supervised prediction is not to validate the classifier on samples used to build the classifier. The most common and serious flaw found by Jafari and Azuaje was a biased estimation of the prediction accuracy for binary outcomes. Reporting the p-values of a chi-square test, a Fisher’s exact test, or an odds ratio is not suitable for the purpose of prediction accuracy, as these are all tests of association. Prediction accuracy with its statistical significance provides an incomplete picture of the classifier’s predictive ability, and it should be accompanied by a ROC curve depicting its sensitivity and specificity. Another commonly used method for validation is known as cross-validation where a portion of the data set is removed prior to building the classifier and acts as a testing data set; this is repeated for all disjoint partitions. The most common form of error reported by Jafari and Azuaje was using the outcome data to select genes from the full data set instead of performing the gene selection from scratch at each iteration. There are several competing software programs available to perform statistical analysis, but the programs differ somewhat in their utility for calculating or approximating certain results. The software used in a particular study should be noted and can assure reproducibility of results. Dupuy and Simon found that 71% (30 out of 42) of the manuscripts reviewed had not adequately reported the software tools used to perform their data analysis.
580
Mazumdar, Banerjee, and Van Epps
In light of these issues in statistical analysis, there seems to be a need for greater discussions involving researchers, editors, publishers, and decision makers. A set of guidelines would increase the awareness of documented biases and errors among researchers and reviewers. Unlike some of the more established biomedical research areas (e.g., treatment evaluation through RCT and diagnostic test evaluation), guidelines are lacking for microarray studies. This is unanimously accepted by the related scientific community. Dupuy and Simon’s paper published in 2007 provides a succinct list of “dos and don’ts” that should be followed in microarray studies until a set of guidelines is developed through consensus building. Allison et al. have provided a list of issues on which the scientific community has reached some consensus and has provided a guide for choosing methods (105). This study also identified unresolved issues in four statistical components of microarray experiments: design, preprocessing, inference and/or classification, and validation of findings. Community-driven efforts such as MIAME (Minimum Information About a Microarray Experiment) (http://www.mged. org/Workgroups/MIAME/miame.html) may be useful for motivating or guiding the definition of a well-defined set of requirements for reporting fundamental data analysis and experimental statistical design factors (106). Another communitydriven effort, MAQC (MicroArray Quality Control), was developed by the US Food and Drug Administration with the goal of experimentally addressing the key issues surrounding the reliability of microarray data and developing guidelines for data analysis by providing the scientific community with large reference data sets along with readily accessible reference RNA samples (107). At the current time, repeatability of published microarray gene expression analyses is reported to be limited and stricter publication rules enforcing public data availability and explicit description of data processing and analysis have been called for (108). 2.9. Reporting Studies with Proteomics Data (http://www.mcponline. org/, http://www. proteomicsjournal.com)
Most proteomics experiments make use of “high-throughput” technologies such as two-dimensional gel electrophoresis (2-DE), mass spectrometry (MS), or protein arrays to simultaneously measure the expression levels of thousands of proteins. Such experiments yield large, multi-dimensional data sets which reflect the biological, technical, and experimental factors of each experiment. Statistical tools are essential for evaluating these data and avoiding false conclusions (109). The field of proteomics has expanded rapidly in the last decade, but the degree of stringency required for the generation and analysis of these data seems to have lagged behind. As a result, many published findings require confirmation and/or validation (110). One recurring problem is that the reproducibility of proteomic techniques is not demonstrated. Statistical meth-
Improved Reporting of Statistical Design and Analysis
581
ods for assessing reproducibility have been reviewed for increasing their use (111). Power analyses to assess the number of samples that should be analyzed to discover a statistically significant result are also rarely undertaken (112, 113). Another common pitfall is concluding that proteins are differentially expressed based on univariate statistical tests (e.g., Student’s t-test), where normal distribution of the data has been assumed but not tested. This is a particular concern, as proteomic expression data are not typically distributed normally but rather is skewed and requires transformation before many statistical tests can be applied (111, 114). Two journals, Molecular and Cellular Proteomics (115) and Proteomics (110), have provided reporting guidelines as part of their “Instruction to Authors.” Both guidelines, developed through a consensus building effort, intend to provide sufficient information from which their readers can evaluate, interpret, compare, and, if necessary, reproduce the reported study. They contain both mandatory and recommended information in the classes of (a) Ethics Approvals, (b) Study Goals and Design, (c) Subject Source and Description, (d) Biospecimen Qualification, (e) Statistical Considerations, and (f) Technical Considerations. Both guidelines remain too general and will need to be refined and expanded using comments collected at their website and suggestions from their advisory committees to be broadly useful. Community-driven efforts such as MIAPE (Minimum Information About a Proteomics Experiment) may be useful for motivating and guiding the definition of a well-defined set of requirements for data representation in proteomics (to facilitate data comparison, exchange, and verification) and for reporting fundamental data analysis and statistical design factors (116) (http://www.psidev.info). 2.10. Reporting Genome-Wide Association Studies
A major milestone in biomedical research was the completion of the first draft of the human genome sequence. This opened the door for tremendous opportunity for epidemiological studies to evaluate the role of genetic variants in human diseases. Genetic variants could be single nucleotide polymorphisms (SNPs) or functional variants including insertions, deletions, or copy number variations (CNV). Although the spectrum of gene–disease association can be very broad, ranging from a monogenic disease to polygenic disorders, there is some commonality in the methodological issues. Currently, over 6000 original articles reporting genetic epidemiologic results are published annually (117). The literature is rife with associations of genetic variants, most of which have not been reproduced in independent studies and thus considered to be false positives (118). With current genomic searches among ∼500,000 variants, and with next generation sequencing of millions of genetic variants, the problem of finding
582
Mazumdar, Banerjee, and Van Epps
associations of genetic variant(s) with a disease can be analogously compared to the proverbial finding of “needle in a haystack.” Some of the issues that are particularly important in the appraisal of studies include the analytic validity of genotyping, selection of subjects, confounding (especially through population stratification), gene–environment and gene–gene interactions, statistical power, and multiple statistical comparisons (119). Genotyping calls for genetic variants (either by PCR-based methods or SNP arrays using DNA hybridization) depend on technological variability causing genotyping errors, which can potentially inflate the type I error rate. It is, therefore, important to report the type of algorithm used. Another source of bias widely recognized in genetic association studies is through population stratification, which can lead to substantial inflation of the test statistic, and reporting of diagnostic analysis for the same increases the credibility of detected associations (120). In the presence of population substructure information, a stratified analysis can be performed, and in the absence of it, the same adjustments can be made using genomic control (121), structured association (122), or eigenstrat (123). The clear definition of the goals of the study largely influences selection of genotype and phenotype measures, the statistical methods used, the kind of inference that may be drawn, and hence the integration of evidence through meta-analysis of similar studies. The most critical issue governing the credibility of any positive finding largely depends on the power to detect such association and the probability of finding a false positive. The power of a study is largely driven by the sample size of the smallest genetic group of those contrasted. A succinct representation of power calculations for typical study situations is presented in Ioannidis et al. (124). Although there is no consensus on the “best” method for multiple comparison correction in the literature, there is considerable agreement on the need for some strategy to control type I error (Bonferroni, Benjamini–Hochberg FDR, FPRP, etc.). Credibility of an association is enhanced when an association is found in different studies (replication). Lack of replication in a different study population, however, does not necessarily refute the original reported association and may be due to population-specific gene–gene and gene–environment interactions (124). Hence there is an increasing need to consider gene–gene and gene–environment interactions, although deciphering and quantifying them might pose a formidable challenge. Unlike randomized clinical trials or even microarray analysis, genetic association studies are a relatively new research area and more effort needs to be expended in developing guidelines and/or checklists. However, there are some papers that present a set of guidelines/checklist, which might be very useful for biomedical researchers as well as biostatisticians
Improved Reporting of Statistical Design and Analysis
583
to appropriately design and analyze genetic association studies (119, 120, 125–127). Ioannidis et al. (124) used a consensus process to develop interim guidance criteria for assessing cumulative epidemiological evidence in genetic associations by proposing a semi-quantitative index assigning three levels for amount of evidence, extent of replication, and protection from bias. A group effort worthy of mention is the Human Genome Epidemiology Network of HuGENet (http://www.cdc.gov/genomics/hugenet/about.htm), which is a collaboration of individuals and institutions from a diverse background devoted to the development and dissemination of population-based human genome epidemiologic information.
3. Efforts Toward Education Guidelines may help investigators report details of methods and results in a transparent way, but they cannot prescribe exactly which details need to be reported. Education in experimental and statistical design, analytic methods and their proper interpretation, and scientific writing are all critical parts for improved statistical reporting. 3.1. State of Statistical Education for Biomedical Researchers
Familiarity with experimental design and statistical analysis techniques is necessary for investigators to conduct state-of-theart research in molecular biology as well as other areas of biomedical study. In setting standards for the PhD degree in molecular biosciences, the committee on “Education of The International Union of Biochemistry and Molecular Biology” recommended courses or workshops in statistics (128). This organization also encourages inclusion of a statistician as a member of the PhD candidates’ supervisory committee. The NIH has awarded various training grants that require training in statistics. One example is the “K-30 Clinical Research Curriculum Award (CRCA),” designed to attract talented individuals to the challenges of clinical research and to provide them with the critical skills that are needed to translate basic discoveries into clinical treatments with emphasis on training in “hypothesis development” and “biostatistical skills.” Each of the 51 awards to date is expected to train 10–20 researchers per year (http://grants1.nih.gov/training/K30.htm). Statistical educators have responded to this challenge by providing enhanced training in critical appraisal and biostatistics throughout the continuum of medical education. A wide range of educational methods including online courses (e.g., http://www. statistics.com/, http://onlinestatbook.com/, http://www. graphpad.com/quickcalcs/index.cfm), self-paced textbooks or
584
Mazumdar, Banerjee, and Van Epps
computer learning (129), formal academic courses for credit, one-on-one education offered under the context of consultation, intensive short courses (130), and seminars or lectures covering selected topics in statistics followed by a journal club (utilizing a published manuscript using the statistical method on which the lecture was just presented) as continuing education (131) have been attempted. These courses typically target medical students, residents, postdoctoral fellows, research faculty, and other interested research staff at medical research institutions. The goals of these efforts are to teach basic and intermediate biostatistical principles and techniques relevant to the design, analysis, and reporting of research studies, to augment skill for proper interpretation of medical journal articles with regard to statistical issues, to improve their ability for effective communication with collaborating biostatisticians, and to be able to conduct simple analyses on their own to be able to generate hypothesis from their preliminary data. The goal is NOT to make them biostatisticians or even statistical analysts. Often this goal is contradictory to the expectation and/or desire of the student. The effectiveness of the intensive short course has been captured utilizing a “familiarity survey” of the topics before and after the course and an “overall evaluation” of all aspects of the course and the results were favorable (130). However, recently Windish et al. (36) performed a multiprogram cross-sectional survey of internal medicine residents’ biostatistics knowledge and their ability to interpret study results from published manuscripts. The main outcome measure assessed was percentage of questions correct on a multiple-choice knowledge test. The survey was completed by 277 of 367 residents (75.5%) in 11 residency programs. In addition to the MD degree, 14% residents had additional degrees (4% PhD, 10% MPH/MHS/MSc). Seventy-five percent indicated they did not understand all of the statistics they encountered in journal articles, but 95% felt it was important to understand these concepts to be an intelligent reader of the literature. More than 68% of respondents had some training in biostatistics, with approximately 70% of this training occurring during medical school. An overall mean percentage correct on statistical knowledge and interpretation of results was 41.4% (95% confidence interval [CI], 39.7–43.3%). Higher scores were associated with additional advanced degrees, prior biostatistics training, enrollment in a university-based training program, and male sex. On individual knowledge questions, 81.6% correctly interpreted a relative risk, but were less likely to know how to interpret an adjusted odds ratio from a multivariate regression analysis (37.4%) or the results of a Kaplan–Meier analysis (10.5%). The authors recommended that residency programs incorporate more effective education in biostatistics, but it is apparent that such training has proven difficult, with systematic
Improved Reporting of Statistical Design and Analysis
585
reviews showing only limited effectiveness of many journal clubs and evidence-based medicine (EBM) curricula (132–136). Interactive, self-directed, and clinically driven instructional strategies seem to stand the best chance of success (132, 137). Involvement in hypothesis-driven research during training that requires comprehensive reading of the literature may also enhance residents’ knowledge and understanding (138). Tools to be able to retain the knowledge need to be taught and booster courses on statistics need to be designed. Another track to improve statistical reporting involves having qualified biostatisticians available to all researchers. Currently, this is not possible for two reasons: (1) inadequate funding for biostatisticians in a research team to develop good working knowledge of the subject matter (e.g., specific molecular biology principles needed for individual study) and (2) inadequate number of statistics graduates, especially those with additional training in subject matters such as molecular biology. The first issue is being addressed by the addition of statistical reviewers to various study sections of granting agencies where they have an opportunity to comment on the budgetary need for the biostatistical work so that time dedicated to learning subject matter is planned for. The second issue, further described in Section 3.2, is under consideration by the curriculum committees of many universities. 3.2. Education in Biology for Statistician
The role of biostatisticians and biostatistics in the design, conduct, and analysis of biomedical research has grown in the last 20 years, especially in the areas of basic laboratory research, biomedical imaging, genetics, genomics, computational biology, neuroscience, epidemiological studies, clinical trials, and health services assessments. Two workshops held between 2001 and 2003 by the NIH examined the need to train more biostatisticians in the United States to meet the increasing opportunities in the biomedical research enterprise (35). There is general agreement that the supply of new PhD graduates in biostatistics in the United States has been relatively steady for the past two decades while demand has increased dramatically. These workshops therefore concluded that a renewed effort is necessary and should be led at least in part by the NIH to add and expand to the existing training programs to increase the supply of biostatisticians. Key elements of recommended training program include (1) basic theory and methods in biostatistics; (2) knowledge of a biology specialty area; (3) involvement as collaborator and researcher; (4) communication and leadership skills; and (5) training in research ethics and information privacy. Recommendations stemming from these workshops include the call for development of methods to ensure that graduates have the necessary verbal and written skills to be effective in a team research environment. Preparation might include oral pre-
586
Mazumdar, Banerjee, and Van Epps
sentations at laboratory meetings, journal clubs, and/or seminars; drafting review papers on selected topics; preparing mock grant applications; or attending formal institutional workshops or courses on effective communication. Biostatistical trainees are expected to be trained with a broad range of research survival skills to enhance their chances of being effective and successful in collaborative settings and be able to assume leadership roles in research teams. Ethics training should include topics such as accurate reporting, the ethics of authorship, and appropriate attribution of previous results. Universities granting masters and PhD degrees in biostatistics are adjusting their curriculum according to these recommendations but finding time for these additional training, specifically training in other subject matter, seems challenging. A key approach for improved statistical design, analysis, and reporting would include embedding well-trained biostatisticians in all research teams with allowance of sufficient time for manuscript peer review for journals in which his/her research team will eventually publish. 3.3. Potential of Two-Way Education During Biostatistical Consulting
Medical students, residents, postdoctoral fellows, and faculty commonly consult with biostatistical experts about study design and data analysis when conducting clinical research. An excellent chapter by Berman et al. (139) offers guidelines for working with statisticians outlining the type of statistical education that can be expected during the consulting session. Discussion has been offered regarding a curriculum for the consulting biostatisticians to receive further training in understanding their role as a teacher (140). It is generally assumed that time of statistical consultations also serves as a time for learning but the burden of utilizing this opportunity is left on the researchers involved. A unique study recently quantified the role of biostatistical training during these consultations and reviewed the connections between biostatistical consultation and education (37). The presence and nature of teaching efforts during biostatistical consults at four academic research institutions over various periods of time between 1999 and 2005 (237 consultations in total) were recorded and described. A total of 67, 70, 78, and 100 percent of the consulting sessions at each of the four respective institutes included biostatistical training, with an overall 78% (95% CI: 73–83%) of consultations including an educational component when all consultations were combined. Training covered a wide range of biostatistical topics. Results show that both the need and the opportunity exist for specialized biostatistical instruction during one-onone sessions between a consulting biostatistician and researchers. Academic researchers are ideally positioned to absorb this kind of training when they initiate a request for assistance with their own research project. Chances of success in understanding a difficult statistical concept are higher when explained in the setting
Improved Reporting of Statistical Design and Analysis
587
of a researcher’s own work. It is important for biostatisticians to be aware of this educational opportunity, have time to be able to deliver such training, and receive enhanced knowledge of the subject matter during the same process. This approach will prove more effective when compounded with the efforts outlined in Sections 3.1 and 3.2.
4. Efforts Toward Improving Editorial Practice
Although peer review is widely considered to be the most credible method for selected manuscripts for publication and improving the quality of accepted papers in scientific journals, there is mixed evidence in support of its use (141). There exist a few studies on the effect of statistical review on the quality of manuscripts published that failed to prove the benefit of peer review (142– 144). It was noted that these studies sampled from lower impact medical journals such as Medicina Clínica, Annals of Emergency Medicine, and Croatian Medical Journal. There are other studies that reported improved quality of published manuscripts due to the inclusion of statistical reviewers in the peer review process (145, 146). However, these studies were focused on large, prestigious medical journals (e.g., Journal of American Medical Association and British Medical Journal), which are known for being highly selective for studies with superior study design and as a result easier to report correctly. Potential confounding factors between types of journals and the quality of statistical reporting in these journals include general review process, editorial practice, and practice of statistical refereeing. Even high-impact journals are not immune to poor editorial practice of methodological peer review (147). There is evidence, however, that the situation in the journals with large readership is improving. For example, the proportion of journals with a mandatory biostatistical review policy for all published papers increased twofold (from 15 to 37%) during the last two decades (148, 149). Since it has been shown that the appointment of a statistical editor in a small journal does not guarantee improvement of statistical rigor in manuscripts (143), it seems logical that these problems arise from inadequate editorial practice. Additional measures are needed, including stringent editorial policy that includes statistical review for all manuscripts with numerical data and rigorous monitoring of revised manuscripts. Another key issue is the quality of statistical reviews. To obtain relevant statistical reviews, it is essential to have reviewers with background not only in statistics but preferably in the particular field of biomedicine as well. This issue again loops back to not having enough statistics graduates with a solid background in
588
Mazumdar, Banerjee, and Van Epps
biology. Availability of time for performing a proper review is also problematic as these efforts are primarily voluntary. Many journals are aware of this issue and efforts are underway for improving this area of scientific publishing. Recently, a unique study aiming to estimate the effects on manuscript quality of either adding a statistical peer reviewer or suggesting the use of checklists such as CONSORT, QUOROM, or STARD to clinical reviewers has been published (150). The control arm of the study consisted of manuscripts with “no statistical expert” and “no checklist” used as part of the review process. One hundred fifteen manuscripts were randomized and the increment in quality of papers was measured using the modified Goodman scale, a 34-item scale with each item scored from 1 to 5 (147). Two blinded evaluators rated the quality of manuscripts at initial submission and the final post-peer review stage. The estimated effect of adding a statistical reviewer was 5.5 (95% CI: 4.3– 6.7), thus demonstrating a small but significant improvement in quality. Use of checklist was not found to have an effect on overall quality by the Goodman scale (0.9, 95% CI: –0.3 to +2.1). However, the authors provided a note of caution in interpreting this nonsignificant effect due to a known weakness in their study planning—they did not ask the referees to return the completed checklist, thereby having no guarantee of their use. A new trial with an improvement in checklist delivery and feedback is therefore needed to make stronger conclusions. More studies like this one would prove useful in better understanding the roles of statistical reviewers, reporting guidelines, and their joint effect.
5. Discussions The design, analysis, interpretation, and reporting of biomedical experiments are best performed with the aid of a biomedical research team trained in principles of statistics, advice from a statistician with knowledge base in the research subject matter, a journal review system with clear instructions to authors, and statistical editorial consultation. Good statistical reporting depends on correct statistical design, proper data collection/analysis, and appropriate interpretation/writing. Reporting guidelines have been developed to aid investigators in thinking about all the important aspects of their study and its limitations. Guidelines also provide an indirect tutorial for scientific writing by listing items to be potentially captured in a particular section. The effect of using a guideline is never detrimental (except the additional time needed to go through a checklist or to create a flowchart) but has the potential of adding clarity and improving reporting.
Improved Reporting of Statistical Design and Analysis
589
It is important to keep in mind that although guidelines may help investigators report details of methods and results in a transparent way, they cannot prescribe exactly which details need to be reported. It has been best described by Dr. Ransohoff (151): Guidelines for reporting cannot replace the thoughtful reflection and insight of an investigator who explicitly considers: what are all the possible systematic differences between compared groups that could explain the results; what measurements could be checked to see if those biases occurred; and, based on those measurements, can the direction and magnitude of possible biases be estimated, along with their impact on results and interpretation? This kind of detailed consideration should be provided by authors and expected by reviewers and editors, and it needs to be reflected in every step of research, from design and methods to results, analysis and interpretation. In other words, every possible explanation for experimental results should be considered and addressed by the investigator.
The last decade has witnessed considerable activity in developing reporting guidelines: more than 80 of them now exist. They cover a broad spectrum and 6 of them (CONSORT, STROBE, QUOROM, MOOSE, STARD, and REMARK) have been described briefly in this chapter. Although a core set of steps are required to optimally develop any reporting guideline, a recent survey of 30 developers of reporting guidelines suggests that most guidelines are created idiosyncratically (152). Contributing factors include the lack of a central repository of available guidelines and a paucity of literature to inform developers on how to develop a reporting guideline (153). To help improve the quality of reporting health research, a network called EQUATOR (Enhancing the QUAlity and Transparency Of health Research) with initial funding from the UK National Health Service National Knowledge Service and National Institute for Health Research has been established (http://www.equatornetwork.org). This new initiative seeks to improve the quality of scientific publications by promoting transparent and accurate reporting. The network is developing resources and training materials related to the reporting of health research and will assist in the development, dissemination, and implementation of robust reporting guidelines. A free newsletter is available electronically from the network’s website. The field of molecular biology is vast in the types of studies and data. Although the six guidelines named above could provide the core elements of many of the studies, they do not cover the expanse of studies with animal experiments, microarray data, proteomics data, and other genetic marker-based studies. Individual journals or authors have started some efforts in these areas but the process has not been as rigorous as CONSORT and others. It
590
Mazumdar, Banerjee, and Van Epps
would be useful to contact the EQUATOR network and use their expertise in developing guidelines in this field. A more effective angle for improving the statistical reporting would be achieved by providing education on statistical thinking to biological researchers, by training mathematical statisticians to truly become biostatisticians, and by embedding biostatisticians in research teams and in journal review processes. Progress has been made through the creation of training grants with increased focus on producing a higher number of well-trained biostatisticians poised for collaboration, by mandating courses in statistical thinking be part of education in biomedical research, and by creating free consulting services in academic institutions with statistical education as an integral part through the Clinical and Translational Science Awards (http://www.ctsaweb.org/). These efforts need to be continually supported and evaluated. There are many excellent tutorials on various statistical methods written for the clinical audience. Improving PubMed links so that the tutorial and guideline papers are displayed as related papers might also help (http://www.ncbi. nlm.nih.gov/pubmed/). A Health Insurance Portability and Accountability Act (HIPPA)-like training for biostatistics should be developed with a quiz at the end for knowledge evaluation. This could serve as a basic training for all individuals participating in research process and will bring awareness that statistics is an integral part of research and any error in data collection, analysis, and reporting could have a huge impact on the scientific progress. Provision for sharing of data as supplementary material in an analysis-ready form is another important message that needs further promotion. This effort, although somewhat time consuming, satisfies healthy scientific skepticism and makes each study more valuable by providing opportunity for analysis from various angles and potentially improve meta-analysis based on individual patient data. The ultimate step where the statistical reporting can be improved is through the review and editorial process. Increased participation of biostatisticians and biomedical researchers with statistical backgrounds as reviewers and editorial policies requiring most manuscripts to get statistical review are the ways to prevent the scandals of misreporting. Primary obstacles in this pursuit are the inadequate availability of time for review and lack of qualified reviewers. Some journals have started providing fees for statistical review. It might be worthwhile to think of “Statistical Reviewer” as a profession rather than the voluntary work the statistical researchers provide. This will have the advantage of the reviewer being dedicated full time to review work and therefore gaining expertise in this endeavor. The disadvantage might be that the professional reviewer might not be able to keep up with the newer methodologies due to competing pressure on his/her time.
Improved Reporting of Statistical Design and Analysis
591
Studies of effectiveness of guideline adoption, change in education curricula, and incorporation of statistical review have shown mixed results. It underlines the importance of making evaluation an integral part of change in each process: addition of mandatory use of guidelines in submission and review process, addition of qualified statisticians in the review team, change in education systems, modification in editorial policies requesting statistical review for almost all papers, and following up on statistical comments from reviewers. We call for the following steps to be followed to avoid reporting errors (classified by different stakeholders). 5.1. For the Biomedical Researchers
(a) Identify the type of study you are performing. (b) Find one or more reporting guideline/checklist to assess if you have the knowledge base to create the “statistical design” and “data analysis plan” for your study; if not find a statistical collaborator with needed background. (c) Use the reporting guidelines/checklists at the planning stage; this will ensure you are capturing all that will be needed for good reporting. (d) Return to the reporting guidelines/checklists while preparing the manuscript and add the information in your methods section (e.g., data are MIAME-compliant and the study is reported in a STROBE-compliant way).
5.2. For the Research Biostatisticians
(a) Identify the type of study in which you are collaborating. (b) Find one or more reporting guideline/checklist that fits the objectives and designs of the study. Present the guidelines and the rationale for the requirements (e.g., in a journal club setting) to make connection to the study being planned. Engage your collaborator in discussion for data sharing and how this improves the visibility/usefulness of the study. (c) Consult other statisticians working in the field if you are unsure of some of the requirements. (d) If you disagree with some of the guidelines, record your comments on the guideline website. (e) Use the reporting guidelines/checklists at the planning stage to ensure you are capturing all information that will be needed for good reporting. (f) Return to the reporting guidelines/checklists while preparing the manuscript and add the information in statistical methods section if not reported in the general methods section.
592
Mazumdar, Banerjee, and Van Epps
5.3. For the Journal Editors and the Reviewer of the Statistical Section
(a) Identify the types of studies generally expected to be submitted to your journal. (b) Make links to all appropriate guidelines on the website under “Instruction to Authors” including EQUATOR network. (c) Require a simple checklist marking whether the author followed the items in the guideline, which should be submitted with each publication. The categories could be “Checked (reported on page ##),” “Not Checked (reasons if any),” and “Not Applicable.” (d) Make this filled checklist available to each reviewer and ask them to mark their satisfaction level as “Satisfactory” or “Not Satisfactory.” If a category is not satisfactory, the reviewer will add comments suggesting ways to improve it.
Acknowledgments Drs. Mazumdar and Banerjee were partially supported by Clinical Translational Science Center (CTSC) (UL1-RR024996). Dr. Mazumdar was additionally supported by AHRQ RFA-HS-0514, NIGMS R25CA105012, and NCI HHSN261200622004C. Ms. Alison M. Edwards is thanked for assisting in the literature search and for providing comments on the write-up. References 1. Chan, A., and Altman, D. (2005) Epidemiology and reporting of randomised trials published in PubMed journals, Lancet 365, 1159–1162. 2. Hayden, J. A., Cote, P., and Bombardier, C. (2006) Evaluation of the quality of prognosis studies in systematic reviews, Ann Intern Med 144, 427–437. 3. Lee, C., and Chi, K. (2000) The standard of reporting of health-related quality of life in clinical cancer trials, J Clin Epidemiol 53, 451–458. 4. Mills, E., Loke, Y. K., Wu, P., Montori, V. M., Perri, D., Moher, D., and Guyatt, G. (2004) Determining the reporting quality of RCTs in clinical pharmacology, Br J Clin Pharmacol 58, 61–65. 5. Pocock, S. J., Collier, T. J., Dandreo, K. J., de Stavola, B. L., Goldman, M. B., Kalish, L. A., Kasten, L. E., and McCormack, V. A. (2004) Issues in the reporting of epidemiological studies: A survey of recent practice, BMJ 329, 883.
6. Yoo, K., Shin, H., Chang, S., Choi, B., Hong, Y., Kim, D., Kang, D., Cho, N., Shin, C., and Jin, Y. (2005) Genomic epidemiology cohorts in Korea: Present and the future, Asian Pac J Cancer Prev 6, 238–243. 7. Steels, E., Paesmans, M., Berghmans, T., Branle, F., Lemaitre, F., Mascaux, C., Meert, A., Vallot, F., Lafitte, J., and Sculier, J. (2001) Role of p53 as a prognostic factor for survival in lung cancer: A systematic review of the literature with a meta-analysis, Eur Respir J 18, 705–719. 8. Emerson, J., and Colditz, G. (1983) Use of statistical analysis in the New England journal of medicine, N Engl J Med 309, 709–713. 9. Juzych, M., Shin, D., Seyedsadr, M., Siegner, S., and Juzych, L. (1992) Statistical techniques in ophthalmic journals, Arch Ophthalmol 110, 1225–1229. 10. Kanter, M., and Taylor, J. (1994) Accuracy of statistical methods in transfusion: A review of articles from July/August 1992 through June 1993, Transfusion 34, 697–701.
Improved Reporting of Statistical Design and Analysis 11. Rosenfeld, R., and Rockette, H. (1991) Biostatistics in otolaryngology journals, Arch Otolaryngol Head Neck Surg 117, 1172–1176. 12. Seldrup, J. (1997) Whatever happened to the t-test?, Drug Inf J 31, 745–750. 13. Wang, Q., and Zhang, B. (1998) Research design and statistical methods in Chinese medical journals, JAMA 280, 283–285. 14. Holmes, T. H. (2004) Ten categories of statistical errors: A guide for research in endocrinology and metabolism, Am J Physiol – Endocrinol Metab 286, E495–501. 15. Strasak, A., Zaman, Q., Pfeiffer, K., Gobel, G., and Ulmer, H. (2007) Statistical errors in medical research – a review of common pitfalls, Swiss Med Wkly 137, 44–49. 16. Adam, B., Qu, Y., Davis, J., Ward, M., Clements, M., and Cazares, L. (2002) Serum protein fingerprinting coupled with a pattern-matching algorithm distinguishes prostate cancer from benign prostate hyperplasia and healthy men, Cancer Res 62, 3609–3614. 17. Drake, R., Manne, U., Bao-Ling, A., Ahn, C., Cazares, L., and Semmes, O. (2003) SELDI-TOF-MS profiling of serum for early detection of colorectal cancer, Gastroenterology 124 (Suppl 1), A650. 18. Petricoin, E., Ardekani, A., Hitt, B., Levine, P., Fusaro, V., and Steinberg, S. (2002) Use of proteomic patterns in serum to identify ovarian cancer, Lancet 359, 572–577. 19. Petricoin, E., Ornstein, D., Paweletz, C., Ardekani, A., Hackett, P., and Hitt, B. (2002) Serum proteomic patterns for detection of prostate cancer, J Natl Cancer Inst 94, 1576–1578. 20. Vlahou, A., Schellhammer, P., Mendrinos, S., Patel, K., Kondylis, F., and Gong, L. (2001) Development of a novel proteomic approach for the detection of transitional cell carcinoma of the bladder in urine, Am J Pathol 158, 1491–1502. 21. Zhu, W., Wang, X., Ma, Y., Rao, M., Glimm, J., and Kovach, J. (2003) Detection of cancer-specific markers amid massive mass spectral data, Proc Natl Acad Sci USA 100, 14666–14671. 22. Pollack, A. (2004) New cancer test stirs hope and concern, New York Times Feb 3rd [D1, D6]. 23. Marcus, A. (2002) Testing for ovarian cancer is on the way, Wall St J [D1, D2]. 24. Diamandis, E. (2003) Re: Serum proteomic patterns for detection of prostate cancer [author reply 90–91], J Natl Cancer Inst 95, 489–490.
593
25. Diamandis, E. (2004) Mass spectrometry as a diagnostic and a cancer biomarker discovery tool: Opportunities and potential limitations, Mol Cell Proteomics 3, 367–378. 26. Diamandis, E. (2004) OvaCheck: Doubts voiced soon after publication, Nature 430. 27. Diamandis, E. (2003) Point: Proteomic patterns in biological fluids: Do they represent the future of cancer diagnostics?, Clin Chem 49, 1272–1275. 28. Garber, K. (2004) Debate rages over proteomic patterns, J Natl Cancer Inst 96, 816–818. 29. Baggerly, K., Coombes, K., Morris, J., and (2005) Bias, randomization, and ovarian proteomic data: A reply to “Producers and consumers”, Cancer Inform 1, 9–14. 30. Baggerly, K., Edmonson, S., Morris, J., and Coombes, K. (2004) High-resolution serum proteomic patterns for ovarian cancer detection, Endocr Relat Cancer 11, 583–584. 31. Baggerly, K., Morris, J., and Coombes, K. (2004) Reproducibility of SELDITOF protein patterns in serum: Comparing datasets from different experiments, Bioinformatics 20, 777–785. 32. Baggerly, K., Morris, J., Edmonson, S., and Coombes, K. (2005) Signal in noise: Evaluating reported reproducibility of serum proteomic tests for ovarian cancer, J Natl Cancer Inst 97, 307–309. 33. Ransohoff, D. (2005) Lessons from controversy: Ovarian cancer screening and serum proteomics, J Natl Cancer Inst 97, 315–319. 34. Ransohoff, D. (2007) How to improve reliability and efficiency of research about molecular markers: Roles of phases, guidelines, and study design, J Clin Epidemiol 60, 1205–1219. 35. DeMets, D., Stormo, G., Boehnke, M., Louis, T., Taylor, J., and Dixon, D. (2006) Training of the next generation of biostatisticians: A call to action in the U.S., Statist Med 25, 3415–3429. 36. Windish, D. M., Huot, S. J., and Green, M. L. (2007) Medicine residents understanding of the biostatistics and results in the medical literature, JAMA 298, 1010–1022. 37. Deutsch, R., Hurwitz, S., Janosky, J., and Oster, R. (2007) The role of education in biostatistical consulting, Statist Med 26, 709–720. 38. Anonymous. (1979) Uniform requirements for manuscripts submitted to biomedical journals. International steering committee of medical editors, Br Med J 1, 532–535. 39. Evans, M. (1989) Presentation of manuscripts for publication in the British Journal of Surgery, Br J Surg 76, 1311–1315.
594
Mazumdar, Banerjee, and Van Epps
40. Altman, D. G., Gore, S. M., Gardner, M. J., and Pocock, S. J. (1983) Statistical guidelines for contributors to medical journals, BMJ 286, 1489–1493. 41. Curran-Everett, D., Benos, D. J., and American Physiological, S. (2004) Guidelines for reporting statistics in journals published by the American Physiological Society, Am J Physiol – Endocrinol Metab 287, E189–191. 42. Bossuyt, P. M., Reitsma, J. B., Bruns, D. E., Gatsonis, C. A., Glasziou, P. P., Irwig, L. M., Lijmer, J. G., Moher, D., Rennie, D., De Vet, H. C. W., and Standards for Reporting of Diagnostic, A. (2003) Towards complete and accurate reporting of studies of diagnostic accuracy: the STARD initiative, Am J Roentgenol 181, 51–55. 43. McShane, L. M., Altman, D. G., Sauerbrei, W., Taube, S. E., Gion, M., Clark, G. M., and Statistics Subcommittee of the, N. C. I. E. W. G. o. C. D. (2005) REporting recommendations for tumour MARKer prognostic studies (REMARK).[see comment], Br J Cancer 93, 387–391. 44. Dickinson, K., Bunn, F., Wentz, R., Edwards, P., and Roberts, I. (2000) Size and quality of randomized controlled trials in head injury: Review of published studies, BMJ 320, 1308–1311. 45. Hotopf, M., Lewis, G., and Normand, C. (1997) Putting trials on trial- the costs and consequences of small trials in depression: A systematic review of methodology, J Epidemiol Community Health 51, 354–358. 46. Thornley, B., and Adams, C. (1998) Content and quality of 2000 controlled trials in schizophrenia over 50 years, BMJ 317, 1181–1184. 47. Pildal, J., Hróbjartsson, A., Jørgensen, K., Hilden, J., Altman, D., and Gotzsche, P. (2007) Impact of allocation concealment on conclusions drawn from meta-analyses of randomized trials, Int J Epidemiol 36, 847–857. 48. Begg, C., Cho, M., Eastwood, S., Horton, R., Moher, D., Olkin, I., Pitkin, R., Rennie, D., Schulz, K., Simel, D., and Stroup, D. (1996) Improving the quality of reporting of randomized controlled trials. The CONSORT statement, JAMA 268, 637–639. 49. Egger, M., Jüni, P., and Bartlett, C. (2001) The value of patient flow charts in reports of randomized controlled trials: Bibliographic study. The CONSORT group, JAMA 285, 1996–1999. 50. Devereaux, P. J., Manns, B. J., Ghali, W. A., Quan, H., Guyatt, G. H., Manns, B. J., Ghali, W. A., Quan, H., and Guyatt, G. H.
51.
52.
53.
54. 55.
56.
57.
58.
59.
60.
61.
(2002) The reporting of methodological factors in randomized controlled trials and the association with a journal policy to promote adherence to the consolidated standards of reporting trials (CONSORT) checklist, Control Clin Trials 23, 380–388. Moher, D., Jones, A., and Lepage, L. (2001) Use of the CONSORT statement and quality of reports of randomized trials: A comparative before and after evaluation? The CONSORT group, JAMA 285, 1992–1995. Moher, D., Schulz, K., and Altman, D. (for the Consort Group 2001) The CONSORT statement: Revised recommendations for improving the quality of reports of parallel-group randomized trials, Lancet 357, 1191–1194. Mills, E. J., Wu, P., Gagnier, J., Devereaux, P. J., Mills, E. J., Wu, P., and Gagnier, J. (2005) The quality of randomized trial reporting in leading medical journals since the revised CONSORT statement, Contemp Clin Trials 26, 480–487. Glasziou, P., Vandenbroucke, J., and Chalmers, I. (2004) Assessing the quality of research, BMJ 328, 39–41. Lee, W., Bindman, J., Ford, T., Glozier, N., and Moran, P. (2007) Bias in psychiatric casecontrol studies: Literature survey, Br J Psychiatry 190, 204–209. Tooth, L., Ware, R., Bain, C., Purdie, D., and Dobson, A. (2005) Quality of reporting of observational longitudinal research, Am J Epidemiol 161, 280–288. Bogardus, S., Concato, J., and Feinstein, A. (1999) Clinical epidemiological quality in molecular genetic research: The need for methodological standards, JAMA 281, 1919–1926. Anonymous. (1981) Guidelines for documentation of epidemiologic studies. Epidemiology work group of the interagency regulatory liaison group, Am J Epidemiol 114, 609–613. vonElm, E., Altman, D., Egger, M., Pocock, S., Gøtzsche, P., and Vandenbroucke, J. (2007) for the STROBE Initiative: The strengthening the reporting of observational studies in epidemiology (STROBE) statement: Guidelines for reporting observational studies, Lancet 370, 1453–1457. Sacks, H., Berrier, J., Reitman, D., AnconaBerk, V., and Chalmers, T. (1987) Metaanalyses of randomized controlled trials, N Engl J Med 316, 450–455. Sacks, H., Reitman, D., Pagano, D., and Kupelnick, B. (1996) Meta-analysis: An update, Mt Sinai J Med 63, 216–224.
Improved Reporting of Statistical Design and Analysis 62. Moher, D., Cook, D. J., Eastwood, S., Olkin, I., Rennie, D., and Stroup, D. F. (1999) Improving the quality of reports of metaanalyses of randomised controlled trials: The QUOROM statement. Quality of reporting of meta-analyses [see comment], Lancet 354, 1896–1900. 63. Delaney, A., Bagshaw, S., Ferland, A., Manns, B., Laupland, K., and Doig, C. (2005) A systematic evaluation of the quality of metaanalyses in the critical care literature, Crit Care 9, R575–R582. 64. Oxman, A., and Guyatt, G. (1991) Validation of an index of the quality of review articles, J Clin Epidemiol 44, 1271–1278. 65. Wen, J., Ren, Y., Wang, L., Li, Y., Liu, Y., Zhou, M., Liu, P., Ye, L., Li, Y., and Tian, W. (2008) The reporting quality of metaanalyses improves: A random sampling study, J Clin Epidemiol 6, 770–775. 66. Hind, D., and Booth, A. (2007) Do health technology assessments comply with QUOROM diagram guidance? An empirical study, BMC Med Res Methodol 7, 1–9. 67. Shapiro, S. (1994) Meta-analysis/shmetaanalysis, Am J Epidemiol 140, 771–778. 68. Stroup, D., Thacker, S., Olson, C., Glass, R., and Hutwagner, L. (2001) Characteristics of meta-analyses related to acceptance for publication in a medical journal, J Clin Epidemiol 54, 655–660. 69. Blettner, M., Sauerbrei, W., and Schlehofer, B. (1999) Traditional reviews, meta-analyses and pooled analyses in epidemiology, Int J Epidemiol 28, 1–9. 70. Easterbrook, P., Berlin, J., Gopalan, R., and Matthews, D. (1991) Publication bias in clinical research, Lancet 337, 867–872. 71. Schlesselman, J. (1997) Risk of endometrial cancer in relation to use of combined oral contraceptives: A practitioner s guide to meta-analysis, Hum Reprod 12, 1851–1863. 72. Corvol, J. C., Anzouan-Kacou, J. B., Fauveau, E., Bonnet, A. M., Lebrun-Vignes, B., Girault, C., Agid, Y., Lechat, P., Isnard, R., Lacomblez, L., Corvol, J.-C., AnzouanKacou, J.-B., Fauveau, E., Bonnet, A.-M., Lebrun-Vignes, B., Girault, C., Agid, Y., Lechat, P., Isnard, R., and Lacomblez, L. (2007) Heart valve regurgitation, pergolide use, and parkinson disease: An observational study and meta-analysis, Arch Neurol 64, 1721–1726. 73. Fryback, D., and Thornbury, J. (1991) The efficacy of diagnostic imaging, Med Decis Making 11, 88–94. 74. Guyatt, G., Tugwell, P., Feeny, D., Haynes, R., and Drummond, M. (1986) A framework for clinical evaluation of diagnos-
75.
76.
77. 78.
79. 80.
81.
82.
83.
84.
85.
86.
87.
595
tic technologies, Can Med Assoc J 134, 587–594. Kent, D., and Larson, E. (1992) Disease, level of impact, and quality of research methods. Three dimensions of clinical efficacy assessment applied to magnetic resonance imaging, Invest Radiol 27, 245–254. Griner, P., Mayewski, R., Mushlin, A., and Greenland, P. (1981) Selection and interpretation of diagnostic tests and procedures. Principles and applications, Ann Intern Med 94, 557–592. Metz, C. (1978) Basic principles of ROC analysis, Semin Nucl Med 8, 283–298. Sackett, D., Haynes, R., Guyatt, G., and Tugwell, P. (1991) The selection of diagnostic tests, Little, Brown and Company, Boston, Toronto, and London. Begg, C. B. (1987) Biases in the assessment of diagnostic tests, Stat Med 6, 411–423. Lijmer, J., Mol, B., Heisterkamp, S., Bonsel, G., Prins, M., and van der Meulen, J. (1999) Empirical evidence of design related bias in studies of diagnostic tests JAMA 282, 1061–1066. Reid, M., Lachs, M., and Feinstein, A. (1995) Use of methodological standards in diagnostic test research. Getting better but still not good, JAMA 274, 645–651. Devries, S., Hunink, M., and Polak, J. (1996) Summary receiver operating characteristic curves as a technique for meta-analysis of the diagnostic performance of duplex ultrasonography in peripheral arterial disease, Acad Radiol 3, 361–369. Nelemans, P., Leiner, T., de Vet, H., and van Engelshoven, J. (2000) Peripheral arterial disease: Meta-analysis of the diagnostic performance of MR angiography, Radiology 217, 105–114. Coppus, S., van der Veen, F., Bossuyt, P., and Mol, B. (2006) Quality of reporting of test accuracy studies in reproductive medicine: Impact of the standards for reporting of diagnostic accuracy (STARD) initiative, Fertil Steril 86, 1321–1329. Wilczynski, N. (2008) Quality of reporting of diagnostic accuracy studies: No change since STARD statement publication – beforeand-after study, Radiology 248, 817–823. Smidt, N., Rutjes, A. W. S., van der Windt, D. A. W. M., Ostelo, R. W. J. G., Reitsma, J. B., Bossuyt, P. M., Bouter, L. M., and de Vet, H. C. W. (2005) Quality of reporting of diagnostic accuracy studies, Radiology 235, 347–353. Smidt, N., Overbeke, J., de Vet, H., and Bossuyt, P. (2007 ) Endorsement of the STARD statement by biomedical journals:
596
88.
89.
90.
91.
92.
93.
94.
95.
96. 97.
98.
99. 100.
Mazumdar, Banerjee, and Van Epps Survey of instructions for authors, Clin Chem 53, 1983–1985. Bossuyt, P. (2008) STARD statement: Still room for improvement in the reporting of diagnostic accuracy, Radiology 248, 713– 714. Altman, D., Lausen, B., Sauerbrei, W., and Schumacher, M. (1994) Dangers of using “optimal” cutpoints in the evaluation of prognostic factors, J Natl Cancer Inst 86, 829–835. Brundage, M., Davies, D., and Mackillop, W. (2002) Prognostic factors in non-small cell lung cancer: a decade of progress, Chest 122, 1037–1057. Burton, A., and Altman, D. (2004) Missing covariate data within cancer prognostic studies: A review of current reporting and proposed guidelines, Br J Cancer 91, 4–8. Mirza, A., Mirza, N., Vlastos, G., and Singletary, S. (2002) Prognostic factors in nodenegative breast cancer: A review of studies with sample size more than 200 and follow-up more than 5 years, Ann Surg 235, 10–26. Popat, S., Matakidou, A., and Houlston, R. (2004) Thymidylate synthase expression and prognosis in colorectal cancer: A systematic review and meta-analysis, J Clin Oncol 22, 529–536. Riley, R., Abrams, K., Sutton, A., Lambert, P., Jones, D., Heney, D., and Burchill, S. (2003) Reporting of prognostic markers: Current problems and development of guidelines for evidence-based practice in the future, Br J Cancer 88, 1191–1198. Riley, R., Burchill, S., Abrams, K., Heney, D., Sutton, A., Jones, D., Lambert, P., Young, B., Wailoo, A., and Lewis, I. (2003) A systematic review of molecular and biological markers in tumours of the Ewing’s sarcoma family, Eur J Cancer 39, 19–30. Festing, M. (1994) Reduction of animal use: Experimental design and quality of experiments, Lab Anim 28, 212–221. Festing, M., and Lovell, D. (1995) The need for statistical analysis of rodent micronucleus test data: Comment on the paper by Ashby and Tinwell, Mutat Res 329, 221–224. Festing, M., and Lovell, D. (1996) Reducing the use of laboratory animals in toxicological research and testing by better experimentaldesign, J R Stat Soc 58 (B-Methodol), 127–140. McCance, I. (1995) Assessment of statistical procedures used in papers in the Australian veterinary journal, Aust Vet J 72, 322–328. Ntzani, E., and Ioannidis, J. (2003) Predictive ability of DNA microarrays for cancer
101.
102.
103.
104.
105.
106.
107.
108.
109.
110.
outcomes and correlates: An empirical assessment, Lancet 362, 1439–1444. Michiels, S., Koscielny, S., and Hill, C. (2005) Prediction of cancer outcome with microarrays: A multiple random validation strategy, Lancet 365, 488–492. Yang, M., Ruan, Q., Yang, J., Eckenrode, S., Wu, S., McIndoe, R., and She, J. (2001) A statistical method for flagging weak spots improves normalization and ratio estimates in microarrays, Physiol Genomics 7, 45–53. Dupuy, A., and Simon, R. M. (2007) Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting, J Natl Cancer Inst 99, 147–157. Jafari, P., and Azuaje, F. (2006) An assessment of recently published gene expression data analyses: Reporting experimental design and statistical factors BMC Med Inform and Decis Mak 6, 1–8. Allison, D., Cui, X., Page, G., and Sabripour, M. (2006) Microarray data analysis: From disarray to consolidation and consensus, Nat Rev Genet 7, 55–65. Brazma, A., Hingamp, P., Quackenbush, J., Sherlock, G., Spellman, P., Stoeckert, C., Aach, J., Ansorge, W., Ball, C. A., Causton, H. C., Gaasterland, T., Glenisson, P., Holstege, F. C., Kim, I. F., Markowitz, V., Matese, J. C., Parkinson, H., Robinson, A., Sarkans, U., Schulze-Kremer, S., Stewart, J., Taylor, R., Vilo, J., and Vingron, M. (2001) Minimum information about a microarray experiment (MIAME)-toward standards for microarray data, Nat Genet 29, 365–371. Shi, L., Reid, L. H., Jones, W. D., Shippy, R., Warrington, J. A., Baker, S. C., Collins, P. J., de Longueville, F., Kawasaki, E. S., Lee, K. Y., et al. (2006) The MicroArray Quality Control (MAQC) project shows interand intraplatform reproducibility of gene expression measurements, Nat Biotechnol 24, 1151–1161. Ioannidis, J., Allison, D., Ball, C., Coulibaly, I., Cui, X., Culhane, A., Falchi, M., Furlanello, C., Game, L., Jurman, G., Mangion, J., Mehta, T., Nitzberg, M., Page, G., Petretto, E., and van Noort, V. (2009) Repeatability of published microarray gene expression analyses, Nat Genet 41, 149–155. Urfer, W., Grzegorczyk, M., and Jung, K. (2006) Statistics for proteomics: A review of tools for analyzing experimental data, Proteomics 1–2, 48–55. Wilkins, M., Appel, R., Van Eyk, J., Chung, C., Görg, A., Hecker, M., Huber, L., Langen, H., Link, A., Paik, Y., Patterson, S.,
Improved Reporting of Statistical Design and Analysis
111.
112.
113.
114.
115.
116.
117.
118. 119.
120.
Pennington, S., Rabilloud, T., Simpson, R., Weiss, E., and Dunn, M. (2006) Guidelines for the next 10 years of proteomics, Proteomics 6, 4–8. Hunt, S., Thomas, M., Sebastian, L., Pedersen, S., Harcourt, R., Sloane, A., and Wilkins, M. (2005) Optimal replication and the importance of experimental design for gel-based quantitative proteomics, J Proteome Res 4, 809–819. Molloy, M., Brzezinski, E., Hang, J., McDowell, M., and VanBogelen, R. (2003) Overcoming technical variation and biological variation in quantitative proteomics, Proteomics 3, 1912–1919. Karp, N., and Lilley, K. (2005) Maximising sensitivity for detecting changes in protein expression: Experimental design using minimal CyDyes, Proteomics 5, 3105–3115. Karp, N., Kreil, D., and Lilley, K. (2004) Determining a significant change in protein expression with DeCyder during a pairwise comparison using two-dimensional difference gel electrophoresis, Proteomics 4, 1421–1432. Celis, J., Carr, S., and Bradshaw, R. (2008) New guidelines for clinical proteomics manuscripts, Mol Cell Proteomics 7, 2071–2072. Taylor, C., Paton, N., Lilley, K., Binz, P., Julian, R. J., Jones, A., Zhu, W., Apweiler, R., Aebersold, R., Deutsch, E., Dunn, M., Heck, A., Leitner, A., Macht, M., Mann, M., Martens, L., Neubert, T., Patterson, S., Ping, P., Seymour, S., Souda, P., Tsugita, A., Vandekerckhove, J., Vondriska, T., Whitelegge, J., Wilkins, M., Xenarios, I., Yates, J. r., and Hermjakob, H. (2007) The minimum information about a proteomics experiment (MIAPE), Nat Biotechnol 25, 887–893. Lin, B., Clyne, M., and Walsh, M. (2006) Tracking the epidemiology of human genes in the literature: The HuGE published literature database, Am J Epidemiol 164, 1–4. Ioannidis, J., Ntzani, E., and Trikalinos, T. (2001) Replication validity of genetic association studies, Nat Genet 29, 306–309. Little, J., Bradley, L., Bray, M., Clyne, M., Dorman, J., Ellsworth, D., Hanson, J., Khoury, M., Lau, J., O’Brien, T., Rothman, N., Stroup, D., Taioli, E., Thomas, D., Vainio, H., Wacholder, S., and Weinberg, C. (2002) Reporting, appraising, and integrating data on genotype prevalence and genedisease associations, Am J Epidemiol 156, 300–310. Ziegler, A., Ewhida, A., Brendel, M., and Kleensang, A. (2009) More powerful haplotype sharing by accounting for the
121. 122. 123.
124.
125.
126. 127.
128.
129. 130.
131.
132.
133.
134.
597
mode of inheritance, Genet Epidemiol 33(3): 228–236. Devlin, B., and Roeder, K. (1999) Genomic control for association studies, Biometrics 55, 997–1004. Pritchard, J., Stephens, M., and Rosenberg, N. (2000) Association mapping in structured populations, Am J Hum Genet 67, 170–181. Price, A., Patterson, N., and Plenge, R. (2006) Principal components analysis corrects for stratification in genomewide association studies, Nature Genet 38, 904–909. Ioannidis, J., Boffetta, P., Little, J., O’Brien, T., Uitterlinden, A., Vineis, P., Balding, D., Chokkalingam, A., Dolan, S., and Flanders, W. (2007) Assessment of cumulative evidence on genetic associations: Interim guidelines, Int J Epidemiol 37, 120–132. Ehm, M., Nelson, M., and Spurr, N. (2005) Guidelines for conducting and reporting whole genome/large-scale association studies, Hum Mol Genet 14, 2485–2488. Weiss, S. (2001) Association studies in asthma genetics, Am J Respir Crit Care Med 164, 2014–2015. Freimer, N., and Sabatti, C. (2005) Guidelines for association studies in human molecular genetics, Hum Mol Genet 14, 2481– 2483. IUBMB. (2000) Standards for the PhD degree in the molecular biosciences: Recommendation of the committee on education of the international union of biochemistry and molecular biology, BioFactors 11, 201–215. Kleinbaum, D., and Klein, M. (2005) Survival analysis: A self-learning text (Statistics for Biology and Health), Springer, New York. Ambrosius, W., and Manatunga, A. (2002) Intensive short courses in biostatistics for fellows and physicians, Stat Med 21, 2739–2756. Deutsch, R. (2002) A seminar series in applied biostatistics for clinical research fellows, faculty and staff, Stat Med 21, 801–810. Coomarasamy, A., and KS, K. (2004) What is the evidence that postgraduate teaching in evidence based medicine changes anything? A systematic review, BMJ 329, 1017. Ebbert, J., Montori, V., and Schultz, H. (2001) The journal club in postgraduate medical education: A systematic review, Med Teach 23, 455–461. Norman, G., and Shannon, S. (1998) Effectiveness of instruction in critical appraisal (evidence-based medicine) skills: A critical appraisal, CMAJ 158, 177–181.
598
Mazumdar, Banerjee, and Van Epps
135. Parkes, J., Hyde, C., Deeks, J., and Milne, R. (2001) Teaching critical appraisal skills in health care settings, Cochrane Database Syst Rev 3, CD001270. 136. Taylor, R., Reeves, B., Ewings, P., Binns, S., Keast, J., and Mears, R. (2000) A systematic review of the effectiveness of critical appraisal skills training for clinicians, Med Educ 34, 120–125. 137. Khan, K., and Coomarasamy, A. (2006) A hierarchy of effective teaching and learning to acquire competence in evidenced-based medicine, BMC Med Educ 6, 59. 138. Rogers, L. (1999) The “win-win” of research, Am J Roentgenol 172, 877. 139. Berman, N., and Gullion, C. (2007) Working with a Statistician, Methods in Molecular Biology: Topics in Biostatistics 404, 489–503. 140. Tobi, H., Kuik, D., Bezemer, P., and Ket, P. (2001) Towards a curriculum for the consultant biostatistician: identification of central disciplines, Stat in Med 20, 3921–3929. 141. Altman, D. (1998) Statistical reviewing for medical journals, Stat Med 17, 2661–2674. 142. Arnau, C., Cobo, E., Cardellach, F., Ribera, J., Selva, A., and Urrutia, A. (2001) Effect of statistical review on manuscript quality in Medicina Clínica, in International Congress on Peer Review in Biomedical Publication, Barcelona, Spain. 143. Lukic, I., and Marušic, M. (2001) Appointment of statistical editor and quality of statistics in a small medical journal, Croat Med J 42, 500–503. 144. Schriger, D., Cooper, R., Wears, R., and Waeckerle, J. (2001) The effect of dedicated methodology/statistical review on published manuscript quality, in Fourth International
145.
146. 147.
148.
149.
150.
151. 152.
153.
Congress on Peer Review in Biomedical Publication, Barcelona, Spain. Gardner, M., and Bond, J. (1990) An exploratory study of statistical assessment of papers published in the British Medical Journal, JAMA 263, 1355–1357. Schor, S., and Karten, I. (1966) Statistical evaluation of medical journal manuscripts, JAMA 195, 1123–1128. Goodman, S., Berlin, J., Fletcher, S., and Fletcher, R. (1994) Manuscript quality before and after peer review and editing at annals of internal medicine, Ann Intern Med 121, 11–21. George, S. (1985) Statistics in medical journals: A survey of current policies and proposal for editors, Med Pediatr Oncol 13, 109–112. Goodman, S., Altman, D., and George, S. (1998) Statistical reviewing policies of medical journals: Caveat lector?, J Gen Intern Med 13, 753–756. Cobo, E., Selva-O’Callagham, A., Ribera, J. M., Cardellach, F., and Dominguez, R. (2007) Statistical reviewers improve reporting in biomedical articles: A randomized trial, PLoS ONE 2, e332. Ransohoff, D. (2005) Bias as a threat to the validity of cancer molecular-marker research, Nat Rev Cancer 5, 142–149. Simera, I., Altman, D., Moher, D., Schulz, K., and Hoey, J. (2008) Guidelines for reporting health research: The EQUATOR network’s survey of guideline, PLoS Med 5, 869–874. Altman, D. G. (2005) Endorsement of the CONSORT statement by high impact medical journals: Survey of instructions for authors, BMJ 330, 1056–1057.
Chapter 23 Stata Companion Jennifer Sousa Brennan Abstract This chapter is an introductory reference guide highlighting some of the most common statistical topics, broken down into both command-line syntax and graphical interface point-and-click commands. This chapter serves to supplement more formal statistics lessons and expedite using Stata to compute basic analyses. Key words: Stata, statistics, frequency tables, variable transformations, hypothesis testing, ANOVA/ANCOVA, nonparametric methods, correlation, linear regression, logistic regression, survival analysis.
1. Introduction Learning a statistical software package is essential to modern data analysis. Although the software will not do the hard work for you, avoiding computation by hand drastically reduces the analysis time and allows you to work with larger data sets. There are a number of excellent statistical software packages available, ranging from the more programming-based packages such as SAS, S+ (and its free counterpart R) to the arguably more beginner-friendly “point-and-click” variety such as Stata, SPSS, and Minitab. While every analyst has their own preference, each package conducts basic analyses similarly. Stata’s is especially helpful to programming-shy scientists because it incorporates easy to follow menus with point-andclick options. For more programming-inclined practitioners, H. Bang et al. (eds.), Statistical Methods in Molecular Biology, Methods in Molecular Biology 620, DOI 10.1007/978-1-60761-580-4_23, © Springer Science+Business Media, LLC 2010
599
600
Brennan
Stata allows for use of programming syntax. Many experienced analytic software users agree that Stata is relatively easy to learn, simple to use, and allows for a great amount of growth as statistical knowledge and programming skills improve. Many superb books and technical web sites show how to use and explore Stata. These instructional tools highlight a variety of learning methods. If you are interested in an introductory course or step-by-step examples, these resources should be sought after and studied. This chapter is not designed to provide statistical rules of thumb, statistical guidance, or other general statistical instruction. In fact, you will find that as you gain experience with Stata, there is more than one way to successfully analyze data— many roads can be built to the same end. This chapter is intended to serve a handy reference while conducting your analyses in Stata. The material is based on several introductory biostatistics courses taught at Cornell Medical School. This chapter is broken down into major categories. 1.1. Menus and Commands in This Chapter
Each topic provides both the graphical user interface’s point-andclick commands and the formulaic command-line syntax to be used in the Stata command window. Additionally, we have provided an example of syntax use for each topic.1 Variables used in the example commands are written in italics, so that you can readily identify the variables needed to input into the syntax. The italicized variable names suggest their variable type. For example: catvar is a categorical variable, depvar is a dependent variable, and var1 var2 are two independent variables to input into the syntax. In some cases, when additional comments seemed helpful, we included them after the code and a double backslash (//). For example, browse //opens all variables in a read-only worksheet browse var1 var2 var3 //opens only variables var1 var2 and var3 in a read-only worksheet
The two comments illustrate the differences between the two browse command lines. 1.2. Graphical User Interface
Each topic is followed by sequences of menu and dialog box selections with arrows showing how selections progress. In order to emphasize additional menu options, the sequences also include dialog box information. For example, selecting to
1 Please note: this syntax list does not include all possible commands. To see the unabridged list of the command information, including all available options, please refer to the Stata help menu or via help topic in the Stata command window.
Stata Companion
601
summarize variables and display additional statistics would be shown as follows:
Statistics > Summaries, tables, and tests > Summary and descriptive statistics > Summary statistics > Display additional statistics. Note: Instead of using the point-and-click commands, command code can be used. The command syntax for each pointand-click command is listed for each topic below the graphical user interface sequence. 1.3. Command-Line Interface
All of the command-line and option information in this chapter comes directly from Stata. To see the complete, unabridged list of command information, including all available options, please refer to the Stata help menu (or via help topic in the Stata command window). According to Stata, the basic Stata language syntax, with very few exceptions, is [prefix :] command [varlist] [=exp] [if] [in] [, options] see language element description ---------------------------------------------------------help prefix prefix : prefix command help command command Stata command help varlist varlist variable list help exp =exp expression help if if if exp qualifier help in in in range qualifier help options options options ----------------------------------------------------------
It is useful to note that the items in brackets are optional and are used to modify the scope of the command. command is the syntax placeholder for the keyword for the topic command. varlist is the syntax placeholder for the list of variables to be used by the command. prefix_cmd is the syntax placeholder modifying the command’s application. The most common prefix is by. The by command
602
Brennan
specifies the groups of observations that the command is separately applied to. Another important prefix command is xi. The xi command is an interaction expansion command. xi expands terms containing categorical variables into indicator (also called dummy) variables. if is a syntax placeholder controlling which observations are used in the commands. The if element defines the conditions on the data and controls which observations are to be used. It takes the form of a Boolean expression. For example: if weight > 130. Only observations for which the if expression are true will be used by the command. In this example, only the observations with a weight greater than 130 will be used. in is also a syntax placeholder controlling which observations are used in the commands. The in element controls which observations are to be used by the command by specifying the observation numbers (usually by listing a range of observation numbers) to be used by the command. For example, in 1/100. In this example, only observations 1– 100 (inclusive) will be used by the command. options is the syntax placeholder for additional options to modify the command operation. The options must be separated from the rest of the command line by a single comma. There is only one comma needed; do not separate multiple options by multiple commas. For example, the Stata syntax to summarize variables and display additional statistics is syntax: summarize [varlist] [if][in][, options] options description -----------------------------------------------------detail display additional statistics meanonly suppress the display; calculate only the mean summarize var1 var2 var3, detail
2. Reference Guide 2.1. Some Stata Basics
Open data in Stata if data is in Stata format File > Open > File name > “c: data _name.dta” > File of Type > Stata data (∗ .dta) syntax: use filename use "c: data_ name.dta", clear
Stata Companion
603
Open data if data in tab-delimited, text format File > Import > ASCII data created by a spreadsheet > File of type: All (∗.∗ ) > ASCII dataset file name > “c: data_name.txt” > Delimiter > Automatically determine delimiter // data is in tabdelimited, text format2 syntax: insheet [varlist] using filename [, options] options description -------------------------------------------------tab tab-delimited data clear replace data in memory insheet using "c: data_name.txt"
Edit data Data > Data editor //opens up variable worksheet to edit syntax: edit [varlist] [if][in] [, nolabel]
The nolabel option causes the underlying numeric values, rather than the label values (equivalent strings), to be displayed for variables with value labels. edit // opens all variables in a worksheet to edit edit var1 var2 var3 // opens only variables var1 var2 and var3 in a worksheet to edit
Reshape structure of data from “long” to “wide” or “wide” to “long” syntax: reshape wide stubnames, i(varlist) j(varname) options description -------------------------------------------------Main i(varlist) i(varlist) specifies the variables whose unique values denote a logical observation. i() is required. j(varname)
2
j(varname [values]) specifies the variable whose unique values denote a subobservation.
To save a spreadsheet as a tab-delimited text file In Excel, click File > Save as > Save as type > Text (tab delimited)
604
Brennan reshape wide outcomevar, i (var1) j (var2)// Transforms outcomevar variable from long to wide, var1 uniquely identifies the outcome variable in the wide form. The transposed outcome variable will take its suffix from var2 syntax: reshape long stubnames, i(varlist) j(varname) options description -------------------------------------------------Main i(varlist) i(varlist) specifies the variables whose unique values denote a logical observation. i() is required. j(varname)
j(varname [values]) specifies the variable whose unique values denote a subobservation.
reshape long outcomevar, i (var1) j (var2)// Transforms outcomevar variable from wide to long, var1 uniquely identifies the outcome variable in the long form. The transposed outcome variable will take its suffix from var2
Keyword search in Stata Help > Search > Search documentation and FAQs syntax: search word [word ...] [, search_options] findit word [word ...] search options description ---------------------------------------------------local search using Stata s keyword database net search across materials available via Stata s net command all search across both the local keyword database and the net material manual search the entries in the Stata Documentation search topic
Display online help Help > Contents syntax for displaying help information in Viewer: help [command_or_topic_name] [, nonew name(viewername) marker(markername)]
Stata Companion
605
syntax for displaying help information in Results window: chelp [command_or_topic_name] help tab
2.2. Summary Statistics
Summarize variables Statistics>Summaries, tables, and tests>Summary statistics //Displays mean, std dev, min, max syntax: summarize [varlist] [if] [in] [,options] options description --------------------------------------------------detail display additional statistics meanonly suppress the display; calculate only the mean summarize var1 var2 var3
Save summary statistics as a variable tabstat var1 var2, statistics (min mean p50 Step 1: max) by (catvar1) columns (variables) save // saves the summary statistics in a matrix Step 2: return list matrix list r(StatTotal) matrix stats=r(StatTotal) // converts the special matrix into typical matrix Step 3: svmat stats, name(varname)// saves the summary statistics as a variable
Confidence intervals Statistics > Summaries, tables, and tests > Summary and descriptive statistics > Confidence intervals syntax: ci [varlist] [if][in] [, options] options description ------------------------------------------------exact calculate exact confidence intervals level(#) set confidence level; default is level(95) ci var1
2.3. Frequency Tables
One-way tables of frequencies Statistics > Summaries, tables, and tests > Tables > One-way tables
606
Brennan syntax: tabulate varname [if][in] [, tabulate1_ options] tabulate options description -----------------------------------------------------plot produce a bar chart of the relative frequencies tabulate var1
One-way tables of frequencies for each variable syntax: tab1 varlist [if][in] [, tab1_ options] See options as in One-way tables of frequencies tab1 var1 var2 var3 //Creates one way tables for each variable
Two-way tables of frequencies Statistics > Summaries, tables, and tests > Tables > Two-way tables with measures of association syntax: tabulate varname1 varname2 [if][in] [, options] options description -------------------------------------------------chi2 report Pearson s chi-squared exact report Fisher s exact test cchi2 report Pearson s chi-squared in each cell column report relative frequency within its column of each cell row report relative frequency within its row of each cell cell report the relative frequency of each cell expected report expected frequency in each cell tabulate var1 var2 tab var1 var2
Two-way tables for all possible combinations Statistics > Summaries, tables, and tests > Tables > All possible two-way tabulations syntax: tab2 varlist [if][in] [, options] See options as in Two-way tables of frequencies tab2 var1 var2 var3
Input a table Statistics > Summaries, tables, and tests > Tables > Table calculator
Stata Companion
607
syntax: tabi #11 #12 [...] \ #21 #22 [...] [\ ...] [, options] See options as in Two-way tables of frequencies tabi row1col1value row1col2value\ row2col1value row2col2value, chi2 expected row exact // chi2 = Chi square test, expected = gives the expected counts for each cell, row = gives within row relative frequencies, exact = Fisher’s exact
Table with summary statistics Statistics>Summaries, tables, and tests > Tables >Table of summary statistics (tabstat) syntax: tabstat varlist [if][in] [, options] options description -------------------------------------------------save save summary statistics in r () tabstat var1 var2 tabstat var1 var2, statistics (mean sd) tabstat var1 var2, statistics (min mean p50 max) by (catvar1) columns (variables) save // saves the summary statistics in a matrix return list matrix list r(StatTotal) matrix stats=r(StatTotal) // converts the special matrix into typical matrix svmat stats, name(varname) // saves the summary statistics as a variable
Chi-Square test Statistics > Summaries, tables, and tests > Tables > Two-way measures of association>Pearson’s chi square > Expected frequencies > Within row relative frequencies syntax: tabulate varname1 varname2 [if][in][, options] See options as in Two-way tables of frequencies tabulate catvar1 catvar2, chi2 expected row
Fisher’s Exact test Statistics > Summaries, tables, and tests > Tables > Two-way measures of association> Fisher’s exact > Expected frequen-cies > Within row relative frequencies syntax: tabulate varname1 varname2 [if][in][, options] See options as in Two-way tables of frequencies tabulate catvar1 catvar2, expected row exact
608
Brennan
Cohort study relative risk Statistics > Epidemiology and related > Tables for epidemiologists > Cohort study risk ratio, etc. syntax: cs var_case var_exposed [if][in] [, cs_options] options description -------------------------------------------------or report odds ratio woolf use Woolf approximation for calculating SE of the odds ratio exact calculate Fisher s exact p level(#) set confidence level; default is level(95) cs casevar exposurevar
Case–control odds ratio Statistics > Epidemiology and related > Tables for epidemiologists > Case–control odds ratio syntax: cc var_case var_exposed [if][in][, cc_options] options description -------------------------------------------------bd perform Breslow-Day homogeneity test binomial (varname) number of subjects variable cornfield use Cornfield approximation for calculating SE of the odds ratio woolf use Woolf approximation for calculating SE of the odds ratio exact calculate Fisher s exact p cc casevar exposurevar
Tab odds Statistics > Epidemiology and related > Tables for epidemiologists > Tabulate odds of failure by category syntax: tabodds var_case [expvar] [if [, tabodds_ options] options description -------------------------------------------------or report odds ratio base(#) reference group of control variable for odds ratio
Stata Companion
cornfield graph
609
use Cornfield approximation for calculating SE of the odds ratio graph odds against categories
tabodds casevar exposurevar
2.4. Generate New Variables, Variable Transformations, and Other Variable Operations
Create new variable Data > Create or change variable > Create new variable syntax: generate [type] newvar [:lblname] = exp [if][in] generate newvar = var1
Change contents of a variable Data > Create or change variable > Change contents of a variable syntax: replace oldvar =exp [if][in] replace newvar = 1 if var1 == 0 generate newcatvar = 1 replace newcatvar = 2 if var1 > 12 & var1<= 30 replace newcatvar = 3 if var1 > 31
Create new string variable with the names separated by a space generate var3 = var1 + " " + var2
Change numeric values to missing Data > Create or change variables > Other variable transformation commands > Change numeric values to missing syntax: mvdecode
varlist [if] [in], mv(numlist)
mvdecode var, mv(value1 value2 value3)
Change missing values to numeric Data > Create or change variables > Other variable transformation commands > Change missing values to numeric syntax: mvencode varlist [if] [in], mv(#) mvencode var, mv(value1 value2 value3)
Transforming Variables generate logvar1 = ln ( var1)// Natural Log generate var1Squared = ( var1)ˆ 2 // Square
610
Brennan
2.5. Variable List Commands
Alphabetize specified variables and move to front of data set Data > Variable utilities > Alphabetize variables syntax: aorder [varlist] aorder // Alphabetizes variable list
Move specified variables to front of data set Data > Variable utilities > Reorder variables in data set syntax: order varlist order var1 var4
Move one variable to specified position Data > Variable utilities > Relocate variable syntax: move varname1 varname2 move var1 var4 // Swaps variable 1 and variable 4 in variable list
Drop variables Data> Variable utilities > Keep or drop variables syntax: drop varlist drop var1 Drop observations syntax: drop if exp Drop a range of observations syntax: drop in range [if exp]
Keep variables Data> Variable utilities > Keep or drop variables syntax: keep varlist keep var1 Keep observations syntax: keep if exp Keep a range of observations syntax: keep in range [if exp]
Labeling variables Data > Labels > Label variables syntax: label variable varname ["label"] label var var1 "Variable 1"
Alternatively, variables can be labeled in Stata’s data editor Data > Data editor > [Click on desired variable to highlight the column] > [Right click] > Variable > Properties > Label > Label name > LabelName > OK
Stata Companion
611
Label variable values Step 1: Create a named mapping between each value and its label Data > Labels > Label Values > Define or modify value labels > Define > Label name > LabelName > OK > Value > SomeValue > Text > ValueLabel > OK > Value > NextValue > Text > NextValueLabel > OK > Cancel Step 2: Assign the labels to the variable Data > Labels > Label Values > Assign value labels to variable Step 1 syntax: Define value label label define lblname # "label" [# "label" ...] Step 2 syntax: Assign value label to variable label values varname [lblname] label define labelname Somevalue "ValueLabel" label define labelname Nextvalue "NextValueLabel" label values varname labelname
Alternatively, variables can be labeled in Stata’s data editor Data > Data editor > [Click on desired variable to highlight the column] > [Right click] > Variable > Properties > Label > label name > Define or modify. . . > Define > Name > OK > Value > SomeValue > Text > ValueLabel > OK > Value > NextValue > Text > NextValueLabel > OK > Cancel > Close > Value Label > Name > OK Create indicator variables Data > Create or change variables > Other variable commands > Create indicator variables quietly tabulate catvar1, generate (new_catvar1)
Renaming variables Data > Variable utilities > Rename variable rename oldname newname
2.6. Stata as a Calculator
Display computations Data > Other Utilities > Hand Calculator > Create syntax: display [display_directive [display_ directive [...]]] display 2ˆ3
Computing critical values Data > Other utilities > Hand calculator > Expression builder > Probability > invttail(n,p): Returns the inverse reverse
612
Brennan
cumulative Student s t distribution. [Double click and enter values for n and p] syntax: display invttail(n,p) display invttail(n,p)
2.7. Graphics
Box plots Graphics > Box plot syntax: graph box graph box var1 graph box var1, by (catvar1)// Boxplots by category graph box var1, over (catvar1) by (catvar2)// Boxplots for each category within a category
Scatterplots Graphics > Two-way graph (scatter, line, etc.) > Create > Scatter syntax: [graph] twoway plot [if] [in][, twoway_options] where the syntax of plot is [(]plottype varlist ..., options [)] plottype description ---------------------------------------------------scatter scatterplot line line plot lfit linear prediction plot lfitci linear prediction plot with CIs histogram histogram plot twoway (scatter var1 var2)
Graphics > Two-way graph (scatter, line, etc.) > Create > Scatter > Define > Scatter //Overlays scatter plots twoway (scatter var1 var2) (scatter var1 var3)
Graphics > Two-way graph (scatter, line, etc.) > Create > Scatter > Define > Line //Overlays line plot on top of scatterplot twoway (scatter var1 var2) (line var1 var2)
Graphics > Scatter plot matrix graph matrix var1 var2
Stata Companion
613
Pie Chart Graphics > Pie chart syntax: graph pie graph pie, over (catvar1)
Histograms Graphics > Histogram syntax: histogram varname [if] [in] [, options] options description ---------------------------------------------------density draw as density; the default fraction draw as fractions frequency draw as frequencies percent draw as percentages addlabels add heights label to bars histogram var1
Graphics > Histogram >Density plots >Add normal-density plot //adds normal curve histogram var1, normal
Graphics > Histogram >Density plots >Add normal-density plot > By histogram var1, normal by (cat var1) //histograms by category, adds normal curve
Drawing a Normal Distribution in Stata Graphics > Two-way graph (scatter, line, etc.) > Create > Advanced > Plots > Function> Create > Probability > normalden(x,m,s): Returns the normal density with mean m and standard deviation s [double click] > [enter values for m and s only] > range twoway (function normalden (x, i,j), range (k,l)) //enter i = mean value, j = standard deviation. k and l are the upper and lower range values
Cumulative Frequency Plot Statistics > Summaries, tables, and tests > Distributional plots and tests > Cumulative distribution graph > Generate equal cumulative for tied values syntax: cumul varname [if] [in], generate (newvar) cumul var1, generate (var1b) equal
614
Brennan
Plotted quantiles of a variable against quantiles of normal distribution, Q–Q plot Graphics > Distributional graphics > Normal quantile plot syntax: qnorm varname [if] [in] qnorm var1
Standardized normal probability plot Graphics > Distributional graphics > Normal probability plot syntax: pnorm varname [if] [in] pnorm var1
Shapiro–Wilks normality test Statistics > Summary, tables, and tests > Distributional plots and tests > Shapiro–Wilks normality test syntax: swilk varlist [if] [in] swilk var1 by catvar1, sort: swilk var1/
2.8. Hypothesis Testing
One-sample mean-comparison test Statistics > Summary, tables, and tests > Classical tests of hypotheses > One sample mean-comparison test syntax: ttest varname == # [if] [in] [, level(#)] ttest var1 == HypothMean
Mean-comparison test, paired data Statistics > Summary, tables, and tests > Classical tests of hypotheses > Mean-comparison test, paired data syntax: ttest varname1 == varname2 [if] [in] ttest var1 == var2
Two-sample mean-comparison test Statistics > Summary, tables, and tests > Classical tests of hypotheses > Two-sample mean-comparison test > Unequal variances syntax: ttest varname1 == varname2 [if] [in], unpaired [unequal level(#)] ttest var1 == var2, unpaired unequal //unequal variances ttest var1 == var2, unpaired // equal variances
Stata Companion
615
Two-group mean-comparison test Statistics > Summary, tables, and tests > Classical tests of hypotheses > Two-group mean-comparison test syntax: ttest varname [if] [in] , by(groupvar) [options1] ttest var1, by (catvar)
One-sample mean-comparison calculator Statistics > Summary, tables, and tests > Classical tests of hypotheses > One-sample mean-comparison calculator syntax: ttesti #obs #mean #sd #val [, level(#)] ttesti samplesize samplemean samplestdev hypothmean, level (95.5)
Two-sample mean-comparison calculator Statistics > Summary, tables, and tests > Classical tests of hypotheses > Two-sample mean-comparison calculator syntax: ttesti #obs1 #mean1 #sd1 #obs2 #mean2 #sd2 ttesti samplesize1 samplemean1 samplestdev1 samplesize2 samplemean2 samplestdev2, level (99.9)
One-sample proportion test Statistics > Summary, tables, and tests > Classical tests of hypotheses > One-sample proportion test syntax: prtest varname == #p [if] [in] [, level(#)] prtest catvar == proportion
Two-sample proportion test Statistics > Summary, tables, and tests > Classical tests of hypotheses > Two-sample proportion test syntax: prtest varname1 == varname2 [if] [in] [, level(#)] prtest catvar1 == catvar2
One-sample proportion calculator Statistics > Summary, tables, and tests > Classical tests of hypotheses > One-sample proportion calculator syntax: prtesti #obs1 #p1 #p2 [, level(#) count] prtesti samplesize sampleproportion hyothesizedproportion, level (95)
616
Brennan
Two-sample proportion calculator Statistics > Summary, tables, and tests > Classical tests of hypotheses > Two-sample proportion calculator syntax: prtesti #obs1 #p1 #obs2 #p2 [, level (#) count] prtesti samplesize1 sampleproportion1 samplesize2 sampleproportion2, level (95)
One-sample variance-comparison test Statistics > Summary, tables, and tests > Classical tests of hypotheses > One-sample variance-comparison test syntax: sdtest
varname == # [if] [in] [,]
sdtest var1 == hypothstddev
Two-sample variance-comparison test Statistics > Summary, tables, and tests > Classical tests of hypotheses > Two-sample variance-comparison test syntax: sdtest varname1 == varname2 [if] [in] [, level(#)] sdtest var1 == var2
Two-group variance-comparison test Statistics > Summary, tables, and tests > Classical tests of hypotheses > Two-group variance-comparison test syntax: sdtest varname [if] [in], by(groupvar) sdtest var1, by (catvar1) level (95)
One-sample variance-comparison test Statistics > Summary, tables, and tests > Classical tests of hypotheses > One-sample variance-comparison test sdtesti samplesize stddev hypothesizedstddev, level (95)
Robust equal variance test Statistics > Summary, tables, and tests > Classical tests of hypotheses > Robust equal variance test syntax: robvar
varname [if] [in] , by(groupvar)
robvar var1, by (catvar1)
Stata Companion
617
Two-sample comparison of means Statistics > Summary, tables, and tests > Classical tests of hypotheses > Sample size and power determination > Twosample comparison of means syntax: sampsi # 1 # 2 [, options] options description ---------------------------------------------------onesample one-sample test; default is two-sample sd1 (#) standard deviation of sample 1 sd2 (#) standard deviation of sample 2 power (#) power of test; default is power (0.90) n1 (#) size of sample 1 n2 (#) size of sample 2 ratio (#) ratio of sample sizes; default is ratio (1) onesided one-sided test; default is two-sided sampsi mean1 mean2, sd1 (stddev1) sd2 (stddev2)
One-sample comparison of mean to hypothesized value Statistics > Summary, tables, and tests > Classical tests of hypotheses > Sample size and power determination > Onesample comparison of mean to hypothesized value syntax: sampsi # 1 # 2 [, options] See options as in Two-sample comparison of means sampsi hypothvalue postulatedvalue, sd1 (stddev) onesample
Two-sample comparison of proportions Statistics > Summary, tables, and tests > Classical tests of hypotheses > Sample size and power determination > Twosample comparison of proportions syntax: sampsi # 1 # 2 [, options] See options as in Two-sample comparison of means sampsi prop1 prop2
One-sample comparison of proportions to hypothesized value Statistics > Summary, tables, and tests > Classical tests of hypotheses > Sample size and power determination > Onesample comparison of proportions to hypothesized value
618
Brennan
syntax: sampsi # 1 # 2 [, options] See options as in Two-sample comparison of means sampsi hypothvalue postulatedvalue, onesample
2.9. ANOVA/ANCOVA
ANOVA Statistics > Linear models and related > ANOVA/MANOVA > One-way ANOVA > Produce Summary statistics > Bonfer-roni > Scheffe syntax: oneway response_var factor_var [if] [in] [, options] options description ---------------------------------------------------bonferroni Bonferroni multiple-comparison test scheffe Scheffe multiple-comparison test tabulate produce summary table oneway var1 catvar1, bonferroni scheffe tabulate
Statistics > Linear models and related > ANOVA/ MANOVA > Analysis of variance/covariance > Reporting > Display anova and regression table //Two-way ANOVA syntax: anova varname [term [/] [term [/]...]] [if] [in] options description -------------------------------------------------Model category (varlist) variables in terms that are categorical or class continuous (varlist) variables in terms that are continuous repeated (varlist) variables in terms that are repeatedmeasures variables partial use partial (or marginal) sums of squares sequential use sequential sums of squares Reporting regress display the regression table
Stata Companion [no] anova
619
display or suppress the ANOVA table
anova var1 catvar1 catvar2, partial regress anova tabulate catvar1 catvar2, summarize (var1)
ANCOVA Statistics > Linear models and related > ANOVA/ MANOVA > Analysis of variance/covariance > Reporting > Display anova and regression table > Continuous except the following categorical variables > catvar1 anova var1 catvar1 var2, category (catvar1) partial regress anova
2.10. Nonparametric Methods
Wilcoxon sum-rank test Statistics > Summaries, tables, and tests > Nonparametric tests of hypotheses > Wilcoxon sum-rank test syntax: ranksum varname [if] [in], by (groupvar) ranksum var1, by (catvar1)
Wilcoxon (matched pairs) signed-rank test Statistics > Summaries, tables, and tests > Nonparametric tests of hypotheses > Wilcoxon matched pairs sum-rank test syntax: signrank varname = exp [if] [in] signrank var1 = var2
Kruskal–Wallis test Statistics > Summaries, tables, and tests > Nonparametric tests of hypotheses > Kruskal–Wallis test syntax: kwallis varname [if [in], by (groupvar) kwallis var1, by (catvar1)
Fisher’s exact test See Fisher’s Exact test
McNemar’s test (matched case–control) Statistics > Epidemiology and related > Tables for epidemiologists > Matched case–control studies syntax: mcc var_exposed_case var_exposed_control [if] [in] mcc cases controls
2.11. Correlation
Pearson’s correlation coefficient Statistics > Summaries, tables, and tests > Summaries and descriptive statistics > Pairwise correlation > Print significance level for each entry
620
Brennan
syntax: Display correlation matrix or covariance matrix correlate [varlist] [if] [in][, correlate_ options] Display all pairwise correlation coefficients pwcorr [varlist] [if] [in] [, pwcorr_options] options description ---------------------------------------------------_coef display correlation or covariance matrix of the coefficients means display means, standard deviations, minimums, and maximums with matrix covariance display covariances pwcorr_options description ---------------------------------------------------sig print significance level for each entry bonferroni
use Bonferroni-adjusted significance level
pwcorr var1 var2, sig correlate var1 var2, sig
Spearman’s rank correlation coefficient Statistics > Summaries, tables, and tests > Nonparametric tests of hypotheses > Spearman’s rank correlation > Display correlation coefficient > Display number of observations > Display significance level syntax: spearman [varlist] [if] [in] [, spearman_options] options description ---------------------------------------------------bonferroni use Bonferroni-adjusted significance level matrix display output in matrix form spearman var1 var2, stats (rho obs p)
Cohen’s kappa agreement coefficient Statistics > Epidemiology and related > Interrater agreement, two unique raters > Display table of assessments
Stata Companion
621
syntax: kap varname1 varname2 [if] [in] kap var1 var2, tab
2.12. Linear Regression
Linear Regression Statistics > Linear models and related > Linear regression regress outcomevar predictorvar1 predictorvar2 syntax: regress depvar [indepvars] [if] [in] regress outcomevar predictorvar1 xi: regress outcomevar predictorvar1 i.catvar1
i.catvar2 //Creates indicator variables for categorical variables
Scatter plot with fitted line Graphics > Two-way graph (scatter, line, etc.) > Create > Basic > Scatter > Accept > Create > Fit plots > Linear prediction with CI > Accept twoway (scatter outcomevar predictorvar1) (lfit outcomevar predictorvar1)
Define residuals Statistics > Postestimation > Predictions, residuals, etc. > Residuals (equation-level scores) syntax: predict [type] newvar [if] [in] predict ResidualsName, residuals
Residuals versus fitted values Statistics > Linear models and related > Regression diagnostics > Residual versus fitted plot rvfplot, yline=(0)
Residuals versus predictor plots Statistics > Linear models and related > Regression diagnostics > Residual versus predictor plot rvpplot predictorvar1
Breusch–Pagan/Cook–Weisenberg test for heteroskedasticity Statistics > Linear models and related > Regression diagnostics > Specification tests, etc. > Tests for heteroskedasticity (httest) > Breusch–Pagan/Cook–Weisenberg > Use the following variables > ResidualVariable estat hettest ResidualVariable
622
Brennan
Create indicator variables See options as in Create indicator variables 2.13. Logistic Regression/ROC Curve Analysis
Logistic Regression Statistics > Binary outcomes > Logistic regresion (reporting odds ratios) > Reporting > Odds ratios (default) syntax: logit depvar [indepvars] [if] [in] [, options] options description ---------------------------------------------------offset (varname) include varname in model with coefficient constrained to 1 asis retain perfect predictor variables or report odds ratios logistic depvar var1 var2, or logit depvar var1 var2, or xi: logistic outcomevar predictorvar1 i.catvar1 i.catvar2, or
ROC Curve xi: logistic outcomevar predictorvar1 i.catvar1 i.catvar2, or lroc // graphs the ROC curve and calculates the area under the curve lsens // graphs sensitivity and specificity versus probability cutoff
2.14. Dates in Stata
A quick note: Stata stores dates as string variables. In order to make working with dates less cumbersome, Stata created a reference day – January 1, 1960 [Day 0] – and assigns each lapsed day since Day 0 a number. For example, January 2, 1960, is 1. Since every day since January 1, 1960, is a unique number, the difference between days is relatively easy to calculate. The difference, for example, between January 1, 1960 (unique day number 0) and January 1, 2008 (unique day number 17530) is 17530. In other words, 17,530 days lapsed since January 1, 1960. Subtracting dates in Stata if date is in mm/dd/yyyy format, e.g., 12/31/1979 Step 1: Data > Create or change variables > Create new variable
Stata Companion
623
Step 2: Data > Create or change variables > Create new variable Step 1:generate timelapsedInDays = newdatevar1 - newdatevar Step 2:generate timelapsedInMonths = timelapsedInDays/30
Subtracting dates in Stata if date is in month, date, year format, e.g., December 31, 1979 Step 1: Data > Variable utilities > Set variables’ output format Step 2: Data > Create or change variables > Create new variable Step 3: Data > Create or change variables > Create new variable syntax: generate datevar= StrToNumFn(strvar,mdy) Step 1: generate newdatevar = date(datevar1, "MDY") generate newdatevar1 = date(datevar2, "MDY") Step 2: generate timelapsedInDays = newdatevar1 - newdatevar Step 3: generate timelapsedInMonths = timelapsedInDays/30
Change format from MMDDYY (e.g., 12311979) to a date format in Stata Step 1: Data > Create or change variables > Create new variable > if/in > Create > String > substr() Step 2: Data > Create or change variables > Create new variable > if/in > Create > String > real() Step 3: Data > Create or change variables > Change con-tents of existing variable Step 4: Data > Create or change variables > Create new variable syntax: substr(var, BeginningCharacterNumber, EndingCharacterNumber) Step 1: generate str2 month = substr(datevar, 1, 2) generate str2 day = substr(datevar, 3, 2) generate str2 year = substr(datevar, 5, 2) Step 2: generate mo =real(month) generate dy =real(day) generate mo =real(year) Step 3: replace yr = yr1 + 1900 Step 4: generate newdatevar1 = mdy(mo,dy,yr)
624
Brennan
2.15. Survival Analysis
Declare data to be survival-time data Statistics > Survival analysis > Setup and utilities > Declare data to be survival-time data Syntax for single-record-per-subject survival data: stset timevar [if] stset timevariable,failure (exposurevariable==ValueCorrespondingToFailure)
Kaplan–Meier Survival Curves Graphics > Survival analysis graphs > Kaplan–Meier survival function > Graph Kaplan–Meier survivor function > survival settings [make sure data has been declared survival-time data] > At-risk table > Show at-risk beneath // Kaplan–Meier curve with risk table syntax: sts graph [if] [in] [, options] options description -------------------------------------------------------survival graph Kaplan-Meier survivor function; the default cumhaz graph Nelson-Aalen cumulative hazard function hazard graph smoothed hazard estimate by (varlist) calculate separately on different groups of varlist strata (varlist) stratify on different groups of varlist separate show curves on separate graphs ci show pointwise confidence bands risktable show table of number at risk beneath graph censored (single) show one hash mark at each censoring time, no matter what number censored (number) show one hash mark at each censoring time and number censored above hash mark sts graph, risktable
Graphics > Survival analysis graphs > Kaplan–Meier survival function > Graph Kaplan–Meier survivor function > survival settings [make sure data has been declared survival-time data] > Options > Plot censoring, entries, etc.. . .> Number
Stata Companion
625
censored (one hash mark at each censoring time) //Kaplan– Meier curve with hash marks at censoring times sts graph, censored (single) sts graph, by (catvar) censored (single)
//Survival curves by categorical groups List survivor cumulative hazard functions Statistics > Survival analysis > Summary statistics, tests, and tables > List survivor and cumulative hazard functions Syntax: sts list [if] [in] [, options] options description ---------------------------------------------------survival report Kaplan-Meier survivor function failure report Kaplan-Meier failure function cumhaz report Nelson-Aalen cumulative hazard function sts list
Log-rank test Statistics > Survival analysis > Summary statistics, tests, and tables > Test equality of survivor functions > Perform test: log rank syntax: sts test varlist [if] [in] [, options] options description ---------------------------------------------logrank perform log-rank test of equality; the default cox perform Cox test of equality wilcoxon perform Wilcoxon-Breslow-Gehan test of equality peto perform Peto-Peto-Prentice test of equality sts test catvar, logrank
Cox proportional hazards regression model Statistics > Survival analysis > Regression models > Cox proportional hazards model > Survival settings syntax: stcox [varlist] [if] [in] [, options] options description ---------------------------------------------offset (varname) include varname in model with coefficient constrained to 1
626
Brennan breslow
exactm
use Breslow method to handle tied failures; the default use exact marginallikelihood method to handle tied failures
stcox var1
Acknowledgments This research is supported in part by grant T32 MH014235 from the National Institute on Mental Health.
SUBJECT INDEX
A
study/studies . . . . . . . 112–115, 124, 140–143, 219–237, 421, 485–493, 581–583 Asymmetric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80, 165 Asymptotic . . . . . . . . . . . 107, 110, 119, 133, 156–157, 174, 211–213, 226, 360, 424–425, 555 Autonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Accuracy . . . . . . . 8, 108, 169–170, 172–173, 210, 251, 348, 351, 361, 423, 425, 444, 453, 474, 477, 479, 563–565, 569, 575, 579 Additive . . . . . 53–57, 73–84, 106–107, 133, 140, 225–228, 234, 276–278, 425, 488, 551 Adjacency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321–325, 328 Admixture . . . . . . . . . . . . . . . . . . . . . . . . . . 230, 486–487, 491 Affinity propagation . . . . . . . . . . . . . . . . . 371, 392–393, 401 Affymetrix . . . . . . . 116, 119, 231, 276–278, 288, 318, 334, 448–449, 479, 536 Aggregation/Aggregating/Aggregate bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452–453 feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442, 453–454 Algorithm backward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408–410 Baum–Welch . . . . . . . . . . . . . . . . . . . . . . . . . 410–411, 414 classification . . . . . . . . . . . . . . . . . . . . . . . . . . 450–453, 498 co-expression extrapolation (COXEN) . . . . . . . 475–477 dynamic programming . . . . . . . . . . . . . . . . . . . . . . . . . . 408 Expectation-Maximization (EM) . . . . . . 377, 379, 411, 518, 532, 535, 538, 553 stochastic (SEM) . . . . . . . . . . . . . 532, 535–538, 553 forward . . . . . . . . . . . . . . . . . . . . . 408–410, 412, 414–415 genetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457, 462, 504 Metropolis–Hastings . . . . . . . . . . . . . . . . . . . . . . 174, 522 optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426, 457 Van Steen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489, 492 Viterbi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408–410 Allele/Allelic . . . . . 112–116, 123, 140, 221–230, 233–235, 251, 488, 543–545, 552 Ambiguity . . . . . . . . . . . . . . . . . . . . . . . . . . 131, 132, 133–134 Amplification . . . . . . . . . . . . . . . . . . . . . . . 249, 251, 275, 567 Analysis of covariance (ANCOVA) . . . . . . . . . . . . . . . . . . 618 Analysis of variance (ANOVA) one way . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47–61, 618 two way . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75–80, 618 Array channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275–279 dye . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306 effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270, 306 multi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277, 449 oligonucleotide . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250, 564 single . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448–449 two color . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276, 449–450 Assay . . . . . . . . . . . . . . . . . . . . . . . . . . 249–252, 269, 472–473, 479–480 Association . . . . . . . . . 8, 86–87, 93–95, 101–104, 112, 115, 123, 140, 143, 219–237, 247, 253–256, 258, 341, 361, 396, 421–422, 437, 485–493, 516, 543, 565, 576, 579, 581–583, 606–607
B Bagging/Bagged . . . . . . . 320, 328–332, 339, 452–453, 458 Bandwidth . . . . . . . . . . . . 111–112, 363, 380, 382–384, 402 Bayesian/Bayes analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159, 187–188 Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330–331 classifier . . . . . . . . . . . . . . . . . . . . . . . . . 359–360, 450, 464 empirical (EB) . . . . . . . . . 163–164, 185, 277, 450, 473, 514, 519, 553 estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164–165, 450 factor (BF) . . . . . . . . . 166, 182–183, 185–188, 197, 554 hierarchical model . . . . . . . . . . . . . . . . 163, 193, 531–538 hypothesis test . . . . . . . . . . . . . . . . . . . . . . . . 183–184, 189 inference . . . . . . . . . . . . . . . . . . . . 162–163, 165, 405, 414 model . . . . . . . . . . . . . . . . . . . . . . 158–159, 161, 163, 554 na¨ıve . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450, 455, 462, 464 network . . . . . . . 316–318, 320, 331, 335–341, 505–506 regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474 risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359 rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158–163, 378, 474 test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166, 179–198 theorem . . . . . . . . . . . . . . . . . . . . . . . . . 183, 196, 338, 450 BDVal . . . . . . . . . . . . . . . . 438, 441, 456–457, 460–463, 465 Benchmark . . . . . . . . . . . . . . . . . . . . . 184–185, 278, 450–452 Bias publication . . . . . . . . . . . . . . . . . . . . . . . . 52, 542, 574–575 selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 Bimodal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168, 270 Bin . . . . . . . . . . . . . . . . . . . . . . . . 161–162, 547–550, 556–557 Binding . . . . . . . . . . 116, 244–245, 277, 297, 306, 327, 370, 405–408, 410, 412, 414, 418, 511–514, 516, 518–522 site . . . . . . . . . . . . . . . . 405–408, 410–412, 414, 513–514 Bioinformatics/Informatics . . . . . . . . . . . 107, 119, 258, 348, 420–425, 427, 481–482, 513 (Bio)marker discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421, 460, 473 genetic . . . . . . . . . . . . . . . . . 220, 224, 229, 235, 555, 589 informative . . . . . . . . . . . . . . . . . . 146, 260, 439–440, 547 molecular . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236, 249, 349 Bisection method . . . . . . . . . . . . . . . 204, 207–208, 214–215 Bivariate . . . . . . . .86–87, 102, 104, 138–139, 356–357, 429 Block/Blocking . . . . . . . . . . 72–81, 108, 114, 122–124, 127, 259, 294
H. Bang et al. (eds.), Statistical Methods in Molecular Biology, Methods in Molecular Biology 620, c Springer Science+Business Media, LLC 2010 DOI 10.1007/978-1-60761-580-4,
627
STATISTICAL METHODS IN MOLECULAR BIOLOGY
628 Subject Index
Bonferroni . . . . . . . . . . 69–70, 210–212, 233, 310, 312, 487, 490–491, 582, 618, 620 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453, 523–524 Bootstrap/Bootstrapping . . . . . . . . 133, 256, 261, 330–331, 339–340, 447–448, 452–453
C Calibrate/Calibration . . . . . . . . . 84, 98–100, 277, 317, 359, 364, 450, 565 Case–Control . . . . . . . . . 224–231, 233, 237, 444, 493, 564, 568, 573, 608, 619 matched . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228–229, 619 Cause/Causal/Causation/Causality . . . . . . . . . . 4, 8, 30, 85, 105, 120, 157, 210, 221, 224–225, 228, 247–248, 254, 281, 289, 299, 318, 341, 442, 498–499 Censoring/Censored interval . . . . . . . . . . . . 128–129, 134–135, 147, 534–535 doubly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134–135 left . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128, 134 right . . . . . . . . . . . . . . . . . . . 128, 134, 418, 429, 534–535 Central limit theorem (CLT) . . . . . . . . . 106–107, 109, 147, 171–173 Chromosome/Chromosomal . . . . . 110, 135–136, 143–144, 335, 504, 517, 538, 543–544, 546–549, 553 Classification accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348, 361, 453 error . . . . . . . . . . . . . . . . . . . . . . . . 366, 398, 474–475, 504 model . . . . 440, 451, 453–454, 457, 459, 473–474, 477, 523–524 multi-way . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71–80 one-way . . . . . . . . . . . . . . . . . . . 47–61, 73–76, 81, 84–85 and regression tree (C&RT) . . . . . . . . . . . . . . . . . . . . . 452 three-way . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347–348, 452, 458 two-way . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72–80 Classifier . . . . . . . . . 349, 351–353, 359–360, 362, 364, 440, 444–445, 450–453, 460, 462, 464–465, 475, 481, 523–524, 579 Clustering/Cluster agglomerative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274, 387 analysis . . . 126, 143, 255–256, 274–275, 340, 498, 502 attribute-based . . . . . . . . . . . . . . . . . . . . . . . 370–384, 392 bi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371, 393–395, 401 dissimilarity-based/similarity-based . . . . . . . . . 384, 390 gene . . . . . . . . . . 140, 274, 316, 322, 328–331, 334–336, 341–342 hierarchical . . . 255–256, 274, 322, 324–326, 328, 371, 375, 385, 387–390, 401, 402, 564 K-means . . . . . . . . . . . . . . . . . . . . 370–375, 385, 392, 401 K-medoids . . . . . . . . . . . . . . . . . . . . . . . . . . . 370–375, 401 linkage . . . . . . . . . . . . . . . . . 322, 324, 328, 387, 390, 402 model-based . . . . . . . . . . . . . . . . .375–379, 395, 401–402 nonparametric . 370–371, 379–382, 384, 390, 400, 402 spectral . . . . . . . . . . . . . . . . . . . . . . . . . 371, 386, 390–392, 401, 402 tree . . . . . . . . . . . . . . . . . . . . . . . . 274, 325–327, 370–371, 380–382, 402 Coefficient of determination . . . . . . . . . . . . . . . . . . . . 92, 102 See also Multiple correlation Cohort . . . . . . . . . . . 254, 459, 461, 463–464, 492, 564, 568, 573, 608 Collaboration/Collaborating . . . . . 138–140, 543, 565–566, 583–584, 590–591 Collinearity/Collinear/Multicollinearity . . . . . . . . . 420–421
Combination . . . . . . 68, 122, 126, 130, 137–139, 194–196, 236, 276, 281, 358–359, 376, 422, 429, 437, 440, 451–453, 455, 457, 474, 480, 544 linear . . . . . . . . . 126, 130, 422, 429, 440–441, 451, 453, 474, 480 Comparison/Comparisons family of/families of . . . . . . . . . . . . . . . . . . . 61–67, 69–71 multiple, see Multiple testing paired . . . . . 40–43, 46, 67, 71, 112, 119–120, 129–132, 134–135, 137 pairwise . . . . . . . . . . . . . . . . . . . . . . . . . . 63, 65–66, 68–71 planned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61–66, 69 two-sample . . . . . . . . . . . . . . . . . . . . . . . . . . . 180, 614–617 Complex disease . . . . . . . . 143, 146–147, 219–237, 486, 541–543, 548–550 statistic . . . . . . . . . . . . . . . . . . . . . 185, 448–449, 564–565 Compound symmetry/Compound symmetric . . . . . . . . 195 Concordance/Concordant . . . . . . . . . . . . . . . . . . . . . 477, 550 Conditional probability table . . . . . . . . . . . . . . . . . . . 337, 340 Confidence bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 interval . . . . . . . . . 24–25, 172–173, 256, 502, 554, 570, 584, 605 one-sided . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167–168, 605, 608 Confound/Confounding/Confounder . . . . . . . . . . 5–9, 108, 113, 120, 149, 228, 317, 439, 448–449, 459, 486–487, 490–491, 571–575, 582, 587 Conjugate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162–163, 412, 414 Connectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321, 323–328, 330 Consistent/Consistency . . . . . . 95, 168, 184, 232–233, 247, 253, 257–258, 260, 298, 309, 341–342, 359, 365, 441–442, 448, 449, 458–459, 517, 524, 542, 544, 548–550, 557, 565 Contingency table . . . . . . . . . . . . . . . . . . . . . . . . 396, 443–445 Contrast . . . . . . . . . . 68–70, 81, 85–86, 106, 111, 114, 123, 139, 146, 149, 187, 234, 270, 348–349, 362, 379, 393, 400, 409, 432, 444, 448, 452, 516, 572–573 Coregulate/Coregulation . . . . . . . . . . . . . 138–140, 144, 370 Correlation/Correlation coefficient Kendall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138, 455 Matthews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444 multiple . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Pearson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274, 322 prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192, 195 Spearman/Spearman’s rank . . . . . . . . 138, 274, 477, 620 tetrachoric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193, 195 Cost-effective . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231–232, 250 Covariance diagonal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376 ellipsoidal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376–377 spherical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376 Covariate . . . . . 83, 101, 227, 316–320, 326, 331, 336–337, 339, 429, 488, 491–492, 552–554, 556–558, 573 CpG dinucleotides . . . . . . . . . . . . . . . . . . . . . . . . . 244–246, 253 island . . . . . . . . . . . . . . . . . . . . . . 245–249, 251, 253–254, 256–257, 260 Credible set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167–168, 554, 556–558 Cytosine . . . . . . . . . . . . . . . . . . . . . . . 244–246, 248–251, 253
STATISTICAL METHODS IN MOLECULAR BIOLOGY Subject Index 629 D Data mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282, 348, 525 Degrees of freedom (df ) . . . 11, 22, 60–61, 78, 93–94, 124, 181, 184, 211, 223, 225, 227, 230, 303–304, 307, 545, 552 Dendrogram . . . . . . . . . . . . . . . . . . . . . . . . 274, 325, 387–390 Density conditional . . . . . . . . . . . . . . . . . . 157–158, 161, 163–164 ellipsoidal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379 highest probability (HPD) . . . . . . . . . . . . . . . . . . . . . . 168 marginal . . . . . . . . . . . . . . . . . . . . . . . . . 157–159, 163, 183 posterior . . . . . . . . . . . 158–161, 163–164, 168, 172–174 prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157–159, 161–164 probability . . . 19, 21, 86, 157, 160, 163, 168, 173, 186 sampling . . . . . . . . . . . . . . . . . . . . 157–159, 161–162, 173 Deoxyribonucleic acid (DNA) . . . 215–216, 223–224, 233, 244–250, 252–257, 258, 262, 268–269, 275, 288, 290, 292, 334–335, 341, 359, 370, 407, 418, 437, 439, 471–472, 498, 511–514, 516, 519, 564, 578, 582 complementary DNA (cDNA) . . . . . . . . 290, 293, 428, 472, 564 Dependence/Dependency . . . 157, 169, 184, 205, 210, 212, 299, 317, 336, 340, 451, 519, 532, 535, 553 Depletion/Deplete . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246–247 Design affected relative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 affected sib pair (ASP) . . . . . . . . . . . . . . . . . . . . . . . . . . 222 balanced . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122–123 complete . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121–122 cross-over . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 experimental . . . 6–8, 40, 106, 121, 149, 297, 317, 437, 498–500, 565–566, 577, 583 factorial . . . . . . . . . . . . . . . . . . . . . . . . . 149, 564, 577–578 family-based . . . . . . . . . . . . . . . . 229–230, 235, 485–493 incomplete . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122–123 multi-stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486 one-stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231–233 paired . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47–48, 72, 74 population-based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 random . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71, 73 randomized complete block . . . . . . . . . . . . . . . . . . . 72–80 stratified . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121, 127, 129 two-stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231–233 unbalanced . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122–123 Diagnostic/Diagnostics accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 569, 575–576 analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 582 assays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472–473 markers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621 study/studies . . . . . . . . . . . . . . . . . . . . . 565, 569, 575–576 tests . . . . . . . . . . . . . . . . . . . . 442, 444, 564–565, 575, 580 Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 143, 224, 476, 572, 574–577 Diffusion map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .386 Dimensionality . . . . 169, 210, 253, 356, 362, 374, 418, 453 curse of . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169, 418 Dimension reduction . . . . . . . . 261, 271, 417–432, 453, 504 Diplotype . . . . . . . . . . . . . . . . . . . . . . . . . . . 135–137, 142–143 Directed acyclic graph (DAG) . . . . . . . . . . . . . . . . . . 336, 523 Discordance/Discordant . . . . . . . . . 131, 134, 229, 552, 574 Discriminant analysis
linear (LDA) . . . 347–350, 359, 360, 451, 473–474, 476–477 quadratic (QDA) . . . . . . . . . . . . . . . . . . 347–348, 359 function . . . . . . . 349–353, 356–357, 359–361, 364–365 Discrimination/Discriminate/Discriminatory . . . . . . . . 139, 144–145, 249, 255, 257, 347–350, 356, 365, 418, 425, 452–453, 455, 481 Disequilibrium 112, 118, 135, 220, 224–226, 229–230, 488 Dissimilarity, see Similarity Distance Euclidean . . . . . . . . . . . . . . . . . . . . . . . .274, 385, 387–388 Karl Pearson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386 Mahalonabis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387 Manhattan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27, 385 Minkowski . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385 Distribution F . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40, 53, 57, 59, 75, 94 Bernoulli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161, 190, 520 beta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159, 161, 412 binomial . . . . . . . . . . . . . . . . . . 17, 19, 108–109, 113, 161 chi-square . . . . . . . . . . . . . . . . . . . . . . . . 21, 184, 227, 235 conditional/conditional probability . . . . . . . . . 157, 214, 317, 337, 421, 474, 519 cumulative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .114 Dirichlet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414 empirical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148, 269, 277 gamma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 genotype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488 inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 marginal . . . . . . . . . . . . . . . . . . . . 214, 421, 423, 432, 518 normal/Gaussian bivariate . . . . . . . . . . . . . . . . . . . . . . . . 86–87, 102, 104 multivariate . . . . . . . . . 190, 193, 195, 198, 223, 432 standard . . . . 19–20, 22, 44, 46, 205, 211, 226, 545 null . . . . . . . . . . . . . . . . . . . . . . . . . 256–257, 259, 515, 518 posterior . . . . . . 161, 163–164, 172–175, 182, 414, 474, 521–522, 533, 554 prior . . . . . . . . . 156, 163, 166, 175, 184–185, 412, 414, 520, 523, 553, 556, 558 probability . . . . . . . 15–24, 86, 157, 160–161, 180, 317, 336–337, 351 sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 23–24, 40, 57, 75 t (or student t) . . . . . . . . . . 21–22, 28, 31, 33, 39, 44, 95, 102, 114, 181, 192–193, 211, 230, 274, 299, 303–304, 385, 398, 402, 473 noncentral . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 triangular . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .111 Wishart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192–193 Dominant/Dominance . . . . . . . . . . . . . . . 113–115, 118, 142, 225–228, 272, 317, 488 co- . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113–115 Dye-swap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297–299, 308 Dysregulation . . . . . . . . . . . . . . . . . . . . . . . 235–236, 247–248
E Effect batch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438–439, 450 carry-over . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 dye . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279, 306, 500 fixed . . . . . . . . . . . . . . . . . . . 55–57, 60, 78, 223, 551, 557 mixed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473 random . . 54–56, 74, 222, 255, 473, 551–552, 556–557 size . . . . . . 184–186, 188, 191, 193, 195, 197, 203–204, 206–210, 212–214, 216, 228, 450, 489–491, 524, 543, 551, 555–557, 576
STATISTICAL METHODS IN MOLECULAR BIOLOGY
630 Subject Index
Effect (cont.) systematic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54, 78, 276 Efficiency . . . . 106, 126, 231, 296, 299, 348, 366, 474, 553 Eigen array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272–273 decomposition . . . . . . . . . . . . . . . . . . . 272, 420, 424, 429 gene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421 value . . . . . . . . . . . . . . . . . . . 272, 376, 390–391, 420, 424 vector . . . . . . . . . 272, 376, 390–392, 420, 424–426, 429 Elimination backward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365, 456 recursive feature . . . . . . . . . . . . . . . . . . . . . . .365, 456, 462 Embedding . . . . . . . . . . . . . . . . . . . . . . . . . 419, 461, 586, 590 locally linear . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398 Epi demiology/demiologic . . . . . . . 126, 129, 258, 568–570, 572–575, 581–583, 585, 608, 619–620 genetics/genetic . . . . . . . . . . . . . . 244–249, 254–256, 258 genomics/genomic . . . . . . . . . . . . . . . . . . . . . . . . . 243–262 Epistasis . . . . . . . . . . . . . . . . . . . . . . . .106, 135–137, 143, 558 Error additive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80–81, 276 experimental . . . . 47–49, 67, 71–72, 76–77, 79, 85, 92, 268, 440 generalization . . . . . . . . . . . . . . . . . . . . 445, 447, 453, 457 measurement . . . . . . . . . . . . . . . . . 95, 157, 277, 319, 570 minimum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347, 362 multiplicative . . . . . . . . . . . . . . . . . . . . . . . . . . . 57, 80, 276 sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448, 551 type I . . . . . . . . . 32, 34–35, 43, 62–66, 70–71, 166–167, 187–189, 207, 231, 232–233, 582 family-wise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 type II . . . . . . . . . . . . . . . . . . . . . 35, 43–44, 166–167, 189 Estimate/Estimation interval . . . . . . . . . . . . . . . . . . . . . . . . . 24, 96, 99, 111, 424 maximum likelihood (MLE) . . . . . . . . . . . . . . . . 421, 533 maximum a posteriori (MAP) . . . . . . . . . . . . . . 533, 535 point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53, 164–165, 167 Estimator . . . . . . . . . . 12–14, 21–25, 27–28, 38, 52–53, 55, 57–58, 69, 77, 96, 100, 102, 164–165, 175, 423–429, 450 Exchangeable/Exchangeabilty . . . . . . . . 193, 195–196, 257, 277, 473 Expectation/Expected value, see Mean Exploration/Exploratory analysis/Exploratory data analysis . . . . . . . . . . . . . . . . . . . . . . . . 267–282 Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275–276, 459
F Factor analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 106, 137, 144 False discovery rate (FDR) . . . . 203–210, 217, 260, 310–312, 317, 491, 501–502, 505, 515–516, 582 negative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443, 445, 505 positive . . . . . . . . . . . 140, 142, 228–229, 260, 443–445, 474–475, 481, 492, 501–502, 505, 514–515, 578–579, 581–582 Family based association test (FBAT) . . . . . . . . . . . . . . . . 230, 486–491 Family-wise error rate (FWER) 66, 70, 189, 203, 210–217, 501–502 Filtering/Filter . . . . 215–216, 275, 361, 453–454, 457, 473 Fisher/Fisher’s contingency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
discriminant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347–348 exact test, see Test –Neyman factorization . . . . . . . . . . . . . . . . . . . . . . . . . 161 p-value method . . . . . . . . . . . . . . 543, 545–547, 555–557 Fold change . . . . . . . . . . . 206, 210, 254, 296, 361, 455, 457, 461–462, 464–465, 502 Fragment/Fragmentation . . . . . . . . . . . . 246, 250–253, 255, 258–260, 439 Frequency . . . . . . . . . 17, 113, 204, 224–226, 228–229, 233, 258, 380, 412, 416, 536, 538, 605–609, 613 Frequentist . . . . . . . . . . . . 156, 163–164, 166–168, 187–189 Fubini’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
G Gaussian quadrature rule . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 Gene expression 181, 186–189, 191, 194, 210–212, 215, 220, 235–237, 244–245, 247–249, 251, 253–255, 260–261, 269–273, 275–276, 278, 288, 290–292, 294, 299–302, 305–307, 312, 315–316, 318, 321, 335–337, 341–342, 361, 369–370, 393–394, 417, 421, 423–428, 430, 436–437, 460–461, 471–482, 497–498, 501, 511–514, 516–517, 519, 524, 531–532, 536–538, 543, 578–580 meta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331 module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324–326 shaving . . . . . . . . . . . . . . . . . . . . . 320, 328–335, 339–341 signature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472–473 Generalizability/Generalizable . . . . . . . . . . . . . . . . . 575–576 Genetic(s) . . . 105–106, 108, 111–113, 116, 120, 124–126, 128, 131, 133–137, 140, 143, 157, 188–190, 219–235, 244, 249, 258, 417, 437, 457, 462, 478–479, 486–491, 493, 504, 541–542, 544–545, 547, 549, 552–554, 555–557, 581–583, 585, 589 Genome/Genomic(s) genetical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221, 235–236 wide association (GWA) . . . . . 123–124, 142–143, 221, 231–232, 421, 485–493, 543, 581–583 Genotype/Genotypic/Genotyping . . . . 112, 115, 123, 137, 219–220, 222, 224–227, 230–234, 236–237, 274, 486–491, 542, 544, 553, 556, 582 Gold-standard . . . . . . . . . . . . . . 126, 249–252, 259, 261, 575
H Haplotype . . . . . . . . . . . . . . . . . 227, 230–231, 236–237, 437 HapMap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224, 231 Hardy–Weinberg disequilibrium (HWD) . . . . . . . . . . . . . . . . . . . . . . . . . 226 equilibrium (HWE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 Heatmap . . . . . . . . . . . . . . . . . . . . . . . 143, 273–275, 333, 339 Heritability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486 Hessian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .424 Heterogeneity . . . . . . . . . 175, 221, 421, 486–487, 492–493, 521–522, 532–533, 542–545, 549–552, 555–558, 575 Heteroscedastic/Heteroskedastic . . . . . . . . . . . 425, 621–622 Heterozygous/Heterozygote . . . . . . 110, 112–113, 229, 234 Hierarchical model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163, 531–539 non . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132, 144, 146, 147 prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164, 193 structure . . . . . . 130, 133–134, 136, 143–144, 274, 381, 523–524
STATISTICAL METHODS IN MOLECULAR BIOLOGY Subject Index 631 Hierarchy . . . . . . . . . 133–134, 144–145, 147, 193, 274, 533 High dimension . . . . . . . . . . . 165, 168–169, 171, 174, 210, 214, 220, 228, 231, 253, 267–282, 348, 417–432, 442, 450–452, 473, 502, 531–538 High throughput . . 267–270, 272–273, 315–316, 435–465, 471–472, 486, 511–526, 580 screening (HTS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 Histogram . . . . . . . . . . . 17, 19, 80, 204, 269–270, 536–537, 612–613 Homogeneity/Homogeneous . . . . 124, 147, 229–230, 316, 334–335, 341, 542–543, 553, 555, 579, 608 Homoscedastic/Homoskedastic . . . . . . . . . . . . 185–186, 432 Homozygous/Homozygote . . . . . . . . . . . . . . . . . . . . 112–115 Human genome project . . . . . . . . . . . . . . . . . . . 237, 471–472 Hybridization . . . . . . . . . 268–270, 275–276, 278, 288–289, 335, 517, 578, 582 Hypothesis aggregate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545 alternative . . 30, 33, 36, 50, 60, 62, 182–183, 185, 187, 191, 211, 301–302, 547 null . . . .30, 37, 44, 51, 61, 63–64, 68, 70–71, 103–104, 108, 166, 180–181, 183–184, 187–191, 204, 211, 225–226, 230, 256, 301–302, 305, 307, 310, 397, 454–455, 488–489, 514–516, 545, 547–548, 552 one-sided . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33, 37 test/testing . . . 28–47, 53, 57, 62–63, 71, 103–104, 106, 165–167, 182–185, 189, 229, 231, 301–303, 310, 312, 489, 498, 504, 514, 546, 570, 614–618 two-sided . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33, 36–37
I Identifiability/Identifiable . . . . . . . . . . . . . . . . . . . . . . 519, 536 Identity-by-descent/Identical by descent (IBD) . . . . . . . . . . 222–223, 234–235, 551–552 Image analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 500 Importance proposal density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172–174 weight function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 Imprint/Imprinting/Imprinted . . . . . . . 220–221, 233–235, 244–245, 247–249, 251 Improper posterior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160, 175, 533 prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160, 162, 185 Inbreed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234, 236 Independence . . . . .106–113, 124, 156, 193, 204–205, 336, 421–422, 432, 449–450, 455 conditional . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336, 450 Index Fowlkes–Mallows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397 Rand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397 Youden . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475 Inference . . . 4–8, 12–15, 20–47, 53, 57, 61–62, 73, 80–81, 87–88, 98, 100, 157, 162–165, 168–169, 220, 228, 275, 351, 405, 414, 421, 497–498, 514–522, 525–526, 572–573, 580, 582 Information content . . 130–134, 137–138, 144–146, 254, 256, 259, 547, 567–569 criterion . . . . . . . . . . . . . . . . . . . . 378, 424–425, 428–429 Akaike (AIC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429 Bayesian (BIC) . . . . . . . . . . . . . . . . . . . . . . . . 378–379 Inheritance . . . . . . . . . . . . . . . . . 221–222, 225, 228, 231, 538 Integral/Integration/Integrate/Integrating
high-dimensional . . . . . . . . . . . . . . . . . 165, 168–169, 171 numerical . . . . . . . . . . . . . . . . . . . . . . . . 162–163, 168–174 Integrative analysis . . . . . . . . . . . . . . . . . . . . . . . 260, 522–524 Intensity/intensities background . . . . . . . . . . . . . . . . . . . . . . 276–278, 294, 500 spot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293–294, 500 Interaction/Interacting gene–environment . . . . . . . . . . . . . . . . . . . . 227, 542, 582 gene–gene . . . . . . . . . . . . . . . . . . . . . . . 140, 227, 517, 542 Interquartile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106, 269–270 Isomap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
K Kaplan–Meier . . . . . 255, 429–431, 477, 584–585, 624–625 Kernel bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383–384 density . . . . . . . . . . . . . . . . . . . . . . 270, 348, 380–382, 402 function . . . . . . . . . . . 159–160, 166, 168, 173, 175–176, 357–359, 365 Gaussian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357, 363, 382 linear . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456 method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357–359, 366 K-nearest neighbor (KNN) . . . . . . . 450–451, 474, 478–479
L Lagrange multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . 354–356 Latin square . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Law of Large Numbers (LLN) . . . . . . . . . . . . 108, 171–174 Least significant difference (LSD) . . . . . . . . . . . . . 66–67, 69 Least square ordinary (OLS) . . . . . . . . . . . . . . . . . . . . . . . . . . . 423, 426 partial (PLS) . . . . . . . . . . . . 419, 422–425, 428–431, 474 Leave-one-out . . . . . . . . . . . . . . . . . . . . . . . . . . . 447–448, 461 Likelihood function . . . . . . . . . . . . . . . . . . . . 159–160, 163–164, 176, 194, 223 log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223, 359, 411, 533–535 loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359 marginal . . . . . . . . . . . . . . . . . . . . 122, 126–127, 163–164 maximum (ML) . . . . 162, 223, 421, 500, 518, 533–534 restricted . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 multivariate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193–194 ratio . . . . . . . . . . . . . . . . . . . 222, 227, 235, 473, 544, 575 Linkage analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219–237 association study . . . . . . . . . . . . . . . . . 112, 219–237, 543 average . . . . . . . . . . . . . . . . . . . . . . . . . . 322, 324, 328, 388 complete . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388–391, 402 disequilibrium (LD) . . . . . . . . . . . . . . . . . . 135–136, 220, 224–225 scan . . . . . . . . . . . . . . . . . . . . . . . . 233, 236, 542–543, 552 score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 single . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388–390, 402 Link function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177, 423, 425 Locus/Loci multiallelic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 Loess/Lowess . . . . . . . . . . . . . . . . . . . . . . . 279, 299, 449–450 Logarithm of the odds (LOD) . . . . 221, 544–548, 555–557 Longitudinal . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488, 490, 492, 505, 573 Loss function . . . . . . . . . . . . . . . . . . . . . . . . . 165, 352, 354, 474 hinge . . . . . . . . . . . . . . . . . . .352–354, 356, 358–359, 365 surrogate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
STATISTICAL METHODS IN MOLECULAR BIOLOGY
632 Subject Index M
Machine learning . . . . . . 261, 348, 351, 357, 437, 440–441, 458, 532 Mapping . . . . 220–221, 223–224, 228, 233–237, 250–251, 356–357, 398, 459, 486, 488, 611 feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356–357 Marginal . . . . . . . . . 122, 126, 157–160, 163–164, 183, 189, 196, 205, 207, 214, 216, 255–256, 361, 414–416, 421, 423, 432, 457, 518, 618, 626 Markov chain . . . . . . . . . . 163, 174, 384, 392, 413, 522, 536–537 Monte Carlo (MCMC) . . . . . . . 163, 174, 339, 384, 522, 554 model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405–416, 517 hidden Markov model (HMM) . . . . . 405–416, 517 process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406–407 random field (MRF) . . . . . . . . . . . . . . . . . . 517–518, 526 Mass-to-charge . . . . . . . . . . . . . . . . . . . . . . . . . . 279, 439, 502 Mass spectrometry (MS) . . . . . . . . 249–250, 267–268, 275, 279–282, 439, 472, 498, 502–504, 564, 580 Mass spectrum . . . . . . . . . . . . . . . . . . . . . . . . . . . 281, 502–504 Matrix association . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .396 confusion . . . . . . . . . . . . . . . . . . . . . . . . 396, 398, 443–444 positional weight . . . . . . . . . . . . . . . . . . . . . . 406, 410–411 Mean conditional . . . . . . . . . . . . . . . . . . . . . . 422–424, 490–491 posterior . . . . . . . . . . . 164–165, 168–169, 172–176, 534 shift . . . . . . . . . . . . . . . . . . . .370–371, 382–384, 392, 401 trimmed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564 Median . . . . . . . 11, 106, 111, 116, 164, 269–270, 277–278, 291–292, 294, 306, 319, 323, 356, 360, 363, 388, 449, 564, 578 Mendel/Mendelian . . . . . . . . . . . . . . . . . . . . . . . 221–222, 538 Meta analysis/Meta analyses/Meta analytic (MA) . . . . . . . . 258, 486, 524, 531–538, 542–543, 546–550, 552–555, 557, 564–565, 569, 573–575, 582, 590 genome search (GSMA) . . . . . . 543, 547–550, 555–557 Methyl/Methylation binding domain (MBD) . . . . . . . . . . . . . . . . . . . . 244–245 cytosine . . . . . . . . . . . . . . . . . . . . . . . . . 244–246, 248–253 Microarray one-color/single-color . . . . . . . . . . . . . . . . . . . . . . 288–289 prediction analysis of (PAM) . . . . . . 478, 480, 522–524 significance analysis of (SAM) . . . . . . . . . 107, 478, 481 statistical analysis of (StAM) . . . . . . . . . . 289, 294–295, 514, 517, 523–524 two-color . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287–312 Misclassification . . . . . . . . . . . 144, 349, 352–353, 397, 400, 474–475, 505 Missing data/Missing value . . . . . . . . . . 122–123, 126–127, 131, 278, 339, 361, 488, 501, 536, 553–554, 578, 609 Misspecification . . . . . . . . . . . . . . . . . 106, 230, 486–487, 491 Mixture . . . 48, 54, 156, 190, 204, 235, 251, 268–269, 273, 337, 370–371, 376–378, 384, 401, 486, 499, 503–504, 514, 517–520, 525–526, 534, 547 Mobility relative desired (RDM) . . . . . . . . . . . . . . . . . . . . .144, 146 subject (SM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 Mode . . . 164, 168, 221–222, 225, 228, 231, 277, 379–383, 386, 449, 461, 517 Model accelerated failure time (AFT) . . . . . . . . . . . . . . . . . . .534
additive . . . . . . . . . . . . 53–57, 73–75, 77–79, 81, 84, 425 autoregression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518–519 consensus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458–459 development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441–442 Gaussian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505–506 graphical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316, 505–506 hierarchical . . . . . . . . . . . . . . . . . . . . . . . . . . . 163, 531–538 joint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519–522 linear additive . . . . . . . . . . . . . 53–57, 73–75, 77–79, 81, 84 generalized (GLM) . . . . . . . . . . . . . . . . . . . . 126, 425 mixed effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473 mixture normal . . . . . . . . . . . . . . . . . . . . . . . 517–519, 525–526 stratified . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518 multiplicative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 probability . . . . . . . . . . . . . . . . . . . . . . . . 87, 348, 359, 365 Module . . . . . . . . . . . . . . . . . . . . . . . . 321–330, 405, 412–416 cis-regulatory (CRM) . . . . . . . . . . . . . 405–406, 412–416 Monte Carlo integration . . . . . . . . . . . . . . . . . . . . . . . . . . . 169, 171, 194 method . . . . . . . . . . . . . . . . . . . . . 163–164, 168, 171–174 optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457 Motif . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405–416, 418, 427 Multiple testing/Multiple comparison(s)/ Multiplicity . . . . 36, 61–71, 185, 189–190, 203, 204–205, 207, 210–212, 215–216, 260, 309–310, 479–480, 486, 489–490, 492, 576–577, 582, 618 Multivariate/Multivariable . . . . . . . . . . . 125–148, 188–190, 193–195, 198, 223, 255, 261, 349, 419–420, 432, 455–456, 473–474, 476–477, 487–488, 490, 492, 501–502, 534, 576, 584–585 MuStat . . . . . . . . . . . . . . . 118, 124–125, 132, 136, 140, 147 Mutation/Mutate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96–97, 221, 224, 246, 258, 504, 513, 515
N Network co-expression . . . . . . . . . . . 320–328, 332–334, 339–342 gene . . . . . . . . . . . . . . . 235–236, 316, 517–518, 525–526 neural . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126, 348, 564 relevance (RN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505–506 Newton–Cotes rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 Nonparametric . . . . . . . . 105–148, 156–158, 221–222, 230, 255–256, 357, 365, 370–371, 379–382, 384, 386, 390, 401, 402, 418, 501, 505, 514, 519, 523–524, 546–547, 555, 619–620 Normality . . . . . . . . . . 29, 36, 40, 57, 80, 88, 101, 211–212, 223, 349, 360, 455, 579, 614 Normalization/Normalize . . . . . . . . . . . . 107, 215–216, 245, 252–253, 270, 272, 275–279, 281–282, 294–299, 303, 305–306, 351, 397, 402, 449–450, 459, 500–501, 505, 578–579
O Observation/Observational . . . 4–5, 7–8, 10–14, 21, 29–30, 52–54, 56–60, 77, 81–88, 98, 108, 110–113, 124, 129, 138, 157, 171, 277, 292, 300, 370–375, 381, 389–391, 440, 501–502, 565, 572–575 Odds ratio (OR) . . . . . . . . . . . . . . . . 162, 175, 579, 608, 622 Oligo genetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541–542 Optimization genetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330, 457
STATISTICAL METHODS IN MOLECULAR BIOLOGY Subject Index 633 greedy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456 Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41, 127–135, 147 partial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127–134, 149 Orthogonal . . . . . . . . . . . . . . . . . . . . . . 59, 170, 329, 419, 422 Outlier . . . . . . . . . . . . . . . . . . . . . . . . . 101, 269, 278, 375, 504 Over-fitting . . . . . . . . . . . . . . . . . . . . . . . . . 261, 362, 378, 441 Overparameterize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55, 74
P Paradigm . . . . . . . . . . . . . . 138, 185, 348, 418–419, 421–423 Parameter hyper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 nuisance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158, 185, 212, 488, 553 tuning . . . . . . . . . . . . . . . . . . . . . . 352, 354, 361–364, 366, 424, 553 Parametric . . . 106, 109, 148–149, 158, 175, 221, 375, 384, 390, 418, 427, 451, 455, 519, 547 Partition(s) . . . . . . . . . . . . . 59, 75–77, 91, 93, 388, 390, 518 comparing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369 Pathway . . . . . 106, 138–140, 146–147, 194, 196, 335, 342, 453–454, 505, 512, 517–519, 523 Pedigree . . . . . . . . . . . . . . . . . . . 220–223, 230, 488, 490, 557 disequilibrium test (PDT) . . . . . . . . . . . . . . . . . . 230, 486 Penalize/Penalization . . 181, 185, 338, 351, 354, 360, 362, 364–365, 378, 523–524 Penalty . . . . . . . 353–354, 358, 364–365, 426, 428, 523–524 Performance estimate . . . . . . . . . . . . . . . . . . . . . 448, 461–462 Permutation/Permute . . 116–117, 210–212, 253, 258–259, 329, 331, 424, 515, 547, 550, 564 Personalized medicine . . . . . . . . . . . . . . . . 144–147, 149, 237 Phenotype/Phenotypic . . . . . 123, 128, 136–137, 143–145, 220, 222–223, 235, 255–258, 486–492 Pie chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613 Pilot . . . . . . . . . . . . . . . . . . . . . . . . . . . 206, 210, 212–217, 505 Pivotal/Pivotality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 subset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 Platform . . . . . 231, 245, 249–250, 252–253, 278, 289, 319, 436, 459–460, 524–525, 532 Plot box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269–270, 612 normal probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614 quantile–quantile (Q–Q) . . . . . . . . . . . . . . .256–257, 614 scatter . . . . . . . . . . . . . 270–271, 325, 364, 464, 612, 621 stem and leaf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564 Polymerase chain reaction (PCR) . . . . . . . . . . 249–250, 472, 564, 582 Polymorphism . . . . . . . . . . . . . . . . . . 123, 220, 225, 236–237, 275, 485, 513, 581 Pooling/Pooled data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543–544 information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .180 variance . . . . . . . . . . . . 28, 38–39, 51, 53, 180–181, 186, 206, 213 Population based . . . . . 4, 23, 225, 229–231, 237, 486–487, 493, 583 Positive definite . . . . . . . . . . . . . . . . . . . . . . . . . . 191, 357, 377 Posterior . . . . . . . . . 158–169, 172–176, 182–183, 187–188, 190–196, 410, 414, 474, 520–522, 554, 557 Power calculation . . . . . . . . . . . . . . . . . . 203–217, 492, 571, 582 conditional . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 490–491 global . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211, 214 marginal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
statistical . . . . . . . . . . 210, 222, 230, 459, 486, 489–490, 513, 519, 543, 547, 578, 582 Precision . . . 8, 48–49, 73, 78–79, 82, 91, 95, 97, 100, 443, 480, 526 Prediction/Predictive/Predicting . . . . . . . . . 87–88, 98–100, 147, 245, 248–249, 254–255, 261, 356, 360–361, 428, 444, 447, 461, 471–481, 513–514, 522–523, 564–565, 578–579 model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443, 457 Predictive value negative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443, 475 positive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443 Preprocessing . . . . . . . . . 216, 253, 267–282, 318, 361, 498, 500–501, 503, 578, 580 Principal component(s) . . . . . . . . . . 270–273, 328, 419–421, 427–428, 431–432, 453, 504 analysis (PCA) . . . . . 270–273, 419–421, 425, 427–432, 450, 453, 502 Prior Gamma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532, 534 Haldane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162, 175 informative . . . . . . . . . . . . . . . . . . . . . . . . . . . 162, 185, 193 Jeffreys . . . . . . . . . . . . . . . . . . . . . . . . . . 162, 175, 183–185 joint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259, 414 non-informative . . . . . . . . . . . . . . . . . . . . . . 162, 185, 193 parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184–185, 187 subjective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 uniform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162, 175 Zellner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162, 175 Prioritization/Prioritize . . . . . . . . . . . . . . . . . . . 245, 259–260 Probability density function (pdf ) . . . . . . . 19, 21, 86, 160, 173, 186 posterior . . . . . . . . . . 167, 183, 187–188, 195–196, 410, 518, 520, 522, 554, 557 prior . . . . . 182, 185, 189–193, 195–196, 517–519, 523 Probe . . . . . . . 215–216, 252, 268–270, 275–278, 288, 294, 319–320, 331–332, 437, 448, 459 effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216, 276 Profile/Profiling . . . . . . . 105, 112, 131, 133, 139, 190, 231, 235–237, 245, 248–249, 256, 291, 324, 348, 428, 435–436, 477–481, 502, 512–513, 524, 536, 579 Prognostic/Prognosis . . . . . . . . . . . . 203–204, 206–208, 210, 212–214, 216, 248, 261, 361, 364–365, 513, 565, 576 Programming . . . . . 355, 400–401, 408, 478, 480, 599–600 linear . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478 Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . 215, 378, 423, 451 Promoter . . . . . . . . . 246–249, 251, 253–254, 256–261, 418 Proportionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193–195 Protein . . . . . . . . . . . 140, 244–246, 273–275, 288, 317–318, 334, 439, 471–472, 497–502, 505–506, 511–514, 516–517, 519, 580 Proteome/Proteomic/Proteomics . . . . . . . . . . 267, 279, 458, 497–505, 580–581, 589 Proximity . . . . . . . . . . . . . . . . . . . . . . . . . . . 135, 259, 384–385 p-value adjusting/adjusted . . . . . . . . . . . . . . . . . . . . . . . . . 210, 260 minimum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546 uni/bi/tri/multi-variate . . . . . . . . . . . . . . . . . . . . . 140, 328
Q Quality control . . . . . . . . . . . . . . . . . . 116–117, 453, 502, 580 Quantification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262, 275, 279, 439
STATISTICAL METHODS IN MOLECULAR BIOLOGY
634 Subject Index
Quantile . . . . . . . . . . . . . . 181, 211, 215, 252, 256–257, 277, 449, 501, 505, 614 Quantitative trait loci (QTL) . . . . . 220–222, 234–237, 542 expression (eQTL) . . . . . . . . . . . . . . . . . . . . . . . . 235–237
R Random forest . . . . . . . . . . . . . . . . . . . . . . . 255, 452, 462, 522 Randomization . . . . . . . . . 6–9, 48, 120–123, 570–572, 577 complete . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Randomized controlled trial/Randomized clinical trial (RCT) . . . . . . . . . . 564–565, 568, 571–572, 582 Ranking . . . . . . . . . . . . . . . . . . . . . . . . 330–331, 444, 453–456 Rank test . . . . . . . . . . . . . . 120, 138, 255, 429–430, 619, 625 Recall . . . . . . . 19, 21–23, 30, 41, 43–44, 50–54, 57, 77, 86, 91, 96, 102, 323, 340, 364, 444 Receiver operating characteristic (ROC) curve . . 444–445, 474, 579 area under the (AUC) . . . . . . . . . . . . . 444, 462, 474–475 Recessive . . . . . . . . . . . . . . 115, 118, 142, 225–228, 230, 488 Recombinant/Recombination . . . . 140, 220–221, 220, 234 inbred line (RIL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 Recurrence . . . . . . . . 128–129, 248, 254–255, 319, 513, 576 Regression directional . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424 Haseman–Elston (HE) . . . . . . . . . . . . . . . . . . . . 222, 546 inverse . . . . . . . . . . . . . . . . . . . . . . . . . . 419, 421, 424–428 sliced (SIR) . . . . . . . . . . . . . . . . . . . . . . . 419, 424–428 linear . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81–104 logistic . . . . . . . . . . . . . 227, 255, 359–360, 364, 451, 622 median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356, 360 proportional hazards . . . . . . . . . . . . . . . . . . . . . . . . . . . .564 ridge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351, 354 stepwise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Regularization/Regularize . . .255, 351, 353–354, 357–358, 362, 426–427, 474 Regulate/Regulator/Regulation . . . . . . . 194, 212, 235, 254, 334–335, 412, 512–513 down . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279, 479–480, 520 up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289, 479, 520 Relevance minimum redundancy–maximum (mRMR) . . . . . . . 456 vector machine (RVM) hierarchical (HRVM) . . . . . . . . . . . . . . 532–536, 538 Multi-task (MT-RVM) . . . . . . . . . . . . . . . . . 532, 534 Repair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140, 256, 334 Repeated measurements/Repeated measures . . . . 488, 490, 492, 577 Replicate/Replication equal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54, 57–58, 60–61 unequal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54, 60–61, 70 Repression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244, 340, 480 Reproducible/Reproducibility . . . . 281, 439, 442, 455, 460, 578–580 Resampling/Resample . . . . . . . 210, 212, 330–331, 339–340 Residual . . . . . . . . . . . . 92–95, 148, 307–309, 440, 477, 621 Response . . . . . . . . . . 4–5, 8, 26–27, 91, 100, 135, 161, 332, 360, 421, 429, 471–481 Ribonucleic acid (RNA) messenger (mRNA) . . . . . 235, 255, 288–290, 292, 303, 334, 341–342, 369–370, 393, 418, 436, 471–472 Ridge parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426–427, 429 penalty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426 Risk
assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262, 472 minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .351, 354 relative (RR) . . . . . . . . . . . . . . . . . . . . . . . . . 227, 584, 608 score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429 Robust . . 106–107, 111, 215, 230–231, 277–278, 342, 438, 451–452, 486, 557, 616
S Sample size . . . 40, 43–47, 52, 71, 107, 113, 142–143, 156, 161, 167, 180–181, 203–217, 223, 227–228, 230–231, 237, 259–260, 317, 355, 362, 383, 432, 451, 454–455, 474, 505, 531–532, 542–543, 547, 555, 557, 570–572, 577–579, 582, 617 Sampling/Sampler error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95, 448, 551 Gibbs . . . . . . . . . . . . . . . . . . . . . . . 406, 411–412, 414–416 Scale/Scaling . . 11, 14, 20, 57, 80, 101, 106, 126–127, 138, 140, 148, 184–186, 220, 224, 228, 231, 233–235, 249–250, 253, 256, 269–270, 273, 277, 281–282, 296, 308, 321–325, 330, 365, 387, 409, 432, 447, 449, 459, 480, 487, 492, 511–513, 524–525, 534, 588 Score mu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 u- . . . . . . . . . . . . . . . . . 118–124, 127–129, 134, 136, 142 Z- . . . . . . . . . . . . . . . . . 253, 518, 525–526, 547, 556–557 Screening . . . . 115, 124, 250, 255, 259–261, 269, 370, 425, 428, 432, 444, 477–479, 487, 489–491 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 Segregation/Segregate . . 220–221, 223–224, 233–234, 236 Selection backward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456 feature . . . . 365, 442, 451, 454–458, 460–462, 464–465 forward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456 model . . . . . . . . . . . . . . . . . . 146, 261, 442, 457–458, 465 predictor . . . . . . . . . . . . . . . . . . . . 427–428, 531–532, 534 variable . . 146, 320–335, 365, 393, 395, 427–428, 432, 533–534, 538 Sensitivity . . . 115, 143, 157, 187–188, 250–252, 260, 321, 364, 443–445, 474–475, 477, 480, 514, 564–565, 573–575, 579, 622 analysis . . . . . . . . . . . . . . . . . 157, 187–188, 564, 573–575 Separable/Separability . . . . . . . . . . . . . . . . . . . . 349–353, 451 Sequence/Sequencing . . 105–106, 135, 137, 149, 170–172, 174, 235, 244–246, 248–250, 252–254, 256, 260, 268–269, 277, 288, 290, 405–416, 418, 453, 461, 471–472, 497–498, 511–513, 519–522, 525, 572, 581–582, 600–601 Sequential . . . 211, 350, 364, 420, 422, 424–425, 449, 505, 519, 618 Shrinkage . . . . . . . . . . . . . 354, 427–431, 514–517, 523, 532 Sib -pair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222, 234–235 -ships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 Signature . . . . . . . . . 146–147, 361, 472–473, 476–477, 480 Significance familywise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69–70 level/level of . . 32, 35–36, 65, 303–304, 309–310, 322, 490–491, 550, 576, 619–620 test of . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30, 556 Similarity matrix . . . . . . . . . . . . . . . . . . 322–323, 386, 387, 390–392 measure . . . . . . . . . . . . . . . . . . . . . 322, 324, 384, 385–387 Simpson’s rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170–171
STATISTICAL METHODS IN MOLECULAR BIOLOGY Subject Index 635 Simulated annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457 Simulation . . . 171, 182, 210–212, 215–216, 233, 397, 490, 522, 547, 549–550 Single nucleotide polymorphism (SNP) distal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 proximal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 Single-step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211–212 Singular value decomposition (SVD) . . . . . . . 272–273, 420 Skewed/Skewness . . . . . . . . . . . . . . . . . . . . . . . . . 11, 379, 581 Smooth/Smoothed/Smoothing . . . .19, 169, 171–173, 252, 270–271, 279–281, 299, 357, 425, 453, 519, 624 Space Hilbert . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356–357, 474 parameter . . . . . 157–159, 165, 167–171, 173, 414, 473 Specificity . . . 250–251, 254, 443–445, 474–475, 480, 514, 522, 564, 575, 579, 622 Spectrum/Spectral . . . . . 267, 279–282, 316, 366, 371, 386, 390–392, 401, 473, 502–504, 550, 581, 589 decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424–426 Stabilization/Stabilize/Stabilizing . . . . . . . . . . . . . . 278, 501 Standard deviation (SD) . . 11–12, 106, 109–110, 114, 172, 291–292, 303–304, 307–308, 448, 454, 458, 613, 617, 620 Standard error (SE) . . . . 14, 43, 67, 69, 78, 91, 96, 99–100, 253, 255, 543, 551, 556–557 Standardization/Standardize . . . . . . . . . 107, 184–186, 206, 212, 216, 386–387, 450, 614 Statistical learning . . . . . . . . . . . . . . . . . . . . . . . . 348–349, 351 Statistic/Statistics descriptive . . . . . . . . . . . . . . . . 10–12, 601, 605, 619–620 gap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329–331 Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543–544, 549, 552 standard normal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 sufficient . . . . . . 161–162, 486, 488–489, 491, 533–534 summary . . . . . . . . . . . . . . . . . . . 194, 268–269, 278, 601, 605, 607, 617, 625 t- . . . . . . . . . . 47, 53, 181–182, 185–186, 198, 454–455 test . . . . 40, 79, 112–114, 126, 182, 205, 210–213, 215, 226, 256–257, 514, 607, 614–616, 618–619, 625 u- . . . . . . . . . . . . . . . . . . . . . . 116, 127, 136–137, 148, 455 z- . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193–194, 455, 488 Step down . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210–212, 456 Step up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310–312, 456 Stochastic . . . . 169, 172–173, 268, 276, 532, 535, 538, 553 Stratification . . . . . 108, 112, 114, 120–123, 228–230, 421, 461, 490–491, 582 Structure genetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 hierarchical . . . . 130, 134–136, 144–145, 274, 381, 523 Subsampling/Subsample . . . . . . . . . . . . . . . . . . . . 75–80, 108 Subspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422–425, 427, 452 Summarization . . . . . . . . . . . . . 262, 276–278, 448–449, 564 Supervised . . . . . . . 138–140, 255, 332, 351, 366, 419, 421, 429–432, 438, 440–441, 450, 473, 579 Support vector machine (SVM) . . 255, 347–366, 452, 456, 462, 473–474, 478, 504, 522–523, 564 Suppressor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246, 254, 261 Survival/survivor analysis . . . . . . . . . . . . . . . . . . . . . 428, 477, 534, 624–626 function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206, 624–625 Susceptibility/Susceptible . . . . . . . .222, 228, 439, 448, 480, 546–549 Symmetry/Symmetric . . . . 22, 80, 111, 119, 131, 165, 195, 277, 279, 299, 324, 331, 349, 357, 359,
376–377, 384, 390, 396, 398–399, 423–424, 432
T Test asymptotic . . . . . . . . . . . . . . 107, 123–124, 227, 235, 424 binomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 chi-square . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225–226, 607 Cochran–Mantel–Haenszel (CMH) . . . . . . . . . . . . . 124 conditional . . . . . . . . . . . . . . . . . . . . . . 112, 118, 125, 489 exact . . . . . . . . . . . . . . . . . . . . . . . . 481, 579, 606–607, 619 Fisher/Fisher’s . . . . . . . 124, 481, 579, 606–607, 619 Friedman . . . . . . . . . . . . . . . . . . . . . . . . 122–123, 125, 255 Jonckheere–Terpstra ( JT) . . . . . . . . . . . . . . . . . . . . . . . 138 Kruskal–Wallis . . . . . . . . . . . . . . . . . . . . . . . 107, 122, 619 Lam–Longnecker . . . . . . . . . . . . . . . . . . . . . . . . . 124, 138 likelihood ratio . . . . . . . . . . . . . . . . . . . . . . . 227, 235, 473 generalized . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473 log-rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255, 430, 625 Mann–Whitney (WMW) . . . . 119, 120, 124,127, 138 McNemar . . . . . . . . . . . . . . . . . . . 110, 113–114, 118, 619 stratified (SMN) . . . . . . . . . . . . . . . . . . .114–116, 118 multiple, see Multiple comparison multivariate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .130 normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614 permutation . . . . . . . . 210–212, 253, 258–259, 424, 564 rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121, 124, 126, 619 sign conditional . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 unconditional . . . . . . . . . . . . . . . . . . . . . 112, 118, 121 tBayesian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179–198 paired . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 two-sample . . . . . . . . . . . . . . 179–180, 206, 455, 473 trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139–141, 225–226 two-stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485–493 u- . . . . . . . . . . . . . . . . . . . . . . . . . . 122, 125, 142, 144, 455 unconditional . . . . . . . . . . . . . . . . . . . . . . . . . 112, 118, 121 Welch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .180 Wilcoxon Mann–Whitney (WMW) . . . . . 120, 124, 127, 138 rank sum . . . . . . . . . . . . . . . . 107, 120, 124, 127, 455 signed-rank . . . . . . . . . . . . . . . . . . 107, 124, 138, 619 Threshold . . . 112, 167, 183, 251, 257, 321–322, 325–326, 339, 364, 395, 411, 420, 445, 501, 515–516, 536–537, 544, 546–547, 554 Topology/Topologies/Topological . . . . . . . . . . 137, 320–328 Training . . . . . 349–350, 352, 355, 359, 361–365, 410–411, 428–430, 438, 440–442, 447–452, 457–458, 461–463, 472–474, 480, 504, 565–566, 583–586, 589–590 Trait . . . . . . 8, 219–224, 229–230, 233, 235–237, 486–488, 490, 492, 542, 550 Transcription . . . . . 140, 235, 244–249, 251, 260, 287, 290, 335, 405–406, 412, 418, 427, 461, 472, 512–513, 519 factor (TF) . . . . . . . . . . . . . 245, 405, 418, 512–513, 519 Transduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140, 194 Transform/Transformation . . . . . . . . . . 21, 57, 80, 101–103, 106–107, 126–128, 130, 133, 138, 146–148, 162, 181, 187, 206, 260, 277–278, 280, 356, 386, 391, 397, 418, 422, 425, 439, 453, 502, 504, 518, 581, 609 Transition . . . . . . . . . . . . . . . . . . 174, 391, 407, 410–411, 413 probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410–413
STATISTICAL METHODS IN MOLECULAR BIOLOGY
636 Subject Index
Translation . . . . . . . . . . . . . . . . . . . . . . . . . . 334–335, 512, 583 Transmission/Transmit . . . . . . . . . . . . . . 112–113, 229–230, 244, 552 disequilibrium test (TDT) . . . . . . . . 112–115, 118, 140, 229–230, 235, 486, 488 True negative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443–444, 475 positive . . . . . . . . . . . . 443–444, 474–475, 505, 514–515 rejection . . . . . . . 203–204, 206–208, 213–214, 216–217 Truncated product method (TPM) . . . . . . . . . 546–547, 555 Two-dimensional gel (2-DE) Electrophoresis . . . 498–502, 504, 580
Unbiased . . . . . . . . 23–24, 69, 254, 277, 342, 447–448, 571 Unit experimental . . . . . . . . 26, 37, 40, 46–49, 52–55, 58–59, 71–74, 82–83, 85, 95, 499 sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49, 73–74, 212 Univariate . . . 107–125, 128, 131–133, 136–137, 139–140, 143, 147, 190–191, 194, 328, 418, 425, 428, 454–455, 464, 577, 581 Unsupervised . . . . . . . . . 144, 273–275, 331–332, 419, 421, 429–432, 453, 579
external . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575 internal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575 Variability . . . . . . . . . . 4–6, 8, 13–14, 27, 38, 41, 43, 55, 57, 79, 91–92, 99, 101, 173, 180, 250–251, 440, 454–455, 536, 549, 551, 575–577, 582 Variance average . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424–425 common . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .38, 46, 206 component . . . . . . 48, 72, 75–76, 91–92, 222–223, 234 –covariance matrix . . . . . . 121, 124–125, 223, 234, 486 equal . . . . . . . . . . . . . . . . 38–39, 180, 215, 360, 451, 616 equality of . . . . . . . . . . . . . . . . . . . . . . . . . . . 39–40, 60, 455 pooled . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180, 186, 213 sliced . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424–425 total . . . . . . . . . . . . . . . . . . . . . . . . . . . 58–59, 75, 329, 453 unequal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39, 614 Variant . . . . . . . . . . . . . . . . 224, 227, 230, 383, 473–474, 581 Variation coefficient of (CV) . . . . . . . . . . . . . . . . . . . . . . 14–15, 473 copy number (CNV) . . . . 231, 259, 269, 275, 437, 581 source of . . . . . . . . . . . . . . . . . . . . 7, 41, 60–61, 72, 78–79 systematic . . . . . . . . . . . . . . . . . . . . 71, 268, 276, 298–299 Visualization/Visualize . . . . . . . . . . 140, 259, 267–282, 295, 370, 374, 418, 445, 450, 453, 502, 525 Voting/Vote . . 122–123, 452–453, 458, 544–545, 555–556
V
W
Validate/Validation cross . . . . . . . . . 364, 380, 423, 447–449, 453, 457–458, 460–462, 464–465, 480, 579 K -fold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447–448 external . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259, 261, 575 model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438, 464 Validity
Weight/Weighted . . . . . . 18, 26, 28–29, 31, 33, 38, 82–83, 101, 113, 126, 133, 165, 167, 170, 173, 180, 260–261, 320–323, 325–328, 330–331, 339, 365, 376, 386–387, 406, 410, 453, 456–457, 480, 487, 502–503, 523, 547–552, 557, 602 Western blot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564 WinBUGS/BUGS . . . . . . . . . . . . . . 174–176, 182, 525–527
U