METHODS
IN
MOLECULAR BIOLOGY
Series Editor John M. Walker School of Life Sciences University of Hertfordshire Hatfield, Hertfordshire, AL10 9AB, UK
For further volumes: http://www.springer.com/series/7651
TM
.
Statistical Human Genetics Methods and Protocols
Edited by
Robert C. Elston Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH, USA
Jaya M. Satagopan Department of Epidemiology and Biostatistics, Memorial Sloan-Kettering Cancer Center, New York, NY, USA
Shuying Sun Department of Epidemiology and Biostatistics, Case Comprehensive Cancer Center, Case Western Reserve University, Cleveland, OH, USA
Editors Robert C. Elston Department of Epidemiology and Biostatistics Case Western Reserve University Cleveland, OH, USA
[email protected]
Jaya M. Satagopan Department of Epidemiology and Biostatistics Memorial Sloan-Kettering Cancer Center New York, NY, USA
[email protected]
Shuying Sun Department of Epidemiology and Biostatistics Case Comprehensive Cancer Center Case Western Reserve University Cleveland, OH, USA
[email protected]
ISSN 1064-3745 e-ISSN 1940-6029 ISBN 978-1-61779-554-1 e-ISBN 978-1-61779-555-8 DOI 10.1007/978-1-61779-555-8 Springer New York Dordrecht Heidelberg London Library of Congress Control Number: 2011945439 ª Springer Science+Business Media, LLC 2012 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Humana Press, c/o Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed on acid-free paper Humana Press is part of Springer Science+Business Media (www.springer.com)
Preface The recent advances in genetics, especially in the molecular techniques that have over the last quarter of a century spectacularly reduced the cost of determining genetic markers, open up a field of research that is becoming of increasing help in detecting, preventing and/or curing many diseases that afflict us. This has brought with it the need for novel methods of statistical analysis and the implementation of these methods in a wide variety of computer programs. It is our aim in this book to make these methods and programs more easily accessible to the beginner who has data to analyze, whether a student or a senior investigator. Apart from the first chapter, which defines some of the genetic terms we shall use, and the last chapter, which compares three major multipurpose programs/software packages, each chapter of this book takes up a particular analytical topic and illustrates the use of at least one piece of software that the authors have found helpful for the relevant statistical analysis of their own human genetic data. There is often more than one program that performs a particular type of analysis and, once you have used one program for a particular analysis, you may find you prefer another program—and there is a good chance you will find that the same basic analysis is described in more than one chapter of this book. You may therefore wish to browse over several chapters, in the first place restricting your reading to only the introductory sections, which describe the underlying theory. The chapters are ordered in the approximate logical order in which human genetic studies are often conducted; so, if you are new to research in human genetics, this initial reading could serve as an introduction to the subject. Our main purpose, however, is to serve the needs of those who have already performed their study and now need to analyze their data. The second sections of the chapters give you step-by-step instructions for running the programs and interpreting the program outputs, with extra notes in the third sections. However, although our aim is very much to offer a “do it yourself” manual, there will be times when you will need to consult a statistical geneticist, especially for the interpretation of computer output. We have tried to be fairly comprehensive in covering statistical human genetics, but we do not cover here any of the bioinformatic software for gene sequencing, which is still very much in its infancy. Cleveland, OH, USA New York, NY, USA Cleveland, OH, USA
Robert C. Elston Jaya M. Satagopan Shuying Sun
Cleveland, OH, USA
Robert Elston
v
Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
v ix
1 Genetic Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Robert C. Elston, Jaya M. Satagopan, and Shuying Sun 2 Identification of Genotype Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yin Y. Shugart and Ying Wang 3 Detecting Pedigree Relationship Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lei Sun 4 Identifying Cryptic Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lei Sun and Apostolos Dimitromanolakis 5 Estimating Allele Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Indra Adrianto and Courtney Montgomery 6 Testing Departure from Hardy–Weinberg Proportions . . . . . . . . . . . . . . . . . . . . . . Jian Wang and Sanjay Shete 7 Estimating Disequilibrium Coefficients. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maren Vens and Andreas Ziegler 8 Detecting Familial Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Adam C. Naj, Yo Son Park, and Terri H. Beaty 9 Estimating Heritability from Twin Studies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Karin J.H. Verweij, Miriam A. Mosing, Brendan P. Zietsch, and Sarah E. Medland 10 Estimating Heritability from Nuclear Family and Pedigree Data. . . . . . . . . . . . . . Murielle Bochud 11 Correcting for Ascertainment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Warren Ewens and Robert C. Elston 12 Segregation Analysis Using the Unified Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiangqing Sun 13 Design Considerations for Genetic Linkage and Association Studies . . . . . . . . . . Je´re´mie Nsengimana and D. Timothy Bishop 14 Model-Based Linkage Analysis of a Quantitative Trait . . . . . . . . . . . . . . . . . . . . . . Audrey H. Schnell and Xiangqing Sun 15 Model-Based Linkage Analysis of a Binary Trait . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rita M. Cantor 16 Model-Free Linkage Analysis of a Quantitative Trait . . . . . . . . . . . . . . . . . . . . . . . . Nathan J. Morris and Catherine M. Stein 17 Model-Free Linkage Analysis of a Binary Trait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wei Xu, Shelley B. Bull, Lucia Mirea, and Celia M.T. Greenwood
1
vii
11 25 47 59 77 103 119 151
171 187 211 237 263 285 301 317
viii
18
Contents
Single Marker Association Analysis for Unrelated Samples . . . . . . . . . . . . . . . . . . . Gang Zheng, Jinfeng Xu, Ao Yuan, and Joseph L. Gastwirth 19 Single-Marker Family-Based Association Analysis Conditional on Parental Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ren‐Hua Chung and Eden R. Martin 20 Single Marker Family-Based Association Analysis Not Conditional on Parental Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Junghyun Namkung 21 Allowing for Population Stratification in Association Analysis . . . . . . . . . . . . . . . . Huaizhen Qin and Xiaofeng Zhu 22 Haplotype Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xin Li and Jing Li 23 Multi-SNP Haplotype Analysis Methods for Association Analysis. . . . . . . . . . . . . Daniel O. Stram and Venkatraman E. Seshan 24 Detecting Rare Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tao Feng and Xiaofeng Zhu 25 The Analysis of Ethnic Mixtures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiaofeng Zhu 26 Identifying Gene Interaction Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gurkan Bebek 27 Structural Equation Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Catherine M. Stein, Nathan J. Morris, and Nora L. Nock 28 Genotype Calling for the Affymetrix Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Arne Schillert and Andreas Ziegler 29 Genotype Calling for the Illumina Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yik Ying Teo 30 Comparison of Requirements and Capabilities of Major Multipurpose Software Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Robert P. Igo Jr. and Audrey H. Schnell Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
347
359
371 399 411 423 453 465 483 495 513 525
539 559
Contributors INDRA ADRIANTO Arthritis and Clinical Immunology Research Program, Oklahoma Medical Research Foundation, Oklahoma City, OK, USA TERRI H. BEATY Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA GURKAN BEBEK Center for Proteomics and Bioinformatics, Case Comprehensive Cancer Center, Case Western Reserve University School of Medicine, Cleveland, OH, USA D. TIMOTHY BISHOP Section of Epidemiology and Biostatistics, Leeds Institute of Molecular Medicine, University of Leeds, Cancer Genetics Building, Leeds, UK MURIELLE BOCHUD Institute of Social and Preventive Medicine, University of Lausanne, Lausanne, Switzerland SHELLEY B. BULL Samuel Lunenfeld Research Institute of Mount Sinai Hospital and Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada RITA M. CANTOR Department of Human Genetics, David Geffen School of Medicine at UCLA, Los Angeles, CA, USA; Center for Neurobehavioral Genetics, Department of Psychiatry, David Geffen School of Medicine at UCLA, Los Angeles, CA, USA REN-HUA CHUNG John P. Hussman Institute for Human Genomics, Leonard M. Miller School of Medicine, University of Miami, Miami, FL, USA APOSTOLOS DIMITROMANOLAKIS Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada ROBERT C. ELSTON Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH, USA WARREN EWENS Department of Biology, University of Pennsylvania, Philadelphia, PA, USA TAO FENG Department of Epidemiology and Biostatistics, Case Western Reserve University School of Medicine, Cleveland, OH, USA CELIA M.T. GREENWOOD Centre for Clinical Epidemiology, Lady Davis Research Institute, Jewish General Hospital, Montreal, QC, Canada; Cancer Research Society Division of Epidemiology, Department of Oncology, McGill University, Montreal, QC, Canada JOSEPH L. GASTWIRTH Department of Statistics, George Washington University, Washington, DC, USA ROBERT P. IGO JR Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH, USA JING LI Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, OH, USA XIN LI Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, OH, USA EDEN R. MARTIN John P. Hussman Institute for Human Genomics, Leonard M. Miller School of Medicine, University of Miami, Miami, FL, USA ix
x
Contributors
SARAH E. MEDLAND Genetic Epidemiology Unit, Queensland Institute of Medical Research, Brisbane, QLD, Australia LUCIA MIREA Maternal-Infant Care Research Centre, Mount Sinai Hospital, Toronto, ON, Canada COURTNEY MONTGOMERY Arthritis and Clinical Immunology Research Program, Oklahoma Medical Research Foundation, Oklahoma City, OK, USA NATHAN J. MORRIS Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH, USA MIRIAM A. MOSING Genetic Epidemiology Unit, Queensland Institute of Medical Research, Brisbane, QLD, Australia; School of Psychology, University of Queensland, Brisbane, QLD, Australia ADAM C. NAJ John P. Hussman Institute for Human Genomics, University of Miami Miller School of Medicine, Miami, FL, USA JUNGHYUN NAMKUNG Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH, USA JE´RE´MIE NSENGIMANA Section of Epidemiology and Biostatistics, Leeds Institute of Molecular Medicine, University of Leeds, Cancer Genetics Building, Leeds, UK NORA L. NOCK Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH, USA YO SON PARK John P. Hussman Institute for Human Genomics, University of Miami Miller School of Medicine, Miami, FL, USA HUAIZHEN QIN Department of Epidemiology and Biostatistics, Case Western Reserve University School of Medicine, Cleveland, OH, USA JAYA M. SATAGOPAN Department of Epidemiology and Biostatistics, Memorial Sloan-Kettering Cancer Center, New York, NY, USA ur Medizinische Biometrie und Statistik, ARNE SCHILLERT Institut f€ Universit€ a tsklinikum Schleswig-Holstein, Universit€ a t zu Lubeck, L€ ubeck, Germany AUDREY SCHNELL Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH, USA VENKATRAMAN E. SESHAN Department of Epidemiology and Biostatistics, Memorial Sloan-Kettering Cancer Center, New York, NY, USA SANJAY SHETE Department of Epidemiology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA YIN Y. SHUGART Unit of Statistical Genomics, Intramural Research Program, National Institute of Mental Health, National Institutes of Health, Bethesda, MD, USA CATHERINE M. STEIN Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH, USA DANIEL O. STRAM Department of Preventive Medicine, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA LEI SUN Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada SHUYING SUN Department of Epidemiology and Biostatistics, Case Comprehensive Cancer Center, Case Western Reserve University, Cleveland, OH, USA
Contributors
XIANGQING SUN Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH, USA YIK YING TEO Department of Statistics and Applied Probability, National University of Singapore, Singapore, Singapore; Department of Epidemiology and Public Health, National University of Singapore, Singapore, Singapore; Genome Institute of Singapore, Agency for Science, Technology and Research, Singapore, Singapore ur Medizinische Biometrie und Statistik, MAREN VENS Institut f€ Universit€ a tsklinikum Schleswig-Holstein, Universit€ a t zu L€ ubeck, L€ ubeck, Germany KARIN J.H. VERWEIJ Genetic Epidemiology Unit, Queensland Institute of Medical Research, Brisbane, QLD, Australia; School of Psychology, University of Queensland, Brisbane, QLD, Australia JIAN WANG Department of Epidemiology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA YING WANG Department of Psychiatry and Behavioral Sciences, The Johns Hopkins University School of Medicine, Baltimore, MD, USA JINFENG XU Department of Statistics and Applied Probability, National University of Singapore, Singapore, Singapore WEI XU Department of Biostatistics, Princess Margaret Hospital, Toronto, ON, Canada; Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada AO YUAN National Human Genome Center, Howard University, Washington, DC, USA GANG ZHENG Office of Biostatistics Research, National Heart, Lung and Blood Institute, Bethesda, MD, USA XIAOFENG ZHU Department of Epidemiology and Biostatistics, Case Western Reserve University School of Medicine, Cleveland, OH, USA ur Medizinische Biometrie und Statistik, ANDREAS ZIEGLER Institut f€ Universit€ a tsklinikum Schleswig-Holstein, Universit€ a t zu Lubeck, L€ ubeck, Germany BRENDAN P. ZIETSCH Genetic Epidemiology Unit, Queensland Institute of Medical Research, Brisbane, QLD, Australia; School of Psychology, University of Queensland, Brisbane, QLD, Australia
xi
sdfsdf
Chapter 1 Genetic Terminology Robert C. Elston, Jaya M. Satagopan, and Shuying Sun Abstract Common terms used in genetics with multiple meanings are explained and the terminology used in subsequent chapters is defined. Statistical Human Genetics has existed as a discipline for over a century, and during that time the meanings of many of the terms used have evolved, largely driven by molecular discoveries, to the point that molecular and statistical geneticists often have difficulty understanding each other. It is, therefore, imperative, now that so much of molecular genetics is becoming an in silico statistical science, that we have well-defined, common terminology. Key words: Gene, Allele, Locus, Site, Genotype, Phenotype, Dominant, Recessive, Codominant, Additive, Phenoset, Diallelic, Multiallelic, Polyallelic, Monomorphic, Monoallelic, Polymorphism, Mutation, Complex trait, Multifactorial, Polygenic, Monogenic, Mixed model, Transmission probability, Transition probability, Epistasis, Interaction, Pleiotropy, Quantitative trait locus, Probit, Logit, Penetrance, Transformation, Scale of measurement, Identity by descent, Identity in state, Haplotype, Phase, Multilocus genotype, Allelic association, Linkage disequilibrium, Gametic phase disequilibrium
In this introductory chapter, we give the original meanings of various genetic terms (which will be found in the older literature), together with some of the various meanings that are sometimes ascribed to them today, and how the terms will be defined in the following chapters. For simplicity, exceptions are ignored and what is stated is usually, but not invariably, true.
1. Gene, Allele, Locus, Site The concept of a gene (the word itself was introduced by Bateson) is due to Mendel, who used the German word “Factor.” Mendel used the word in the same way that we might call “hot” and “cold” factors, not in the way that we call “temperature” a factor. In other words, his Factor was the level of what statisticians now call a factor. In the original terminology, still used by some population geneticists, genes occur in pairs on homologous chromosomes. In this Robert C. Elston et al. (eds.), Statistical Human Genetics: Methods and Protocols, Methods in Molecular Biology, vol. 850, DOI 10.1007/978-1-61779-555-8_1, # Springer Science+Business Media, LLC 2012
1
2
R.C. Elston et al.
terminology, the four blood groups A, B, O, and AB (defined in terms of agglutination reactions) are determined by three (allelic) genes: A, B, and O. Nowadays molecular geneticists do not call these three factors genes, but rather “alleles,” defined as “alternative forms” of a gene that can occur at the same locus, or place, in the genome. Whereas Drosophila geneticists used to talk of two loci for a gene, and human geneticists used to talk of two genes at a locus, modern geneticists talk of “two alleles of a gene” or “two alleles at a locus”; this last, which is nowadays so common, is the terminology that will thus be used in this book. It then follows (rather awkwardly) that two alleles at the same locus are allelic to each other, whereas two alleles that are at different loci are nonallelic to each other. A gene is commonly defined as a DNA sequence that has a function, meaning a class of similar DNA sequences all involved in the same particular molecular function, such as the formation of the ABO red cell antigens. (Note the common illogical use of the phrase “cloning genes” by molecular geneticists when, by their own terminology, “cloning alleles” is meant.) Some restrict the word gene to protein-coding genes, but there are many more sequences of DNA that have function by virtue of being transcribed to RNA without ever being translated to DNA and hence protein coding, so this restricted definition of a gene would appear to be unwarranted. See ref. 1 for a more detailed explanation of the evolution of, and a modern definition of, the word “gene.” A locus is the location on the genome of a gene, such as the “ABO gene.” By any definition a gene must involve more than one nucleotide base pair. Single nucleotide polymorphisms (SNPs) thus do not occur at loci, but rather in and around loci and, in this book, we shall not write SNP markers as being “at” loci. Because of the confusion that occurs when SNPs are described as occurring at loci, some use the term “gene-locus,” but we shall always use the term locus for the location of a functional gene. We shall, however, allow SNP markers to have alleles and use the original term for their locations: “sites” within loci or, more generally, sites within the region of a locus or anywhere in the genome. If in the population only one allele occurs at a site or locus, we shall say that it is monomorphic, or monoallelic, in that population. If two alleles occur, as is common for SNPs, we shall use the original term diallelic which, apart from having precedence, is etymologically sounder than the now commonly used term biallelic. If many alleles occur, we shall describe the polymorphism as polyallelic or multiallelic (the former term is arguably more logical, the latter more common). When there are just two alleles at a locus, the one with the smaller population frequency is called the minor allele. In genetics, the term allele “frequency”—which is strictly speaking a count—is used to mean relative frequency, i.e., the proportion of all such alleles at that locus among the members of a population; thus the term minor allele frequency is often used for diallelic markers.
1 Genetic Terminology
2. Genotype, Phenotype, Dominant, Recessive, Codominant, Additive
3
An individual’s genotype is the totality of that individual’s hereditary material, whereas an individual’s phenotype is the individual’s appearance. However, the terms genotype and phenotype are usually used in reference to a particular locus or set of loci, and to a particular trait or set of traits. Genotypes are not observed directly, but rather inferred from particular phenotypes. Thus, with respect to the ABO locus, the four blood types A, B, O, and AB are (discrete) phenotypes; and the possible genotypes, formed by pairs of alleles, are AA, AO, BB, BO, AB, and OO. With respect to the ABO blood group phenotypes, the A allele is dominant to the O allele and the O allele is recessive to the A allele. Similarly, the B allele is dominant to the O allele and the O allele is recessive to the B allele. The A and B alleles are codominant. Note that for the words “dominant” and “recessive” to have any meaning, at least two alleles and two phenotypes must be specified. If a particular allele at a locus is dominant with respect to the presence of a disease, there must be at least one other allele at that locus that is recessive with respect to the absence of that disease. Geneticists loosely talk about a disease being dominant, meaning that, with respect to the phenotype “disease”, the underlying disease allele is dominant, i.e., the disease is present when either a single or two copies of the allele is present. Similarly, they talk of a disease being recessive, meaning that, with respect to the same phenotype, the underlying disease allele is recessive, i.e., the disease is present only when two disease alleles are present. Alternatively, they may talk of an allele being dominant or recessive, the particular phenotype (often disease) being understood. The important thing to realize is that “dominance” and “recessivity” describe a relationship between one or more genotypes and a particular phenotype. This leads to the concept of phenosets: the genotypes AA and AO form the phenoset corresponding to the A blood type, and the genotypes BB and BO form the phenoset corresponding to the B blood type. In the case of the ABO blood group, a person who has one A allele and one B allele has the blood type AB, which is a different phenotype from that of either of the corresponding homozygotes, AA and BB; this relationship is called codominance. In general, a locus is codominant with respect to the set of phenotypes it controls if the phenotypes of each heterozygote at that locus differ from that of each of the corresponding homozygotes. We make a distinction between codominant and additive; the latter implies that the phenotype (or phenotypic distribution, see below under quantitative phenotypes) corresponding to the heterozygote is in some sense half-way between those of the two corresponding homozygotes. Whereas the term additive is only meaningful when a
4
R.C. Elston et al.
scale of measurement has been defined, codominance is a more general concept that does not require the definition of a scale of measurement.
3. Polymorphism, Mutation The A, B, O, and AB blood types comprise a polymorphism, in the sense that they are alternative phenotypes that commonly occur in the population. A polymorphic locus was originally defined operationally as a polymorphism-determining locus at which the least common allele occurs with a “frequency” of at least 1% (2); but a more appropriate definition would be a locus at which the most common allele occurs with a “frequency” of at most 99%. Different alleles arise at a locus as a result of mutation, or sudden change in the genetic material. Mutation is a relatively rare event, caused, for example, by an error in replication. Thus all alleles are by origin mutant alleles, and a genetic polymorphism was conceived of as a locus at which the frequency of the least common allele has a frequency too large to be maintained in the population solely by recurrent mutation. However, what is important at a locus is the degree of polymorphism, and a locus in which there are 1,000 equifrequent alleles would be considered much more polymorphic than a locus at which there are two alleles with frequencies 0.01 and 0.99. Many authors now use the term mutation for any rare allele, and the term polymorphism for any common allele.
4. Complex Trait, Multifactorial, Polygenic, Monogenic
The term “complex trait” was introduced about two decades ago without a clear definition. It appears to be used for traits that do not exhibit clear one-locus (“Mendelian”) segregation, usually because segregation at more than one locus is involved. Whereas multifactorial and complex are ill-defined and often used interchangeably, a clear distinction should be made between multifactorial and polygenic. Multifactorial implies that more than one factor is involved in the etiology of the phenotype, whether genetic, environmental, or both. Polygenic, on the other hand, implies that only genetic factors are involved, usually in an additive fashion, with the original definition that the number of factors (loci) is so large that they cannot be individually characterized. Thus, strictly speaking, the term polygenic should not be used to include any environmental factors—though in practice it is often used that way.
1 Genetic Terminology
5
Monogenic inheritance implies segregation at a single locus, and the term “mixed model” is used by geneticists to denote an additive combination of monogenic and polygenic inheritance. When both components are present in a segregation model in which both components are latent variables (the former discrete and the latter continuous), the underlying statistical model is random, not mixed, because there are two random components other than any error term. Statistical geneticists often use the term “transmission probability” in two quite different senses. In this book, we carefully distinguish transmission probabilities, probabilities that a parent having a particular genotype transmits particular alleles to offspring, from transition probabilities, probabilities that offspring receive particular genotypes from their parents. This distinction was introduced in ref. 3.
5. Haplotype, Phase, Multilocus Genotype
Let A, B be two alleles at one locus, and D, d be two alleles at another locus. If one parent transmits A and D to an offspring, while the other transmits B and d, the offspring genotype is denoted AD/Bd (or Bd/AD), in which the parental origins are separated by “/”. The two alleles transmitted by one parent constitute a two-locus haplotype; with respect to two alleles at each of two loci there are four possible haplotypes—AD, Ad, BD, and Bd in this case, with AD/Bd and Bd/AD being the two possible phases. If n1 alleles can occur at one of the loci and n2 at the other, n1n2 two-locus haplotypes are possible. At the first locus n1(n1 + 1)/2 genotypes are possible (n1 homozygotes and n1(n1 1)/2 heterozygotes), while at the second locus n2(n2 + 1)/2 genotypes are possible. If we pair these genotypes, one from each locus, the total number of pairs possible is n1 þ 1 n2 þ 1 n1 n2 þ 1 n2 ¼ n1 n2 n1 2 2 2 n1 1 n2 1 n1 n2 : 2 2 On the other hand, at the two loci together, there are n1n2 haplotypes; and pairing these we have n1n2(n1n2 + 1)/2 possible pairs of two-locus haplotypes, or diplotypes. In this book, we shall define “two-locus genotypes” this way, i.e., without differentiating the two phases, so that for the same number of alleles at each locus there is a smaller number of two-locus genotypes than there is of two-locus diplotypes. Thus we shall consider the two phases of the double heterozygote, Ad/BD and BD/Ad, as being the same two-locus genotype. Usually, the term “multilocus genotype”
6
R.C. Elston et al.
refers to genotypes when the phases are not distinguished, and the term diplotype is useful for the case when they are distinguished (though this term is not yet in common usage). More generally, a haplotype is the multilocus analog of an allele at a single locus. It consists of one allele from each of multiple loci that are transmitted together from a parent to an offspring. When haplotypes made up of multiple alleles (one from each locus) are paired, a pair in which the genotype at each of n loci is heterozygous corresponds to 2n 1 different diplotypes or phases. It is usual nowadays to restrict the word haplotype to the case where all the loci involved are on the same chromosome pair, so that all the alleles involved are on the same chromosome. Typically, but not always, it is assumed that all the different phases of a particular multiple heterozygote have the same phenotype.
6. Epistasis, Interaction, Pleiotropy
7. Allelic Association, Linkage Disequilibrium, Gametic Phase Disequilibrium
When two loci are segregating, each typically influences a separate phenotype. For example, A and B may be alleles at the ABO locus, determining the ABO blood types, while D and d are alleles at a disease locus, determining disease status. But if alleles at a single locus influence two different phenotypes, we say there is pleiotropy. It is known that a person’s ABO genotype influences the risk of gastric cancer as well as determining blood type. Thus the ABO locus is pleiotropic. Alternatively, alleles at two different loci may determine the same phenotype, such as the presence or absence of a disease; and if the phenotype associated with the genotypes at one locus depends on the genotypes at another locus, we say there is epistasis. Thus gastric cancer may perhaps be caused by the epistatic effect of alleles at two (or more) loci. Epistasis and pleiotropy are sometimes confused in statistical genetics.
If the alleles at one locus are not distributed in the population independently of the alleles at another locus, the two loci exhibit allelic association. If this association is a result of a mixture of subpopulations (such as ethnicities or religious groups) within each of which there is random mating, the association is often denoted as “spurious.” In such a case, there is true association, but the cause is not of primary genetic interest. If the association is not due to this kind of population structure, it is either due to linkage disequilibrium (LD) or, more generally, to gametic phase disequilibrium (GPD); in the former case the loci are linked, i.e., they
1 Genetic Terminology
7
cosegregate in families, in the latter case they need not be linked, i.e., they may segregate independently in families. Owing to an unintended original definition, loci that are not linked have often been mistakenly described as being in LD (4, 5).
8. Identity The concept of allelic identity is an important one. Alleles are identical by descent (IBD) if they are copies of the same ancestral allele, and must be differentiated from alleles that are physically identical but not (at least within the previous dozen or so generations) ancestrally identical. Such alleles, when not IBD, are identical in state (IIS). It is well understood that molecules, atoms, etc., can be in different states (not “by” different states), and the same is true of alleles, though here the states are ancestrally, not physically, different. Whereas in the animal and plant genetics literature the phrases “identity in state” and “identical in state” are commonly used, for no good reason the phrases “identity by state” and “identical by state” are now commonly used in the human genetics literature. In this book, to stress the difference and to be consistent with both the earlier common usage and the usage in the animal and plant genetics literature, we shall use the terminology IIS, not IBS.
9. Quantitative Traits A locus at which alleles determine the level of a quantitative phenotype is called a QTL (quantitative trait locus). Typically, the word “quantitative” is used interchangeably with “continuous” when describing a phenotype. However, quantitative traits can be discrete. Care should be taken to distinguish between those methods of analysis of quantitative traits for which distributional assumptions, such as conditional normality, are critical, and those for which they are not. Transforming the phenotype of a QTL corresponds to changing its units if the transformation is linear, or more generally to changing the scale of measurement (e.g., square root or logarithmic) if the transformation is nonlinear. On the scale of measurement used, alleles at a QTL have an additive effect if the phenotypic distribution of the heterozygote is the average of the corresponding two homozygote phenotypic distributions. With respect to that phenotype, allele A is dominant to the allele B, and allele B is recessive to allele A, if the whole phenotypic distribution of the heterozygote AB is the same as that of the homozygote AA. Any variance among the phenotypic means of the genotypes at a locus over and above that
8
R.C. Elston et al.
due to additive allele action is called dominance, or dominant genetic, variance. Thus dominance variance can arise as a result of one allele being dominant to another, but such simple allele action is not necessarily implied by dominance variance. The presence of dominance variance depends on the scale of measurement; dominant allele action (complete dominance, as described above for discrete traits such as the ABO blood group) does not. If the phenotypic distribution of a heterozygote is not the average of the corresponding two homozygote phenotypic distributions, we shall say there is codominance. Thus in this book we shall not restrict the word codominance to the case of additivity (with the result that codominance is scale independent). Just as dominance has a different meaning when applied to quantitative traits, so does epistasis. From a statistical point of view, dominance can be considered as intralocus interaction, or nonadditivity of the allelic contributions to the phenotype. Epistasis is a genetic term, now generalized when applied to quantitative traits to indicate nonadditivity of the effects on the phenotype of the genotypes at two (or more) loci in a population. It is thus from a statistical viewpoint interlocus interaction, and so dependent on how the phenotype is measured. Statistical interaction is a term with a similar limitation, but is not restricted to genetic factors. Statistical interaction should be carefully distinguished from biological interaction (5, 6). Whereas biological interaction does not require the presence of statistical interaction, the presence of the latter implies the existence of the former. Indeed, statistical interaction is removable if a monotonic transformation can make the effects of the two factors involved (e.g., segregation at two loci, or segregation at one locus and levels of an environmental factor) additive. Furthermore, the magnitude of any interaction effects can depend critically on how the individual factor effects (single locus genotypes in the case of genetic factors) are defined (5). There is usually no loss of generality in assuming that disease status, unaffected or affected, is a quantitative trait that takes on the values 0 or 1, respectively, so that its mean value is the population prevalence of the disease. Then everything that has been written here with regard to dominant allele action, dominance variance, and epistasis also holds in the case of a binary disease phenotype, except that now the scale of measurement (in the sense of a nonlinear monotonic transformation) is irrelevant in the absence of a quantitative measure. However, if there is a quantitative measure, such as a relative risk or odds ratio, then the scale of measurement will determine whether or not there is interaction. Also, in the case of a binary disease phenotype, the penetrance, or probability of being affected, is often transformed to a probit (or logit), giving rise to what is called the “liability” to disease, and this liability is treated as a continuous phenotype. Dominance and epistatic variance can be
1 Genetic Terminology
9
quite different on this liability scale from that measured on the original “penetrance” scale. For a QTL, dominance variance is present when there is intralocus nonadditivity. By the same token, epistatic variance is present when there is interlocus nonadditivity. Each locus gives rise to its own components of additive genetic and dominant genetic variance. If multiple loci affect a QTL, there are multiple components of epistatic variance. Except in the case of a binary phenotype with no associated quantitative measure, the relative magnitudes of all such components are scale (i.e., transformation) dependent, just as corresponding components of genotype (or allele) environment interaction are scale dependent. Finally, for those who wish to have a better theoretical understanding of statistical human genetics, ref. 7 provides an exceptionally good introduction.
Acknowledgments This work was supported in part by the following grants from the National Institutes of Health, USA: P41RR003655 (RCE) and R01CA137420 (JMS). References 1. Gerstein MB et al (2007) What is a gene, post-ENCODE? History and updated definition. Genome Res 17: 669–681 2. Ford EB (1940) Polymorphism and taxonomy. In Huxley J (ed) The new systematics, Oxford 3. Elston RC, Stewart J (1971) A general model for the genetic analysis of pedigree data. Hum Hered 21: 523–542 4. Lewontin RC (1964) The interaction of selection and linkage. I. General considerations; heterotic models. Genetics 49: 49–67
5. Wang X, Elston RC, and Zhu X (2010) The meaning of interaction. Hum Hered 70: 269–277 6. Wang X, Elston R, Zhu X (2010) Statistical interaction in human genetics: how should we model it if we are looking for biological interaction? Nat Rev Genet doi:10.1038/ nrg2579-c2 7. Ziegler A, Ko¨nig IR (2010) A statistical spproach to senetic spidemiology: Concepts and applications, 2nd edn. Wiley-VCH, Weinheim
sdfsdf
Chapter 2 Identification of Genotype Errors Yin Y. Shugart and Ying Wang Abstract It has been documented that there exist some errors in most large genotype datasets and that an error rate of 1–2% is adequate to lead to the distortion of map distance as well as a false conclusion of linkage (Abecasis et al. Eur J Hum Genet 9(2):130–134, 2001), therefore one needs to ensure that the data are as clean as possible. On the other hand, the process of data cleaning is tedious and demands efforts and experience. O’Connell and Weeks implemented four error-checking algorithms in computer software called PedCheck. In this chapter, the four algorithms implemented in PedCheck are discussed with a focus on the genotypeelimination method. Furthermore, an example for four levels of error checking permitted by PedCheck is provided with the required input files. In addition, alternative algorithms implemented in other statistical computing programs are also briefly discussed. Key words: Genotype, Genotype error, Parametric linkage analysis, LOD score, Computational efficiency, Automatic genotype elimination, Nuclear-pedigree method, Genotype-elimination method, Critical-genotype method, Odds-ratio method
1. Introduction While gene hunters have limited access to computational resources, they have to rely on visual inspection to check for genotypic errors occurring in a human pedigree. The errors come from two main major sources (1) pedigree errors and (2) true genotyping errors. Pedigree errors include, but are not limited to, nonpaternity, unreported adoption status, errors in data entry as well as sample mix-ups. In this chapter, we focus only on true genotyping errors. One can imagine that if automated approaches for error checking were not available, detection of erroneous genotypes by visual inspection could be a tedious task, particularly in extended pedigrees with multiple generations. Therefore, there is a demand for computational algorithms that can efficiently identify erroneous
Robert C. Elston et al. (eds.), Statistical Human Genetics: Methods and Protocols, Methods in Molecular Biology, vol. 850, DOI 10.1007/978-1-61779-555-8_2, # Springer Science+Business Media, LLC 2012
11
12
Y.Y. Shugart and Y. Wang
genotypes regardless the size of the pedigrees. Needless to say, the elimination of genotype errors will benefit both linkage analysis and family-based association analysis. In this chapter, we use linkage analysis as an example for the sake of illustration. Linkage analysis is a statistical method that has been widely applied for mapping genes related to human disorders with relatively high penetrance and rare disease allele frequencies. Because of the fact that DNA segments located near each other on a chromosome tend to be passed together from one generation to another, genetic markers are often used as tools for tracking the inheritance pattern of a gene that has not yet been identified but the approximate location of the gene of interest is known. The main use of linkage analysis is to test for evidence of linkage between a disease locus of interest and an arbitrary marker locus. The LOD score was defined as a test statistic by Morton (2). A single point LOD score test is typically performed by maximizing the LOD score over a grid of values of y in the interval 0–0.5, where y is the probability of the recombination fraction, viewed as a measure of the extent of linkage. The recombination fraction ranges from y ¼ 0 for loci in close proximity to each other through y ¼ 0.5 for loci that are far apart or on different chromosomes. The formula for a LOD score is Z ðyÞ ¼ log10 ½LðyÞ log10 ½Lðy ¼ 0:50Þ In practice, LOD scores can be calculated using linkage analysis programs such as LIPED (3) and LINKAGE (4–6). A wellknown algorithm for parametric linkage analysis is the Elston and Stewart algorithm (7), in which the genetic mode of inheritance is assumed, and this algorithm was implemented in the LINKAGE software in an elegant manner. It is common knowledge that LINKAGE is capable of handling relatively large pedigrees, but the computation time of LINKAGE is exponential in the number of genetic markers (8). In the presence of genotypic errors, the increase in computational time can be tremendous. Therefore, eliminating Mendelian inconsistencies in pedigree data is an important task if only for the sake of improving the efficiency of linkage analysis. Lange and Goradia (9) described an algorithm for automatic genotype elimination which enables great reduction in computational times required for pedigree analysis. Their genotype elimination program was an extension of the one given by Lange and Boehnke (10), which is discussed in this chapter. Later, Stringham and Boehnke (11) developed two methods that were implemented in the Mendel computer software (12) to calculate the posterior probability of erroneous genotypes for each pedigree member. Stringham and Beohnke developed two novel approaches to compute an individual posterior probability of
2 Identification of Genotype Errors
13
genotype error using a weighted sum of all possible genotypes, where the weight is the probability of the genotype being an error. Their first approach allows each individual to have every possible genotype weighted by the probability that the genotype is erroneous and then computes the posterior probability of this individual’s genotype, whereas their second approach allows one genotyped pedigree member at a time to have every possible genotype while the other pedigree members are held at their originally assigned genotypes. They then compute the posterior probability for each genotyped individual, based on the assumption that all other pedigree members have been correctly genotyped. It has been recognized that both methods have the weakness of not being able to automatically deal with more than one marker at a time; therefore, the computation with markers with multiple alleles (i.e., microsatellites) is very time consuming. These two algorithms were implemented in MENDEL (12). Of note, Sobel and Lange (13) developed important extensions to the Lander–Green–Kruglyak algorithm for small pedigrees and to the Markov-chain Monte Carlo stochastic algorithm for large pedigrees to conduct linkage analysis while accommodating genotype errors. These developments have been implemented in the program MENDEL, version 4. In recognition of the need for developing efficient computer software for automatic error checking, O’Connell and Weeks (14) implemented four error-checking algorithms in a new computer program called PedCheck. The main goal of this chapter is to illustrate the algorithms used in PedCheck and, for the purpose of demonstration, offer a few examples for different levels of error checking permitted by PedCheck, as well as provide detailed instructions for users to follow using the simulated pedigree data presented in Figs. 1 and 2. 1.1. The NuclearPedigree Algorithm
Lange and Goradia (9) described the steps involved in a using nuclear pedigree as follows: (A) For each pedigree member, save only those genotypes compatible with his or her phenotype. (B) For each nuclear family: 1. Examine each mother–father genotype pair. (a) Determine which zygote genotypes can arise. (b) If each child in the nuclear family has one or more of these zygote genotypes among his or her list of genotypes, then save the parental genotypes as well as any child genotype that is not conflicted with one of the created zygote genotypes.
14
Y.Y. Shugart and Y. Wang
1
2
3
4
5
6
2/1
4/3
4/6
5/1
7 3/2
Fig. 1. Example of level 1 error.
(c) If any child is incompatible with the current parental pair, then take no action to save any genotypes. 2. For each person in the nuclear family, exclude any genotypes not saved during step 1. (C) Repeat step B until no more genotypes can be excluded. 1.2. The GenotypeElimination Algorithm
The genotype-elimination algorithm used by O’Connell and Weeks was an extension of the Lange–Goradia algorithm (9, 15) for set-recoded genotypes (14). Because the genotype-elimination algorithm is able to detect subtle levels of inconsistencies due to the elimination of certain genotypes presented in pedigrees with more complex structure, it is more powerful than the nuclear family algorithm originally described by Lange and Goradia (9). For example, Fig. 1 shows that individual 7 cannot possibly carry allele 3 given that the genotypes of his father is 4/3 and of his mother is 4/6. As described by O’Connell and Weeks (14), for each pedigree and locus, the genotype-elimination algorithm can pick the first component nuclear family with an error that has been missed by the nuclear family algorithm, and then provide the inferred-genotype lists for each member of that nuclear family. To illustrate the
2 Identification of Genotype Errors
3
1
2
2/4
1/2
4
1/2
15
5 2/2
6 1/3
Fig. 2. Example of Mendelian error in a pedigree with three generations.
situation where the nuclear-family algorithm is not able to find any errors, we presented an example in Fig. 2. As shown in Fig. 2, genotypes in each nuclear family appear to be consistent at the level 1 checking, but individual 3 determines phase (“linkage phase” is defined as the haplotype of the gamete transmitted from parent to offspring) of individual 6, which forces individual 3 to carry a “2” allele, which also means that a parent of individual 4 is a carrier of a “3” allele. However, we know the genotypes of the parents of individual 4, which are 4/2 and 1/2; therefore, he cannot possibly carry a “3” allele. This type of error can easily be detected by the genotype-elimination algorithm as a level 2 error. 1.3. The CriticalGenotype Method
In Subheading 2, we described the nuclear-family method and the genotype-elimination method. Below we discuss two additional algorithms termed “critical genotype” and “odds ratio method,” respectively, both implemented in PedCheck. The genotype of an individual that resolves the issue of inconsistency within a pedigree when removed from the data is defined as a “critical genotype” by O’Connell and Weeks (14). They pointed
16
Y.Y. Shugart and Y. Wang
out that a “critical genotype” does not always provide a perfect solution. The critical genotypes may not be independent, for example, when a parent and offspring are homozygous for different alleles. Under this circumstance, both will be critical genotypes, but blanking either resolves the inconsistency. Therefore, O’Connell and Weeks concluded that “the set of erroneous genotypes is a subset of the critical genotypes.” The critical-genotype algorithm implemented in PedCheck enables users to identify the critical genotypes in a particular pedigree by “untyping” one typed individual at a time (meaning call this individual unknown in the input file), and then applying the genotype-elimination algorithm to determine if the inconsistency has been resolved. If one critical genotype is found, this genotype represents the error; otherwise, the set of critical genotypes blanked may depend on the order of the individuals who were untyped. Below we give one example to demonstrate how PedCheck detected errors at different levels. The genotype data were simulated. PedCheck can be freely downloaded from the following Web site: http://watson.hgen.pitt.edu/register/docs/pedcheck.html. As documented in the PedCheck menu, using PedCheck one is able to specify the algorithms at four levels (from basic to comprehensive): level 1 uses the above described nuclear family algorithm; level 2 checking uses the genotype-elimination algorithm; level 3 checking uses the critical-genotype algorithm; and level 4 checking uses an odds-ratio algorithm. Typically, users start with level 1 checking, resolving identified problems, and then move to levels two, three, and four. Using a program such as PedCheck, if a complete genotype-elimination algorithm finds no errors, that means the genotypes are consistent with Mendelian laws of inheritance, and other downstream analysis can be performed. As clearly stated by the authors of PedCheck, although the genotype-elimination algorithm are guaranteed to find subtle errors, the inferred-genotype lists provided by PedCheck do not always help to detect the source of the inconsistency, simply because the genotype lists for untyped individuals may be extensive. Even if there is only one error, the individual involved may be difficult to detect by examination of only the genotype lists, since either more than one individual may be identified as the possible source of that error or the error may not be in the particular nuclear-family data appearing in the output. Therefore, the authors attempted to find an appropriate statistic to distinguish between several critical genotypes. 1.4. The Odds-Ratio Algorithm
It was thoroughly discussed by O’Connell and Weeks that, in the presence of several critical genotypes at a given locus, one is not able to decide a priori which critical genotype is most likely to be a mistake. To help distinguish between various critical genotypes, O’Connell
2 Identification of Genotype Errors
1
3
4
5 4/7
6 8/9
7
17
2
8
9
11 6/9
10
12 4/6
13
+ 14 7/9
15 4/9
16 4/8
17 5/7
31 4/6
32 4/7
18 4/9
19
20 6/9
Fig. 3. A simulated pedigree that was used for the purpose of demonstration. The plus sign indicates the individual with the actual genotype error; and the triangle points to the individual identified by PedCheck as most likely to have an erroneous genotype.
and Weeks implemented an odds-ratio statistic based on single-locus likelihoods for the pedigree. We would first like to bring to the users’ attention that the authors restrict the genotypes to contain only alleles appearing in the pedigree under consideration. Because “untyping” an individual with a critical genotype results in a consistent pedigree, one recognizes that after genotype elimination the individual must have at least one alternative genotype that is equally valid in a statistical sense. Second, for the particular locus, the authors compute and store the likelihood for the pedigree data of each alternative valid genotype at each critical genotype, while keeping all other critical genotypes at their original value. To explain how PedCheck works at each level of assessment, we present a test data set using one simulated multigeneration pedigree (Fig. 3). Note that, in this particular example, the level 2 option automatically runs level 1 first. Level 3 checking identified a total of three critical genotypes. It is very fast to compute, since it involves only single-locus likelihoods. Finally, level 4 checking calculates a full-pedigree likelihood.
18
Y.Y. Shugart and Y. Wang
2. Method The procedure of running PedCheck is fairly straightforward: Level 1 is a first screening using the nuclear-pedigree algorithm. If no level 1 errors were detected or if the errors have been removed, then level 2 should be run to detect relatively subtle errors using the genotype-elimination algorithm. Level 3 should be run only if the researchers are interested in knowing the “critical genotypes,” and level 4 should be run to detect the most likely source of error. After level 4 checking finishes, level 2 should be run again if there has been any change in the data, until PedCheck indicates that no more pedigree errors can be found and that you are ready for analysis. PedCheck requires two input files: ped file and data file. They can be either pre-MAKEPED files or LINKAGE format files. The files below are prepared in the pre-MAKEPED format. The first row stands for the pedigree ID and then followed by ID, father’s ID, mother’s ID, gender (1 ¼ male, 2 ¼ female), affection status (1 ¼ normal, 2 ¼ affected and 0 ¼ unknown), and alleles. While entering the ID numbers for a father or mother who is a founder (i.e., his or her parents are not known in this particular pedigree), one should use zero for his or her ID. The number of spaces between fields should not matter. The file below is called myfile1. ped for pedigree 1 (see Fig. 1). Fam
ID
Father
Mother
Gender
Aff
Allele1
Allele2
1
1
0
0
1
0
0
0
1
2
0
0
2
0
0
0
1
3
1
2
2
0
2
1
1
4
1
2
2
0
4
3
1
5
1
2
2
0
4
6
1
6
0
0
1
0
5
1
1
7
6
3
1
0
3
2
Below is myfile23.ped for pedigree 2 and 3 (see Figs. 2 and 3). Fam
ID
Father
Mother
Gender
Aff
Allele1
Allele2
2
1
0
0
1
0
2
4
2
2
0
0
2
0
1
2
2
3
1
2
2
0
1
2 (continued)
2 Identification of Genotype Errors
Fam
ID
Father
Mother
19
Gender
Aff
Allele1
Allele2
2
4
0
0
1
0
0
0
2
5
4
3
1
0
2
2
2
6
4
3
2
0
1
3
3
1
0
0
1
0
0
0
3
2
0
0
2
0
0
0
3
3
1
2
2
0
0
0
3
4
1
2
2
0
0
0
3
5
0
0
1
0
4
7
3
6
1
2
2
0
8
9
3
7
0
0
1
0
0
0
3
8
1
2
2
0
0
0
3
9
1
2
1
0
0
0
3
10
0
0
2
0
0
0
3
11
1
2
2
0
6
9
3
12
1
2
2
0
4
6
3
13
1
2
1
0
0
0
3
14
5
6
2
0
7
9
3
15
5
6
2
0
4
9
3
16
7
8
2
0
4
8
3
17
7
8
1
0
5
7
3
18
9
10
2
0
4
9
3
19
0
0
1
0
0
0
3
20
19
18
2
0
6
9
3
31
0
0
2
0
6
4
3
32
17
31
2
0
4
7
Now, we introduce a file called “myfile1.dat” to go with pedigree file one. The meaning of each parameter is provided in each row. This format was described by Terwillier and Ott (16) in their well-known text book called “Handbook of Human Genetic Linkage.”
20
Y.Y. Shugart and Y. Wang
2 Identification of Genotype Errors
21
We now provide the details for each step involved in error checking using PedCheck: Step 1: Run PedCheck to check level 0 and level 1 errors in pedigrees 1, 2, and 3. The results will be saved in a file called pedcheck.err. Command: pedcheck -m -p myfile1.ped -d myfile1.dat Command: pedcheck -m -p myfile23.ped -d myfile23.dat PedCheck identified three level 1 errors in pedigree 1. First, the children have more than 4 alleles. Second, mother 5’s allele is out of range. Third, child 7 and mother are inconsistent. On the other hand, there were no level 0 or level 1 errors in pedigree 2 (see Fig. 2) and pedigree 3 (see Fig. 3), respectively. Setp 2: Run Pedcheck to check level 2 errors in pedigrees 2 and 3. Command: Pedcheck -2 -m -p myfile23.ped -d myfile23.dat After this run, PedCheck can find one inconsistency in each pedigree. Step 3: Run Pedcheck to check level 3 and level 4 errors. Command: Pedcheck -4 -m -p myfile23.ped -d myfile23.dat PedCheck’s run at level 3 and the output indicated that there are four critical genotypes in individuals 1, 2, 3, 6 in pedigree 2 and three critical genotypes in individuals 6, 12, and 17 in pedigree 3. Untyping any person listed will result in a consistent pedigree at the given locus. According to the results of level 4 (see Notes 1 and 2), one can conclude that individual 6 in pedigree 2 and individual 17 in pedigree 3 mostly likely have the erroneous genotypes. We reran PedCheck after their genotypes were removed. No inconsistencies were found. The following shows the diagnostic output from PedCheck for pedigree in Fig. 3:
22
Y.Y. Shugart and Y. Wang
2 Identification of Genotype Errors
23
3. Notes 1. Although PedCheck has been frequently used by the linkage analysts, we are aware of one limitation—that it cannot be applied to pedigrees with loops. Therefore, the readers may wish to visit some algorithms implemented in SimWalk2 version 2.82 (17) which can handle pedigrees with multiple loops. Further, we also like to highlight a few advantages of SimWalk2, which was not discussed in depth in this paper. Similar to the idea of level 4 checking in Pedcheck, SimWalk2 version 2.8 reports the overall probability of mistyping at each observed genotype. Sobel et al. (13) indicated that construction of these so-called posterior mistyping probabilities is based on the marker map, a prior error model and the error model that is used to define the penetrance function at the marker loci. Their algorithm can also accommodate alternative error models. The authors of SimWalk2 pointed out that false homozygosity is often the most common genotyping error. Therefore, SimWalk2 version 2.8 includes an empirical error model that incorporates this information and recognizes that misreading one allele occurs more frequently than misreading two. In addition, SimWalk2 version 2.8 imputes at each genotype the expected number of each allele appearing in that genotype while allowing for mistyping. 2. Other computer programs can also be used for error checking for various types of pedigree structure, including UNKNOWN of the Linkage package (4–6), MERLIN (18), RELPAIR (19), SIMWALK2 (17), and MENDEL version 4 (20). Owing to limited space, we did not discuss the statistical details behind each of these programs. The detailed algorithms for these programs are all freely available on line. However, most of these programs are not general enough to catch all genotyping errors. For instance, UNKNOWN takes a very long time to run and does not always provide diagnostic errors in the output file. Finally, it has been shown there exists a good amount of errors in most large genotype datasets and that an error rate of 1–2% is adequate to lead to the distortion of map distance as well as false conclusion of linkage (1), so the analysts may wish to consider using an analytical method using error models when analyzing large pedigree datasets (13).
Acknowledgments The views expressed in this chapter do not necessarily represent the views of the NIMH, NIH, HHS, or the US Government.
24
Y.Y. Shugart and Y. Wang
References 1. Abecasis GR, Cherny SS, Cardon LR. (2001) The impact of genotyping error on familybased analysis of quantitative traits. Eur J Hum Genet. 9(2):130–134. 2. Morton NE.(1955) Sequential tests for the detection of linkage. Am J Hum Genet 7(3): 277–318. 3. Ott J.(1974) Estimation of the recombination fraction in human pedigrees: efficient computation of the likelihood for human linkage studies. Am J Hum Genet. 26(5):588–597. 4. Lathrop GM, Lalouel JM (1984) Easy calculations of LOD scores and genetic risks on small computers. Am J Hum Genet. 36(2):460–465. 5. Lathrop GM, Lalouel JM, Julier C, Ott J. (1984) Strategies for multilocus linkage analysis in humans. Proc Natl Acad Sci USA. 81(11):3443–3446. 6. Lathrop GM, Lalouel JM, White RL. (1986) Construction of human linkage maps: likelihood calculations for multilocus linkage analysis. Genet Epidemiol. 3(1):39–52. 7. Elston RC, Stewart J. (1971) A general model for the genetic analysis of pedigree data. Hum Hered. 21(6):523–542. 8. Ott J.(1999) Analysis of Human Genetics Linkage. Baltimore: Hopkins University Press. 9. Lange K, Goradia TM. (1987) An algorithm for automatic genotype elimination. Am J Hum Genet. 40(3):250–256. 10. Lange K, Boehnke M. (1983) Extensions to pedigree analysis. V. Optimal calculation of Mendelian likelihoods. Hum Hered. 33(5): 291–301. 11. Stringham HM, Boehnke M. (1996) Identifying marker typing incompatibilities in linkage analysis. Am J Hum Genet. 59(4):946–950.
12. Lange K, Weeks D, Boehnke M. (1988) Programs for Pedigree Analysis: MENDEL, FISHER, and dGENE. Genet Epidemiol. 5(6):471–472. 13. Sobel E, Papp JC, Lange K.(2002) Detection and integration of genotyping errors in statistical genetics. Am J Hum Genet. 70(2): 496–508. 14. O’Connell JR, Weeks DE. (1998) PedCheck a program for identification of genotype incompatibilities in linkage analysis. Am J Hum Genet. 63(1):259–266. 15. Lange K, Weeks DE.(1989) Efficient computation of LOD scores: genotype elimination, genotype redefinition, and hybrid maximum likelihood algorithms. Ann Hum Genet. 53(Pt 1):67–83. 16. Terwilliger JD, Ott J. (1994) Handbook of Human Genetics Linkage. 1 ed. The Johns Hopkins University Press. 17. Sobel E, Lange K. (1996) Descent graphs in pedigree analysis: applications to haplotyping, location scores, and marker-sharing statistics. Am J Hum Genet. 58(6):1323–1337. 18. Abecasis GR, Cherny SS, Cookson WO, Cardon LR. (2002) Merlin–rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet. 30(1):97–101. 19. Broman KW, Weber JL. (1998) Estimation of pairwise relationships in the presence of genotyping errors. Am J Hum Genet. 63(5): 1563–1564. 20. Sobel E, Sengul H, Weeks DE.(2001) Multipoint estimation of identity-by-descent probabilities at arbitrary positions among marker loci on general pedigrees. Hum Hered. 52(3): 121–131.
Chapter 3 Detecting Pedigree Relationship Errors Lei Sun Abstract Pedigree relationship errors often occur in family data collected for genetic studies, and unidentified errors can lead to either increased false positives or decreased power in both linkage and association analyses. Here we review several allele sharing, as well as likelihood-based statistics, that were proposed to efficiently extract genealogical information from available genome-wide marker data, and the software package PREST that implements these methods. We provide detailed analytical steps involved using two application examples, and we discuss various practical issues including results interpretation. Key words: Pedigree error, Relationship estimation, IBD, IBS, IIS, Allele sharing, Likelihood, Hidden Markov model, EM algorithm, Software, PREST, PREST-plus, Multiple hypothesis testing, Simulation, Linkage, Association, Robustness
1. Introduction Pedigree errors or misspecified relationships among individuals often occur in data collected for genetic studies using family data. The potential causes of pedigree errors are numerous, including undocumented non-paternity, non-maternity, adoption, sample duplication, sample swap or mating between relatives. Unidentified pedigree relationship errors can lead to either increased false negatives or false positives in both linkage (1) and association analyses (2). For example, linkage analysis looks for regions of the genome that are shared by affected relatives, in excess of what is expected under the null hypothesis of no linkage (see Chapter 17 for details on the concept of “identical by decent,” IBD, and model-free linkage analysis based on IBD sharing). The null expected IBD sharing by a pair of relatives, however, is determined by the assumed pedigree structure. If two putative full sibs (with expected IBD sharing of 1 under the null hypothesis of no linkage, Snull, full-sib ¼ 1) are in fact half-sibs (Snull, half-sib ¼ 1/2), the
Robert C. Elston et al. (eds.), Statistical Human Genetics: Methods and Protocols, Methods in Molecular Biology, vol. 850, DOI 10.1007/978-1-61779-555-8_3, # Springer Science+Business Media, LLC 2012
25
26
L. Sun
linkage statistic calculated under the erroneous full-sib relationship, Sobs Snull, full-sib, will be smaller than that calculated under the true half-sib relationship, Sobs Snull, half-sib, thus resulting in the loss of power. This is true whenever the true relationship is more distant than the hypothesized relationship. Conversely, if two putative full sibs are in fact monozygotic (MZ) twins, then an increase in type 1 error rate is expected. Thornton and McPeek (3) among others showed that misspecified relationship has similar consequences in association analyses, and Chapter 4 investigates cryptic relatedness and its effects on case–control association studies. Some pedigree relationship errors can be detected using additional genealogical information (e.g., surname, age, or gender), however such information might not be available or powerful. Genome-wide marker data, collected for linkage or association scans, can provide accurate information on relatedness among individuals. Consequently, a number of statistical methods have been proposed and corresponding analysis software developed, including, for example, RELATIVE (4), RELCHECK (1), PEDCHECK (5), SIBERROR (6), PREST (7), GRR (8), and ECLIPSE (9). All these computational tools are listed and available at http:// linkage.rockefeller.edu/soft/. Mendelian inconsistencies can be used to identify relationship errors (i.e., inconsistencies occurring at the genome-wide level within a specific parent–child trio), in addition to genotyping error (i.e., inconsistencies occurring at a specific marker across multiple parent–child trios). However, most existing methods use the more powerful allele sharing- or likelihood-based statistics. Here we focus on the methods implemented in PREST (7, 10), because the proposed statistics are suitable for a range of relative pairs broader than sib pairs. For clarity we start with data set-up and notation. We then review four test statistics, IBS, AIBS, EIBD, and MLRT, and a simple relationship estimation method proposed by McPeek and Sun (7). We leave methods evaluation (see Note 1) and discussion of alternative methods (see Notes 2 and 3) to the Subheading 3. Although most information can be found in the original work and PREST documentation (http://www.utstat.toronto.edu/sun/Software/Prest), we try to make the material in this chapter comprehensive and self-sufficient, particularly when it comes to practical details. 1.1. Notation
Let R denotes the relationship type between a pair of individuals. Figure 1 shows pedigree structures for a set of 11 relationship types, R, common for human pedigrees, and Table 1 provides the corresponding IBD distributions and the kinship coefficients. The collected family data typically contain information on family ID, individual ID, father’s ID, mother’s ID, gender, and disease status, as well as on genotypes of either microsatellites or SNPs for each individual, Gm : genotype for marker m, m ¼ 1; :::; M (Table 2).
3 Detecting Pedigree Relationship Errors
Pedigree
27
Relationship MZ-twin
parent-offspring
full-sib
half-sib+first-cousin
half-sib
grandparent-grandchild
avuncular
first-cousin
half-avuncular
half-first-cousin
unrelated
Fig. 1. Eleven relationships common for human pedigree. The shaded pair of individuals has the specified relationship. See Table 1 for the corresponding IBD distributions and kinship coefficients.
The first five columns, C1–C5, determine the pedigree structure and the presumed or hypothesized relationship type, R0 , between a pair of individuals (e.g., in family 1, individuals with IDs 1 and 2 have full-sib relationship; in family 4, IDs 1 and 5 have the grandparent– grandchild relationship). The goal of detecting relationship errors can be formulated as a hypothesis-testing problem with H0 : R ¼ R0 ;
28
L. Sun
Table 1 IBD marginal distribution and kinship coefficient Distribution of IBD Sharing p0
p1
p2
Kinship coefficient, f
MZ-twin (MZ)
0
0
1
0.5
Parent–offspring (PO)
0
1
0
0.25
1
Full-sib (FS)
0.25
0.5
0.25
0.25
9
Half-sib + first-cousin (HSFC)
0.375 0.5
0.125 0.1875
2
Half-sib (HS)
0.5
0.5
0
0.125
3
Grandparent–grandchild 0.5 (GPC)
0.5
0
0.125
4
Avuncular (AV)
0.5
0.5
0
0.125
5
First-cousin (FC)
0.75
0.25
0
0.0625
7
Half-avuncular (HAV)
0.75
0.25
0
0.0625
8
Half-first-cousin (HFC)
0.875 0.125 0
0.03125
6
Unrelated (UN)
1
0
Reltype in PREST
Relationship type (notation)
11 10
0
0
p ¼ ðp0 ; p1 ; p2 Þ and ’ are shown for each of the 11 relationship types shown in Fig. 1. Note that p1 þ 2p2 ¼ 4f. PREST reltype numerical coding is also provided
where R0 is the relationship implied by the presumed pedigree structure. Evidence supporting or rejecting the null hypothesis can be sought in Gm ; m ¼ 1; :::; M ; by assessing whether the observed genotype data are compatible with what’s expected for a pair of individuals with the hypothesized relationship type. For example, if R0 is MZ-twin but the observed genotypes for the two individuals are not identical across the genome (allowing for a small genotyping error rate), then we have strong evidence for relationship error. To formally assess the statistical evidence, we review four test statistics. In the following, we use Gm ¼ ðði; j Þ; ðk; lÞÞ to denote the observed genotypes for a pair of individuals at marker m, where (i, j) is the genotype for one individual and (k, l) for the other. We also introduce Dm ¼ i; i ¼ 0; 1; or 2 to denote the number of alleles shared IBD by the pair at marker m, which is typically unknown. The underlying fDm g; m ¼ 1; :::; M ; along the genome is also called the IBD process.
3 Detecting Pedigree Relationship Errors
29
Table 2 Typical information contained in collected family data C1
C2
C3
C4
C5
C6
Genotype data
Family ID
Individual ID
Father ID
Mother ID
Disease Gender status
G1
G2
G3
G4
Gm
1
1
7
6
1
2
5 5
4 4
3 4
2 5
...
1
2
7
6
2
2
5 5
0 0
0 0
5 6
...
1
3
7
6
1
2
5 5
0 0
3 3
2 5
...
1
7
0
0
1
1
0 0
0 0
3 3
5 5
...
1
6
0
0
2
0
5 7
0 0
0 0
2 6
...
4
1
8
9
2
2
5 7
4 5
0 0
7 8
...
4
2
8
9
1
1
5 7
0 0
6 6
5 7
...
4
3
8
9
2
1
5 5
0 0
6 6
7 8
...
4
8
0
0
1
0
5 5
0 0
3 6
7 7
...
4
9
5
6
2
1
5 7
0 0
0 0
5 8
...
4
5
0
0
1
0
0 0
0 0
0 0
0 0
...
4
6
0
0
2
0
0 0
0 0
0 0
0 0
...
Sex: 1 ¼ male, 2 ¼ female; diseases status: 1 ¼ unaffected, 2 ¼ affected (or quantitative phenotype measures); 0 ¼ unknown in all cases
1.2. Likelihood-Based Statistic: MLRT
The likelihood for a relationship type R is LðRÞ ¼ PR ðG1 ; G2 ; :::; Gm ; :::; GM Þ: Due to the linkage correlation between markers, this probability calculation is nontrivial and is typically executed via the hidden Markov model (HMM) (11) assumed for the underlying IBD process, {Dm} (see Chapters 14–17 for more details). For a pair of individuals, the likelihood calculation boils down to three key components: PR ðDm ¼ iÞ; the IBD marginal or stationary distribution PR ðDmþ1 ¼ j j Dm ¼ iÞ; the IBD transition probability PðGm j Dm ¼ iÞ; the genotype conditional probability. Table 1 lists the IBD marginal distribution, p ¼ ðp0 ; p1 ; p2 Þ ¼ ðPR ðDm ¼ 0Þ; PR ðDm ¼ 1Þ; PR ðDm ¼ 2ÞÞ; for each of the 11 relationships shown in Fig. 1. Note that the IBD marginal distribution is independent of marker m but dependent on the relationship type R. We did not include subscript R in p ¼ ðp0 ; p1 ; p2 Þ for notation simplicity. Table 1 also includes the kinship coefficient, f, for each
30
L. Sun
Table 3 IBD transition probability for a full-sib pair, PR=full–sib (Dm+1 = j |Dm = i ) Next IBD status, Dm+1 Current IBD Status, Dm
0
1
2
0
’2
2’C
C2
1
fy
f 2 + C2
fy
2
y2
2fy
’2
’ ¼ y2 þ ð1 yÞ2 and C ¼ 2 y ð1 yÞ, and y is the recombination fraction between the two markers. Under a no interference model, y ¼ ð1 e2t Þ=2, where t is the genetic distance (in units of Morgans) between the two markers
Table 4 Genotype conditional probability, conditional on the IBD status, P(Gm|Dm = i ) IBD status, Dm Unordered genotype, Gm
0
1
2
((i, i), (i, i))
fi 4
fi3
fi 2
((i, i), (i, j))
4fi3fj
2 fi2fj
0
((i, i), (j, j))
2fi2fj2
0
0
((i, i), (j, k))
4fi2fjfk
0
0
((i, j), (i, j))
4fi2fj2
fi fj(fi+fj)
2fifj
((i, j), (i, k))
8fi2fj fk
2fifjfk
0
((i, j), (k, l))
8fifjfkfl
0
0
Note that genotypes in this table are unordered within an individual and between the two individuals, e.g., ((i, j), (k, l)) ¼ ((j, i), (l, k)) ¼ ((k, l ), (i, j)). fi, fj, fk, and fl, are the population frequencies of alleles i, j, k, and l, and i 6¼ j 6¼ k 6¼ l. For a diallelic marker such as a SNP, the table could be simplified further
relationship. Note that p ¼ ðp0 ; p1 ; p2 Þ and f are simplified summaries of R and do not always uniquely determine a relationship type. Table 3 shows the IBD transition probability for a full-sib pair, which depends on R and the map distance between the two markers (see ref. 12 for other relationships). Table 4 provides the
3 Detecting Pedigree Relationship Errors
31
conditional probability of the observed genotype, conditional on the underlying IBD status at that marker. Note that this conditional probability does not depend on R (13, 14). To gain computational efficiency, the actual calculation utilizes a recursive formula instead of considering the unknown IBD status at all markers simultaneously (see refs. 7 and 11 for details of the algorithm). As a result, the computational time grows only linearly with the number of markers, making the likelihood calculation feasible for genome-wide scans with thousands or more markers and for analyzing thousands or more different relative pairs. The Markov property of the IBD process requires the assumptions of linkage equilibrium and no interference. In addition, McPeek and Sun (7) emphasized that it is valid for only certain relative pairs (e.g., full-sib or half-sib) and proposed an alternative approach for general relationships (e.g., avuncular or first-cousin). However, they also showed that the likelihood calculated under the Markov assumption is a good approximation of the true likelihood in most cases. To perform the likelihood ratio test, the likelihood is first calculated for the hypothesized relationship (e.g., full-sib), LðR0 Þ, then for an alternative (e.g., half-sib), LðR1 Þ, and the difference between the log likelihoods serves as the test statistic. In practice, the possible alternative is often not unique. For example, full-sib could be MZ-twin, half-sib due to non-paternity, or unrelated due to adoption. In that case, we can consider a composite alternative hypothesis, calculate the likelihood for each plausible alternative, ^ LðR1 Þ; R1 2 A, and use the maximum, LðAÞ. Formally, to test H0 : R ¼ R0 ; H1 : R ¼ R1 ; R1 2 AnR0 ; the maximum likelihood ratio test (MLRT) statistic is ^ MLRT ¼ log LðAÞ logðLðR0 ÞÞ: The distribution of 2 MLRT, however, cannot be approximated by the standard w2 distribution due to the discreteness of the parameter space (i.e., each parameter value here is a pedigree relationship type). Therefore, statistical significance must be assessed empirically via simulations. Briefly, for each replicate k, genotype data for the pair are first simulated assuming R ¼ R0 , the corresponding MLRTðkÞ then calculated. The number of replicates with MLRTðkÞ more extreme than the observed MLRT, divided by the total number of replicates, is the empirical P -value. Although MLRT is powerful, it is computationally expensive. In the following, we review three allele sharing-based statistics that are simpler to calculate and easier to assess statistical significance via normal approximation.
32
L. Sun
1.3. Allele-SharingBased Statistics: IBS, AIBS, and EIBD
The idea behind these three statistics stems from nonparametric linkage analysis (see Chapter 17 for details), though with some key differences. For example, linkage analysis with the affected-sib-pair design investigates localized allele sharing for a given marker, averaged over all affected sib-pairs, and over-sharing indicates linkage between the marker and the susceptibility locus. In contrast, pedigree relationship analysis investigates global allele sharing for a given relative pair, averaged over all markers, and over-sharing (or under-sharing) indicates that the true relationship has a higher (or lower) kinship coefficient than the hypothesized one. Also note that, unlike linkage or association analyses, phenotype information is not needed for pedigree error detection. To detect pedigree relationship errors, the sharing statistics proposed by McPeek and Sun (7) have the general form of S¼
M 1 X Xm ; M m¼1
where Xm is the marker-specific allele-sharing statistic for a pair of individuals at marker m, and S is the average of Xm across all available markers. In the genome-wide setting where M is typically large (e.g., ~500 or more for linkage scans and ~100 K or more for association scans), S is approximately normally distributed, S N ðmR ; s2R Þ; where mR is the expected value and s2R is the variance of S for a pair of individuals with relationship type R. To test the null hypothesis, H0 : R ¼ R0 ; the statistical significance of S is assessed using the null distribution, S N ðmR0 ; s2R0 Þ: Specifically, the IBS statistic counts the number alleles shared “identical in state,” IIS (also known as “identical by state,” IBS; see Chapter 1), i.e., Xm ¼ the number of alleles shared IIS: The AIBS statistic calculates the probability that two alleles are IBD conditional on being shared IIS, i.e., Xm ¼ 0, if 0 alleles IIS, Gm ¼ ðði; j Þ; ðk; lÞÞ, fR0 Xm ¼ f þð1f Þfi , if 1 allele IIS, Gm ¼ ðði; jÞ; ði; lÞÞ, R0
Xm ¼
R0
fR0 fR0 þð1fR0 Þfi
þf
fR0
R0 þð1fR0 Þfj
, if 2 alleles IIS, Gm ¼ ðði; j Þ;
ði; j ÞÞ, where fR0 is the kinship coefficient for the null relationship R0, and fi and fj are the allele frequencies. The EIBD statistic is the expected number of alleles shared IBD conditional on the observed genotype, i.e.,
3 Detecting Pedigree Relationship Errors
Xm ¼ ER0 ½Dm jGm ¼
33
PðGm jDm ¼ 1Þp1 þ PðGm jDm ¼ 2Þp2 P ; i¼0;1;2 PðGm jDm ¼ iÞpi
where p ¼ ðp0 ; p1 ; p2 Þ is the IBD marginal distribution for R0 as in Table 1, and PðGm j Dm ¼ iÞ is the genotype conditional probability as in Table 4. These statistics are simple to calculate and require only the IBD marginal distribution, genotype conditional distribution and population allele frequencies. However, the calculations of mR0 and s2R0 for assessing the statistical significance are more complicated. Briefly, mR0 is the average of ER0 ½Xm , whose calculation requires information similar to that for Xm. However, s2R0 involves ER0 ½Xm2 and ER0 ½Xm1 Xm2 , whose calculation needs also the IBD transition probability between marker m1 and marker m2, i.e., the second order of the IBD process. We refer readers to (7) and (12) for technical details of the calculation. Interestingly, mR0 for the EIBD statistic depends only on the IBD marginal distribution for R0 and has a close relationship with the kinship coefficient, ER0 ½EIBD ¼ ¼
1.4. Relationship Estimation
M M 1 X 1 X ER0 ½Xm ¼ ER ER ½Dm jGm M m¼1 M m¼1 0 0 M 1 X ER ½Dm ¼ p1 þ 2p2 ¼ 4fR0 : M m¼1 0
When strong evidence against the hypothesized null relationship, H0 : R ¼ R0 has been found, it is very helpful to know the alternative relationship(s) that are compatible to the genotype data observed for the pair of individuals. McPeek and Sun (7) proposed a simple strategy that estimates the most likely IBD distribution, p ¼ ðp0 ; p1 ; p2 Þ; by obtaining ^p ¼ ð^p0 ; ^p1 ; ^p2 Þ that maximizes the following quantity ! M M X X X LðpÞ ¼ logðPðGm ; pÞÞ ¼ log PðGm jDm ¼ iÞpi : m¼1
m¼1
i¼0;1;2
The maximization can be efficiently achieved by an application of the EM algorithm. Specifically, PM 1 ðkþ1Þ ðkÞ M m¼1 PðGm jDm ¼ iÞ pi ¼ pi P ; and ðkÞ j ¼0;1;2 PðGm jDm ¼ iÞpj ðkÞ
ðkÞ
ðkÞ
p0 þ p1 þ p2 ¼ 1; ðkÞ
ðkÞ
ðkÞ
where pðkÞ ¼ ðp0 ; p1 ; p2 Þ is the current estimate and pðkþ1Þ ¼ ðkþ1Þ ðkþ1Þ ðkþ1Þ ðp0 ; p1 ; p2 Þ is the next updated estimate. When the markers are unlinked, L(p) is the correct log likelihood for parameter p, and ^p is the MLE (13, 14). When markers are linked, this procedure
34
L. Sun
still provides good estimates for p (7), and Chapter 4 shows that the conclusion holds even for high-throughput SNP data collected for genome-wide association studies (GWAS) (15). Depending on the values of ^p ¼ ð^p0 ; ^p1 ; ^p2 Þ, one can propose plausible relationship(s) and formally test if the proposed relationship is compatible with the observed genotype data (10). This procedure can be practically important because it allows the possibility to keep individuals that had relationship errors by reconstructing alternative pedigrees. Although the IBS statistic is less informative than IBD, IBS is more robust to misspecified allele frequencies. Therefore, it is also useful to obtain the proportion of markers that have 0, 1, or 2 alleles shared IIS. Such an IIS distribution can quickly pinpoint, for example, MZ twins, for which we expect the proportion of 2 IIS to be 1 or close to 1 if allowing for some genotyping errors.
2. Methods The test statistics and estimation method reviewed in Subheadings 1.2–1.4 have been implemented in a program called PREST, which stands for Pedigree RElationship Statistical Test (10). To meet the computational challenges introduced by the high-throughput genotype data, PREST-plus was later developed to increase the computational efficiency as well as the interface of PREST (15). We will use PREST and PREST-plus interchangeably unless specified otherwise. In the following, we provide details on essential steps and result interpretation when using PREST to detect pedigree relationship errors, as well as applications to two datasets with distinct features. 2.1. Essential Steps in Detecting Pedigree Relationship Errors via PREST
Step 0: Download a PREST version suitable for your computer platform at http://www.utstat.toronto.edu/sun/Software/Prest/. Step 1: Prepare two input files, both are identical to what are used by PLINK (16): l
“prest.ped”: pedigree and genotype information. Specifically each row is an individual and the total number of columns is 6 þ 2 M , where the first six columns are C1–C6 as shown in Table 2, M is the total number of markers that were genotyped. C6 is not used but a required column to be consistent with linkage or association input files. Markers can be either microsatellites or SNPs or both, with corresponding alleles coded as integers (0 for missing data). (The alleles of SNPs should be coded as 1/2 and not with A/C/G/T characters. To accomplish this, the plink option –recode12 can be used to recode the data).
3 Detecting Pedigree Relationship Errors l
35
“prest.map”: map and allele frequency information. Specifically each row is a marker and columns provide information of, in this order, chromosome, marker name, map in cM (optional), bp position (not used), number of alleles, followed by the corresponding allele frequency for each allele (optional). If a cM map is not provided, relationship estimation is performed but not relationship testing. If allele frequency is not provided, it will be estimated from the genotype data. (A note to PREST users: convert-geno pedigree chromfiles will convert PREST-ready input files to PREST-plus-ready input files. The script convert-geno is included in the PRST-plus package.)
Step 2: Run the program with different command options depending on user-specific needs. For example, l
option 0 (least computing time): relationship estimation only
l
prest –file prest.ped –map prest.map –wped –out prest.wped. option0 option 1 (more computing time): both relationship estimation and testing, but using the fast EIBD, AIBS, and IBS statistics only
l
prest –file prest.ped –map prest.map –wped –cm –out prest.wped. option1 option 2 (substantially more computing time): testing using the more powerful MLRT statistic as well prest –file prest.ped –map prest.map –wped –cm –mlrt –out prest. wped.option2
Step 3: Interpret the results. PREST produces several output files:
2.2. Essential Steps in Interpreting the Results
l
“prest_mendel”: family trios with Mendelian inconsistencies if there are any.
l
“prest_summary”: summary statistics such as the total numbers of families, individuals, and markers, and the number of relative pairs being tested.
l
“prest_results”: the main results file. Each row provides various relationship testing and estimation results for a pair of individuals, as detailed in Table 5. An R program is provided to assist the users navigate and digest the results. For details, see two application examples in Subheadings 2.4 and 2.5.
Step 1: Checking the estimated IBD distribution (e.g., Fig. 2) and the IIS distribution (e.g., Fig. 3), stratified by the relationship type can quickly pinpoint obvious outliers. Step 2: The histogram of P -values (e.g., the empirical P -values of the MLRT statistic in Fig. 4) can also provide a sense of pedigree relationship error rate. Note that if all hypotheses are truly null
36
L. Sun
Table 5 PREST results interpretation Column name
Interpretation
FID1
Family ID and individual ID of individual 1
Option
IID1 FID2
Family ID and individual ID of individual 2
IID2 reltype
The hypothesized null relationship type for the pair of individuals determined by the given family data (see Table 1 for relationship types and their reltype codings)
commark
The number of markers with genotype data for both individuals
EIBD
The EIBD, AIBS, and IBS statistics as reviewed in Subheading 1.3
Option 0—basic output point estimates only
AIBS IBS p.IBS0
The proportion of markers with 0, 1, or 2 alleles IIS
p.IBS1 p.IBS2 p.IBD0
The estimated probability of 0, 1, or 2 alleles IBD as reviewed in Subheading 1.4, i.e., the estimated IBD distribution
p.IBD1 p.IBD2 p.normal.EIBD
The P -value of EIBD, AIBS, or IBS based on normal approximation (P -value of MLRT requires simulations as discussed in Subheading 1.2)
Option 1—statistical significance provided for EIBD, AIBS, IBS
The P -value of EIBD, AIBS, IBS, or MLRT based on simulations
Option 2—statistical significance provided for MLRT as well, but simulation needed
p.normal.AIBS p.normal.IBS p.emprical.EIBD
p.emprical.AIBS p.emprical.IBS p.emprical. MLRT
p1, Prob(IBD=1)
1.0
0.8
0.6
0.4
0.2
0.0
1.0
0.8
0.6
0.4
0.2
0.0
p1, Prob(IBD=1)
0.0
0.0
0.0
0.6
p0, Prob(IBD=0)
0.4
0.8
1.0
0.4
0.6
p0, Prob(IBD=0)
0.8
1.0
0.2
0.6
p0, Prob(IBD=0)
0.4
X
0.8
1.0
R0: HSFC, 0 pairs
0.2
X
R0: FC, 309 pairs
0.2
X
0.0
X
0.0
0.0
0.6
p0, Prob(IBD=0)
0.4
0.8
1.0
0.4
0.6
p0, Prob(IBD=0)
0.8
1.0
0.2
0.6
p0, Prob(IBD=0)
0.4
0.8
1.0
R0: PO, 993 pairs
0.2
X
R0: UN, 1293 pairs
0.2
X
R0: HS, 66 pairs
0.0
X
0.0
0.0
0.6
p0, Prob(IBD=0)
0.4
0.8
1.0
0.2
0.2
0.4
0.6
0.8
0.6 p0, Prob(IBD=0)
0.4
0.8
R0: MZ, 0 pairs
p0, Prob(IBD=0)
X
1.0
1.0
R0: HAV, 62 pairs
0.2
X
R0: GPC, 165 pairs
0.0
0.0
0.0
0.6
p0, Prob(IBD=0)
0.4
0.8
1.0
0.2
0.2
0.4
0.6
0.8
0.6 p0, Prob(IBD=0)
0.4
0.8
All, 5035 pairs
p0, Prob(IBD=0)
X
1.0
1.0
R0: HFC, 40 pairs
0.2
X
R0: AV, 900 pairs
Fig. 2. Application 1: Relationship estimation based on IBD. Each plot shows the estimated IBD distribution, ^ p1 vs. p^0 , for relative pairs having the hypothesized relationship, R0. The cross marks the expected IBD distribution for R0. The bottom right plot shows the results for all 5,035 relative pairs analyzed. The full names of the 11 relationships are given in Table 1.
p1, Prob(IBD=1)
1.0
0.8
0.6
0.4
0.2
0.0
1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4
p1, Prob(IBD=1) p1, Prob(IBD=1) p1, Prob(IBD=1)
0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0
1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4
p1, Prob(IBD=1) p1, Prob(IBD=1) p1, Prob(IBD=1)
0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0
1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4
p1, Prob(IBD=1) p1, Prob(IBD=1) p1, Prob(IBD=1)
0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0
R0: FS, 1207 pairs
3 Detecting Pedigree Relationship Errors 37
1.0
proportion of IBS=1 0.2 0.4 0.6 0.8
0.0
1.0
proportion of IBS=1 0.2 0.4 0.6 0.8
0.0
1.0
proportion of IBS=1 0.2 0.4 0.6 0.8
0.0
0.0
0.0
0.4
0.6
0.8
1.0
0.4
0.6
0.8
1.0
0.4
0.6
0.8
proportion of IBS=0
0.2
1.0
R0: HSFC, 0 pairs
proportion of IBS=0
0.2
R0: FC, 309 pairs
proportion of IBS=0
0.2
R0: FS, 1207 pairs
0.0
0.0
0.0
0.4
0.6
0.8
1.0
0.4
0.6
0.8
1.0
0.4
0.6
0.8
proportion of IBS=0
0.2
1.0
R0: PO, 993 pairs
proportion of IBS=0
0.2
R0: UN, 1293 pairs
proportion of IBS=0
0.2
R0: HS, 66 pairs
0.0
0.0
0.0
0.4
0.6
0.8
1.0
0.4
0.6
0.8
0.4
0.6
0.8 proportion of IBS=0
0.2
R0: MZ, 0 pairs
proportion of IBS=0
0.2
1.0
1.0
R0: HAV, 62 pairs
proportion of IBS=0
0.2
R0: GPC, 165 pairs
0.0
0.0
0.0
0.4
0.6
0.8
1.0
0.4
0.6
0.8
0.4
0.6
0.8 proportion of IBS=0
0.2
All, 5035 pairs
proportion of IBS=0
0.2
1.0
1.0
R0: HFC, 40 pairs
proportion of IBS=0
0.2
R0: AV, 900 pairs
Fig. 3. Application 1: Relationship estimation based on IIS (also known as IBS). Each plot shows the proportion of markers with 0 IIS vs. 1 IIS for relative pairs having the hypothesized relationship, R0. The expected IIS distribution for R0 depends on the allele frequencies but is generally the center of the cluster, assuming the majority of the pairs do not have relationship error and have the same markers genotyped. The bottom right plot shows the results for all 5,035 relative pairs analyzed. The full names of the 11 relationships are given in Table 1.
0.0
1.0 proportion of IBS=1 0.2 0.4 0.6 0.8 0.0 1.0 proportion of IBS=1 0.2 0.4 0.6 0.8 0.0 1.0 proportion of IBS=1 0.2 0.4 0.6 0.8 0.0
1.0 proportion of IBS=1 0.2 0.4 0.6 0.8 0.0 1.0 proportion of IBS=1 0.2 0.4 0.6 0.8 0.0 1.0 proportion of IBS=1 0.2 0.4 0.6 0.8 0.0
1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4
proportion of IBS=1 proportion of IBS=1 proportion of IBS=1
0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0
38 L. Sun
3 Detecting Pedigree Relationship Errors
39
Fig. 4. Application 1: Relationship testing. Histogram of MLRT P -values.
(i.e., no relationship error), the P -values follow the Uniform (0, 1) distribution, assuming tests are independent of each other. Although pairs from the same pedigree have some dependency, it generally does not noticeably affect the Uniform assumption. Therefore, abundant amount of small P -values indicate that a proportion of the relative pairs might have relationship errors. Step 3: To formally assess the statistical significance of the P -value for a relative pair, one must adjust for multiple hypothesis testing because of the thousands or more pairs investigated. This can be achieved by using the conservative Bonferroni correction, i.e., rejecting the null hypothesis if the P -value is less than a=total no. of pairs, where a is the overall type 1 error rate for all the pairs tested. Alternatively, one can use the less stringent false discovery rate (FDR) control (17) to strike a balance between the trade-off between type 1 error rate and power (18), and adopt a stratified FDR approach to further improve power (19). Step 4: There is a number of other factors influencing the conclusion of pedigree relationship errors, including the number of markers used and the pattern of results among multiple pairs from the same pedigree. For pairs with strong evidence for relationship error, it is also useful to use the companion program, ALTERTEST (10), to identify potential alternative relationship(s) that are compatible with the observed genotype data. In some cases, alternative pedigree structure could be proposed. See PREST documentation for additional details.
40
L. Sun
2.3. Pedigree Relationship Errors and Cryptic Relationships
Cryptic relationships can also occur in family data, including relatedness between founders (often causes inbreeding) and across pedigrees. The methods discussed in this chapter were designed for general relationships, including unrelated relationship type, therefore they are suitable for identifying cryptic relatedness as well. Essentially, the null hypothesis, H0 : R ¼ R0 , hypothesizes the relationship as unrelated, and corresponding MLRT and IBS statistics can be calculated. Note that EIBD and AIBS are not applicable to unrelated pairs (7). In addition, relationship estimation can be performed in the exact same manner by obtaining ^p ¼ ð^p0 ; ^p1 ; ^p2 Þ that maximizes the probability of the observed genotype data. PREST can detect cryptic relatedness between families by using the –aped option combined with other options as listed above, e.g., prest –file prest.ped –map prest.map –aped –out prest.aped.option0 prest –file prest.ped –map prest.map –aped –cm –out prest.aped. option1 prest –file prest.ped –map prest.map –aped –cm –mlrt –out prest.aped. option2 Note that the computing time increases considerably when using the –aped option, because the number of pairs to be analyzed is approximately in the order of n2 =2, where n is the total number of individuals in the dataset. In the context of GWAS, PLINK (16) implemented a methodof-moment-based p ¼ ðp0 ; p1 ; p2 Þ estimation method, tailored for identifying cryptic relationships among putatively unrelated case–control samples. We refer readers to Chapter 4 for issues that are specific to detecting cryptic relationships, particularly when high-throughput SNP data are available.
2.4. Application 1: Small to ModerateSized Pedigrees with Genome-Wide Low-Density Microsatellite Data
The COGA data (20) were provided by the collaborative study on the genetics of alcoholism (U10AA008401) and distributed as part of the biennial genetic analysis workshops (GAW11). (We thank COGA investigators and the GAW advisory committee for permission to use the data.) The data consist of 105 families (mostly three- or four-generation pedigrees) with 1,214 individuals in total, among which 992 were genotyped at 285 autosomal microsatellite markers. Missing genotype rates vary across individuals but most of the 992 have genotype data on >200 markers. Potential pedigree errors in the data were previously analyzed and reported (7, 12). Here we provide some of the key results and practical advice, after running PREST with command line: prest –file coga.ped –map coga.map –wped –cm –mlrt –out coga.wped. option2
3 Detecting Pedigree Relationship Errors
41
In total, 5,381 relative pairs (within pedigree pairs, two individuals of a pair must have the same family ID; see Chapter 4 for results of across pedigree analysis) were analyzed, but some pairs have as few as 25 markers genotyped in common. Deleting pairs with insufficient genotype data (<200 markers) resulted in 5,035 pairs, among which 1,207 are full-sib, 66 half-sib, 165 grandparent–grandchild, 900 avuncular, 309 first-cousin, 1,293 unrelated, 62 half-avuncular, 40 half-first-cousin, 0 half-sib + first-cousin, 993 parent–offspring, and 0 MZ-twin pairs, determined by the provided family data. Figure 2 shows the estimated IBD distribution stratified by the 11 relationships. There are some clear outliers indicating misspecified relationships. For example, a number of full-sib pairs have ^p ¼ ð^p0 ; ^p1 ; ^p2 Þ close to (0, 1, 0), the IBD distribution expected for parent–offspring, one parent–offspring pair has ^p ¼ ð^p0 ; ^p1 ; ^p2 Þ close to (0, 0, 1) for MZ-twin, and a couple of first-cousin pairs have ^p ¼ ð^p0 ; ^p1 ; ^p2 Þ close to (0.375, 0.5, 0.125) for half-sib + firstcousin. Figure 3 provides the distribution of the proportion of markers with 0, 1, or 2 alleles IIS. The IIS-based statistic clearly provides less information, but it unambigously identifies the MZ-twin pair that is misspecified as a parent–offspring pair. The histogram of the empirical P -values of the MLRT statistic (Fig. 4) also indicates that a nonnegligible proportion of the 5,035 pairs analyzed has relationship error. There are 68 pairs with MLRT P -values less than 0:05=5035 105 . A detailed investigation of these pairs is provided in ref. 12. 2.5. Application 2: Moderate-Sized to Extended Pedigrees with Genome-Wide High-Through SNP Data
The VTE data (21) were collected for an oligogene study of venous thromboembolism. (We thank Dr. France Gagnon for permission to use the data.) The data consist of five multigenerational FrenchCanadian pedigrees with a total of 369 individuals. The number of individuals per family ranges from 18, 39, 42, 63 to as large as 207. Among the 369 individuals, 255 (10, 25, 27, 40, and 153) were genotyped using the Illumina 660 chip. Performing hypothesis testing using all available SNP data in this case requires modifying the variance and likelihood calculation to allow linkage disequilibrium (LD) between markers. However, the uncertainty and variability in haplotype inference could out-shadow the potential gain in modeling LD. In practice, the estimates of the IBD distribution obtained from high-throughput SNP data can provide sufficient information for identifying relationship error (15). Therefore, we analyze this data using PREST option 0 only: prest –file vte.ped –map vte.map –wped –out vte.wped.option0 In total, 5,550 relative pairs (within pedigree pairs; see Chapter 4 for across pedigree analysis) were analyzed, among which 413 are full-sib, 16 half-sib, 119 grandparent–grandchild, 1,468 avuncular, 2,192 first-cousin, 1,110 unrelated, 6 half-avuncular, 0 half-firstcousin, 0 half-sib + first-cousin, 226 parent–offspring, and 0 MZtwin pairs, determined by the provided family data.
42
L. Sun
Figure 5 shows the estimated IBD distribution and Fig. 6 shows the IIS distribution, stratified by the 11 relationships. (These results were obtained using randomly selected 300K SNPs across the genome; Using all 610K SNPs did not improve the estimation noticeably; see Chapter 4 for more discussion.) Some putatively unrelated pairs seem to be first-degree relatives. To further investigate these pairs, one can prune the available SNPs to a smaller set of markers that are in linkage equilibrium so that formal hypothesis testing could be performed.
3. Notes 1. There are pros and cons associated with each of the four statistics, MLRT, EIBD, AIBS, and IBS (Table 6). Generally speaking, there is always a trade-off between power and robustness. For example, the MLRT statistic is the most powerful one among the four, but it is also the one most sensitive to violations of assumptions, for example, map and allele frequencies were accurately estimated (7, 12). The choice of the method also depends on other factors such as computing facilities, as illustrated in the application examples above (also see Chapter 4 for discussion on genome-wide linkage vs. association scan data). 2. The statistics discussed here were designed for analyzing pairwise relationships without inbreeding using autosomal markers only. Epstein et al. (22) proposed a method that incorporates sex-chromosome markers to further improve power; Sieberts et al. (9) and McPeek (23) discussed ways to investigate >2 individuals simultaneously; Sun et al. (24) extended the EIBD statistic to analyzing large inbred pedigrees such as the Hutterites. (Note that animal genetic literature offers extensive references on inbred pedigrees.) 3. Pedigree errors and genotyping errors are often confounded. The common practice is to first detect genotyping errors by evaluating marker-specific statistic such as the Mendelian inconsistencies rate, averaged across all pedigrees or samples. After removing markers with high genotyping error rate, the relative pair-specific statistics as discussed above are then used to identify pedigree errors. Broman and Weber (25) proposed a method that incorporates a pre-specified genotyping error rate in the likelihood calculation, which is implemented in the MLRT statistic (10). However, joint modeling the two types of errors in a single framework remains an open question. In addition, little work has been done to tackle the problem of incorporating pedigree errors or relationship uncertainties in linkage or association analyses with only a few exceptions (3, 26).
1.0
p1, Prob(IBD=1) 0.2 0.4 0.6 0.8
0.0
1.0
p1, Prob(IBD=1) 0.2 0.4 0.6 0.8
0.0
1.0
p1, Prob(IBD=1) 0.2 0.4 0.6 0.8
0.0
0.0
0.0
0.4
0.6
p0, Prob(IBD=0)
0.8
1.0
0.4
0.6
p0, Prob(IBD=0)
0.8
1.0
0.2
0.6
p0, Prob(IBD=0)
0.4
X
0.8
1.0
R0: HSFC, 0 pairs
0.2
X
R0: FC, 2192 pairs
0.2
X
0.0
X
0.0
0.0
0.4
0.6
p0, Prob(IBD=0)
0.8
1.0
0.4
0.6
p0, Prob(IBD=0)
0.8
1.0
0.2
0.6
p0, Prob(IBD=0)
0.4
0.8
1.0
R0: PO, 226 pairs
0.2
X
R0: UN, 1110 pairs
0.2
X
R0: HS, 16 pairs
0.0
X
0.0
0.0
0.4
0.6
p0, Prob(IBD=0)
0.8
1.0
0.2
0.2
0.4
0.6
0.8
0.6 p0, Prob(IBD=0)
0.4
0.8
R0: MZ, 0 pairs
p0, Prob(IBD=0)
X
1.0
1.0
R0: HAV, 6 pairs
0.2
X
R0: GPC, 119 pairs
0.0
0.0
0.0
0.4
0.6
p0, Prob(IBD=0)
0.8
1.0
0.2
0.2
0.4
0.6
0.8
0.6 p0, Prob(IBD=0)
0.4
0.8
All, 5550 pairs
p0, Prob(IBD=0)
X
1.0
1.0
R0: HFC, 0 pairs
0.2
X
R0: AV, 1468 pairs
Fig. 5. Application 2: Relationship estimation based on IBD. Each plot shows the estimated IBD distribution, ^ p1 vs. ^ p0 , for relative pairs having the hypothesized relationship, R0. The cross marks the expected IBD distribution for R0. The bottom right plot shows the results for all 5,550 relative pairs analyzed. The full names of the 11 relationships are given in Table 1.
0.0
1.0 p1, Prob(IBD=1) 0.2 0.4 0.6 0.8 0.0 1.0 p1, Prob(IBD=1) 0.2 0.4 0.6 0.8 0.0 1.0 p1, Prob(IBD=1) 0.2 0.4 0.6 0.8 0.0
1.0 p1, Prob(IBD=1) 0.2 0.4 0.6 0.8 0.0 1.0 p1, Prob(IBD=1) 0.2 0.4 0.6 0.8 0.0 1.0 p1, Prob(IBD=1) 0.2 0.4 0.6 0.8 0.0
1.0 p1, Prob(IBD=1) 0.2 0.4 0.6 0.8 0.0 1.0 p1, Prob(IBD=1) 0.2 0.4 0.6 0.8 0.0 1.0 p1, Prob(IBD=1) 0.2 0.4 0.6 0.8 0.0
R0: FS, 413 pairs
3 Detecting Pedigree Relationship Errors 43
proportion of IBS=1
1.0
0.8
0.6
0.4
0.2
0.0
1.0
0.8
0.6
0.4
0.2
0.0
proportion of IBS=1
0.0
0.0
0.0
0.4
0.6
0.8
1.0
0.4
0.6
0.8
1.0
0.4
0.6
0.8
proportion of IBS=0
0.2
1.0
R0: HSFC, 0 pairs
proportion of IBS=0
0.2
R0: FC, 2192 pairs
proportion of IBS=0
0.2
R0: FS, 413 pairs
0.0
0.0
0.0
0.4
0.6
0.8
1.0
0.4
0.6
0.8
1.0
0.4
0.6
0.8
proportion of IBS=0
0.2
1.0
R0: PO, 226 pairs
proportion of IBS=0
0.2
R0: UN, 1110 pairs
proportion of IBS=0
0.2
R0: HS, 16 pairs
0.0
0.0
0.0
0.4
0.6
0.8
1.0
0.4
0.6
0.8
0.4
0.6
0.8 proportion of IBS=0
0.2
R0: MZ, 0 pairs
proportion of IBS=0
0.2
1.0
1.0
R0: HAV, 6 pairs
proportion of IBS=0
0.2
R0: GPC, 119 pairs
0.0
0.0
0.0
0.4
0.6
0.8
1.0
0.4
0.6
0.8
0.4
0.6
0.8 proportion of IBS=0
0.2
All, 5550 pairs
proportion of IBS=0
0.2
1.0
1.0
R0: HFC, 0 pairs
proportion of IBS=0
0.2
R0: AV, 1468 pairs
Fig. 6. Application 2: Relationship estimation based on IIS (also known as IBS). Each plot shows the proportion of markers with 0 IIS vs. 1 IIS for relative pairs having the hypothesized relationship, R0. The expected IIS distribution for R0 depends on the allele frequencies but is generally the center of the cluster, assuming the majority of the pairs do not have relationship error and have the same markers genotyped. The bottom right plot shows the results for all 5,550 relative pairs analyzed. The full names of the 11 relationships are given in Table 1.
proportion of IBS=1
1.0
0.8
0.6
0.4
0.2
0.0
1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4
proportion of IBS=1 proportion of IBS=1 proportion of IBS=1
0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0
1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4
proportion of IBS=1 proportion of IBS=1 proportion of IBS=1
0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0
1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4
proportion of IBS=1 proportion of IBS=1 proportion of IBS=1
0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0
44 L. Sun
3 Detecting Pedigree Relationship Errors
45
Table 6 Methods comparison in terms of power, robustness, and computational efficiency Statistics
Power
Robustness
Computation
Others
MLRT
1
3
3
EIBD
2
2
2
na for unrelated and parent–offspring
AIBS
2
2
2
na for unrelated
IBS
3
1
1
Ranks are from 1 (best) to 3 (worst)
References 1. Boehnke M, Cox NJ (1997) Accurate inference of relationships in sib-pair linkage studies. Am J Hum Genet 61(2):423–429 2. Voight BF, Pritchard JK (2005) Confounding from cryptic relatedness in case-control association studies. PLoS Genet 1(3):e32 3. Thornton T, McPeek MS (2010) Roadtrips: case-control association testing with partially or completely unknown population and pedigree structure. Am J Hum Genet 86 (2):172–184 4. Goring HH, Ott J (1997) Relationship estimation in affected sib pair analysis of late-onset diseases. Eur J Hum Genet 5(2):69–77 5. O’Connell JR, Weeks DE (1998) Pedcheck: a program for identification of genotype incompatibilities in linkage analysis. Am J Hum Genet 63(1):259–266 6. Ehm M, Wagner M (1998) A test statistic to detect errors in sib-pair relationships. Am J Hum Genet 62(1):181–188 7. McPeek MS, Sun L (2000) Statistical tests for detection of misspecified relationships by use of genome-screen data. Am J Hum Genet 66 (3):1076–1094 8. Abecasis GR, Cherny SS, Cookson WO, Cardon LR (2001) Grr: graphical representation of relationship errors. Bioinformatics 17 (8):742–743 9. Sieberts SK, Wijsman EM, Thompson EA (2002) Relationship inference from trios of individuals, in the presence of typing error. Am J Hum Genet 70(1):170–180
10. Sun L, Wilder K, McPeek MS (2002) Enhanced pedigree error detection. Hum Hered 54(2):99–110 11. Rabiner L (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77:2257–2286 12. Sun L (2001) Two statistical problems in human genetics. PhD thesis, University of Chicago 13. Thompson EA (1975) The estimation of pairwise relationships. Ann Hum Genet 39 (2):173–188 14. Thompson EA (1986) Pedigree analysis in human genetics. The Johns Hopkins University Press, Baltimore 15. Dimitromanolakis A, Paterson AD, Sun L (2009) Accurate IBD inference identifies cryptic relatedness in 9 hapmap populations. Abstract no. 1768 16. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ, Sham PC (2007) Plink: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 81(3):559–575 17. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: A practical and powerful approach to multiple testing. J Roy Statist Soc Ser B 57:289–300 18. Craiu RV, Sun L (2008) Choosing the lesser evil: trade-o_ between false discovery rate and non-discovery rate. Statistica Sinica 18:861–879
46
L. Sun
19. Sun L, Craiu RV, Paterson AD and Bull SB (2006) Stratified false discovery control for large-scale hypothesis testing with application to genome-wide association studies. Genet Epidemiol 30:519–530 20. Begleiter H, Reich T, Nurnberger JJ, Li TK, Conneally PM, Edenberg H, Crowe R, Kuperman S, Schuckit M, Bloom F, Hesselbrock V, Porjesz B, Cloninger CR, Rice J, Goate A (1999) Description of the genetic analysis workshop 11 collaborative study on the genetics of alcoholism. Genet Epidemiol 17 Suppl 1:S25–30 21. Antoni G, Morange P, Luo Y, Saut N, Burgos G, Heath S, Germain M, Biron-Andreani C, Schved J, Pernod G, Galan P, Zelenika D, Alessi M, Drouet L, Visvikis-Siest S, Wells P, Lathrop M, Emmerich J, Tregouet D, Gagnon F (2010) A multi-stage multi-design strategy provides strong evidence that the bai3 locus is associated with early-onset venous thromboembolism. J Thromb Haemost DOI 10.1111/j.15387836.2010.04092.x
22. Epstein MP, Duren WL, Boehnke M (2000) Improved inference of relationship for pairs of individuals. Am J Hum Genet 67 (5):1219–1231 23. McPeek MS (2002) Inference on pedigree structure from genome screen data. Statistica Sinica 12:311–335 24. Sun L, Abney M, McPeek MS (2001) Detection of mis-specified relationships in inbred and outbred pedigrees. Genet Epidemiol 21 Suppl 1:S36–41 25. Broman KW, Weber JL (1998) Estimation of pairwise relationships in the presence of genotyping errors. Am J Hum Genet 63 (5):1563–1564 26. Ray A, Weeks DE (2008) Relationship uncertainty linkage statistics (ruls): affected relative pair statistics that model relationship uncertainty. Genet Epidemiol 32(4):313–324
Chapter 4 Identifying Cryptic Relationships Lei Sun and Apostolos Dimitromanolakis Abstract Cryptic relationships such as first-degree relatives often appear in studies that collect population samples such as the case–control genome-wide association studies (GWAS). Cryptic relatedness not only creates increased type 1 error rate but also affects other aspects of GWAS, such as population stratification via principal component analysis. Here we discuss two effective methods, as implemented in PREST and PLINK, to detect and correct for the problem of cryptic relatedness using high-throughput SNP data collected from GWAS or next-generation sequencing (NGS) experiments. We provide the analytical and practical details involved using three application examples. Key words: Cryptic relatedness, Pedigree error, Relationship estimation, IBD, IBS, IIS, Likelihood, EM algorithm, Method-of-moments, Software, PREST, PREST-plus, PLINK, GWAS, Sequencing
1. Introduction Data quality control and error detection is paramount in ensuring valid and powerful gene discoveries and the success of replication studies. Apart from quality control of the genotypes, it is important to specify correct genealogical relationships for all individuals involved in a study. In the context of high-throughput genome scans such as genome-wide association studies (GWAS), it is common to collect putative unrelated case–control samples. However, cryptic relationships, such as duplicated samples or first-degree relatives, often appear in a study. In this chapter, we discuss effective methods to detect and correct for the problem of cryptic relatedness using high-throughput genome-wide scan data collected from GWAS or next-generation sequencing (NGS) experiments. GWAS form a new paradigm that provides an agnostic search of the whole human genome by looking for association between a trait and a vast number of SNPs that tag a large proportion of the
Robert C. Elston et al. (eds.), Statistical Human Genetics: Methods and Protocols, Methods in Molecular Biology, vol. 850, DOI 10.1007/978-1-61779-555-8_4, # Springer Science+Business Media, LLC 2012
47
48
L. Sun and A. Dimitromanolakis
genome (see Chapter 18 for more details on GWAS). Modern GWAS collect thousands of case–control samples genotyped on half-million or more SNPs. Prior to the actual association analysis, data quality control must be performed on both the SNPs and samples, which includes, among other things, genotype quality and cryptic relatedness. The latter turns out to be a crucial element for a number of reasons. First and foremost, related individuals within GWAS can lead researchers to false association results, because genotypes of different samples are no longer independent of each other as assumed by the standard case–control association analysis. Cryptic relationship creates increased type 1 error rate and overall inflation of the significance of P -values (1), unless a correct correlation structure among individuals is specified in the association model (2). Cryptic relatedness also affects other aspects of the data quality control, such as population stratification via principal component analysis (see Chapter 21 and (3) for more details on the topic of population stratification and association analysis). In this case, cryptic first-degree relatives, for example, could completely alter the population stratification results, in which the most important principal component partitions samples based on different degrees of relatedness rather than different populations. Unfortunately, this type of error is not academic and occurs frequently in practice. In fact, the GWAS literature so far shows that there are not many GWAS that did not have the cryptic relatedness issue. It is even more important to pay attention to this type of error when the GWAS design is moving from using thousands of samples to using tens of thousands of individuals, since the probability of sampling related individuals is then likely to increase dramatically, particularly among the cases. Another source for cryptic relatedness is duplication of samples, which can occur in the lab during the genotyping process. This is an extreme case where the two samples are in effect having a relationship equivalent to monozygotic twins. Some GWAS and the emerging high-through sequencing studies also collect family data, for example, to improve power by enriching the occurrence of rare variants. Because cryptic relatedness can also occur between putatively independent families, identifying cryptic relationship is not limited to GWAS with case–control samples, just as detecting pedigree errors is not limited to data collected for linkage analysis. The methods described in Chapter 3 and implemented in PREST (i.e., the three allele-sharing and the likelihood-based statistics and relationship estimation) were developed for general relationship types, including putatively unrelated individuals, and so they are suitable for identifying cryptic relationships (4). Application of these methods to high-throughput GWAS or sequencing
4
Identifying Cryptic Relationships
49
data, however, requires additional care. To perform formal relationship hypothesis testing using genotype data, linkage disequilibrium (LD) between SNPs must be modeled in the variance and likelihood calculation. This approach utilizes all available data but considerably increases the sensitivity of the methods to the assumption that map and allele frequencies were accurately estimated, and it also significantly increases the computational burden. Alternatively, the most informative SNPs that are also in linkage equilibrium can be selected so that existing methods are directly applicable. In practice, relationship estimation alone, based on the estimates of the identity by descent (IBD) distribution using all or partial GWAS or NGS SNP data, can often provide sufficient information to identify cryptic relationships such as first-degree relatives (5). In the following, we first briefly review the relationship estimation methods implemented in PREST (5, 6) and PLINK (7). In Subheading 2, we compare the two methods using the HapMap (8) phase III GWAS data (May 2010 release) and provide implementation and application details (see also Notes 1 and 2). 1.1. Measures of Relatedness
The relatedness between a pair of individuals can be summarized by their mean IBD probability distribution, p ¼ ðp0 ; p1 ; p2 Þ, as defined in Chapter 3. It describes the probability of a randomly sampled marker to have 0, 1, or 2 common ancestry alleles between the two individuals. The kinship coefficient f is an alternative measure of relatedness, which is also commonly employed. It provides the probability of two randomly sampled alleles from a pair of individuals, at a randomly sampled marker, to have the same common ancestry. Therefore, the following relationship holds between the IBD probabilities and the kinship coefficient: f ¼ 0:25p1 þ 0:5p2 : Table 1 in Chapter 3 summarizes the IBD distributions and kinship coefficients for the most commonly encountered human relationships. Although these summary statistics do not fully determine a relationship type (e.g., half-sib, grandparent–grandchild, and avuncular), the estimates of these statistics could be used to identify cryptic relatedness, particularly when the true relationship is first-degree or second-degree (e.g., the first seven relationships in Table 1 of Chapter 3 that have f 0.125).
1.2. Relationship Estimation Methods Implemented in PREST and PLINK
There are two commonly employed methods that use genomewide scan data to estimate the IBD distribution, p ¼ ðp0 ; p1 ; p2 Þ. One is the maximum likelihood-based method (see Chapter 3) and implemented in PREST (5, 6). The other is the method-ofmoments approach implemented in PLINK (7). The likelihoodbased method is more powerful than the method-of-moments method, but there is a trade-off between statistical power and computational efficiency. With the increasing availability of raw
50
L. Sun and A. Dimitromanolakis
Table 1 Advantages and disadvantages of PREST and PLINK, the two computer programs commonly used for IBD estimation PREST vs. PLINK for identifying cryptic relationships Computation
PREST is more computationally intensive but can be easily parallelized in a computing cluster
Interface
Both are user friendly and use the SAME input files, but PLINK is more familiar to many users
Relationship estimation
PREST provides considerably more accurate IBD estimates, particularly when the number of markers is limited (e.g., <50 K)
Genotype type
PREST takes both microsatellites and SNPs, while PLINK handles only SNPs
Genotype density
PREST is designed for both low- and high-density maps, while the PLINK IBD estimation method assumes that a very dense map is available
Family data
PREST is designed for detecting both pedigree errors and cryptic relationships in both family and population data, while PLINK is designed for only cryptic relationships
Sex-chromosome
Both do not use sex-chromosome SNPs
Sensitivity to allele frequency
Both are sensitive to misspecified allele frequencies
Relationship testing
PREST performs both relationship estimation and hypothesis testing
computing processing power, the issue of computational efficiency becomes less important and the likelihood-based method can be feasibly applied to large GWAS studies. When high-throughput SNP data are available, both computer programs perform well and each has its own advantages and disadvantages (Table 1). Generally speaking, PLINK is sufficient forGWAS with case–control only samples and for the detection of first-degree relationships. However, the higher variability of the PLINK estimates can lead the investigator to exclude a large proportion of truly unrelated pairs from the study (see application examples below). Therefore, it is recommended that, when distant relationships are discovered in a dataset via PLINK, the results need to be verified by PREST. PREST is also more suitable for other study designs, for example GWAS with families (with or without additional case–control samples) as in Subheading 2.5, or when only a limited number of markers are genotyped (e.g., a linkage marker panel, as in Subheading 2.4, or targeted sequencing data).
4
Identifying Cryptic Relationships
51
2. Methods 2.1. Methods Evaluation Using HapMap Phase III Data
To compare the two estimation methods, we applied PREST (http:// www.utstat.toronto.edu/sun/Software/Prest, version 4.07) and PLINK (http://pngu.mgh.harvard.edu/~purcell/plink/, version 1.06) to the HapMap phase III GWAS data (May 2010 release). This dataset is freely available from the HapMap home page (http://hapmap.ncbi.nlm.nih.gov/). A study of cryptic relatedness in all eleven HapMap populations was previously reported via the use of PREST (5). Here we use the European-descent CEPH (CEU) samples (8) as proof of principle. The HapMap phase III CEU data consist of 113 founders (i.e., putatively unrelated individuals) genotyped at ~1.4 M SNPs. After filtering out SNPs with low minor allele frequency (MAF < 15%), low call rate (<97%), or in high LD (pair-wise r2 > 0.5), ~180 K autosomal SNPs remain and are used for relationship estimation (see Subheading 2.3 for details on SNP selection). Figure 1 shows the estimated IBD distribution for 6,328 putatively unrelated pairs. There is one outlier pair with IBD estimates close to what is expected for a full-sib pair, p ¼ ðp0 ; p1 ; p2 Þ ¼ ð0:25; 0:5; 0:25Þ: Most pairs have the estimates close to what is expected for unrelated, but there are nine putatively unrelated pairs who have ^p0 <0:95 based on PREST and 159 based on PLINK. These results show that (a) cryptic first-degree relatives can be easily identified based on relationship estimation alone via either PREST or PLINK when high-throughput genotype data are available and
Fig. 1. Relationship estimation of CEU founders using HapMap III GWAS data via PREST (left ) and PLINK 9 (right ). Each plot shows the estimated IBD distribution, ^p1 vs. ^p0 , for the 6,328 pairs having the hypothesized relationship, R0, of UNrelated. The cross marks the expected IBD distribution for R0, i.e., p ¼ ðp0 ; p1 ; p2 Þ ¼ ð1; 0; 0Þ.
52
L. Sun and A. Dimitromanolakis
(b) estimates based on the method-of-moments have more variability than those from the likelihood-based approach, even when using ~180 K SNPs. Therefore, when only a limited number of markers is available or when the cryptic relatedness is more distant than first-degree relationship, it is more advantageous to use PREST. In addition, PREST can also formally assess the statistical significance of the variation observed by performing relationship hypothesis testing, as demonstrated in Subheading 2.4. In the following, we provide practical details when using PREST (and PLINK when applicable) to identify cryptic relationships, including applications to two additional datasets. 2.2. Essential Steps in Using PREST to Identify Cryptic Relationships
The essential steps here are identical to the steps outlined in Chapter 3, Subheading 2.1 (on how to run PREST) and Subheading 2.2 (on how to interpret PREST results) for detecting pedigree relationship errors, with one exception, that is, the –aped option must be used so that the two individuals of a pair must have different family IDs (see Subheading 2.3 in Chapter 3). Because the number of pairs to be analyzed is approximately of the order of n2/2, where n is the total number of individuals in the dataset, a significant amount of computing time is necessary, even for IBD estimation alone. Therefore, it is recommended that the –cm or –mlrt options are not used for all pairs in the data but evoked only for suspected pairs based on the relationship estimation results (see example in Subheading 2.5). In the context of GWAS, selection of SNPs prior to the PREST analysis is also recommended. This is done for a number of reasons as detailed in Subheading 2.3. In summary, the following steps are used to identify cryptic relationships via PREST (also see Subheading 2 under Chapter 3), Step 0: Download a PREST version suitable for your computer platform at http://www.utstat.toronto.edu/sun/Software/Prest/. Step 1: Prepare two input files, both are identical to that used by PLINK (7), except that the alleles should be encoded as 1/2 and not with A/C/G/T characters. To accomplish this easily, the plink option –recode12 can be used to recode the dataset with numerically coded alleles. Step 2: Run the program with different command options depending on user-specific needs, but typically only relationship estimation is evoked for all possible pairs in the dataset: prest –file prest.ped –map prest.map –aped –out prest.aped.option0 Step 3: Interpret the results and perform possible further analysis. For example, based on the estimated IBD distribution, perform relationship hypothesis testing for suspected pairs using the –pair option, e.g.,
4
Identifying Cryptic Relationships
53
prest –file prest.ped –map prest.map –cm –mlrt –pair Fam.ID1 ID1 Fam.ID2 ID2 For GWAS, additional SNP selection procedures, as described in the following Subheading 2.3, are typically needed between Step 1 and Step 2, and are necessary for conducting hypothesis testing in Step 3. 2.3. Selection of GWAS SNPs
There are three levels of GWAS SNP selection. Levels 1 and 2 are recommended but not necessary for relationship estimation. Level 3 is required for relationship testing. Level 1 involves quality control of SNPs that is similar to what is typically done prior to association analysis. Detecting cryptic relationships and pedigree errors uses mean statistics that are averaged overall SNPs from the genome, thus a few low-quality SNPs are unlikely to affect the relationship analysis; nevertheless, it is sensible to filter out such SNPs. It is also important to use only autosomal markers for the application of PREST (or PLINK). Level 2 SNP selection is for computational reasons. Our experience with PREST shows that there is little improvement in relationship estimation accuracy once more than ~50 K SNPs are used (typically with MAF >5%). (Substantially more SNPs are needed for PLINK to achieve similar estimation efficiency.) Therefore, it is logical not to use all available SNPs, so that substantial computational efficiency can be gained with little compromise in statistical efficiency. One can perform LD-based pruning, and in practice we have seen that even random selection of SNPs provides accurate results in PREST. Both levels of filtering can be easily achieved by using PLINK and are also typically done for PLINK-based IBD estimation. For example, PLINK options –extract, –maf 0.15, and –geno 0.03 select SNPs with MAF >0.15 and genotype call rate >0.97, and –indeppairwise 50 5 0.2 performs a pair-wise LD pruning of the SNPs so that values of r2 between the remaining SNPs are less than 0.2. Level 3 is used only if a formal relationship hypothesis testing is desired. In that case, only SNPs that are in linkage equilibrium are used and a bp map must be converted to a cM map.
2.4. Application 1: Identifying Cryptic Relatedness Across Pedigrees with GenomeWide Low-Density Microsatellite Data
In this application, we revisit the COGA data (9) discussed in Subheading 2.4 of Chapter 3. Briefly, the data consist of 105 families and 1,214 individuals in total, among which 992 were genotyped at 285 autosomal microsatellite markers. To identify cryptic relatedness between families, we can first obtain the estimates of the IBD distribution with the PREST command line: prest –file coga.ped –map coga.map –aped –out coga.aped.option0
54
L. Sun and A. Dimitromanolakis
Fig. 2. Relationship estimation for COGA between-family pairs via PREST as in Subheading 2.4. The plot shows the estimated IBD distribution, ^ p1 vs. ^ p0 , for the 483,571 pairs having the hypothesized relationship, R0, of UNrelated. The estimates are based on genotypes of 100–285 microsatellite markers. The cross marks the expected IBD distribution for R0, i.e., p ¼ ðp0 ; p1 ; p2 Þ ¼ ð1; 0; 0Þ.
In total, 486,036 putatively unrelated pairs (across-pedigree pairs, two individuals of a pair must have different family IDs; see Chapter 3 for results of the within-pedigree analysis) were analyzed, of which 483,571 had 100 markers genotyped in common (Fig. 2). Note that PLINK cannot be applied to microsatellite markers and the IBD estimates via the method-of-moments are not reliable when only a few thousand markers are available. There are some clear outliers, indicating cryptic relationships. For example, there are seven pairs with ^p0 <0:65. However, we also note that the variation associated with the IBD estimates is large in this case because of the small number of markers. To distinguish between genuine outliers due to cryptic relatedness from outliers due to sample variation, we can perform relationship hypothesis testing using the PREST command line: prest –file coga.ped –map coga.map –cm –mlrt –pair Fam.ID1 ID1 Fam.ID2 ID2 for each of the suspected pairs. For most of the seven pairs the null relationship of unrelated is rejected based on the likelihood ratio test (LRT) (at the a ¼ 0:05=7 level), and the alternative relationship(s) of half-sib, grandparent–grandchild, avuncular, or half-sib + first cousin are most compatible with the observed genotype data. In practice, the plausible alternative relationship(s) inferred from
4
Identifying Cryptic Relationships
55
Fig. 3. Relationship estimation for VTE between-family pairs via PREST as in Subheading 2.5. The plot shows the estimated IBD distribution, ^ p1 vs. ^ p0 , for the 19,281 pairs having the hypothesized relationship, R0, of UNrelated. The estimates are based on genotypes of ~300 K GWAS SNPs. The cross marks the expected IBD distribution for R0, i.e., p ¼ ðp0 ; p1 ; p2 Þ ¼ ð1; 0; 0Þ.
this analysis, in conjunction with other information such as results of other pairs, could be used to reconstruct pedigrees. 2.5. Application 2: Identifying Cryptic Relatedness Across Pedigrees with Genome-Wide High-Through SNP Data
In this application, we revisit the VTE data (10) discussed in Subheading 2.5 of Chapter 3. Briefly, the data consist of five multi-generational French-Canadian pedigrees with a total of 369 individuals, among which 255 were genotyped using the Illumina 660 chip. To identify cryptic relatedness between families, we first randomly selected ~300 K SNPs and then applied PREST to obtain the relationship estimation (Fig. 3; results of using less SNPs (e.g., ~50 K) or all SNPs are similar) with the command line: prest –file vte.ped –map vte.map –aped –out vte.aped.option0 In total, 19,281 putatively unrelated pairs (across-pedigree pairs; see Chapter 3 for results of the within-pedigree analysis) were analyzed. When compared with the COGA data (Fig. 2), it is clear that the IBD estimates are less variable when high-through SNP data are available. However, compared with the HapMap CEU PREST results (Fig. 1, left plot) for which only 9 of the 6,328 putatively unrelated pairs have ^p0 <0:95 estimated from ~180 K SNPs, 12,290 of the 19,281 pairs in the VTE data have ^p0 <0:95, estimated from ~300 K SNPs. Clearly, the high proportion of potential outliers is not due to variation inherent in the IBD estimation. Rather, VTE samples are on average more related than the CEU samples, which
56
L. Sun and A. Dimitromanolakis
is not entirely surprising because all the VTE samples are from five French-Canadian families ascertained through familial thrombophilia with a Factor V Leiden variant. Although the VTE samples are on average somewhat related, the smallest ^p0 is 0.818 (largest kinship coefficient estimate is 0.047), indicating the relatedness is perhaps no more than halffirst cousin. This raises two questions. The first is whether we have good power to distinguish between unrelated and distantly related relatives such as half-first cousins or second cousins. We can address this question by performing an LRT, although SNPs must be selected so that they are not in LD, as discussed before. The second question is a practical one. When some putatively unrelated pairs are in fact distantly related, technically cryptic relationships exist in the data. However, whether such cryptic relatedness has a consequential effect on association analysis is an open question.
3. Notes 1. Both PLINK and PREST use autosomal SNPs only to obtain the estimates of the IBD distribution (see Subheading 3 of Chapter 3 for discussions on the use of sex-chromosome markers). Both programs also use the available genotype data to estimate allele frequencies, which, in turn, are used for the IBD estimation. However, when the study population is not homogenous, the allele frequencies estimated from the whole sample can lead to systematic bias in the IBD estimation. In that case, PREST can use population-specific allele frequencies specified in an external file. 2. Recently, methods to detect local IBD sharing in population samples (i.e., among putatively unrelated samples) have been proposed for population-based linkage mapping (11). Although there is some connection, the focus here is estimating global IBD sharing averaged across the whole genome. From an evolutionary point of view, all population samples are related and have common ancestry in part of the genome, albeit extremely rare across the genome and short in length (e.g., twentieth cousins). For the purpose of identifying cryptic relationships, however, such distantly related samples are categorized as unrelated. In this context, it is also important to note that, besides the concern of the number of markers available, the genotype data used for relationship analysis must not be from specific region(s) selected for the phenotype of interest. For example, if only markers from linkage peaks are used, then an upward bias in IBD estimation is expected for affected relative pairs.
4
Identifying Cryptic Relationships
57
References 1. Voight BF, Pritchard JK (2005) Confounding from cryptic relatedness in case-control association studies. PLoS Genet 1: e32 2. Thornton T, McPeek MS (2010) Roadtrips: case-control association testing with partially or completely unknown population and pedigree structure. Am J Hum Genet 86: 172–18 3. Price AL, Zaitlen NA, Reich D, Patterson N (2010) New approaches to population stratification in genome-wide association studies. Nature Reviews Genetics 11: 459–46 4. McPeek MS, Sun L (2000) Statistical tests for detection of misspecified relationships by use of genome-screen data. Am J Hum Genet 66: 1076–109 5. Dimitromanolakis A, Paterson AD, Sun L (2009) Accurate IBD inference identifies cryptic relatedness in 9 hapmap populations. Abstract no. 1768 presented at the annual meeting of the American Society of Human Genetics 6. Sun L, Wilder K, McPeek MS (2002) Enhanced pedigree error detection. Hum Hered 54: 99–11
7. Purcell S, et al (2007) Plink: a tool set for whole-genome association and populationbased linkage analyses. Am J Hum Genet 81: 559–57 8. The International HapMap Consortium (2007) A second generation human haplotype map of over 3.1 million SNPs. Nature 449: 851–861 9. Begleiter H, et al (1999) Description of the genetic analysis workshop 11 collaborative study on the genetics of alcoholism. Genet Epidemiol 17 Suppl 1: S25–3 10. Antoni G, et al (2010) A multi-stage multidesign strategy provides strong evidence that the bai3 locus is associated with early-onset venous thromboembolism. J Thromb Haemost DOI 10.1111/j.1538-7836.2010.04092.x 11. Browning SR, Browning BL (2010) Highresolution detection of identity by descent in unrelated individuals. Am J Hum Genet 86: 526–539
sdfsdf
Chapter 5 Estimating Allele Frequencies Indra Adrianto and Courtney Montgomery Abstract Methods of estimating allele frequencies from data on unrelated and related individuals are described in this chapter. For samples of unrelated individuals with simple codominant markers, the natural estimator of allele frequencies can be used. For genetic data on related individuals, maximum likelihood estimation (MLE) can be applied to compute allele frequencies. Factors that influence allele frequencies in populations are also explained. Key words: Allele, Allele frequency, Genotype, Phenotype, Natural estimator, Unrelated individuals, Related individuals, Relatives, Families, Pedigree, Founder, Nonfounder, Population genetics, Disease research, Hardy–Weinberg equilibrium, Maximum likelihood estimation, Log-likelihood, Expectation– maximization algorithm, ABO blood group, Natural selection, Mutation, Migration, Genetic drift, Nonrandom mating
1. Introduction Allele frequencies are key characteristics in understanding both population and disease research genetics. These frequencies are determined by calculating the relative proportion of an allele at a locus in a population. Many methods in genetic studies, such as population history studies, linkage and association analysis, calculation of linkage disequilibrium, admixture mapping and even copy number variant detection, require accurate estimates of allele frequencies. Inaccurate estimation of allele frequencies can cause false positives or reduce power in linkage analysis (1–3) and lead to spurious or missed effects in association analysis and linkage disequilibrium calculation. Admixture mapping, an affected-only study design, also needs strong prior information on the ancestral allele frequencies (4, 5). Suppose there are m different alleles at a particular locus, the allele-frequency distribution in a population is given by a ¼ P ða1 ; :::; am ÞT ; m i¼1 ai ¼ 1; ai >0, where ai is the (relative) allele Robert C. Elston et al. (eds.), Statistical Human Genetics: Methods and Protocols, Methods in Molecular Biology, vol. 850, DOI 10.1007/978-1-61779-555-8_5, # Springer Science+Business Media, LLC 2012
59
60
I. Adrianto and C. Montgomery
frequency. Consider a situation where we have genotype data for N randomly sampled individuals from a large population with two alleles (A1 and A2) at a locus of a codominant system, the natural (naı¨ve) estimator of the A1 allele frequency, p, is given by p ¼ [(the number of A1)/2N]. Similarly, the frequency of the A2 allele, q, is given by q ¼ [(the number of A2)/2N], where q ¼ 1 p. If the frequencies of all genotypes (A1A1, A1A2, and A2A2) are known, then p ¼ ðthe frequency of A1 A1 Þ þ 1=2 ðthe frequency of A1 A2 Þ and q ¼ ðthe frequency of A2 A2 Þ þ 1=2 ðthe frequency of A1 A2 Þ. When a population meets Hardy–Weinberg equilibrium assumptions (i.e., allele and genotype frequencies remain constant from generation to generation, a large random mating population with no genetic drift, no mutation, no migration, and no natural selection), the genotype frequencies can be calculated from the allele frequencies. For example, in a locus with two alleles, the frequency of A1 A1 ¼ p2 , the frequency of A1 A2 ¼ 2pq, and the frequency of A2 A2 ¼ q 2 . In the case where we collect data from related individuals, the estimator of allele frequencies described above can be inaccurate. Previous studies have suggested that maximum likelihood estimation (MLE) of allele frequencies is the best approach for this problem (6–9). The basic idea of MLE is to determine the parameter values that maximize the probability of the observed data. In the following section, we describe the application of MLE to estimate allele frequencies based on data from unrelated and related individuals, then we advance our discussion to explain factors that can alter allele frequencies in populations. 1.1. Allele Frequency Estimation
In this section, we first introduce an application of MLE to compute allele frequencies from genotype frequencies on unrelated individuals. Then we describe how to estimate allele frequencies for a dominant system such as the ABO blood group system. Methods for estimating allele frequencies from data on related individuals or families are also discussed. Furthermore, we explain the importance of understanding factors that influence allele frequencies in population and disease research genetics.
1.1.1. Allele Frequency Estimation from Data on Unrelated Individuals
Suppose we are given genotype data with two alleles in a locus from unrelated individuals that satisfy Hardy–Weinberg equilibrium assumptions as follows: Genotype
A1A1
A1A2
A2A2
Total
Genotype counts
n11
n12
n22
N
Frequency
p11
p12
p22
1
5
Estimating Allele Frequencies
61
The likelihood as a function of allele frequencies is given by: N Lðp; N Þ ¼ pn11 pn12 pn22 n11 ; n12 ; n22 11 12 22 2 n11 n22 N p ¼ ð2pq Þn12 q 2 : n11 ; n12 ; n22 Then, we calculate the log-likelihood and its derivative as follows: ‘ðp; N Þ ¼ ln Lðp; N Þ ¼ 2n11 ln p þ n12 ln p þ n12 ln q þ 2n22 ln q þ C ¼ ð2n11 þ n12 Þ ln p þ ð2n22 þ n12 Þ ln q þ C ¼ ð2n11 þ n12 Þ ln p þ ð2n22 þ n12 Þ lnð1 pÞ þ C d‘ðp; N Þ ð2n11 þ n12 Þ ð2n22 þ n12 Þ ¼ ¼0 dp p ð1 pÞ ð2n11 þ n12 Þ ð2n11 þ n12 Þ n11 n12 p¼ ¼ þ : ¼ N 2N 2ðn11 þ n12 þ n22 Þ 2N
The natural estimator and the maximum likelihood estimator of p provide similar solutions suggesting that the allele frequencies of genotype data with two alleles in a locus from unrelated individuals can be calculated straightforwardly using the natural estimator. For genotype data with more than two alleles in a locus, the frequency of one allele can be calculated by adding its homozygote frequency and half of the sum of all heterozygote frequencies for that particular allele. However, if there is dominance in the genetic data, for example as in the human ABO blood group system, the natural estimator cannot be applied. Consider the ABO system in a population that satisfies Hardy–Weinberg equilibrium assumptions as follows: Genotype
AA
AO
BB
BO
AB
OO
Total
Phenotype (blood group)
A
A
B
B
AB
O
–
Genotype counts
nAA
nAO
nBB
nBO
nAB
nOO
N
Expected frequency
p2
2pr
q2
2qr
2pq
r2
1
The likelihood as a function of allele frequencies is defined by: N Lðp; N Þ ¼ nAA ; nAO ; nBB ; nBO ; nAB ; nO 2 nAA nBB nOO ð2pr ÞnAO q 2 ð2qr ÞnBO ð2pq ÞnAB r 2 ; p where p, q, and r are the frequencies of alleles A, B, and O, respectively. The log-likelihood can be written as: ‘ðp; N Þ ¼ ln Lðp; N Þ ¼ nAA ln p2 þ nAO lnð2pr Þ þ nBB ln q 2 þ nBO lnð2qr Þþ nAB lnð2pq Þ þ nOO ln r 2 þ C:
62
I. Adrianto and C. Montgomery
By applying the EM (Expectation–maximization) algorithm (7, 10) to find p, q, and r that maximize ‘ðp; N Þ, the expectation step with initial guesses p ¼ p1, q ¼ q1, and r ¼ r1 calculates the expected numbers of genotypes: p2 p2 þ 2pr 2pr ¼ nA 2 p þ 2pr q2 ¼ nB 2 q þ 2qr 2qr : ¼ nB 2 p þ 2qr
nAA ¼ nA nAO nBB nBO
New guesses for p, q, and r can be calculated from the following estimates (maximization step): 2nAA þ nAO þ nAB 2N 2nBB þ nBO þ nAB q2 ¼ 2N 2nAO þ nBO þ nOO : r2 ¼ 2N These estimates are used for another iteration of expectation and maximization steps. The iteration process continues until pn, qn, and rn are stationary or are at least less than some threshold of change from one iteration to the next. p2 ¼
1.1.2. Allele Frequency Estimation from Data on Related Individuals
Given genotype data from related individuals or families, allele frequencies can be estimated using a subset of unrelated individuals (founders and singletons (see Note 1) from each pedigree (8). However, this approach is not optimal since it excludes nonfounder information and in many cases the founders are not available. Another approach that assumes all individuals are unrelated also has disadvantages. It can overestimate the allele frequencies and have too small standard errors (8). Boehnke (8) proposed a method for allele frequency estimation by maximizing the likelihood of the family data in the framework of pedigree analysis (11, 12), taking into account the dependence among individuals within families and assuming Hardy–Weinberg equilibrium. The likelihood for the pedigree is given by: XY Y Y LðQ Þ ¼ PðXi jGi Þ PðGj Þ PðGk jGkf Gkm Þ; g
i
j
k
where g is a marker genotype, i is a pedigree member with known marker phenotype, j is a pedigree member whose parents are not present, k is a pedigree member whose parents kf and km are present in the pedigree, Q ¼ ðQ1 ; :::; Qm Þ corresponds to the allele frequencies of alleles A1,. . .,Am, and X ¼ ðX1 ; :::; Xn Þ and G ¼ ðG1 ; :::; Gn Þ are the vectors of marker phenotypes and genotypes of a pedigree
5
Estimating Allele Frequencies
63
with n individuals, respectively (8). The MLE of the allele frequencies can be computed by maximizing the likelihood function L(Q). Standard errors of allele frequency estimates are obtained from the second partial derivatives of the log-likelihood function. Boehnke (8) showed that this method can improve estimates of allele frequencies. This method has been implemented in several software packages including Mendel (http://www.genetics.ucla.edu/software/mendel) (13) and the Statistical Analysis for Genetic Epidemiology (S.A.G.E.) FREQ program (http://darwin.cwru.edu/sage/) (14). For the case when parental genotypes are not available, Broman (15) compared four different methods for estimating allele frequencies with data on sibships. He found that although it requires an increase in computational time, taking into account the relationship between individuals provides an improved estimator over the one that ignores the individual relationships. However, the estimator calculated by assuming all individuals are unrelated is unbiased and provides an improvement over the one using only one individual from each sibship. Another approach proposed by McPeek et al. (9), uses the best linear unbiased estimator (BLUE) of allele frequencies with data from large and complex pedigrees. This estimator is similarly to the quasilikelihood estimator and performs very similarly to MLE but with faster computation. This method has been implemented in the CCQLS program (http://galton.uchicago.edu/~mcpeek/software/). 1.2. Factors that can Alter Population Allele Frequencies
In population and disease research genetics, it is important to note the factors that can influence allele frequencies and are often considered as nuisance parameters in studies of human genetic disease. Those factors include: natural selection, mutation, migration or gene flow, genetic drift, and nonrandom mating. While an understanding of these natural phenomena is interesting in and of themselves, it is a common assumption that they are infrequent enough to be ignored in the study of disease, making it necessary to discuss them here.
1.2.1. Natural Selection
Natural selection is a process of differential survival and reproduction of individuals with genetic characteristics or phenotypes that are more favorable or well adapted in a specific environment. The ability of an individual to survive and to reproduce successfully is called fitness. The differences in the relative fitness of the phenotypes result in a change of allele frequency after a generation of selection. Phenotypes that have a higher probability of survival and reproduction will increase in frequency in the next generation. If a phenotype has a high heritability, natural selection can change its frequency more rapidly compared to one with a low heritability. Natural selection tends to decrease genetic variation in a population over time.
1.2.2. Mutation
A mutation is a change of the DNA sequence in the genome. Mutations may involve substitution, insertion, or deletion of a single base or a section of the DNA and inversion of a section of
64
I. Adrianto and C. Montgomery
the DNA. They can be spontaneous or induced by mutagens, such as chemicals, radiations, and viral infections. Genetic variations can increase over time because of mutations. However, the effect of mutation on changing allele frequencies is relatively weak since the rate of mutation is typically low. Furthermore, most mutations are not helpful to the organism. 1.2.3. Migration or Gene Flow
Migration or gene flow refers to the movement of alleles from one population to another population that may have different allele frequencies. If individuals move to a new location or population and mate with individuals there, they can change the existing allele frequencies in that population as a result of a new gene pool and introduce new genetic variation. The effect of migration on altering the allele frequencies depends on the number of migrating individuals and the differences in allele frequencies between the two different populations. If allele frequencies are similar in the two populations, migration will have little impact on the new allele frequencies.
1.2.4. Genetic Drift
Genetic drift is a random change in allele frequencies in a population. In other words, allele frequencies can increase or decrease by chance. The effect of genetic drift on altering allele frequencies is higher in small populations than in large populations. Two types of genetic drift that are generally observed are founder effect and bottleneck. A founder effect occurs when a random group of individuals with different allele frequencies compared to the original population creates a new community. The founders of this community will have a strong effect on the allele frequencies of future generations. The corresponding phenotypes in this community can have a higher frequency compared with the original or other populations. A bottleneck occurs when the size of a population decreases significantly because of random events such as natural disasters (e.g., earthquakes, volcano eruptions, floods, and tsunamis) or diseases. It may change the allele frequencies in the affected population drastically. Genetic variation can be reduced by a bottleneck.
1.2.5. Nonrandom Mating
Nonrandom mating occurs when individuals select their mates based on particular characteristics or phenotypes or when individuals mate with close relatives (inbreeding). If individuals with certain phenotypes mate with individuals with similar phenotypes, they are likely to have offspring with similar phenotypes too. This can alter the allele frequencies because only the alleles associated with those phenotypes are passed on to the next generation. On the other hand, inbreeding will increase homozygosity in the offspring because two close relatives have a higher probability of having similar alleles compared to two unrelated individuals. It can also increase the probability of offspring being affected by recessive phenotypes or genetic disorders. It is an assumption of virtually all disease genetic analyses that each of the factors that influence population allele frequencies
5
Estimating Allele Frequencies
65
(e.g., that cause populations not to be in Hardy–Weinberg equilibrium) are either absent or present with such low frequency in human populations that they need not be considered. However, in the presence of missing data, particularly parental data in a trio design where the assumption of Hardy–Weinberg equilibrium is critical (16–19) or in the study of isolated populations or highly selected traits (20–22), where Hardy–Weinberg equilibrium is not likely to exist, caution must be taken to assure accurate estimation.
2. Methods In this section, we demonstrate how to use software packages to calculate allele frequencies from data on unrelated and related individuals using PLINK (23) (see Note 2) and the S.A.G.E. FREQ program (14, 24) (see Note 3), respectively. 2.1. Allele Frequency Estimation Using PLINK
We start by estimating allele frequencies with PLINK using as an example ten parent–offspring trios with five single-nucleotide polymorphism (SNP) markers. PLINK requires two input files of genotype data formatted in the PED (pedigree information and alleles for each SNP) and MAP (SNP information) files. The PED file contains columns for family identifier (ID), individual ID, father ID, mother ID, sex/gender, affection status, and two alleles of each SNP, without a header (see example.ped in Table 1). The MAP file has four columns, for chromosome number, SNP ID, genetic distance (in morgans), and base-pair position (bp), without a header (see example.map in Table 2). Currently, PLINK only supports calculating allele frequencies from data on unrelated individuals. The default –freq function calculates the allele frequencies based on all founders. The following shows how to use the PLINK –freq function on a UNIX/Linux terminal by calling the plink command followed by the –file inputfilename without the file extensions: $ plink –file example –freq –out example_freq $ cat example_freq.frq CHR
SNP
A1
A2
MAF
NCHROBS
1
rs1
A
G
0.350
40
1
rs2
C
G
0.125
40
1
rs3
T
A
0.275
40
1
rs4
A
G
0.175
40
1
rs5
G
A
0.075
40
The –freq function generates an output file example_freq.frq (by specifying the –out function followed by outputfilename) with
66
I. Adrianto and C. Montgomery
Table 1 The PED file for PLINK (example.ped) F4
I500
I501
I502
1
1
A
G
G
G
T
T
G
G
A
A
F4
I501
0
0
1
1
A
G
G
G
T
A
G
G
A
A
F4
I502
0
0
2
1
G
G
G
G
T
A
G
G
A
A
F5
I503
I504
I505
1
1
A
G
C
G
A
A
A
A
G
A
F5
I504
0
0
1
1
A
G
C
G
A
A
A
A
G
A
F5
I505
0
0
2
1
A
G
G
G
T
A
A
G
A
A
F9
I506
I507
I508
1
1
G
G
C
G
A
A
G
G
A
A
F9
I507
0
0
1
1
G
G
C
G
A
A
A
G
A
A
F9
I508
0
0
2
1
A
G
G
G
A
A
G
G
G
A
F13
I515
I516
I517
1
1
G
G
C
G
A
A
A
G
A
A
F13
I516
0
0
1
1
G
G
G
G
A
A
A
G
A
A
F13
I517
0
0
2
1
A
G
C
G
A
A
G
G
A
A
F16
I521
I522
I523
1
1
G
G
G
G
A
A
G
G
A
A
F16
I522
0
0
1
1
G
G
G
G
A
A
G
G
A
A
F16
I523
0
0
2
1
G
G
G
G
A
A
A
A
A
A
F18
I852
0
0
2
1
A
A
G
G
T
A
G
G
A
A
F18
I853
0
0
1
1
A
G
G
G
T
A
G
G
G
A
F18
I854
I853
I852
1
1
A
G
G
G
A
A
G
G
G
A
F23
I855
0
0
2
1
A
G
G
G
T
A
G
G
A
A
F23
I856
0
0
1
1
A
G
G
G
T
T
G
G
A
A
F23
I857
I856
I855
1
1
G
G
G
G
T
T
G
G
A
A
F12
I858
0
0
2
1
G
G
G
G
A
A
G
G
A
A
F12
I859
0
0
1
1
A
G
C
C
A
A
G
G
A
A
F12
I860
I859
I858
1
1
G
G
C
G
A
A
G
G
A
A
F24
I861
0
0
2
1
A
G
G
G
T
A
G
G
A
A
F24
I862
0
0
1
1
A
G
G
G
T
A
G
G
A
A
F24
I863
I862
I861
1
1
A
G
G
G
T
A
G
G
A
A
F17
I870
0
0
2
1
G
G
G
G
A
A
G
G
A
A
F17
I871
0
0
1
1
A
G
G
G
T
A
G
G
A
A
F17
I872
I871
I870
1
1
G
G
G
G
A
A
G
G
A
A
5
Estimating Allele Frequencies
67
Table 2 The MAP file for PLINK (example.map) 1
rs1
0
400,000
1
rs2
0
500,000
1
rs3
0
600,000
1
rs4
0
700,000
1
rs5
0
800,000
six columns: chromosome number (CHR), SNP ID (SNP), Allele 1 or minor allele (A1), allele 2 or major allele (A2), minor allele frequency (MAF), and non-missing allele count (NCHROBS). The cat command is a UNIX/Linux command to concatenate and display files. Since we have ten trios in our genotype data, only 20 founders (fathers and mothers) are used to calculate the allele frequencies. However, PLINK allows users to calculate allele frequencies using all individuals including non-founders (offspring in this example) with the –nonfounders function as follows: $ plink –file example –freq –nonfounders –out example_freq_all $ cat example_freq_all.frq CHR
SNP
A1
A2
MAF
NCHROBS
1
rs1
A
G
0.300
60
1
rs2
C
G
0.150
60
1
rs3
T
A
0.2667
60
1
rs4
A
G
0.1667
60
1
rs5
G
A
0.0833
60
We can see the differences in estimating allele frequencies based on the genotype data of all individuals compared with all founders. 2.2. Allele Frequency Estimation Using the S.A.G.E. FREQ Program
The S.A.G.E. FREQ program estimates allele frequencies from genotype data among unrelated and related individuals with known pedigree information (24). It can calculate allele frequencies for both codominant and dominant markers using founders only, incorporating non-founders assuming they are unrelated but assigning founder_weight in the formulation (see Note 4), or taking into account the dependence between individuals within pedigrees by applying maximum likelihood allele frequency estimation (see Note 5). The input files for the S.A.G.E. FREQ Program consist of a parameter file, a data file (including pedigree information and genotype data), and a marker locus description file (only for
68
I. Adrianto and C. Montgomery
Table 3 The parameter file for the S.A.G.E. package (SAGE_param.par)
noncodominant markers). The parameter file specifies the parameters and options to run a S.A.G.E. program. The data file is almost similar to the PED file for PLINK but typically uses a header. We use the same example as for PLINK but with some modifications in the input files. The parameter file (SAGE_param.par) and data file (SAGE_data.ped) can be found in Tables 3 and 4, respectively.
5
Estimating Allele Frequencies
69
Table 4 The data file for the S.A.G.E. package (SAGE_data.ped) FID IID
FA
MO
Sex Status rs1a rs1b rs2a rs2b rs3a rs3b rs4a rs4b rs5a rs5b
F4
I500 I501 I502 1
1
A
G
G
G
T
T
G
G
A
A
F4
I501 0
0
1
1
A
G
G
G
T
A
G
G
A
A
F4
I502 0
0
2
1
G
G
G
G
T
A
G
G
A
A
F5
I503 I504 I505 1
1
A
G
C
G
A
A
A
A
G
A
F5
I504 0
0
1
1
A
G
C
G
A
A
A
A
G
A
F5
I505 0
0
2
1
A
G
G
G
T
A
A
G
A
A
F9
I506 I507 I508 1
1
G
G
C
G
A
A
G
G
A
A
F9
I507 0
0
1
1
G
G
C
G
A
A
A
G
A
A
F9
I508 0
0
2
1
A
G
G
G
A
A
G
G
G
A
F13 I515 I516 I517 1
1
G
G
C
G
A
A
A
G
A
A
F13 I516 0
0
1
1
G
G
G
G
A
A
A
G
A
A
F13 I517 0
0
2
1
A
G
C
G
A
A
G
G
A
A
F16 I521 I522 I523 1
1
G
G
G
G
A
A
G
G
A
A
F16 I522 0
0
1
1
G
G
G
G
A
A
G
G
A
A
F16 I523 0
0
2
1
G
G
G
G
A
A
A
A
A
A
F18 I852 0
0
2
1
A
A
G
G
T
A
G
G
A
A
F18 I853 0
0
1
1
A
G
G
G
T
A
G
G
G
A
F18 I854 I853 I852 1
1
A
G
G
G
A
A
G
G
G
A
F23 I855 0
0
2
1
A
G
G
G
T
A
G
G
A
A
F23 I856 0
0
1
1
A
G
G
G
T
T
G
G
A
A
F23 I857 I856 I855 1
1
G
G
G
G
T
T
G
G
A
A
F12 I858 0
0
2
1
G
G
G
G
A
A
G
G
A
A
F12 I859 0
0
1
1
A
G
C
C
A
A
G
G
A
A
F12 I860 I859 I858 1
1
G
G
C
G
A
A
G
G
A
A
F24 I861 0
0
2
1
A
G
G
G
T
A
G
G
A
A
F24 I862 0
0
1
1
A
G
G
G
T
A
G
G
A
A
F24 I863 I862 I861 1
1
A
G
G
G
T
A
G
G
A
A
F17 I870 0
0
2
1
G
G
G
G
A
A
G
G
A
A
F17 I871 0
0
1
1
A
G
G
G
T
A
G
G
A
A
F17 I872 I871 I870 1
1
G
G
G
G
A
A
G
G
A
A
70
I. Adrianto and C. Montgomery
The following shows how to use the S.A.G.E. freq command on a UNIX/Linux terminal followed by the -p parameterfile –d datafile: $ freq -p SAGE_param.par –d SAGE_data.ped The program then gives:
5
Estimating Allele Frequencies
71
We specified the output file name as SAGE_FREQ_output in the parameter file (SAGE_param.par). We obtain the following output files: 1. freq.inf, an information output file that contains diagnostic messages, warnings and program errors. Users need to check this file first before viewing the other output files. 2. SAGE_FREQ_output.sum, a summary output file that contains summary information of analysis results (Table 5). For each marker, allele frequencies are calculated using founders only, entire dataset, and MLE. 3. SAGE_FREQ_output.det, a detailed output file that contains detailed information of analysis results including the MLE results (Table 6). 4. SAGE_FREQ_output.loc, a locus description file that contains allele frequencies for each marker (Table 7). 2.3. Summary
We have described statistical methods to estimate allele frequencies from data on both unrelated and related individuals. We have also demonstrated how to use two software packages (PLINK and the S.A.G.E. FREQ program) to calculate allele frequencies. An accurate estimation of allele frequencies is important for population and disease genetic studies. For codominant genetic data on unrelated individuals, allele frequencies can be easily calculated using a natural estimator. For a system with dominant alleles, such as the ABO blood group system, MLE with the EM algorithm can be applied for estimating allele frequencies in samples of both related and unrelated individuals. MLE has several desirable properties; for example, when the sample size increases, the estimate will converge to the true value, its variance will be smaller and its distribution will be close to a normal distribution. However, MLE is not efficient for large and complex pedigrees since MLE calculation can be infeasible or computationally slow. For these types of pedigrees, the BLUE method (9) can be used since it provides similar performance to the MLE with much faster computation time. We have also described factors that can alter allele frequencies in populations and why they are important for the study of human disease.
72
I. Adrianto and C. Montgomery
Table 5 Summary output file (SAGE_FREQ_output.sum)
5
Estimating Allele Frequencies
Table 6 Detailed output file (SAGE_FREQ_output.det)
73
74
I. Adrianto and C. Montgomery
Table 7 Locus description file (SAGE_FREQ_output.loc) rs1 A ¼ 0.350000 G ¼ 0.650000 ; ; rs2 G ¼ 0.875000 C ¼ 0.125000 ; ; rs3 T ¼ 0.275000 A ¼ 0.725000 ; ; rs4 G ¼ 0.861111 A ¼ 0.138889 ; ; rs5 A ¼ 0.925000 G ¼ 0.075000 ; ;
3. Notes 1. Founders are individuals who do not have parental information in the pedigree (i.e., fathers and mothers). Singletons are unrelated and unconnected individuals in the pedigree. Non-founders are individuals who have both parents in the pedigree (e.g., offspsring, siblings, and cousins). 2. PLINK is a free genetic association analysis toolset that can handle large-scale data sets including genome-wide association study (GWAS) data (23). The PLINK executable file should be placed either in the current working directory or in the command path and run from the command line of Linux, MSDOS, or Apple Mac Terminal. The latest version of PLINK (v1.07) and the user manual or documentation can be found at its website (http://pngu.mgh.harvard.edu/~purcell/plink/). The PLINK –freq function utilizes a naı¨ve approach to calculate allele frequencies. Therefore, it is only suitable for large datasets of unrelated individuals. The advantages of PLINK are that it is computationally fast and easy to use. However, PLINK only works on genotype data with two alleles for each locus.
5
Estimating Allele Frequencies
75
3. The S.A.G.E. software package is a free toolset of compiled C++ programs that perform a wide variety of genetic analyses, such as data summary statistics, estimating allele frequencies, heritability, familial correlations, and identity-by-descent (IBD) allele sharing, and performs linkage and association analyses (14). The S.A.G.E. programs can be run either from a command line or from a graphical user interface (GUI) on Linux, Windows, Solaris, and Mac/OSX platforms. The version of S.A.G.E. package used in this chapter is v6.1.0 (24). The S.A.G.E. FREQ program has an option to use maximum likelihood to estimate allele frequencies from genotype data of related individuals. The computational time to calculate maximum likelihood frequency estimates increases significantly if the number of alleles at any locus increases. The limitation of this program is its inability to calculate maximum likelihood estimates on pedigrees with loops. 4. The value of the founder_weight parameter (specified in the freq sub-block of the parameter file) is the weight (w) between 0 and 1 assigned for the founder allele frequencies and 1 w for non-founder allele frequencies to calculate a weighted average of the founder and non-founder allele frequencies (24). If founder_weight is not set, the founder and non-founder frequencies are combined together as thought they are independent. Setting founder_weight to 1 results in founder-only allele frequencies, while setting founder_weight to 0 generates nonfounder-only allele frequencies. This approach provides consistent but statistically inefficient allele frequency estimates. 5. The MLE of the S.A.G.E. FREQ program assumes random ascertainment of the pedigrees with respect to the marker loci. Hardy– Weinberg equilibrium is not a requirement of the method; there is an option to estimate one parameter, the marker-specific inbreeding coefficient, to partially allow for departure from Hardy–Weinberg equilibrium (in the case of diallelic markers this parameter fully allows for departure from Hardy–Weinberg equilibrium) (24). The likelihood for the data at each marker is maximized over possible allele frequencies (and inbreeding coefficient) to obtain maximum likelihood estimates using the Elston–Stewart algorithm (8, 11). This program supports noncodominant markers if the genotype to phenotype mapping is provided.
Acknowledgment Some of the results of this chapter were obtained by using the program package S.A.G.E., which is supported by a US Public Health Service Resource Grant (RR03655) from the National Center for Research Resources.
76
I. Adrianto and C. Montgomery
References 1. Ott, J. (1992) Strategies for characterizing highly polymorphic markers in human gene mapping, Am J Hum Genet 51:283–290. 2. Lockwood, J. R., Roeder, K., and Devlin, B. (2001) A Bayesian hierarchical model for allele frequencies, Genet Epidemiol 20 :17–33. 3. Mandal, D. M., Sorant, A. J., Atwood, L. D., Wilson, A. F., and Bailey-Wilson, J. E. (2006) Allele frequency misspecification: effect on power and Type I error of model-dependent linkage analysis of quantitative traits under random ascertainment, BMC Genet 7:21. 4. Hoggart, C. J., Shriver, M. D., Kittles, R. A., Clayton, D. G., and McKeigue, P. M. (2004) Design and analysis of admixture mapping studies, Am J Hum Genet 74:965–978. 5. Montana, G., and Pritchard, J. K. (2004) Statistical tests for admixture mapping with case–control and cases-only data, Am J Hum Genet 75:771–789. 6. Ceppellini, R., Siniscalco, M., and Smith, C. A. (1955) The estimation of gene frequencies in a random-mating population, Ann Hum Genet 20 :97–115. 7. Smith, C. A. (1957) Counting methods in genetical statistics, Ann Hum Genet 21:254– 276. 8. Boehnke, M. (1991) Allele frequency estimation from data on relatives, Am J Hum Genet 48:22–25. 9. McPeek, M. S., Wu, X., and Ober, C. (2004) Best linear unbiased allele-frequency estimation in complex pedigrees, Biometrics 60:359–367. 10. Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977) Maximum Likelihood from Incomplete Data Via Em Algorithm, Journal of the Royal Statistical Society Series B-Methodological 39:1– 38. 11. Elston, R. C., and Stewart, J. (1971) A general model for the genetic analysis of pedigree data, Hum Hered 21:523–542. 12. Lange, K., and Boehnke, M. (1983) Extensions to pedigree analysis. V. Optimal calculation of Mendelian likelihoods, Hum Hered 33:291–301. 13. Lange, K., Weeks, D., and Boehnke, M. (1988) Programs for Pedigree Analysis: MENDEL, FISHER, and dGENE, Genet Epidemiol 5:471–472. 14. Elston, R. C., and Gray-McGuire, C. (2004) A review of the ‘Statistical Analysis for Genetic
Epidemiology’ (S.A.G.E.) software package, Hum Genomics 1:456–459. 15. Broman, K. W. (2001) Estimation of allele frequencies with data on sibships, Genet Epidemiol 20:307–315. 16. Guo, C. Y., DeStefano, A. L., Lunetta, K. L., Dupuis, J., and Cupples, L. A. (2005) Expectation maximization algorithm based haplotype relative risk (EM-HRR): test of linkage disequilibrium using incomplete case-parents trios, Hum Hered 59:125–135. 17. Allen, A. S., and Satten, G. A. (2007) Inference on haplotype/disease association using parentaffected-child data: the projection conditional on parental haplotypes method, Genet Epidemiol 31:211–223. 18. Boyles, A. L., Scott, W. K., Martin, E. R., Schmidt, S., Li, Y. J., Ashley-Koch, A., Bass, M. P., Schmidt, M., Pericak-Vance, M. A., Speer, M. C., and Hauser, E. R. (2005) Linkage disequilibrium inflates type I error rates in multipoint linkage analysis when parental genotypes are missing, Hum Hered 59:220–227. 19. Bergemann, T. L., and Huang, Z. (2009) A new method to account for missing data in case-parent triad studies, Hum Hered 68:268– 277. 20. Burrell, A. S., and Disotell, T. R. (2009) Panmixia postponed: ancestry-related assortative mating in contemporary human populations, Genome Biol 10:245. 21. Torche, F. (2010) Educational assortative mating and economic inequality: a comparative analysis of three Latin American countries, Demography 47:481–502. 22. Sebro, R., Hoffman, T. J., Lange, C., Rogus, J. J., and Risch, N. J. (2010) Testing for nonrandom mating: evidence for ancestry-related assortative mating in the Framingham heart study, Genet Epidemiol 34:674–679. 23. Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A., Bender, D., Maller, J., Sklar, P., de Bakker, P. I., Daly, M. J., and Sham, P. C. (2007) PLINK: a tool set for whole-genome association and populationbased linkage analyses, Am J Hum Genet 81:559–575. 24. S.A.G.E.6.1. (2010) Statistical Analysis for Genetic Epidemiology http://darwin.cwru. edu/sage/.
Chapter 6 Testing Departure from Hardy–Weinberg Proportions Jian Wang and Sanjay Shete Abstract The Hardy–Weinberg principle, one of the most important principles in population genetics, was originally developed for the study of allele frequency changes in a population over generations. It is now, however, widely used in studies of human diseases to detect inbreeding, populations stratification, and genotyping errors. For assessment of deviation from the Hardy–Weinberg proportions in data, the most popular approaches include the asymptotic Pearson’s chi-square goodness-of-fit test and the exact test. The Pearson’s chi-square goodness-of-fit test is simple and straightforward, but it is very sensitive to small sample size or rare allele frequency. The exact test of Hardy–Weinberg proportions is preferable in these situations. The exact test can be performed through complete enumeration of heterozygote genotypes or on the basis of the Markov chain Monte Carlo procedure. In this chapter, we describe the Hardy–Weinberg principle and the commonly used Hardy–Weinberg proportions tests and their applications, and we demonstrate how the chi-square test and exact test of Hardy–Weinberg proportions can be performed step-by-step using the popular software programs SAS, R, and PLINK, which have been widely used in genetic association studies, along with numerical examples. We also discuss recent approaches for testing Hardy–Weinberg proportions in case–control study designs that are better than traditional approaches for testing Hardy–Weinberg proportions in controls only. Finally, we note that deviation from the Hardy–Weinberg proportions in affected individuals can provide evidence for an association between genetic variants and diseases. Key words: Hardy–Weinberg proportion, Exact test, Pearson’s chi-square goodness-of-fit test, Genetic association study, Quality control, Genotyping error, R, SAS/Genetics, PLINK, Case–control genetic association study, Population stratification
1. Introduction 1.1. What Is the Hardy– Weinberg Proportion?
The Hardy–Weinberg principle, derived independently by Castle (1), Hardy (2), and Weinberg (3), is one of the most important principles in population genetics (4). The Hardy–Weinberg principle states that, in the absence of natural selection, mutation, migration, nonrandom mating, random genetic drift, gene flow, and meiotic drive, the genotypic frequencies and the allele frequencies of a population remain constant from one generation to the next, and furthermore, the genotypic frequencies can be expressed as a
Robert C. Elston et al. (eds.), Statistical Human Genetics: Methods and Protocols, Methods in Molecular Biology, vol. 850, DOI 10.1007/978-1-61779-555-8_6, # Springer Science+Business Media, LLC 2012
77
78
J. Wang and S. Shete
Table 1 Punnett square for inferring genotypic frequencies from allele frequencies under assumption of the Hardy–Weinberg principle A(p)
a(1 p)
A(p)
AA(p2)
Aa[p(1 p)]
a(1 p)
Aa[p(1 p)]
aa[(1 p)2]
simple function of allele frequencies (5). The Hardy–Weinberg principle is now more commonly used in human studies to detect inbreeding, population stratification, and genotyping errors. Consider a simple case of two alleles, A and a, at a single locus. If the allele frequency of A is denoted as p, then the allele frequency of a is (1 p). If the Hardy–Weinberg principle holds, the expected frequencies of the three possible genotypes, AA homozygotes, Aa heterozygotes, and aa homozygotes are the products of allele frequencies p2, 2p(1 p), and (1 p)2, respectively (Table 1). The expected genotypic frequencies are called Hardy–Weinberg proportions. Whether the observed genotypic frequencies conform to the expected frequencies in a study sample is the very first question in population genetics. The departure from the Hardy–Weinberg proportion is tested by comparing the differences between observed and expected genotypic frequencies. This test is commonly referred to as the Hardy–Weinberg equilibrium test, but it is more accurate to refer to it as the Hardy–Weinberg proportion test, because Hardy–Weinberg equilibrium refers to a state of equilibrium with unchanged allele frequencies and genotypic frequencies over generations, whereas the Hardy–Weinberg proportions are the genotypic frequencies achieved in one generation. Therefore, we consider the terminology “departure from Hardy–Weinberg proportion” in a sample the most appropriate for genetic association studies and use it throughout this chapter. 1.2. Why Test for Deviation from the Hardy–Weinberg Proportion?
Deviations from Hardy–Weinberg proportions can result from evolutionary forces such as inbreeding, assortative mating, and small population size. Inbreeding is mating between close relatives, which can cause a decrease in heterozygosity across the genome in the population, that is, an increase in the number of homozygous genotypes in the individuals (5). In a simple two-allele situation with inbreeding, the inbreeding coefficient F (6, 7) can be calculated as one minus the ratio of the observed number of heterozygotes and the expected number of heterozygotes under the assumption of Hardy–Weinberg proportions. If the observed and expected numbers of heterozygotes are the same in the population, F will be equal to zero. Therefore, in this case, the tests for the
6 Testing Departure from Hardy–Weinberg Proportions
79
deviation from Hardy–Weinberg proportions and for the inbreeding coefficient F ¼ 0 are equivalent, and the deviation from Hardy– Weinberg proportions can indicate inbreeding in the population while a nonzero F statistic can indicate either an excess of heterozygotes (negative F statistic) or an excess of homozygotes (positive F statistic) compared to the expected Hardy–Weinberg proportions. Assortative mating, with a mate who has a similar (positive assortative mating) or dissimilar (negative assortative mating) phenotype, can also increase homozygosity for the genes associated with the phenotype. The relationship between the degree of assortative mating in parents, measured by using a weighted covariance, and the degree of the deviation from Hardy–Weinberg proportions in offspring has been presented in the studies of Price (8) and Shockley (9). Small population size can also increase homozygosity in the population (10). When a population is small, the allele frequencies can drift from generation to generation, a process known as genetic drift. Therefore, the Hardy–Weinberg principle can be violated due to the random change of genotypic frequencies resulting from genetic drift. In addition to serving as an indicator of evolutionary forces, such as inbreeding, the test for deviation from Hardy–Weinberg proportions can also be applied in studies of population genetics to indicate population stratification, admixture, or cryptic relatedness. It has been shown that the unrecognized population structure and cryptic relatedness (unknown to the investigators) might inflate the falsepositive rates in genetic association studies (11), and therefore, Hardy–Weinberg proportions need to be carefully investigated before undertaking genetic association studies. Cryptic relatedness occurs when apparently unrelated individuals in a sample actually have a close kinship relationship. The related individuals will increase the homozygosity in the sample, which can lead to deviations from Hardy–Weinberg proportions across the entire genome (12). If a population is formed from multiple subpopulations, deviation from the Hardy–Weinberg proportions can be observed in the admixed population, even if all the subpopulations are in Hardy–Weinberg proportion (13–17). For example, consider two subpopulations, each having 1,000 individuals. Also, let us assume that the counts for three genotypes, AA, Aa, and aa, are 160, 480, and 360, respectively, in the first subpopulation and 10, 180, and 810, respectively, in the second subpopulation. Then, the A allele frequency is 0.4 and 0.1 in the two subpopulations, respectively. It can be seen that both subpopulations are in perfect Hardy– Weinberg proportion (P -value ¼ 1.0 in both). However, when the two populations are combined, the observed counts of the three genotypes AA, Aa, and aa will be 170, 660, and 1,170, respectively, and the allele frequency of allele A is now 0.25. The expected counts of the three genotypes can be calculated as 125, 750, and 1,125, respectively. The chi-square test of departure
80
J. Wang and S. Shete
from Hardy–Weinberg proportions gives a highly significant P -value of 8 108, which implies the admixed population deviates from Hardy–Weinberg proportions. The combined population can deviate from Hardy–Weinberg proportion even when the allele frequencies in the subpopulations are not too far apart. For example, if in the first subpopulation, the genotypic counts are 190, 480, and 330, respectively, giving a minor allele frequency of 0.43, and in the second subpopulation, the genotypic counts are 120, 420, and 460, respectively, giving a minor allele frequency of 0.33, both subpopulations are in Hardy–Weinberg proportion, with chi-square-based P -values of 0.5105 and 0.1124, respectively. However, the chi-square test of the combined population provides a significant P -value of 0.0442; therefore, the combined population is not in Hardy–Weinberg proportion. Most commonly, the Hardy–Weinberg proportion test is used as a quality control tool for identifying errors in genotyping before analysis (5, 18–28). Many genotyping errors can cause deviation from Hardy–Weinberg proportions. For example, a mistaken allele due to DNA contamination and allelic dropout due to low quantity or quality of DNA (29) might cause an increase in homozygotes in individuals, and therefore, cause deviations from Hardy–Weinberg proportions. Genotyping errors will result in inflated type I and type II error rates for genetic association studies (30). The Hardy– Weinberg proportion test is considered an essential procedure in genetic case–control association studies (19, 21, 31–33). However, the Hardy–Weinberg proportion test has very low power for detecting genotyping errors, especially when the genotyping error rate is low and the minor allele frequency is not rare. This is because when the genotyping error rates are small, the observed genotype counts will not be significantly different from the expected genotype counts under Hardy–Weinberg proportions and, therefore, any test that attempts to detect such errors based on proportion testing will have very little power (26, 34). For example, suppose in a sample of 1,000 individuals without genotyping error the observed counts for the three genotypes AA, Aa, and aa are 85, 418, and 497, respectively (Fig. 1). Without genotyping error (panel A), the genetic variant is in Hardy–Weinberg proportion (P -value of Hardy–Weinberg proportion exact test ¼ 0.8790). For the purpose of demonstration, we assumed three genotyping error models. In the first error model (panel B) (27), the genotyping error is that both homozygous genotypes (AA and aa) are miscoded as the heterozygous genotype (Aa) with equal probability (i.e., AA ! Aa and aa ! Aa). In genotyping error models two and three (panels C and D), we considered the miscoding only from rare homozygotes to heterozygotes and from heterozygotes to rare homozygotes (i.e., AA ! Aa or Aa ! AA). In Fig. 1, we demonstrate the variations in P -values of the Hardy–Weinberg proportion test with respect to increased miscoding probabilities of 1, 2.5, and 5%. The P -values were obtained with the use of the exact test of
6 Testing Departure from Hardy–Weinberg Proportions
81
P -values (A)
(B)
(C)
(D)
Aa (418)
AA (85)
aa (497)
0.8790
AA (84)
1%
Aa (424)
1%
aa (492)
0.6488
AA (83)
2.5%
Aa (432)
2.5%
aa (485)
0.3655
AA (81)
5%
Aa (447)
5%
aa (472)
0.0862
AA (84)
1%
Aa (419)
aa (497)
0.8190
AA (83)
2.5%
Aa (420)
aa (497)
0.7030
AA (81)
5%
Aa (422)
aa (497)
0.5413
AA (89)
1%
Aa (414)
aa (497)
0.8203
AA (95)
2.5%
Aa (408)
aa (497)
0.4067
AA (106)
5%
Aa (397)
aa (497)
0.0520
Fig. 1. The Hardy–Weinberg proportion test has poor power to detect genotyping errors when the genotyping error rates are low. P -values were obtained from Hardy–Weinberg proportion exact tests. (A) Model without genotyping error; (B) Genotyping error model with some homozygous individuals miscoded as heterozygotes; (C) Genotyping error model with some rare homozygous individuals miscoded as heterozygotes; (D) Genotyping error model with some heterozygous individuals miscoded as rare homozygotes.
Hardy–Weinberg proportions (5). Given a significance level of 5%, all the P -values of Hardy–Weinberg proportion tests are nonsignificant, implying that the sample is in Hardy–Weinberg proportion in all the scenarios of all the error models. Even when the overall probability of miscoding is 2.9% in the first error model, the observed genotypic counts are 81, 447, and 472 for AA, Aa, and aa, respectively. The Hardy–Weinberg proportion exact test gives a P -value of 0.0862, and thus, the test cannot identify the genotyping errors.
82
J. Wang and S. Shete
Table 2 Observed and expected genotypic counts for a diallelic locus in a sample with n individuals Genotype
AA
Aa
aa
Observed counts
nAA
nAa
naa
Expected counts
n^ pA2
2n^pA ð1 ^ pA Þ
nð1 ^ p A Þ2
^pA : the estimated A allele frequency from the data nAA, nAa, and naa: observed genotypic counts for three genotypes
With recent advancements in genotyping techniques, the genotyping error rates are quite small (i.e., 0.01% using human GWAS SNP—Illumina). Therefore, the Hardy–Weinberg proportion test will not be a powerful tool for detecting genotype errors. However, the current sequencing technologies have high error rates, which will lead to the higher probability of errors in calling individual genotypes, particularly for rare and novel variants. The relationship between genotyping error and the Hardy–Weinberg proportion test has been studied in the literature (26, 34–39). In genetic association studies, the genetic variants that deviate from Hardy–Weinberg proportions are usually considered to be genotyping errors and are removed from further analysis. However, such conclusions should be reached with great caution because a departure from Hardy–Weinberg proportions can also be evidence of an association between genetic variants and the disease of interest (12, 16, 17, 27, 28, 33, 40–46). 1.3. How to Test for Deviation from the Hardy–Weinberg Proportion?
To test the deviations from Hardy–Weinberg proportions in a population, the null hypothesis, H0, is that there is no significant difference between the observed and the expected genotypic counts under Hardy–Weinberg proportions; the alternative hypothesis, Ha, is that there is a significant difference between the observed and expected genotype counts. The commonly used approaches for Hardy–Weinberg tests include the asymptotic Pearson’s chi-square goodness-of-fit test and the exact test.
1.3.1. Chi-Square Goodness-of-Fit Test
Pearson’s chi-square goodness-of-fit test is the most commonly used approach for testing the departure from Hardy–Weinberg proportions (5, 16). If we consider a sample with n individuals, and denote the observed genotypic counts of AA, Aa, and aa at a single locus as nAA, nAa, and naa, respectively (see Table 2), the test statistic of Pearson’s chi-square goodness-of-fit test is given as (5): P ðObserved countsExpected countsÞ2 w2 ¼ Expected counts genotypes 2 ; 2 2 2 ðnAA n^pA Þ ½naa nð1^pA Þ2 ½nAa 2n^pA ð1^pA Þ þ þ ¼ 2n^p ð1^p Þ n^p2 ^ 2 A
A
A
nð1pA Þ
6 Testing Departure from Hardy–Weinberg Proportions
83
where ^pA is the A allele frequency estimated from the sample data, and ^pA ¼ 2nAA2nþnAa . The w2 test statistic asymptotically follows a chi-square distribution with one degree of freedom. For a multiallele locus with m alleles, the degrees of freedom is calculated as m the number (the number of independent parameters under 2 the alternate hypothesis minus the number of independent parameters under the null hypothesis). In Fig. 1, panel A, when there are no genotyping errors, nAA ¼ 85, nAa ¼ 418, and naa ¼ 497, giving a total sample size of n ¼ 1,000. The estimated allele frequency of A can be evaluated as ^pA ¼ 0:294. Therefore, the expected counts are 1; 000 ^pA2 ¼ 86:44, 1; 000 2^pA ð1 ^pA Þ ¼ 415:13, and 1; 000 ð1 ^pA Þ2 ¼ 498:44 for genotypes AA, Aa, and aa, respectively. Using the formula above, we can obtain the following value of the w2 statistics: ð85 86:44Þ2 ð418 415:13Þ2 ð497 498:44Þ2 þ þ 86:44 415:13 498:44 ¼ 0:048:
w2 ¼
Compared to the chi-square distribution with one degree of freedom, the P -value is 0.8268, which is not statistically significant at a significance level of 5%. Therefore, we do not reject the null hypothesis and can assume that this locus is in Hardy–Weinberg proportion. Also, since the genotypic counts are discrete, the Yates continuity correction of 0.5 can be used (5, 47): w2 ¼
X ðjObserved counts Expected countsj 0:5Þ2 : Expected counts genotypes
In this scenario, the P -value obtained here using the asymptotic chi-square test is 0.8733. It needs to be noted that the test statistic w2 follows a chi-square distribution asymptotically when the sample size is large. This asymptotic assumption of a chi-square distribution could fail when the sample size is too small or there are not enough genotype counts per cell. A locus with a rare minor allele could also have an impact on the performance of Pearson’s chi-square test, even if the total sample size is large, because the expected counts of possible genotypes can still be low or close to zero owing to rare allele frequency, and therefore, can greatly inflate the test statistics. It has been suggested that the asymptotic Pearson’s chi-square test for Hardy–Weinberg proportions should not be used if the expected count of a particular genotype is less than some specified number, which is typically five (5, 16). In this situation, the exact test is preferable (5). 1.3.2. Hardy–Weinberg Exact Test
In the exact approach, a test is performed by computing probabilities under the null hypothesis of all possible genotype combinations that have the same allele frequency and total sample size as the
84
J. Wang and S. Shete
observed sample. Then, the sum of all probabilities of events less or equally probable to the observed event probability is the exact P -value, and the null hypothesis is rejected if it is smaller than a prespecified significance level (5, 48). Consider the notation from Table 2 with n diploid individuals. For the genotypes at a single locus, the conditional probability of observed genotypic counts nAA, nAa, and naa, given the observed allele frequencies, can be expressed in terms of the probability of heterozygote counts nAa conditional on observed counts of the A allele under the assumption of Hardy–Weinberg proportions and sample size. The conditional probability is given as follows (5, 49, 50): PrðnAa jn; nA Þ ¼
n!nA !na !2nAa ; ½ðnA nAa Þ=2!nAa !½n ðnA þ nAa Þ=2!ð2nÞ!
where nA ¼ 2nAA + nAa is the observed count for allele A, n is the sample size, and na ¼ 2n nA. One can evaluate the conditional probabilities for all possible genotypes consistent with the observed data and order the counts of heterozygotes nAa according to these probabilities. The summation of the conditional probabilities that are less than or equal to the conditional probability of observed genotypes is then calculated as the P -value of the exact test (49, 51, 52). The exact test is more desirable than Pearson’s chi-square test because it is valid for any sample size and minor allele frequency. The exact test can provide an exact P -value for the test of Hardy– Weinberg proportions if one completely enumerates all possible genotypes, as described in the study by Louis and Dempster (53). However, the number of possible genotypes given the same sample size and allele frequencies increases exponentially with the number of alleles (54). Therefore, in practice, it might not be feasible to perform complete enumeration for large samples involving multiple alleles. Efficient algorithms have been proposed to improve the efficiency of the full enumeration algorithm (17, 51, 55, 56). In a recent study (51), Engels presented a new algorithm for full enumeration using recursion, and improved the efficiency by about two orders of magnitude. However, even using the recursion algorithm, complete enumeration is still not practical in some situations. Engels showed that the total number of possible genotypes is 2 1056 for the data from the human Rh locus (51). In this situation, complete enumeration would certainly be computationally inefficient. Alternative approaches to full enumeration that are based on permutation or resampling for testing Hardy–Weinberg proportions have been extensively developed (17, 54, 57, 58). The conventional Monte Carlo test of Hardy–Weinberg proportions was first proposed by Guo and Thompson (54). For the Monte Carlo test, one can randomly generate a large number of independent possible genotypes based on the observed allele counts and sample size. Guo and Thompson (54) also adapted the Markov chain
6 Testing Departure from Hardy–Weinberg Proportions
85
algorithm to the Monte Carlo test by using the Markov chain to approximate the distribution of the test statistic. It has been shown that, when the sample size is relatively large, the Markov chain Monte Carlo (MCMC) algorithm is faster than the direct Monte Carlo algorithm (54). Other improvements to the Monte Carlo- or MCMC-based tests of Hardy–Weinberg proportions have been proposed (57–59). The MCMC-based tests are usually referred to as “exact tests” in the literature and software. However, it should be noted that these approaches are actually not “exact” because they do not enumerate the entire space of possible genotypes. However, compared to complete enumeration, the MCMC-based tests perform favorably and offer enormous improvement in computational time; therefore, they have been extensively applied when complete enumeration is not feasible. Other approaches for testing Hardy–Weinberg proportions have been proposed, including unconditional exact tests, likelihood ratio tests, a confidence-limit-based approach, and Bayesian approaches (5, 60–65). Some considered the Hardy–Weinberg proportion test from a different point of view and proposed an equivalence test (66). In practice, however, the most popular tests of Hardy–Weinberg proportions remain Pearson’s chi-square goodness-of-fit test and the exact tests (complete enumeration or MCMC-based). Although the derivations and examples in this chapter are based on two alleles, both tests can be extended to multiple alleles (5). Many commonly used programs and software (some available at no cost) can perform these two approaches. In Subheading 2, we demonstrate step-by-step how these two tests can be performed using popular software programs that have been widely used in genetic association studies, SAS (67), R (68), and PLINK (69, 70), along with numerical examples. The chi-square test has wider usage than the exact test because it is simpler and more straightforward. However, as we show in Subheading 2, the chi-square test is very sensitive to small expected counts in one cell and, therefore, provides more liberal P -values when the allele is rare or the sample size is small. Therefore, we recommend using the exact test when assessing Hardy–Weinberg proportions. In the next section, we discuss recent approaches for testing Hardy–Weinberg proportions in case–control study designs that are better than traditional approaches for testing Hardy–Weinberg proportions in controls only. We also note that deviation from the Hardy–Weinberg proportions in affected individuals can provide evidence for an association between genetic variants and diseases. Some investigators have used the Hardy–Weinberg proportion test in cases to identify disease susceptibility genetic loci while others have combined this information with the standard genetic association tests. We would like to discuss these recent approaches briefly in this chapter. But no practical guideline for performing these approaches is provided. For readers who are interested, please
86
J. Wang and S. Shete
refer to the related papers for the details. In this chapter, we focus on demonstrating how to use the software programs SAS, R, and PLINK to perform the traditional tests of Hardy–Weinberg proportions (i.e., chi-square and exact tests). 1.4. Hardy–Weinberg Proportion in Case– Control Genetic Association Studies
Case–control genetic association studies with unrelated individuals, such as genome-wide association studies, have become a popular and powerful approach for identifying genetic variants associated with complex diseases. The test for the departure from Hardy–Weinberg proportion plays important roles in case–control genetic association studies. The most common usage is to assess the Hardy–Weinberg proportion in control subjects as a quality control measure for identifying genotyping errors. The relationship between genotyping errors and the Hardy–Weinberg proportion test has been studied and discussed in previous studies (26, 34–39). These studies suggest that the Hardy–Weinberg proportion test in controls has very low power for detecting genotyping errors, especially when the genotyping error rate is low and the minor allele frequency is not rare. However, the Hardy–Weinberg proportion test is still considered an essential and routine quality control tool in genetic association studies (19, 21, 31– 33). In general, the Hardy–Weinberg proportion test assumes that the genotypes are sampled from the general population, and therefore, the expected genotype counts in the test should be evaluated from the general population. In a case–control genetic association study, when the Hardy–Weinberg proportion test is performed in control subjects, the observed genotypic counts in controls are compared against the expected genotypic counts in controls. This strategy might work if the disease under consideration is rare, where the controls might well represent the general population. However, when the disease is common in the population, it could be problematic to use only controls when evaluating the expected genotypic counts from the general population, as cases would account for a relatively large portion of the general population. This might lead to artificial departure from Hardy–Weinberg proportions, especially for the markers associated with the disease, and to discarding important SNPs that could potentially be causal SNPs associated with the disease. It has been shown that the type I errors can be inflated dramatically for Hardy–Weinberg proportion tests on the disease-associated markers (23, 26). Moreover, if the genotyping is problematic, it might likely have an impact on both case and control subjects (71). Therefore, the Hardy–Weinberg proportions should be tested in the entire study population rather than only in control subjects. Recently, several new approaches have been proposed for assessing Hardy–Weinberg proportions using both cases and controls for the case–control genetic association (23, 26, 71).
6 Testing Departure from Hardy–Weinberg Proportions 1.4.1. Hardy–Weinberg Proportion Test for Case– Control Study Likelihood-Based Approach
87
Li and Li (23) proposed an approach for assessing Hardy–Weinberg proportions based on a general likelihood ratio framework and applied the approach to both case–control and family-based study designs (see Note 1). They considered a di-allelic locus with the three genotypes aa, Aa, and AA as g ¼ (0, 1, 2). The genotype frequencies were denoted as P0, P1, and P2, respectively, where P0 ¼ 1 P1 P2. The disease status D was defined as a binary variable with 1 representing cases and 0 representing controls. Let the penetrance of the disease conditional on genotypes be fg ¼ Pr (D ¼ 1|g), then the prevalence of the disease can be written as K ¼ f0P0 + f1P1 + f2P2. Therefore, given n unrelated cases and m unrelated controls, the likelihood of the sample is given as: L¼
2 Y m n þm 1 n fg g 1 fg g Pg g g ; m n K ð1 K Þ g¼0
where ng and mg are the numbers of genotypes in cases and controls, respectively. Under the null hypothesis of Hardy–Weinberg proportion, P0 ¼ (1 p)2, P1 ¼ 2p(1 p), and P2 ¼ p2, where p is the allele frequency of allele A. The likelihood ratio test compares the likelihood that is maximized under the alternative hypothesis (departure from Hardy–Weinberg proportions) with the likelihood that is maximized under the null hypothesis (in Hardy–Weinberg proportion). The likelihood ratio statistic follows an asymptotic chisquare distribution with one degree of freedom under the null hypothesis. The likelihood ratio test of population Hardy–Weinberg equilibrium proposed by Yu et al. (71) is similar to the one proposed by Li and Li. In this approach, they fit models by minimizing the deviation function, comparing the observed and expected numbers of genotypes in cases and controls. Mixture Hardy–Weinberg Proportion Exact Test
Wang and Shete (26) proposed a mixture Hardy–Weinberg proportion (mHWP) exact test, where a mixture sample that mimics the general population is created and employed. The individuals in the mixture sample are randomly selected from the original cases and controls, and the number of cases in the mixture sample is proportional to the prevalence of the disease. Consider a case–control study with n0 controls and n1 cases. Let nm be the sample size of the mixture sample, and let K be the estimated prevalence of disease. One could choose nm ¼ minðbn1 =K c; bn0 =ð1 K ÞcÞ to achieve the largest possible mixture sample size and then randomly select bnm K c individuals from the cases and bnm ð1 K Þc individuals from the controls. The exact P -value of the Hardy– Weinberg proportion test can be evaluated using the mixture sample. The procedure is repeated L times to allow for variability in the mixture sampling, and L exact P -values are obtained (see Note 2).
88
J. Wang and S. Shete
The empirical distribution-based nonparametric density can be constructed based on L mixture sample exact P -values. The maximum likelihood estimator of this empirical distribution is estimated as the final P -value for mHWP in the general population. When the marker is not associated with the disease, both likelihood-based and mHWP approaches perform similarly to the traditional test using controls only. When the marker is associated with the disease, the traditional test using only controls inflates the type I errors dramatically when minor allele frequencies and prevalence of disease increase (23, 26). However, the likelihood-based approaches and mHWP exact test can still control type I errors well for the disease-associated markers and, therefore, significantly outperform the traditional approach. If genotyping errors are absent, the mHWP exact test provides a conservative approach for assessing Hardy–Weinberg proportions relative to the likelihood-based approaches. Therefore, the mHWP exact test is more likely to retain causal SNPs for future analyses after the Hardy–Weinberg testing. When the genotyping error rates are higher, the genotyping error can generate extreme deviation from Hardy–Weinberg proportions and, therefore, all approaches can have high power to detect genotyping errors. However, when the genotyping error rates are low, likelihood-based approaches and the mHWP exact test are not very sensitive for detecting genotyping errors. Therefore, one may also consider a strategy of keeping all SNPs for the association study, performing the Hardy–Weinberg test only among significant markers. 1.4.2. Hardy–Weinberg Proportion Test for Genetic Association Studies
Researchers also suggest that deviation from Hardy–Weinberg proportions among case subjects can provide additional evidence for an association between genetic markers and the disease of interest (28, 33, 40–42). There is increasing interest in using deviation from Hardy–Weinberg proportions in patients as a tool in genetic association studies for identifying disease-susceptibility loci. Feder et al. (40) proposed to investigate the deviation from Hardy–Weinberg proportions among cases to fine map diseasesusceptibility loci for an autosomal recessive disorder. In their paper, the degree of the deviation from Hardy–Weinberg proportions was measured by using the F parameter (known as the inbreeding coefficient), which compares the observed homozygosity and expected homozygosity under Hardy–Weinberg proportions. The markers with higher F values are considered to be closer to the disease susceptibility locus. Their method has been reviewed and extended by subsequent researchers (13, 41–43). Wittke-Thompson et al. (28) examined the directions of the difference between population and expected genotypic frequencies in cases and controls, respectively, and developed a chi-square test for determining whether the observed data in a case–control study are consistent with a genetic disease model.
6 Testing Departure from Hardy–Weinberg Proportions
89
Other researchers have proposed to combine the information of the departure from Hardy–Weinberg proportions and the commonly used measures of association tests (i.e., logistic regression, allelic association test, and the Cochran-Armitage trend test) to create new statistical genetic association tests (33, 44, 46). The set-association method proposed by Hoh et al. (46) employs the information of departure from Hardy–Weinberg proportions in both cases and controls. They first use the departure from Hardy–Weinberg proportions in controls as a trimming tool to eliminate markers with unusually high statistical values, which might indicate genotyping errors. Then, the departure from Hardy–Weinberg proportions in cases is combined with the allelic association test, through the product of these two test statistics, to form the new test statistic. The significance of the test is obtained by permutation of the disease status. Song and Elston (44) addressed the weakness of the approach proposed by Hoh et al. and developed a similar approach, called weighted average statistic, for fine-mapping disease susceptibility loci. This approach combines the Cochran-Armitage trend test statistic and the Hardy–Weinberg disequilibrium (HWD) trend test statistic, which is proposed in this study to examine the difference between the HWD coefficients in cases and controls. The linear function of the two test statistics using appropriate weights is used to form the new test statistic. The weighted average statistic for identifying disease susceptibility loci has better power than the adjusted Cochran-Armitage trend test, the HWD trend test and the product of these two tests, for all genetic disease models investigated in the study. Both approaches discussed above use the Hardy–Weinberg proportion information from both cases and controls, and no covariate is considered. Alternatively, Wang and Shete (33) developed a new test statistic for genetic association studies that incorporates evidence about deviation from Hardy–Weinberg proportions only in cases into the regression-based models. With the use of regression models, this approach can easily include covariates in the analysis. The mean-based and median-based tail-strength measures (72) were proposed to combine P -values from two different hypothesis tests: the likelihood ratio test for association and the Hardy–Weinberg proportion exact test in cases. The significance of the new test can be assessed through analytic formulas as well as a resampling procedure. This approach showed a significant increase in power for genetic association studies and good control of type I errors with the additive genetic model. Wang and Shete (73) further pointed out that the analytic formulas for evaluating P -values might cause inflated type I errors for recessive and dominant genetic models, owing to the assumptions underlying the development of asymptotic null distribution; therefore, they recommended using the resampling-based approach to assess the significance of the new statistics. The computer program “CSig” performs the proposed association test through analytic formulas and is available at http://www.epigenetic.org/software.php.
90
J. Wang and S. Shete
2. Methods To demonstrate difference in P -values obtained using the Pearson’s chi-square goodness-of-fit test and the exact test for testing the departure from Hardy–Weinberg proportions, we considered diallelic SNPs at 18 different genetic loci in a sample of 1,000 individuals. The three genotypic counts for all the SNPs are listed in Table 3. The minor allele frequencies of the SNPs vary from ~1% (rare variants) to ~50% (common variants). We utilized three software programs (SAS, R, and PLINK) to evaluate the P -values of Hardy–Weinberg proportion tests for each SNP based on asymptotic and exact approaches. 2.1. SAS/Genetic Software
The tests for the departure from Hardy–Weinberg proportions can be performed by using SAS/Genetics software (67). Although the Pearson’s chi-square and Fisher’s exact tests might be conducted using statistical procedures in SAS (e.g., PROC FREQ), the procedure ALLELE in the SAS/Genetics software is specially developed for analyzing genetic data, and it provides statistical tests for Hardy– Weinberg proportions based on these two commonly used approaches. To examine the departure from Hardy–Weinberg proportions of the markers listed in Table 3, we first create the input data file in SAS format as below, based on the genotypic counts of the 18 markers. ...... aaaaaaAaAAaaAaaaAaAaAaAaAaAAAaaaAaAA aaAaaaaaAAaaAaaaaaAaAaAaAaAaAaAaAAAA AAaaAAAaaaaaAaaaAaAaAaaaAaAaAAAAAaAa ...... The input data include 36 columns, with the first two columns representing the set of two alleles for the first SNP, the third and fourth columns representing the set of alleles for the second SNP, and so on. There are 1,000 rows of data, each representing one individual. The following code reads the input data and conducts Hardy–Weinberg proportion tests for the 18 markers:
6 Testing Departure from Hardy–Weinberg Proportions
91
Table 3 Genotypic counts, estimated allele frequencies, and P -values obtained based on Pearson’s chi-square goodness-of-fit test and exact test of Hardy–Weinberg proportions by using three different software programs: SAS, R, and PLINK SAS
R
PLINK
p_exact{
p_chisq*
p_exact{
p_chisq*
p_exact{
982 0.0105 1.425501E-18
1.400000E-04
1.425501E-18
9.822870E-05
1.43E-18
9.82E-05
19
980 0.0105 6.767192E-03
1.006500E-01
6.767192E-03
1.006557E-01
0.006767 0.1007
0
20
980 0.0100 7.494065E-01
1.000000E+00
7.494065E-01
1.000000E+00
0.7494
1
20
50
930 0.0450 6.149249E-40
0.000000E+00
6.149249E-40
1.214089E-17
6.15E-40
1.21E-17
3
97
900 0.0515 8.218825E-01
7.437800E-01
8.218825E-01
7.425484E-01
0.8219
0.7425
3
95
902 0.0505 7.667648E-01
7.340400E-01
7.667648E-01
7.358119E-01
0.7668
0.7358
5
185 810 0.0975 1.053538E-01
1.468900E-01
1.053538E-01
1.471012E-01
0.1054
0.1471
15
155 830 0.0925 1.520538E-02
2.115000E-02
1.520538E-02
2.191775E-02
0.01521
0.02192
10
180 810 0.1000 1.000000E+00
1.000000E+00
1.000000E+00
1.000000E+00
1
1
30
360 610 0.2100 7.195676E-03
7.350000E-03
7.195676E-03
7.435283E-03
0.007196 0.007435
50
250 700 0.1750 2.198160E-05
1.000000E-04
2.198160E-05
6.528187E-05
2.20E-05
6.53E-05
40
320 640 0.2000 1.000000E+00
1.000000E+00
1.000000E+00
1.000000E+00
1
1
60
500 440 0.3100 9.450207E-08
0.000000E+00
9.450207E-08
5.864779E-08
9.45E-08
5.87E-08
100 420 480 0.3100 5.642284E-01
5.584000E-01
5.642284E-01
5.549472E-01
0.5642
0.5549
90
420 490 0.3000 1.000000E+00
1.000000E+00
1.000000E+00
1.000000E+00
1
1
300 400 300 0.5000 2.539629E-10
0.000000E+00
2.539629E-10
2.299897E-10
2.54E-10
2.30E-10
200 500 300 0.4500 7.494065E-01
8.003600E-01
7.494065E-01
7.982999E-01
0.7494
0.7983
250 500 250 0.5000 1.000000E+00
1.000000E+00
1.000000E+00
1.000000E+00
1
1
AA
Aa
aa
3
15
1
^ pA
p_chisq*
Note: the P -values obtained from R and PLINK are in their default format ^ pA : the estimated A allele frequency from the data *P -values obtained using Pearson’s chi-square goodness-of-fit test without continuity correction { P -values obtained using permutation-based Hardy–Weinberg proportion test { P -values obtained using exact Hardy–Weinberg proportion test
Alternatively, the input data can be read in a format of columns of genotypes instead of columns of alleles. Using this format, there is only one column for each marker. One can use different characters or strings as delimiters (i.e., “/” or “-”), or even no delimiter, to separate the two alleles for each marker (see below).
92
J. Wang and S. Shete
...... a/a a/a a/a A/a A/A a/a A/a a/a A/a A/a A/a A/a A/a A/A A/a a/a A/a A/A a/a A/a a/a a/a A/A a/a A/a a/a a/a A/a A/a A/a A/a A/a A/a A/a A/A A/A A/A a/a A/A A/a a/a a/a A/a a/a A/a A/a A/a a/a A/a A/a A/A A/A A/a A/a ......
When this alternative format is used, there will be only 18 variables and the options GENOCOL and DELIMITER¼ should be included. The DELIMITER¼ option can be omitted in the following example since “/” is the default.
The ALLELE procedure Marker summary Test for HWE DF Pr > ChiSq
Prob exact
77.3589
1
1.425501E-18
1.400000E-04
0.0208
7.3337
1
6.767192E-03
1.006500E-01
0.0196 0.0200
0.0198
0.1020
1
7.494065E-01
1.000000E+00
0.0823 0.0500
0.0859
174.9468 1
6.149249E-40
0.000000E+00
Locus Number of Number individuals of alleles
PIC
Heterozygosity Allelic Chidiversity square
M1
1,000
2
0.0206 0.0150
0.0208
M2
1,000
2
0.0206 0.0190
M3
1,000
2
M4
1,000
2
The SAS code above generates a marker summary table (part of the table is shown). It provides chi-square test statistic values without continuity correction, degree of freedom, P -values based on the chi-square test, and P -values based on the exact test obtained using a permutation approach. In addition, it provides population genetic measures such as the polymorphism information content, heterozygosity, and allelic diversity. By default, the ALLELE procedure performs a chi-square goodness-of-fit test for Hardy–Weinberg proportions and reports the asymptotic P -values. When the PERMS¼number option is included in the procedure, the Monte Carlo permutation test of Hardy–Weinberg proportions based on “number” permutations is performed, and the P -value thus
6 Testing Departure from Hardy–Weinberg Proportions
93
obtained is provided. One can also use the EXACT¼number option instead of PERMS¼number to perform the same permutationbased exact test. The exact test conducted here is based on the approaches proposed by Guo and Thompson (54). In this permutation procedure, the alleles are randomly permuted to form new genotypes. For each permutation, the conditional probability of genotypic counts given the allele frequency and sample size is evaluated. The P -value is obtained as the proportion of permutations where the conditional probabilities are less than or equal to the observed probability. It is recommended that 10,000 or more permutations be used for accuracy. Increasing the number of permutations will provide more accurate P -values, but the execution time will be longer (see Note 3). The SEED¼ option is used to define the random seed for the random number generator for permuting the alleles. It should be a nonnegative integer. If this option is omitted, the computer clock will be used (see Note 3). The exact P -values reported in Table 3 were based on 100,000 permutations. The ALLELE procedure can also deal with markers with multiple alleles. If the option NOFREQ is omitted, two more tables of allele frequencies and genotype frequencies will be generated. All the analyses performed in this section were conducted using SAS/Genetics 9.2 (67). 2.2. R Software
R is a free software environment for statistical computing and graphics (68). Several functions or packages have been developed for the purpose of testing Hardy–Weinberg proportions. We first focus on the functions available in the population genetic package, “genetics” (74), and also introduce several other packages developed specifically for the Hardy–Weinberg proportion exact test. The “genetics” package has two functions: HWE.chisq and HWE.exact, for testing the departure from Hardy–Weinberg proportions based on the chi-square and exact tests, respectively. The syntaxes for the two tests are as follows: HWE.chisq(x) HWE.exact(x) where x is the genotype data in object class “genotype,” which can be obtained by using the genotype function also available in this package. By default, HWE.chisq provides chi-square test statistics without continuity correction and simulated P -values based on 10,000 iterations. If one wants the asymptotic P -value based on continuity-corrected chi-square statistics, then one has to use the option simulate.p.value¼FALSE. The function HWE.exact provides exact Hardy–Weinberg P -values. The algorithm for the exact test used by this function is based on the approach proposed by
94
J. Wang and S. Shete
Emigh (49). This function only works for genotypes with two alleles. The following code performs the asymptotic and exact Hardy–Weinberg proportion tests within the “genetics” package. 1. Install and load the “genetics” package: > install.packages("genetics") > library("genetics") 2. Read the data for the 18 genetic markers (see Note 4), and create genotype object using function genotype. Since the genotype function only works for a single marker, we use a marker in Table 3 as an example with genotypic counts for AA, Aa, and aa of 5, 185, and 810, respectively. > allmarker<-read.table('markers') > onemarker<-allmarker[,7] > genodata<-genotype(onemarker, sep¼"/") 3. Conduct asymptotic chi-square test for Hardy–Weinberg proportions, and the asymptotic P -values can be obtained in three ways (see Note 5): > # to obtain chi-square test statistics without continuity correction and P -value based on simulation one needs to run default setting of the function > t_chisq<-HWE.chisq(genodata) > # to obtain asymptotic chi-square test statistics with continuity correction and associated P -value one needs to run the following script > t_chisq<-HWE.chisq(genodata, simulate.p.value¼FALSE) > # to perform asymptotic chi-square test and associated P -value without continuity correction > t_chisq<-HWE.chisq(genodata, simulate.p.value¼FALSE, correct¼FALSE) 4. Conduct exact test for Hardy–Weinberg proportions: > t_exact<-HWE.exact(genodata) For this marker, the asymptotic chi-square P -value obtained is 0.1107, based on simulation iterations. Alternatively, if one chooses not to use the simulation to compute P -values (simulate.p. value¼FALSE), the function computes the test statistic using Yates’ continuity correction and uses the asymptotic chi-square distribution to evaluate the P -value. The P -value obtained in this way is 0.1499. We can further choose not to use Yates’ continuity correction (correction¼FALSE), which results in a P -value of 0.1054. The exact P -value obtained from the HWE.exact function is 0.1471 in this example.
6 Testing Departure from Hardy–Weinberg Proportions
95
Several other R functions are available for the exact test of Hardy–Weinberg proportions, such as the function HWExact in the “GWASExactHW” package (75) and function hwexact in the “hwde” package (76). Compared to the HWE.exact function in the “genetics” package, these two functions directly deal with the genotypic counts. Both functions were adapted from code by Wigginton et al. (17). The approach proposed by Wigginton et al. (17) uses the recurrent relationships from the previous study of Guo and Thompson (54) and performs the exact test for SNPs in a computationally efficient manner for diallelic loci. Again considering genotypic counts for AA, Aa, and aa of 5, 185, and 810, respectively, the following codes can be used to obtain the exact P -values based on those counts. 1. Use the function HWExact in the “GWASExactHW” package: > genocounts<-data.frame(nAA¼5,nAa¼185,naa¼810) > p_exact<-HWExact(genocounts) 2. Use the function hwexact in the “hwde” package: > p_exact<-hwexact(5,185,810) Both functions provide, in this example, an exact P -value of 0.1471, which is the same as that obtained using the HWE.exact function in the “genetics” package. The function hwexact is simpler to perform than the function HWExact. However, by using the data.frame function, the function HWExact can deal with a large number of markers simultaneously (i.e., nAA, nAa, and naa can be defined as arrays) and, therefore, is more favorable for large-scale genome-wide association studies. It should be noted that all the R packages/functions discussed so far for the exact test of Hardy–Weinberg proportions only work for markers with two alleles. The Hardy–Weinberg proportion exact test for markers with more than two alleles can be conducted using the function hwe.hardy in the genetic analysis package “gap” (77). This function was adapted from the code by Guo (54). Interestingly, for markers with only two alleles, this function cannot be applied. All the analyses performed in this section were conducted using R version 2.10.1 (68). 2.3. PLINK Software
PLINK (69, 70) is a free software program providing a computationally efficient way of performing statistical analyses for largescale genome-wide association studies. As a basic summary statistic, the Hardy–Weinberg proportion test can be conducted by using one command line with the option --hardy in PLINK, using the pedigree and map files (see Note 6). Both pedigree (PED) and map (MAP) files are required as the standard input files for PLINK. We first briefly introduce the formats of PED and MAP files and then describe how the Hardy–Weinberg test is conducted.
96
J. Wang and S. Shete
PLINK has detailed guidelines for the formats of PED and MAP files. These files are in the standard “linkage format,” and all the formats and coding must conform to these guidelines. The PED file stores all the data for all the variables of all the individuals. The columns refer to variables, and the rows refer to individuals. The first six columns are mandatory: Family ID, Individual ID, Father ID, Mother ID, Sex, and Phenotype. The combination of Family ID and Individual ID needs to be unique to identify an individual. Sex is coded as 1 ¼ male, 2 ¼ female, and other ¼ unknown. The Phenotype can be a quantitative trait or a case– control status. If the Phenotype is a cases–controls status, it is coded as 1 ¼ controls, 2 ¼ cases, and 0/9 ¼ missing. Starting from column seven, genotype data can be defined with two columns representing one marker. The genotypes can be coded using any numbers (1, 2, 3, and 4) or characters (A, B, C, and D) except 0, as 0 represents missing genotypes by default. The MAP files store additional information for markers, with each row describing one single marker. By default, the MAP file has exactly four columns (default settings): Chromosome (1–22, X, Y, or 0 if unplaced), rs# or SNP identifier, Genetic distance (morgans), and Base-pair position (bp units). For the detailed guidelines, the reader can refer to the online PLINK manual (http://pngu.mgh.harvard.edu/ ~purcell/plink/). Once the example.ped and example.map files are created, the Hardy–Weinberg proportion tests for all markers can be performed using the following command line: plink --ped example.ped --map example.map --hardy By default, this command conducts the exact test of Hardy– Weinberg proportions, described and implemented by Wigginton et al. (17). To perform the asymptotic chi-square test, one can use the option --hardy2 instead: plink --ped example.ped --map example.map --hardy2 Two files are created from this command (1) plink.log file captures all the information that should appear on the console, including information about commands used, as well as information about markers included, individuals used, cases and controls, male and female, and the missing genotypes and individuals, etc.; (2) plink.hwe provides the Hardy–Weinberg proportion test P -value. The first line in the plink.hwe file includes headers. The last column gives the Hardy–Weinberg P -value: asymptotic or exact. For each marker, there are three rows with respect to Hardy– Weinberg tests in three different samples (i.e., all data [ALL], cases only [AFF], and controls only [UNAFF]) (see Note 7).
6 Testing Departure from Hardy–Weinberg Proportions
97
Part of the resulting plink.hwe file based on the option --hardy is shown below. CHR SNP
TEST
A1 A2 GENO
O(HET) E(HET) P
1
rs123456 ALL
2
1
3/15/982 0.015
0.02078 9.82E-05
1
rs123456 AFF
2
1
0/0/0
nan
1
rs123456 UNAFF 2
1
3/15/982 0.015
0.02078 9.82E-05
1
rs234567 ALL
2
1
1/19/980 0.019
0.02078 0.1007
1
rs234567 AFF
2
1
0/0/0
nan
1
rs234567 UNAFF 2
1
1/19/980 0.019
nan
nan
NA
NA
0.02078 0.1007
Since we assumed all the individuals were controls, no Hardy– Weinberg test P -value was calculated for the cases (label AFF) for all markers, and the results obtained using all data (ALL) and controls (UNAFF) were exactly the same. For example, for the first SNP with genotype counts 3/15/982, the exact P -value is 9.82 105, which is the same as the one obtained in R using different exact test functions (see Table 3). The asymptotic P -value based on the chisquare test can also be calculated. PLINK does not provide values of chi-square test statistics. Both the asymptotic and exact P -values from PLINK are listed in Table 3. The Hardy–Weinberg proportion tests available in PLINK can only deal with markers of two alleles. All the analyses performed in this section were conducted using PLINK version 1.07 (69, 70). In Table 3, R and PLINK provide similar P -values for both asymptotic and exact tests of Hardy–Weinberg proportions. By default, SAS will provide results with only four decimals. For the purpose of comparisons, we used a scientific notation for the P -values obtained from SAS. The exact P -values from SAS are slightly different from those obtained from R or PLINK because the P -values are permutation-based. Even when the allele is common, asymptotic and exact P -values could be different. For example, when the genotypic counts are 15, 155, and 830 for AA, Aa, and aa, the minor allele frequency is 0.0925. The asymptotic and exact P -values are 0.0152 and 0.0219, respectively. In this situation, the asymptotic P -value is liberal. Furthermore, when the allele is rare, the asymptotic chi-square test is very sensitive and provides more liberal P -values than the exact test. For example, when the genotypic counts are 1, 19, and 980 for AA, Aa, and aa, the minor allele frequency is 0.0105. The asymptotic P -value is 0.0068, which is statistically significant at the 5% level and implies that this marker deviates from the Hardy–Weinberg proportions, but the exact P -value is 0.1007, which is not significant at the 5% level and suggests that this marker is in Hardy–Weinberg proportion. With 18 markers in the data, the computation time for the analysis is SAS>R>PLINK if the exact test is performed (the asymptotic
98
J. Wang and S. Shete
chi-square test is always quick). Using 100,000 permutations, SAS needs approximately 4–5 min to complete the analysis, R package “genetics” needs about 2–3 s, and PLINK only needs about 0.03 s. Therefore, the time it takes to perform the exact Hardy–Weinberg proportion test for SNPs in a candidate region or at the genomewide level is less than a day. Given the liberal nature of the asymptotic-based chi-square test, we recommend that the exact test be performed routinely. 2.4. Other Software
Many other software/programs are also useful for testing the departure from Hardy–Weinberg proportion: SNP-HWE (http://www.sph.umich.edu/csg/abecasis/Exact/) (17) HWtest (http://www.mathworks.com/matlabcentral/fileexchange/ 14425-hwtest) (78) Haploview (http://www.broadinstitute.org/mpg/haploview/) (17, 79) TFPGA (http://www.marksgeneticsoftware.net/)
3. Notes 1. Li and Leal (80) studied the departure from Hardy–Weinberg equilibrium in a family-based study (i.e., parental and unaffected sibling genotype data). They found that the pattern of departure from Hardy–Weinberg equilibrium is different in different groups of individuals, such as the parent group, affected proband group, and unaffected sibling group. 2. The number of mixture samples L can be decided by conducting simulations. For example, given a data set, one can use different numbers of L to evaluate the empirical distribution of P -values and the maximum likelihood estimator. If the empirical distribution and the value of the maximum likelihood estimator are approaching stability when L is greater than some number, one can use this number or a greater number in the analysis. 3. We tried two different numbers of permutations, PERMS ¼ 10,000 and 100,000. In both cases, the exact P -values show some variation if we conduct the ALLELE procedure multiple times without the SEED ¼ option. If the multiple tests are conducted using a fixed random seed number, the exact same results can be replicated. The variations of exact P -values are larger with PERMS ¼ 10,000 than with PERMS ¼ 100,000. These variations might not have a significant impact on the conclusions from the exact test (Hardy–Weinberg proportion test is significant or nonsignificant), but we still recommend more permutations for accurate results, if it is feasible.
6 Testing Departure from Hardy–Weinberg Proportions
99
4. The input data are in a format of columns of genotype pairs. One can use different delimiters to separate two alleles, such as “A/a” and “A-a,” or use no delimiter between two alleles, such as “Aa.” When creating genotypes using genotype function, the delimiter needs to be specified in the function with the option sep¼"" (by default, sep¼"/"). One can also use 0, 1, or 2 to represent genotypes aa, Aa, or AA, and then use function as.genotype.allele.count to convert them to genotype pairs A/A, A/a, and a/a. If only the genotypic counts are available, one can also create the genotype data and then apply the genotype function: > genocounts<-c(5, 185, 810) > data<-c(rep("A/A",genocounts[1]), rep("a/A",genocounts[2]), rep("a/a",genocounts[3])) > genodata<-genotype(data) 5. When using the HWP.chisq function based on the asymptotic chi-square distribution (simulate.p.value¼FALSE), a warning message might appear regarding the validity of chi-squared approximation:Warning messages:1: In chisq.test(tab, . . .) : Chi-squared approximation may be incorrect. This is probably due to a small expected count in one cell. To check the expected counts under the null, one can use results$expected, where the “results” is the variable saving all the outcomes. 6. PLINK is a command-line program with no GUI interface. All the command lines need to be written at the command prompt (e.g., DOS window or Unix terminal). The basic syntax of PLINK is as follows: plink --ped file.ped --map file.map --option The options --ped and --map indicate the input pedigree and map data files. The --option specifies the analysis or methods to be applied. All the results are saved in files with different extensions according to the analyses performed. 7. The example data used the case–control status as the phenotype. For a quantitative trait, each SNP only has one row, labeled as ALL(QT). By default, only founders are considered in the Hardy–Weinberg proportion analysis. Instead, the option --nonfounders can be used to indicate that all individuals will be included to perform an approximate test. References 1. Castle WE (1903) The laws of Galton and Mendel and some laws governing race improvement by selection. Proc Amer Acad Arts Sci 35:233–242 2. Hardy GH (1908) Mendelian proportions in a mixed population. Science 28:49–50
3. Weinberg W (1908) On the demonstration of heredity in man. In: Boyer SH (ed) Papers on human genetics. Prentice Hall, Englewood Cliffs, NJ 4. Crow JF (1988) Eighty years ago: the beginnings of population genetics. Genetics 119:473–476
100
J. Wang and S. Shete
5. Weir BS (1996) Genetic data analysis II: methods for discrete population genetic data. Sinauer Associates, Sunderland, Mass 6. Cockerham CC (1969) Variance of gene frequencies. Evolution 23:72–84 7. Wright S (1951) The genetical structure of populations. Ann Eugen 15:323–354 8. Price GR (1971) Extension of the HardyWeinberg law to assortative mating. Ann Hum Genet 34:455–458 9. Shockley W (1973) Deviations from HardyWeinberg frequencies caused by assortative mating in hybrid populations. Proc Natl Acad Sci USA 70:732–736 10. Templeton A (2006) Population genetics and microevolutionary theory. John Wiley & Sons, Hoboken, NJ 11. Voight BF, Pritchard JK (2005) Confounding from cryptic relatedness in case–control association studies. PLoS Genet 1:e32 12. Weinberg CR, Morris RW (2003) Invited commentary: Testing for Hardy-Weinberg disequilibrium using a genome single-nucleotide polymorphism scan based on cases only. Am J Epidemiol 158:401–403 13. Deng HW, Chen WM, Recker RR (2000) QTL fine mapping by measuring and testing for Hardy-Weinberg and linkage disequilibrium at a series of linked marker loci in extreme samples of populations. Am J Hum Genet 66:1027–1045 14. Deng HW, Chen WM, Recker RR (2001) Population admixture: detection by Hardy-Weinberg test and its quantitative effects on linkagedisequilibrium methods for localizing genes underlying complex traits. Genetics 157:885– 897 15. Grover VK, Cole DE, Hamilton DC (2010) Attributing Hardy-Weinberg disequilibrium to population stratification and genetic association in case–control studies. Ann Hum Genet 74:77–87 16. Ryckman K, Williams SM (2008) Calculation and use of the Hardy-Weinberg model in association studies. Curr Protoc Hum Genet Chapter 1:Unit 1.18 17. Wigginton JE, Cutler DJ, Abecasis GR (2005) A note on exact tests of Hardy-Weinberg equilibrium. Am J Hum Genet 76:887–893 18. Attia J, Thakkinstian A, McElduff P et al (2010) Detecting genotyping error using measures of degree of Hardy-Weinberg disequilibrium. Stat Appl Genet Mol Biol 9(1) :Article 5 19. Gomes I, Collins A, Lonjou C et al (1999) Hardy-Weinberg quality control. Ann Hum Genet 63:535–538
20. Graffelman J, Camarena JM (2008) Graphical tests for Hardy-Weinberg equilibrium based on the ternary plot. Hum Hered 65:77–84 21. Hosking L, Lumsden S, Lewis K et al (2004) Detection of genotyping errors by HardyWeinberg equilibrium testing. Eur J Hum Genet 12:395–399 22. Laurie CC, Doheny KF, Mirel DB et al (2010) Quality control and quality assurance in genotypic data for genome-wide association studies. Genet Epidemiol 34(6):591–602 23. Li M, Li C (2008) Assessing departure from Hardy-Weinberg equilibrium in the presence of disease association. Genet Epidemiol 32:589– 599 24. Schaid DJ, Batzler AJ, Jenkins GD et al (2006) Exact tests of Hardy-Weinberg equilibrium and homogeneity of disequilibrium across strata. Am J Hum Genet 79:1071–1080 25. Tapper W, Collins A, Gibson J et al (2005) A map of the human genome in linkage disequilibrium units. Proc Natl Acad Sci USA 102:11835–11839 26. Wang J, Shete S (2010) Using both cases and controls for testing hardy-weinberg proportions in a genetic association study. Hum Hered 69:212–218 27. Weale ME (2010) Quality control for genomewide association studies. Methods Mol Biol 628:341–372 28. Wittke-Thompson JK, Pluzhnikov A, Cox NJ (2005) Rational inferences about departures from Hardy-Weinberg equilibrium. Am J Hum Genet 76:967–986 29. Pompanon F, Bonin A, Bellemain E et al (2005) Genotyping errors: causes, consequences and solutions. Nat Rev Genet 6:847–859 30. Akey JM, Zhang K, Xiong M et al (2001) The effect that genotyping errors have on the robustness of common linkage-disequilibrium measures. Am J Hum Genet 68:1447–1456 31. Weiss ST, Silverman EK, Palmer LJ (2001) Case–control association studies in pharmacogenetics. Pharmacogenomics J 1:157–158 32. Xu J, Turner A, Little J et al (2002) Positive results in association studies are associated with departure from Hardy-Weinberg equilibrium: hint for genotyping error? Hum Genet 111:573–574 33. Wang J, Shete S (2008) A test for genetic association that incorporates information about deviation from Hardy-Weinberg proportions in cases. Am J Hum Genet 83:53–63 34. Cox DG, Kraft P (2006) Quantification of the power of Hardy-Weinberg equilibrium testing
6 Testing Departure from Hardy–Weinberg Proportions to detect genotyping error. Hum Hered 61:10–14 35. Fardo DW, Becker KD, Bertram L et al (2009) Recovering unused information in genomewide association studies: the benefit of analyzing SNPs out of Hardy-Weinberg equilibrium. Eur J Hum Genet. doi:10.1038/ ejhg.2009.85 36. Leal SM (2005) Detection of genotyping errors and pseudo-SNPs via deviations from Hardy-Weinberg equilibrium. Genet Epidemiol 29:204–214 37. Teo YY, Fry AE, Clark TG et al (2007) On the usage of HWE for identifying genotyping errors. Ann Hum Genet 71:701–703 38. Zou GY, Donner A (2006) The merits of testing Hardy-Weinberg equilibrium in the analysis of unmatched case–control data: a cautionary note. Ann Hum Genet 70:923–933 39. Salanti G, Amountza G, Ntzani EE et al (2005) Hardy-Weinberg equilibrium in genetic association studies: an empirical evaluation of reporting, deviations, and power. Eur J Hum Genet 13:840–848 40. Feder JN, Gnirke A, Thomas W et al (1996) A novel MHC class I-like gene is mutated in patients with hereditary haemochromatosis. Nat Genet 13:399–408 41. Jiang R, Dong J, Wang D et al (2001) Finescale mapping using Hardy-Weinberg disequilibrium. Ann Hum Genet 65:207–219 42. Nielsen DM, Ehm MG, Weir BS (1998) Detecting marker-disease association by testing for Hardy-Weinberg disequilibrium at a marker locus. Am J Hum Genet 63:1531–1540 43. Lee WC (2003) Searching for disease-susceptibility loci by testing for Hardy-Weinberg disequilibrium in a gene bank of affected individuals. Am J Epidemiol 158:397–400 44. Song K, Elston RC (2006) A powerful method of combining measures of association and Hardy-Weinberg disequilibrium for finemapping in case–control studies. Stat Med 25:105–126 45. Won S, Elston RC (2008) The power of independent types of genetic information to detect association in a case–control study design. Genet Epidemiol 32:731–756 46. Hoh J, Wille A, Ott J (2001) Trimming, weighting, and grouping SNPs in human case–control association studies. Genome Res 11:2115–2119 47. Yates F (1934) Contingency tables involving small numbers and the X2 test. J Roy Stat Soc Suppl 1:217–235 48. Fisher RA (1935) The logic of inductive inference. J Roy Stat Soc 98:39–54
101
49. Emigh T (1954) A comparison of tests for Hardy-Weinberg equilibrium. Biometrics 36:627–642 50. Haldane JBS (1954) An exact test for randomness of mating. J Genet 52:631–635 51. Engels WR (2009) Exact tests for Hardy-Weinberg proportions. Genetics 183:1431–1441 52. Levene H (1949) On a matching problem arising in genetics. Ann Math Stat 20:91–94 53. Louis EJ, Dempster ER (1987) An exact test for Hardy-Weinberg and multiple alleles. Biometrics 43:805–811 54. Guo SW, Thompson EA (1992) Performing the exact test of Hardy-Weinberg proportion for multiple alleles. Biometrics 48:361–372 55. Aoki S (2003) Network algorithm for the exact test of Hardy-Weinberg proportion for multiple alleles. Biometrical J 45:471–490 56. Maurer HP, Melchinger AE, Frisch M (2007) An incomplete enumeration algorithm for an exact test of Hardy-Weinberg proportions with multiple alleles. Theor Appl Genet 115:393–398 57. Huber M, Chen Y, Dinwoodie I et al (2006) Monte Carlo algorithms for Hardy-Weinberg proportions. Biometrics 62:49–53 58. Yuan A, Bonney GE (2003) Exact test of Hardy-Weinberg equilibrium by Markov chain Monte Carlo. Math Med Biol 20:327–340 59. Lazzeroni LC, Lange K (1997) Markov chains for Monte Carlo tests of genetic equilibrium in multidimensional contingency tables. Ann Stat 25:138–168 60. Hernandez JL, Weir BS (1989) A disequilibrium coefficient approach to Hardy-Weinberg testing. Biometrics 45:53–70 61. Maiste PJ, Weir BS (2004) Optimal testing strategies for large, sparse multinomial models. Comput Stat Data An 46:605–620 62. Montoya-Delgado LE, Irony TZ, de BPC et al (2001) An unconditional exact test for the Hardy-Weinberg equilibrium law: samplespace ordering using the Bayes factor. Genetics 158:875–883 63. Shoemaker J, Painter I, Weir BS (1998) A Bayesian characterization of Hardy-Weinberg disequilibrium. Genetics 149:2079–2088 64. Wakefield J (2010) Bayesian methods for examining Hardy-Weinberg equilibrium. Biometrics 66:257–265 65. Wellek S, Goddard KA, Ziegler A (2010) A confidence-limit-based approach to the assessment of Hardy-Weinberg equilibrium. Biom J 52:253–270 66. Goddard KA, Ziegler A, Wellek S (2009) Adapting the logical basis of tests for HardyWeinberg Equilibrium to the real needs of
102
J. Wang and S. Shete
association studies in human and medical genetics. Genet Epidemiol 33:569–580 67. SAS Institute Inc. (2008) SAS/GeneticsTM 9.2 User’s Guide. SAS Institute Inc., Cary, NC 68. R Development Core Team (2009) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria 69. Purcell S, Neale B, Todd-Brown K et al (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 81:559–575 70. Purcell S (2009) PLINK (v1.07). 71. Yu C, Zhang S, Zhou C et al (2009) A likelihood ratio test of population Hardy-Weinberg equilibrium for case–control studies. Genet Epidemiol 33:275–280 72. Taylor J, Tibshirani R (2006) A tail strength measure for assessing the overall univariate significance in a dataset. Biostatistics 7:167– 181 73. Wang J, Shete S (2009) Is the tail-strength measure more powerful in tests of genetic
association? response. Am J Hum Genet 84:298–300 74. Warnes G, Gorjanc G, Leisch F et al (2008) genetics: Population Genetics. 75. Painter I (2010) GWASExactHW: Exact Hardy-Weinburg testing for Genome Wide Association Studies. 76. Maindonald JH, Johnson R (2009) hwde: Models and tests for departure from HardyWeinberg equilibrium and independence between loci. 77. Zhao JH (2007) gap: Genetic analysis package. J Stat Softw 23(8):1–18 78. Cardillo G (2007) HWtest: a routine to test if a locus is in Hardy Weinberg equilibrium (exact test). 79. Barrett JC, Fry B, Maller J et al (2005) Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics 21:263–265 80. Li B, Leal SM (2009) Deviations from hardyweinberg equilibrium in parental and unaffected sibling genotype data. Hum Hered 67:104–115
Chapter 7 Estimating Disequilibrium Coefficients Maren Vens and Andreas Ziegler Abstract Gametic phase disequilibrium (GPD) is the nonrandom association of alleles within gametes. Linkage disequilibrium (LD) describes the special case of deviation from independence between alleles at two linked genetic loci. Estimation of allelic LD requires knowledge of haplotypes. Genotype-based LD measures dispense with the haplotype estimation step and avoid bias in LD estimation. In this chapter, the most important measures for allelic and genotypic LD are introduced. The use of software packages for LD estimation is illustrated. Key words: Allelic linkage disequilibrium, Coefficient of determination, Composite linkage disequilibrium, Gametic phase disequilibrium, Genotypic linkage disequilibrium, Haploview, Haplotype, Hardy–Weinberg equilibrium, Lewontin’s D0 , Linkage disequilibrium, PLINK
1. Introduction The third Mendelian law, the law of independent assortment, states that two genetic factors are transmitted independently of each other. Gametic phase disequilibrium (GPD) is the nonrandom association of alleles within gametes (1). The term linkage disequilibrium (LD) describes the special case of deviation from independence between alleles at two linked genetic loci. Thus, allelic LD, also termed gametic LD refers to the association between two alleles at two loci that are in close vicinity on the same chromosome. There are several causes for the deviation from independence. One possibility is a lack of independent segregation or recombination. However, it is also possible that other evolutionary forces have driven LD. LD is used as a tool for the genetic mapping of trait or disease loci in humans or model organisms (2). Unless gametic frequencies, i.e., haplotype frequencies, are directly observable, they are inferred from genotype frequencies
Robert C. Elston et al. (eds.), Statistical Human Genetics: Methods and Protocols, Methods in Molecular Biology, vol. 850, DOI 10.1007/978-1-61779-555-8_7, # Springer Science+Business Media, LLC 2012
103
104
M. Vens and A. Ziegler
Table 1 Genotype frequencies at markers M1 and M2 Marker M2
Marker M1
bb
Bb
BB
aa
n00
n01
n02
n0+
Aa
n10
n11
n12
n1+
AA
n20
n21
n22
n2+
n+0
n+1
n+2
n
under the assumption of random union of gametes (3). For illustration, consider two diallelic markers, i.e., single nucleotide polymorphisms (SNPs) M1 and M2 with alleles A and a and B and b, respectively. Both SNPs have been genotyped in a sample of n subjects (see Table 1). All n22 probands are homozygous at both SNPs with alleles A and B, respectively. Thus, it is clear that all n22 probands carry two AB haplotypes. All individuals who are homozygous at both SNPs can be similarly assigned. In persons who are homozygous at one SNP and heterozygous at the other, haplotypes can also be determined unambiguously. However, haplotypes in doubly heterozygous subjects are uncertain. They either carry haplotypes AB and ba or carry haplotypes Ab and ba , and the frequency of these haplotypes needs to be estimated. To avoid bias in estimating allelic LD, genotype-based LD measures can be employed that dispense with the haplotype estimation step. To avoid this phasing step and possible bias in estimating LD, which may be introduced if the assumption of Hardy–Weinberg equilibrium (see Chapter 6) is not met, the definition of allelic LD can be extended to a nongametic LD. This nongametic LD is based on genotypes rather than on alleles, and it is therefore termed genotypic LD (4). First, we describe allelic LD measures. For simplicity, we assume that phase is known in all individuals. Next, we consider some genotype-based LD measures. Subheading 2 gives a short introduction to two software tools. 1.1. Allelic Linkage Disequilibrium Measures
Table 2 displays the observed haplotype and marker allele frequencies at two diallelic markers. The basis of LD is the difference between the observed and the–under the assumption of independence–expected proportion of haplotypes bearing the A allele at marker M1 and the B allele at marker M2 (5): D ¼ DAB ¼ p22 pA :pB :
(1)
7 Estimating Disequilibrium Coefficients
105
Table 2 Summary of estimated haplotype and maker allele frequencies at two diallelic marker loci Marker M2
Marker M1
b
B
a
p11
p12
qA ¼ 1 pA
A
p21
p22
pA
qB ¼ 1 pB
pB
1
DAB can also be represented by DAB ¼ p11 p22 p12 p21, which is the covariance between the two diallelic marker loci (see Note 1). DAB is the basic component of many LD measures. Specifically, it is used as the numerator of a number of LD measures as described below, and it can be directly estimated from the observed haplotype frequencies. One of the most frequently used measures of LD (6) is the square of the standardized measure. p11 :p22 p12 :p21 DAB D ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pA :qA :pB :qB pA :qA :pB :qB
(2)
or D2 (see Note 2). Another frequently used measure introduced by Lewontin (7) is 8 p11 :p22 p12 :p21 > > < minðp :q ; q :p Þ if DAB >0 A B A B D0 ¼ : (3) p :p p :p21 > 11 22 12 > if DAB <0 : minðpA :pB ; qA :qB Þ The denominator of Lewontin’s D0 is the absolute maximum D ¼ Dmax that could be achieved using the table margins. See Note 3 for characteristics of Lewontin’s D0 . For a short comparison of D and D0 , see Note 4. Levin and Bartell (8) introduced d¼
p11 p22 p12 p21 DAB ¼ ; pB p11 pB p11
(4)
which is based on Levin’s population attributable risk (see Note 5) d ¼
pA ðf 1Þ ; 1 þ pA ðf 1Þ
where f ¼ ðp22 =pA Þ=ðp12 =qA Þ denotes the relative risk (9). The frequency difference or difference in proportion
106
M. Vens and A. Ziegler
f ¼d¼
p22 p21 p11 p22 p12 p21 DAB ¼ ¼ pB qB pB qB pB qB
(5)
is a further epidemiologic measure used for estimating LD between two diallelic marker loci (10). Other epidemiologic measures are used in population genetics. For example, the odds ratio l and Yule’s (11) Q are used for estimating the LD between two diallelic markers. The odds ratio is calculated as: p11 p22 (6) l¼ p12 p21 and Yule’s Q as: Q ¼
l 1 p11 p22 p12 p21 DAB ¼ ¼ : l þ 1 p11 p22 þ p12 p21 p11 p22 þ p12 p21
(7)
Q is sometimes referred to as y (see Note 6). A comparison of D; D0 ; d; d; and Q regarding LD mapping is given in ref. 12. 1.2. Genotypic Linkage Disequilibrium Measures
Often SNPs are genotyped, but the haplotypes are unknown. To avoid the haplotype estimation step for determining the LD, it is possible to use genotypic LD measures instead. The reader should note that the concept of allelic LD has a simple interpretation from the formation of haplotypes during meiosis. In contrast, the haplotypic configurations during meiosis are explicitly ignored in genotypic LD measures but we consider instead genotype configurations in diploid individuals. We first consider the composite LD approach of Weir and Cockerham (3–14). Let pij ¼ PðX ¼ i; Y ¼ j Þ; i; j ¼ 0; 1; 2 be the probability that a subject carries i A alleles at loci M1 and j B alleles at marker M2 (see Tables 3 and 4). Analogously, let nij
Table 3 Joint genotype probabilities at two diallelic marker with, respectively, X and Y denoting the number of A alleles at M1 and B alleles at M2 Y
X
0
1
2
Total
0
p00
p01
p02
p0+
1
p10
p11
p12
p1+
2
p20
p21
p22
p2+
Total
p+0
p+1
p+2
1
7 Estimating Disequilibrium Coefficients
107
Table 4 Joint genotype distribution obtained from the haplotype frequencies shown in Table 2 under Hardy–Weinberg equilibrium with, respectively, X and Y denoting the number of A alleles at marker locus M1 and B alleles at marker locus M2 Y
X
0
1
2
0
2 p11
2p11p12
2 p12
1
2p11p21
2(p11p22 + p12p21)
2p12p22
2
2 p21
2p21p22
2 p22
denote the corresponding observed frequency of the genotype combination. As explained in the previous section, double heterozygote subjects cannot be distinguished when two different SNPs are considered, and the allelic LD cannot be directly inferred. Weir and Cokerham (3, 13), therefore, proposed the composite LD as: A þP A A þP A a DX Y ¼ 2P A j j j B b B B B B 1 a A a P A 2pA pB ; þ (8) Bj b þ P b j B 2 also termed digenic LD (see Note 7). An alternative to the composite LD is to estimate LD from diploid data by using the correlation between the numbers of alleles at each of the two loci. Under the assumption of random mating this correlation has the same expectation D as the correlation between haplotypes (4). For estimation, the Pearson product–moment correlation coefficient is used. For details of calculating the variance and the confidence interval for this estimator, see Note 8. 1.3. Further Approaches
Several extensions of allelic LD measures to multiallelic markers or multiple loci have been proposed. An overview of pair-wise, multilocus, haplotype-specific, and model-based LD measures is provided in ref. 15. Nielsen et al. (16) compared two- and three-locus LD measures regarding the power to detect marker/phenotype associations. Only diallelic loci were involved in their analysis. Nothnagel et al. (17) introduced the normalized entropy difference as a multilocus measure for LD. This measure allows for arbitrary numbers of loci. It describes the LD with regard to the locus sequence and can be interpreted as an extension of D2 to the multilocus case.
108
M. Vens and A. Ziegler
In the same direction, Zhang et al. introduced a multilocus LD measure based on mutual information also known as relative entropy or Kullback–Leibler distance (18). Their measure considers the existence of LD heterogeneity across the region under investigation and handles distant regions where long-range LD patterns may exist. Another measure based on generalized mutual information was introduced in ref. 19. This measure quantifies the distance between the observed haplotype distribution and the expected distribution under linkage equilibrium. It is shown that this measure is approximately equal to D2 if only two loci are considered. A class of stepwise tagging selection algorithms is proposed based on the introduced LD measure and an entropy measure. A mathematically precise formulation of LD between multiple loci as deviation from probabilistic independence is presented in ref. 20. Gorelick and Laubichler provide explicit formulae for all higher-order LD terms, where higher-order LD terms are recursively decomposed into lower-order terms. The LD measures D and D can be expressed as covariances and correlations across diploid genotypes provided that mating is random (4). This result holds approximately even if mating is nonrandom (21). The multiorder Markov chain model was developed to quantify the complexity value of LD patterns among SNPs (22). Feng and Wang (23) derived mathematical relationships between LD measures to understand the Markov chain model parameters in terms of conventional LD measures.
2. Methods Different implementations are available for the estimation of LD measures from a genotyped sample. A summary of web addresses for the estimation and testing of LD is provided in ref. 15. Two commonly used packages for estimation and visualization of LD are briefly described below. To illustrate the use of these packages, we used the phase 2 HapMap data as provided by PLINK (http://pngu.mgh.harvard. edu/~purcell/plink/res.shtml), and we focused on the 232 qualitycontrolled SNPs on chromosome 9 between 21.9 and 22.2 MB, which is a prominent region in cardiovascular genetics (24). 2.1. PLINK
PLINK is a whole-genome association analysis toolset (25). There are three options in PLINK to calculate the LD: --ld, --r, and --r2. The command --ld is followed by two SNP identifiers. For a single pair of SNPs, D2, D0 , the estimated haplotype frequencies,
7 Estimating Disequilibrium Coefficients
109
and those expected under linkage equilibrium are estimated. In addition, it is indicated which haplotypes are in phase. The LD statistics are based on haplotype frequencies, which are estimated by the EM algorithm. In pedigrees, only founders are used to calculate the statistics. We have chosen the two SNPs rs167901 and rs6475619 for illustration. The command plink --file hapmapChr9p21QC --ld rs1679014 rs6475619 yields the output LD information for SNP pair [rs1679014 rs6475619] R-sq ¼ 0.082 D’ ¼ 1.000 Haplotype -----------
Frequency -----------
Expectation under LE -----------------------
TT CT TA CA
0.000 0.440 0.095 0.466
0.042 0.398 0.053 0.507
In phase alleles are TA/CT
It is possible to obtain correlations based on genotype allele counts using the PLINK options --r or --r2, which calculate the genotypic LD measures D and D2, respectively. Thus, no phasing is required for estimation. In pedigrees, the measures are estimated in founders only. To estimate D, we use plink --file hapmapChr9p21QC --r --ld-window 10 --ld-windowkb 1000 --out hapmapChr9p21QC_LDr to produce the output file hapmapChr9p21QC_LDr.ld, which includes the entry CHR_A
BP_A
SNP_A
CHR_B
BP_B
SNP_B
r
9
22197037
rs1679014
9
22197255
rs6475619
0.345984
In the example code from above, we have also used several filtering options which are always used by PLINK. The option --ld-window is used to analyze only SNPs that are not too far apart from each other. The default is 10, i.e., only the ten neighboring SNPs are analyzed. In addition, --ld-window-kb specifies a kb window. The default is 1 Mb. To estimate D2, we use the default options as described above. Additionally, we use the option --ld-window-r2. As a consequence, only LD estimates are reported that exceed a specific threshold, and
110
M. Vens and A. Ziegler
the default is 0.2. This reduces the size of the output file when many comparisons are made. To receive all pair-wise LD estimates, we use the option --ld-window-r2 0. The command plink --file hapmapChr9p21QC --r2 --ld-window-r2 0 --out hapmapChr9p21QC_LDr2 produces an output file hapmapChr9p21QC_LDr.ld containing the following entry for the two SNPs CHR_A
BP_A
SNP_A
CHR_B
BP_B
SNP_B
r2
9
22197037
rs1679014
9
22197255
rs6475619
0.119705
To obtain all LD values for a set of SNPs versus one specific SNP, use the --ld-snp command in conjunction with --r2. To obtain all LD values from a group of SNPs with other SNPs, use the command --ld-snp-list followed by a.txt-file containing a list of SNPs. With the --matrix option, a matrix of LD values is created rather than a list. Thus, GPD is calculated for all SNP pairs, i.e., for SNPs on the same and on different chromosomes. It is also possible to force all SNP-by-SNP cross-chromosome comparisons with the standard output format by adding the option --inter-chr. The reader should note that using this option can be excessively long because all pair-wise SNP comparisons are made over all chromosomes. 2.2. Haploview
Haploview is a tool designed to maintain the analysis of haplotypes by providing an interface (26, 27). Haploview supports several functionalities. Here, we focus on the analysis of LD and haplotype blocks using Haploview 4.2. Several data input format files are accepted, such as linkage, Haps, HapMap, HapMap PHASE, and PLINK output. In this illustration, we use the linkage format. To this end, we assume presence of a ped- and a map-file. After starting Haploview, we are asked to input a Data File, i.e., the ped-file and a Locus Information File, i.e., the map-file. Afterwards, it may be specified whether the data are X-chromosomal data or whether an association test is to be performed. In our illustration, we neither have X-chromosomal data nor do we want to test for association. Additionally, we choose to ignore pair-wise comparisons of SNPs >1,000 kb apart, and we ignore the option that excludes individuals. After clicking “OK”, four flags may be chosen: “LD Plot”, “Haplotypes”, “Check Markers”, and “Tagger”. If “LD Plot” is chosen, the LD plot can be saved, e.g., as a PNG file. Choose “File”, then “Export Data”, as Output Format we choose “PNG Image” and we restrict the PNG Image to Markers “1–41”. Afterwards we click on “OK” and save the image wherever we want.
Fig. 1. Illustrative linkage disequilibrium plot generated with Haploview.
7 Estimating Disequilibrium Coefficients 111
112
M. Vens and A. Ziegler
The result is displayed in Fig. 1. It is possible to use different color schemes in Haploview. To change the color schemes choose “Display” and “LD color scheme”. We want to change the color scheme to “R-squared”. Afterwards we want to show other LD values. Choose again “Display”, then “Show LD values” and change to “R-squared”. We saved this image analogously to the procedure described above. The result is shown in Fig. 2. It is possible in Haploview to get detailed information about the relationship of two SNPs by doing a right-click on the square belonging to the SNPs. By doing so, you get the following information: distance in kb, D0 , LOD, D2, confidence bounds for D0, and haplotype frequencies. 2.3. Further Software Implementations
There are different packages for estimation or visualization of LD available. The package GENASSOC is based on STATA/SE and provides a variety of allelic LD measures (http://www-gene.cimr.cam.ac.uk/ clayton/software/stata/). Several R libraries are available for the estimation of LD. A commonly used library is GenABEL (28). A further possibility to plot the estimated LD structure in heat maps is GOLD (29). LocusZoom is a plotting tool which displays, e.g., regional information relative to LD (30).
3. Notes 1. The measure DAB is the most straightforward measure of LD. However, it depends on allele frequencies. Its maximum range is 0.25 to 0.25 when allele frequencies are 0.5. This range of values of DAB is even more restricted if allele frequencies become closer to 0 or 1 (31). The values of DAB depends on the allele frequencies at both marker loci (2). It can be shown that maxðpA pB ; qA qB Þ DAB minðpA qB ; qA pB Þ: DAB is comparable between two pairs of SNPs only if the allele frequencies of the SNPs are similar. This limits the practical use of this measure. To overcome this limitation, several standardizations of DAB have been suggested (32) (see above). All of them have in common that DAB is in the numerator (see Eqs. 2–5, and 7). b AB of DAB has expected value and large The estimator D sample variance (14)
Fig. 2. Illustrative linkage disequilibrium plot generated with Haploview using the “R-squared” “LD color scheme”.
7 Estimating Disequilibrium Coefficients 113
114
M. Vens and A. Ziegler
b AB ¼ 2n1 DAB E D 2n b AB ¼ 1 pA qA pB qB þ ð1 2pA Þð1 2pb ÞDAB D2 ; Var D AB 2n d D b AB Þ for and it is asymptotically unbiased. An estimator for Varð b AB Þ is obtained by replacing the allele frequencies and VarðD DAB by their estimators. It is possible to use DAB to test for LD using the null hypothesis H0 : DAB ¼ 0 versus H1 : DAB 6¼ 0. Statistical tests can be constructed as score tests (2): b AB EH D b AB D b AB 0 D TS ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 ^ ^ ^ ^ b AB d H0 D Var 2n ðpA q A pB q B Þ or Wald tests: TW
b AB EH0 D b AB D b AB D ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi : d D b AB d D b AB Var Var
Both test statistics are asymptotically standard normally distributed under H0. Asymptotically valid (1 a)-confidence intervals for LD can thus be constructed as rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi b d b DAB z1a=2 Var DAB : 2. Sometimes D is also termed r or r. D is usually squared to avoid the arbitrary sign which is introduced when the alleles are labeled. D is also known as the Pearson’s product moment correlation coefficient, and D2 is often termed the coefficient of determination. The definition of D2 can be understood by considering the alleles as realizations of two binary random variables with values 0 and 1 among which the correlation coefficient is estimated. D2 ranges between 0 and 1. It is equal to one only if two entries of Table 2 are equal to 0, and the LD is said to be total (or perfect) in this case. This measure is preferred whenever the focus is on the predictability of one polymorphism given the other. Thus, it is often used in power studies for association designs (33); for properties of D, see VanLiere and Rosenberg (34). 3. D0 ranges between 1 and 1. If the denominator of D0 equals 1, the LD is said to be complete. To achieve this property, a single cell in Table 2 needs to equal 0. D0 is preferred to assess recombination patterns. Haplotypes have often been defined using D0 (33). If an SNP has only a low minor allele frequency, it is possible that the corresponding rare haplotype is not observed in a small study population. This leads to a Lewontin’s D0 equal
7 Estimating Disequilibrium Coefficients
115
to 1, independent of the true level of LD (35, 36). Thus, the values of D0 are inflated in the presence of rare alleles. To avoid these spurious results, empirical confidence intervals for D0 should be estimated using resampling schemes (37). Lewontin’s D0 is sometimes rewritten as DAB Dmax with Dmax ¼ minðpA qB ; qA pB Þ if DAB >0 and Dmax ¼ maxðpa pB ; qA qB Þ if DAB <0 . Thus, D0 ranges between 0 and 1 (2, 15). D0 ¼
4. Several papers have argued for and against the use of D in favor of D0 . For details, the reader may refer to refs. 4, 12, 31, 38, 39. 5. The denominator of d is often written as pB ð1 pA pB þ pA pB þ DAB Þ. In the literature, d is also termed Pexcess (40) or l (41). 6. The odds ratio l ranges from 0 to 1, while Q 2 ½1; 1. 7. DX Y can be estimated by 1 1 b 2n00 þ n01 þ n10 þ n11 2^pA ^pB : DX Y ¼ n 2 A series of other higher-order genotypic LD measures, also termed composite LD measures, have been proposed by Weir and Cockerham (3, 13, 14). The variance of the composite LD which is not of a standard form is also provided. In practice, the use of these LD measures is limited because any relation to other LD measures (see Subheading 3.1) has been established. 8. The asymptotic variance of the estimated correlation coefficient ^ X Y can be estimated by D !2 2 X 2 2 X 2 X X 1 1 ^z2 p ^z p ^XY Þ ¼ d D ^ij ^ij : Varð (9) n i¼0 j ¼0 ij n i¼0 j ¼0 ij The terms zij can be estimated by
! ij i y j x 1 iði 2 x Þ j ðj 2 y Þ ^ ^zij ¼ ; D þ sx sy 2 sx2 sy2
(10)
where x and y denote the means of the two involved SNPs, respectively, and sx2 and sy2 are the corresponding variances (2). It is possible to derive an asymptotically valid confidence interval, preferably using Fisher’s z transformation. Thus, let ^ ^g ¼ 12 ln 1þD^ be the estimated Fisher z transformed correlation 1D d gÞ ¼ coefficient. The variance of ^g can be estimated as Varð^ ^ c V arðDÞ . With these estimators, an asymptotically valid confi2 2 ^ Þ ð1D
dence interval for g at a confidence level 1 a is given by
116
M. Vens and A. Ziegler
gð1aÞ
qffiffiffiffiffiffiffiffiffiffiffiffiffiffi d gÞ; ¼ ^g z1a=2 Varð^
gð1aÞ
qffiffiffiffiffiffiffiffiffiffiffiffiffiffi d gÞ; ¼ ^g þ z1a=2 Varð^
where z1a=2 denote the ð1 a=2Þ quantile of the standard normal distribution. The inverse Fisher z transformation is used to calculate an asymptotically valid 1 a confidence interval for DXY DX Y ;ð1aÞ ¼ tanh gð1aÞ ;
DX Y ;ð1aÞ ¼ tanh gð1aÞ :
There is a series of pathological cases that need to be considered separately, for solutions see ref. 4. It is also possible to use an exact distribution instead of the described asymptotic approach (4, 14). References 1. Wang X, Elston RC, Zhu X (2010) The meaning of interaction. Hum Hered 70: 269–2772 2. Ziegler A, Ko¨nig IR (2010) A statistical approach to genetic epidemiology: Concepts and applications, 2nd edn. Wiley-VCH, Weinheim 3. Weir BS (1979) Inferences about linkage disequilibrium. Biometrics 35: 235–254 4. Wellek S, Ziegler A (2009) A genotype-based approach to assessing the association between single nucleotide polymorphisms. Hum Hered 67: 128–139 5. Robbins RB (1918) Some applications of mathematics to breeding problems III. Genetics 3: 375–389 6. Hill WG, Weir BS (1994) Maximum-likelihood estimation of gene location by linkage disequilibrium. Amer J Hum Genet 54: 705–714 7. Lewontin RC (1964) The interaction of selection and linkage. I. General considerations; heterotic models. Genetics 49: 49–67 8. Levin ML, Bertell R (1978) Re - simple estimation of population attributable risk from case–control studies. Amer J Epidemiol 108: 78–79 9. Levin ML (1953) The occurrence of lung cancer in man, Acta Unio Int Contra Cancrum 9: 531–541 10. Kaplan N, Weir BS (1992) Expected behavior of conditional linkage disequilibrium. Amer J Hum Genet 5: 333–343 11. Yule GU (1900) On the association of attributes in statistics. Phil Transact Roy Soc London, A 194: 257–319 12. Devlin B, Risch N (1995) A comparison of linkage disequilibrium measures for fine-scale mapping. Genomics 2: 311–322
13. Weir BS, Cockerham CC (1979) Estimation of linkage disequilibrium in randomly mating populations. Heredity 42: 105–111 14. Weir BS (1996) Genetic data analysis II: Methods for discrete population genetic data, 2nd edn. Sinauer Associates, Inc., Sunderland (MA) 15. Mueller JC (2004) Linkage disequilibrium for different scales and applications. Brief Bioinformatics 5: 355–364 16. Nielsen et al (2004) Effect of two- and threelocus linkage disequilibrium on the power to detect marker/phenotype associations. Genetics 168: 1029–1040 17. Nothnagel M, Furst R, Rohde K (2002) Entropy as a measure for linkage disequilibrium over multilocus haplotype blocks. Hum Hered 54: 186–198 18. Zhang L, Liu JF, Deng HW (2009) A multilocus linkage disequilibrium measure based on mutual information theory and its applications. Genetica 137: 355–364 19. Liu Z, Lin S (2005) Multilocus LD measure and tagging SNP selection with generalized mutual information. Genet Epidemiol 29: 353–364 20. Gorelick R, Laubichler MD (2004) Decomposing multilocus linkage disequilibrium. Genetics 16: 1581–1583 21. Rogers AR, Huff C (2009) Linkage disequilibrium between loci with unknown phase. Genetics 182: 839–844 22. KimY, Feng S, Zeng ZB (2008) Measuring and partitioning the high-order linkage disequilibrium by multiple order Markov chains. Genet Epidemiol 32: 301–312 23. Feng S, Wang SC (2010) Summarizing and quantifying multilocus linkage disequilibrium
7 Estimating Disequilibrium Coefficients patterns with multi-order Markov chain models. J Biopharm Stat 20: 441–453 24. Coronary Artery Disease Consortium (2009) Large scale association analysis of novel genetic loci for coronary artery disease. Arterioscler Thromb Vasc Biol 29: 774–780 25. Purcell S, et al (2007) PLINK: A tool set for whole-genome association and populationbased linkage analyses. Amer J Hum Genet 81: 559–575 26. Barrett JC (2009) Haploview: Visualization and analysis of SNP genotype data. Cold Spring Harb Protoc 2009, pdb ip71 27. Barrett JC, et al (2005) Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics 21: 263–265 28. Aulchenko YS, et al (2007) GenABEL: an R library for genome-wide association analysis. Bioinformatics 23: 1294–1296 29. Abecasis GR, Cookson WOC (2000) GOLD Graphical overview of linkage disequilibrium. Bioinformatics 16: 182–183 30. Pruim RJ, et al (2010) LocusZoom: regional visualization of genome-wide association scan results. Bioinformatics 26: 2336–2337 31. Sved J (2009) Linkage disequilibrium and its expectation in human populations. Twin Res Hum Genet 12, 35–43 32. Morton NE, Collins A (1998) Tests and estimates of allelic association in complex inheritance. Proc Natl Acad Sci USA 95: 11389–11393
117
33. Chen YG, Lin, CH, Sabatti C (2006) Volume measures for linkage disequilibrium. BMC Genet 7: 54–62 34. VanLiere JM, Rosenberg NA (2008) Mathematical properties of the r(2) measure of linkage disequilibrium. Theor Popul Biol 74: 130–137 35. Teare MD, et al (2002) Sampling distribution of summary linkage disequilibrium measures. Ann Hum Genet 66: 223–233 36. Tenesa A, et al (2004) Extent of linkage disequilibrium in a Sardinian sub-isolate: sampling and methodological considerations. Hum Mol Genet 13: 25–33 37. Gabriel SB, et al (2002) The structure of haplotype blocks in the human genome. Science 296: 2225–2229 38. Hedrick, PW (1987) Gametic disequilibrium measures—proceed with caution. Genetics 117: 331–341 39. Wray NR (2005) Allele frequencies and the r2 measure of linkage disequilibrium: Impact on design and interpretation of association studies. Twin Res Hum Genet 8: 87–94 40. Lehesjoki AE, et al (1993) Linkage disequilibrium mapping in progressive myoclonus epilepsy of Unverricht-Lundborg type. Amer J Hum Genet 5: 1029–1029 41. Terwilliger JD (1995) A powerful likelihood method for the analysis of linkage disequilibrium between trait loci and one or more polymorphic marker loci. Amer J Hum Genet 56: 777–787
sdfsdf
Chapter 8 Detecting Familial Aggregation Adam C. Naj, Yo Son Park, and Terri H. Beaty Abstract Beyond calculating parameter estimates to characterize the distribution of genetic features of populations (frequencies of mutations in various regions of the genome, allele frequencies, measures of Hardy–Weinberg disequilibrium), genetic epidemiology aims to identify correlations between genetic variants and phenotypic traits, with considerable emphasis placed on finding genetic variants that increase susceptibility to disease and disease-related traits. However, determining correlation alone does not suffice: genetic variants common in an isolated ethnic group with a high burden of a given disease may show relatively high correlation with disease but, as markers of ethnicity, these may not necessarily have any functional role in disease. To establish a causal relationship between genetic variants and disease (or disease-related traits), proper statistical analyses of human data must incorporate epidemiologic approaches to examining sets of families or unrelated individuals with information available on individuals’ disease status or related traits. Through different analytical approaches, statistical analysis of human data can answer several important questions about the relationship between genes and disease: 1.
Does the disease tend to cluster in families more than expected by chance alone?
2.
Does the disease appear to follow a particular genetic model of transmission in families?
3.
Do variants at a particular genetic marker tend to cosegregate with disease in families?
4.
Do specific genetic markers tend to be carried more frequently by those with disease than by those without, in a given population (or across families)?
The first question can be examined using studies of familial aggregation or correlation. An ancillary question: “how much of the susceptibility to disease (or variation in disease-related traits) might be accounted for by genetic factors?” is typically answered by estimating heritability, the proportion of disease susceptibility or trait variation attributable to genetics. The second question can be formally tested using pedigrees for which disease affection status or trait values are available through a modeling approach known as segregation analysis. The third question can be answered with data on pedigrees with affected members and genotype information at markers of interest, using linkage analysis. The fourth question is answerable using genotype information at markers on unrelated affected and unaffected individuals and/or families with affected and unaffected members. All of these questions can also be explored for quantitative (or continuously distributed) traits by examining variation in trait values between family members or between unrelated individuals. While each of these questions and the analytical approaches for answering them is explored extensively in subsequent chapters (heritability in Chapters 9 and 10, segregation in
Robert C. Elston et al. (eds.), Statistical Human Genetics: Methods and Protocols, Methods in Molecular Biology, vol. 850, DOI 10.1007/978-1-61779-555-8_8, # Springer Science+Business Media, LLC 2012
119
120
A.C. Naj et al.
Chapter 12, linkage in Chapters 13–17, and association in Chapters 18–21 and 23), this chapter focuses on statistical methods to answer questions of familial aggregation. Key words: Family history, Family history score, Familial case–control, Cochran–Mantel–Haenszel, Odds ratio, Logistic regression, Conditional logistic regression, Familial gene–environment interaction, Generalized estimating equations, Familial relative risk, Familial recurrence risk, Standardized incidence ratio, Standardized mortality ratio
1. Introduction Familial aggregation studies represent a “first step” in epidemiologic investigations of genetic risk of disease, as they determine whether a disease of interest is observed in families more than would be expected by chance alone. By testing the null hypothesis (H0) that the frequency of a disease among individuals related to an affected case or proband is greater than the frequency of disease in the general population, these studies establish whether the disease clusters in families, be it the result of genetic factors or shared environmental exposures. Related follow-up studies, such as twin studies and adoption studies, try to disentangle whether risk of disease is due solely to genetic factors, due solely to environmental factors, or the result of some combination of effects from both environmental and genetic risk factors. Studies of familial aggregation initially ascertain unrelated affected individuals (index cases) with disease from specific groups or from the population at large. In collecting data on disease and exposure patterns, these studies will often attempt to determine from the index case whether any additional family members have disease. Data collected on patterns of disease in the index case’s family may range from simple questions about whether the case’s parents had the disease of interest or related conditions to more comprehensive questions about affection status in siblings, cousins, and other members of the case’s extended family, including information such as age at disease onset and final outcome (e.g., death). To better characterize patterns of disease in families, some studies may actively try to recruit relatives of the index case and collect first hand information on disease features and patterns of exposures in relatives of these individuals, to gather more insights into disease pathogenesis and whether certain disease features or comorbidities also cluster in families. These more detailed studies will be able to examine broader questions of whether pathogenesis of the disease is heterogeneous, with possibly different mechanisms or risk factors clustering in different families at high risk. A broad range of diseases, conditions, and continuously distributed disease-related traits have been examined for familial aggregation (Table 1), including rheumatoid arthritis (1), obsessive-compulsive disorder (2), and even “tone deafness” (congenital amusia) (3).
8 Detecting Familial Aggregation
121
Table 1 Some examples of studies (adapted and updated from Khoury et al. (12)) Author (year) (reference)
Disease/trait Disease/trait in index case in relatives
Comments
Tokuhata and Lilienfeld Lung cancer (1963) (46)
Lung cancer
Interaction between family history and smoking
Feinleib et al. (1977) (47)
Cardiovascular risk factors
Cardiovascular risk factors
Risk factors for a disease
Cohen (1980) (15)
Chronic obstructive pulmonary disease and lung cancer
Impaired pulmonary function
Etiologic similarities for different diseases
Khoury et al. (1982) (48, 49)
Neural tube defects
Neural tube defects
Etiologic heterogeneity within a malformation
ten Kate et al. (1982) (50)
Myocardial infarction
Myocardial infarction
Familial aggregation of a disease
del Junco et al. (1984) (1)
Rheumatoid arthritis
Rheumatoid arthritis
Familial aggregation of a disease
Sattin et al. (1985) (51) Breast cancer
Breast cancer
Familial aggregation of a disease
Nielsen et al. (1987) (52)
Peak bilirubin in newborn
Physiologic trait associated with neonatal jaundice
Beaty et al. (1988) (53) Birth weight
Birth weight
Continuous trait in sibs
Maestri et al. (1988) (39)
Congenital cardiovascular malformations
Familial aggregation of a class of diseases
Linet et al. (1989) (54) Chronic lymphocytic leukemia
Lymphoproliferative cancers
Etiologic and pathogenic similarities
Ponz de Leon et al. (1989) (8)
Colorectal cancer
Tumor formation
Pathologic heterogeneity with familial aggregation of a disease
Lin et al. (1998) (55)
Rheumatoid arthritis
Other autoimmune diseases
Familial aggregation of a class of diseases
Nestadt et al. (2000) (2)
Obsessive compulsive disorder
Obsessive compulsive Familial aggregation of disorder a psychological disorder
Naldi et al. (2001) (6)
Acute guttate psoriasis
Psoriasis (any)
Peak bilirubin in newborn
Congenital cardiovascular malformations
Familial aggregation of a class of diseases (continued)
122
A.C. Naj et al.
Table 1 (continued) Author (year) (reference)
Disease/trait Disease/trait in index case in relatives
Criswell et al. (2005) (56)
Autoimmune disease
Comments
Autoimmune disease
Etiologic and pathogenic similarities
Beaty et al. (2006) (57) Oral cleft
Oral cleft
Familial aggregation of a disease
Peretz et al. (2007) (3) Congenital amusia (tone deafness)
Congenital amusia
Pathologic heterogeneity with familial aggregation of a disease
Raynor et al. (2009) (58)
Age-related hearing loss
Age-related hearing loss
Etiologic heterogeneity with familial aggregation of a condition in some cases
Krogh et al. (2010) (59)
Pyloric stenosis
Pyloric stenosis
Familial aggregation of a disease
Xiong et al. (2010) (60) Restless legs syndrome
Restless legs syndrome
Familial aggregation of a disease
Saito et al. (2010) (61) Irritable bowel syndrome
Irritable bowel syndrome
Familial aggregation of a disease
Lichtenstein et al. (2010) (62)
Autism-spectrum disorders (ASD)
ASD and related neuropsychiatric disorders
Twin concordance and familial aggregation with pathogenic heterogeneities
Petersen et al. (2010) (63)
Infectious disease
Infectious disease
Familial aggregation for disease susceptibility and case fatality
Initially, we focus on two epidemiologic study designs for examining familial aggregation, conventional case–control and family case–control designs. Other designs exist and are discussed in more detail in Chapters 9 and 10 and in other reviews (4, 5). 1.1. Conventional Case–Control Approaches to Familial Aggregation of Binary Traits
The simplest approach to identify familial aggregation of disease is by comparing the presence of a family history (FH) of the disease among cases and controls. This approach extends traditional case–control analysis to treat familial clustering of disease as a binary trait [FH present/FH absent]. Positive family history (FH+) is usually defined as the presence of disease in one or more first-degree relatives for either cases or controls, whereas negative family history (FH) would
8 Detecting Familial Aggregation
123
a
b
… Fig. 1. Traditional epidemiologic 2 2 contingency tables for family history exposure and disease data, where “FH+” indicates the presence of family history of disease and “FH” is the absence of family history, D indicates cases and D indicates controls. A general form of the table is depicted in (a), whereas stratum-specific tables are demonstrated in (b).
be total absence (or very low frequency) of the disease among first-degree relatives. As implemented by epidemiologists in traditional case–control designs, this can help identify whether exposures shared among family members may contribute to increased odds of disease, and this information can sometimes be used to develop preventive or intervention strategies. These data can be obtained by direct interviews of cases and controls or by corresponding with family informants. 1.1.1. Unadjusted Measures of FH-Disease Association
If effects of potential biases on the association being examined, such as ascertainment issues, disease or exposure misclassification, and/ or the presence of confounders, are minimal (or deliberately minimized through appropriate study design), then the relationship between FH and disease can be evaluated in a relatively simple manner. FH exposure and disease data can be presented in a traditional epidemiologic 2 2 contingency table (Fig. 1). Under these circumstances, the odds ratio (OR) comparing positive family his would be: tory (FH+) exposure among cases (D) and controls (D) OR ¼
PrðFH þ jD Þ= PrðFH jD Þ Þ= PrðFH jD Þ: PrðFH þ jD
(1)
An OR > 1 would indicate an association of positive family history with disease, while an OR ¼ 1 would indicate no association. An OR < 1 would indicate an excess positive family history among those without disease (i.e., controls). As it is expected that (1) the frequency of FH + among controls should approximate the prevalence of disease in the general population and (2) the prevalence of affected family members with disease among cases should be no less than the population prevalence, observing OR < 1 would suggest the need to reevaluate the study design for the presence of factors that may bias this association.
124
A.C. Naj et al.
Table 2 Distribution of family history of psoriasis among acute guttate psoriasis cases and unrelated controls (adapted from Naldi et al. (6)) Trait
Cases, No. (%)
Controls, No. (%)
Chi-square test for homogeneitya (P value)
Family history of psoriasis (parents or siblings)
a
No
49 (67.1)
397 (92.3)
Yes
24 (32.9)
33 (7.7)
36.98 (0.000)
The number of degrees of freedom is equal to the number of categories, 1
An example estimating association of positive family history with disease is explored in a study by Naldi et al. (6), which examined several risk factors for acute guttate psoriasis in 73 cases and 430 controls. As demonstrated in Table 2, 24 of 73 cases (32.9%) had a family history of psoriasis, whereas only 33 of 430 controls (7.7%) had a family history of this condition. Assuming there are no sources of other bias that need to be accounted for, the unadjusted OR and 95% confidence interval (CI) for odds of psoriasis from a history of psoriasis in parents or siblings were estimated as 5.89 (3.22, 10.8). In Subheadings 2.1 and 2.2, we provide a step-by-step description of how to calculate the odds ratio and the confidence interval using the R statistical software package (http://cran.r-project.org). In this study, the investigators identified several potential confounding variables (age, sex, body mass index, smoking, and alcohol consumption) to adjust for in estimating this OR, and thus chose more complex approaches to report the adjusted OR estimates, some of which are described in more detail subsequently. 1.1.2. Measures of FH-Disease Association with Cochran–Mantel–Haenszel Adjustment for Potential Confounders
If any variables are recognized as potential confounders of observed relationship between FH and disease, these may be accounted for in the design stage of the study, using approaches such as frequency matching or individually matching cases and controls, or restricting ascertainment to individuals meeting certain criteria (though this may reduce the case–control sample’s overall representativeness of the general population). If these potentially biasing factors are frequent enough in the population from which the cases and controls are sampled, stratifying cases and controls into groups with similar values for these variables (Fig. 1b) may allow for their effects to be adjusted out using various procedures, such as the Cochran–Mantel–Haenszel (CMH) approach (7). Stratification may also permit a cursory
8 Detecting Familial Aggregation
125
examination of whether these potential confounders modify the effects of FH on risk of disease. If a trait has K states, the within-stratum CMH OR for comparing positive family history with affection status would be defined for each stratum i ¼ 1, . . ., K as: ORi ¼
PrðFH þ jD; ith stratumÞ= PrðFH jD; ith stratumÞ ith stratumÞ= PrðFH jD; ith stratumÞ : PrðFH þ jD; (2)
Using the cell counts in Fig. 1b, we note for stratum i, each of these probabilities is defined as: ai PrðFH jD; ith stratumÞ ai þ bi bi ¼ ai þ bi ci ith stratumÞ PrðFH þ jD; ith stratumÞ ¼ PrðFH jD; ci þ di di : ¼ ci þ di (3)
PrðFH þ jD; ith stratumÞ ¼
Restating the ORi in these terms, we have ORi ¼
ðai =ai þ bi Þ=ðbi =ai þ bi Þ ai di ¼ : ðci =ci þ di Þ=ðdi =ci þ di Þ bi ci
(4)
To estimate an average effect across all K strata weighted by the number of individuals in each stratum, the across-stratum CMH OR would be defined as: PK ai di =ni ORCMH ¼ Pi¼1 ; (5) K i¼1 bi ci =ni where ni is the total sample size of the ith stratum. CMH estimation of ORs and other measures of association are widely implemented in most statistical software packages (see Subheading 2.4). A study by Ponz de Leon and colleagues (8) examining familial aggregation of tumor development among individuals in a colon cancer registry used a modified version of this approach to estimate the OR for tumor development among first-degree relatives of cases and age- and sex-matched controls. The investigators parsed first-degree relatives into 389 strata (K ¼ 389), each stratum containing the relatives of one case and the relatives of their matched control to adjust the measured association for the effects of age and sex. Within stratum i, for example, the probability of a positive family history for case i, Pr(FH + |D, ith stratum), would be estimated as the fraction of person i’s first-degree relatives with tumors out of the total number of their first-degree relatives at risk for tumor development, for example; all other within-stratum
126
A.C. Naj et al.
probabilities would be estimated similarly. Using this approach, these authors observed a CMH OR ¼ 7.5 for tumor development (P < 0.001), indicating a strong excess of tumors among firstdegree relatives of colon cancer patients. The functions available in the software packages R, SAS, and Stata to calculate the CMH odds ratio are outlined in Subheading 2.5. 1.1.3. Logistic Regression Approaches to Measuring FH-Disease Association with Adjustment for Potential Confounders
Several approaches for implementing general linear models to measure association of disease among unrelated cases and controls with family history of disease in their relatives have been explored. Hopper et al. (9) proposed examining familial binary trait data using a restricted class of log-linear models, within which the modeling of a first-order interaction term can be interpreted as the conditional log odds ratio measuring familial aggregation. An approach similarly using log-linear models was developed and described by Connolly and Liang (10). Subsequently, Hopper and Derrick (11) demonstrated how data from families of different sizes and structure can be used to estimate within-class and betweenclass “correlations” for binary data as functions of their odds ratios, under the assumption that no second-order or higher-level interactions would be included in their log-linear model. Unlike conventional epidemiologic risk factors for disease, family history is not a physical characteristic particular to individuals being studied, but depends on factors that may be entirely unrelated to disease, such as number of family members, types of relationship to the proband, age distribution of the family members, and population prevalence of the disease (12). Because of this, logistic regression approaches modified to account for both withinfamily and among-family effects, such as the approach described by Liang and Beaty (13), are warranted. Under this approach, a pedigree with n members, with Yj denoting the binary outcome of the jth individual, we have: log
PrðYj ¼ 1Þ ¼ b0 þ b1 x1j þ þ bp xpj ¼ bt xj ; PrðYj ¼ 0Þ
(6)
where xj is a p 1 column vector of covariates thought to be associated with risk of disease through bt, a 1 p row vector. These covariates may include demographic variables (such as sex and age), environmental factors (e.g., smoking habits), as well as variables specific to the entire family (such as race/ethnicity and markers of socioeconomic status, like family income). Under this logistic regression model, a common risk among all family members need not be assumed. Following on this, we can define the odds ratio between the jth and kth members of the family as: OR jk ¼
PrðYj ¼ 1; Yk ¼ 1Þ PrðYj ¼ 0; Yk ¼ 0Þ : PrðYj ¼ 1; Yk ¼ 0Þ PrðYj ¼ 0; Yk ¼ 1Þ
(7)
8 Detecting Familial Aggregation
127
The odds ratio is the most common measure of association between categorical variables and ranges from zero to infinity (14). If this odds ratio is truly equal to one, there is no association between pairs of relatives. In our context, OR ¼ 1 for each of the n pairs of relatives indicates no genetic contribution to disease 2 risk. On the other hand, OR > 1 would reflect familial aggregation of risk. To complete the modeling, we assume log OR jk ¼ gtz ;
(8)
where z, a q 1 vector, includes variables specific to the family and variables indicating the status of the (j, k) pair in the family. Thus, in families including all first-degree relatives, we may assume that log ORjk ¼ gss, gpp, or gps depending on whether the (j, k) pair are siblings, parents, or a parent–sibling pair. This equation introduces two intraclass odds ratios for familial aggregation, namely egss and egpp , and one interclass parameter, egps . The model characterized in Eq. 8 is extensible to examine additional hypotheses, for instance, differences in familial aggregation by race/ethnic group. A familial aggregation study including both white and black families might model log OR jk ¼ g0 þ g1 ðRaceÞ, where Race ¼ 1 for white families and Race ¼ 0 for black families, and a significant ^g1 term would lead to the conclusion that the degree of familial aggregation (as measured by this odds ratio) may differ between black and white families. 1.1.4. Extending Modeling of Family History as a Covariate in Regression Approaches
An alternative approach to addressing concerns about variable family sizes, age distribution and biologic relationships across cases and controls is to create for each family a family history score (FHS) (12). This is accomplished by first deriving for each case or control the expected number of affected relatives based on person-time at risk, i.e., E¼
n X
Ej ;
(9)
j ¼1
where Ej is the expected risk to the jth relative based on risks in the general population considering age, gender, and other demographic variables, such as birth year. A FHS is then defined as ðO E Þ ; E 1=2
(10)
the “standardized” version of O, the observed number of affected among n relatives. Consider a 1980 genetic epidemiologic study by Cohen examining chronic obstructive pulmonary disease (COPD) (15), which sampled 105 cases and 79 controls and their families. First-degree relatives of case and control probands were directly examined for
128
A.C. Naj et al.
Table 3 Affection status for chronic obstructive pulmonary disorder (COPD) among relatives of COPD case and control probands (adapted from Liang and Beaty (64)) Disease status Affected
Unaffected
Case relative
71
173
244
Control relative
29
134
163
pulmonary function in this study, and Table 3 shows the number of affected relatives by case–control status. In this study, FHSs were estimated for cases and controls, adjusting the expected number of affected first-degree relatives for age, sex, and birth year. The average FHS among cases was 0.383 [standard error (s.e.) ¼ 0.117], while the average FHS among controls was 0.006 (s.e. ¼ 0.107) (data not shown here). FHS is a Poisson variable, where the mean and variance are the same, and standard error is estimated accordingly (16). A simple t-test revealed a significant difference in FHS between cases and controls, suggesting that case relatives were at excess risk compared to control relatives, as measured by FHS. 1.1.5. Accounting for Nongenetic Contributors to Excess Family History of Disease
Family history (FH) as a variable can be subject to misclassification. For example, even when the disease has no genetic etiology, the probability of a case having a positive family history is a function of his/her total number of first-degree relatives n 1 ð1 pÞn ;
(11)
where p is the disease prevalence. With p ¼ 0.05, this proportion would be 0.19 when n ¼ 4 and increases to 0.34 when n ¼ 8. Meanwhile, a proportion of 0.34 is expected if the disease prevalence is 0.10 instead of 0.05, even for a constant family size of n ¼ 4. Thus, if the distribution of family sizes differs substantially between cases and controls, this odds ratio based on FH can be quite misleading. Another concern about the use of this simple family history variable is the potential bias of information or recall. Depending on biologic relationship, number of relatives and other factors, the true disease status of relatives may be misreported, leading to further misclassification of FH. Unlike the previous concerns about family size, however, the degree of recall bias may often differ between cases and controls. Consequently, the estimated odds ratio could either be attenuated or inflated, and the magnitude of this discrepancy can be substantial (12). This concern
8 Detecting Familial Aggregation
129
about potential recall bias may be alleviated, to some extent, by carefully choosing informants from whom to gather FH information. For example, rather than only interviewing cases (or controls), one may instead consider interviewing parents or spouses as informants (multiple informants should also help). Indeed, in the situation where the cases (or controls) are deceased, such an alternative will become a necessity. 1.1.6. Modifications of Logistic Regression Approaches
Testing gene–environment interactions. Case–control designs can also be utilized to test for interactions between genes and environmental risk factors. In the absence of knowledge regarding specific susceptibility genes, one may use either FH or FHS as a surrogate measure of “genetic loading,” or one can use markers in candidate genes. To test the hypothesis of interaction between environmental factors and such factors as family history (FH or FHS), one can consider the following logistic regression model, (17): logit PrðDjFHðSÞ; ENVÞ ¼ a þ b1 FHðSÞ þ b2 ENV þ b3 ½FHðSÞ ENV;
(12)
where ENV stands for an observed environmental variable such as maternal smoking, etc. One can test the hypothesis of no interaction by examining the magnitude and sign of b3, the coefficient for the interaction term. It is worth noting that this interaction, if it exists, is modeled in a multiplicative fashion. Specifically, when comparing two individuals who differ by one unit in ENV, E + 1 versus E say, the odds ratio relating disease risk to ENV for those with positive family history (FH+) is eb3 times that of those without family history (FH), i.e., OR FHþ ðE þ 1; EÞ ¼ eb2 þb3 and OR FH ðE þ 1; EÞ ¼ eb2 :
(13)
In the situation where the risk factor is a marker in a candidate gene, one can test for interaction between marker genotype and exposure to the environmental risk factor by applying Eq. 13 with FH(S) replaced by a dichotomous variable, GEN, which is 1 if the case (or control) carries the targeted allele(s) and 0 otherwise. As an illustration, consider a case–control study on oral clefts in which 333 children born with oral clefts and 166 healthy infants were sampled (18). A main objective of the study was to test for association between risk of being a case and genotype at a candidate gene, transforming factor alpha locus (TGFA), and the possible gene– environment interaction between TGFA and maternal smoking (MS). These data are summarized in Table 4 in two 2 2 tables denotes the stratified by the maternal smoking status; here G (G) presence(absence) of a C2 allele at the taq I polymorphism in TGFA. There was little evidence of association between oral cleft and TGFA whether the mother smoked or not (OR ¼ 1:05 and 1:07; respectively). The logistic regression model was fitted including an
130
A.C. Naj et al.
Table 4 Distribution of presence (G) or absence (G ) of the C2 allele at the transforming factor alpha locus (TGFA) by oral cleft case (D )/control (D ) status, stratifying by maternal smoking status (yes/no) (adapted from Liang and Beaty (64))
Maternal smoking D D No maternal smoking D D
G
G
28 5 33
80 15 95
108 20 128
31 19 50
194 127 321
225 146 371
interaction term (as in Eq. 12 above) with the addition of maternal age (MA) as another covariate. Results were very similar to that shown above, as the odds ratio relating TGFA to the risk for oral cleft was estimated at 1.00 (¼e0.0014) for children whose mothers did not smoke and as 1.09 (¼e0.0014 + 0.088) for children whose mothers did smoke. Thus, these data provided little evidence of interaction between TGFA and maternal smoking regarding risk for oral cleft. Subheading 2.3 describes how to use the R software package to fit the model given by Eq. 12 that involves an interaction term. Lung disease provides an example of a complex disease in which genetic effects can be understood only when both genetic and environmental contributions are considered (reviewed in detail elsewhere (19)). Environmental factors such as cigarette smoking may modify genetic effects in disease pathogenesis in diseases such as COPD, when genetically susceptible individuals are exposed chronically or to high concentrations of these factors (20). Studies have shown that even for a form of COPD under monogenic control, a1 antitrypsin (AAT) deficiency, there appears to be significant variation among family members in pulmonary function (21), suggesting the importance of environmental factors as modifiers of risk (19). Pare and colleagues (22) have recently proposed a new method called “variance prioritization,” which prioritizes the genetic markers examined to facilitate subsequent tests of gene–gene and gene– environment interactions. Variance prioritization applies Levene’s test of variance equality to select SNPs under a predefined threshold and uses standard regression models to study them further for interaction effects (22). Similar efforts have been undertaken by others,
8 Detecting Familial Aggregation
131
who have noted the importance of evaluating gene–environment interaction effects in addition to composite genetic effects (23). Finally, sample size calculations have been developed to detect gene–environment interactions in unmatched case–control studies (24–26). Sturmer and Brenner (27) pointed out that considerable gain in power for testing interactions may be achieved by matching (frequency or individual) on the environmental factor in the design stage, especially if the environmental factors are rare in the population. Such a gain in statistical power may be offset by the extra difficulty in identifying matched controls, for the very reason of low prevalence of environmental factors. Sturmer and Brenner (27) suggest the balance between power gain and extra costs to achieve matching must take into account the specific research questions and surrounding circumstances. Statistical software programs such as QUANTO (28) allow for the consideration of prevalence of environmental factors, varying allelic and genotypic frequencies, and modeling of different statistical tests to determine how much power is available for tests of gene–environment interaction under a varying set of conditions. Identifying homogeneous subgroups. Under the general framework of case–control designs, one can test for homogeneity among different subgroups of cases. Assuming a hypothesized subtyping variable (such as early versus late onset of the disease) is available, the following polytomous logistic regression model can be fitted: log
PrðY ¼ j jFHðSÞÞ ¼ aj þ bj FHðSÞ; PrðY ¼ 0jFHðSÞÞ
j ¼ 1; . . . ; C;
(14)
where Y is a categorical representation of C + 1 categories with Y ¼ 0 for controls and Y ¼ j for cases of the particular subtype j, j ¼ 1, . . ., C. For the onset age example, C would be 2 with Y ¼ 2 (1) if the case is diagnosed with late (early) onset. If the regression coefficients (bj s) are significantly different from one another, then more homogeneous subgroups can be identified and may be targeted for further investigation. Returning to the oral cleft study, it is also of interest to examine if the two anatomical subtypes of oral cleft, cleft palate only (CP), and cleft lip with/without palate (CLP) are heterogeneous etiologically. The 2 3 table in Table 5 shows the prevalence of carrying the C2 allele is highest among cases with CP (21.8%), followed by CLP cases (15.3%) and 14.5% among controls. To formally answer this question, consider the following polytomous logistic regression model log
PrðY ¼ jjTGFA; MS; MAÞ ¼ aj þ bj 1 TGFA þ bj 2 MS PrðY ¼ 0jTGFA, MS, MAÞ þ bj 3 TGFA MS þ bj 4 MA,
132
A.C. Naj et al.
Table 5 Distribution of presence (G) or absence (G ) of the C2 allele at the transforming factor alpha locus (TGFA) by oral cleft case (CP)/oral cleft lip/palate (CLP)/control (D ) status, stratifying by maternal smoking status (yes/no) (adapted from Liang and Beaty (64)) Oral cleft TGFA
CP
CLP
D
G
27
32
24
83
G
97
177
142
416
124
209
166
499
where Y ¼ 2(1) if the case had CP (CLP) and Y ¼ 0 for controls. It appeared children of nonsmoking mothers showed no association between TGFA and the risk of either type of oral cleft ^ ^ (eb11 ¼ 1:05 and eb21 ¼ 0:98 for CP and CLP, respectively), and hence there was very little evidence of heterogeneity in genotypic effect between these two types of cleft. However, for children whose mothers smoked, there is a stronger positive association ^ ^ between TGFA and risk of being born with CP (eb21 þb13 ¼ 1:87) compared to the association between TGFA and the risk of being ^ ^ born with CLP (eb21 þb23 ¼ 0:74). Although the difference between these two odds ratios is not statistically significant (at the conventional 0.05 level), these data raise the possibility that these two subtypes of oral cleft may have different genetic etiology with respect to TGFA, and perhaps this is modified by maternal smoking. In Subheading 2.4, we provide a step-by-step description of fitting a polytomous logistic regression model using the R software package. 1.2. Family-Based Approaches to Familial Aggregation of Binary Traits 1.2.1. Description
During the past decade, family-based designs have drawn a good deal of attention among researchers in genetic epidemiology to address issues considered here, see, for example, Claus et al. (29) and Mettlin et al. (30) for breast cancer, Pulver and Liang (31) for schizophrenia and, more recently, Nestadt et al. (2) for obsessive compulsive disorders. Specifically, with the consent of cases and controls, their relatives were recruited and interviewed directly for detailed evaluations about their disease status, laboratory assessments, and demographic and risk factor information relevant to
8 Detecting Familial Aggregation
133
Fig. 2. Epidemiologic 2 2 contingency tables for disease status among relatives of cases and controls data.
disease. For this design, the phrase “case (control) proband” has been coined for cases (controls) as they represent probands, or the individuals through which the family is ascertained. Just as in Subheading 2, the data can be summarized in a 2 2 contingency table (Fig. 2). The primary response in this design is the risk among relatives, known as the familial risk. Thus, familial aggregation of disease may be claimed if the risk among case relatives is substantially higher than that among control relatives. It is important to point out that a primary difference between the conventional case–control design and this family case–control design lies in the sampling unit (here the family) and on the quantity to be used for comparison. For the standard case–control design, the unit of sampling and analysis is an individual (i.e., the case and control). Characteristics (including family history) are compared between cases and controls. For the latter design, however, the unit becomes the family and characteristics of individual relatives, and disease status within families is compared between case families and control families. This distinction has profound implications on validity of statistical inferences and on practical implementation, as addressed below. 1.2.2. Matching Cases and Controls
Just as in conventional case–control studies, one has the choice of matching each case with a control or not. The matching criteria for family case–control designs should be subject to some modification, however. Recall a major principle behind matching is to assure, to the extent possible, that “units” to be compared are indeed comparable. Given the comparison is made between case relatives and control relatives, sampling a control comparable to a case may not be adequate for this purpose. Thus, for family case– control designs, matching case and control probands in the design stage may be warranted if the primary confounding variables are themselves highly familial. This would increase the likelihood that case relatives and matched control relatives are more comparable to each other, at least for the matching variables. Confounding variables that are not necessarily familial—for instance, sex—can be easily adjusted through regression. Whether matching or not, statistical methods that have been fully developed for the conventional case–control design, including the conditional logistic regression analysis for matched designs, are
134
A.C. Naj et al.
not adequate here. This is because the unit is a family and the response variables (i.e., affected status of relatives) are not statistically independent of each other. This additional analytical complication of within-family correlation in risk can be adjusted for by using the generalized estimating equation (GEE) method, which was specifically developed to consider dependence within clusters (2, 32). In a matched design, the method developed by Liang (33) could be used to address issues considered here to account for correlations among related individuals. These two methods have been successfully applied to some genetic studies that address the three issues discussed below. For a more detailed discussion of the utility of this design and the two analytical methods mentioned above for other questions of interest, and the issue of sample size calculations, see Liang and Pulver (34). Finally, for age of onset outcomes, methods for detecting familial aggregation have also been developed to take into account the within-family correlation complication (35–37). In Subheading 5, we outline the functions that are available in R, SAS, and Stata programs for analyzing data using the GEE method. 1.2.3. Detecting Familial Aggregation and Testing Gene–Environment Interactions
To see how the family case–control design may be used to more formally address the issue of familial aggregation, let Y ¼ (Y1, . . ., Yn) be the disease status of n relatives from a family ascertained through a proband (either a case or control). Consider for each relative j, j ¼ 1, . . ., n, logit PrðYj ¼ 1Þ ¼ a þ bt xj þ gz;
(15)
where z ¼ 1 (0) if the proband for this family was a case (control) and the xj covariates are specific to the jth relative or to the proband. The key parameter of interest is obviously g which corresponds to the log odds ratio from the 2 2 table displayed in Fig. 2 (ignoring covariates). In general, this parameter characterizes the overall difference in familial risk, i.e., Pr(Y ¼ 1), between case families and control families, but the logistic regression model allows for differential risk among individuals with different observed risk factor values, xj. Ignoring correlations in Y among related individuals would lead to incorrect estimates of the variance of these regression estimators in Eq. 15 above (38). This concern can be alleviated by adopting the GEE method for unmatched designs and the extended Mantel–Haenszel method by Liang (33) for matched designs. Returning to the COPD study, we recall that first-degree relatives of case and control probands were directly examined for pulmonary function in this study, and Table 3 shows the number of affected relatives by case/control status. Here 71 out of 244 case relatives were diagnosed with impaired pulmonary function (29%), whereas 29 of 163 control relatives experienced the same condition
8 Detecting Familial Aggregation
135
(18%). This led to an estimated log odds ratio of 0.64 (with s.e. ¼ 0.25), suggesting that the familial risk among case families is twice (e0.64) that of control families. This simple approach may be criticized on the grounds that the estimated standard errors may be too conservative, owing to their failure to account for within-family correlations in Y, and that important risk factors such as smoking status were not properly considered. To alleviate these concerns, a logistic regression model with and without the GEE method of adjusting for dependence within family was applied to these data, in models with covariate adjustment only for race (Model I). Compared with the non-GEE logistic regression analysis results, GEE results in Model I suggested corrected s.e.s for ^g were larger (s.e. ¼ 0.28), although the discrepancy (0.25 versus 0.28) was modest in this example. Model II adjusted for some individual risk factors, including smoking, age, etc. Comparing results from GEE and non-GEE analyses revealed Model II provided stronger evidence of familial aggregation after adjustment. Specifically, the familial risk in case families was now estimated to be 2.23 (¼e0.80) times that in control families. To test for gene–environment interaction under this family case–control framework, one could simply add in Eq. 15 interaction terms between a covariates (x) and z. A third model (Model III) tested for possible interaction between z and smoking status. While not statistically significant, results suggest the effect of smoking is considerably higher among case families (OR ¼ e0:91þ0:14 ¼ 2:86; 95% CI : 0:67; 12:13) than that among control families (OR ¼ e0:14 ¼ 1:15; 95%CI : 0:46; 2:89). 1.2.4. Searching for Homogeneous Subgroups
Family case–control designs are particularly useful in identifying subgroups that may be etiologically distinct, especially if observed variables for subtyping are clinical characters associated with disease. Examples of the use of subgrouping in studies include age at onset (early versus late) for breast cancer (29, 30) and schizophrenia (31). A more detailed example of subgrouping is of a matched family case–control study of patients diagnosed with congenital cardiovascular malformation (CCVM), where probands were categorized by the presence or absence of “flow lesion” defects (39). In the context of analyzing this dataset, if we wanted to account for the difference between probands with or without flow lesions, we could expand the logistic regression model in Eq. 14 of the previous section by allowing multiple zs, i.e., logit PrðYj ¼ 1Þ ¼ a þ bt xj þ g1 z1 þ þ gC zC ;
(16)
where C represents the number of subgroups among all case probands and controls serve as the reference group. For the CCVM example, C would involve two contrast variables with
136
A.C. Naj et al.
z1 ¼ z2 ¼
0 if the case proband has a flow lesion 1 otherwise
0 if the case proband does not a flow lesion ; 1 otherwise
to characterize three groups, case probands with flow lesions ðz1 ¼ 1; z2 ¼ 0Þ, case probands without flow lesions ðz1 ¼ 0; z2 ¼ 1Þ, and controls ðz1 ¼ 0; z2 ¼ 0Þ. This expanded model where zs identify subgroups having different patterns of familial risk could be quite useful, and presumably families with higher familial aggregation could be targeted first for further investigation. We now illustrate the use of the model in Eq. 16 and the statistical method developed by Liang (33) on data from the matched family case–control study of CCVM (39) mentioned in the previous section. Here each of 570 cases who have one or more full sibs (363 with flow lesion defects and 207 without) was matched with a control born within the same 30-day period and also having at least one full sib. Among 1963 case relatives (1,140 parents and 823 siblings), 41 were diagnosed with CCVM; whereas only 10 of 1,946 control relatives (1,140 parents and 806 siblings) were affected. While these data revealed strong evidence of familial aggregation overall (2.1% versus 0.05%), this ad hoc approach ignores the matching aspect of the study design and individual risk factors such as gender were not considered. After adjusting for race, gender, and relationship to probands (parents versus siblings) and using the GEE method proposed by Liang (33), Maestri et al. (39) found that Eq. 16 yielded an estimated g1 ¼ 1.405 (s.e. ¼ 0.424), corresponding to the familial risk among case families being 4.14 (¼e1.405) times higher than among relatives of controls, again showing a strong evidence of familial aggregation. To further test the hypothesis that familial risks in families of a proband with a flow lesion and nonflow lesion case probands were different, Maestri et al. (39) considered the model contrasting these two subtypes of CCVM and found ^g1 ¼ 1:698 (s.e. ¼ 0.498) and ^g2 ¼ 0:330 (s.e. ¼ 0.765). These estimated parameters suggested familial risk among relatives of flow lesion cases is stronger than in cases with other types of CCVM, and there may be etiologic heterogeneity between these two types of CCVM because the parameter estimates yielded ð^g1 ^g2 Þ2 ¼ 4:94; Varð^g1 ^g2 Þ corresponding to a statistically significant difference at the 0.05 level.
8 Detecting Familial Aggregation 1.2.5. Estimating Risk of Recurrence Among First-Degree Family Members
137
Using family case–control designs, it is possible to quantitate familial aggregation of disease and estimate how much higher a burden of disease may exist in families of affected persons compared to expected (1) among the family members of individuals without disease [controls] (familial relative risks) or (2) among individuals sampled from the population at random (familial recurrence risks) (40). How much of this excess disease occurs among the families of affected persons compared to the families of unaffected persons can be determined from a traditional Poisson regression approach, estimating relative risk of disease, where the presence or absence of disease among index individuals (cases and controls) is taken to be presence/absence of an exposure of interest. This expected relative risk (RR) can be written as: Sibling RR ¼
Prðsib2 Djsib1 DÞ : Prðsib2 Djsib1 DÞ
(17)
If the disease of interest is observed with frequency p in the population, it is possible to compute the numerators and denominators in Eq. 17 above using conditional probabilities defined as follows: Prðsib2 Djsib1 DÞ ¼
Prðsib2 D and sib1 DÞ ; p
(18)
¼ Prðsib2 Djsib1 DÞ
Prðsib2 D and sib1 DÞ : ð1 pÞ
(19)
The joint probabilities of disease in Eqs. 18 and 19 can be computed using the Uij matrix as shown in Figs. 3 and 4. Note Pr (D)i and Pr(D)j are exposure-specific risks for siblings i and j defined as: Prðsib2 D and sib1 DÞ ¼
2 X 2 X i¼1 j ¼1
¼ Prðsib2 D and sib1 DÞ
2 X 2 X i¼1 j ¼1
Uij Pr ðDÞj Pr ðDÞi : (20)
Uij Pr ðDÞj 1 Pr ðDÞi : (21)
With this information, we can estimate the most common measure of familial recurrence risk, the sibling recurrence risk (ls): " # 2 P 2 P Uij Pr ðDÞi Pr ðDÞj Sibling recurrence riskðlz Þ ¼
j ¼1 i¼1
p
: (22)
138
A.C. Naj et al.
Likewise, with this information, a more thorough characterization of sibling relative risk is derived as: " # 2 P 2 P Uij Pr ðDÞj Pr ðDÞi =p Sibling RR ¼ "
i¼1 j ¼1
2 P 2 P i¼1 j ¼1
#
:
Uij Pr ðDÞj Pr ðDÞi ð1 Pr ðDÞi Þ =p (23)
Sibling 1 Sibling 2
Total Exposed (i =1)
Exposed ( j =1) Nonexposed ( j = 2) Total
f [ (1 − c) f + c ] f (1 − c) (1 − f) f
Nonexposed (i = 2) f [ (1 − c) ( 1 − f) (1 − f ) [1− (1 − c) f]
f 1− f
1 −f
1
* Uij is the relative frequency of sibling 1 and sibling 2 to be of exposure status i and j, respectively. ** f is the probability of exposure to the factor in the population and, in this instance, the population prevalence of the disease of interest; c is the correlation in exposure (in this case, disease status) between two siblings.
Fig. 3. Sibling–sibling probability matrix (Uij)* for exposure** to a risk factor of interest, in this case the presence/absence of disease among index individuals (cases and controls) (adapted from Khoury et al. (40)).
E E
E
Fig. 4. Various probabilities of disease in two siblings according to their exposure status* (adapted from Khoury et al. (40)).
8 Detecting Familial Aggregation
139
Fig. 4. (continued).
We consider an application of estimating sibling RR and ls in a study of congenital amusia (commonly known as “tone deafness”) (3), conducted on a set of 71 members of 9 large families of amusic probands and 75 members of 10 control families. As shown in Table 6, using data reported by probands and controls under the assumption of single ascertainment, 9 of 21 tested siblings of case probands were amusic, whereas 2 of 22 control siblings were amusic. The investigators cite a population prevalence (p) of amusia of 4% (in Fig. 3, this would correspond to f ¼ 0:04). Therefore, we can estimate the probability of the sib and proband were affected to be Prðsib2D and sib1 DÞ ¼ 9=21 ¼ 0:43 (in Fig. 3, this would correspond to c ¼ 0:43), and with p ¼ 0.04, ls ¼ 10.8. Likewise, ¼ 2=22 ¼ 0:09, which is more than twice Prðsib2 D and sib1 DÞ the population prevalence of p ¼ 0.04. With this information, the sibling relative risk can be estimated to be more conservative than the sibling recurrence risk, which was RR ¼ 4.73. 1.2.6. Assessing Standardized Incidence Ratio
Often alternative approaches are necessary to study disease incidence and relative risks among families when familial case–control studies are not feasible or possible. One such method is the standardized incidence ratio (SIR, or, for mortality data, the standardized mortality ratio, SMR). SIRs can be used to analyze events and person-time at risk in a longitudinal cohort study in which individuals might be followed over a long and variable length of time and a range of ages (41). SIRs can be calculated as the ratio of the observed to the
140
A.C. Naj et al.
Table 6 Amusia in siblings of amusic probands and in siblings of controls (excluding probands) (adapted from Peretz et al. (3)) Determined
n
Affected
Unaffected
Unknown*
No. of siblings of probands By report By test
57 21
39 9
15 12
3 ...
No. of siblings of controls By report By test
52 22
22 2
41 20
9 ...
*When the proband was unsure whether a relative was amusic, the relative was classified as “unknown”
Fig. 5. Application of the standardized incidence ratio (SIR) methods in a cohort study (adapted from Thomas (41)).
expected number of cases, and inferences can be made based on the assumption of an underlying Poisson distribution (for the estimation of SIRs in statistical programs, see Subheading 2.4). The standardized RR is defined as the ratio of SIRs between exposure categories. Thomas (41) explains the methods of calculating SIR through tabulating each individual’s time at risk over a two-dimensional array of ages and calendar years in a certain interval, to obtain the total person-time Tzs in each age-year stratum s and exposure category z. The number of expected cases in stratum z, Ez, can be estimated by summing the product of age-year-specific incidence rates ls and the total person-time Tzs across all age-year strata. Comparing the number of observed cases, Yz, to the number of expected, Ez, calculates the SIR within each exposure category (SIRz). These estimations are illustrated in Fig. 5. One of the major advantages of estimating SIRs within the cohorts is that population incidence rates can be used instead of examining controls as a proxy for the larger population (41).
8 Detecting Familial Aggregation
141
Bratt and colleagues (42) estimated SIRs in their study of the nationwide population-based Prostate Cancer Database Sweden (PCBaSe Sweden), which includes multiple Register cohorts and the Census database. Calculation of the expected number of specific incidences were derived from the corresponding age- and timespecific annual incidence of all patients registered in the National Prostate Cancer Register, comprising 98% of all diagnosed prostate cancer patients in Sweden during the studied time period (42). SIRs and their 95% confidence intervals (CIs) were calculated by dividing the total number of observed cases by the number of expected cases, as explained in the previous paragraph. The observed patients were assumed to have a Poisson distribution by the use of Byar’s normal approximation (42). The investigators found higher incidences of prostate cancer among brothers of the proband (FH+) than among men of the same age in the general Swedish population (SIR ¼ 3.1, 95% CI: 2.9, 3.3), and the incidences were still higher if individuals had both a father and brother with prostate cancer (SIR ¼ 5.3, 95% CI: 4.6, 6.0). Incidence was highest among men with two brothers diagnosed with prostate cancer (SIR ¼ 11, 95% CI: 8.7, 14). In Subheading 2.5, we outline the functions available in SAS and Stata programs for calculating SIR. 1.3. Familial Aggregation of Quantitative Traits: Measuring Correlations of Trait Levels in Families
An important advantage of the family case–control design is its ability to measure familial correlations in terms of both their magnitude and patterns. It is intuitive that the larger the magnitude of familial correlation, the stronger the evidence for some genetic basis in determining disease. This information about the magnitude of familial risk has a significant bearing on the statistical power available to locate or map potential causal genes. Furthermore, examining the patterns of familial aggregation closely allows investigators to further tease apart the roles played by genetic and environmental factors. For example, with phenotype data from first-degree relatives of probands, higher correlations among siblings compared to parents could suggest a dominance effect of unobserved gene(s). On the other hand, higher correlations between mothers and offspring compared to fathers and offspring provides a clue that the underlying genetic mechanism(s) is not simply autosomal. Thus, while examining familial correlations remains an exploratory exercise where assumptions about genetic mechanisms are typically held to a minimum, it still provides invaluable information and clues concerning the underlying genetic mechanisms and their possible interaction with environmental factors. For quantitative traits observed on pairs of relatives (e.g., twins), such as cholesterol level or birth weight, Y1 and Y2 say, the most commonly used measure of familial correlation is
142
A.C. Naj et al.
Pearson’s product moment correlation coefficient. This interclass correlation is defined as: r¼
CovðY1 ; Y2 Þ ðVarðY1 ÞVarðY2 ÞÞ1=2
:
(24)
For designs involving twins only, this correlation also provides a direct estimate of heritability, the proportion of total variation due to genetic variance due to one or more autosomal loci. For sets of three or more relatives, such as sibships, the intraclass correlation (ri) can be used. This is defined as the ratio of the variance among sibships to the total variance (i.e., the sum of the within sibship variance and the among sibship variance). This intraclass correlation is viewed as the average intraclass correlation over all possible pairs of sibs. Clearly, evidence of familial resemblance is stronger for higher ri, but observed correlations may or may not reflect actions of genes. Careful design and modeling of r is still necessary to provide clues about the roles of genetic factors. For a family of arbitrary size n, a general model can be constructed for rjk, the correlation coefficient for a quantitative trait Y in the jth and kth relatives, j < k ¼ 1, . . ., n 1 þ rjk ¼ a0 þ at zjk ; (25) logit 2 where zjk is a set of q covariates which could be specific to the (j, k) pair, specific to the family, or some combination of both (43). The transformation on the left-hand side of Eq. 25, logit [(1 + r)/2], ensures this measure ranges over the whole set of real numbers. 1.3.1. Measuring Correlations in Nuclear Families
For designs including nuclear families, which consist of parents and offspring, several different correlations should be considered: rjk ¼ rSS ; rPS ; or rPP ;
(26)
depending on whether the (j, k) relatives pair are siblings, a parent and an offspring or two parents, respectively. Of particular interest is the comparison between rSS and rPP. Assuming all relatives share a common environment, a significantly higher rSS compared to rPP would strengthen the argument for some genetic control of the trait. On the other hand, one can test the hypothesis of, for example, maternal transmission mechanism by contrasting rMS with rFS, where rMS (rFS) is the pairwise r between mother (father) and each offspring. 1.3.2. Measuring Correlations with Sibling Pairs
For designs involving siblings only, one can readily test the hypothesis that a constant within-family correlation is associated with observable family-specific covariates, for example, race. For example, let zjk ¼ 1(0) if the family is white (black), and then fit separate familial correlation coefficients. This approach would also allow
8 Detecting Familial Aggregation
143
investigators to identify variables reflecting heterogeneity among subgroups of families. 1.3.3. Adjusting for Additional Risk Factors When Measuring Correlations with Sibling Pairs
While a constant sibling correlation does provide one measure of heritability, one can further examine risk factors that may influence the sibship correlation using Eq. 24. For traits such as birth weight, one can, for example, model rjk as a function of the time between the two pregnancies of sibs j and k. To make inference on the rs, one can take the likelihood approach by assuming Y ¼ (Y1, . . ., Yn) from a sibship of size n follows a multivariate normal distribution with covariance matrix consistent with the rs that are specified. An alternative is to take the estimating function approach (38), which may be viewed as a multivariate analog of the quasi-likelihood method of Wedderburn (44). In either case, it is imperative that important risk factors for the phenotype Ys be properly considered. This can be achieved by modeling EðYj jxj Þ ¼ xjt b;
(27)
where xj is the set of observed covariates for the jth relative, j ¼ 1, . . . , n. For applications of the estimating function approach and use of Eq. 27 to measure familial correlations of birth weight, see Beaty et al. (43).
2. Methods Here we illustrate how to conduct some of the analyses described in Subheading 1 using the statistical software package R (http://cran. r-project.org). These calculations may also be performed using alternative software packages, such as SAS or Stata. 2.1. Calculating the Odds Ratio Described in Subheading 1
Consider the data given in Table 2. The calculations described in Subheading 1 can be done using the R package as follows. We can use the “oddsratio” function available in the “epitools” library, which can be implemented as follows: 1. Download and install the “epitools” library to R. install.packages("epitools") library(epitools)
2. Input the data into a 2 2 matrix for analysis. disease.fh.data < cbind(c(24,49), c(33,397))
3. Calculate the odds ratio.
144
A.C. Naj et al. my.result < oddsratio(disease.fh.data)
4. Results are organized as attributes of the object “my.result.” To output the results to the screen, print the desired attributes, in this case “$measure.” print(my.result$measure)
The output of the print command is given below:
This output can be interpreted as follows. By default, R designates the second row of disease.fh.data (corresponding to “no family history”) as the baseline group for calculating the odds ratio. This category is also designated “Exposed1” by default and, in the output, the lower and upper confidence intervals are given as “NA” (i.e., not applicable). The estimated odds ratio is 5.86 with 95% confidence interval (3.18, 10.75). These are slightly (but not much) different from those reported in Subheading 1, possibly due to computational methods used in the “oddsratio” function. In SAS, the PROC FREQ procedure can be used to perform estimation of the odds ratio (OR) from data in a 2 2 table. Similarly, in Stata, the “cc” command estimates odds ratios and confidence intervals if exposure and outcome data are encoded for each individual. However, if only the numbers of exposed and unexposed among cases and controls are available, then an “immediate” form of the “cc” command, “cci,” can be used. 2.2. Fitting a Logistic Regression Model in R
The logistic regression model can be fitted using the “glm()” function with the “family ¼ ‘binomial’” argument. We use the data given in Table 2 for illustration. First, we create the disease and family history data in R as follows. The 73 (¼24 + 49) cases are given outcome status “1” and the 430 (¼33 + 397) controls are given outcome status “0.” Similarly, family history is coded as “1” if FH+ or “0” if FH. 1. Input individual data into a two 503-length vectors for analysis. The first vector, “disease,” encodes 73 people with disease (value “1”) and 430 individuals without disease (value “0”). The second vector, “fh,” encodes the first 24 people as exposed (value “1”), the next 24 as unexposed (value “0”), the next 33 as exposed (“1”), and the last 397 as unexposed (“0”). Disease < c(rep(1, 24 + 49), rep(0, 33 + 397)) Fh < c(rep(1,24), (0,397))
rep(0,49),
rep(1,33),
rep
2. Using the “glm” function, we fit a logistic regression model. fit.model < glm(disease ~ fh, family¼"binomial")
8 Detecting Familial Aggregation
145
As results have been put into the object “fit.model,” we can examine the results by using the “summary()” function, and examining the attribute “$coeff” in the resulting object. summary(fit.model)$coeff
The summary of the above fit is shown here:
The parameter estimates are given on the log scale. To get the odds ratio, we exponentiate these estimates. 3. Calculate the odds ratios and the confidence intervals.
These commands give an odds ratio of 5.89 with 95% confidence interval (3.22, 10.78), which are equal to the results reported in Subheading 1. Suppose we have a set of confounders, denoted “conf,” which is a matrix having the number of rows equal to the number of cases and controls, and the number of columns equal to the number of confounders. Then, we can pursue the same calculations with the following command in R: fit.model < ¼"binomial")
glm(disease
~
fh
+
conf,
family
The odds ratio for family history can be calculated using the commands given above. These odds ratios are now adjusted for the confounders. These computations can be performed in SAS using the PROC LOGISTIC command. In Stata, the “logistic” or “logit” commands can be used. 2.3. Including Interaction Terms in the R Model
Equation 12 shows a logistic regression model involving an interaction term. This model can be fitted in R by extending the “glm()” function used in Subheading 2.2. Consider the data given in Table 4. Let “disease,” “gene,” and “smoking” denote the disease status, genetic data, and maternal smoking data, respectively. Furthermore, let “age” denote the maternal age data (not shown in Table 4). The interaction term, adjusting for maternal age, can be incorporated as follows: fit.model < glm(disease ~ age + gene*smoking, family ¼"binomial").
146
A.C. Naj et al.
The object “summary(fit.model)$coeff” contains the parameter estimates (on the log scale) in the first column, which can be used to calculate the odds ratios as given by Eq. 13. The standard errors of the parameters which are on the log scale are given in the second column. 2.4. Fitting a Polytomous Logistic Regression Model Using R
The “multinom()” function in the “nnet” package in R can be used to fit a polytomous logistic regression model (see Note 1). We illustrate how to analyze the data given in Table 5 using this function. The results of our analysis presented here will be different from the results shown at the end of Subheading 1. This is because the results given in Subheading 1 incorporate maternal smoking, maternal age, and interaction terms. Here we provide an illustration using the disease outcome and genotype data alone. 1. Call the “nnet” library. This library is generally available by default in R. However, if it is not available, it can be installed using the command “install.packages” command. library(nnet)
2. Set up the disease data. We have three outcomes, and we will code these as “1” for no disease (166 individuals), “2” for CLP (209 individuals), and “3” for CP (124 individuals). Disease < c(rep(1,166), rep(2,209), rep(3,124))
3. Set up the genotype data. Gene < c(rep(1,24),rep(0,142),rep(1,32),rep (0,177),rep(1,27),rep(0,97))
4. Call the “multinom()” function. fit.model < multinom(disease ~ gene)
5. The results can be obtained using the “summary” command. summary(fit.model)
The output of this command is as follows:
8 Detecting Familial Aggregation
147
The odds ratios (on the log scale) can be found under “Coefficients,” and the standard errors of the parameters obtained on the log scale are under “Std. Errors.” By default, R assigns disease a value of “1” as the baseline outcome, and gene ¼ 0 as the baseline genetic exposure variable. The odds ratio corresponding to CLP is exp(0.06735737) ¼ 1.07. The 95% confidence interval is exp(0.06735737 1.96 0.2925894) ¼ (0.60, 1.90). The odds ratio corresponding to CP is exp (0.49889519) ¼ 1.65, and the 95% confidence interval is (0.90, 3.02).In SAS, we can use the PROC CATMOD command to conduct a polytomous logistic regression analysis. In Stata, the “mlogit” command can be used. 2.5. Other Computations
Here we briefly summarize the software commands that are available for various other computations explained in Subheading 1. Most commonly used statistical software packages have one or more approaches for applying the CMH procedure. In SAS, CMH statistics are available as an option of the PROC FREQ procedure. Stata has several different implementations, including the “mhodds” command for examination of data if exposure and outcome data are encoded for individuals. If survival time data is being examined, and rate ratios are available for estimation, Stata can use the CMG procedure via the command “stmh” as long as the nature of the survival time data has been prespecified using other “st” commands. R implements the CMH procedure through the “cmh.test()” function. GEE approaches have been thoroughly implemented in SAS, Stata, and R. In SAS, the analysis of data with correlated observations can be performed using the PROC GENMOD, which enables GEE analysis to be performed in conjunction with specifications of clustering information and a working correlation matrix through the use of a REPEATED statement. In Stata, the “xtgee” command, part of the subset of “xt” commands for panel data, fits populationaveraged models to data on correlated observations using GEE. The “gee” (Generalize Estimating Equation solver) package can be implemented in R to fit models using the “gee()” function. To use this function, the “gee” package can be installed using the ‘install. packages()” and “library()” commands given in Subheading 1. This package extends the flexibility of the existing “glm” (Generalized Linear Models) function in R. SIRs (or for mortality, SMR) can be easily estimated using the SAS system with SMRFIT, for which extensive documentation and a design paper (45) are readily available. In Stata, SMR/SIR estimation uses the “st” set of commands for examining person time/ survival time data; more specifically, the command “stptime” can be used to produce SMRs and SIRs with proper specification of rate
148
A.C. Naj et al.
variables and grouping variables (in the case of familial SIRs/SMRs, the specification of families or clusters of related individuals). 2.6. Concluding Remarks
While factors such as pedigree structure and shared environment introduce technical complexity into statistical analyses of family datasets, familial studies are essential in genetic epidemiology and statistical genetics to demonstrate and confirm the transmission of genetic risk factors, especially in investigations of less common disease. Fundamentally, these analyses can demonstrate to us whether diseases of interest aggregate within families, and whether the aggregation can arise due to shared environmental factors or interactions between genetic and environmental factors as well. Given the complexity of family structures and exposure to environmental factors within families, cautions must be taken in familial aggregation studies, especially when matching case families and control families on the features of the probands alone. While familial aggregation answers the question “do diseases or disease-related traits tend to cluster in families?” another related question is: “how much of the susceptibility to disease (or variation in disease-related traits) might be accounted for by genetic factors?” This question is answered through studies that estimate heritability, the proportion of disease susceptibility or trait variation attributable to genetics. The complexities of heritability studies are explored in the next two chapters.
3. Note 1. Another approach for conducting a polytomous logistic regression analysis in R is to use the “mlogit” package. At the time of writing this chapter, installing the “mlogit” package using the “install.packages” command was not feasible when running R on Windows, since this gave an error message that the mlogit package was not available for installation using this command. Alternatively, one may download the package from the R web page (http://cran.r-package.org). We found the mlogit package to be a little bit tedious to understand. Hence, we have proposed using the multinom function in the nnet package in R.
References 1. del Junco D, et al (1984) The familial aggregation of rheumatoid arthritis and its relationship to the HLA-DR4 association. Am J Epidemiol 119: 813–829 2. Nestadt G, et al (2000) A family study of obsessive-compulsive disorder. Arch Gen Psychiatry 57: 358–363 3. Peretz I, Cummings S, Dube MP (2007) The genetics of congenital amusia (tone deafness):
a family-aggregation study. Am J Hum Genet 81: 582–588 4. Rice TK (2008) Familial resemblance and heritability. Adv Genet 60: 35–49 5. Visscher PM, Hill WG, Wray NR (2008) Heritability in the genomics era–concepts and misconceptions. Nat Rev Genet 9: 255–266 6. Naldi L, et al (2001) Family history of psoriasis, stressful life events, and recent infectious
8 Detecting Familial Aggregation disease are risk factors for a first episode of acute guttate psoriasis: results of a case-control study. J Am Acad Dermatol 44: 433–438 7. Mantel N, Haenszel W (1959) Statistical aspects of the analysis of data from retrospective studies of disease. J Natl Cancer Inst 22: 719–748 8. Ponz de Leon M, et al (1989) Familial aggregation of tumors in the three-year experience of a population-based colorectal cancer registry. Cancer Res 49: 4344–4348 9. Hopper JL, Hannah MC, Mathews JD (1984) Genetic Analysis Workshop II: pedigree analysis of a binary trait without assuming an underlying liability. Genet Epidemiol 1: 183–188 10. Connolly MA, Liang KY (1988) Conditional logistic regression models for correlated binary data. Biometrika 75: 501–506 11. Hopper JL, Derrick PL (1986) A log-linear model for binary pedigree data. Genet Epidemiol Suppl 1: 73–82 12. Khoury MJ, Beaty TH, Cohen BH (1993) Fundamentals of genetic epidemiology. Monographs in epidemiology and biostatics v. 19. Oxford University Press: New York 13. Liang KY, Beaty TH (1991) Measuring familial aggregation by using odds-ratio regression models. Genet Epidemiol 8: 361–370 14. Bishop YMM, Fienberg SE, Holland PW (1975) Discrete multivariate analysis : theory and practice. MIT Press: Cambridge, MA 15. Cohen BH (1980) Chronic obstructive pulmonary disease: a challenge in genetic epidemiology. Am J Epidemiol 112: 274–288 16. Schwartz AG, Boehnke M, Moll PP (1988) Family risk index as a measure of familial heterogeneity of cancer risk. A population-based study in metropolitan Detroit. Am J Epidemiol 128: 524–535 17. Breslow NE, Day NE (1980) Statistical methods in cancer research. IARC scientific publications. International Agency for Research on Cancer: Lyon 18. Beaty TH, et al (1997) Testing for interaction between maternal smoking and TGFA genotype among oral cleft cases born in Maryland 1992–1996. Cleft Palate Craniofac J 34: 447–454 19. Seibold MA, Schwartz DA (2010) The Lung: The Natural Boundary Between Nature And Nurture. Annu Rev Physiol 73: 457–478 20. Garantziotis S, Schwartz DA (2010) Ecogenomics of respiratory diseases of public health significance. Annu Rev Public Health. 31: 37–51.
149
21. Demeo DL, et al (2007) Determinants of airflow obstruction in severe alpha-1-antitrypsin deficiency. Thorax 62: 806–813 22. Pare G, et al (2010) On the use of variance per genotype as a tool to identify quantitative trait interaction effects: a report from the Women’s Genome Health Study. PLoS Genet 6: e1000981 23. Murcray CE, Lewinger JP, Gauderman WJ (2009) Gene-environment interaction in genome-wide association studies. Am J Epidemiol 169: 219–226. 24. Hwang SJ, et al (1994) Minimum sample size estimation to detect gene-environment interaction in case-control designs. Am J Epidemiol 140: 1029–1037 25. Foppa I, Spiegelman D (1997) Power and sample size calculations for case-control studies of gene-environment interactions with a polytomous exposure variable. Am J Epidemiol 146: 596–604 26. Garcia-Closas M, Lubin JH (1999) Power and sample size calculations in case-control studies of gene-environment interactions: comments on different approaches. Am J Epidemiol 149: 689–692 27. Sturmer T, Brenner H (2000) Potential gain in efficiency and power to detect gene-environment interactions by matching in case-control studies. Genet Epidemiol 18: 63–80 28. Gauderman, WJ (2002) Sample size requirements for association studies of gene-gene interaction. Am J Epidemiol 155: 478–484 29. Claus EB, Risch NJ, Thompson WD (1990) Age at onset as an indicator of familial risk of breast cancer. Am J Epidemiol 131: 961–972 30. Mettlin C, et al (1990) The association of age and familial risk in a case-control study of breast cancer. Am J Epidemiol 131: 973–983 31. Pulver AE, Liang KY (1991) Estimating effects of proband characteristics on familial risk: II. The association between age at onset and familial risk in the Maryland schizophrenia sample. Genet Epidemiol 8: 339–350 32. Liang KY, Zeger SL (1986) Longitudinal data analysis using generalized linear models. Biometrika 73: 13–22 33. Liang KY (1987) Extended Mantel-Haenszel estimating procedure for multivariate logistic regression models. Biometrics 43: 289–299 34. Liang KY, Pulver AE (1996) Analysis of casecontrol/family sampling design. Genet Epidemiol 13: 253–270
150
A.C. Naj et al.
35. Hsu L, Zhao LP (1996) Assessing familial aggregation of age at onset, by using estimating equations, with application to breast cancer. Am J Hum Genet 58: 1057–1071 36. Li H, Yang P, Schwartz AG (1998) Analysis of age of onset data from case-control family studies. Biometrics 54: 1030–1039 37. Liang KY (1991) Estimating effects of probands’ characteristics on familial risk: I. Adjustment for censoring and correlated ages at onset. Genet Epidemiol 8: 329–338 38. Liang KY, Zeger SL (1993) Regression analysis for correlated data. Annu Rev Public Health 14: 43–68 39. Maestri NE, et al (1988) Assessing familial aggregation of congenital cardiovascular malformations in case-control studies. Genet Epidemiol 5: 343–354 40. Khoury MJ, Beaty TH, Liang KY (1988) Can familial aggregation of disease be explained by familial aggregation of environmental risk factors? Am J Epidemiol 127: 674–683 41. Thomas DC (2004) Statistical methods in genetic epidemiology. Oxford University Press: New York 42. Bratt O, et al (2010) Effects of prostate-specific antigen testing on familial prostate cancer risk estimates. J Natl Cancer Inst 102: 1336–1343 43. Beaty TH, et al (1997) Analyzing sibship correlations in birth weight using large sibships from Norway. Genet Epidemiol, 14: 423–433 44. Wedderburn RWM. (1974) Quasi-likelihood functions, generalized linear models, and the Gaussaˆ€”Newton method. Biometrika 61: 439–447 45. Hanrahan LP, et al (1990) SMRFIT: a Statistical Analysis System (SAS) program for standardized mortality ratio analyses and Poisson regression model fits in community disease cluster investigations. Am J Epidemiol, 1990. 132(Suppl): S116-122 46. Tokuhata GK, Lilienfeld AM (1963) Familial aggregation of lung cancer in humans. J Natl Cancer Inst 30: 289–312 47. Feinleib M, et al (1977) The NHLBI twin study of cardiovascular disease risk factors: methodology and summary of results. Am J Epidemiol 106: 284–285 48. Khoury MJ, Erickson JD, James LM (1982) Etiologic heterogeneity of neural tube defects: clues from epidemiology. Am J Epidemiol 115: 538–548 49. Khoury MJ, Erickson JD, James LM (1982) Etiologic heterogeneity of neural tube defects.
II. Clues from family studies. Am J Hum Genet 34: 980–987 50. ten Kate LP, et al (1982) Familial aggregation of coronary heart disease and its relation to known genetic risk factors. Am J Cardiol 50: 945–953 51. Sattin RW, et al (1985) Family history and the risk of breast cancer. JAMA 253: 1908–1913 52. Nielsen HE, et al (1987) Risk factors and sib correlation in physiological neonatal jaundice. Acta Paediatr Scand 76: 504–511 53. Beaty TH, et al (1988) Effect of maternal and infant covariates on sibship correlation in birth weight. Genet Epidemiol 5: 241–253 54. Linet MS, et al (1989) Familial cancer history and chronic lymphocytic leukemia. A case-control study. Am J Epidemiol 130: 655–664 55. Lin JP, et al.(1998) Familial clustering of rheumatoid arthritis with other autoimmune diseases. Hum Genet 103: 475–482 56. Criswell LA, et al (2005) Analysis of families in the multiple autoimmune disease genetics consortium (MADGC) collection: the PTPN22 620W allele associates with multiple autoimmune phenotypes. Am J Hum Genet 76: 561–571 57. Beaty TH, et al (2006) Analysis of candidate genes on chromosome 2 in oral cleft case-parent trios from three populations. Hum Genet 120: 501–518 58. Raynor LA, et al (2009) Familial aggregation of age-related hearing loss in an epidemiological study of older adults. Am J Audiol 18: 114–118 59. Krogh C, et al (2010) Familial aggregation and heritability of pyloric stenosis. JAMA 303: 2393–2399 60. Xiong L, et al (2010) Family study of restless legs syndrome in Quebec, Canada: clinical characterization of 671 familial cases. Arch Neurol 67: 617–622 61. Saito YA, et al (2010) Familial aggregation of irritable bowel syndrome: a family case-control study. Am J Gastroenterol 105: 833–841 62. Lichtenstein P, et al (2010) The genetics of autism spectrum disorders and related neuropsychiatric disorders in childhood. Am J Psychiatry 167: 1357–1363 63. Petersen L, Andersen PK, Sorensen TI (2010) Genetic influences on incidence and case-fatality of infectious disease. PLoS One 5: e10603 64. Liang KY, Beaty TH. (2000) Statistical designs for familial aggregation. Stat Methods Med Res 9: 543–562
Chapter 9 Estimating Heritability from Twin Studies Karin J.H. Verweij*, Miriam A. Mosing*, Brendan P. Zietsch, and Sarah E. Medland Abstract This chapter describes how the heritability of a trait can be estimated using data collected from pairs of twins. The principles of the classical twin design are described, followed by the assumptions and possible extensions of the design. In the second part of this chapter, two example scripts are presented and described, explaining the basic steps for estimating heritability using the statistical program OpenMx. OpenMx and the scripts used for this chapter can be downloaded so that readers can adapt and use the scripts for their own purposes. Key words: Heritability, Behavior genetics, Twin, Mx, OpenMx, Quantitative genetics, Twin modeling, Classical twin design, Genetics, Environment
1. Introduction Individuals vary considerably on physical, cognitive, and behavioral traits, such as height, intelligence, and personality. These individual differences may arise from variation in genes, environmental experiences, or a combination of both, and it is generally accepted that individual differences in most traits are due to both genetic and environmental factors. To the extent that a trait is influenced by genetic (heritable) effects, phenotypic similarity is correlated with genetic relatedness. In this chapter, we describe how this principle can be used to estimate the relative magnitude of genetic and environmental influences on trait variation using identical (monozygotic; MZ) and nonidentical (dizygotic; DZ) twin pairs. We first describe the principles of the classical twin design, followed by assumptions and extensions of the design. In the second half of the chapter, we provide a practical describing the basic steps for estimating heritability using the open-access program OpenMx. *These authors contibuted equally (equal first authorship) Robert C. Elston et al. (eds.), Statistical Human Genetics: Methods and Protocols, Methods in Molecular Biology, vol. 850, DOI 10.1007/978-1-61779-555-8_9, # Springer Science+Business Media, LLC 2012
151
152
K.J.H. Verweij et al.
1.1. The Classical Twin Design
Twin studies make it possible to partition the observed variance of a trait into genetic, shared environmental, and residual (also known as unshared or unique environmental) variation. Additive genetic variance (A) denotes the variance resulting from the sum of allelic effects across multiple genes (nonadditive genetic influences will be discussed later in the chapter). Shared environmental variance (C) results from environmental influences shared by family members, such as prenatal environment, home environment, socioeconomic status, and residential area. Residual variance (E ) results from environmental influences that are not shared by family members, such as idiosyncratic experiences (like illness and injury), stochastic biological effects, and also includes differences in perception and salience of events and measurement error. These different variance components may be estimated using twin data because identical twins share all their genes, while nonidentical twins share on average half of their segregating genes. Accordingly, if MZ twins resemble each other more than DZ twins on a particular trait, this indicates the trait is partly influenced by genetic effects. If all the variance of a trait were due to genetic variance, we would expect a twin pair correlation of 1.0 for MZ twins and 0.5 for DZ twins. Both MZ and DZ twin pairs share environmental influences. If the shared environment (C) were the only source of variance, we would expect a twin pair correlation of 1.0 for both MZ and DZ pairs. E factors are unique for every individual, so if E were the only source of variance, we would expect a twin pair correlation of zero for both MZ and DZ twin pairs. In reality, the variance of a trait is generally due to a combination of A, C, and E influences, and this is reflected in the MZ and DZ correlations. The correlation between MZ twins (rMZ) can be summarized as A + C and the correlation between DZ twins (rDZ) as 0.5A + C. Using the observed MZ and DZ twin pair correlations, it is possible to estimate the proportion of variance accounted for by A, C, and E using the formulas below (1): rMZ ¼ A + C rDZ ¼ 0.5A + C. Solving for A and C, A ¼ 2(rMZ rDZ). C ¼ 2rDZ rMZ. Since A + C + E ¼ 1, E ¼ 1 rMZ.
9 Estimating Heritability from Twin Studies
153
rMZ = 1; rDZ = 0.5 rMZ and rDZ = 1
E1
A1
e
a
Twin 1
C1
c
C2
E2
A2
c
a
e
Twin 2
Fig. 1. Classical twin design; path diagram representing the resemblance between MZ or DZ twins for additive genetic influences (A), shared environmental influences (C ), and non-shared environmental influences (E ), for a given trait.
Rather than simply calculating the genetic and environmental variance components, behavioral geneticists typically employ structural equation modeling to more precisely estimate the combination of A, C, and E influences that best explains the observed data. This modeling can take into account covariate effects such as age and sex, can compare the fit of various types of models, and provide confidence intervals for the estimates. The basic model is presented in Fig. 1. The boxes represent the observed variables (for twin 1 and twin 2), and the circles represent the latent variables (A, C, and E ) that influence the observed variables. The double-headed arrows show the correlations between the corresponding latent factors. As explained above, for A this correlation is 1.0 for MZ twins and 0.5 for DZ twins, and for both MZ and DZ twins the correlation for C is 1 and for E is zero. Structural equation modeling of twin data is most commonly performed in the flexible matrix algebra program Mx (2). Recently, the program has been redeveloped to run within the R programming environment (see http://openmx.psyc.virginia.edu/). Mx employs maximum likelihood modeling procedures to determine what combination of A, C, and E best explain the observed data. Parameter estimates are derived through optimization—effectively, the optimizer searches the parameter space comparing the observed and expected variance–covariance matrices under different parameter estimates until it reaches the optimum solution. The goodnessof-fit of a model to the observed data is summarized by a statistic distributed as w2 . The difference in fit of two models is also summarized by a statistic distributed as w2 (i.e., Dw2 ), and the appropriate degrees of freedom (i.e., Ddf) is equal to the difference in the number of estimated parameters in the two models. By testing Dw2
154
K.J.H. Verweij et al.
with Ddf, we can determine whether dropping model parameters (e.g., reducing an ACE model to a CE model), or constraining parameters to be equal (e.g., equating the male and female A estimates), significantly worsens the model fit. This allows us to make inferences about the significance of parameters. 1.2. Assumptions of the Classical Twin Model
One of the key assumptions of the classical twin design is that traitrelevant environments are similar to the same extent in MZ and DZ twin pairs. If this were not the case—for example, if MZ twins were treated more similarly by their parents than DZ twins—this would falsely inflate the A estimate and deflate the C estimate. The “equal environment assumption” can be tested by comparing the similarity of trait-relevant environments between MZ and DZ twin pairs or by comparing the similarity of twin pairs who were mistaken or misinformed about their zygosity. These tests have shown that, in general, the assumption is valid (see, e.g., ref. 3). Another assumption is that DZ twins share on average 50% of their genes. This assumption is only valid if mating occurs randomly in the population and is violated if assortative mating (tendency of individuals to select mates who are similar to themselves) or inbreeding is present. Note that, in the context of a complex trait, the assumption is not that individuals will share exactly the same genetic variants across loci, but rather that their overall genetic predisposition will be more similar. Assortative mating can increase the genetic similarity of DZ twins, which if not taken into account results in a higher C and lower A estimate. Note that the effect of assortative mating can be accounted for by adding parents or spouses of twins to the model. A third assumption is that only additive genetic influences are modeled in the standard classical twin design, while nonadditive genetic influences are not taken into account; however, the model can be modified to estimate nonadditive genetic influences as well (see below). Nonadditive genetic effects include dominance (allelic interactions within genes) and epistasis (interaction between multiple genes). Dominant genetic effects (D) predict a twin pair correlation of 1.0 for MZ twins and 0.25 for DZ twins, while epistasis predicts a MZ correlation of 1.0 and a DZ correlation between 0 and 0.25 (4, 5). Dominance and epistatic effects cannot be resolved in a classical twin study, so it is conventional to simply estimate dominance variance (D) specifying an expected MZ correlation of 1.0 and a DZ correlation of 0.25, and accept that this component also includes epistatic variance. It is also not possible to estimate C and D simultaneously when only using data from twins who were reared together, because C and D are confounded: C influences increase the DZ correlation relative to the MZ correlation, whereas D influences decrease the DZ correlation relative to the MZ correlation. The choice of an ACE or ADE model (i.e., a model that includes the components A, C, and E or A, D, and E ) depends on the pattern of MZ and DZ correlations; C is estimated
9 Estimating Heritability from Twin Studies
155
if the DZ twin correlation is more than half the MZ twin correlation, and D is estimated if the DZ twin correlation is less than half the MZ correlation. By extending the twin design with parents or children of twins, it is possible to estimate both C and D in the same model. A fourth assumption is that there is no interaction or correlation between genes and environment. Gene–environment correlation is present if individuals actively or passively expose themselves to different environments depending on their genotype or when individuals’ genotypes affect their social interactions or influence the responses they elicit from other individuals. For example, children who are genetically predisposed to be talented at sports are more likely to join a sports club than genetically untalented children, and genetically predisposed extroverted children are likely to have different social experiences than genetically introverted children. Environmental influences are then correlated with genetic predisposition. If not explicitly modeled, gene–environment correlations (rGE) between the latent A and E variables behave like additive genetic effects, whereas rGE between the latent A and C variables acts like C. Gene–environment interaction (GE) occurs when the expression of an individual’s genotype depends on the environment. For example, Boomsma et al. (6) found that a religious upbringing reduced the expression of genetic factors on disinhibition. If gene– environment interaction is present, the relative contribution of genes and environment to the trait variance differ among individuals. If not explicitly modeled, GE will inflate the estimate of E when the environmental influence is unshared or inflate the estimate of C when the environmental influence is shared between twins. 1.3. Extensions of the Classical Twin Model
Above we described the most basic form of the classical twin design—this model can be extended in various ways. Covariates can be included in the model, which estimates and accounts for the effects of possible confounders such as sex and age. With data for male and female MZ twin pairs, and male, female, and opposite-sex DZ twin pairs, it is possible to test for qualitative and quantitative sex differences in genetic and environmental effects. Quantitative differences can be investigated by estimating separate A, C, and E parameters for males and females. To model qualitative differences in genetic effects between sexes, the genetic correlation between DZ opposite-sex twins is estimated in the model, instead of being fixed at 0.5 (which is the genetic correlation of same-sex DZ twin pairs). If completely different genetic factors were influencing males and females, the genetic correlation between opposite-sex twins would be zero. In the same way, sex differences in the source of shared environmental influences can be investigated by estimating the shared environmental correlation instead of fixing it at 1.0. However, with only twins raised together in the sample, it is not possible to estimate both A and C oppositesex correlations in the same model.
156
K.J.H. Verweij et al.
The correlations between identical and nonidentical twin pairs are central to the twin design; these statistics assume normally distributed traits, but for ordinal or dichotomous variables a liability threshold model can be used to estimate these correlations (7). Threshold models assume that there is an underlying continuum of liability (e.g., to depression) that is normally distributed in the population and that our measurement categories (e.g., depressed/ not depressed) are due to one or more artificial divisions (thresholds) overlaying this normal distribution. Analyses are effectively performed on the underlying liability to the trait, resulting in estimates of the heritability of the liability. As mentioned above, the classical twin design can be extended by including additional family members (siblings, parents, children, and spouses). Inclusion of extra family members increases the statistical power; and adding additional family members makes it possible to estimate more parameters and relax assumptions regarding mating and cultural transmission. For example, adding parents to the model makes it possible to simultaneously estimate C and D influences as well as effects from assortative mating, familial transmission, sibling environment, and the correlation between additive genetic effects and family environment (8). Adding data from nontwin siblings makes it possible to test for twin-specific environmental influences. Until now we have only discussed univariate analyses, where we were interested in disentangling the variance of a trait into that due to genetic and environmental components. By including more than one dependent variable in a model we can additionally partition the covariance between traits into that due to A, C (or D), and E in the same way as we do for the variance of a single trait. Multivariate models can be used to test the extent to which the same genetic or environmental factors influence multiple traits. For example, they can be used to answer such questions as: is there a correlation between novelty seeking and drug use and, if so, to what extent is this correlation due to overlapping genetic and/or environmental influences between the two variables? Figure 2 shows a bivariate Cholesky decomposition, the base model used for bivariate analyses (9). From this base, parameters can be dropped or equated to test specific hypotheses regarding those parameters. Multivariate twin modeling can be used to analyze numerous variables and to conduct both exploratory and confirmatory factor, longitudinal and causal analyses.
9 Estimating Heritability from Twin Studies
A1
A2
Trait 1
C1
157
E1
Trait 2
C2
E2
Fig. 2. Bivariate Cholesky decomposition. Traits 1 and 2 are the two measured variables for each individual. A1 is a genetic factor that influences both traits 1 and 2 and A2 is a genetic factor that influences only trait 2. The same structure applies for the C and E factors.
2. Methods In this section, two example scripts (preliminary analyses and ACE model fitting) are presented and described. To reduce the complexity of the scripts, the practicals below focus on univariate modeling, using height as the example variable. For the present examples, we use an example dataset included with the distribution of OpenMx. The dataset includes a variable labeled “zygosity,” which classifies the twin pairs as one of five zygosity groups, and further divides the sample into younger and older twins. For the younger subsample the five zygosity groups are MZ females (1), MZ males (2), DZ females (3), DZ males (4), and DZ opposite-sex pairs (5), and for the older sub-sample the same zygosity groups are labeled 6–10 in the same order. For the present examples, we have only used male twin pairs from the young cohort (zygosity 2 and 4). In Subheading 1, we test assumptions regarding the data and gather important information about the dataset, which we use for the final ACE modeling. These preliminary analyses generally involve testing whether the means, variances, and covariances differ between different subgroups; for example, are the means and variances of MZ and DZ groups equivalent? Subsequently, in the second script, we estimate the relative magnitude of A, C, and E components of
158
K.J.H. Verweij et al.
the variance in height in males by fitting a univariate ACE model. The programs used for the following examples (OpenMx and R) can be downloaded from the following pages: http://openmx.psyc. virginia.edu/installing-openmx and http://www.r-project.org/. OpenMx is not a standalone program but rather a package of functions which runs within the R environment. The scripts and dataset used in the following examples can be found at http://www. genepi.qimr.edu.au/staff/sarahMe/twinstudies.html. Below, we explain the individual steps within the scripts (in boxes underneath). 2.1. Preliminary Analyses 2.1.1. Setting Up the R Session
Once R is installed, open the program. OpenMx can be easily installed from within R by pasting the following code: source(’http://openmx.psyc.virginia.edu/getOpenMx.R’) Once OpenMx is installed the library can be loaded by pasting the following code into the R window: require(OpenMx) In this practical, we will also be using the “psych” package to summarize the data. This can be loaded by typing: require(psych)
2.1.2. Data Preparation
In this example, we will be using the “twinData” dataset that is supplied with the OpenMx program; this is loaded using the data (twinData) command. For details as to how to prepare a new dataset for analysis in OpenMx (see Note 1). The describe command (output not shown) provides some useful information about the dataset, such as descriptive statistics and frequencies. In general, it is a good idea to check your data prior to analysis to ensure the data are complete and missing codes have been read correctly. Subsequently, the trait we are interested in (height (ht) in the present example) and the number of variables is specified. We create a list (called selVars) that contains the names of the two variables we will be analyzing in the current example: height of twin1 and height of twin2. This list will be reported back to us by typing selVars. Then we rescale the data and increase the variance (to improve the optimization) by converting from meters to centimeters. Using the describe command again, we can check the new mean and find that height is now reported in centimeters. Subsequently, the number of types of twins (two) is multiplied by the number of variables (nv) to create a new variable that will be used in the analysis (ntv). Finally, two new datasets are created, one including all male MZ twin pairs (mzData; in this dataset zygosity 2) and another one including all male DZ twin pairs (dzData; zygosity 4). (See box next page).
9 Estimating Heritability from Twin Studies
2.1.3. Saturated Model Fitting
159
After data preparation and specifying starting values for means, variances, and covariances (see Note 2), a univariate saturated model (in which all possible parameters are estimated) is fitted to estimate the means, variances, and covariances that best fit the observed data. The goodness-of-fit statistic of this saturated model is later compared to that of the more reduced models in which certain parameters are dropped (i.e., fixed to equal zero) or equated. In this way, we can test whether means, variances, and covariances of different subsamples are significantly different or whether covariates (e.g., age) significantly influence the trait. The model (univTwinSatModel) used in the present example consists of two submodels, one for the MZ twin data (shown below) and one for the DZ twin data (omitted). To include more zygosity groups (e.g., female MZ and female DZ), we would increase the number of submodels. Using the mxMatrix command, we create a matrix that is free (i.e., parameters will be estimated as opposed to being fixed at a specific value), symmetric (the upper triangle of the matrix is constrained to be equal to the lower triangle, which will be estimated), and has two rows and two columns. This matrix is called expCovMZ and contains the expected variance–covariance for height. We also create a full matrix (i.e., all elements are estimated) with one row (for one variable) and two columns (for two twins), containing the expected means for height of twin1 and twin2 (expMeansMZ). The dataset is specified using the mxData command. With the mxFIMLObjective function, full-information maximum likelihood (ML) is used to provide modeling estimates for all free parameters in the expected means (expMeansMZ) and covariance (expCovMZ) matrices defined by the covariance and means arguments. To ease the testing of more restricted models at a later stage, we create some additional matrices. We take the means for twin1 and twin2 (row one, column one and row one, column two) from the expMeanMZ matrix and name them expMeansMZt1 and expMeanMZt2, respectively. Subsequently, we take the values on the diagonal from the expCovMZ matrix (the variances for twin1
160
K.J.H. Verweij et al.
and twin2) and name them expVarMZt1 and expVarMZt2, respectively. Finally, we take the covariance (row two, column one) from the expCovMZ matrix and name it CovMZ. As mentioned above, the same is repeated for the DZ twins (not shown here). Please note that the punctuation in the scripts is very important (see Note 3).
2.1.4. Running the Models and Generating Output
After specifying the saturated model for MZ and DZ twins as above, the mxAlgebra command is used to add the two objectives (estimating the fit of the submodels for MZ and DZ twins) together with the name (min2sumll). min2sumll is specified as the object for optimization using the mxAlgebraObjective command. Then we run the model and request summary statistics (univTwinSatSumm). Finally, the model output is generated, showing the parameter specifications, the expected means, variances, and covariances, and the model fit (see Note 4 for a screenshot of the output and more detailed output explanation). For example, the results show that the expected means for height were about 177 cm for both twins in the MZ and DZ groups.
9 Estimating Heritability from Twin Studies 2.1.5. Assumption Testing (Mean Differences)
161
The fewer parameters we estimate in a model, the more power we have to estimate them. We expect that our data are representative of the general population and that the trait (height) is normally distributed in the general population. Therefore, the most basic twin model assumes that there is no significant differences in means and variances between the different groups (e.g., twin1, twin2, MZ, DZ, males, females, etc.), which means that fewer parameters need to be estimated. However, before running the most basic model, we need to test whether these assumptions are valid in our data; violations of these assumptions are not problematic, but need to be accounted for with more complex twin models that allow for separate parameter estimates for different subsamples and/or include covariates (e.g., sex differences). To start with, we test whether the means of twin1 and twin2 (within the MZ and DZ groups) are significantly different. To test this, the means are equated and the model fit is compared to the original saturated model. If the model fit of the restricted model is not significantly worse, the means are not significantly different and we keep them equated in subsequent modeling. To equate the means, a new mxModel object is created in which the expected means of twin1 and twin2 in the MZ group (and again in the DZ group) are equated and named MeanMZt1t2 and MeanDZt1t2, respectively. Then the new reduced model is run and the summary statistics and output are requested.
With the tableFitStatistics command the fit of the reduced model is compared to the fit of the saturated model [the difference in model fit is shown in the output (see Note 4)]. As explained in the introduction, the change in fit between two models (assuming that they are nested) is asymptotically distributed as w2 . By testing the change in w2 statistic (Dw2 ) with the change in degrees of freedom (Ddf), it is possible to test whether constraining the means to be equal significantly worsens the model fit. For further information regarding output and model fit/comparison, see Note 4. In the
162
K.J.H. Verweij et al.
present example, the output shows no significant mean differences between twin1 and twin2 (within both MZ and DZ groups). Other assumptions of the most basic twin model that are tested (not shown here but in the script online) are whether we can equate: –
The means of the MZ and DZ groups
–
The variance of twin1 and twin2 in the MZ and DZ groups
–
The variance of the MZ and DZ groups
–
The covariance of the MZ and DZ groups (this is not an assumption, but is a preliminary test for genetic effects).
In a model including data from both sexes, we would also check for differences between male MZ and DZ groups and female MZ and DZ groups as well as DZ opposite sex twins. If the means, variances, or covariances are significantly different between groups, we estimate the parameter separately for each group. If we can equate two parameters we do so, thereby reducing final number of parameters to be estimated and increasing statistical power to estimate them. For the present example, assumption testing showed no mean or variance differences between twin1 and twin2 or between the MZ and DZ groups. Therefore, in the ACE model below, we only estimate one mean and one variance for the whole sample (i.e., we equate the means and the variances over the different groups). The covariance between MZ twins was significantly higher than for DZ twins, indicating that there is a genetic influence on height variation. This will be formally tested in the ACE model below. 2.2. Univariate ACE Modeling 2.2.1. Saturated Model Fitting
After the data are prepared and starting values are specified for the standardized A, C, and E estimates (see Note 2), a saturated ACE model (univACEModel) is fitted to estimate the relative contribution of A, C, and E to variation in height (i.e., the square of the a, c, and e pathways presented in Fig. 1). Three free lower matrices (i.e., each element in the upper triangle of the matrix is fixed to zero, the other elements are estimated) are created using the mxMatrix command, one for each of the a, c, and e estimates. Note that in a univariate model the lower matrix would be a one by one matrix; however, the scripts are set up so that they can be easily expanded to a bivariate or multivariate model. We compute the variance components for A, C, and E using the matrix multiplication (* for which the R notation is %*%) command, multiplying a and the transpose of a. This is followed by algebra to compute the total variance by summing the three variance components A, C, and E to produce V. To standardize the variance components, we first create an identity matrix (I) (containing a value of one for all entries on the diagonal and the value zero in all off-diagonal entries); second, we create a vector with the standard deviations (iSD) by taking the inverse (solve) of the square root of the total variance multiplied by the identity matrix.
9 Estimating Heritability from Twin Studies
163
Using the mxAlgebra command, three variables are created containing the standardized path coefficients for the a, c, and e effects. This is achieved by multiplying the path coefficients a, c, and e by the standard deviations, e.g., a * SD(A))/SD(Trait), where SD(Trait) is the standard deviation of height, and SD(A) yields the standard deviation of the predictor, the latent factor “A.” Next, a vector is created containing the genetic variance (A) divided by the total variance (V) to get the heritability. The heritability (h2) is the proportion of the total variance due to A (additive genetic effects). This is repeated to estimate the variance due to shared environment (C) and residual (E ) influences. Please note that the following algebra does not change when the script is converted to multivariate modeling. A means vector is created for the expected means of twin1 and twin2. Note that this mean has been equated between twin1 and twin2 for both MZ and DZ twins, based on the results of the preliminary analyses. The algebra for the expected variance/covariance matrix for MZ and DZ twins is specified. The command rbind joins the expected variances and covariances together vertically, while cbind joins the matrices horizontally. For the DZ twins, A is multiplied (Kronecker product, for which the R notation is %x%) by 0.5, as only half of the genes are shared between the DZ twins.
164
K.J.H. Verweij et al.
Next, the MZ and DZ data and objectives are specified. Based on the results of the preliminary analyses, the same mean and variance are used for both submodels, while the covariance has been estimated separately for each submodel (ACE.expCovMZ and ACE.expCovDZ). The mxAlgebra command is used to add the two objectives together and we name this (m2ACEsumll). Finally, m2ACEsumll is specified as the object used for optimization using the mxAlgebraObjective command. We also request 95% confidence intervals for the standardized variance components (h2, c2, and e2). Subsequently we run the model, request the summary as well as the model output (script omitted).
2.2.2. Fitting Submodels
After the general ACE model is run, submodels can be fitted to test the significance of parameters specified in the saturated model. To do this, we simply drop the parameter of interest and compare the model fit of the reduced and the full (saturated) model. Based on the parameter estimates in our example, we dropped the C component first (shown below), by setting the c matrix to zero. Then we ran the AE model (univAEModel). If C can be dropped we can test dropping A from the AE model (i.e., test the E model) or from the ACE model (i.e., test the CE model). Note that the E parameters can never be dropped as E includes measurement error.
Note that in this case we wish to compare each model to the saturated model and not to the previous nested model (as was the case for the assumption testing script). Here we show two
9 Estimating Heritability from Twin Studies
165
ways of generating the output. First, we create a table listing the goodness-of-fit of all the submodels compared to the fully saturated model (explained in Subheading 2.1.5). Second, a table is created listing the goodness-of-fit of all the submodels compared to the ACE model. For information on model output and model fit, please see Note 4.
3. Notes 1. Data set preparation OpenMx uses a very flexible file format. The only real restriction is that the data from each unit of analysis needs to be on a separate line. Thus, if we were running a multivariate analysis on a sample of unrelated individuals (where the unit of analysis is an individual) each line in the data file would contain the data for a different person, and all the variables for an individual would be one line. In the analysis of twin data, the data for each twin pair would be listed on a separate line. The only other restriction on the data is that missing values are correctly identified within the data, either using the default missing code “NA” or via the missing data command, e.g., “na.strings¼-99.” Data can be prepared using a standard data management or statistics package (e.g., SPSS or SAS). Alternatively, the data can be prepared within R. Numerous web-based tutorials have been developed that describe importing and managing data in R; we recommend the Quick-R pages, e.g., http://www.statmethods. net/input/index.html. 2. Starting values OpenMx optimizes a likelihood function under the provided model and tries to find the combination of parameter estimates that best fit the data. It does this via iteration—OpenMx chooses a set of parameters and determines the difference between the observed and the expected variance–covariance matrix, it then
166
K.J.H. Verweij et al.
chooses another set of parameters and refits the model, this process continues until no further improvement in model fit is available. For the first iteration, it is necessary to provide a reasonable starting point for each of the parameters to be estimated to speed up the optimization process. Starting values should ideally be close to actual estimates, but starting the optimizer at the solution itself is problematic and the analyses generally fail. For the expected means and variance–covariance matrix, it is possible to use the observed means, variances, and covariances. You could manually enter starting values in the argument of the mxMatrix command (with the values command). Alternatively, in supplying starting values for OpenMx, users can also take advantage of the ability to calculate the mean and variance of the data using inbuilt R functions. For example, in the ACE analysis, we use the following code to set the starting values for the means: StMZmean <-vech(mean(mzData,na.rm¼T)) This code obtains the means of each variable in the mzData dataframe and places it in a vector called STMZmean. These starting values are then applied to the matrix that will hold the MZ means using the following code: mxMatrix( type¼"Symm", nrow¼ntv, ncol¼ntv, free¼TRUE, values¼StMZcov, name¼"expCovMZ") Determining starting values for the a, c, and e parameters is less straightforward. Based on the observed data, you know what the total variance for a trait is. You can use the observed twin pair correlations and Falconer’s formulas (7) (see Subheading 1), to determine the approximate relative contribution of A, C, and E estimates from the trait variance. You can then estimate the unstandardized a, c, and e pathways by multiplying the square root of the A, C, and E contributions by the total variance, respectively. In our script, we set starting values for the a path coefficient by using inbuilt R functions to calculate the standard deviation of height in our data and multiply it by a guesstimate of the magnitude of the effect: St_a <-(vech(sd(twinData$ht1, na.rm¼T)))*.3 In general, it is a good idea to provide higher starting values for the e estimates than for a or c as this avoids the possibility of non-positive definite matrices (where the estimates for the covariances become larger than the estimates of the variance). Providing starting values gets more complicated when the model includes more than one dependent variable, in which case you have to use the cross-twin-cross-trait variance–covariance matrix to estimate starting values for the pathways.
9 Estimating Heritability from Twin Studies
167
3. Explanation of errors when commas and brackets are omitted or in excess and how to remove/find them There are two main types of errors encountered when running OpenMx: errors relating to the code and errors relating to the optimization. The three most common code errors are typos, missing commas/brackets, and extra commas/brackets. Typos typically result in an error message that looks something like this: Error: The reference ’minsumll’ does not exist. It is used by the named entity ’univTwinSat.objective’. A missing coma typically results an error message that looks something like this: Error: unexpected symbol in: " mxAlgebra( expCovDZ[2,1] name", whereas an extra coma typically results an error message that looks something like this: Error in mxModel("univTwinSat", mxModel("MZ", mxMatrix(type ¼ "Symm", : argument is missing, with no default There is also a number of important optimizer error messages. Any serious errors will be reported in the output. However, it is worth checking for errors on each analysis run. For the ACE example, these can be viewed by typing: univTwinSatFit@output$status[1] which yields the following output: univTwinSatFit@output$status[1] univTwinSatFit@output$status[1] [[1]] [1] 0 The status code, which in this case is 0, summarizes the status of the NPSOL optimizer run, which can take several possible values. A value of 0 means a successful optimization—no error returned. A value of 1 means that it is highly likely that an optimal solution was found but that there may have been some issues with the optimizer. These estimates can generally be considered correct solutions, so this code is labeled (Mx status GREEN). Values of 1, 2, 3, 4, 6 reflect critical optimizer failures and results should not be used. A value of 1 means that the optimizer became stuck in a location where the objective function could not be calculated and could not find a way out. This most often happens if the starting values make the calculation impossible. A value of 2 or 3 means that the bounds or constraints, respectively, could not be satisfied.
168
K.J.H. Verweij et al.
A value of 4 means that the iteration limit was reached with no solution found; while it is possible to increase the number of iterations it is better to improve the starting values and rerun the analyses. A value of 6 means that optimality conditions could not be reached, and the optimizer could find no way to improve the estimate. It often implies either a mistake in the model specification or starting values in an unallowable range. This code is labeled (Mx status RED). 4. Output If we request the parameter specifications (below) we can see which parameters we decided to estimate.
The expectedMeansCovariance command displays the expected variance–covariance matrix as well as the expected means for MZ and DZ twins, respectively (below).
9 Estimating Heritability from Twin Studies
169
The summary of the ACE model is shown below. First the parameter estimates for a, c, and e are shown, each with its standard error. Subsequently, a2 (heritability), c2, and e2 are presented with confidence intervals. And finally we get information about the model fit, such as 2 log likelihood (2LL) or Aikaike’s Information Criterion (AIC). Note that the smaller AIC, the better is the model fit. Importantly, the 2LL of submodels is always higher in magnitude, indicating a worse fit but, as long as this fit is not significantly worse, the more reduced submodel is the more parsimonious one. With the w2 test we can determine whether the difference in 2LL (w2 distributed) is significant. A nonsignificant P -value means that the model is consistent with the data, meaning that the reduced submodel fits the data more parsimoniously (e.g., the dropped parameter/s is/are nonsignificant). Note that OpenMX by default always compares the submodels to the saturated model (as done below) and not to the previous nested model. This has to be specified separately as explained in Subheading 2.
In the Nested.fit output (below), we compare the reduced models to the ACE model. For example, the difference in degrees of freedom (diffdf) is two when we fit an E model. This is because we drop A as well as C from the ACE model, and therefore we have to estimate two parameters less. As explained above, if we wanted to compare two submodels with each other (e.g., the CE and the E model), we would have to use the table fit statistics command (Tablefitstatistics (univCEFit, univEFit)).
170
K.J.H. Verweij et al.
References 1. Holzinger KJ (1929) The relative effect of nature and nurture influences on twin differences. J Educ Psychol 20: 241–24. 2. Neale MC, Boker SM, Xie G et al (2006) Mx: Statistical Modeling, 7th edn. VCU Box 900126; Richmond, VA 23298, USA: Department of Psychiatry. 3. Kendler KS, et al (1993) A test of the equalenvironment assumption in twin studies of psychiatric illness. Behav Genet 23: 21–2. 4. Keller MC, Coventry WL (2005) Quantifying and addressing parameter indeterminacy in the classical twin design. Twin Res Hum Genet 8: 201–21. 5. Mather K (1974) Non-allelic interaction in continuous variation of randomly breeding populations. Heredity 32: 414–41.
6. Boomsma DI, et al (1999) A religious upbringing reduces the influence of genetic factors on disinhibition: Evidence for interaction between genotype and environment on personality. Twin Res 2: 115–12. 7. Falconer DS (1989) Introduction to quantitative genetics, Longman Scientific and Technical, Harlow, Essex, UK 8. Keller MC, et al (2009) Modeling Extended Twin Family Data I: Description of the Cascade Model. Twin Res Hum Genet 12: 8–1. 9. Neale MC, Cardon LR (1992) Methodology for Genetic Studies of Twins and Families. Kluwer, Dordrecht, The Netherlands.
Chapter 10 Estimating Heritability from Nuclear Family and Pedigree Data Murielle Bochud Abstract Heritability is a measure of familial resemblance. Estimating the heritability of a trait represents one of the first steps in the gene mapping process. This chapter describes how to estimate heritability for quantitative traits from nuclear and pedigree data using the ASSOC program in the Statistical Analysis for Genetic Epidemiology (S.A.G.E.) software package. Estimating heritability rests on the assumption that the total phenotypic variance of a quantitative trait can be partitioned into independent genetic and environmental components. In turn, the genetic variance can be divided into an additive (polygenic) genetic variance, a dominance variance (nonlinear interaction effects between alleles at the same locus), and an epistatic variance (interaction effects between alleles at different loci). The last two are often assumed to be zero. The additive genetic variance represents the average effects of individual alleles on the phenotype and reflects transmissible resemblance between relatives. Heritability in the narrow sense (h2) refers to the ratio of the additive genetic variance to the total phenotypic variance. Heritability is a dimensionless populationspecific parameter. ASSOC estimates association parameters (regression coefficients) and variance components from family data. ASSOC uses a linear regression model in which the total residual variance is partitioned, after regressing on covariates, into the sum of a random additive polygenic component, a random sibship component, random nuclear family components, a random marital component, and an individual-specific random component. Assortative mating, nonrandom ascertainment of families and failure to account for key confounding factors may bias heritability estimates. Key words: Heritability, Additive genetic variance, Polygenic variance, Total phenotypic variance, Narrow sense heritability, Broad sense heritability, Familial aggregation, Variance components, Environmental variance, Genetic variance, Pedigrees, Family data, Nuclear families
1. Introduction Heritability is a measure of familial resemblance. Estimating the heritability of a trait represents one of the first steps in the gene mapping process. A nonzero heritability is a necessary, although not sufficient, condition for the detection of genes underlying a phenotype of interest. Heritability provides information on the statistical power to discover genes related to a specific phenotype Robert C. Elston et al. (eds.), Statistical Human Genetics: Methods and Protocols, Methods in Molecular Biology, vol. 850, DOI 10.1007/978-1-61779-555-8_10, # Springer Science+Business Media, LLC 2012
171
172
M. Bochud
in family-based studies (1). The higher the heritability, the stronger is the correlation between phenotype and genotype. When multiplerelated phenotypes are available, heritability estimates can help choose the best phenotype for gene mapping. Heritability estimates also provide insight into the pathophysiology of diseases by indicating how strongly genetic factors contribute to disease-related phenotypes compared to environmental factors. From a population genetics perspective, the heritability of a phenotype also provides information on the ability of a population to respond to selection and on the potential of that population to evolve (2). Yet, by itself, heritability does not provide information on the number of genes influencing a phenotype nor on their modes of action. Hence, heritability does not help understand the genetic structure of a phenotype. 1.1. Definition of Concepts
This chapter describes how to estimate heritability for quantitative traits from nuclear and pedigree data in humans. Although it is now possible to easily genotype thousands of genetic markers that allow inferring relatedness without pedigree information and, hence, estimate heritability by combining this inferred relatedness with phenotypic resemblance (3–5), this chapter focuses on estimating heritability without genetic markers because, in human studies, pedigree relationships are usually known. Calculating heritability rests on the assumption that the total phenotypic variance of a 2 quantitative trait s can be partitioned into independent genetic T 2 sG and environmental components s2E (Eq. 1). In other words, Eq. 1 assumes that there is no gene by environment interaction. Note that the environmental variance includes known and unknown environmental factors as well as any noise (e.g., measurement error) that may affect the phenotypic data. In turn, the genetic variance (polygenic) genetic can be divided into an 2additive 2 variance sP , a dominance variance sD , and an epistatic variance s2I (Eq. 2). Heritability in the broad sense (H2) refers to the ratio 2 of the variance total genetic variance sG to the total phenotypic 2 2 sT (Eq. 3). Heritability in the narrow sense (h ) refers to the ratio of the additive genetic variance to the total phenotypic variance (6) (Eq. 4). s2T ¼ s2G þ s2E
(1)
s2G ¼ s2P þ s2D þ s2I
(2)
s2G s2T
(3)
s2P : s2T
(4)
H2 ¼ h2 ¼
10
Estimating Heritability from Nuclear Family and Pedigree Data
173
The additive genetic variance is also referred to as the variance of the breeding value, which represents the average effects of individual alleles on the phenotype and reflects transmissible resemblance among relatives. The additive genetic variance is of main interest because it captures the genetic information that parents transmit to their children (2). In the rest of this chapter, heritability will, therefore, refer to h 2 . Because the unit of genetic transmission is the allele and not the genotype, parents only share one allele identical by descent (IBD) with their children (unilineal relative pairs). In contrast, siblings can share zero, one or two alleles IBD. The dominance variance reflects the nonlinear interaction effects between alleles at the same locus. Bilineal relative pairs, such as sibpairs, are needed to estimate a dominance variance component. The epistatic variance represents interaction effects between alleles at different loci. 1.2. Further Assumptions and Limitations
An important assumption underlying heritability as usually estimated is that of random mating. For some traits, such as human height or body size, this may not always be true. The fact that heritability is the ratio of two variances has important consequences. Heritability is always greater or equal to zero and is dimensionless. Because the numerator is always smaller than the denominator, heritability estimates range from 0 to 1, zero meaning no heritability and 1 meaning that the trait is fully heritable, which in practice of course never occurs in humans. Also, heritability is populationspecific. Both the total phenotypic variance and the additive genetic variance may vary substantially across populations (e.g., if environmental factors and/or if allele frequencies, respectively, vary across populations). Furthermore, heritability estimates also refer to a specific point in time (1). If the environmental variance of a trait was to vary considerably with time in a specific population, heritability estimates in this population could change over time even without corresponding changes in the additive genetic variance.
1.3. Estimating Heritability from Nuclear Families or Pedigrees Using ASSOC: Theoretical Aspects
The ASSOC program in the Statistical Analysis in Genetic Epidemiology (S.A.G.E.) software package estimates association parameters (regression coefficients) and variance components from family data (7–9). For any individual i, with continuous phenotype yi and jth covariate values cji, ASSOC uses a linear regression model in which the total residual variance is partitioned, after regressing on covariates, into the sum of a random additive polygenic component ðpi Þ, a random sibship component (si), a variable number of random nuclear family components, (fi), a variable number of random marital component (mi), and an individual-specific random component ðei Þ (9): hðyi Þ ¼ hða þ b1 c1i þ b2 c2i þ þ bn cni Þ þ pi þ si þ fi þ fi0 þ mi þ ei ;
(5)
174
M. Bochud
where in this equation there are two random nuclear family components fi and fi0 Þ for the typical situation that a person can be a member of two different nuclear families, one including the parents and siblings and the other including the spouse and children, and just one marital component. However, when a person has children by multiple mates, there will be more than two nuclear family components and more than one marital component. The sibling component could be due to either a dominance genetic variance component or a common sibship environmental component. Note that one cannot separate dominance genetic variance and common environmental variance from data on full sibs. Heritability is estimated as the additive polygenic component divided by the total residual variance. ASSOC implements simultaneous maximum likelihood estimation of both the components of variance and covariate coefficients on the assumption of multivariate normality of the residuals. ASSOC also allows for simultaneously estimating a power transformation (h) of both sides of the equation (10). Allowing for a transformation relaxes the usual strict assumption of normality, and estimating the variance components makes the model robust to nonindependence (e.g., in case of large pedigrees). Transforming both sides of the regression equation results in median-unbiased estimates of the covariate coefficients on the original scale of measurement. As detailed in ref. 9, all the random effects in the model are assumed to be mutually independent and, after transformation, normally distributed with zero means and variances s2p , s2s , s2f ¼ s2f 0 , s2m , and s2z , such that V ½ðhðyi Þ ¼ s2p þ s2s þ s2f þ s2f 0 þ s2m þ s2z . In ASSOC, standard errors are determined by numerical double differentiation of the log likelihood and P values are based either on a Wald test or on a likelihood ratio test. P values are two-sided for the regression coefficients and one-sided for the variance components (9). 1.4. Major Procedures When Estimating Heritability from Nuclear Family and Pedigree Data Using ASSOC
The methods section will describe in a step-by-step manner (1) key aspects of data collection and study design, (2) prerequisite knowledge, (3) how to install the S.A.G.E. software package from the internet, (4) how to organize the data, (5) how to clean the data, (6) how to prepare the input files, describe the pedigree structure, and estimate heritability, (7) how to interpret the ASSOC output, and (8) how to interpret the results.
2. Methods 2.1. What Are Key Aspects of Data Collection and Study Design with Respect to Estimating Heritability?
The availability of high quality, reproducible phenotypic data is essential to obtain valid heritability estimates. Ascertainment of the families should also be taken into account in the analyses (see Note 1). Covariate information relevant to the quantitative trait of interest should be available and accurately measured as well. The presence of assortative mating may bias heritability estimates
10
Estimating Heritability from Nuclear Family and Pedigree Data
175
(see Note 2). The use of nuclear family and pedigree data is not equivalent to using twin data (see Note 3). It is important to realize that heritability estimates are influenced by the way the phenotype is measured (see Note 4). 2.2. What Prior Knowledge Is Required?
The level of computer skills required is minimal in that the software package has a friendly graphical user interface. Nevertheless, familiarity with the Windows Operating System and spreadsheet-like softwares (e.g., Excel) is needed. Basic knowledge of genetics, statistics, and biology related to the phenotype of interest are required, as well as the documentation provided with the software, to understand and properly use the method described.
2.3. How to Install the S.A.G.E. Software Package?
S.A.G.E. is an open-source resource supported by the US National Center for Research Resources of the National Institutes of Health. The S.A.G.E. software package can be downloaded from the following website: http://darwin.cwru.edu. Users need to register before they can download the software. The user reference manual can be downloaded directly from the website and is also available in the Help section of the software (see Note 5). This manual provides an in-depth description of each software available in S.A.G.E. and explains all options available. To estimate heritability, only the PEDINFO and ASSOC programs are needed.
2.4. How to Organize the Data?
Pedigree relationships should be correctly specified and coded (see Note 6). Founders should have missing parent identifiers (ids). Dummy parents may be needed to link all members of the pedigree (see Note 6).
2.5. How to Clean the Data?
Data cleaning is often a time consuming, but necessary, process. Pedigrees should not have loops, as this is a prerequisite for most S.A.G.E. programs (see Note 7). Missing data should be properly coded and their distribution evaluated outside of S.A.G.E. The missing code is provided to S.A.G.E. when importing the data. Do not use as a missing code a value that lies within the range of possible phenotypic values. The distributions of the quantitative trait(s) of interest and their corresponding relevant covariates should be evaluated and great care should be taken to identify potential outliers (see Note 8).
2.6. How to Prepare the ASSOC Input Files and Run the Analyses?
Two files are needed to estimate heritability of quantitative traits in ASSOC: a data file and a parameter file. The data file contains the pedigree data (as well as traits and covariates) as described in Subheading 2.4 and is created by importing the data into S.A.G.E. The parameter file is created within the S.A.G.E. software. The data can be in a spreadsheet-like format (e.g., Excel) or in a tab-delimited (or other delimiter) text file (see Note 9).
176
M. Bochud
1. Open the S.A.G.E. software by clicking on the desk icon. 2. Create a new project. 3. Name the project, provide the path showing where the project files should be stored. 4. Three options are now possible: (a) Create a new project from scratch. (b) Import a pedigree data file that needs to be formatted by S.A.G.E. (c) Work with S.A.G.E.-ready format files. For this example, the second situation (b) will be illustrated. 5. Specify the path to the pedigree data file. Select the format of the file [text file with multiple delimiter choice or Excel file (with or without header)]. It is recommended to use a tabdelimited text file (see Note 10). 6. Set up the individual missing value code (dot “.”; blank “”; space “ ”; tab “/”; other) for pedigree and individual ids as well as parental fields. 7. Define the sex code (i.e., specific how sex is defined in your data file). 8. Attribute to each column a S.A.G.E. identifier (i.e., pedigree ID, individual ID, parent 1, parent 2, sex, trait or covariate). You cannot move forward before this is completed. 9. For all covariates and traits, specify the missing value code and the type of variable (quantitative or binary). 10. Click and fill in the general specifications tab. 11. Name the S.A.G.E. formatted pedigree data file (e.g., “pedigree0.dat”). 12. Name the parameter file (e.g., “parameter.par”). 13. Click on the “Analysis” tab and select summary statistics (this will run the PEDINFO program). 14. Click on Analysis definition and select traits and covariates of interest. 15. Check the PEDINFO parameter file generated by S.A.G.E. (see Note 11). 16. Check the summary statistics in the output folder of the PEDINFO1 analysis. As a general rule in S.A.G.E., always check the information output file (*.inf) to ensure that the pedigree data file does not contain errors and that the data have been properly read by the software. This file contains informational diagnostic messages, warnings, and program errors. No analysis results are stored in this file. The information
10
Estimating Heritability from Nuclear Family and Pedigree Data
177
output file will list the pedigree structure and phenotypes for the first individuals in the dataset. This is a good way to check that the data are being properly handled. 17. Check the analysis output file. General statistics on all pedigrees are provided with the number and mean size of pedigrees in the dataset, the number of generations and sibships (including their size). Information on whether there are loops or rings in the pedigrees is provided. Finally, the number of relative pairs of each type, the number of men and women, and the number of founders are provided. This information is very important to gauge the power of the dataset to estimate heritability and to get a feeling for how precise the heritability estimates will be. 18. Run the ASSOC program to obtain heritability estimates. For this, you click on the “Analysis” tab, select “Familial aggregation” and “Heritability estimation.” (a) Select the appropriate pedigree data file and name the output file (e.g., ASSOC1). (b) Provide the title of the analysis (e.g., SBP). (c) Select the quantitative trait of interest (e.g., SBP) (see Note 12). (d) Define the covariates to include in the model (e.g., age and sex) (see Note 13). (e) Choose a transformation. You may choose (1) no transformation, (2) a Box and Cox transformation, or (3) a George and Elston transformation (see Subheading 2). It is advised to use a George and Elston transformation (see Note 14). (f) Choose a summary display. (g) Select the variance components that ASSOC should try to estimate (see Note 15). (h) Run the ASSOC analysis. (i) Check the assoc.inf file. If no warning message is provided, look at the ASSOC detailed output result (assoc.det), which provides variance components, the total phenotypic variance and a heritability estimate with a standard error and a P value (see Note 16). 2.7. How to Interpret ASSOC Output?
Below is an example of an ASSOC analysis output file (the output has been trimmed to ease the understanding of the results) providing a heritability estimate for systolic blood pressure (sbp) using a simulated dataset. Heritability is estimated to be 0.59 with a large standard error (0.68) so the estimate is not significantly greater than zero. Increasing the sample size threefold, while keeping the same data structure and phenotypic values, and letting
178
M. Bochud
ASSOC estimate the power transformation (lambda 1) leads to a similar heritability estimate (0.63), but with a smaller standard error (0.32), and a significant P value (0.02). This illustrates the importance of sample size in determining the precision of the heritability estimate.
10
Estimating Heritability from Nuclear Family and Pedigree Data
179
2.8. How to Interpret the Heritability Estimates?
This is an informal and arbitrary guide to the reader. A table with selected examples of heritability estimates for quantitative traits measured using ASSOC is provided in Note 17. One should differentiate the statistical significance of the heritability estimate from its strength. A trait may have a statistically significant (i.e., significantly higher than zero), but small heritability, such as, say, 0.1. In contrast, a trait may have a higher heritability estimate (say 0.4), which may not be significantly different from zero if there is not enough information in the data to get sufficient precision. In general, many people would agree that, in humans, heritability estimates <0.2 are low, heritability estimates between 0.2 and 0.5 are moderate, and estimates higher than 0.5 reflect a high heritability. This being said, it is important to realize that a low heritability estimate does not automatically imply that the additive genetic variance is small, but only that a small proportion of the total phenotypic variance is caused by genetic factors transmitted from generation to generation in the population of interest (1). Alternatively, a high heritability estimate does not at all imply that genes involved in the control of the phenotype have large effects (1). A nice example is given by blood pressure, which, when measured with high accuracy (e.g., longitudinal measurements or 24-h blood pressure monitoring), may achieve high heritability estimates (0.5–0.6) (11), whereas current findings show that the vast majority of genetic variants influencing blood pressure in humans have a tiny effect size (12, 13). For blood pressure, it seems that numerous genetic variants each with a small effect play a role, compatible with a polygenic model. A similar example is human height, with heritability estimates around 0.8 across populations, and yet most genetic variants associated with human height only have very small effects (14). The accuracy (precision) of a heritability estimate depends on the sample size of the data available, both in terms of number and structure of the pedigrees. The sampling variance of the heritability estimate is inversely proportional to the number of families, the number of individuals in each family, and to the type of relatedness found in the pedigrees (i.e., first-degree relatives provide more information than second- and third-degree relatives) (2).
2.9. What Type of Confounding Factors May Bias Heritability Estimates?
If the resemblance of parents and their children is due, in part, to a shared environment, this may bias heritability estimates upwards (1). One may correct the total phenotypic variance for the effects of covariates such as age, sex, and other potential confounders (e.g., treatment effects for blood pressure or blood lipids). If information on these covariates is not available, the total phenotypic variance used to estimate heritability will be larger and heritability might be underestimated. However, adjustment should be made with caution for covariates that may share genetic determinants with the phenotype of interest or lie in the causal pathway between genes and this phenotype (e.g., adjusting for body mass index or renal function when estimating heritability for blood pressure). Multigenerational
180
M. Bochud
pedigree data include individuals who may have lived under different environmental conditions. Most of the time, one assumes that the environmental variance does not vary across generations within the population of interest. Hence, if a cohort effect strongly influences the environmental variance, such effect is usually not taken into account. A cohort effect exists for human height for instance, which has been shown to substantially increase with time, in a different way across populations (15). A similar observation can be made for socioeconomic conditions, which have also considerably changed over time and are known key determinants of health. However, it is not clear to what extent cohort effects may bias heritability estimates in humans.
3. Notes 1. The validity of heritability estimates rests on the assumption that a random sample of families from the population of interest is available. In practice, such truly random samples are difficult to obtain for family-based studies in humans, particularly for phenotypes that are complex and time consuming to measure, and ascertainment based on a specific disease or phenotypic value is often present. For instance, families enriched in diseased individuals might be preferentially collected. In such a case, estimates of the variance components may be biased and hence not be representative of the true underlying population values. It is therefore advisable, whenever possible, to correct for ascertainment. Correction for ascertainment aims to determine what would have been the results had the investigators not ascertained this way and should, therefore, use a variable that appropriately reflects the event that triggered the family to be included in the study (11). For instance, in a family-based study in which families were included if two siblings were hypertensives, blood pressure was corrected for ascertainment by including as a covariate an indicator variable whose value was 1 if the participant was hypertensive and whose value was 0 otherwise (11). A more general description of how to correct for ascertainment in family-based studies can be found in ref. 9. 2. Assortative mating, i.e., there is a correlation between the phenotypic values of spouses, can bias heritability estimates. This might be an issue for anthropometric traits such as obesity and height (16). 3. As described in another chapter of this book (Chapter 9), heritability can also be, and is often being, estimated from twin data. Monozygotic twins share 100% of their genetic
10
Estimating Heritability from Nuclear Family and Pedigree Data
181
material, whereas dizygotic twins only share 50%, similar to non-twin siblings. Therefore, why bother estimating heritability from non-twin data? The twin method is based on the assumption that monozygotic and dizygotic twin pairs differ in their similarities as a result of only genetic causes (17). This assumption often does not hold. For instance, monozygotic twins are more similar than dizygotic twins for several demographic and lifestyle factors (18). Another assumption is that both types of twins and singletons have the same genotypic and environmental population variances (19). An important fact that may play a role for cardiovascular phenotypes is that twins have lower birth weights and hence suffer more from intrauterine growth retardation than singletons (20). These observations raise the question of how heritability estimates calculated in twins can be generalized to singletons (17). Estimating heritability from non-twin family data is, therefore, useful and may be more valid for inferences regarding singletons, which represent the majority of human populations. 4. Heritability will depend not only on a phenotype in a specific population with a particular environment, but also on how the phenotype has been measured (2). For instance, the number of blood pressure measurements that are averaged to constitute the quantitative trait of interest was shown to strongly influence heritability estimates, which varied from 0.06 0.09 for a single daytime systolic blood pressure measurement to 0.37 0.12 for an average of 20 measurements (11). Whenever repeated measurements are available, one may partition the variance into within-individual and among-individual components. The within-individual component is purely environmental, whereas the among-individual component is partly environmental and partly genetic (2). 5. For a detailed introduction, it may be possible to arrange a shortcourse for S.A.G.E. For more information, see under Training at http://darwin.cwru.edu/sage/. Users can join a mailing list for information about S.A.G.E. bug lists, web site updates, new software releases, patches, and fixes. 6. The dataset should have one line per subject (Table 1), with the first line containing variable names, and one column per variable. Keep variable names short. Avoid using spaces in variable names, rather use underscores instead (e.g., TRAIT_1). The format of the dataset and the column delimiter (tab, comma, space, semicolon, etc.) is of utmost importance. S.A.G.E. can import data from Excel spreadsheets or from tab-delimited, comma-delimited text files. Although they can with care be handled, it is best to avoid datasets with multiple delimiters between variables. Each participant should have a unique identifier (UID), a father identifier (FATHER), a mother identifier (MOTHER),
182
M. Bochud
Table 1 Example of data structure required by S.A.G.E. Family
UID
Father
Mother
Sex
Age
TRAIT_1
001
001
.
.
0
44
130
001
002
.
.
1
42
146
001
003
001
002
1
21
118
001
004
001
002
1
20
122
002
006
.
0
58
156
002
007
008
009
1
62
144
002
008
.
.
0
.
.
002
009
.
.
1
.
.
002
010
006
007
0
36
136
002
011
012
013
0
34
128
002
012
.
.
0
34
156
002
013
008
009
1
65
138
002
014
.
.
0
70
142
002
015
014
013
0
41
130
“.” represents a missing value
sex (e.g., 1 ¼ woman, 0 ¼ men), age, and other phenotypic values (e.g., TRAIT_1). Other covariate information may be added, as needed. A family identification number is not necessary but often useful. Pedigree relations are determined by the father and mother identifiers. For example, full sibs have the same mother and father (individuals 003 and 004 from family 001 in Table 1 are full sisters). Half-sibs only share one parent. Note that pedigree information should allow reconstructing all relative pairs within the family. For instance, in family 002 in Table 1, 007 and 013 are full sisters, 010 and 015 are cousins. Individuals 008 and 009 did not participate to the study and no phenotypic data are available for them. Nevertheless, these individuals need to be in the pedigree file to link individuals 007 and 013 and also individuals 010 and 015. Individuals 001 and 002 are founders, i.e., individuals with at least one descendant who have neither parent in the pedigree. In ASSOC, founders are assumed to be unrelated by ancestry to any other founder. 7. A loop is present whenever a group of individuals are linked such as illustrated in Fig. 1. The loop can be broken by
10
Estimating Heritability from Nuclear Family and Pedigree Data
1
2
5
3
4
6
7
9
183
8
10
11
Fig. 1. Example of a loop in a pedigree.
removing individual 11, for instance. Other possibilities exist and should depend on the available phenotypic data. 8. Severe outliers, i.e., values lying more than 4 standard deviations away from the mean for instance, may have a large influence on the results. It is important to check whether this is the case, by running a model including all data points and a model that takes into account outliers (i.e., removing them or using winzorisation, such as replacing extreme values by data percentiles 1 and 99, for instance). Outlier values should always be checked for potential mistakes (e.g., data entry error). 9. If you encounter problems when importing data from an Excel spreadsheet, save the data using a tab-delimited text file and try again. 10. Such a tab-delimited text file can be easily produced from an Excel spreadsheet by saving, using the appropriate format. One may look at the content of this file using Wordpad or Notepad. 11. This is an example of a PEDINFO parameter file generate by S.A.G.E. The first line defines the program and the name of the output file (“PEDINFO1”). The lines between curly brackets define the traits and covariates to be taken into account when summarizing the pedigree data. There is an option to perform summary statistics for each pedigree separately and an option to output only trait-relevant statistics.
184
M. Bochud
12. For studies including multicenter data, it is advisable to ensure that variance components are similar before merging data across centers. If significant differences exist, center-specific standardized phenotypes can be used, as illustrated in ref. 21. 13. If one is interested in accounting for the potential confounding effect of covariates, it is advisable to compare models with and without including these covariates, in an incremental manner to evaluate how each covariate influences the heritability estimate (17). Fixed covariates such as sex and age are usually taken into account because of their major role in most biological processes in humans. 14. For the George and Elston transformation, lambda 1 refers to the power parameter and lambda 2 refers to the shift parameter. Hence, if all the trait values are >0, lambda 1 equal to one means no transformation, lambda 1 equal to zero means a logtransformation, and lambda 1 equal to 0.5 means a square root transformation. Theoretically, lambda 1 < 0 can never result in the trait being normally distributed, but in practice it may result in an approximate normal distribution if lambda1 is not too small. Lambda 2 can usually be fixed to zero. Fixing lambda 1 to one and lambda 2 to zero in the George and Elston transformation is equivalent to no transformation. 15. Datasets with pedigrees having more than two generations are needed to estimate a family effect s2f . Also, large datasets (many families) are usually needed to reliably estimate marital and family effects s2m and s2f . In practice, these latter variances are often constrained to be zero and the following simplified model is estimated from the data: hðyi Þ ¼ hða þ b1 c1i þ b2 c2i þ þ bn cni Þ þ pi þ si þ ei :
(6)
If the sibship component is estimated to be not significantly different from zero, the model can be further simplified to: hðyi Þ ¼ hða þ b1 c1i þ b2 c2i þ þ bn cni Þ þ pi þ ei :
(7)
16. The column named “Deriv” provides the numerical values of the partial derivatives of the log likelihood with respect to the parameters. If these partial derivatives are close to zero (e.g., <0.001), this suggests that the maximization process was successful. If these values are much larger than zero, the maximization process was not successful and the results should be interpreted with caution. One can sometimes solve this problem by using a slightly different set of covariates or by winzorising the phenotype. 17. Table 2 presents examples of heritability estimates for selected human quantitative traits estimated from pedigree data using the current ASSOC. ASSOC is being continually improved and
10
Estimating Heritability from Nuclear Family and Pedigree Data
185
Table 2 Examples of heritability estimates for selected human quantitative traits estimated using ASSOC Population (reference)
Sample size (individuals)
Seychelles, East African descent (11)
314
Trait
h2 (se)
24 h systolic blood pressure
0.40 (0.13)
Daytime systolic blood pressure Nighttime systolic blood pressure
0.37 (0.12)
348
Serum creatinine Creatinine clearance Inulin clearance
0.33 (0.34) 0.52 (0.13) 0.41 (0.10)
South Africa, African 240 descent (22)
Endogenous lithium clearance Serum sodium Proximal sodium reabsorption
0.76 (0.10)
Endogenous lithium clearance Serum sodium Proximal sodium reabsorption
0.45 (0.07)
Height Weight Body mass index Peripheral augmentation index (measured with Sphygmocor) Pulse wave velocity (measured with Sphygmocor)
0.85 (0.07) 0.54 (0.08) 0.43 (0.09) 0.39 (0.10)
Seychelles, East African descent (17)
Belgium, Caucasians 737 (22)
Belgium, Poland, Czech Republic, Caucasians (21)
494
0.34 (0.13)
0.62 (0.14) 0.82 (0.08)
0.38 (0.09) 0.56 (0.07)
0.19 (0.11)
one such change that will be available in S.A.G.E. 6.2 is the option to transform the difference, i.e., h(yi E(y |Xi)), rather than transform both sides, so check that transformation of both sides is still a default for the version you use. References 1. Visscher PM, Hill WG, Wray NR (2008) Heritability in the genomics era–concepts and misconceptions. Nat Rev Genet 9: 255–266
2. Falconer DS, Mackay TFC (1996) Introduction to quantitative genetics. Longman, Harlow, Essex
186
M. Bochud
3. Ritland K (1996) A marker-based method for inferences about quantitative inheritance in natural populations. Evolution 50: 1062–1073 4. Thomas SC, Pemberton JM, Hill WG (2000) Estimating variance components in natural populations using inferred relationships. Heredity 84: 427–436 5. Thomas SC (2005) The estimation of genetic relationships using molecular markers and their efficiency in estimating heritability in natural populations. Philos Trans R Soc Lond B Biol Sci 360: 1457–1467 6. Vogel F, Motulsky AG (1997) Human genetics. Problems and approaches. Springer-Verlag: Berlin 7. George VT, Elston RC (1987) Testing the association between polymorphic markers and quantitative traits in pedigrees. Genet Epidemiol 4: 193–201 8. Elston RC, George VT, Severtson F (1992) The Elston-Stewart algorithm for continuous genotypes and environmental factors. Hum Hered 42: 16–27 9. Gray-McGuire C, et al (2009) Genetic association tests: a method for the joint analysis of family and case–control data. Hum Genomics 4: 2–20 10. George V, Elston RC (1988) Generalized modulus power transformation. Communication in statistics - Theory and Methods 17: 2933–2952 11. Bochud M, et al (2005) High heritability of ambulatory blood pressure in families of East African descent. Hypertension 45: 445–450
12. Levy D, et al (2009) Genome-wide association study of blood pressure and hypertension. Nat Genet 41: 677–687 13. Newton-Cheh C, et al (2009) Genome-wide association study identifies eight loci associated with blood pressure. Nat Genet 41: 666–676 14. Yang J, et al (2010) Common SNPs explain a large proportion of the heritability for human height. Nat Genet 42: 565–569 15. Komlos J, Lauderdale BE (2007) The mysterious trend in American heights in the 20th century. Ann Hum Biol 34: 206–215 16. Magnusson PK, Rasmussen F (2002) Familial resemblance of body mass index and familial risk of high and low body mass index. A study of young men in Sweden. Int J Obes Relat Metab Disord 26: 1225–1231 17. Bochud M, et al (2005) Heritability of renal function in hypertensive families of African descent in the Seychelles (Indian Ocean). Kidney Int 67: 61–69 18. Heller RF, et al (1988) Lifestyle factors in monozygotic and dizygotic twins. Genet Epidemiol 5: 311–321 19. Elston RC, Boklage CE (1978) An examination of fundamental assumptions of the twin method. Prog Clin Biol Res 24A: 189–199 20. Hall JG (2003) Twinning. Lancet 362: 735–743 21. Seidlerova J, et al (2008) Heritability and intrafamilial aggregation of arterial characteristics. J Hypertens 26: 721–728 22. Bochud M, et al (2009) Ethnic differences in proximal and distal tubular sodium reabsorption are heritable in black and white populations. J Hypertens 27: 606–612
Chapter 11 Correcting for Ascertainment Warren Ewens and Robert C. Elston Abstract Data used to study human genetics are often not obtained by simple random sampling, which is assumed by many statistical methods, especially those that are based on likelihood for making inferences. There is a welldeveloped theory to correct likelihoods based on sibship data whether or not the exact mode of ascertainment is known. In the case of larger pedigrees, however, the problem is much more difficult unless they are recruited into the sample by single ascertainment. There is no one piece of software that analyzes ascertainment in general, so most of this chapter is devoted to theory. A general method by which one general genetic analysis software package corrects pedigree data for ascertainment is briefly described. Key words: Proband, Catchment area, Proband sampling frame, Complete ascertainment, Single ascertainment, Multiple ascertainment, Sequential sampling, Proband-dependent sampling, Pseudolikelihood
1. Introduction It is a basic tenet of statistical theory that the analysis of any body of data by statistical methods must take into account the method by which the data were obtained. By far the most frequently made assumption is that the data were obtained by simple random sampling from some population. If this assumption is correct, it is easy to make valid inferences from the sample to this population. However, the data used in the study of the genetic basis of diseases are often not obtained by simple random sampling. The reason for this is that, since (fortunately) genetic diseases are rare, random sampling is often not a practicable approach to obtaining a sample of sufficient size to draw reliable conclusions. For example, if a disease affects one person in ten thousand and a sample of five hundred affected individuals is needed for a sufficiently precise analysis, a simple random sample of five hundred thousand is needed, usually beyond the bounds of practicability.
Robert C. Elston et al. (eds.), Statistical Human Genetics: Methods and Protocols, Methods in Molecular Biology, vol. 850, DOI 10.1007/978-1-61779-555-8_11, # Springer Science+Business Media, LLC 2012
187
188
W. Ewens and R.C. Elston
In the simplest case of nonrandom sampling an entire family is “ascertained,” that is, it comes to the attention of the investigator through one or more affected children in that family. The affectedness status and aspects of the genetic composition of each child in the family are then examined. Clearly, if the disease does in fact have a genetic basis, the members of any ascertained family are more likely to carry the disease variant (or variants), and thus to be affected, than individuals taken at random from the population of interest. Thus statistical procedures based on the assumption of simple random sampling are not appropriate when applied to data from any ascertained family, and in particular will lead to a possibly substantial overestimation of the population frequency of the disease variant(s). The statistical procedure that should be employed depends on the properties of the ascertainment process. For example, if a family with three affected children is three times as likely to be ascertained as is a family with one affected child, the estimation procedure that is appropriate differs from that which is appropriate when the probability of ascertainment of a family is independent of the number of affected children in it. The need for a sampling theory suitable for data from ascertained families was recognized a century ago (1–3), and the theory has developed steadily since this time. If the ascertainment procedure is known, the correct sampling theory can be derived. Problems arise if it is unknown, and some of our discussion concerns the problems facing an analyst who is given data from a collection of ascertained families and who is unaware of the properties of the ascertainment sampling procedure that led to these families. Our aims in this chapter are, first, to survey various methods that have been used for analyzing genetic segregation in data arrived at by an ascertainment procedure, and second, to describe some of the problems that arise when the ascertainment procedure is unknown to the analyst, together with some approaches suggested for overcoming these problems. Subheadings 2.1–2.4 discuss in some detail the theory as it developed up to the mid1980s for a binary phenotype observed on members of nuclear families. Subheadings 2.5 and 2.6 introduce the concept of the ascertainment-assumption-free (AAF) method, the latter comparing standard errors of the model-based and model-free methods of allowing for ascertainment, again for a binary phenotype observed on members of nuclear families. Subheading 2.7 extends everything to continuous phenotypes and Subheadings 2.8–2.9 consider what can be done for larger pedigree structures. There is no one piece of software that analyzes ascertainment in general, so most of this chapter is devoted to theory. We do, however, discuss in Subheading 2.9 the general approach taken in the program package S.A.G.E. (see Chapters 12 and 30).
11
Correcting for Ascertainment
189
2. Methods 2.1. Definitions and Statistical Theory
We start by considering family (sibship) data. A proband is defined as any individual of extreme phenotype (usually affected with a disease) who comes to the attention of the investigator and who would thus on his own, be sufficient to lead to ascertaining the family of that individual. A family might have several probands. Most of the existing theory requires that the event that one child in a family be a proband is independent of the event that another child in the family is a proband, this being taken to be part of the definition of a proband; an exception to this is discussed in ref. 4. The set of potential probands in a family is called the proband sampling frame (PSF) of that family (5). This set might consist in practice of all children in the family who could become probands, for example by living in the catchment area of the sampling process. The proband combination of any family is the set of actual probands—independent or not—in that family, and is a subset (maybe the entire set) of the PSF of that family. In the examples given below, we often refer to data derived from a collection of n families. All these families are assumed to have been ascertained through some ascertainment process, so that from now on we refer to families, rather than “ascertained sibships,” in our discussion. Parameter estimation in ascertainment sampling analysis is almost invariably carried out by the method of maximum likelihood, a procedure that for large n has optimality properties. The likelihood used in ascertainment sampling is found by using conditional probabilities, where the condition is the event that the family is ascertained. Thus the relevant likelihood L is L¼
n Y PrðDi Ai Þ ; PrðAi Þ i¼1
(1)
where Di denotes the data in family i, including genetic, phenotypic and proband status data, Ai denotes the event of ascertainment of family i, (i ¼ 1, 2, . . ., n), and Pr(X) denotes the probability (mass or density) of X. The ascertainment problem derives from the facts that this likelihood depends on the nature of the ascertainment scheme, that in practice this scheme might be unknown to the data analyst, and that any misspecification of the ascertainment scheme will lead almost always to biased estimators of the genetic parameters. There are three further points about the likelihood to be used in an ascertainment analysis. First, in practice, this likelihood will usually contain multiplicative combinatorial constants which are irrelevant to the estimation procedure and which are, therefore, ignored in the calculations given below. These constants are
190
W. Ewens and R.C. Elston
generically written below as “const.”. Second, it has been argued by some authors (6) that when ascertained families can be thought of as coming from some well-defined population or group, unobserved families should be thought of as part of the data and that the likelihood should thus contain a contribution from those unobserved families. Very few other authors use unobserved families, and it can be shown that using them leads to essentially the same estimates as those obtained from the likelihood involving only observed families. This is discussed further below. Thus in practice unobserved families are not considered in current methods. Third, it has been claimed that the family size distribution (FSD) should be used in the likelihood used for parameter estimation. In practice, this is rarely if ever done. There are four reasons for this. First, even if the FSD is known, only a minor increase in precision arises by using it. Second, if it is unknown but some erroneous form for it is assumed, biases will arise in the estimation of genetic parameters. Third, if a completely general form for the FSD is assumed, then the estimates of genetic parameters are the same as those obtained by using the likelihood Eq. 1. Finally, the mathematical form of the likelihood which does not use the FSD is simpler than the form that incorporates the FSD. With these background remarks in place, we now turn to some specific examples of calculations involving the likelihood L given above, emphasizing the way in which this likelihood depends on the nature of the ascertainment process and the problems that can arise if an incorrect ascertainment process is assumed in the analysis of the data obtained by the ascertainment procedure. 2.2. A Simple Example
In this section, we discuss a simple example that is sufficient to illustrate the points raised above. Define a family to be “at risk” if the two parents can produce children (the sibship) who are affected by the disease of interest, and suppose that in any “at risk” family the sibs are independently affected, the probability of any child being affected being p. Our aim is to estimate p. We consider a sample of n families in which there are si sibs in family i, ri of whom are affected by the disease of interest. We consider two possible ascertainment schemes. The first of these is complete ascertainment. Under this scheme, a family is ascertained if at least one sib in the family is affected. Such a situation might arise in a nation having a registry with a listing of all individuals in the country, together with the disease status of each individual. The probability of ascertainment of family i, the denominator in this family’s contribution to the conditional likelihood in Eq. 1, is 1 ð1 pÞsi . The numerator in this family’s contribution to the conditional likelihood (Eq. 1) is proportional to pri ð1 pÞsi ri . The (conditional likelihood) L in Eq. 1 then becomes
11
const:
Correcting for Ascertainment
n Y pri ð1 pÞsi ri : f1 ð1 pÞsi g i¼1
191
(2)
The maximum likelihood estimator p^ of p derived from this likelihood is given implicitly by the solution of the equation n R X si ¼ s ; ^p f1 ð1 ^pÞ i g i¼1
(3)
where R is the total number of affected children in the sample of n families. Although no explicit solution of this equation is in general possible, numerical methods rapidly lead to a numerical solution of Eq. 3. It may be shown that this (implicit) estimator is usually very close to the explicit but approximating (7) estimator ^p ¼
RT ; S T
(4)
where T is the number of families in the sample with exactly one affected sib and S is the total number of sibs in the sample. Both the estimator derived from Eq. 3 and the explicit estimator (Eq. 4) are biased, but the bias is usually small and both are asymptotically (n ! 1) unbiased. Thus in practice, if the data are derived from a complete ascertainment process and this is known to the analyst, satisfactory parameter estimation is obtained, at least for large n. Under single ascertainment, the probability that any family is ascertained is assumed to be proportional to the number of affected sibs in the family (a more precise definition is given later). The probability that a family with s sibs is ascertained is then ! n X s const: r pr ð1 pÞsr ¼ sp: r r¼1 From this it follows that, conditional on the event of the ascertainment of this family, the probability that it contains exactly r affected sibs (r ¼ 1, 2, . . ., s) is ! s 1 r1 const: p ð1 PÞsr : (5) r 1 Given again a sample of n families, with ri affected sibs and si sibs altogether in family i, the likelihood (Eq. 1) is found from Eq. 5 to be L ¼ const:
n Y i¼1
½pri 1 ð1 pÞsi ri ¼ const: pRn ð1 pÞSR ;
(6)
192
W. Ewens and R.C. Elston
with R and S being as defined above. From this the (explicit) maximum likelihood estimator of p is ^p ¼
Rn : S n
(7)
If single ascertainment is indeed the case, Eq. 5 shows that the mean number of affected sibs in a singly ascertained family having r sibs is 1 + (s 1)p, and from this the mean of R is n + (S n)p. This implies that ^p is an unbiased estimator of p. Thus if single ascertainment is the case and this is known to the analyst, unbiased estimation of p is possible. We now consider the situation when the ascertainment process is unknown to the analyst, and the data are analyzed using an incorrect assumption about that process. If the true ascertainment process is single ascertainment, but the data were analyzed assuming complete ascertainment, both the estimator derived from Eq. 3 and the explicit estimator (Eq. 4) are substantially biased estimators of p. Similarly, if the true ascertainment scheme were complete ascertainment, the estimator (Eq. 7) derived by assuming single ascertainment is a substantially biased estimator of p. We illustrate the level of these biases by considering the case where all families sampled have three children, so that S ¼ 3n. Suppose first that complete ascertainment is the case but single ascertainment is incorrectly assumed and the estimator (Eq. 7) used. Since under complete ascertainment the mean of R is 3np/[1 (1 p)3], the mean of the estimator (Eq. 7) is . 3np ½1 ð1 pÞ3 n 3p ¼p ; 6 6p þ 2p2 3n n confirming that this estimator is biased, with a bias between 0 and 50% depending on the value of p. Conversely, if single ascertainment had been the case, the mean of R is n(1 + 2p) and the mean of T is n(1 p)2. The mean of the estimator (Eq. 4) used assuming complete ascertainment is then asymptotically p(4 p)/(2 + 2p p2), so that this estimator is asymptotically biased, with an asymptotic bias between 0 and 100%, again depending on the value of p. 2.3. Multiple Ascertainment
The multiple ascertainment model (3, 8) embraces both single and complete ascertainment schemes as particular cases. It is assumed in this model that each sib in an “at risk” family is independently affected by the disease of interest with probability p and that any affected sib in a family independently becomes a proband with fixed (but unknown) probability p. Given that an “at risk” family has r affected sibs, the probability that it has at least one proband is 1 – (1 – p)r. The probability that such a family having s sibs is ascertained is thus
11 s X
s
r¼1
r
Correcting for Ascertainment
193
! pr ð1 pÞsr f1 ð1 pÞr g ¼ 1 ð1 ppÞs :
(8)
Since the probability that this family has q probands is proportional to pq(1 p)r-q, its contribution to the likelihood L is const:
½pq ð1 pÞrq pr ð1 pÞsr : ½1 ð1 ppÞs
Thus L ¼ const:
pQ ð1 pÞRQ pR ð1 pÞSR Qn ; si i¼1 ½1 ð1 ppÞ
(9)
where Q is the total number of probands among the n families, R the total number of affected sibs, S the total number of sibs, and si the number of sibs in family i. When p ¼ 1, it is necessary that Q ¼ R, and Eq. 9 then reduces to Eq. 2, so that complete ascertainment is a limiting case of multiple ascertainment. When p ! 0, Eq. 9 reduces (apart from multiplicative factors) to Eq. 6, so that single ascertainment is also a limiting case of multiple ascertainment. Further, these two limiting cases bound the range of possible values of p. It might then be thought that, in cases where the nature of ascertainment procedure is unknown to the analyst, use of the multiple ascertainment model will overcome the ascertainment problems mentioned above. We discuss this view in the next section. Estimation of p requires the joint estimation of p and p jointly, and we demonstrate the calculations by a simple example. Suppose for simplicity that there are s sibs in each of the n families in the data. The likelihood (Eq. 9) then becomes const: pQ ð1 pÞRQ pR ð1 pÞnsR ½1 ð1 ppÞs n ;
(10)
where R is the total number of affected children and Q the total number of probands among the n families. Joint estimation of p and p implies solving the following simultaneous equations for ^p and p ^: ^ð1 ^pp ^Þ R ns R ns p s ¼ 0; ^p ^ 1p ^Þ 1 ð1 ^pp
(11)
^Þ Q R Q ns ^pð1 ^pp s ¼ 0: ^ ^ p 1p ^Þ 1 ð1 ^pp
(12)
s1
s1
These equations must be solved numerically (see Note 1). 2.4. Comments on the Multiple Ascertainment Procedure
Since complete ascertainment is the special case of multiple ascertainment for the case p ¼ 1 and single ascertainment the special case of multiple ascertainment for the case p ! 0, it might be thought that bounds for ^p can be found by assuming, respectively,
194
W. Ewens and R.C. Elston
complete and single ascertainment, estimating p under both assumptions, thus bracketing the correct value. However, the limits p ¼ 1 and p ! 0 do not, in general, form bounds for all possible ascertainment schemes. From a different point of view, one can think of complete ascertainment as the case where the probability of ascertainment of an “at risk” family is independent of the number of affected sibs in the family, provided that this number exceeds zero. That is, from a mathematical point of view, one can think of this probability as being proportional to r0, where r is the number of affected sibs. Similarly, under single ascertainment the probability of ascertainment of an “at risk” family is a linear function of the number of affected sibs, so that it is proportional to r1. From this point of view, one can envision a situation where the probability of ascertainment of an “at risk” family is proportional to r2, so that it is a quadratic function of the number of affected sibs. A case where such a quadratic ascertainment procedure would arise is where a family comes to the attention of a physician with probability proportional to r, that is via single ascertainment, and then that any family obtained by the physician is then passed on to an investigator again with probability again proportional to r, leading for that investigator to a quadratic ascertainment process. This quadratic process is outside the bounds of the multiple ascertainment model. It was early (9) noted that in practice the probability of ascertaining a family might increase more than proportionately with r, and models have been suggested (10) where this probability is of the form ra, where the possibility that a > 1 is allowed. A further problem with the multiple ascertainment model is that it embodies two assumptions which must, in practice, often be unrealistic. The first of these is that sibs within the same family become probands independently. It is difficult to imagine that this is a realistic assumption in the case of young children in the care of their parents. Second, it assumes that all affected children become probands with the same probability p, an assumption that again seems unrealistic. It is possible to build up a general statistical framework that does not assume independence of proband status and at the same time allows different affected sibs in the same family to have different probabilities of being a proband (4). (We indicate the main features of this approach in Subheading 2.9, where we discuss ascertainment in the context of larger pedigrees.) Unfortunately, from a mathematical point of view, any progress under these more general assumptions must be limited if our sample comprises only independent sibships, since allowing for dependence of proband status and allowing for different within- and between-family probandship probabilities introduces more parameters than can be estimated from the data. This can be seen in the following way. Suppose that all families in a sample of n families have two sibs. We denote the probability of ascertainment of a family with i affected sibs by some unknown
11
Correcting for Ascertainment
195
parameter ai. It turns out that only the ratio of a1 and a2 is important, so we write a1 as 1 a and a2 as a. Suppose that in the sample there are n1 families with one affected sib and n2 with two affected sibs (so that n1 + n2 ¼ n). Then the likelihood L in Eq. 1 is proportional to ð1 aÞn1 an2 pn1 þ2n2 ð1 pÞn1 : ½2pð1 pÞð1 aÞ þ p2 an This expression may be written as ð1 qÞn1 q n2 , where q ¼ pa/[2(1 p)(1 a) + p]2. This expression is maximized when q ¼ n2/n, but there are infinitely many (p,a) combinations satisfying this equation. Thus unique estimation of p is not possible even in this simple case, and the same is true in more complicated and realistic cases. Next we consider the suggestion, referred to above, that unobserved families should be thought of as being part of the data. If the total number of families in the relevant catchment area is N there will be N n unobserved families. If N is treated as an unknown, a set of three simultaneous equations is obtained which must be ^, and N. It is found that the solutions for ^p and p ^ solved for ^p, p are identical to those deriving directly from Eqs. 11 and 12. Thus there is no need to introduce the concept of unobserved families. The main conclusions from the discussion so far are clear. On the one hand, an ascertainment correction is needed for parameter estimation when data are obtained through some ascertainment process. If the properties of the ascertainment process are completely known, this correction can be made. On the other hand, if these properties are unknown, significant biases can arise when an incorrect ascertainment scheme is assumed by the data analyst. Further, the assumptions made in the multiple ascertainment process must, in many cases, be unrealistic. Finally, complete and single ascertainment do not necessarily provide the “limits” of ascertainment, so that parameter estimates found by respectively assuming these two ascertainment processes do not necessarily bound the true values. 2.5. The AscertainmentAssumption-Free Method
In the ascertainment models considered in the previous section, it is assumed that the probability of ascertainment of a family depends, in some specified way, on the number r of affected sibs in the family. For example, under single ascertainment this probability is assumed to be proportional to r, under complete ascertainment it is assumed to be independent of r (so long as r > 0), while under multiple ascertainment (see Eq. 8) this probability is assumed to be 1 – (1 p)r, where r is the number of affected sibs in the family. As noted above, these three cases lead to different ascertainment
196
W. Ewens and R.C. Elston
corrections, and assuming one ascertainment scheme when another is appropriate leads to biased estimation. Suppose then that it is assumed that the probability of ascertaining a family having s sibs, r of whom are affected, is some unspecified parameter a(s,r). Write d as the data for this family and Ps(d) for the probability of these data. Finally write the probability that this family has j affected sibs as Ps( j ). Then, given that this family is ascertained, the contribution to the likelihood provided by this family is Ps(d)a(s,r)/[∑jPs(j) a(s,j)]. More generally, we may suppose that the data d in any family can be divided into two mutually exclusive parts, namely d1 and d2. Here d1 is that part of d that is “relevant to ascertainment,” so that in the example above d1 ¼ r. We correspondingly write P(d) as P(d1,d2). Suppose that in the data analyzed there are n(s,a,b) families having s sibs with d1 ¼ a, d2 ¼ b. Then the entire likelihood of the data is 2 3nðs;a;bÞ Y Y Y Ps ða; bÞaðs; aÞ 4P 5 : (13) P ðaÞaðs; aÞ s s a b a
It is now necessary to maximize this likelihood with respect to both the ascertainment parameters a(s,r) and the genetic parameters involved in P(d1) and P(d1,d2), respectively. We do not provide the details of this process here (for these see ref. 11) and note only that estimation of the ascertainment parameters separates out from estimation of the genetic parameters and that the latter are estimated directly by maximizing the likelihood 2 3nðs;a;bÞ Y Y Y Ps ða; bÞ 4P 5 : (14) Ps ðaÞ s a b a
We call this the “AAF” likelihood, and the estimator(s) derived from it the AAF estimator(s). Note 2 gives a simple example that illustrates the main points. 2.6. Standard Errors
So far we have focused on the means of estimators of genetic parameters, and from this have also considered the conditions under which various estimators are asymptotically unbiased. However, an unbiased estimator is of comparatively little value unless an indication of its variance, and thus of its standard error, is provided. Because of the asymptotic optimality properties of maximum likelihood estimators, all the estimators used in ascertainment sampling procedures are maximum likelihood estimators. These optimality properties hold not only for unbiased maximum likelihood estimators, but also for biased estimators provided that the bias is of order n1, where n is the number of families in the data. In all the examples considered the bias is of this order provided that the correct ascertainment scheme is assumed. Biased estimators of
11
Correcting for Ascertainment
197
this type are possibly important since a biased estimator might have a smaller asymptotic variance than that enjoyed by an unbiased estimator. If various reasonable regularity conditions are satisfied, and only one parameter p is estimated by the maximum likelihood estimator ^p, then as n approaches infinity, the distribution of the random variable p nð^p pÞ is asymptotically normally distributed with mean 0 and variance equal to the expected value of ∂2/∂p2 (log L), where L is the likelihood for the data. As an example, the logarithm of the likelihood (Eq. 5) is log L ¼ (R – n) log p + (S – R) log (1 p), and from this @ 2 =@p2 ðlog LÞ ¼ ðR nÞ=p2 þ ðS RÞ=ð1 pÞ2 : The only random variable in this expression is R, the number of affected sibs, and the discussion below (Eq. 7) shows that under single ascertainment the expected value of R is n + (S n)p. From this, the asymptotic variance of ^p is p(1 p)/(S n). (This is in fact the correct variance for all n.) The estimate ^p should then be reported in conjunction with its estimated standard deviation p f^pð1 ^pÞ=ðS nÞg. Similar, although more complicated, calculations arise for the case of complete ascertainment. Generalizations of this procedure for the case where several parameters are estimated are available but are not discussed here. Although the AAF method might overcome various ascertainment problems as described above, this property comes at a price. If, for example, the true ascertainment scheme had been single ascertainment, then the estimates of genetic parameters for an analyst who correctly assumed single ascertainment would be smaller than those for an analyst who used the AAF technique. So far the only phenotype that has been considered is the binary phenotype “affected or not affected” and a review of ascertainment in this situation is also given in ref. 12. In some cases, however, the phenotype might be some continuous measurement, to which we now briefly turn. 2.7. Continuous Data
Consider blood pressure, for example. In such a case, a family might be ascertained only if at least one sib in the family has a blood pressure exceeding some threshold T. In this case, complete ascertainment applies if the probability of ascertainment of a family is independent of the number of sibs having blood pressure exceeding T, provided that at least one sib does. Single ascertainment applies if the probability of ascertaining the family is proportional to the number of sibs having blood pressure exceeding T. Other ascertainment possibilities, of course, also exist, and also (as above) the true nature of the ascertainment process might not be known to the analyst, in which case an AAF procedure might be thought desirable.
198
W. Ewens and R.C. Elston
Consider some ascertained family having s sibs, with respective measurements x1, x2,. . ., xs for the measurement of interest. These measurements would reasonably be assumed to be identically but not necessarily independently distributed, and thus might be assumed to have arbitrary joint density function f (x1,x2,. . .,xs). The density function for any one measurement is denoted by f (x). The likelihood contribution from this family is thus of the form f ðx1 ; x2 ; . . . ; xs ; AÞ ; Pr(AÞ
(15)
where A is the event that the family is ascertained and Pr(A) is the probability of this event. The expression for Pr(A) depends on the nature of the ascertainment scheme. In the case of complete ascertainment, it is the probability that at least one measurement exceeds T and is thus found by a multiple integration involving f (x1,x2,. . .,xs). For single ascertainment, Pr(A) is proportional to the probability that the measurement for any one sib exceeds T and is thus found by an integration involving the marginal density f (x). Under the AAF approach, if in any family exactly r of the s sibs have measurements exceeding T, then Pr(A) is the probability that exactly r measurements do exceed T. This probability is found by a multiple integration. In the case of continuous data another form of conditioning is possible. Thus under single ascertainment, if sib i in the family described above is the proband and xi is the measurement value for this sib, use of the conditional likelihood f ðx1 ; x2 ; . . . ; xs ; AÞ f ðxi Þ
(16)
can be shown to lead to asymptotically unbiased estimation. This is a “conditioning on measured values” likelihood. It is only for single ascertainment that asymptotically unbiased estimation arises from conditioning on measured values. Under single ascertainment, the asymptotic variance of the estimator using Eq. 16 exceeds that of the estimator using Eq. 15. This result is discussed in detail in ref. 13, where many further results concerning ascertainment sampling of continuous characters can be found. 2.8. Pedigrees, Sequential and Proband-Dependent Sampling
All of the discussion so far relates to sibships. This implies that a comparatively simple analysis is possible, provided that either the ascertainment scheme is known or the AAF method is used and that all sibs in any ascertained family are examined. Pedigrees raise problems more formidable than those for families. First, calculations for pedigrees are usually far more complicated than those for families. But more important, conceptual problems arise for pedigrees that do not arise for families. Examining all the sibs in a
11
Correcting for Ascertainment
199
nuclear family is straightforward, since a sibship is a well-defined concept and every person belongs to one and only one sibship. As noted below, under so-called “proband-dependent” sampling, it is necessary to know a true family or pedigree structure defined in such a way that everyone belongs to one and only one pedigree. The first major advance on the correct method for analyzing pedigree data obtained from an ascertainment process considered a sequential sampling scheme (14). Under the assumption of a single proband, the appropriate likelihood is the conditional probability of the data eventually gathered given that the proband is affected, provided the following two sequential sampling rules are followed: (1) the choice of individuals to be examined next at any stage of the sequential sampling process depends only on the affectedness status of the individuals already sampled and (2) the data from all individuals examined, both affected and unaffected, are entered into the likelihood used for parameter estimation. Subsequent work (15) showed that this claim is correct when both the following requirements hold: (1) there is only one proband in the pedigree and (2) the ascertainment process is single in a more precise manner than that given above and in a way (16) that generalizes previous definitions found in the literature. This is the requirement that the probability that the pedigree be ascertained, given the true structure of the pedigree, is independent of that structure. This in effect implies that the probability that there is more than one proband for a pedigree be zero. (It is possible to generalize the proband concept to allow for a “proband configuration,” for example an affected sib pair. In this case, the requirement is in effect that the probability that there is more than one proband configuration for a pedigree be zero). If the ascertainment scheme is something other than single ascertainment, more than one member of a pedigree could be a proband. This raises the possibility of proband-dependent (PD) sampling, a situation that arises when the members of any pedigree who are examined depend on who the proband is (probands are) for that pedigree. In Note 3, we illustrate the general framework for calculating the correct likelihood in this situation when the details of the initial ascertainment scheme are known. In Note 4, we discuss a pseudo-likelihood approach (17) appropriate for the situation when the details of the initial ascertainment scheme are unknown. 2.9. Pedigrees, A General Approach Without the Concept of Proband-Dependent Sampling
We now describe a very general likelihood approach (4) to allow for ascertainment in segregation, linkage or association analysis, or any combination of these three types of analysis, without using the concept of PD sampling. This approach makes the assumption that there exists a population of distinct true pedigrees from which we could in principle obtain a random sample. In actual fact we can do no better than assume the existence of such a
200
W. Ewens and R.C. Elston
population that represents the population we are sampling from, because the definition of a “true pedigree” depends on the depth to which the relatives of any probands are investigated. A welldesigned pedigree study begins with the definition of the sampling design. We usually start collecting a pedigree through a single proband, or more generally a proband combination, and continue sampling the pedigree up to a predetermined depth. Provided we keep a record of the existence of all those in the true pedigree— defined by the predetermined depth—the true pedigree structure becomes available and the ascertainment correction problem becomes potentially tractable. Without knowledge of the true underlying pedigree structure (including who are the unobserved members of the pedigree) it is not possible to write down a correct likelihood and the ascertainment correction problem becomes intractable (18). This conclusion holds even for linkage analysis, where the parameter to be estimated is a recombination fraction between a disease and a marker locus, though the errors incurred by ignoring unobserved parts of a pedigree may in this case be negligible (19). We now distinguish between the pedigree that is sampled and the true pedigree, whose structure from now on we assume is correctly known. The sampled pedigree comprises all those pedigree members on whom we have data; the rest comprise the nonsampled pedigree. The true pedigree members can also be divided into two mutually exclusive subsets according to whether they are or are not in the PSF; we call the members of the latter the pedigree extension. Without assuming any model for the way in which members of the pedigree extension enter the sampled pedigree, we can obtain a valid likelihood from which to make inferences provided the pedigree extension, which is sampled, contains all the members of the PSF who are in the true pedigree. Let L(y, S) be the pedigree likelihood appropriate under random sampling for all the data S on the members of the sampled pedigree, where y is the (possibly vector) parameter about which we wish to make inferences, and let L(y, PSF) be the corresponding likelihood for all the data on members of the PSF. Then L(y, S)/L(y, PSF) forms that valid likelihood. If we product this over n independent pedigrees, we have an expression that is analogous to Eq. 1. If we have single ascertainment and each pedigree contains exactly one proband, that proband is the sole member of the pedigree’s PSF. A problem arises when there are PSF members of the true pedigree who are not sampled. However, because the pedigree structure relating these members to the sampled pedigree members is known, we can assign them data and include them in both the numerator and denominator of the likelihood: L(y, S)/L(y, PSF). A practical solution is to replace each missing phenotype on the PSF members by the mean of the phenotypes of all the sampled pedigree members. This has been shown, in one example, to lead to a bias in
11
Correcting for Ascertainment
201
the maximum likelihood parameter estimate that is substantially less than its standard deviation (4, p. 125). Another potential way of correcting a likelihood for ascertainment when the ascertainment procedure is unknown might be to constrain the parameter estimates using knowledge about the population that is being sampled. The fact that the pedigree is ascertained in no way changes the underlying genetic model nor any penetrance functions. What it does change is the distribution of founder genotypes which, because of the ascertainment, no longer reflect the distribution of genotypes in the population. If we could modify the likelihood to force the distribution of founder genotypes to be the same as in the population, we might expect no further ascertainment correction to be necessary. The S.A.G.E. (20) program SEGREG allows one to impose a prevalence constraint, assuming we have independent information about the prevalence of the disease. This is done by multiplying the likelihood—formed assuming random sampling—by the factor PR(1 P)(N R), where P is the population prevalence of the disease (expressed as a function of all the model parameters, in particular the genotypic distribution and the penetrance functions), and we know that a random sample of size N drawn from the population contains R persons with the disease. The user inputs R and N and, if desired, the values of a covariate, such as age, for which R and N are appropriate (see Note 5). Several sets of R, N, and covariate values can be input. Note that for a diallelic locus two parameters are needed to specify the genotypic distribution; whereas it may be reasonable to assume Hardy–Weinberg equilibrium proportions in the population, this is unlikely to hold for the founders of an ascertained pedigree. Thus constraining the prevalence of the disease is not sufficient to constrain the whole genotypic distribution. Nevertheless, until more work has been done to determine how we might quantify the departure from Hardy–Weinberg equilibrium proportions in the founders of the pedigree caused by the ascertainment, this could be considered as a “poor man’s” ascertainment correction. In SEGREG, there are also options, for a continuous trait, to condition the likelihood on the actual phenotypic values of members in the PSF or on their values being below or above a particular threshold— either given or simultaneously estimated together with all the other parameters (see Note 6).
3. Notes 1. In an example given by Fisher (8) and taken up by Bailey (6), the numerical values are s ¼ 5, n ¼ 340, Q ¼ 432, and R ¼ 623, so that Eqs. 11 and 12 become
202
W. Ewens and R.C. Elston
^Þ 623 1; 077 1; 700^ pð1 ^pp ¼ 0; s ^p 1 ^p ^Þ 1 ð1 ^pp 4
^Þ 432 191 1; 700^pð1 ^pp ¼ 0: s ^ ^ p 1p ^Þ 1 ð1 ^pp 4
The (numerically derived) solutions of these equations are ^p ¼ 0:2526, p ^ ¼ 0:4753. From this value of p ^, it appears likely that neither complete nor single ascertainment apply. If complete ascertainment had been assumed and p estimated from Eq. 3, the solution found would have been 0.3086, significantly greater. Similarly, if single ascertainment had been assumed and p estimated from Eq. 7, the solution found would have been 0.2021, significantly less than the solution found from these equations. 2. We illustrate the procedure by a simple example. Consider some disease susceptibility locus A admitting a susceptibility allele A and a normal allele a. The probability that an AA individual is affected is 0.7 and that an Aa individual is affected is 0.1. Individuals of type aa are never affected. All families in the sample have two children (s ¼ 2), so that the notation s is suppressed in the following. Ascertainment of a family depends in some unknown way by the number r of affected children (r ¼ 1 or 2) in the family. The data in any family are of the form (r, g), where g is the number of affected parents. The aim is to estimate the frequency p of the allele A from the set of (r, g) values provided by the ascertained families. There are six possible (r, g) combinations for each family, namely (1,0), (1,1), (1,2), (2,0), (2,1), and (2,2). The probability P(r, g) of each combination (as a function of p) is easily calculated. As an example, we consider the calculation of P (2,2), the probability that both parents and both sibs are affected. There are three parental mating types where this can happen, AA AA (probability p4), AA Aa (probability 4p3(1 p)), and Aa Aa (probability 4p2(1 p)2). For the first mating type, both sibs must be AA and the probability that all four family members are affected is (0.7)4 ¼ 0.2401. Thus this mating type contributes a term 0.2401p4 to P(2,2). Consideration of the other two mating types leads eventually to the equation Pð2; 2Þ ¼ 0:2401p4 þ 0:0448p3 ð1 pÞ þ 0:002025p2 ð1 pÞ2 :
(17)
Similar calculations may be made for the other five P(r, g) values. The six P(r, g) probabilities do not add to 1, since
11
Correcting for Ascertainment
203
there are many families with no affected sibs, and the sum P of all six P(r, g) probabilities is given by P ¼ 0:91p4 þ 2:56p3 ð1 pÞ þ 1:9775p2 ð1 pÞ2 þ 0:39pð1 pÞ3 :
(18)
The probability P(r) that there are r affected sibs in a family (r ¼ 1,2) are found as P(r) ¼ P(r,0) + P(r,1) + P(r,2). This gives Pð1Þ ¼ 0:42p4 þ 1:92p3 ð1 pÞ þ 1:755p2 ð1 pÞ2 þ 0:38pð1 pÞ3 :
(19)
Pð2Þ ¼ 0:49p4 þ 0:64p3 ð1 pÞ þ 0:2225p2 ð1 pÞ2 þ 0:01pð1 pÞ3 :
(20)
Suppose now that the data came from a complete ascertainment scheme, although again this is unknown to the investigator. In this case, the probability that all family members are affected in an ascertained family is the ratio of the right-hand sides of Eqs. 17 and 18. Suppose that p ¼ 0.1 (a value of course unknown to the investigator), then to eight decimal place accuracy this ratio is 0.00172344. Parallel ratios may be calculated for the other five cases. We now imagine “perfect complete ascertainment” sample data, for which in a total sample of n families there are n (2,2) ¼ 0.00172344n families with all four family members affected, with parallel numbers of the other five possible family configurations. With these data, the estimate of p derived from the appropriate “complete ascertainment” likelihood nðr; gÞ 2 Y 2 Y Pðr; gÞ r¼1 g¼0
P
(21)
and from the “AAF” formula Eq. 15 are both 0.1, and are thus both unbiased. Under the assumption of single ascertainment, the probability that a family has r affected children and g affected parents is rP(r, g)/[P(1) + 2P(2)], and this leads to the “single ascertainment” likelihood 2 Y 2 Y r¼1 g¼0
nðr; gÞ rPðr; gÞ : Pð1Þ þ 2Pð2Þ
(22)
Under “perfect complete ascertainment” data, the estimate of p derived from this formula is 0.0641 and is thus biased. Suppose next that the data came from a single ascertainment scheme, although again this is unknown to the investigator. In “perfect single ascertainment” data, the proportion of
204
W. Ewens and R.C. Elston
families having r affected children and g affected parents is rP(r, g)/[P(1) + 2P(2)]. For these data, the estimate of p derived from Eq. 22 and the “AAF” formula (Eq. 15) are both 0.1 and are thus both unbiased, whereas the estimate derived from Eq. 21 is 0.1407 and is thus biased. Finally, if the probability of ascertainment of a family with r affected children is proportional to r2 (the quadratic case), then under perfect quadratic data the estimate of p using the “complete ascertainment” likelihood (Eq. 21) is 0.2009, the estimate using the “single ascertainment” likelihood (Eq. 22) is 0.01534, and the estimate using the AAF method (Eq. 15) is 0.1. Thus only the AAF estimate is unbiased, and (as expected) the “complete” and “single” ascertainment estimates do not even bracket the correct value. 3. For simplicity, we assume in this example that sibships, rather than pedigree data, are involved. Families are obtained from a multiple ascertainment sampling process and the disease properties (and parameters) are as described for “at risk” families in Section 2.4. Consider a sample of n such families with exactly three sibs in each family. Once a family comes to the attention of the investigator via a proband (or several probands), the sequential sampling rule is as follows. If the only proband in a family is either the oldest or the youngest sib, then only that sib and the middle sib are examined. All three sibs are examined either when both the oldest and youngest sibs are probands or when the middle sib (together with none, one, or both of the other sibs) is a proband. This sequential procedure follows the first Cannings–Thompson (14) sequential sampling prescription. Define nsr as the number of families in which s sibs are examined and for which there were r probands (so that s ¼ 2 or 3, r ¼ 1, 2, or 3, and s r). The likelihood from which p and p are estimated is of form 3 Y 3 Y
Psrnsr :
(23)
s¼2 r¼1
As shown by Vieland and Hodge (15), the correct likelihood calculations are not obtained by considering only the members in any family who are actually observed. One has to take account of the entire family structure and thus to take into account also those sibs in the family who are not observed. Consider first the case s ¼ 2, r ¼ 1. This case arises if and only if either the oldest or the youngest sib was the only proband and the middle sib was found not to be affected. Suppose first that the proband was the oldest sib. The “observed data” ascertainment conditional probability P21 is then p(1 p)p/D2, where D2 ¼ 1 – (1 – pp)2. Taking also into account the possibility
11
Correcting for Ascertainment
205
that the youngest child was the proband, the “observed data” probability for the “s ¼ 2,r ¼ 1” case is then P21 ¼ 2p (1 p)p/D2. However, this is not the correct form for P21. One has to consider the entire sibship, including the (unobserved) third sib. The correct denominator in P21 is the probability that the family is ascertained, namely D3, given by D3 ¼ 1 (1 pp)3. The numerator (in the case where the oldest sib was the proband) takes into account not only the observed data but also the fact that the youngest sib was not a proband (probability 1 pp) and is thus p(1 p)p(1 pp). Consideration also of the case where it was the youngest sib who was the proband leads to the correct probability P21 ¼ 2p (1 p)p(1 pp)/D3. Similar calculations for the remaining correct and “observed data” Psr values show that all “observed data” Psr calculations are incorrect. This is so even in cases where the entire sibship happens to be observed: the different ways in which the entire sibship could have been observed must be calculated. Thus for example, the correct form of the numerator of P33 contains two terms of the form p3p2(1 p) (for the two cases where either the oldest or the youngest sib—but not both), together with the middle sib, were probands—one term of the form p3p2(1 p) (for the case where the oldest and the youngest sib were both probands but the middle sib was not), one term of the form p3p(1 p)2 (for the case where only the middle sib was a proband), and finally one term of the form p3p3 (for the case where all three sibs were probands). This leads to a numerator of p3p[1 + p p2] and to a value for P33 of p3p[1 + p p2]/D3. The incorrect “observed data” value for P33 is p3[1 (1 p)3]/D3, the term p3 in the numerator deriving from the fact that all three observed sibs were affected and the term 1 (1 p)3 coming from the fact that at least one was a proband. 4. Rabinowitz (17) introduced a pseudo-likelihood approach to address the problem of parameter estimation when the details of the ascertainment scheme are not available in the case where the analyst knows who the proband (probands) in any family is (are) and also knows the nature of the PD sequential sampling procedure. (Although this is an unlikely circumstance for the example given below, it might be realistic in the case of pedigree data.) His procedure is as follows. Consider any specific proband and the data that are found from the PD sampling procedure starting with this proband. There will be some likelihood for these data under the assumption of single ascertainment of this proband that is conditional on the event of the proband being affected. Now consider some other proband and the data that are
206
W. Ewens and R.C. Elston
found from the PD sampling procedure starting with this (second) proband. There will be some likelihood for the observed data under the assumption of single ascertainment, that is conditional on the event that this (second) proband is affected. All probands are considered in this way, and the pseudo-likelihood is defined as the product of these various “separate proband” likelihoods. Some individuals might have been discovered by two or more of these PD sampling procedures; the likelihoods that these individuals contribute are entered into the pseudo-likelihood as many times as these individuals are reached under the various PD sampling procedures. Rabinowitz (17) shows that maximizing the pseudolikelihood leads to asymptotically unbiased parameter estimates provided that the Cannings–Thompson sampling rules from each proband are followed. We now give calculations for the “three sib family” example discussed in Note 3 that confirm these claims, assuming (to be concrete) that p ¼ 1. A listing of the correct probabilities for the case p ¼ 1 (complete ascertainment) in the Vieland and Hodge example is as follows, where (as above) s is the number of sibs observed in any family and r is the number of affected sibs observed: ðs; rÞ ð2; 1Þ ð3; 1Þ ð3; 2Þ ð3; 3Þ ; Probability 2pð1 pÞ2 =D3 pð1 pÞ2 =D3 3p2 ð1 pÞ=D3 p3 =D3
(24) where D3 ¼ 1 (1 p) . Suppose that p ¼ 0.2. Then these (s,r) probabilities are proportional to 0.256, 0.128, 0.096, and 0.008. Assuming then a sample of n ¼ 488 families and this value of p, under “perfect” data there would be 256 families of (s,r) type (2,1), 128 families of type (3,1), 96 families of type (3,2), and 8 families of type (3,3). For an analyst who is aware that the ascertainment scheme was one of complete ascertainment, the correct likelihood for these data is 3
p600 ð1 pÞ864 const: h i488 : 1 ð1 pÞ3
(25)
This expression is maximized at p^ ¼ 0:2, the true parameter value. This is in agreement with the claim that the Vieland and Hodge procedure gives asymptotically unbiased estimators. (Vieland and Hodge provide a similar example in which again p ¼ 0.2, p ¼ 1 and for which the estimate of p from the incorrect “observed data” likelihood is biased by about 35%. They also show that the AAF method leads to biased estimation under PD sampling).
11
Correcting for Ascertainment
207
The Rabinowitz (17) calculation for the case p ¼ 1, so that all affected individuals are probands, proceeds as follows. Consider first the 256 families of (s,r) type (2,1). These families can only have arisen when the oldest or youngest sib is the only proband and the middle sib is unaffected. The single ascertainment probability contribution from these families is (1 p)256. Consider next the 128 families of type (3,1). These can only have arisen when the middle sib was the only proband and the other two sibs were unaffected. The single ascertainment probability contribution from these families is then also (1 p)256. The 96 families of type (3,2) require more careful consideration. Under perfect data 64 of these families are such that the middle sib is a proband and exactly one of the other two sibs is also a proband. Taking the middle sib first, the single ascertainment probability contribution coming from this proband (who under the Vieland–Hodge PD sampling rule leads to one affected and one unaffected sib) is p 64(1 p)64. The single ascertainment probability contribution coming from the non-middle sib proband (who under the Vieland–Hodge PD sampling rule leads to one affected sib) is p64. The remaining 32 “perfect data” families of type (3,2) are those for which both the oldest and youngest sibs are probands and the middle sib is unaffected. Each proband (who under the Vieland–Hodge PD sampling rule lead only to the unaffected middle sib) then provides a single ascertainment probability contribution of (1 p)32. Altogether families of type (3,2) provide a single ascertainment probability contribution of p128(1 p)128. Finally all three sibs for the eight families of type (3,3) are probands. Both the oldest and youngest sib under the Vieland–Hodge PD sampling rule lead to one affected sib (the middle sib) and thus each provides a single ascertainment probability contribution of y8. Finally, the middle sib (who under the Vieland–Hodge PD sampling rule leads to both the other sibs) provides a single ascertainment probability contribution of p16. The total pseudo-likelihood provided by all the data is thus proportional to
h
ih ih i ð1 pÞ256 ð1 pÞ256 p128 ð1 pÞ128 p32 ¼ p160 ð1 pÞ640 :
(26) This is maximized at the correct value ^p ¼ 0:2, the true parameter value, showing that the Rabinowitz procedure leads to asymptotically unbiased estimators. (Note that, in accordance with the pseudo-likelihood procedure, some individuals are counted more than once in the calculations leading to Eq. 26.) Even though the likelihood Eq. 25 and the
208
W. Ewens and R.C. Elston
pseudo-likelihood Eq. 26 differ, they both lead to the correct parameter estimate. The standard deviations of parameter estimates found from a pseudo-likelihood are not correctly derived by using second derivatives of the pseudo-likelihood in the manner shown for “real” likelihoods in Section 2.7. Rabinowitz (17) provides a method for finding close approximations to these standard errors. It might well be that the mean square errors of the estimates derived from a pseudo-likelihood are larger than those for a procedure derived when some specific, although incorrect, ascertainment scheme is assumed. 5. These values are entered in the parameter file for SEGREG under “prevalence constraints.” Suppose we have (for a specified covariate value) an estimate of the prevalence, p, and its standard error, s.e. Then reasonable values for N and R are given by N ¼ p(1 p)/(s.e.)2 and R ¼ Np (R and N need not be integers). Making N very large, with R ¼ Np, is equivalent to assuming that the prevalence is known. 6. This is specified in the SEGREG parameter file under “ascertainment.” References 1. Weinberg W (1912) Further contributions to the theory of heredity. Part 4. On methods and sources of error in studies of Mendelian ratios in man. Arch f€ ur Rassen und Gesellschaftsbiologie 9: 165–174 2. Weinberg W (1912) Further contributions to the theory of heredity. Part 5. On the inheritance of the predisposition to blood disease with methodological supplements to my sibship method. Arch f€ ur Rassen und Gesellschaftsbiologie 9: 694–709 3. Weinberg W (1927) Mathematical foundations of the proband method. Z f€ ur Indukt Abstamm and Vererbungslehre 48: 179–228 4. Ginsburg E, I Malkin, Elston RC (2006) Theoretical Aspects of Pedigree Analysis. Ramot, Tel Aviv 5. Elston RC, Sobel E (1979) Sampling Considerations in the Gathering and Analysis of Pedigree Data. Am J Hum Genet 31: 62–69 6. Bailey NTJ (1951) The estimation of the frequencies of recessives with incomplete multiple ascertainment. Ann Eugen 16: 215–222 7. Li CC, Mantel N (1968) A simple method for estimating the segregation ratio under complete ascertainment. Am J Hum Genet 20: 61–81
8. Fisher RA (1934) The effect of methods of ascertainment upon the estimation of frequencies. Ann Hum Genet 6: 13–25 9. Haldane JBS (1938) The estimation of the frequencies of recessive conditions in man. Ann of Eugen 8: 255–262 10. Stene J (1977) Assumptions for different ascertainment models in human genetics. Biom 33: 523–527 11. Ewens WJ, Shute NCE (1986) A resolution of the ascertainment sampling scheme. I. Theory. Theoret Biol 30: 388–412 12. George VT, Elston RC (1991) An Overview of the Classical Segregation Analysis Model for Independent Sibships. Biom J Ascertainment 33: 741–753 13. Ewens WJ (1991) Ascertainment biases and their resolution in biological surveys. In: Rao CR, Chakraborty R (eds) Handbook of Statistics. Elsevier North-Holland, Amsterdam 14. Cannings CC, Thompson EA (1977) Ascertainment in the sequential sampling of pedigrees. Clin Genet 12: 208–212 15. Vieland VJ, Hodge SE (1995) Inherent intractability of the ascertainment problem for pedigree data: a general likelihood framework. Am J Hum Genet 56: 33–43
11 16. Hodge SE, Vieland VJ (1996) The essence of single ascertainment. Genet 144: 1215–1223 17. Rabinowitz D (1997) A pseudo-likelihood approach to correcting for ascertainment in family studies. Am J Hum Genet 59: 726–730 18. Vieland VJ, Hodge SE (1996) The problem of ascertainment for linkage analysis. Am J Hum Genet 58: 1072–1084
Correcting for Ascertainment
209
19. Slager SL, Vieland VJ (1997) Investigating the numerical effects of ascertainment bias in linkage analysis: development of methods and preliminary results. Genet Epidemiol 14: 1119–1124 20. S.A.G.E. 6.1 (2010). Statistical Analysis for Genetic Epidemiology: http://darwin.cwru. edu/sage/
sdfsdf
Chapter 12 Segregation Analysis Using the Unified Model Xiangqing Sun Abstract Segregation analysis is a basic tool in human genetics. It is a statistical method to determine if a trait, continuous or binary, has a transmission pattern in pedigrees that is consistent with Mendelian segregation. Major locus segregation is combined together with multifactorial/polygenic inheritance in the unified model. Segregation analysis as a procedure to identify the presence of segregation at a major Mendelian locus, with/without multifactorial inheritance, is introduced in this chapter. It is illustrated with the program SEGREG in the Statistical Analysis for Genetic Epidemiology (S.A.G.E.) package, which can use either regressive models or the finite polygenic mixed model to incorporate the multifactorial/ polygenic component. Key words: Segregation analysis, Unified model, Multifactorial inheritance, Polygenic variance, Familial correlation, Multivariate mixed model, S.A.G.E., Mendelian transmission, Susceptibility, Phenotypic distribution, Binary trait, Quantitative trait
1. Introduction Segregation analysis uses statistical methods to determine if, without the use of any genetic markers, the variation of a phenotype, continuous or binary, is in large part consistent with segregation at single loci and to identify the mode of inheritance. In segregation analysis, a general single-locus transmission model (1, 2) infers major locus transmission by rejecting the hypothesis of no transmission of a major type (the three transmission probabilities from types AA, AB, and BB are equal) and accepting the hypothesis of Mendelian transmission (the three transmission probabilities are, respectively, 1, 0.5, and 0), when tested against a more general transmission model of major effect (three transmission probabilities freely estimated, with or without a constraint that ensures homogeneity of the phenotypic distributions across generations). The mixed model of inheritance developed by Morton and MacLean (3) includes a major locus with Mendelian transmission, a polygenic component and random Robert C. Elston et al. (eds.), Statistical Human Genetics: Methods and Protocols, Methods in Molecular Biology, vol. 850, DOI 10.1007/978-1-61779-555-8_12, # Springer Science+Business Media, LLC 2012
211
212
X. Sun
environmental effects that influence the phenotype. It infers major locus transmission by rejecting the hypothesis of no major locus, in contrast to a model in which the familial resemblance results from only a polygenic component. The unified model (4) for complex segregation analysis combines the mixed model of a major locus and polygenic variation with the general single-locus transmission model. The unified model tests the familial transmission more thoroughly by combining the two models. It can test the following hypotheses (1) no major effect and/or a multifactorial component. (2) There is random environmental transmission of a major effect. (3) There is multifactorial/polygenic inheritance. (4) A major type is transmitted in a Mendelian fashion. In this chapter, the basic theory of segregation analysis using the unified model will be introduced first and then the detailed steps to perform a segregation analysis using the unified model will be illustrated with the program SEGREG in the Statistical Analysis for Genetic Epidemiology (S.A.G.E.) 6.1 package (5) to test the above hypotheses. 1.1. The Parameters in Segregation Analysis
Many of the parameters are the same whether the trait being analyzed is continuous or discrete and examples of such analyses can be seen, respectively, in refs. 6 and 7. I therefore make statements that are true for a continuous trait, noting in brackets [] the differences for the case of a binary trait. Segregation of a possible major locus is detected by allowing one or more parameters depend on an unobserved latent factor termed type, denoted u, which can take on one of the three values AA, AB, or BB. If the segregation is Mendelian, the type u represents a putative genotype underlying the distribution of the observed phenotype. The parameters that are estimated in the unified model include type means and type variance(s) for a continuous/quantitative trait [type susceptibilities for a binary trait, always expressed on the logit scale], type frequencies, type transmission probabilities, parameter(s) for a multifactorial/polygenic component, and the coefficients of any covariates affecting the type means [type susceptibilities]. These parameters and associated variables are listed in Table 1.
1.2. Segregation Models
The unified model reconciled the mixed model and the general single-locus transmission model. In this model, a quantitative phenotype x can be expressed as x ¼ g + c + e, which assumes that x is determined by the joint additive contributions of three effects—a major transmissible effect g, a multifactorial transmissible component c, and a non-transmitted environmental effect e. The major effect g is a discrete random effect, and c and e are continuous random effects with means zero. These three factors are assumed independent. The major effect g results from segregation at a single locus with two alleles, A and B, with population allele frequency qA
12
Segregation Analysis Using the Unified Model
213
Table 1 The parameters/variables in segregation models and their meaning Parameter/ variable
Values
Meaning
u
AA, AB, BB
The type, or genotype if the segregation is Mendelian
mAA ; mAB ; mBB
Any
For quantitative traits only, the means of the trait conditional on type AA, AB, and BB
s2
>0
The variance of the trait conditional on type (assuming the same for the three types)
bAA ; bAB ; bBB
Any
For binary traits only, the logit of the susceptibility of the major type AA, AB, and BB
qA
[0, 1]
Allele frequency of allele A
CAA ; CAB ; CBB CAA þ CAB þ CBB ¼ 1 Under HWE: CAA ¼ qA2 , CAB ¼ 2qA ð1 qA Þ, CBB ¼ ð1 qA Þ2
The population frequencies of the types AA, AB, and BB
tAA ,tAB ,tBB
[0, 1] If in Mendelian mode, tAA ¼ 1, tAB ¼ 0:5, tBB ¼ 0
The transmission probabilities that parents of types AA, AB, or BB transmit allele A to offspring
s2v
>0
The total variance of the polygenic loci. v is the number of polygenic loci
xcov
Any
The coefficient of a covariate for the type means [type susceptibilities]
g
a
g¼
e
bu þ
P x x i i i P
bu þ
1þe
x x i i i
For binary traits only, the susceptibility of type u. xi is the ith covariate, could be sex, age, etc.
rFM
[1, 1]a
The father–mother (spouse) correlation among the residuals from the type means
rFO
[1, 1]a
The father–offspring correlation among the residuals from the type means
rMO
[1, 1]a
The mother–offspring correlation among the residuals from the type means
rSS
[1, 1]a
The sib–sib correlation among the residuals from the type means
Jointly, these correlations cannot have values anywhere within these limits
214
X. Sun
for allele A. The major genotypic effects can be expressed through the three type means, mAA , mAB , and mBB [three type susceptibilities bAA , bAB , and bBB ]. Multifactorial/polygenic transmission is specified through residual parent–offspring and sibling correlation [association] parameters or a polygenic variance. In the program SEGREG, two kinds of models, regressive models and the finite polygenic mixed model (FPMM), allow for both major locus inheritance and multifactorial/polygenic effects. The first, Bonney’s class D regressive model (8) for a quantitative trait [regressive multivariate logistic model (MLM) for a binary trait (9)], allows the incorporation of nuclear family residual correlations [associations]. [The MLM model incorporates the familial associations into the logits of type susceptibilities]. In multigenerational pedigrees, the class D assumption as implemented in SEGREG, for regressive models, assumes that all sib–sib residual correlations [associations] are equal, but not necessarily due to common parentage alone; nor does it necessarily assume that they are equal to the parent–offspring residual correlation [association]. The residual correlation [association] parameters in the regressive models include rFM for father–mother, rFO for father–offspring, rMO for mother–offspring, and rSS for sib–sib. The second way to allow for a multifactorial component is by using the FPMM (10, 11), which assumes that the type means [logits of type susceptibilities] are influenced by a small number of additive diallelic loci in addition to possible segregation at a single major locus with large effect. In this model, the effect of a finite number (v) of additive polygenic loci is represented by the polygenic variance s2v , which implies that within nuclear families the sibling correlation is assumed to be the same as the parent–offspring correlation (rFO ¼ rMO ¼ rSS ) and the correlation between kth degree relatives is (½)k1 times that of first-degree relatives. The multifactorial/polygenic component can often take care of any heterogeneity of the type means [susceptibilities] among the pedigrees. In the case of nuclear family data, regressive models subsume the mixed major locus/polygenic model as a special case; in the case of more extended pedigree structures, they do so approximately. In these models, the mode of transmission of the major types is determined by the three transmission probabilities tAA , tAB , and tBB , which are the probabilities that a parent of type AA, AB, or BB, respectively, transmits A to offspring. If there are no major types (mAA ¼ mAB ¼ mBB ½bAA ¼ bAB ¼ bBB ), the transmission probabilities are not available. If there is a major type (mAA , mAB , mBB not equal [bAA , bAB , bBB not equal]), three transmission modes can be fitted, which are, respectively, no major type transmission (tAA ¼ tAB ¼ tBB ¼ qA ), Mendelian transmission (tAA ¼ 1, tAB ¼ 0:5, tBB ¼ 0), and general transmission (tAA ; tAB ; tBB freely estimated). Another general transmission mode—homogeneous general transmission—is a special case of general transmission that
12
Segregation Analysis Using the Unified Model
215
assumes homogeneity of the phenotypic distribution between founders and non-founders; tAB is then a function of qA , tAA , and tBB (12). The two general transmission models subsume both Mendelian transmission and homogeneous no transmission as special cases. The segregation models that can be fitted in SEGREG are listed in Table 2. The sporadic model (model 1) assumes no intergenerational transmission of the type, i.e., the phenotype has one distribution, and no major type or multifactorial component is transmitted. The random environmental transmission model (model 2) assumes that the trait segregation is caused purely by a random environmental factor and there is no transmission from generation to generation (tAA ¼ tAB ¼ tBB ¼ qA ). The polygenic transmission model (model 3) assumes that the phenotype is determined by polygenic inheritance, so the phenotype has one distribution, and familial correlations [associations] can explain the familial aggregation of the trait. If within nuclear families the estimated sibling correlation [association] is the same as the parent–offspring correlation and there is no spouse correlation [association], the polygenic variance in the FPMM model can account for such residual familial correlation [association]. The polygenic-environmental model (model 4) assumes that only a non-transmittable random environmental factor and a polygenic/multifacotrial effect influence the trait. The pure major locus transmission models (models 5–8) assume major locus transmission in a Mendelian mode, without multifactorial/polygenic inheritance, and the major gene plus multifactorial/polygenic models (models 9–12) assume both a major locus (transmitted in a Mendelian mode) and a multifactorial/polygenic effect influence the trait. Among the models with major locus transmission, the codominant inheritance models have three type means [type susceptibilities] that subsume the dominant, recessive, and additive models. The general transmission models (models 13 and 14) are the major type models transmitted with arbitrary probabilities between 0 and 1, with (model 14) or without (model 13) polygenic/multifactorial effects. The general model (model 14) is the unrestricted full model, which subsumes all of the other models. In SEGREG, all parameters of each model are estimated by numerical maximization of the likelihood, with standard errors calculated by numerical double differentiation of the log likelihood evaluated at the maximum likelihood estimates of all the model parameters. 1.3. Basic Segregation Analysis Steps and Tests
In segregation analysis, generally we use the following criteria developed by Elston et al. (13) to test for a major locus affecting a trait. We first test if the phenotype fits one distribution or a mixture of two or more distributions [this test is not possible for a binary trait unless the model includes a multifactorial/polygenic
Codominant Dominant Recessive Additive Codominant Dominant Recessive Additive
Inheritance
* *
1 1 1 1 1 1 1 1
na qA
* *
0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
na qA
na qA
* *
0 0 0 0 0 0 0 0
na qA
na qA
* *
* * * * * * * *
0 *
0 *
* *
* * * * * * * *
* *
* *
tAA tAB tBB qA na qA
mAA mAB
Transmission Allele probabilities frequency
Multifactorial/polygenic effect
* *
* *
* * * * * * * *
0 *
0 0 0 0 * * * *
na *
na na na na * * * *
* *
mAA * * *
mAA * * mAA mBB 1 2 ðmAA þ mBB Þ * mAA mBB 1 2 ðmAA þ mBB Þ
na na
mAA 0 * 0
mAA *
FPMM mBB Regressive model rFM b, rFO ; rMO ; rSS s2v
Type means (quantitative) or type susceptibilities (binary)a
na, not available a For binary traits, mAA ; mAB ; mBB are replaced, respectively, by bAA ; bAB ; bBB b Father–mother correlations, rFM should be 0 in the absence of assortative mating or consanguineous matings * Parameters that are freely estimated within an appropriate range
13 Major type only 14 General
General transmission
Sporadic Random environmental Polygenic Polygenicenvironmental
5 Major locus only 6 7 8 9 Major locus plus 10 polygenic 11 12
3 4
1 2
Model
Mendelian transmission
No transmission
Transmission mode
Table 2 The segregation models and their parameters
216 X. Sun
12
Segregation Analysis Using the Unified Model
217
component]; then if it fits a mixture distribution, we need to test if this is due to random environmental transmission or major type transmission. After random environmental transmission can be rejected, we need to test if the major type is transmitted in a Mendelian mode, which means major locus transmission. With the model allowing for a multifactorial/polygenic component, we can also test if such a component plays a role in the transmission. Moreover, if some covariates, such as sex or age, influence the type means [type susceptibilities], we need to allow for them in the testing. The test of the various hypotheses is conducted by imposing different restrictions on the unrestricted general model (model 14 in Table 2), which cause the likelihood to be smaller than that for the unrestricted model. A likelihood ratio test (LRT) is used to test the significance of the departure from a specified null hypothesis model using the asymptotic properties of the LRT: when the null hypothesis is not on a boundary of the less restricted model, twice the difference in ln(likelihood) between the two models is asymptotically distributed as w2, with the number of degrees of freedom equal to the difference in the number of independent parameters estimated. In some cases, where the null hypothesis is on the boundary of the unrestricted model, the asymptotic distribution is a mixture of w2 distributions (14). In other cases, the asymptotic distribution is unknown, and then Akaike’s A Information Criterion (AIC) can be used to select the better model (15).
2. Methods Using the program SEGREG in the S.A.G.E. graphical user interface (GUI) (the Window’s version of S.A.G.E.), we will show the analysis step-by-step. In all of the examples demonstrated here, Hardy–Weinberg equilibrium (HWE) is assumed, so the type frequencies in the population can be defined by a single parameter qA . No transformation was used in fitting the models for the quantitative traits demonstrated (see Note 1). 2.1. Testing Major Locus Transmission Without Considering a Multifactorial/ Polygenic Effect
Step 1: According to the criteria to test for a major locus, first we will test the null hypothesis that the phenotype fits one distribution. We need to compare two models, the one-distribution model (model 1 in Table 2, the sporadic model) and the two- or threedistribution model (model 2 in Table 2, the random environmental model) [This test is not relevant for a binary trait]. The analysis procedure for the pedigree data of a quantitative trait, named “Q1,” is illustrated here. In the SEGREG GUI, at the “Files” window, we select the prepared data file as the input file, and give a name for the output file. At the “Analysis Definition”
218
X. Sun
window, we select “Q1” as the dependent variate, and set the trait type as “Quantitative.” (If we want to use the fitted segregation model as an input model for model-based linkage analysis using the programs LODLINK or MLOD in S.A.G.E., then at “Output options,” we should check the option “Output file of type probabilities and penetrance functions,” so that a type file, with extension name .typ—see Note 2 and Subheading 2.5—will be produced). At the “Quantitative” window, we select the “Bonney’s class D model” as the “Model class,” and we can fit one, two, and three type means in one run by not specifying any options in the “type mean” parameter setting window (see Note 3). If we want to specify some options for it or for any other parameter(s) such as “type means” or “Transmission,” we just click the “define” button following that parameter, then a pop-up window will appear to allow setting the options for it. The “Residual correlations” should be set as 0 (no multifactorial component) by specifying “parent– offspring and sib–sib correlations equal,” and then fixing correlations between spouses and between sib–sib at 0 (Fig. 1). In the “transformation” parameter setting window, we should select “none” for the illustrated quantitative trait Q1. So in this “Quantitative” page, only “Residual correlations” and “Transformation” parameters are specified (they appear dark colored), as shown in
Fig. 1. The setting in SEGREG for the residual correlations to be fixed at 0.
12
Segregation Analysis Using the Unified Model
219
Table 3 The fitting of one-, two-, and three-distribution models under homogeneous no transmission Sporadic
Random environmental
Parameter
One mean
Two means
Three means
mAA
3.52 0.10
2.73 0.14
1.87 0.12
mAB
3.52 0.10
2.73 0.14
3.86 0.16
mBB
3.52 0.10
5.57 0.26
6.35 0.16
s2
3.09 0.26
1.47 0.18
0.67 0.10
qA
0
0.47 0.05
0.60 0.03
tAA
na
qA
qA
tAB
na
qA
qA
tBB
na
qA
qA
1126.36
1108.64
1099.73
2
4
5
1130.36
1116.64
1109.73
2ln(L) d.f.
a
Akaike’s AIC a
Number of functionally independent parameters estimated
Fig. 1. After running, an information file (with extension name .inf, see Note 4), a detailed file (with extension name .det), and a summary file (with extension name .sum) will be produced. In the .det file, we can see that three “homogeneous no transmission” models, which are respectively for one mean, two means, and three means, are produced (see Note 5). At the end of the .det file, the values of 2ln(likelihood) and Akaike’s AIC for the three models are listed, and we can compare the three models by their Akaike’s AIC values: the smaller Akaike’s AIC, the better the model fits the data. For the analysis that produced the three models shown in Table 3, a comparison of the Akaike’s AICs indicates that one distribution can be rejected. [If the trait is binary, some settings should be changed in SEGREG for the same analysis. First, in the “Analysis” window, we should select “Trait type” as “Binary trait,” then in the “Binary” window, set the “Model class” as “MLM.” For a binary trait, the “Type mean” is replaced by the “Type susceptibility.” There is no transformation parameter for a binary trait. For a binary trait, under no transmission (tAA ¼ tAB ¼ tBB ¼ qA) we cannot compare the one susceptibility type model with two or three susceptibility type
220
X. Sun
models if there is no multifactorial/polygenic component incorporated into the model, because the phenotypic distribution for two or three susceptibility types is a mixture of two or three Bernoulli random variables, which is still a Bernoulli distribution (see Note 6)]. Step 2: After the trait is confirmed to fit a mixture of more than one distribution, for example assuming it fits a mixture of two distributions (mAA ¼ mAB ; mBB freely estimated), we test whether this is due to random environmental transmission by comparing the random environmental model (model 2 in Table 2) with the unrestricted general transmission model (model 14 in Table 2), and whether this is due to Mendelian transmission by comparing model 6 with the general transmission model (model 14 in Table 2). Using the same trait Q1 data, at the “Quantitative” window, we should set “Type mean” as “two” or “three” (here we set it as “two”). Compared with step 1, no change is needed in the settings for “Model class,” “Residual correlations,” and “transformation.” The “transmission” parameter should still be kept unspecified so that all five transmission models can be fitted in one run (see Note 7), or we can fit the three transmission models (homogeneous no transmission/environmental transmission, homogeneous Mendelian transmission, and general transmission) one by one by specifying them at the transmission parameter setting window, to be run separately. After running without specifying transmission parameters, in the output. det file, we can see the fitted five transmission models. At the end
Fig. 2. The summary table in the .det file for all of the five transmission models.
12
Segregation Analysis Using the Unified Model
221
Table 4 The fitting of no transmission, Mendelian transmission, and general transmission models with two type distributions Random environmental
Major gene
General major type
Parameter
No transmission
Mendelian transmission
General transmission
mAA
2.73 0.14
2.54 0.14
2.53 0.13
mAB
2.73 0.14
2.54 0.14
2.53 0.13
mBB
5.57 0.26
5.18 0.21
5.26 0.21
s2
1.47 0.18
1.46 0.17
1.38 0.16
qA
0.47 0.05
0.40 0.05
0.35 0.06
tAA
qA
1
0.22 0.27
tAB
qA
0.5
0.80 0.08
tBB
qA
0
0.10 0.08
1108.64
1089.10
1085.72
4
7
2ln(L) d.f.
a
LRT P value
4 b
Akaike’s AIC a b
4.19 10 1116.64
5
0.19 1097.10
1099.72
Number of functionally independent parameters estimated Compared with the general transmission model
of this .det file, there are two summary tables (Fig. 2: the first lists the ln(likelihood) and Akaike’s AIC for each model, the second lists the LRT statistics (this equals the difference of 2ln(likelihood) for two nested models), and the asymptotic P values by the LRT test). For the results as summarized in Table 4, we can see that, compared with the unrestricted general transmission model, the environmental no transmission model is rejected (P ¼ 4.19 105), and Mendelian transmission cannot be rejected (P ¼ 0.19). So for this quantitative trait, major locus Mendelian transmission model is identified. 2.2. Testing Major Locus Transmission with a Multifactorial/ Polygenic Effect by Regressive Models
A multifactorial/polygenic effect can be incorporated in the segregation model to test if there is such an effect in the presence or absence of a major locus effect. In SEGREG, a polygenic effect can be tested in two ways: the first is by regressive models and the second is by the FPMM, which includes a polygenic effect due to a small number of additive diallelic loci (see Subheading 2.3).
222
X. Sun
The analysis of a quantitative trait is illustrated here, using Bonney’s class D model to test a major locus in the presence of a multifactorial/polygenic effect. Step 1: Test whether there are familial correlations within nuclear families and, if so, whether the sibling correlation equals the parent–offspring correlation. (If they are equal, the familial residual correlation can also be represented by a polygenic variance in the FPMM model). We need to fit and compare four one phenotypic distribution (no major gene models, which, respectively, assume (1) rFO ; rMO ; rSS free; (2) rFO ¼ rMO ; rSS free; (3) rFO ¼ rMO ¼ rSS ; and (4) rFO ¼ rMO ¼ rSS ¼ 0 (the no multifactorial component model, which is the sporadic model already fitted in Table 3). rFM is assumed to be 0 for all these models. To fit the rFO ; rMO ; rSS free model, in the SEGREG GUI “Quantitative” window, we again set the “Model class” as “Bonney’s class D,” and set “Type mean” as “one” mean. At the “Residual correlations” parameter setting window, we select the option “All correlations are functionally independent,” and then check all the four boxes (Spouse, Mother/Offspring, Father/ Offspring, and Sib/Sib) following that option. At the “Spousal” box, which is for the correlation rFM , input 0 in the “Value” box (see Note 8), and then check the “Fixed” box to fix it at 0 (Fig. 3). To estimate the remaining three correlations, we do not fix the values in the three boxes. Then we can run SEGREG to fit this model. To fit the rFO ¼ rMO ; rSS free model, we only make changes at the “Residual correlations” parameter setting window, by selecting the option “Mother–offspring and father–offspring correlations are equal,” and check all the checkable boxes (Fig. 4). To fit the rFO ¼ rMO ¼ rSS model, at the “Residual correlations” parameter setting window, we select the option “Parent–offspring and sib–sib correlations are equal” (Fig. 5). The rFO ¼ rMO ¼ rSS ¼ 0 model (sporadic) has already been fitted in Subheading 2.1 (Table 3). The results of the above four models are summarized in Table 5. According to their Akaike’s AIC values, the equal parent–offspring and sib–sib correlation (rFO ¼ rMO ¼ rSS ) model fits the data best, and the sporadic model that includes no multifactorial component fits the data worst. By the LRT test, we can also reject the sporadic model by comparing it with the unrestricted rFO ; rMO ; rSS free model, while the model that assumes equal parent–offspring and sib–sib correlations cannot be rejected. According to these results, including the multifactorial component in the model improves the fit, and the multifactorial effect can be expressed by equal parent–offspring and sis–sib residual correlations. [All the settings of the model allowing for familial residual correlations—now associations—for a quantitative trait are applicable to a binary trait, except for the “Model class,” which should be set as “MLM”].
Fig. 3. The setting in SEGREG for the residual correlations rFO ; rMO ; rSS to be freely estimated.
Fig. 4. The setting in SEGREG for the residual correlations rFO ¼ rMO ; rSS to be freely estimated.
224
X. Sun
Fig. 5. The setting in SEGREG for the residual correlations to be rFO ¼ rMO ¼ rSS .
Step 2: After confirming that including a multifactorial component improves the fit, we need to test whether, after incorporating a multifactorial component, the null hypothesis of one distribution can be rejected (similar as the analysis in Subheading 2.1, step 1). We will fit three no transmission models, which are, respectively, one-, two-, and three-type mean models for the illustrated quantitative trait, while including the familial residual correlations in the models and assuming that the parent–offspring and sib–sib correlations are equal. In the “Quantitative” window, we select “Model class” as “Bonney’s class D,” do not specify type mean (so that one-, two-, and three-mean models will be fitted in one run, see Note 3), but specify “Residual correlations” as “parent–offspring and sib–sib correlations are equal” and fix “Spouse” correlation at “0” (Fig. 5). Transformation should be set as “None,” and do not specify the transmission parameter (see Note 3).
12
Segregation Analysis Using the Unified Model
225
Table 5 The pure multifactorial inheritance models with various residual familial correlations (FM = 0) Familial correlations Parameters
rFO ¼rMO ¼ rSS ¼ 0
rFO ¼ rMO ¼ rSS
rFO ¼ rMO , rSS free
rFO, rMO, rSS free
mAA
3.52 0.10
3.54 0.13
3.54 0.13
3.54 0.13
mAB
3.52 0.10
3.54 0.13
3.54 0.13
3.54 0.13
mBB
3.52 0.10
3.54 0.13
3.54 0.13
3.54 0.13
s2
3.09 0.26
3.04 0.27
3.04 0.27
3.04 0.27
rFO
0
0.21 0.05
0.20 0.07
0.21 0.13
rMO
0
0.21 0.05
0.20 0.07
0.20 0.09
rSS
0
0.21 0.05
0.22 0.08
0.23 0.08
1126.36
1102.43
1102.40
1102.40
3
4
5
0.99
1
1108.43
1110.40
2ln(L) d.f.
a
LRT P value
2 b
Akaike’s AIC a b
2.55 10 1130.36
5
1112.4
Number of functionally independent parameters estimated Compared with the last, all correlations free, model
For the results of the above analysis shown in Table 6, the onemean model is rejected compared with the two- or three-mean models, after the multifactorial component is included. Step 3: Confirming that the phenotype fits a mixture of more than one distribution with a multifactorial component, which indicates a major type plus multifactorial inheritance, the next step is to test if the major type is due to random environmental transmission plus multifactorial inheritance, or if it fits a major locus (transmitted in Mendelian mode) plus multifactorial inheritance. We set the model class as “Bonney’s class D model,” and set type mean as “two” or “three” (here we set it as “two”), then still set the residual correlations as “parent–offspring and sib–sib correlations equal.” We do not need to specify the transmission parameters, with the result that all the transmission models will be fitted (see Note 7). From the results of the above models summarized in Table 7, compared with the unrestricted general major type transmission model, polygenic-environmental transmission is rejected (P ¼ 0.01), but major locus Mendelian transmission with polygenic effect cannot be rejected (P ¼ 0.08).
226
X. Sun
Table 6 The segregation models incorporating a multifactorial component, without major type transmission Pure multifactorial
Environmental + multifactorial Environmental + multifactorial
Parameter
One mean
Two means
Three means
mAA
3.54 0.13
2.93 0.13
2.10 0.17
mAB
3.54 0.13
2.93 0.13
3.63 0.18
mBB
3.54 0.13
5.94 0.27
6.18 0.22
s
3.04 0.27
1.62 0.22
1.08 0.21
qA
0
0.56 0.04
0.59 0.04
rFO ¼ rMO ¼ rSS 0.21 0.05
0.32 0.06
0.52 0.04
tAA
na
qA
qA
tAB
na
qA
qA
tBB
na
qA
qA
1102.43
1082.84
1078.52
3
5
6
1108.43
1092.84
1090.52
2
2ln(L) d.f.
a
Akaike’s AIC a
Number of functionally independent parameters estimated
2.3. Testing Major Locus Transmission with a Polygenic Effect by the FPMM
The other way to incorporate a polygenic component is by the FPMM, in which an additive polygenic effect (with variance s2v ) of a finite number of polygenic loci is incorporated into the type mean [logit of type susceptibility]. If the parent–offspring (father–offspring and mother–offspring) and sib–sib correlation are equal, the polygenic variance s2v can capture such familial residual correlations [associations]. If they are not equal, a polygenic variance is not sufficient to capture the effect. The FPMM is applicable to both quantitative traits and binary traits. Here we show the analysis of a quantitative trait, assuming that we have already confirmed that the trait fits a mixture of two distributions, and incorporating polygenic loci improves the fitting (similar to the analyses in Subheading 2.2, steps 1 and 2). We want to test whether the major type is due to a random environmental effect or is transmitted from generation to generation in the pedigree and, if transmitted, whether it is transmitted in a Mendelian mode, together with polygenic inheritance. In the SEGREG GUI, compared with the settings of the analysis in Subheading 2.2, we now need to change the settings in the
12
Segregation Analysis Using the Unified Model
227
Table 7 The fitting of no transmission, Mendelian transmission, and general transmission models under two distributions, multifactorial component incorporated Polygenicenvironmental
Major gene plus polygenic
Parameter
No transmission
Mendelian transmission General transmission
mAA
2.93 0.13
2.71 0.18
2.87 0.13
mAB
2.93 0.13
5.61 0.34
2.87 0.13
mBB
5.94 0.27
5.61 0.34
5.99 0.25
s2
1.62 0.22
1.46 0.18
1.49 0.18
qA
0.56 0.04
0.84 0.04
0.52 0.06
rFO ¼ rMO ¼ rSS 0.32 0.06
0.18 0.08
0.30 0.06
tAA
qA
1
1
tAB
qA
0.5
0.40 0.13
tBB
qA
0
0.38 0.16
1082.84
1077.52
1072.18
5
5
8
LRT P valueb
0.01
0.08
Akaike’s AIC
1092.84
1087.52
2ln(L) d.f.
a b
a
General major type plus polygenic
1086.18
Number of functionally independent parameters estimated Compared with the general transmission model
“Quantitative” window. First for the “Model class,” we should select the option “FPMM”; thus the “Residual correlations” settings is not applicable and the “FPMM” setting option is active. Then we set “Type mean” as “two” (assuming two means). In the “FPMM” parameter setting window, use the default setting: number of polygenic loci 3 and allele frequency of each polygenic locus 0.5 (see Note 9). Transformation is still set as “None,” and do not specify the transmission parameter, so that all five transformations will be fitted (see Note 7). From the results shown in Table 8, we can see that, compared with the unrestricted general major type plus polygenic model, the polygenic-environmental transmission model is rejected, and the major locus Mendelian transmission plus polygenic effect model cannot be rejected (P ¼ 0.99), so the major locus together with polygenic transmission is confirmed.
228
X. Sun
Table 8 The segregation models incorporating three polygenic loci (v = 3), without major type transmission Polygenic-environmental
Major gene plus polygenic
General major type plus polygenic
Parameter
No transmission
Mendelian transmission
General transmission
mAA
2.70 0.15
2.59 0.11
2.53 0.13
mAB
2.70 0.15
5.35 0.19
2.53 0.13
mBB
5.55 0.26
5.35 0.19
5.29 0.19
s2
1.20 0.34
1.09 0.26
1.10 0.30
qA
0.46 0.05
0.81 0.04
0.35 0.06
s2v
0.23 0.27
0.31 0.24
0.25 0.26
tAA
qA
1
0.22 0.29
tAB
qA
0.5
0.79 0.08
tBB
qA
0
0.10 0.08
2ln(L)
1107.67
1084.56
1084.47
5
8
d.f.a LRT P value
5 b
Akaike’s AIC a b
3.67 10 1117.67
5
0.99 1094.56
1100.47
Number of functionally independent parameters estimated Compared with the general transmission model
Both regressive models with residual familial correlations and the FPMM with a polygenic variance can capture a polygenic effect, and these two kinds of models should produce similar conclusion in the case the parent–offspring and sib–sib correlations are equal and there is no spouse correlation. 2.4. Covariates, Ascertainment, Prevalence Constraint
Covariates If the trait is influenced by covariates, such as sex, age, founder status, etc., we should include them in the model. To determine whether a covariate should be included in a segregation model, we can initially fit two sporadic models (one mean [one susceptibility], transmission mode, i.e., no transmission), one including the covariate the other not, and then compare them by an LRT or Akaike’s AIC criterion.
12
Segregation Analysis Using the Unified Model
229
Table 9 Testing a covariate (sex status is a covariate of the type susceptibility of a binary trait) Environmental Parameter
Without covariate
With covariate
bAA
2.05 0.03
2.29 0.04
bAB
2.05 0.03
2.29 0.04
bBB
2.05 0.03
2.29 0.04
xsex
na
1.61 0.09
6164.76
5713.33
1
2
6166.76
5717.33
2ln(L) d.f.
a
Akaike’s AIC a
Number of functionally independent parameters estimated
[For example, suppose we want to test if including sex as a covariate of the susceptibility of a binary trait in the model fits the data better than not including it. In the SEGREG GUI, at the “Analysis” window, we select the “Trait type” as “Binary trait”; at the “Binary” window, set the “Model Class” as “MLM,” set “Type susceptibility” as “one.” Then at the settings for “Susceptibility covariates,” for the first model of no covariate, we do not set any covariate. For the second, we select sex as the covariate. A multifactorial component is not necessary for comparing these two models (see Note 6), so in the “Residual associations” parameter setting window, set all the residual familial associations equal to 0 (Fig. 1). From the results of the two fitted models shown in Table 9, we can see that including sex as the covariate of the type susceptibility decreases the Akaike’s AIC, which indicates better fitting. Then in the following tests, for example to identify the disease transmission mode, we should use sex as a covariate of the susceptibility]. Ascertainment Sampling ascertainment is a very important factor that should be included in segregation analysis. If the probability that the pedigree is ascertained is independent of the genetic parameters, or the particular pedigree structure sampled does not depend on the probands’ phenotypes, there is no need to allow for ascertainment. If there is ascertainment (see Chapter 11), the ascertainment can be allowed for by defining the proband sampling frame (PSF) and
230
X. Sun
Fig. 6. The setting in the ascertainment parameter setting window for a binary trait with single ascertainment.
conditioning all likelihoods on the phenotypes of those in the PSF. Provided there is single ascertainment, however, it is sufficient to condition on the phenotypes of the probands. To do this, a variable which indicates the probands should first be prepared in the data; for example, probands can be set as 1 and non-probands can be set as 0 in this proband indicator variable. Then, in the SEGREG GUI, in the “ascertainment” parameter setting window, we select the proband-indicator as the PSF indicator (Fig. 6), and specify the value of the PSF indicator that indicates a person is a proband (here we specified 1 as the PSF indicator value). Then we can fit different segregation models with adjustment for single ascertainment. Prevalence constraint For a binary trait, if we do not know how the pedigrees are ascertained, we can to some extent make use of the prevalence constraint to adjust for the ascertainment. The prevalence constraint is specified at the “Prevalence constraint” parameter setting window. If we set the constraint according to the population prevalence, instead of the prevalence in a subpopulation, such as in males or in females, we do not need to specify a “Covariate” in this window. Assuming we want to specify the prevalence for males and females separately, we first need to set sex as the covariate (see Note 10), and then
12
Segregation Analysis Using the Unified Model
231
Fig. 7. The setting in the prevalence constrain parameter setting window specifying a covariate for the prevalence of a binary trait.
specify one of the two sex values. Assuming 0 is for males, 1 is for females, we can specify sex value equals 0 first (Fig. 7). For this specified sex value (0, for males), we then specify the prevalence constraint by two numbers: the “Number of affected” R and the “Sample size” N (Fig. 8). The two numbers R and N can be estimated for each sex as follows. Assume the estimated sex-specific prevalence is . p SE, where SE is the standard error of p, then N ¼ pð1 pÞ ðSEÞ2 and R ¼ Np. After specifying the constraint
for males, we repeat the above steps to specify sex ¼ 1 (for females), and set the prevalence for females by specifying the corresponding R and N]. 2.5. Using the Identified Trait Model in Linkage Analysis
Once a major gene Mendelian transmission model is identified by segregation analysis, it can be used in a model-based linkage analysis (see Chapter 14 for further details). To produce the type file for model-based linkage analysis, we select the option “Output file of type probabilities and penetrance functions” at the “Analysis Definition” window when fitting the appropriate Mendelian transmission model.
232
X. Sun
Fig. 8. The setting in the prevalence constraint parameter setting window specifying the number of affected R and sample size N for a specified covariate value.
3. Notes 1. If the trait is quantitative, the trait may need to be transformed in segregation analysis to make the distribution approach a normal distribution or a mixture of two or three normal distributions, which is assumed for quantitative traits. In SEGREG, we can choose the standardized Box–Cox transformation (16) which transforms the trait y according to 8 l1 > < ðy þ l2 Þ 1 hðyÞ ¼ l1 yGl1 1 > : yG lnðy þ l2 Þ
if l1 6¼ 0
;
if l1 ¼ 0
Q 1=N , and N is the number of indiviwhere yG ¼ ½ N i¼1 ðyi þ l2 Þ duals in the dataset. This leads to Jacobian transformed likelihoods so that likelihoods based on different transformations can be compared. 2. The .typ file includes two kinds of information for each individual: the probabilities of being AA, AB, or BB genotypes
12
Segregation Analysis Using the Unified Model
233
Fig. 9. The standard error cannot be computed.
conditional on the model and all the available pedigree information, and the penetrance functions for AA, AB, and BB genotypes. 3. If we do not specify the “type mean” settings for quantitative traits [the type susceptibility settings for binary traits], and at the transmission setting window do not specify the transmission mode, then all three no transmission models for type means [susceptibilities] (models for one, two, and three type means [susceptibilities]) will be fitted. 4. The .inf file is the first file we should check before looking at the other two result files; it contains diagnostic information, warnings, and program errors. If there is warning or error information here, we should first solve that problem and only then check the result. 5. If in the .det file, no standard errors are available for the estimates of parameters for a model (Fig. 9), or if the first derivative of any estimate is not close to 0 or, at the end of the model output, there is the warning information “%SEGREG-I: The likelihood surface is flat and standard errors cannot be computed,” we should refit the model after fixing one or more parameters or by fitting a model with fewer parameters, so that the standard errors become estimable.
234
X. Sun
6. For a binary trait without any susceptibility covariate, the phenotypic distribution for two or three susceptibility types is a mixture of two or three Bernoulli distributions. Because a mixture of Bernoulli distribution is still a Bernoulli distribution, we cannot test the difference among these possibilities unless we also have familial correlations in the model. Therefore, the default in SEGREG is to include some correlations/ associations (what would be equivalent to a polygenic component) where needed (for both quantitative and binary traits). However, when we have a binary trait with a susceptibility type covariate, we no longer have a mixture of Bernoulli distributions, and so we should turn them off if we do not want those correlations/associations. 7. In the case that the transmission parameter is not specified, and two or three means [susceptibilities] is specified in the type means [susceptibilities] parameter setting window, all the five transmission models will be fitted. The five transmission models are environmental transmission (corresponding to the option “Homogeneous no transmission”), Mendelian transmission, general transmission (corresponding to the option “All transmission estimated”), tAB free transmission, and homogeneous general transmission. tAB free transmission assumes tAA ¼ 1 and tBB ¼ 0, and tAB is freely estimated in [0, 1]. Homogeneous general transmission assumes homogeneity of the phenotype distribution across generations (tAB is a function of qA, tAA and tBB). 8. In the S.A.G.E. GUI, the “value” box is the place to input the initial value of a parameter; you can either input an initial value or leave it blank, letting the program determine an initial value. With a proper initial value close to the maximum likelihood estimate, the estimation procedure will converge much faster. 9. We can increase the number of polygenic loci and see if the fit improves according to Akaike’s AIC value for the models, and finally select the number of polygenic loci for which the AIC value does not decrease (i.e., the model does not improve) appreciably. 10. The covariate upon which the prevalence depends must be a covariate among the “Susceptibility covariates,” which means that the type susceptibilities also depend on it.
Acknowledgment This work was supported by a US Public Health Service Resource grant (RR03655) from the National Center for Research Resources.
12
Segregation Analysis Using the Unified Model
235
References 1. Elston RC, Stewart J (1971) A general model for the genetic analysis of pedigree data. Hum Hered 21: 523–542 2. Go RC, Elston RC, Kaplan EB (1978) Efficiency and robustness of pedigree segregation analysis. Amer J Hum Genet 30: 28–37 3. Morton NE, Maclean CJ (1974) Analysis of family resemblance. III. Complex segregation analysis of quantitative traits. Amer J Hum Genet 26: 489–503 4. Lalouel JM, et al (1983) A unified model for complex segregation analysis. Amer J Hum Genet 35: 816–826 5. S.A.G.E. 6.1: Statistical Analysis for Genetic Epidemiology; 2010. http://darwin.cwru. edu/sage/ 6. Guo X, et al (1999). Evidence of a major gene effect for angiotensinogen among Nigerians. Ann Hum Genet 63: 293–300 7. Sun X, et al (2010) A segregation analysis of Barrett’s esophagus and associated adenocarcinomas. Cancer Epidemiol Biomarkers Prev 19: 666–674 8. Bonney GE (1984) On the statistical determination of major gene mechanisms in continuous human traits: regressive models. Amer J Med Genet 18: 731–749
9. Karunaratne PM, Elston RC (1998) A multivariate logistic model (MLM) for analyzing binary family data. Amer J Med Genet 76: 428–437 10. Fernando RL, Stricker C, Elston RC (1994) The finite polygenic mixed model: An alternative formulation for the mixed model of inheritance. Theor Appl Genet 88: 573–80 11. Lange K (1997) An Approximate Model of Polygenic Inheritance. Genetics 147: 1423–1430 12. Demenais FM, Elston RC (1981) A general transmission probability model for pedigree data. Hum Hered 31: 93–99 13. Elston RC, et al (1975) Study of the genetic transmission of hypocholesterolemia and hypertriglyceridemia in a 195 member kindred. Ann Hum Genet 39: 67–87 14. Self S, Liang K (1987) Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions. J Amer Statist Assoc 82: 605–610 15. Akaike H (1974) A new look at the statistical model identification. IEEE Trans Automat Contr AC-19: 716–723 16. Box GEP, Cox DR (1964) An analysis of transformations. J Roy Stat Soc B 26: 211–252
sdfsdf
Chapter 13 Design Considerations for Genetic Linkage and Association Studies Je´re´mie Nsengimana and D. Timothy Bishop Abstract This chapter describes the main issues that genetic epidemiologists usually consider in the design of linkage and association studies. For linkage, we briefly consider the situation of rare, highly penetrant alleles showing a disease pattern consistent with Mendelian inheritance investigated through parametric methods in large pedigrees or with autozygosity mapping in inbred families, and we then turn our focus to the most common design, affected sibling pairs, of more relevance for common, complex diseases. Theoretical and more practical power and sample size calculations are provided as a function of the strength of the genetic effect being investigated. We also discuss the impact of other determinants of statistical power such as disease heterogeneity, pedigree, and genotyping errors, as well as the effect of the type and density of genetic markers. Linkage studies should be as large as possible to have sufficient power in relation to the expected genetic effect size. Segregation analysis, a formal statistical technique to describe the underlying genetic susceptibility, may assist in the estimation of the relevant parameters to apply, for instance. However, segregation analyses estimate the total genetic component rather than a single-locus effect. Locus heterogeneity should be considered when power is estimated and at the analysis stage, i.e. assuming smaller locus effect than the total the genetic component from segregation studies. Disease heterogeneity should be minimised by considering subtypes if they are well defined or by otherwise collecting known sources of heterogeneity and adjusting for them as covariates; the power will depend upon the relationship between the disease subtype and the underlying genotypes. Ultimately, identifying susceptibility alleles of modest effects (e.g. RR 1.5) requires a number of families that seem unfeasible in a single study. Meta-analysis and data pooling between different research groups can provide a sizeable study, but both approaches require even a higher level of vigilance about locus and disease heterogeneity when data come from different populations. All necessary steps should be taken to minimise pedigree and genotyping errors at the study design stage as they are, for the most part, due to human factors. A two-stage design is more cost-effective than one stage when using short tandem repeats (STRs). However, dense single-nucleotide polymorphism (SNP) arrays offer a more robust alternative, and due to their lower cost per unit, the total cost of studies using SNPs may in the future become comparable to that of studies using STRs in one or two stages. For association studies, we consider the popular case–control design for dichotomous phenotypes, and we provide power and sample size calculations for one-stage and multistage designs. For candidate genes, guidelines are given on the prioritisation of genetic variants, and for genome-wide association studies (GWAS), the issue of choosing an appropriate SNP array is discussed. A warning is issued regarding the danger of designing an underpowered replication study following an initial GWAS. The risk of finding spurious association due to population stratification, cryptic relatedness, and differential bias is underlined. GWAS have a high power to detect common variants of high or moderate effect. For weaker effects (e.g. relative risk < 1.2), the power is greatly reduced, particularly for recessive loci. While sample sizes Robert C. Elston et al. (eds.), Statistical Human Genetics: Methods and Protocols, Methods in Molecular Biology, vol. 850, DOI 10.1007/978-1-61779-555-8_13, # Springer Science+Business Media, LLC 2012
237
238
J. Nsengimana and D.T. Bishop
of 10,000 or 20,000 cases are not beyond reach for most common diseases, only meta-analyses and data pooling can allow attaining a study size of this magnitude for many other diseases. It is acknowledged that detecting the effects from rare alleles (i.e. frequency < 5%) is not feasible in GWAS, and it is expected that novel methods and technology, such as next-generation resequencing, will fill this gap. At the current stage, the choice of which GWAS SNP array to use does not influence the power in populations of European ancestry. A multistage design reduces the study cost but has less power than the standard one-stage design. If one opts for a multistage design, the power can be improved by jointly analysing the data from different stages for the SNPs they share. The estimates of locus contribution to disease risk from genome-wide scans are often biased, and relying on them might result in an underpowered replication study. Population structure has so far caused less spurious associations than initially feared, thanks to systematic ethnicity matching and application of standard quality control measures. Differential bias could be a more serious threat and must be minimised by strictly controlling all the aspects of DNA acquisition, storage, and processing. Key words: Linkage, Sib pairs, Heterogeneity, Marker density, Association, Power, False positives, Stratification, Cryptic relatedness, Differential bias
1. Introduction Linkage and association are the two main approaches geneticists apply to map genetic loci predisposing to diseases or other phenotypes; we focus on dichotomous traits in this chapter, recognising that the principles carry over to mapping quantitative traits. We review in this chapter the issues related to the design of these studies. The main focus of a linkage study is generally the statistical power. Some factors are fundamental to any study, such as the strength of the genetic effect, allele frequencies, locus, and disease heterogeneity; we will have limited insight into the values of the parameters related to these factors and will have to make some assumptions about these values to calculate power, realising that assumptions very different to the truth will mean that our power estimate is incorrect. We discuss these factors under the Subheading 2.1.1. For many statistical investigations, the design relates simply to deciding sample size, but we will take a broader view of design here and indicate relevant issues related not only to the design of sample collection but also to the design of the analysis procedure. Besides sample size, such factors as genotyping errors and the type or density of markers also impact on power. Some of these are technology dependent and should be controllable by the investigator. They are discussed under the Subheading 2.1.2. For association studies, both power and type 1 error (false positives) are usually of concern. We discuss in different sections the factors that influence power and false-positive findings. Power determinants are similar in association as in linkage studies, although some have stronger effects for one approach over the other. For instance, linkage disequilibrium (LD) has limited effect
13 Design Considerations for Genetic Linkage and Association Studies
239
for linkage studies (although the use of very dense single-nucleotide polymorphism (SNP) arrays can make this an issue for linkage). In the discussion of the LD effect, we distinguish candidate gene studies, where we assume that the investigator will attempt to genotype the susceptibility locus, and the genome-wide association studies (GWAS) that in principle attempt to exploit the LD between typed markers and untyped disease loci to locate the latter. Since we have included power and sample size calculations for both linkage and association designs, we now indicate a brief description of the software that we used.
2. Methods 2.1. Linkage 2.1.1. Inherent Determinants of Statistical Power Sample and Genetic Effect Size
Linkage analysis has been successful in identifying highly penetrant, rare alleles involved in “Mendelian” diseases because rare, highly penetrant alleles induce a strong disease aggregation in families and a recognisable pattern of inheritance. The analysis of a large multigenerational pedigree with multiple cases easily picks up the disease alleles responsible for the aggregation, or marker alleles nearby, with high robustness to genetic model inaccuracy. For example, chilblain lupus (a monogenic form of cutaneous lupus erythematosus) was mapped to chromosome 3p with a LOD score of 5.04 using one large pedigree with 18 affected members from five different generations (1). Very rare alleles can be mapped using a pedigree of fewer affected relatives: a LOD score of 3.9 can be obtained for a disease allele with a frequency of the order of 106 using a pedigree of only three affected fourth-degree cousins (2). When analysing consanguineous families or isolated populations, a very powerful method to detect rare, highly penetrant recessive alleles is autozygosity mapping (3–5). Fewer than 15 families can provide strong evidence of linkage if the ascertained probands have second-degree cousin parents and the disease-causing variant is less than 0.8% frequent in the population, or if parents are first-degree cousins and the variant is 2% frequent or less (3). Slightly higher sample sizes are required in more inbred pedigrees (4). For any pedigree of known structure, the program SIMLINK (6, 7) can be used to estimate the power of detecting linkage using parametric methods. The majority of complex diseases, however, cannot be explained by highly penetrant, rare alleles. The most quoted paradigm is the “common disease, common variant hypothesis,” stating that common, complex diseases are determined by a mixture of environmental factors and multiple independent or interacting genes with relatively common susceptibility alleles, each having a small marginal effect. Any single gene can therefore only explain a fraction of the observed familial risk. Most linkage studies have
240
J. Nsengimana and D.T. Bishop
relied on pedigrees of small size; most often, affected sib pairs (ASPs). The simplicity of such a sampling unit, which is often logistically feasible to identify and sample, is confounded by the high number of such families required to achieve reasonable power. Studies using about 2,000 or more ASPs failed to find any significant linkage (8). The high sensitivity of parametric methods to genetic model uncertainty in these data has prompted the investigators to favour non-parametric allele-sharing statistics (2, 9). Let Za and Z1 b be values from the inverse cumulative density function of a standard normal distribution. Then, according to Risch and Merikangas (10), the number of fully informative ASPs (i.e. sib pairs in which the identity-by-descent (IBD) sharing of the siblings can be clearly identified for each family—this would require that both parents are also genotyped and the markers be highly polymorphic) required to find linkage with power 2 1 b at significance level a is given by N ¼ Za sZ1b =2m2, where m ¼ 2Y 1; s ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 4Y ð1 Y Þ and Y is the expected proportion of alleles shared IBD between two affected siblings, and is related to sibling and offspring disease risk ratios ls and lo, reflecting the incidence of the disease in siblings and offspring compared to the general population. Specifically, Y ¼ 0:5z1 þ z2 with z1 ¼ l0 =2ls and z2 ¼ ð4ls 2lo 1Þ=4ls (11). For power of 80%, Z1 b ¼ 0.84, and for genome-wide significance, Za ¼ 4.11 (a ¼ 2 105, see ref. 12). These equations are valid when the marker analysed is the disease locus and is fully informative (heterozygosity ¼ 1.0). This is the best possible scenario, and therefore sample size estimates represent the lower limit. If a marker lying at some distance from the disease locus is analysed and/or locus information content is not maximal, then larger sample sizes are required (11, 13). Table 1 shows the number of ASPs needed to allow detection of an allele conferring various levels of risk in the best scenario with 80% power, using the above formulas, under the assumption that the investigated locus is the only genetic effect responsible for the disease. For example, a locus with disease allele frequency of 0.05 and multiplicative relative risk of 4.0 per allele will show a proportion of IBD sharing of 0.57 between two affected sibs, and 630 such sib pairs are needed to achieve 80% power. If the locus only explained some of the familial risk, then the marginal effect of this locus would be the relevant recurrence risks to consider and the sample sizes would increase accordingly. Although Table 1 also shows the per-allele risk ratio under the assumed multiplicative model (see ref. 10 for more details on the relationships between allele frequency, IBD sharing statistics, relative risk and recurrence risk ratios), the power of the ASP design depends only on allele-sharing probabilities, and therefore similar sample sizes are required to achieve the same power for a given allele-sharing proportion under any genetic model.
13 Design Considerations for Genetic Linkage and Association Studies
241
Table 1 Number of ASPs required to achieve 80% power at a ¼ 2 105 in the best scenario (details in the text) Per-allele risk ratio
Risk allele frequency
Proportion of alleles shared (Y) ls
lo
Number of ASPs
4.0
0.05 0.10 0.30 0.50
0.570 0.597 0.604 0.576
1.349 1.537 1.592 1.392
1.323 1.479 1.523 1.360
630 326 283 524
2.0
0.05 0.10 0.30 0.50
0.511 0.518 0.529 0.526
1.044 1.076 1.128 1.114
1.043 1.075 1.124 1.111
27,515 9,516 3,574 4,416
1.5
0.05 0.10 0.30 0.50
0.503 0.505 0.510 0.510
1.011 1.021 1.040 1.040
1.011 1.020 1.040 1.040
387,467 119,927 32,302 31,825
If N fully informative ASPs are available and they share a total of R alleles identical by descent (R 2N), then the expected maximum LOD score (ELOD) from these data is given by ELOD ¼ Rlog10 R þ ð2N RÞlog10 ð2N RÞ N log10 R (14). Figure 1 plots this ELOD score as a function of allele-sharing proportion R/2N and sample size N. Both Table 1 and Fig. 1 illustrate that, even in the best scenario, for an allelic risk ratio of the order of 1.5 or smaller, the proportion of alleles shared between ASPs does not exceed 0.51 and a prohibitively large number of families are needed to detect this effect. The choice of sampling unit (extended family or ASP design) is the initial design question, but most usually, there is only one pragmatic solution (which differs depending on the characteristics of the disease under consideration). If there are sufficient extended families with multiple cases (>3) of disease and there is a clear pattern of inheritance (dominant, recessive, etc.) across families, then the most powerful approach would be to collect and genotype such families. As noted before, exceptional multicase families can have high power to map the causative locus, and, of course, a few large families are more homogeneous than multiple small families (see genetic heterogeneity, below). However, for more common, complex diseases, such extended families with clear inheritance are not found, and so the only feasible solution will be to sample ASPs.
242
J. Nsengimana and D.T. Bishop 6 5
ELOD
4 3 2 1 0 0.5
0.51
0.52
0.53
0.54
0.55
ASP allele sharing proportion N=10000
N=2000
N=500
N=5000
N=1000
N=250
Fig. 1. Expected maximum LOD score as a function of the proportion of alleles shared between two affected siblings assuming a fully informative locus. Locus and Disease Heterogeneity
While collecting a large number of families is, in itself, a challenging task, large linkage studies have inherently two other problems: locus heterogeneity and disease heterogeneity. Locus heterogeneity means that multiple loci independently cause the same disease so that when a large number of families have been sampled, they are not all linked to the same locus. The apparent large sample actually corresponds to multiple smaller samples, each being linked to distinct loci. Disease heterogeneity corresponds to the situation when the disease has subtypes [e.g. coronary artery disease (CAD) is often defined as having one of myocardial infarction, percutaneous transluminal coronary angioplasty, coronary artery bypass graft surgery, or angina; see e.g. ref. 15], grades or stages, or the disease clinical definition is broad owing to difficulties met in attempts to identify such subtypes or grades. Whether well defined or not, disease subtypes or grades can potentially be induced by different functional processes and therefore different genes. The issue of disease heterogeneity is therefore similar to that of locus heterogeneity, except that it is significantly more challenging to consider in any analysis. Locus and disease heterogeneity not only result in smaller effective sample sizes than the apparent study size, but they also reduce the power to detect each of the loci involved because linked and unlinked families are pooled together, weakening the signal. Allowing for locus heterogeneity in the analysis by using a heterogeneity LOD (i.e. HLOD) can maintain the power (16), although
13 Design Considerations for Genetic Linkage and Association Studies
243
estimation of the proportion of linked families under this approach is generally biased in the sense that the estimated contribution of each locus will be weighted by the influence of the way in which the families were identified (16, 17); in most circumstances, this is an insignificant problem as more population-relevant estimates can be derived later. Also of note, the HLOD approach is only available in parametric methods, which are not the first choice for complex diseases, as already mentioned. Altmuller et al. (18) reviewed 101 linkage studies of 31 complex diseases published up until December 2000. They found that 67 of them (66.3%) did not report any significant finding according to Lander and Kruglyak’s (12) genome-wide significance criteria and that the results from analyses of the same disease were often inconsistent. The only common feature between successful studies was that they analysed autoimmune diseases. Success could be therefore attributed to higher heritability of these diseases and reduced locus heterogeneity. They identified as key factors of study failure the population ethnicity and inadequate sample size. They concluded that studies should take account of the unique characteristics of the disease and the sample, aiming at the highest possible homogeneity. This implies defining the disease phenotype well and collecting the information on all possible confounding factors for use as covariates or to analyse the data in homogeneous subsets (19). The importance of covariate adjustment was exemplified in our CAD study where, using approximately 2,000 families, the majority of which were ASPs; the highest LOD score was 1.86 in an initial whole genome analysis (8), but this LOD score rose to 4.4 when hypercholesterolemia was included as a covariate. We subsequently showed that only 108 ASPs with normal levels of cholesterol were linked to this locus (20). The vast majority of CAD patients in the study had high cholesterol because these conditions are related. With no covariate adjustment, these patients obscured the linkage signal that existed only from the few patients with normal cholesterol. A potential alternative to increase the power, applicable to the study of some diseases, is to consider a quantitative phenotype that can serve as a proxy and perform linkage analysis to that continuous phenotype (21). The challenge with this approach is to identify such a quantitative phenotype which is sufficiently close to the disease endpoint and provides a meaningful measure of susceptibility in unaffected subjects. A common alternative to increase power, applicable to some diseases, is to identify a quantitative phenotype that can serve as a proxy and perform linkage analysis to that continuous phenotype (21). 2.1.2. Other Design Issues Pedigree and Genotyping Errors
Errors in reporting or recording family relationships and in genotyping are not rare, and it is well documented that they can reduce the study power (22–25). Rigorous quality control practice allows detecting these errors and excluding them (26–29), with some resulting loss in power. Alternatively, methods have been developed
244
J. Nsengimana and D.T. Bishop
to account for these errors in linkage analysis with less power loss (30, 31). It is nonetheless important to take all necessary steps to minimise these errors at the study design stage. Pompanon et al. (24) classified causes of genotyping errors into four categories: causes related to the DNA sequence itself (e.g. a mutation close to the marker such as a null allele from an insertion or deletion), low DNA quality and quantity (e.g. dilution, degradation, contamination), biochemical artefacts in PCR cycles, and human factors. Among these causes, human factors are the most predominant. It is therefore possible to minimise these errors by, for example, ensuring sample quality and technical skills, conducting a pilot study to evaluate theoretical error rate before actual study, blindly and independently duplicating up to 5–10% of samples to assess true error rate, and using negative and positive controls (24). Other common good practices include genotyping all the samples in the same laboratory at the same time as well as centralising data management to minimise human sample handling (25, 30). Finally, high-quality data handling is a critical aspect of any study; minimising the number of times that data are transcribed or manipulated in producing the files for the analysis is best practice, as is maintaining data in purpose-built databases rather than simple spreadsheets, which are too prone to error. Marker Type and Density
Linkage studies of human complex diseases have been in large part conducted using a panel of 300–400 microsatellite markers, short tandem repeats (STRs) distributed approximately evenly across the genome. This density was chosen as an optimal balance between the information content and genotyping cost (32, 33). The discovery of the much more abundant and cheaper genotype SNPs has meant that it is possible to increase the information content by means of genotyping more makers without increasing the cost (or at least not proportionally). The superiority of SNP chips over classical STR maps has been demonstrated for their information content using simulations (34), and their reliability (call rates and genotyping errors), information content, and ability to detect allele sharing between siblings in real data (33). The results from real data comparisons in Sawcer et al. (33) are summarised in Table 2, and they clearly indicate that SNP arrays can improve linkage studies. It is also clear from these results that genotyping parents is the most efficient way to increase the information content. When parents are not available, the information content can be increased by genotyping additional siblings (34, 35). An important issue requiring consideration when denser SNP chips are used is linkage disequilibrium (LD). Standard linkage methods assume linkage equilibrium between markers, and they have been shown to produce noisy signals inducing false positives when applied to dense SNP maps (36–40). Methods that deal with LD have been developed (41–43). Fukuda et al. (44) recently
13 Design Considerations for Genetic Linkage and Association Studies
245
Table 2 Comparison of two microsatellite maps and two linkage SNP chips Compared feature
Standard map of microsatellites
Dense map of microsatellites
Illumina 3.0 SNP chipa
Affymetrix 10K SNP chip
Number of markers
338
911
4,763
11,223
Call rate (%)
93.50
95.57
99.69
93.95
61.7
82.8
90.7
91.5
32.4
49.7
54.4
58.4
30.0
68.7
83.9
83.7
9.3
40.4
51.5
60.3
Mean IC-parentsb Mean IC-no parents Mean IBD-parents
c
d e
Mean IBD-no parents a
The current Illumina linkage SNP array contains approximately 6,000 SNP (http://www.illumina.com/ products/humanlinkage_v_panel_set.ilmn) b, c Mean information content with/without using parental genotypes d, e Mean proportion (%) of the genome where allele sharing between siblings can be detected with >95% confidence using/not using parental genotypes
produced a software (SNPHiTLink) that serves as an interface between very dense SNP chips (e.g. Affymetrix 100K, 500K, Human SNP Array 6.0) and linkage analysis programs. It facilitates data transfer while maximising the information content and filtering out genotyping errors and LD. New genotyping technologies that allow SNP clustering into composite markers [e.g. SNPlex Human Linkage Mapping Set (LMS), Applied Biosystems] have also emerged, therefore reducing the LD and the computing time (45). SNP clustering is accomplished on the basis of genetic distance and haplotype heterozygosity optimisation. Although less automated, the LMS compares well with Affymetrix 10K in terms of call rate and genotyping errors (45) and has been recently used in a linkage study of sarcoidosis in German families (46). A cost-effective two-stage strategy using low-density map on the whole genome in stage 1 and a higher density in promising regions in stage 2 has been proposed (35, 47). Table 3 compares sample sizes needed to achieve 80% power at genome-wide significance level using different types and densities of markers in oneand two-stage designs. Calculations were carried out using DESPAIR (48), assuming no locus heterogeneity, polymorphism information content (PIC) of 0.70 for each STR and 0.30 for each SNP, and 36 M for total length of the genome. As SNP array, we considered the smallest marker spacing accepted by DESPAIR, which is 1 cM. For power or sample size calculations, DESPAIR (see Note 1) requires as input the disease recurrence risk ratio for siblings and offspring of a proband, and for convenience, we consider the same values as those in Table 1.
246
J. Nsengimana and D.T. Bishop
Table 3 Comparison of one- and two-stage designs to detect linkage at genome-wide significance level with 80% power. These calculations are obtained using the program DESPAIR Marker map in stage 1
ls
lo
Ka
Number of ASPs a*b
Spacingc Number of genotypes
1.044
1.043
0
44,618
0.00002
9.0
35,694,400
5
27,581
0.00115
0.8
22,616,420
0
15,767
0.00002
9.0
12,613,600
6
9,676
0.00120
0.7
7,973,024
0
6,028
0.00002
9.0
4,822,400
6
3,694
0.00123
0.7
3,043,856
0
884
0.00002
9.0
707,200
5
541
0.00120
0.8
443,620
0
541
0.00002
9.0
432,800
8
328
0.00130
0.5
272,896
1.076
1.074
400 STRs 1.128
1.124
(PCI ¼ 0.70) 1.392
1.537
3,600 SNPs
(PIC ¼ 0.30)
1.360
1.479
1.044
1.043
0
28,840
0.00002
1.0
207,648,000
1.076
1.074
0
10,189
0.00002
1.0
73,360,800
1.128
1.124
0
3903
0.00002
1.0
28,101,600
1.392
1.360
0
569
0.00002
1.0
4,096,800
1.537
1.479
0
350
0.00002
1.0
2,520,000
K ¼ number of additional markers to genotype in the second stage on each side of stage 1 linkage peak to achieve global significance with the lowest cost (K ¼ 0 in the first stage) b a* ¼ threshold P -value to target in stage 1 to ensure that genome-wide significance a ¼ 0.00002 is achieved in stage 2 c Marker spacing assuming that the total genome length is 36 M a
The table shows that using STRs in a two-stage design requires fewer samples and generates fewer total genotypes than a one-stage design. For example, 541 ASPs are needed to detect linkage at genome-wide significance level using 400 STRs in one stage if ls ¼ 1.537 and lo ¼ 1.479, generating 432,800 genotypes, while in a two-stage design, 328 ASPs and 400 STRs will give a linkage peak with a P -value of 0.0013, and adding in 16 new markers (8 on each side of the peak location, for a total of 272,896 genotypes in the two stages) will also achieve genome-wide significance. The total number of genotypes (and therefore the cost) is reduced by a third compared to a one-stage design. The typical cost reduction of a third is found when considering various scenarios (47). The total number of genotypes in one-stage dense SNP arrays is larger than that in either design using STRs. However, given the cheaper cost of
13 Design Considerations for Genetic Linkage and Association Studies
247
genotyping SNPs, the cost ratio between the designs should be lower than the ratio of number of genotypes. The cost of SNP genotyping has been dropping over the past years and could keep dropping, eventually making denser SNP arrays cheaper than low-density STRs. Note that the final density reached in the second stage of designs using STRs is comparable to the usual densities of commercial SNP arrays (e.g. it is 0.6 cM on the current Illumina linkage array), illustrating the need for dense maps to boost the power. Also note that the results in Table 3 are obtained under the assumption that all the markers will be successfully genotyped, which is generally not the case (24, 25). If one marker fails genotyping in the first stage in the region of a linked locus, the linkage signal can be severely attenuated and the linked locus may fail to be taken to the next stage. SNP arrays are associated with fewer genotyping errors and little or no impact of missing genotypes since high coverage is well maintained after any genotypes with poor quality are discarded. It is also noteworthy that, consistent with expectations, sample size estimates in Table 3 are larger than those reported in Table 1, particularly for low-density STR maps. Remember that Table 1 assumes that the marker is the disease locus and is fully informative. However, as the marker density increases, real data approach this best scenario and estimates of sample sizes get closer to their theoretical values. 2.2. Association 2.2.1. Determinants of Statistical Power Genetic Effect and Sample Size
Most genetic association studies are based on unrelated cases and controls, a common epidemiological design for testing the independence between exposure and outcome variables; in most circumstances, the inclusion of related cases will limit the power. In the case of genetic studies, the exposure is the carrying of a particular genotype, and the outcome is the disease. As in the case of linkage, the effect size and the risk allele frequency on the investigated locus, as well as the available sample size, i.e. number of cases and controls, are key determinants of statistical power. Table 4 shows the number of cases and controls needed in association studies to locate disease variants of varying frequency and relative risk with 80% power at the significance level usually used in GWAS, i.e. a ¼ 5 107. The Genetic Power Calculator (49) was used in all calculations (see Note 2). We assumed a disease prevalence of 10% in the population, non-selected control subjects, and analysis at the disease locus. Despite a more stringent threshold for genome-wide significance used in association compared to linkage, the number of cases required in association is smaller than the number of ASPs needed in linkage. For example, an allele with frequency of 0.50 and multiplicative relative risk (RR) of 1.5 can be detected with 80% power using only 852 cases and controls, while linkage requires nearly 32,000 ASPs (Tables 1 and 4); for such a common susceptibility, the genotype differences between cases and controls would differ substantially, but
248
J. Nsengimana and D.T. Bishop
Table 4 Number of cases and controls required for 80% power at a ¼ 5 107. Calculations are obtained using the online Genetic Power Calculation program package Multiplicative
Dominant
Recessive
RAF
RR
N1 ¼ N2
N1 ¼ N2/3
N1 ¼ N2
N1 ¼ N2/3
N1 ¼ N2
N1 ¼ N2/3
N1 ¼ N2/10
0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
40,564 10,722 5,026 2,975 1,999 1,455 1,119 895 737 622
26,532 6,892 3,180 1,855 1,230 884 672 532 434 363
47,859 12,755 6,026 3,593 2,432 1,782 1,379 1,110 920 781
31,443 8,272 3,863 2,280 1,529 1,111 853 682 562 474
731,102 191,657 89,132 52,362 34,937 25,252 19,281 15,320 12,547 10,521
476,037 122,087 55,631 32,064 21,016 14,938 11,228 8,791 7,099 5,875
386,761 97,736 43,904 24,958 16,141 11,326 8,407 7,122 5,190 4,246
0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3
1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
17,724 4,770 2,275 1,368 934 690 537 436 364 311
11,707 3,127 1,481 886 601 442 344 278 231 197
30,393 8,333 4,042 2,471 1,711 1,282 1,013 831 702 606
20,280 5,569 2,707 1,658 1,152 865 685 564 478 414
89,066 23,522 11,017 6,516 4,376 3,183 2,445 1,954 1,609 1,357
58,226 15,105 6,959 4,055 2,685 1,928 1,464 1,157 943 788
47,430 12,156 5,537 3,191 2,091 1,487 1,118 876 708 586
0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
15,176 4,157 2,015 1,230 852 637 503 413 348 301
10,121 2,776 1,347 824 572 429 340 279 236 205
41,439 11,598 5,731 3,563 2,507 1,905 1,525 1,268 1,084 946
27,964 7,916 3,954 2,483 1,764 1,353 1,092 915 788 693
39,511 10,588 5,027 3,012 2,048 1,508 1,171 946 788 671
26,035 6,906 3,250 1,933 1,305 955 737 593 492 417
21,316 5,615 2,626 1,552 1,042 759 584 467 386 326
Genetic Power Calculation online: http://pngu.mgh.harvard.edu/~purcell/gpc/. RAF, risk allele frequency; RR, genotype relative risk in dominant and recessive models and allelic relative risk in multiplicative model; N1, number of cases; N2, number of controls. The controls represent the population as a whole (i.e. unscreened for the disease)
the deviation in IBD sharing by the siblings would have proportionately lower impact in a linkage study. Such a high power explains the success of GWAS since they became commonplace in the last few years, helped by technological advances that made available dense, highly reproducible SNP arrays. Table 4 shows that association
13 Design Considerations for Genetic Linkage and Association Studies
249
% reduction in number of cases
60 50 40 30 20 10 0 1
3
5 7 9 Number of controls for 1 case
11
Fig. 2. Percentage reduction in required number of cases when more controls than cases are available.
studies are feasible with 5,000 cases or less if the disease variant is at least 10% frequent in the population and multiplies the relative risk by at least 1.3 per copy. For dominant models, slightly higher sample sizes are needed compared to the multiplicative model. For loci acting recessively, a much larger sample size is needed and only common alleles with at least 30% frequency and relative risks of 1.5 or higher appear to be detectable with 5,000 cases and controls. Interestingly, the required number of cases is lower if a larger number of controls can be obtained. Given the explosion of genome-wide studies in the past years, the potential of the research community to share control data can greatly enhance our ability to study the genetic components of multiple different diseases, as has been already demonstrated (50, 51). There is a limit however to the gain in efficiency from using more controls rather than cases. While using three controls for each case can reduce the number of cases by approximately 36% (average from Table 4 across the three genetic models), higher control-to-case ratios have a more modest effect on case sample reduction (Table 4 and Fig. 2). Alleles conferring a relative risk of 1.2 or less under any inheritance model are generally difficult to detect in a single study. Data pooling across different studies and meta-analyses will be required (52, 53). For practical—including sometimes, ethical—reasons, control subjects are not routinely screened for the disease. Instead, the set of controls is assumed to be a representative sample from the study population, which means that it is expected to contain a certain proportion of subjects who are undiagnosed cases. Such misclassifications are most likely in very common diseases and result in a (usually modest) power loss. The remedy to this can be via a more stringent disease definition (54). Screening the controls to exclude misclassifications, if feasible, could reduce the required sample sizes such as
250
J. Nsengimana and D.T. Bishop
presented in Table 4 by 15–20% if the disease prevalence is 10%, and this reduction can reach 50% if the disease prevalence is 30%. Locus and Disease Heterogeneity
As for linkage, locus and disease heterogeneity cause a loss of statistical power. To ensure sample homogeneity, cases and controls need to be recruited from the same population and the disease must be well defined. An interesting way of further increasing sample homogeneity is genetic enrichment, which consists in selecting only cases with a family history, multiple recurrence of disease, early onset, or any other characteristics of extreme phenotypes (51, 55). However, this has the potential of reducing the sample size while increasing the recruitment cost; one alternative option is to collect the information on confounding variables and either match cases and controls on each of them, or adjust for them as covariates (54, 56). Another alternative option is to use a quantitative definition of the disease where appropriate (48, 57).
Genotyping Errors
As mentioned in the linkage section, most genotyping errors are due to poor DNA quality or data manipulations, and they reduce the statistical power. In the particular situation of large GWAS, owing to the large amount of data often obtained from different sources and processed by different laboratories, genotyping errors appear unavoidable (58–60). Automated allele calling and unpractical manual inspection of individual SNP intensity plots make these errors all the more common. With no parents or siblings genotyped, such errors are not identifiable. Departure from Hardy–Weinberg equilibrium is an obvious screening tool, but power is limited to identify aberrant SNPs. Furthermore, identifying and eliminating suspicious genotypes may exacerbate the power loss, as they might cause non-random missingness, whereby some particular genotypes are more likely to be missing or the missing data are not evenly distributed in cases and controls. It has therefore been recommended to account for uncertain genotype calls in the statistical analysis, which could result in a smaller power loss than sample rejection (58).
Linkage Disequilibrium
We assumed in the sample size calculations of Table 4 that the analysis is conducted at the disease locus. In practice, however, the analysis is conducted on a number of SNPs, which may or may not be the actual disease variant. If the SNP is not in the disease locus and/or is in incomplete LD with it, larger sample sizes than those shown in Table 4 are needed to detect the association. Larger sample sizes are also required to achieve a similar power when there is maximum LD between the disease locus and the marker but their allele frequencies are different. To maximise the power, association analysis needs to be conducted on suspected actual disease-causing loci (such as in a candidate-gene study) or on SNPs that are sufficiently tightly linked to the variant that they are likely to be in complete LD (such as in GWAS).
13 Design Considerations for Genetic Linkage and Association Studies
251
Table 5 Prioritisation in SNP selection (reproduced from Tabor et al. (61) with the permission of Nature Publishing Group) Predicted relative risk
Frequency in the genome
Premature termination of amino acid sequence
Very high
Very low
Coding
Changes an amino acid in a protein, altering its properties
Moderate to very high
Low
Missense nonsynonymous conservative
Coding
Changes an amino acid in a protein Low to very without altering its properties high
Low
Insertion/deletion frameshift
Coding
Change the frame or protein coding region (usually negative consequences to the protein)
Very high
Low
Insertion/deletion in frame
Coding or noncoding
Changes amino acid sequence
Low to very high
Low
Sense/synonymous
Coding
No change in amino acid sequence Low to high but possibly altering the splicing
Medium
Promoter/ regulatory
Promoter, 50 UTR/ 30 UTR
No change in amino acid sequence Low to high but can impact on the level, timing, or location of gene expression
Low to medium
Splice site/introexon boundary
Within 10 bp of an exon
Might change the splicing pattern Low to high
Low
Intronic
Deep in introns
No known function but might affect expression or mRNA stability
Intergenic
Non-coding No known function but might affect expression through enhancer or other mechanisms
Variant
Type
Function
Nonsense
Coding
Missense nonsynonymous nonconservative
Very low
Medium
Very low
High
UTR, untranslated region
Candidate Genes
Although different types of genetic markers have been used in association studies in the past, only SNPs are utilised today owing to their abundance and their cheap cost. In SNP selection, priority is generally given to those most likely to cause the disease. Table 5, reproduced from Tabor et al. (61) (with the permission from Nature Publishing Group), shows different types of sequence variants, their
252
J. Nsengimana and D.T. Bishop
known function, their predicted abundance, and the strength of their predicted effect on the phenotype. High priority should be given to nonsense, missense, and insertion/deletion variants. However, these variants may be too rare to observe, particularly in small datasets, and then priority should be extended to the regulatory regions (e.g. promoter and splice sites) and sense variants, which might have a more common minor allele while also having a functional relevance, particularly for common diseases (62, 63). To capture rare variants, haplotype analysis of tag SNPs in the region, or resequencing data analysis, has been proposed (57). GWAS
Array Selection The current SNP arrays for GWAS were designed to cover most common variation (minor allele frequency > 0.05) while the majority are also filtered to minimise information redundancy by selecting tag SNPs in the areas of high LD. Although these arrays have different genome coverage rates, they have virtually similar power in populations of European ancestry (53). Study design tries to balance expenditure on sample collection with genotyping costs; currently, the increasing density of arrays provides more genome coverage, increasing the chance of mapping relevant loci. In the near future, the critical density may well be reached where there is comparatively little benefit achieved by denser arrays. Denser arrays provide higher power in populations with weaker LD (e.g. African descent), and they are also more advantageous for European populations in the interpretation of results since they provide more significant findings in areas of true association; finding multiple support SNPs in a small region can be a major encouragement in terms of believing the signal is real rather than due to the aberrant performance of a single SNP. If one opts for a cheaper, less dense array, in silico genotyping can be a remedy to the limited number of significant SNPs in a particular region. In fact, despite an apparent saturation of extant GWAS arrays in European descent populations, they all substantially benefit from imputation of many more SNPs using the latest HAPMAP SNP data as a gold standard (53).
Multistage Design Multistage design is a strategy originally proposed to reduce the genotyping cost of GWAS (63). Under this design, a maximum number of SNPs are genotyped on the whole genome in a small number of cases and controls in the first stage, and thereafter, a decreasing number of SNPs are selected for genotyping in an increasing number of cases and controls in successive stages, using a more stringent selection threshold at each stage until reaching genome-wide significance. There is no predetermined number of stages, but it has been shown that up to four stages could be necessary to maximise the power and, at the same time, minimise the cost (64). Owing to the reasonable cost of SNP genotyping,
13 Design Considerations for Genetic Linkage and Association Studies
253
this scheme has not so far been as popular as anticipated. It can be a good avenue, however, in designing follow-up studies to strengthen the evidence of association for the multitude of SNPs that, in a typical large-scale single-stage study, fall below the stringent genome-wide significance threshold. Table 6 and Fig. 3 show the power of a two-stage design (a ¼ 5 107) with a total sample of 1,000 cases and 1,000 controls when 20–50% of the samples are genotyped on the whole genome (stage 1) and the 1–10% most promising SNPs are taken forward to stage 2 in the remaining samples. The assumed disease prevalence is 10%, the risk allele frequency is 25%, and the allelic relative risk is 1.5 (multiplicative). All estimations were conducted within CaTS (65) (see Note 3). The maximum achievable power in the second stage analysis ignoring stage 1 samples is 0.83, while the power of a one-stage design is 0.93. There is a probability of 0.94 or higher to rank a truly associated locus among the top 5% of stage 1 results when 30% or more of the samples are genotyped, or in the top 1% if 40% or more of the samples are genotyped (Table 6). There are two types of two-stage designs: (1) a joint analysis of stage 1 and stage 2 samples and (2) ignoring the stage 1 samples when analysing the stage 2 data. The power of the joint analysis of stage 1 and stage 2 samples is higher than ignoring the stage 1 samples, and it approaches the power of the one-stage design when at least 40% of samples are genotyped in stage 1 (or 30% of samples in stage 1 with at least 5% SNPs selected for stage 2). Joint analysis is therefore the most cost-effective (65), as it reduces the total number of genotypes in the same way as the two-stage design that ignores stage 1 samples when analysing the stage 2 data (45–79% of reduction in Table 6), and has nearly the power of a single-stage design.
Replication The requirement to validate findings before they can be published in high-impact journals has prompted most GWAS investigators to split their samples into two datasets: one for initial analysis and the other for replication. This scheme is different from a two-stage GWAS design as only the significant findings from the original analysis are tested in the second dataset rather than a predetermined proportion of initial SNPs. Although this scheme allows confirming the best findings, it reduces the overall study power because the sample size in each data subset is smaller than the pooled data (65). Jointly reanalysing the two data subsets increases the power, but only for those SNPs significant in the first subset for which replication is sought. The ideal replication remains an independent study. When planning a replication study following an initial GWAS, it should be borne in mind that there is danger that the study is
254
J. Nsengimana and D.T. Bishop
Table 6 Power of a two-stage GWAS (a ¼ 5 107). Calculations obtained using the CaTS program package
% Markers Genotype selected for reduction stage 2 rate
Probability of carrying over an associated locus to stage 2
Power of the two-stage design that ignores the stage 1 samples in stage 2 analysis
Power of joint analysis of stages 1 and 2 for selected markers
50
10
0.45
0.998
0.57
0.93
50
5
0.48
0.996
0.63
0.93
50
1
0.50
0.978
0.74
0.92
40
10
0.54
0.99
0.73
0.93
40
5
0.57
0.98
0.77
0.92
40
1
0.59
0.94
0.82
0.89
30
10
0.63
0.97
0.82
0.91
30
5
0.67
0.94
0.83
0.90
30
1
0.69
0.84
0.79
0.81
20
10
0.72
0.86
0.82
0.85
20
5
0.76
0.83
0.78
0.80
20
1
0.79
0.63
0.61
0.62
1.0 stage 2=1%SNPs stage 2=5%SNPs
0.9
stage 2=10%SNPs
0.8 Power
% Samples genotyped on whole genome (stage 1)
0.7
0.6
0.5 20
30
40 50 % Samples in stage 1
60
Fig. 3. Power of a two-stage GWAS as function of the proportion of samples genotyped in stage 1 and the proportion of SNPs selected after stage 1 analysis to genotype in stage 2 samples, with the stage 2 analysis not including stage 1 samples.
13 Design Considerations for Genetic Linkage and Association Studies
255
underpowered. It is well documented that genetic effect size estimates for the most significant or top-ranked SNPs, as is often the case in GWAS, are biased, particularly if they are small (66, 67). Furthermore, estimates may also be subject to ascertainment bias if, for example, the case set were genetically enriched (68, 69). Indeed, GWAS are explorative in nature and are more concerned with increasing the study power than correctly estimating the effect size. The consequence is that using these estimates to plan a future study is too optimistic. Garner (67) examined analytically and through simulations the relationships between bias, true genetic effect, allele frequency, sample size, and significance threshold. These relationships should be used to determine the bias in a particular study when planning for its replication. 2.2.2. Sources of False Positives
In case–control studies, the statistical power is not the only issue to worry about. In fact, it is well known that this design is subject to a number of confounders, which may explain, at least in part, the low success in replicating early candidate gene studies. For GWAS, the high cost of collecting a large number of samples and largescale genotyping commands that a careful approach is followed to ensure that positive findings are not spurious associations. Three main confounders have been extensively discussed in the literature: population structure (or stratification), cryptic relatedness, and differential genotype calling between cases and controls (differential bias).
Population Structure and Cryptic Relatedness
According to Astle and Balding (70), population structure and cryptic relatedness are two ends of a spectrum of the same confounder: the unobserved pedigree of possibly distant relationships among cases and controls. Population structure (see Chapter 21) is a confounder when there are different identifiable groups or (sub)populations (e.g. due to geographical, social, or cultural reasons) with unequal disease prevalence rates and the proportions of cases and controls sampled from each group do not reflect those prevalence rates. Cryptic relatedness (see Chapter 4), on the other hand, is the unobserved pedigree structure that links cases and/or controls across generations. The multigenerational pedigree structure is hidden and may not create any visible structure in the population, but it is acknowledged that if there is a genetic cause to the disease, then cases are likely to share some distant relatives from whom the disease-causing mutations originate (71). It is both easy and difficult to tackle the problem of population stratification at the study design stage. Well-designed studies select cases and controls in the same population, matching by ancestry. However, it is impossible to match the study on all potentially confounding variables. Subgroups might not be perfectly independent, as there is almost always some admixture between them. In populations of European ancestry, population structure inflates
256
J. Nsengimana and D.T. Bishop
tests of genetic association by about 4%, which is regarded as negligible (50, 58). It is admitted that population structure has a limited impact as long as samples are matched by ethnic background and checks are undertaken to find and exclude samples with admixed or ambiguous ethnicity (54). However, the effect of stratification increases with sample size, and hence, the larger samples needed to find low-risk variants can cause greater false-positive rates if this is not accounted for (72). Simulation studies have shown that cryptic relatedness can have only a limited impact on genetic association with diseases in most populations (71). However, it can be a serious cause of concern in small expanding populations or in populations where marriages between relatives are common. Fortunately, there is a host of methods that can be used to correct for the adverse effect of both stratification and cryptic relatedness. It is important to keep in mind that some of these methods can only deal with population structure or cryptic relatedness (73–76), while others adjust for both confounders (70, 77, 78). Family-based designs offer an alternative scheme to investigate genetic association with robustness to confoundings. This robustness comes at some cost, however, as more subjects have to be genotyped than in a case–control design. Family designs are often criticised because they are less powerful than unrelated cases and controls, but this criticism is not always justified, because any adjustment of confounding factors in cases and controls also causes a power loss. Furthermore, it was shown that family trios could have equal or higher power than cases and controls under some genetic models (79). Additional advantages of this design are better detection of genotyping errors and the possibility of analysing complex genetic components such as maternal effects and imprinting. Differential Bias
Differential genotype calling between cases and controls in large-scale GWAS will inflate association tests (58–60). It can be triggered by poor DNA quality in some samples, which causes suboptimal interaction with chemical reagents. Separately calling cases and controls has been suggested to reduce this problem when samples are collected by different research centres (59). However, it recently emerged that randomising cases and controls on plates and genotyping them together offers the advantage of testing the plate effect on genotype calls and allows, therefore detecting sample processing errors that would otherwise remain undetectable. In fact, a handful of poor-quality samples can induce erroneous genotype calling on the whole plate, and even when those few samples are detected through normal quality control measures, their removal does not eliminate all the errors (60). Furthermore, eliminating the errors may cause unequal call rates in cases and controls, which itself can create false positives (63). Although methods have been proposed to detect and adjust
13 Design Considerations for Genetic Linkage and Association Studies
257
Fig. 4. Top screen of parameter input in the DESPAIR program.
for the adverse effects of differential bias (58), it remains unclear whether these methods eliminate all false positives while keeping all true genetic associations. It is therefore necessary to take this aspect into account at the study design stage by controlling all aspects of DNA acquisition, storage, preparation, and processing. It is acknowledged that, even with extra care, subtle differences between samples generally subsist and could, in some cases, outweigh population structure in causing spurious associations.
3. Notes 1. DESPAIR: This program calculates the power and sample size for multistage linkage studies using concordant and discordant relative pairs while minimising the cost (47). It is part of the
258
J. Nsengimana and D.T. Bishop
Fig. 5. Outlook of the parameter input screen for case–control power calculation using Genetic Power Calculation software online.
S.A.G.E. package (see Chapter 30) that is utilised interactively online at http://darwin.cwru.edu/despair. Figure 4 illustrates the top screen of input parameters when the program is run. All the required information is self-explanatory, and documentation can be downloaded for a detailed description. Non-recurrence risk parameters are only required when the data contain discordant pairs. Reference (46) gives guidelines on which test version to use among the mean and proportion tests. Briefly, the mean test is appropriate for most designs and is the only method implemented for data containing discordant
13 Design Considerations for Genetic Linkage and Association Studies
259
Fig. 6. Outlook of CaTS screen displaying parameter input boxes and power calculations for one-stage, two-stage, and joint analysis designs.
relative pairs, but the proportion test can be more powerful when the recurrence risk is as high as or greater than 5. 2. Genetic Power Calculator: The Genetic Power Calculator (49) can be freely and interactively used online (http://pngu.mgh. harvard.edu/~purcell/gpc/) to estimate association study power for both discrete and quantitative traits. Figure 5 displays the parameter input screen for the case–control design. Note that the user-defined power and sample size are both required. The program calculates the power achievable with the entered sample size, and the sample size required to attain the specified power under different models. By ticking “unselected controls,” the user assumes that the controls represent the population and may contain undiagnosed cases. The program assumes by default that controls are screened for the disease to avoid misclassifications. 3. CaTS: This software calculates the power and sample sizes for a two-stage GWAS (65) and can be freely downloaded
260
J. Nsengimana and D.T. Bishop
(www.sph.umich.edu/csg/abecasis/CaTS/) for installation on a local machine. Like the other software described above, CaTS is interactive and easy to use. Figure 6 is an outlook of a self-explanatory CaTS screen with dedicated boxes for parameter input at the top and the estimated powers at the bottom. References 1. Lee-Kirsch M A, et al (2006) Familial chilblain lupus, a monogenic form of cutaneous lupus erythematosus, maps to chromosome 3p. Amer J Hum Genet 79: 731–737 2. Kruglyak L, et al (1996) Parametric and nonparametric linkage analysis: A unified multipoint approach. Amer J Hum Genet 58: 1347–1363 3. Lander ES, Botstein D (1987) Homozygosity Mapping - a Way to Map Human Recessive Traits with the DNA of Inbred Children. Science 236: 1567–1570 4. Mueller RF, Bishop DT (1993) Autozygosity Mapping, Complex Consanguinity, and Autosomal Recessive Disorders. J Med Genet 30: 798–799 5. Wang S, Haynes C, Barany F, Ott, J (2009) Genome-Wide Autozygosity Mapping in Human Populations. Genet Epidemiol 33: 172–180 6. Boehnke M (1986) Estimating the Power of a Proposed Linkage Study - a Practical Computer-Simulation Approach. Amer J Hum Genet 39: 513–527 7. Ploughman LM, Boehnke M (1989) Estimating the Power of a Proposed Linkage Study for a Complex Genetic Trait. Amer J Hum Genet 44: 543–551 8. Samani N J, et al. (2005) A genomewide linkage study of 1,933 families affected by premature coronary artery disease: The British heart foundation (BHF) family heart study. Amer J Hum Genet 77: 1011–1020 9. Whittemore AS, Tu IP (1998) Simple, robust linkage tests for affected sibs. Amer J Hum Genet 62: 1228–1242 10. Risch N, Merikangas K (1996) The future of genetic studies of complex human diseases, Science 273: 1516–1517 11. Risch N (1990) Linkage Strategies for Genetically Complex Traits .2. The Power of Affected Relative Pairs. Amer J Hum Genet 46: 229–241 12. Lander E, Kruglyak L (1995) Genetic Dissection of Complex Traits - Guidelines for Interpreting and Reporting Linkage Results, Nature Genetics 11: 241–247
13. Bishop DT, Williamson JA (1990) The Power of Identity-by-State Methods for Linkage Analysis. Amer J Hum Genet 46: 254–265 14. Risch NJ (2000) Searching for genetic determinants in the new millennium, Nature 405: 847–856 15. Brown BD, et al (2010) An evaluation of inflammatory gene polymorphisms in sibships discordant for premature coronary artery disease: the GRACE-IMMUNE study, BMC Medicine 8: 5 16. Hodge SE, Vieland VJ, Greenberg DA (2002) HLODs remain powerful tools for detection of linkage in the presence of genetic heterogeneity. Amer J Hum Genet 70: 556–558 17. Whittemore AS, Halpern J (2001) Problems in the definition, interpretation, and evaluation of genetic heterogeneity. Amer J Hum Genet 68: 457–65 18. Altmuller J, et al (2001) Genomewide scans of complex human diseases: True linkage is hard to find. Amer J Hum Genet 69: 936–50 19. Hauser ER, et al. (2004) Ordered subset analysis in genetic linkage mapping of complex traits. Genet Epidemiol 27: 53–63 20. Nsengimana J, et al (2007) Enhanced linkage of a locus on chromosome 2 to premature coronary artery disease in the absence of hypercholesterolemia. Eur J Hum Genet 15: 313–319 21. Almasy L, Blangero J (2009) Human QTL linkage mapping. Genetica 136: 333–340 22. Abecasis GR, Cherny SS, and Cardon LR (2001) The impact of genotyping error on family-based analysis of quantitative traits. Eur J Hum Genet 9: 130–134 23. Abecasis GR, et al (2001) GRR: graphical representation of relationship errors. Bioinformatics 17: 742–743 24. Pompanon F, et al (2005) Genotyping errors: Causes, consequences and solutions, Nat Rev Genet 6: 847–859 25. Chang YPC, et al (2006) The impact of data quality on the identification of complex disease genes: experience from the Family Blood Pressure Program. Eur J Hum Genet 14: 469–477
13 Design Considerations for Genetic Linkage and Association Studies 26. Goring HHH, OttJ (1997) Relationship estimation in affected rib pair analysis of late-onset diseases. Eur J Hum Genet 5: 69–77 27. Boehnke M, Cox NJ (1997) Accurate inference of relationships in sib-pair linkage studies. Amer J Hum Genet 61: 423–429 28. Douglas JA, Boehnke M, Lange K (2000) A multipoint method for detecting genotyping errors and mutations in sibling-pair linkage data. Amer J Hum Genet 66: 1287–1297 29. Sun L, Wilder K, McPeek MS (2002) Enhanced pedigree error detection. Hum Hered 54: 99–110 30. Sobel E, Papp JC, Lange K (2002) Detection and integration of genotyping errors in statistical genetics. Amer J Hum Genet 70: 496–508 31. Ray A, Weeks DE (2008) Relationship uncertainty linkage statistics (RULS): Affected relative pair statistics that model relationship uncertainty. Genet Epidemiol 32: 313–324 32. Hauser ER, et al. (1996) Affected-sib-pair interval mapping and exclusion for complex genetic traits: Sampling considerations. Genet Epidemiol 13: 117–137 33. Sawcer SJ, et al (2004) Enhancing linkage analysis of complex disorders: an evaluation of high-density genotyping. Hum Mol Genet 13: 1943–1949 34. Evans DM, Cardon LR (2004) Guidelines for genotyping in genomewide linkage studies: Single-nucleotide-polymorphism maps versus microsatellite maps. Amer J Hum Genet 75: 687–692 35. Guo XQ, Elston RC (2000) Two-stage global search designs for linkage analysis II: Including discordant relative pairs in the study. Genet Epidemiol 18: 111–27 36. Huang QQ, Shete S, Amos CI (2004) Ignoring linkage disequilibrium among tightly linked markers induces false-positive evidence of linkage for affected sib pair analysis. Amer J Hum Genet 75: 1106–1112 37. Schaid DJ, et al (2004) Comparison of microsatellites versus single-nucleotide polymorphisms in a genome linkage screen for prostate cancer-susceptibility loci. Am J Hum Genet 75: 948–65 38. Nsengimana J, Renard H, Goldgar D (2005) Linkage analysis of complex diseases using microsatellites and single-nucleotide polymorphisms: application to alcoholism. BMC Genet 6: S10 39. Wilcox MA, et al (2005) Comparison of singlenucleotide polymorphisms and microsatellite markers for linkage analysis in the COGA and simulated data sets for genetic analysis workshop 14. Genet Epidemiol 29: S7-S28
261
40. Boyles AL, et al (2005) Linkage disequilibrium inflates type I error rates in multipoint linkage analysis when parental genotypes are missing. Hum Hered 59: 220–227 41. Abecasis GR, Wigginton JE (2005) Handling marker-marker linkage disequilibrium: Pedigree analysis with clustered markers. Am J Hum Genet 77: 754–67 42. Kurbasic A, Hossjer O (2008) A general method for linkage disequilibrium correction for multipoint linkage and association. Genet Epidemiol 32: 647–57 43. Webb EL, Sellick GS, Houlston RS (2005) SNPLINK: multipoint linkage analysis of densely distributed SNP data incorporating automated linkage disequilibrium removal. Bioinformatics 21: 3060–3061 44. Fukuda Y, et al (2009) SNP HiTLink: a highthroughput linkage analysis system employing dense SNP data. BMC Bioinformatics 10: 121 45. Selmer KK, et al (2009) Genome-wide Linkage Analysis with Clustered SNP Markers. J Biomol Screen 14: 92–96 46. Fischer ANM, et al (2010) A genome-wide linkage analysis in 181 German sarcoidosis families using clustered bi-allelic markers. Chest 138: 151–157 47. Guo XQ, Elston RC (2000) Two-stage global search designs for linkage analysis I: Use of the mean statistic for affected sib pairs. Genet Epidemiol 18: 97–110 48. Ochs-Balcom HM, et al (2010) Program update and novel use of the DESPAIR program to design a genome-wide linkage study using relative pairs. Hum Hered 69: 45–51 49. Purcell S, Cherny SS, Sham PC (2003) Genetic Power Calculator: design of linkage and association genetic mapping studies of complex traits. Bioinformatics 19: 149–150 50. WTCCC. (2007) Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447: 661–678 51. Bishop DT, et al (2009) Genome-wide association study identifies three loci associated with melanoma risk. Nat Genet 41: 920–925 52. Panoutsopoulou KZE (2009) Finding common susceptibility variants for complex disease: past, present and future. Brief Funct Genomic Proteomic 8: 345–352 53. Spencer CCA, et al (2009) Designing Genome-Wide Association Studies: Sample Size, Power, Imputation, and the Choice of Genotyping Chip. PLOS Genetics 5: e1000477
262
J. Nsengimana and D.T. Bishop
54. McCarthy MI, et al (2008) Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet 9: 356–369 55. Amos CI (2007) Successful design and conduct of genome-wide association studies. Hum Mol Genet Spec 2: R220-R225 56. Zondervan KT, Cardon LR, Kennedy SH (2002) What makes a good case-control study? Design issues for complex traits such as endometriosis. Hum Reprod 17: 1415–1423 57. Newton-Cheh C, Hirschhorn JN (2005) Genetic association studies of complex traits: design and analysis issues. Mutat Res-Fund Mol M 573: 54–69 58. Clayton DG, et al (2005) Population structure, differential bias and genomic control in a largescale, case-control association study. Nat Genet 37: 1243–1246 59. Plagnol V, et al (2007) A method to address differential bias in genotyping in large-scale association studies. PLOS Genet 3: e74 60. Pluzhnikov A, et al. (2010) Spoiling the whole bunch: quality control aimed at preserving the integrity of high-throughput genotyping. Am J Hum Genet 87: 123–28 61. Tabor HK, Risch NJ, Myers RM (2002) Candidate-gene approaches for studying complex genetic traits: practical considerations. Nat Rev Genet 3: 391–7 62. Pettersson FH, et al. (2009) Marker selection for genetic case-control association studies. Nat Protoc 4: 743–752 63. Hirschhorn JN, Daly MJ. (2005) Genomewide association studies for common diseases and complex traits, Nature Reviews Genetics 6: 95–108 64. Pahl R, Schafer H, Muller HH (2009) Optimal multistage designs-025EFa general framework for efficient genome-wide association studies. Biostatistics 10: 297–309 65. Skol AD, et al. (2006) Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nat Genet 38: 209–213 66. Bowden J, Dudbridge F (2009) Unbiased Estimation and Inference for Replicated Associa-
tions Following a Genome Scan. Genet Epidemiol 33: 406–418 67. Garner C (2007) Upward bias in odds ratio estimates from genome-wide association studies. Genet Epidemiol 31: 288–295 68. Goldgar D, et al (2007) BRCA phenocopies or ascertainment bias? J Med Genet 44: 10–15 69. Terwilliger JD, Weiss KM (2003) Confounding, ascertainment bias, and the blind quest for a genetic ‘fountain of youth’. Ann Med 35: 532–544 70. Astle, W., and Balding, D. J. (2009) Population Structure and Cryptic Relatedness in Genetic Association Studies. Stat Sci 24: 451–471 71. Voight BF, Pritchard JK (2005) Confounding from cryptic relatedness in case-control association studies. PLOS Genet 1: 302–311 72. Marchini J, et al (2004) The effects of human population structure on large genetic association studies. Nat Genet 36: 512–517 73. Choi Y, Wijsman EM, Weir BS (2009) CaseControl Association Testing in the Presence of Unknown Relationships. Genet Epidemiol 33: 668–678 74. Slager SL, Schaid DJ (2001) Evaluation of candidate genes in case-control studies: A statistical method to account for related subjects. Am J Human Genet 68: 1457–1462 75. Bourgain C, et al (2003) Novel case-control test in a founder population identifies P-selectin as an atopy-susceptibility locus. Am J Hum Genet 73: 612–626 76. Pritchard JK, et al. (2000) Association mapping in structured populations. Am J Hum Genet 67: 170–181 77. Sillanpaa MJ (2011) Overview of techniques to account for confounding due to population stratification and cryptic relatedness in genomic data association analyses. Heredity 106(4):511–519 78. Price AL, et al (2010) New approaches to population stratification in genome-wide association studies. Nat Rev Genet 11: 459–463 79. Laird NM, Lange C (2009) The Role of Family-Based Designs in Genome-Wide Association Studies. Statist Sci 24: 388–397
Chapter 14 Model-Based Linkage Analysis of a Quantitative Trait Audrey H. Schnell and Xiangqing Sun Abstract Linkage analysis is a family-based method of analysis to examine whether any typed genetic markers co-segregate with a given trait, in this case a quantitative trait. If linkage exists, this is taken as evidence in support of a genetic basis for the trait. Historically, linkage analysis was performed using a binary disease trait, but has been extended to include quantitative disease measures. Quantitative traits are desirable as they provide more information than binary traits. Linkage analysis can be performed using single marker methods (one marker at a time) or multipoint (using multiple markers simultaneously). In model-based linkage analysis, the genetic model for the trait of interest is specified. There are many software options for performing linkage analysis. Here, we use the program package Statistical Analysis for Genetic Epidemiology (S.A.G.E.). S.A.G.E. was chosen because it includes programs to perform data cleaning procedures and to generate and test genetic models for a quantitative trait, in addition to performing linkage analysis. We demonstrate in detail the process of running the program LODLINK to perform single marker analysis and MLOD to perform multipoint analysis using output from SEGREG, where SEGREG was used to determine the best fitting statistical model for the trait. Key words: Linkage analysis, Quantitative trait, LOD score, Recombination fraction, Statistical Analysis for Genetic Epidemiology, Single point, Multipoint, Model-based, Segregation analysis, Homogeneity
1. Introduction 1.1. Linkage Analysis
“Linkage” is the term used to denote the occurrence of two loci on the same chromosome (i.e., syntenic) that are inherited together. Two loci are said to be linked if they are close enough on the same chromosome that recombination during meiosis is rare enough for the loci to “cosegregate,” and this phenomenon can then be observed within families. Classic model-based linkage, where model-based refers to the genetic model for the trait, was developed to assess the association of a binary trait (e.g., affection status) and a genetic marker within families (pedigrees). However, procedures have also been developed to adapt linkage analysis for a
Robert C. Elston et al. (eds.), Statistical Human Genetics: Methods and Protocols, Methods in Molecular Biology, vol. 850, DOI 10.1007/978-1-61779-555-8_14, # Springer Science+Business Media, LLC 2012
263
264
A.H. Schnell and X. Sun
quantitative trait, and these will be demonstrated in this chapter using methods based on the exact likelihood calculation. Modelbased linkage is sometimes referred to as parametric linkage but the term model-based will be used here. If a trait demonstrates familial aggregation (see Chapters 8–10) and/or segregation (see Chapter 12), linkage analysis would be justified. A finding of linkage would provide a basis for further study using association analysis (described in later chapters). Classic model-based linkage analysis calculates the likelihood ratio in favor of linkage to a marker over a range of recombination fractions (y). If recombination between the marker and trait loci is infrequent, it is possible to detect linkage. The LOD score is the 10 based logarithm of the likelihood ratio, often miscalled the log of the odds for linkage. Rather than calculating an odds for linkage it compares the likelihood of observing the data if the two loci are linked (y<1=2) to the likelihood of observing the data if the two loci are unlinked (y ¼ 1=2). LOD scores are calculated for each family and then summed over all families. Classically, a LOD score 3 indicates significant evidence of linkage, and a LOD score 2 indicates rejection of linkage. A LOD score in the interval [2, 3] is considered inconclusive and warrants additional study, while a LOD score in the interval 2 and <3 indicates suggestive linkage. In model-based linkage, the mode of inheritance is specified in the analysis. Specifying the disease/trait model leads to more power than not doing so, but if the model is misspecified power is lost. Testing more than one model is also possible, but this has the drawback that a correction for performing multiple tests need be applied (1). Analysis using more than one marker locus (multipoint analysis) can also be performed and increases the statistical power to detect linkage because multiple markers increase the meiosis information and can localize the trait locus more precisely to the interval between two neighboring markers (2, 3). 1.2. Model Specification
In a model-based linkage analysis, the mode of inheritance of the trait must be specified. If the mode of inheritance is unknown, segregation analysis can be used to determine the best fitting model for the trait (see Chapter 12). For both quantitative and binary traits, by using the program SEGREG and, specifying a Mendelian transmission model, it is possible to request a “type” output file that has the extension .typ. This file contains all the information needed to model the trait for each individual and can be used as input for the model-based linkage analysis programs LODLINK and MLOD. Alternatively you can specify the trait model by building a different kind of file, the “trait locus” file, which includes the necessary trait locus allele frequencies and genotype penetrances (see Note 1).
14 Model-Based Linkage Analysis of a Quantitative Trait
265
1.3. Single-Marker or Multipoint Analysis
Information from the markers can be used in pairwise linkage analysis (2), typically called single-marker (or two-point, one marker locus and one trait locus) analysis, or using all the markers in a given region together, called multipoint linkage analysis (3, 4). Multipoint analysis is more computationally intense and the size of the pedigree is limited, as will be discussed later in this chapter. Options for overcoming this in MLOD are: the default value for size can be increased, pedigrees larger than the default value can be skipped, or the user may choose to “split” pedigrees into smaller families. Two-point linkage and multipoint linkage have different properties. In two-point linkage, misspecification of the trait model may result in inflation of the estimate of the recombination fraction, with loss of power to detect linkage (5). In multipoint linkage, errors in the order of the markers and in the inter-marker distances, genotyping errors, or any misspecification of the trait model, including heterogeneity, tends to decrease power to detect linkage more severely than in two-point linkage (5–7). Therefore, we can perform both two-point and multipoint linkage analyses and compare the results carefully.
1.4. File Descriptions
The LODLINK program requires a pedigree file, locus description file, type file (or alternatively a trait marker description file), and parameter file. For multipoint analysis, MLOD additionally requires a genome description file, which provides the genetic distances between consecutive markers. These files are described below, where S.A.G.E. key words are italicized (and explanations are in parentheses).
1.4.1. Parameter File
The S.A.G.E. parameter file is organized into “blocks.” There is one section/block to specify the data format and then each program has a program-specific block. Using the GUI eliminates the need to know the exact syntax for the commands in the parameter file. Only one parameter file is required when running S.A.G.E. and when running a particular program any commands related to other programs are ignored.
1.4.2. Pedigree File
The pedigree file is a column delimited (typically with either spaces or tabs) text file that contains the required information to describe the families, individuals, traits, and markers. There is a column for family identifier (id), individual id, father id, mother id, and sex (see Notes 2 and 3). Each pedigree is uniquely numbered and for each individual either no parents (these individuals are founders) or both parents are specified. Sometimes “dummy records” are needed for the parents or other non-studied individuals to accurately link the pedigree members according to their relationship (see Note 4). The combination of family id and individual number uniquely identifies each individual. Any number of additional fields can be
266
A.H. Schnell and X. Sun
included for traits, markers, and covariates. Variables can be in any column, and using the GUI the columns are then mapped to S.A.G.E. fields. S.A.G.E. requires one record (row) per individual, with the pedigree relationship data and all other data in the same record (see Note 5). Families can consist of parents and offspring (a nuclear family) or be multigenerational. A nuclear family would be represented like this: Family ID
Individual
Parent1
Parent2
Sex
1
1
0
0
M
1
2
0
0
F
1
3
1
2
M
1
4
1
2
F
Individuals 1 and 2 are father and mother, respectively, and 3 and 4 are full sibs. Any degree of family relationships/generations can be described using this scheme. For linkage analysis, multigenerational families with known trait values for each family member provide more information than nuclear families. Pedigrees with loops cannot be analyzed by LODLINK. 1.4.3. Marker Locus Description File
The marker locus description file contains the marker (locus) name, allele frequencies, and marker genotype to phenotype relationships (the latter need not be directly specified for fully penetrant codominant markers). An example would look like: Marker1 (arbitrary marker name, but must match the label in the parameter file and, if there is one, the label in the header row in the data file) 1 ¼ 0.01 2 ¼ 0.99; (this semicolon denotes the end of the allele frequency list) ; (this semicolon denotes the end of the marker description if it is codominant) If the markers are fully penetrant codominant markers, the S.A.G.E. program FREQ can be used to generate this file from the data. All markers in the data should be present in the marker locus description file and those not in this file will be ignored. S.A.G.E. is not limited to diallelic markers (e.g., most SNPs) or codominant markers.
1.4.4. Genome File
In multipoint analysis, a genome file is necessary. This file contains at least one genomic region and the names of sequentially ordered
14 Model-Based Linkage Analysis of a Quantitative Trait
267
marker loci and the distances or recombination fractions between pairs of adjacent markers. (A map function—Kosambi or Haldane, which is the default—is used to convert map distances to recombination fractions.) The file looks like this:
As in the locus description file, the markers should match the marker names/labels in both the data file and the parameter file. Those markers not included here, even if in the data, are ignored (see Note 6). 1.4.5. Type File
This is the file output from SEGREG that contains the trait model information. Alternately, a trait marker locus file can be constructed (see Note 1).
2. Methods 2.1. Linkage Analysis Software
There are several programs available to perform linkage analysis (for example, Genehunter, Vitesse, or Mendel (8–10)). In this example, we will use the S.A.G.E. program package (11), which has programs for segregation analysis (SEGREG), data quality control (MARKERINFO, RELTEST), model-based two-point (LODLINK) and multipoint (MLOD) linkage analyses. S.A.G.E. can be downloaded from http://darwin.cwru.edu after obtaining a free license. The program is available for Windows, UNIX, or Mac operating systems (see Note 7). The UNIX version is run from the command line and the PC or Mac version can be run from a GUI or command line. In this chapter, we demonstrate using S.A.G.E. via the GUI (see Note 8).
268
A.H. Schnell and X. Sun
Fig. 1. Opening S.A.G.E. and starting a new project.
After downloading the program, it can be installed by double clicking the .exe file. The installation wizard guides the user through the installation process in a standard fashion and is selfexplanatory. For the Windows version, the GUI is automatically installed. Previous versions of the program, if they exist on your computer, are not uninstalled (see Note 9). 2.2. Starting S.A.G.E: Creating a New Project
When S.A.G.E. is opened the user has the option to: “Create a New Project,” “Open Existing Project,” or “View Recently Accessed Projects.” For our purposes, we will choose to create a new project and, when prompted on the next screen, specify a project name and the directory where the project will be stored. Next, the following screen prompts the user to specify what files exist and what files need to be created by choosing one of three options: “I am creating a new S.A.G.E. project from scratch. I may not have all pedigree data fields required by S.A.G.E.” or “I have all the pedigree data required by S.A.G.E. but no parameter file” or “I have pedigree data in S.A.G.E. ready format and one or more parameter files” (see Fig. 1).
14 Model-Based Linkage Analysis of a Quantitative Trait
269
Fig. 2. Importing the data file into S.A.G.E. using the GUI.
For the purposes of this chapter, we assume the user has a data file, but needs to create a parameter file. Therefore, the option “I have all the pedigree data required by S.A.G.E. but no parameter file” is selected. The path to the pedigree file is filled in and, next on the same screen, the format of the data is specified. The column delimiter is specified, whether multiple delimiters are treated as a single delimiter or, alternatively, if the pedigree file is an Excel spreadsheet (see Fig. 2). In either case, if the file contains a header row (i.e., the first row contains the name of the variable in that column) the “Yes” circle under “Header” should be checked. It is possible to use a different column delimiter (e.g., spaces or tabs). It is important to remember that the chosen column delimiter should not also be used to denote missing values. In addition, the column delimiter should not be used as the allele separator [the character used to separate the alleles when the genotype is given in one column (e.g., 1/2, see Notes 10 and 11)]. 2.3. Reading in the Data
Figures 3–5 illustrate reading in the pedigree file and describing the data. There is a section to click labeled “General Specifications and Variable Characteristics” that must be filled in (see Fig. 3). For this example, our trait will be a quantitative measure of DbH (dopamine beta-hydroxylase) and 5,539 SNP markers across the genome
270
A.H. Schnell and X. Sun
Fig. 3. Variable characteristics and general specifications: Identifying the individual missing value and sex coding.
Fig. 4. Mapping the columns in the file to S.A.G.E. fields.
14 Model-Based Linkage Analysis of a Quantitative Trait
271
Fig. 5. Trait characteristics.
on chromosomes 1–22 will be analyzed. The pedigree file is a tab delimited file with SNPs coded allele1/allele2. Missing parents are coded 0, males are coded M, females F, and missing marker and DbH values are denoted with a dot (.) (see Note 12). Only autosomal chromosomes are used in the analyses. In Fig. 4, the data fields in the file are mapped to S.A.G.E. fields. The data file is shown (in this case there is a header row) and the S.A.G.E. field that describes the variable is specified with a pull down menu. Figure 5 shows how to describe the trait characteristics. Figure 6 illustrates how the markers are read into S.A.G.E. Fortunately, it is not necessary to describe all 5,539 markers; only the first marker is described and this description is then applied to the remaining markers. This assumes the markers are in sequential order and the same missing value code is used for all markers. The former is not necessary, but the latter is. The final parameter file would look like:
272
A.H. Schnell and X. Sun
Fig. 6. Reading in the SNP markers.
This file can be viewed with any text editor (see Note 13).
14 Model-Based Linkage Analysis of a Quantitative Trait
273
Fig. 7. Running PEDINFO.
2.4. Quality Control
Once the data have been read into S.A.G.E. via the GUI and the parameter file created, the S.A.G.E. programs can be run. Figure 7 shows the GUI screen where programs are organized and selected and choosing the program PEDINFO. This program can be used to obtain basic information about the families, such as pedigree size and structure, and ensure that the data are read in correctly. The .inf output file from PEDINFO provides information on how the data were read in by S.A.G.E. This is a very useful file and, even if the program runs without error, this file should be examined as it is still possible the data are not being read in as expected or the user is unaware of some other features of the data. Family structure information on the first ten individuals is displayed, as are the phenotypes for the first 10 individuals. S.A.G.E. also prints out what files were read in (see Note 14). Figure 8 illustrates some of the statistics in the PEDINFO .out output file. Note the column labeled “# of inheritance vector bits”: this is not relevant for single-point analysis, but will be important for multipoint analysis (see Subheading 2.7).
274
A.H. Schnell and X. Sun
Fig. 8. A section of the PEDINFO.inf output file.
In addition to PEDINFO, S.A.G.E. has two other programs that provide information on data quality. MARKERINFO reports Mendelian errors and RELTEST uses genome scan data to evaluate the reported nuclear family relationships. 2.5. Making the Marker Locus Description File
The marker locus description file contains the name of each marker and allele frequencies. This file can be generated automatically using the S.A.G.E. program FREQ (Fig. 9). FREQ estimates, using maximum likelihood estimation (MLE), the marker allele frequencies and requires as input the pedigree file and a parameter file. The user has the option to estimate the marker-specific inbreeding coefficient, to choose all or a subset of markers and to skip MLE, instead specifying weights to give to the founders versus the non-founders.
2.6. Running LODLINK
In this example, the program SEGREG will be used to generate the model. A dominant model with square root transformation was chosen as the best fitting model. SEGREG fitted this model and output a type probability model with the extension .typ that will be used as input into LODLINK (and also MLOD). This file specifies, for each individual, the allele frequency and individual-specific penetrance information. With the type probability file and marker locus description file generated, we are ready to run LODLINK. We will create the parameter file (see Fig. 10) and choose the default analysis options (see Fig. 11 and Note 15). The trait marker file is the type probability output file from SEGREG. The default values for the recombination fraction are 0.0000, 0.0100, 0.0500, 0.1000, 0.2000, 0.3000, 0.4000, but the user can specify other values (see Note 16).
14 Model-Based Linkage Analysis of a Quantitative Trait
Fig. 9. The program FREQ to generate the marker locus description file.
Fig. 10. LODLINK file input.
275
276
A.H. Schnell and X. Sun
Fig. 11. LODLINK analysis definitions.
The LODLINK section of the parameter file would look like this:
where “LODLINK1” is the name we give the output file, LODLINK1 is the title we chose, and segreg1 is the name for trait in the SEGREG type file.
14 Model-Based Linkage Analysis of a Quantitative Trait
277
Fig. 12. LODLINK sum file part 1.
Fig. 13. LODLINK sum file part 2.
2.7. LODLINK Output
Figure 12 shows a section of the .sum file output. This output summary file contains two parts. The first part is the LOD score for each marker at the recombination fractions specified in the parameter file (default is at 0, 0.01, 0.05, 0.1, 0.2, 0.3, 0.4), with each row for a SNP and columns for recombination fractions. In Fig. 12, we see the largest LOD score (3.604300) occurs at a recombination fraction of 0 (for rs1611122) (see Note 17). The second part of the file (see Fig. 13) shows the maximum LOD score for each marker at the recombination fraction that produces the maximum LOD score. There are columns showing the estimate of the recombination fraction in the interval [0, 0.5] and in the interval [0, 1] (see Note 18), chi-square statistic and P value. In our example, the P value is 2.31e05. Examining nearby markers also shows LOD scores suggestive of linkage.
278
A.H. Schnell and X. Sun
In addition, there is a .det file that contains the LOD scores for each pedigree. Some of the markers produce a LOD score of 3 or greater, but inspection of the pedigrees individually might warrant testing for linkage assuming heterogeneity (see Note 15). Additionally, we will look at MLOD to see if there is an increase in the LOD score using multipoint linkage analysis. 2.8. Multipoint Analysis
MLOD performs multipoint analysis using information from multiple markers simultaneously to infer the identity by descent (IBD) status among family members. Multipoint analysis is generally considered to be more powerful than single-point analysis as it uses more information. Although it can analyze pedigrees with loops, MLOD has the limitation that only small pedigrees can be analyzed owing to the nature of the method used to calculate IBD status. This limitation is based on the number of inheritance vector bits for the pedigree, defined as 2n f, where n is the number of non-founders and f is the number of founders in the pedigree. The default value is 18. In this example, “Max Size” for inheritance vector bits will be increased to 19 (see Note 19). The MLOD parameter file is illustrated in Figs. 14 and 15. An additional input file, the genome description file, is required (see Subheading 1.4). Figure 16 illustrates the first few lines of this file, which describes the location and distances between sequential markers. All regions in the genome description file can be analyzed or a subset can be chosen. The trait marker is once again specified using the Type file output from SEGREG.
Fig. 14. MLOD parameter file creation.
14 Model-Based Linkage Analysis of a Quantitative Trait
279
Fig. 15. MLOD analysis options.
Fig. 16. Beginning of the genome description file.
2.9. MLOD Output
The .sum file contains four columns, the first is the genetic position of the marker, the second is the marker name, the third is the multipoint LOD score at that position, and the fourth is the linkage information content at that position, which indicates how much information, on a scale from 0 to 1, there is for testing linkage (see Fig. 17). The LOD scores obtained from MLOD show significant
280
A.H. Schnell and X. Sun
Fig. 17. MLOD .sum output.
evidence of linkage. Note that the highest LOD (6.3397) score is not at rs161122, at which two-point analysis gave the largest LOD score (3.604300), but at a nearby marker rs456396. Further analyses of these data can be found in ref. 12. The .det file contains information on individuals removed from the analysis and the position, marker name, LOD score, and information marker by marker (see Note 20). With the genetic positions and LOD scores from the .sum file, you can plot the multi-point linkage profile using other software, such as R.
3. Notes 1. This file is most easily illustrated with an example of a binary trait. For example, consider an underlying model with disease alleles 1 and 2 and with frequencies 10% and 90%, respectively. There are two phenotypes, A ¼ affected and U ¼ unaffected. Assume allele 1 predisposes to the disease, and that it is recessive to allele 2. The trait marker file would look like this:
14 Model-Based Linkage Analysis of a Quantitative Trait
281
The first line notes the disease name. Next the allele frequencies are given followed by a semicolon to terminate the list of alleles, and then the penetrances are specified followed by a semicolon to terminate the list. For a quantitative trait, instead of just two categories, A and U, each possible value of the trait is specified together with the penetrance functions for each within the braces {}, followed by a semicolon to terminate the list. 2. The sex and family, individual and parent ids are typically numeric, but can be alphanumeric and of any length. For example, males may be coded “m”, “M,” or 1. 3. Some other software packages (e.g., LINKAGE) require that the fields be in a certain fixed order, however, S.A.G.E. does not require this fixed format. 4. S.A.G.E. allows omitting rows/records for parents and will then treat all individuals with the same pedigree id as full sibs. 5. It is possible (and perhaps even likely) that the data will come in some other format, such as one file for the pedigree and trait information and another file with marker genotype information. This latter file may have one row per individual per marker. It will then be necessary to transpose that file and merge it with the pedigree/trait file. Also note that, when running S.A.G.E. from the command line, it is possible to have the data in more than one file. 6. If marker distance information is only available in base pairs, these must be converted. The genome wizard in S.A.G.E. Tools (available in the GUI) can be used to perform this task. 7. Currently there is not a specific version for Windows 7. However, the Windows XP version will work. 8. Even if a UNIX version is used or the user wants to run the program via the command line, the parameter file can still be created using the GUI. 9. This makes older installed versions of the program easily available if needed (e.g., to replicate previous analyses). 10. Using a forward slash (/) is perhaps the easiest way to separate alleles using S.A.G.E. However, other packages may require a space and so you may wish to use a space to separate alleles; in this case simply specify a space by “ ” as the allele delimiter and using a different option for the column delimiter. It is also
282
A.H. Schnell and X. Sun
possible to format the two alleles each in one column and treat these two columns (allele1 and allele2) as one marker, using a different option for the column delimiter in the parameter file. 11. Instead of using a column delimited format, it is possible to use a fixed column width. In this case, the column start and stop numbers are specified. This is more cumbersome but could be useful if the columns already have fixed widths and there is no column delimiter in the file. 12. S.A.G.E. will accept using 0 for missing alleles instead of the “.” and this would make the file more compatible with other linkage programs (e.g., LINKAGE). 13. S.A.G.E. parameter files can be constructed in any text editor and the program run via command line. 14. More advanced options are available. For example, it is possible under the analysis options to select to have output information on a per family basis. 15. You can test for homogeneity of the recombination fraction among pedigrees using the Morton and/or Smith test. These tests are useful to evaluate whether there is linkage in only a subset of families. Morton’s test tests the homogeneity of the recombination fraction among several groups of pedigrees, and the default grouping is each pedigree is an independent group. Smith’s test tests such homogeneity by comparing two hypotheses: the null hypothesis is that all pedigrees have the same recombination fraction <0.5 and the alternative hypothesis is that only a proportion of the pedigrees have a recombination fraction <0.5. If we want to perform the homogeneity test for linkage, in the “Linkage tests” window, we should select the “Perform linkage tests” option, and not “Assume linkage homogeneity.” Then in the “Homogeneity tests” window, select “Perform Smith’s test.” If in the “Recombination fractions” window we select to calculate the LOD scores at the specified recombination, or just use the default settings, the LOD score under the estimated proportion will perform Faraway’s test, using the LOD for linkage in the presence of heterogeneity. If options were chosen to perform homogeneity tests for linkage, these results would be reported in the summary file. 16. Optionally, sex-specific recombination fractions could be chosen. 17. The .sum output file could be imported into another program such as SAS or Excel to sort by LOD score to make it easier to detect the largest LOD score. 18. It is informative to examine the LOD scores in the interval [0–1]. If there is no phase information they should be symmetric about y ¼ 0.5. If some phase information is available they will higher at y than at 1 y when there is linkage.
14 Model-Based Linkage Analysis of a Quantitative Trait
283
19. If the option to obtain information by pedigree is selected in PEDINFO, the user can find the maximum number in the dataset. How large an increase is possible depends on the computer running the analysis. 20. If in the .det file NaN (Not a Number) appears for some pedigrees for the LOD scores at the beginning region of the chromosome, this indicates that too many markers are being analyzed. In this case, you should cut the chromosome into several smaller overlapping regions and redo the linkage analysis on these small regions separately.
Acknowledgment This work was supported by a US Public Health Service Resource grant (RR03655) from the National Center for Research Resources. References 1. Weeks DE, et al (1990) Measuring the inflation of the LOD score due to its maximization over model parameter values in human linkage analysis. Genet Epidemiol 7: 237–243 2. Ott J (1974) Estimation of the recombination fraction in human pedigrees: efficient computation of the likelihood for human linkage studies. Amer J Hum Genet 26: 588–597 3. Lathrop GM, et al (1984) Strategies for multilocus linkage analysis in humans. Proc Natl Acad Sci USA 81: 3443–3446 4. Lathrop GM, et al (1985) Multilocus linkage analysis in humans: detection of linkage and estimation of recombination. Amer J Hum Genet 37: 482–498 5. Ott J (1999) Analysis of human genetic linkage. University Press, Baltimore MD 6. Buetow KH (1991) Influence of aberrant observations on high-resolution linkage analysis outcomes. Amer J Hum Genet 49: 985–994
7. Risch N, Giuffra L (1992) Model misspecification and multipoint linkage analysis. Hum Hered 42: 77–92 8. Li H, Schaid DJ (1997) GENEHUNTER: application to analysis of bipolar pedigrees and some extensions. Genet Epidemiol 14: 659–663 9. O’Connell JR, Weeks DE (1995) The VITESSE algorithm for rapid exact multilocus linkage analysis via genotype set-recoding and fuzzy inheritance. Nat Genet 11: 402–408 10. Lange K, et al (2001) Mendel version 4.0: A complete package for the exact genetic analysis of discrete traits in pedigree and population data sets. Amer J Hum Genetics S69:504 11. S.A.G.E. 6.1 (2010). Statistical Analysis for Genetic Epidemiology http://darwin.cwru. edu/sage/. 12. Cubells JF, et al (2011) Linkage analysis of plasma dopamine b-hydroxylase activity in families of patients with schizophrenia. Hum Genet [in Press]
sdfsdf
Chapter 15 Model-Based Linkage Analysis of a Binary Trait Rita M. Cantor Abstract Linkage analysis is a statistical genetics method to localize disease and trait genes to specific chromosome regions. The analysis requires pedigrees with members who vary among each other in the trait of interest and who have been genotyped with known genetic markers. Linkage analysis tests whether any of the marker alleles cosegregate with the disease or trait within the pedigree. Evidence of cosegregation is then combined across the families. We describe here the background and methods to conduct a linkage analysis for a binary trait, such as a disease, when the model of the gene contributing to the trait can be formulated. There are a number of statistical genetics software packages that allow you conduct a model-based linkage analysis of a binary trait. We describe in great detail how to run one of the programs, the LODLINK program of the Statistical Analysis for Genetic Epidemiology (S.A.G.E.) package. We provide directions for making the four input files and information on how to access and interpret the output files. We then discuss more complex analyses that can be conducted. We discuss the MLOD program for multipoint linkage analysis, including its relation to LODLINK and the additional file needed. Notes to improve your ability to run the program are included. Key words: Recombination fraction, LOD score, Linkage analysis, Statistical Analysis for Genetic Epidemiology, LODLINK, MLOD, Locus heterogeneity, Single point, Multipoint, Genetic marker, Genetic disease model
1. Introduction 1.1. Rationale for Linkage Analysis
Linkage analysis is a statistical method used to localize to chromosome regions the genes that contribute to the development of a disease or a trait. Linkage analysis is part of a larger process that has been referred to as “reverse genetics” (1), because the approach works in the reverse order from how genes operate, biologically. That is, while genes act in a forward fashion to produce a trait, with this reverse approach, we start with the trait and conduct analyses to identify the predisposing genes. The first critical step when using this approach is to establish that the disease or trait is heritable. Methods to estimate trait heritability are discussed in detail in Chapters 8–10. With reverse genetics,
Robert C. Elston et al. (eds.), Statistical Human Genetics: Methods and Protocols, Methods in Molecular Biology, vol. 850, DOI 10.1007/978-1-61779-555-8_15, # Springer Science+Business Media, LLC 2012
285
286
R.M. Cantor
for methods that begin with linkage rather than association, which is discussed in Chapters 18–21, the first step is to ascertain pedigrees containing individuals who exhibit that trait. The pedigree members are then genotyped for markers that are inherited and show variation in the study sample. Linkage analysis is the tool that tests the markers for statistical evidence of their allelic cosegregation with the causal allele of a disease or trait gene within the pedigrees. The reverse genetics approach became feasible in the 1990s, when a very extensive panel of genetic markers that spanned the human genome was established (2). During the last 10 years, the markers have evolved from multiallelic to diallelic single-nucleotide polymorphisms (SNPs) (3). The markers are tested within and across the study sample pedigrees, and those showing the strongest evidence of linkage localize the disease or trait gene to the chromosome segment where the marker resides. However, the resolution is usually quite poor, and many genes will reside within a linked region (4). Nevertheless, a statistically significant linkage result will limit the search for the predisposing gene to that region, thus reducing cost and time. Linkage analysis with additional markers, targeted association with SNPs and genetic sequencing are all methods to help identify the predisposing gene. Using the advances made by the Human Genome Project (5), this approach has been very effective in identifying genes causing rare disorders that result from the effects of a single gene and follow a Mendelian pattern of inheritance within a pedigree (6). 1.2. Biological Basis of Linkage Analysis
Linkage analysis is based on the phenomenon of genetic recombination which occurs in the parental gametes during the process of meiosis and before the eggs and sperm are produced. In a parental gamete, when a pair of chromosomes, one from each grandparent, aligns in the first metaphase, an exchange of chromosome material often occurs via a crossover event, with the crossover location thought to be determined by chance. This recombination of genetic material results in chromosomes different from those that would be inherited from either parent alone (7). Thus, each child inherits a unique set of chromosomes that are recombinations of their grandparents’. Linkage analysis is based on identifying recombination events between genetic markers and trait loci and inferring whether a trait and marker alleles are traveling in close proximity on the same chromosome, or are further away or on different chromosomes. The fundamental principle of linkage analysis is that for any two loci on the same chromosome, the closer they are to each other, the more likely it is that they will not undergo recombination. Although recombination rates are not uniform across the genome, this principle has provided an effective biological model for linkage analysis.
15 Model-Based Linkage Analysis of a Binary Trait
287
1.3. Modeled Binary Traits
This chapter focuses on model-based linkage analysis of binary traits. For a binary trait (1) there are two possible values that typically represent the presence or absence of disease and (2) one can pose a model of inheritance for the gene or genes acting to produce the trait. A causal gene acts in a Mendelian fashion, which can be dominant, when one copy of the gene causes the disease, recessive, when two copies of the gene cause the disease, or X-linked, when the gene is located on the X chromosome and may be dominant or recessive. The model can include full or reduced penetrance, where the disease predisposing genotype is not fully penetrant and does not always lead to disease. Penetrance can be reduced because of other genes or environmental risk factors. Genetic heterogeneity occurs when the disease causing gene is different in the different individuals, which usually results in locus heterogeneity. With allelic heterogeneity, the disease is due to different mutations in the same gene. If locus heterogeneity goes unrecognized, it will make linkage more difficult to detect (8); however with allelic heterogeneity, linkage analysis will localize the gene to its chromosome region. Model-free linkage analysis, discussed in Chapters 16 and 17, is applied to localize genes for complex traits where the number of predisposing genes and their inheritance patterns are unknown.
1.4. Statistical Methods: Recombination Fractions and LOD Scores
The parameter of interest in a linkage analysis is the recombination fraction, here denoted theta (7, 9). Genotyped marker alleles in pedigrees with affected and unaffected individuals allow the estimation of this parameter. Every pair of loci will have a recombination fraction between them. It is estimated as the ratio of the number of inferred crossover events compared to the number of meiotic events between the two loci in the study sample (7). Evidence of recombination occurs if the marker allele traveling on the same chromosome with the putative trait allele changes to a different allele. The change would be due to a recombination event between these loci. If two loci are very close, this will happen rarely, and if they are far apart or on different chromosomes, so that they are not linked, the recombination is expected to occur 50% of the time. Thus, theta varies between 0.0, when the loci are very close, and 0.5, when the loci are unlinked. The LOD score is a measure of the statistical support for linkage, at a given recombination fraction. The term LOD comes from the logarithm of the odds, which is the ratio of the statistical likelihood for the data at a specific value of theta compared to the likelihood for the data when theta is 0.5, which is the value of theta under the null hypothesis of no linkage. The term was defined by Newton Morton in 1956 (10), and by applying, the log it allowed the LOD scores from the different pedigrees in different studies to be added together at a given value of theta. Table 1 illustrates how this works. Here, there are
288
R.M. Cantor
Table 1 Sample LOD score file LOD scores for values of theta (y) Pedigree
y ¼ 0.01
y ¼ 0.05
y ¼ 0.10
1
0.528
0.582
0.617
2
0.0243
0.0452
0.0678
3
0.756
0.891
0.973
Total LOD score
1.31
1.52
1.66
three pedigrees where the LOD score at theta ¼ 0.01, theta ¼ 0.05, and theta ¼ 0.10 have been calculated for each pedigree. The final line of the table shows the total LOD scores across the three pedigrees for each value of theta. Given these results, 0.10 is the best estimate of theta, because it has the maximum LOD score for the three values of theta. It is possible that theta ¼ 0.30 would give a better result, and the convention is to calculate the LOD score for 0.01, 0.05, 0.10, 0.20, 0.30, and 0.40 and choose the recombination fraction that gives the maximum value of the LOD score. Until recently, for binary model-based linkage analysis, a LOD score greater than 3.0 was taken to indicate that two loci were linked, whereas a LOD score less than 2.0 indicated there was no linkage, and a score between 2.0 and +3.0 indicated that additional pedigrees needed to be analyzed until the LOD score either exceeded +3.0 or was less than 2.0 (10). However, with the advent of additional markers covering the genome, providing additional linkage information, the LOD score criterion to indicate linkage has been set as high as 3.6 (11). We are focusing on twopoint linkage analysis here, where each marker is tested for evidence of linkage against the binary disease or trait. Multipoint linkage can also be conducted, but only in smaller pedigrees, due to computational burdens. A discussion of multipoint analyses in the context of two-point analyses is provided. 1.5. Linkage Analysis Software
The software for conducting model-based linkage analysis of binary traits includes, but is not limited to, MENDEL (12), GENEHUNTER (13), ALLEGRO (14), and Statistical Analysis for Genetic Epidemiology (S.A.G.E.) (15). These computer programs have input files and formats that differ from each other, but if all are run with the same family structures, the same trait model, and the same markers, under the same parameter values, the recombination fraction estimates, the LOD scores will be the same. However, the programs do vary by the complexities in the genetic models that
15 Model-Based Linkage Analysis of a Binary Trait
289
they can accommodate, the maximum size and complexity of the pedigrees they can analyze, and the speed at which the analyses are conducted. Since the focus of this chapter is to illustrate how to address more common problems that can be solved by all of the programs, one program LODLINK from the S.A.G.E. package is used to illustrate the approach. The S.A.G.E. package is very flexible, and all of the options cannot be discussed here. Only the simplest are presented and illustrated. Once some expertise is gained, the very extensive manual that can be downloaded with the software will allow the user to gain expertise to conduct more complex analyses. Specific limitations of the LODLINK program include not being able to analyze pedigrees containing loops. While larger numbers of pedigrees, individuals within each pedigree, and high disease frequency within the study population provide more statistical power to detect linked loci, for the purposes of this chapter, the example used to illustrate how to conduct an analysis is much smaller to provide better clarity in its description.
2. Methods 2.1. Overview of the LODLINK Files
The LODLINK program requires you to provide four files for a basic linkage analysis. We describe that process in detail in this section. The Pedigree Data File provides the pedigree structure (how the pedigree members are related to each other) and the trait values and the marker genotypes for each individual who has been assessed. The Trait Locus Description File provides the model of how the trait locus results from its putative causal gene. This model can be autosomal dominant or recessive, and the file allows for reduced penetrance. The Marker Locus Description File gives the marker alleles and their frequencies. Finally, the Parameter File guides the analysis by indicating that the model-based linkage method for a binary trait will be used. This file is created within LODLINK by selecting a series of responses or filling in values on menus.
2.2. Making the Pedigree Data File
The Pedigree Data File codes the family structures, the trait values and the genotype data for the members of the pedigrees in the study sample. The two pedigrees used to illustrate how this is done are in Fig. 1a, b, and the corresponding Pedigree Data File is in Fig. 2. There are a few things that can be done to save time and reduce errors. First, prior to coding the data, each pedigree should have a number that is the pedigree ID (see Note 1). Then each individual in each pedigree should be assigned an individual ID (see Note 2). These can be in the form of sequential numbers or something more complex. Here we use the sequential numbers 1–11, since each
290
R.M. Cantor
Fig. 1. (a) Sample Pedigree 1 with disease status and genotypes at M1. (b) Sample Pedigree 2 with disease status and genotypes at M1.
Fig. 2. Pedigree data file for Pedigrees 1 and 2.
15 Model-Based Linkage Analysis of a Binary Trait
291
pedigree has 11 individuals. Second, each individual must have two parents in the data file, even if only one parent has participated in the study, so include both parents for every individual that is not a founder in the pedigree. If we did not study individual 1 in Fig. 1a, we would still include that person in the Pedigree Data File. A straightforward way to generate the Pedigree Data File is to first code the pedigrees from their drawings using a spreadsheet such as Excel. Below are the directions to take the information on the pedigrees in Fig. 1a, b and convert it to the Pedigree Data File in Fig. 2. Required fields are Pedigree (Family) ID, Individual ID, Parent IDs (both), and Sex. Here we also include a field for the binary trait, called “DX Trait,” and marker genotype, called “M1.” Input the data to an Excel file, with the appropriate number of columns for each field. The spreadsheet should include the names of the fields, so that LODLINK can recognize the nature of the data. The labels for this example are given in Fig. 2. More specific information on the coded values is given below. 1. The Pedigree IDs go in the first columns of the Pedigree Data File. They are 1 and 2 here, for the two pedigrees being coded. 2. The numbers on the circles and squares representing the individuals in each pedigree are the Individual IDs, which are put in the next column. 3. For each individual, put the Individual IDs of the parents within the Parent 1 and Parent 2 columns. The convention is that Parent 1 is the father and Parent 2 is the mother. If you are coding a pedigree founder with no parents in the analysis use “.” as a missing value for each parent. In Pedigree 1, 1 and 2 are the father and mother of 3. 3 has 4 children with 4. They are numbered 5, 6, 7, and 8. 4. The sex of each individual is coded, with “M” for male and “F” for female (see Note 3). 5. The trait marker data column has affected coded as “A” and unaffected individuals coded as “U.” 6. Genotypes are coded in this file. For marker alleles, one may input both alleles into the same column with a delimiter such as “/” between them. For the given pedigree, individual 1 has alleles A and B for the first marker genotype (see Note 4). Homozygotes for the A allele are coded “A/A.” 7. Save the file as a “.csv” or single comma delimited file. This will be important when uploading the file into the S.A.G.E. software.
292
R.M. Cantor
2.3. Making the Locus Description Files
Marker alleles and their frequencies, and the model of the putative trait or disease gene contributing to the development of the trait, are included in two Locus Description Files. These are best made written using a text editor program, such as Notepad or Wordpad (see Note 5). 1. The Trait Locus Description file contains information regarding the genetic model of the binary trait or disease under analysis. Three examples are given in Fig. 3. The first block marked “Recessive” indicates that there are two alleles at the trait or disease gene, 1 and 2 and that their frequencies in the population under analysis are 0.99 and 0.01, respectively. Allele frequencies should sum to 1.0 (see Note 6). The next line is a semicolon, which indicates the frequencies for all of the alleles at that marker have been given. 2. The next two lines indicate how the two alleles of the predisposing gene combine together to cause the trait or disease. The first line has “A,” which is how “affected” is coded in the pedigrees. The information in the brackets says that for two copies of allele 1, or one copy of allele 1 and one copy of allele 2, the values is 0.0. This indicates you will not get the disease or be affected if you have these genotypes at the disease gene.
Fig. 3. Trait locus file with three models of inheritance.
15 Model-Based Linkage Analysis of a Binary Trait
293
However, if you have two copies of allele 2, the value of 1.0 says that you will inevitably be affected. The alleles combine differently to give an unaffected or “U” phenotype for the trait. Here you will be unaffected if you have two copies of allele 1 or one copy of allele 1 and one copy of allele 2, which both have value 1.0. The value of 0.0 for 2/2 indicates you will not be unaffected if you get this genotype at that gene. Thus, you will be affected with this genotype, which is also indicated in the line above. This penetrance model describes a disease or trait that is recessive, as two copies of the disease allele are needed for someone to be affected. Please note that for each genotype the value after the equal signs for both A and U should sum to 1.0. Here we add 1.0 and 0.0 for 1/1, 1.0 and 0.0 for 1/2, and 0.0 and 1.0 for 2/2. This information is also followed by a line with a semicolon to indicate the information on the penetrance model is complete. 3. The second box in Fig. 3 gives the model for a dominant disease or trait gene. Here, with “A,” the values of 1.0 for both genotypes 1/1 and 1/2 indicate you will inevitably get the disease and be affected with alleles 1/2 or 2/2. 4. The third box in Fig. 3 provides an example of a disease gene model with “reduced penetrance.” Here reduced penetrance occurs for the risk genotype of 2/2, because only 70% of the individuals with that risk genotype get the disease, and 30% do not get the disease, which is what is coded for this genotype 2/2 for “U.” 5. Phenocopies are also modeled here. A phenocopy occurs when the non-risk allele leads to the disease status of “A” with a non-zero probability. That is, the 1/1 genotype for A has a probability value of 0.1, which says that 10% of those who have the 1/1 genotype will be affected, while 90% will not. 6. The Marker Locus Description File describes the alleles and their frequencies for each marker. An example file is given in Fig. 4. Estimation of allele frequencies from the data in your samples is discussed in Chapter 5 (see Note 7). For the pedigrees in Fig. 1a, b, alleles A, B, C, and D are codominant, which means the marker genotype given reflects the two alleles that contribute to it. Each of the alleles has a frequency of 0.25, which is unrealistic, but makes the input to the program simpler. 2.4. Making the Parameter File Loading the Files and Running LODLINK
The Parameter file will be generated by LODLINK when you select menu options. This requires an installed version of S.A.G.E. on your computer.
294
R.M. Cantor
Fig. 4. Marker locus file with M1 and two other markers.
1. Click the shortcut to the S.A.G.E. GUI.exe to open the S.A.G.E. application. Enter the title of the new project. Choose model-based linkage and binary traits as your options. 2. Indicate that you have project data, but no parameter file. 3. Indicate the Pedigree Data File path on your computer by browsing and clicking, and that it is a single comma delimited text file (.csv file). 4. Headers indicating the variable names will be in the first row of the Pedigree data file. They can be seen here in Fig. 2 and will be referred to by name in the menu. 5. Setting the Pedigree Field Properties—under each header, select the correct field designation. For example, for Father select Parent 1, for “DX TRAIT” select TRAIT and for M1 select MARKER. Include your symbols for missing values. 6. Indicate that TRAIT is binary and list the trait values used here, “A” and “U.” Check the box signifying that this is the trait marker. 7. When MARKER is chosen, select that it is codominant. The example “M1” marker has four codominant alleles with equal allele frequencies so this is marked in the pop-up selection. If you were doing this for M2 and M3 in Fig. 4, you would also select codominant. 8. Next, set the general specifications. These include the symbols used for the sex of the individual, M and F, and the symbol used for missing values.
15 Model-Based Linkage Analysis of a Binary Trait
295
9. Specify that no other raw data files are to be added. Then press “Next.” 10. Adding the Marker and Trait Locus Description files—under the analysis header, chose the appropriate model-based single marker analysis to be performed titled “LODLINK.” Upload the Pedigree Data, Marker Locus Description, and Trait Locus Description files, using the browser and clicking (see Note 8). 11. Setting the analysis definitions—designate the trait marker name, i.e., DX Trait, and make sure that the linkage analysis box is checked. 2.5. Interpreting the Output Files
LODLINK produces four output files: genome.inf, LODLINK.inf, LODLINK1.sum, and LODLINK1.det. They each provide important information and should all be examined. First, it is very important to check that no mistakes have been identified (see Note 9). Reading parts of the LODLINK manual will be informative for identifying mistakes (see Notes 10–12). The files are presented in the order you might examine them. First look for errors and when there are none, look at the results. 1. The genome.inf file is the genome information output file. It contains diagnostic information for each marker genotype. There is no analysis output. 2. The LODLINK.inf file is the information output file. It lists diagnostic information about how the data files were read, warnings and possible program errors. If errors are encountered, this file will contain information that can help identify the problem. The file gives the first ten individuals read from the pedigree. It can help you check that that input file was read correctly. It contains no analysis output (see Notes 10–12). 3. LODLINK_analysis1.sum file is the summary output from the analysis. It contains LOD scores for the entire set of families summed over the standard set of recombination fractions. An example of such a file is given in Fig. 5, where the two pedigrees in Fig. 1a, b are analyzed for the single marker, M1. Three analyses reported in Fig. 5 are for the three different penetrance models given in Fig. 3. 4. For Model 1, the recessive model, the LOD score falls below 2 at recombination fractions of 0.05 or less. Thus close linkage of the marker and the disease gene, if this model of inheritance is correct, is excluded. The –Infinity at a recombination fraction of 0.0 means that there has been at least one recombination between the marker and the disease gene under this model. An examination of the pedigrees may help you confirm this. By tracing allele A with disease you can see that Individual 7 in Pedigree 1 is a recombinant.
296
R.M. Cantor
Fig. 5. LODLINK summary file for Pedigrees 1 and 2 with M1 for three models of inheritance.
5. For Model 2, the LOD scores are very promising, but they do not exceed the threshold of 3. The table for this model indicates the best estimate for the recombination fraction is 10%. You could ascertain two more pedigrees, and if the same gene contributes to the trait in all four pedigrees, the LOD score might exceed 3 and the recombination fraction estimate would be more accurate. 6. Model 3 in Fig. 5 permits reduced penetrance. With this level of uncertainty added, the statistical power to identify linkage is reduced. In addition, the model allows individual 7 in Pedigree 1 to be interpreted as a nonrecombinant. 7. The LODLINK_analysis1.det file is the detailed output from the analysis. It includes LOD scores by family. This information should be examined, as there might be locus heterogeneity that would reduce the LOD score at a linked locus. If this is observed, a parameter for heterogeneity can be included in the analysis. This is discussed in Subheading 2.6. 2.6. Incorporating Additional Complexities into Linkage Analysis
As stated, this chapter presents the basic core two-point modelbased linkage analysis for a binary trait in pedigrees. Once you can run and interpret the results, complexities can be included.
15 Model-Based Linkage Analysis of a Binary Trait
297
1. Genetic heterogeneity, in the form of locus heterogeneity, occurs when a trait or disease is caused by different genes in different pedigrees. If locus heterogeneity is present but not accounted for, the statistical power to detect linkage at a causal locus will be reduced. This would occur if several families produced the LOD score of 3 and different families produced the LOD score of 4. Linkage is masked unless heterogeneity is included in the LOD score analysis. When not so severe, the estimated recombination fraction can be significantly greater than the true recombination fraction. Heterogeneity can be included in the model-based linkage by testing the null hypothesis that there is linkage homogeneity along with the linkage analysis. In LODLINK, one has the option to run either Smith’s test for homogeneity of the recombination fraction or Morton’s likelihood ratio test for homogeneity of the recombination fraction (16) by selecting from the menus when setting up the analysis in Subheading 2.5. 2. The LODLINK program can also be set to provide sexspecific recombination rates. In regards to recombination rates, one may choose to have sex-specific recombination fractions used in LOD score calculations: set the rates manually, or apply those automatically set by the program, i.e., 0.00, 0.01, 0.05, 0.10, 0.20, 0.30, and 0.40. 2.7. Multipoint Analyses
The LODLINK program conducts two-point model-based linkage analysis for a binary trait. However, one may want to also analyze several markers in the same region of a chromosome simultaneously for the same model-based binary trait. This is particularly helpful when a sequence of markers in the same region of a chromosome each provides a positive LOD score that is nearly significant. When markers are analyzed simultaneously, the process is referred to as a multipoint linkage analysis. In general, multipoint analyses increase the available genetic information, offer greater statistical power to observe a significant linkage signal, and localize the putative trait gene among the markers showing a positive LOD score. The number of markers that can be included in a multipoint analysis is negatively correlated with the sizes of the pedigrees, so only a few markers, such as 3 or 4 should be attempted for the pedigrees used to illustrate linkage analysis here. Fortunately, once you have mastered two-point linkage using the LODLINK program, many of the same principles and files can be applied to run MLOD, the multipoint linkage program of the S.A.G.E. package. Briefly, as with LODLINK, the Pedigree Data File provides the pedigree structures, the trait values and the marker genotypes for each individual. The same LODLINK file can be used for MLOD, but there must be at least two markers with genotypes in the same chromosome region to accomplish the analysis.
298
R.M. Cantor
Fig. 6. Sample of file for markers M1, M2, and M3.
The Trait Locus Description File, providing the model of how the trait locus results from its putative causal gene, and the Marker Locus Description File, providing the marker alleles and their frequencies, are the same as those for the two-point analyses of LODLINK. MLOD requires you to provide five files for it to run, instead of four. The additional file, the Genome Description File, describes the genetic map which contains the locations of the test markers in relation to each other in centimorgans (cM) on the chromosome. A cM is defined as 1/100 of the chromosome region across which there is a 100% chance of a crossover. This concept is relevant because linkage is assessed in observed crossovers compared to possible crossovers. An example of this file for markers M1, M2, and M3 is given in Fig. 6. The name of this region in the genome with name “test” is “M1–M3,” but any name will work. The file indicates that M1 is the first marker along the chromosome and that it starts at 0.001 cM, which represents an arbitrary first point. This distance does not enter into the analysis of region “M1–M3.” Marker M2 is 5 cM from M1 and M3 is 6 cM from M2. Thus, there are three markers covering a region of 11 cM on this chromosome. A map such as this one will be known from previously conducted large studies of markers, used to estimate the distances between them in cM. MLOD will give a LOD score for each marker and at every 2 cM between the markers. 1. The MLOD parameter file is constructed using files similar to the ones used for LODLINK. The Parameter file will be generated for MLOD once menu options are selected. 2. When setting the Pedigree Field Properties, select MARKER for M1, M2, and M3, and for each select codominant. 3. For the scan type, select “intervals,” to allow for a multipoint analysis. 4. For the region, be sure to give that interval name for use in the Genome Description File. 5. The output files generated for MLOD are consistent with those for LODLINK, except that many more LOD scores are
15 Model-Based Linkage Analysis of a Binary Trait
299
provided. They are estimated at each of the markers as well as at every 2-cM interval between markers. The point in the region with the largest LOD score localizes the trait gene. Use of additional markers in the region can often localize it further.
3. Notes 1. Here we have selected the values of 1 and 2. However, the width of this field can be larger, and you can start with 1,001 and 1,002, if you wish. 2. Because the pedigrees have their own IDs, it is fine to repeat the numbers, as each individual will be uniquely identified by their pedigree ID and individual ID. 3. “M” and “F” are not fixed labels for the sex of the individual. You can use the numbers “0” and “1.” You will be able to indicate these values when you make the Pedigree Data File. 4. Alleles A and B could be put into two separate columns, one for each marker allele, or both in the same as shown in Fig. 2, where the genotype is coded “A/B.” 5. Be sure to check for any extra spaces, symbols, or characters as these can lead to read errors in the program. Any extra symbol will make the file unreadable for most of the information, depending upon where the mark has been included. 6. The frequency of the disease allele can often be inferred from the rate of that disease in the population under analysis. This is done under the assumption of Hardy–Weinberg equilibrium, which is discussed in Chapter 6. 7. Alternatively, marker frequency information can be found using the Marshfield Clinic’s Mammalian Genotyping Service website at http://www.marshfieldclinic.org/mgs/. 8. Keeping all of your files in a single folder will simplify the process of keeping track of your work. If you make errors you may want to compare older work with newer work. LODLINK will number the folders with the same name for each analysis, which will also help in this process. 9. If mistakes are flagged, it may be challenging to find the errors that caused them. It is best to fix the things you think are incorrect, rerun the program and see if the error messages are gone. If the files are too large, and you do not want to change everything you think is incorrect, you can sometimes run subsets of these files to help diagnose the errors. 10. Do not read the manual sequentially, because there is a great deal of information that will not apply to your analysis. Pick out
300
R.M. Cantor
some of the early chapters which explain how the data are coded and then focus on the LODLINK chapter. 11. Sometimes you can “Google” the key words in your error message with the word LODLINK, and you will find helpful information from others who have encountered the same problems. 12. It is best to approach problems by being patient and methodical, realizing that everyone who begins has the same problems that you have encountered. When you are totally frustrated with the problems you encountered, it is sometimes very helpful to start over again, because you can easily miss something that is very simple. References 1. Orkin, SH (1986) Reverse genetics and human disease. Cell 47: 845–850 2. Broman, et al (1998) Comprehensive human genetic maps: individual and sex-specific variation in recombination. Amer J Hum Genet 63: 861–869 3. Gabriel SB, et al (2002) The structure of haplotype blocks in the human genome. Science 296: 2225–2229 4. Boehnke M (1994) Limits of resolution of genetic linkage studies: implications for the positional cloning of human disease genes. Amer J Hum Genet 55: 379–390 5. The human genome (2001) Science genome map. Science 291: 1218 6. Botstein D, Risch N (2003) Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease. Nat Genet 33 Suppl: 228–237 7. Ott, J (1999) Analysis of Human Genetic Linkage, 3rd edn. Johns Hopkins University Press, Baltimore MD 8. Cavalli-Sforza LL, King MC (1986) Detecting linkage for genetically heterogeneous diseases and detecting heterogeneity with linkage data. Amer J Hum Genet 38: 599–616
9. Elston RC, Stewart J (1971) A general model for the genetic analysis of pedigree data. Hum Hered 21: 523–542 10. Morton NE (1955) Sequential tests for the detection of linkage. Amer J Hum Genet 7: 277–318 11. Lander E, Kruglyak L (1995) Genetic dissection of complex traits: guidelines for interpreting and reporting linkage results. Nat Genet 11: 241–247 12. Lange K, et al (2001) Mendel version 4.0: A complete package for the exact genetic analysis of discrete traits in pedigree and population data sets. Amer J Hum Genet S69: 504 13. Kruglyak L, et al (1996) Parametric and nonparametric linkage analysis: a unified multipoint approach. Amer J Hum Genet 58: 1347–1363 14. Gudbjartsson DF, et al (2000) Allegro, a new computer program for multipoint linkage analysis. Nat Genet 25: 12–13 15. S.A.G.E. (2009) Statistical Analysis for Genetic Epidemiology, Release 6.0.1 16. Lemdani M, Pons O (1995) Tests for genetic linkage and homogeneity. Biometrics 51: 1033–1041
Chapter 16 Model-Free Linkage Analysis of a Quantitative Trait Nathan J. Morris and Catherine M. Stein Abstract Model-free methods of linkage analysis for quantitative traits are a class of easily implemented, computationally efficient, and statistically robust approaches to searching for linkage to a quantitative trait. By “model-free” we refer to methods of linkage analysis that do not fully specify a genetic model (i.e., the causal allele frequency and penetrance functions). In this chapter, we briefly survey the methods that are available, and then we discuss the necessary steps to implement an analysis using the programs GENIBD, SIBPAL, and RELPAL in the Statistical Analysis for Genetic Epidemiology (S.A.G.E.) software suite. Key words: QTL, Haseman Elston regression, Variance component, S.A.G.E., Two-stage Haseman Elston, SIBPAL, RELPAL, GENIBD, FREQ, Linkage analysis
1. Introduction One of the most important concepts in model-free methods is the difference between allele sharing identical by descent (IBD) and identical in state (IIS). Whenever two alleles are the same, they are IIS. When two alleles are the same allele inherited from the same ancestor, they are shared IBD. At a given locus, any two relatives may share 0, 1, or 2 alleles IBD. When performing model-free linkage analysis, generally the IBD values must be found for all pairs of relatives at a number of different loci. Often the marker data does not contain enough information to determine the marker IBD values exactly. This information must thus be estimated from marker data using some computational approach, such as the Elston–Stewart or the Lander–Green algorithm. The IBD information may be estimated using the posterior mean of the IBD sharing. Before discussing linkage any further, we examine the basic model that underlies much of our thinking about quantitative trait genetics. A trait (yi ) is modeled to be a linear combination of various independent genetic and environmental influences such as: P an intercept (b0 ), the effect of some fixed covariates ( k bk xik ), the Robert C. Elston et al. (eds.), Statistical Human Genetics: Methods and Protocols, Methods in Molecular Biology, vol. 850, DOI 10.1007/978-1-61779-555-8_16, # Springer Science+Business Media, LLC 2012
301
302
N.J. Morris and C.M. Stein
effect of genotypes at a single causal gene (gi ), and an independent environmental effect (ei ). Also, a polygenic effect (pi ) is envisioned to be the result of many segregating genes each with very small effect sizes acting additively to produce a normally distributed random effect. Thus, the phenotype may be written as X bk xik þ gi þ pi þ ei (1) yi ¼ b0 þ k
One salient point here is that we actually expect a mixture distribution within the population because gi cannot be directly observed. Suppose, for example, that we have two alleles: A and a. In this case, gi may take on three possible values depending on the genotype of individual i: mAA ¼ 0, mAa ¼ ð1 þ kÞa, or maa ¼ 2a. In model-based linkage analysis, as discussed in Chapter 14, an attempt is made to actually fit such a mixture distribution. In contrast, model-free linkage analysis focuses on the first and second moment structure of the phenotype, ignoring the higher moments suggested by the mixture distribution. If k ¼ 0 in the above formulas, then the model is said to be additive. If we assume that the model is additive and the genotype frequencies follow Hardy–Weinberg proportions, then Var½gi ¼ 2ð1 pa Þpa a 2 which is known as the additive major gene variance (s2a ), where pa is the minor allele frequency. The parameter of interest in modelfree linkage analysis is s2a , and the actual values of a and pa are not estimated. Similarly, we define the values s2p ¼ Var½pi and ^lij represent the estimated proportion of alleles s2e ¼ Var½ei . Let p ^lij shared IBD between individuals i and j at locus l, and ’ij ¼ E p be twice the kinship coefficient between individuals i and j . It follows from Eq. 1 and some modest assumptions that, approximately X E ½yi jxi ¼ b0 þ bk xik (2) k
and
Cov yi ; yj ¼
(
s2a þ s2p þ s2e ^lij s2a þ ’ik s2p p
if if
i¼k i 6¼ k:
(3)
The key point of Eq. 3 is that the larger the proportion of alleles shared IBD at the causal locus between two individuals, the larger the genetic correlation between those two individuals. The original groundbreaking paper on model-free linkage analysis was by Haseman and Elston (1). The method which they suggested (which we shall refer to as HE regression) was for sibling pairs only and it involved performing a simple linear regression. The suggested outcome was2the vector of all squared pairwise trait differences dij ¼ yi yj , and the suggested predictor was the ^lij . The rationale for this corresponding vector of IBD values p
16
Model-Free Linkage Analysis of a Quantitative Trait
303
may be derived easily from Eq. 3 because, assuming there are no covariates: h i h i 2 2 E yi yj j^ plij ¼ E ðyi b0 Þ yj b0 j^ plij h i ^lij s2a (4) ¼ s2p þ 2 s2a þ s2e p Thus, the magnitude of the estimated regression coefficient may be interpreted as the additive genetic variance (s2a ). Further research has shown that the power of HE regression could be improved by using a weighted sum of the pairwise differ 2 ences (dij ) and pairwise sums sij ¼ yi þ yj 2y as the outcome in the regression model (2, 3). It has also been shown that further power can be obtained by using a best linear unbiased predictor (BLUP) in place of y in the formula for sij above (4). While HE regression, especially in its more modern forms, is a powerful and robust method of analysis, it can generally only utilize full or half sibling pairs. Another modification of HE regression that has been termed the variance component (VC) linkage method (5, 6) allows modeling general pedigree structures. Of course, HE also estimates variance components, but for lack of a better term we use VC to refer to these newer methods. VC methods ignore the fact that, for genetically determined traits, we fully expect the distribution of traits to follow a mixture distribution that may be multimodal. We simply pretend that the data are normal with the mean and covariance structure given by Eqs. 2 and 3. Simulation studies have shown that the model is robust to the presence of a mixture distribution (7). When the distribution of the traits is approximately normal, it has been found that the VC method is quite powerful relative to the older HE methods. This conclusion can be demonstrated both on a theoretical basis (8, 9) and via simulations (10). However, the difference in power between the newer HE flavors, which use a weighted combination of the sib-pair sums and differences (11), may not be particularly significant (9). In fact, the HE and VC methods, while quite different on the surface, may be shown to fall under a common framework (9). Furthermore, the two-level HE method (12) yields estimates that are in theory identical to the likelihood-based VC method. However, VC methods are known to be sensitive to deviations from multivariate normality. The possible consequences include inflated type I error rates (7). There are numerous approaches that have been developed to make more robust versions of the VC method. Of course, the simplest approach is transforming the data. Others have used the multivariate t distribution (13). Amos (5) developed a generalized estimating equation (GEE) approach and suggested using a robust Wald test. Blangero et al. (14) developed a scaling factor for the likelihood ratio test that makes it
304
N.J. Morris and C.M. Stein
robust to non-normality. Chen et al. (15) give an excellent overview of these innovations, and point out that the multivariate t method and at least one transformation method have low power when the disease allele frequency is low. Several well-known programs such as SOLAR (6) and MERLIN (16) perform VC linkage analysis. The Statistical Analysis for Genetic Epidemiology (S.A.G.E.) program RELPAL tests for linkage in a way that is mathematically equivalent to the likelihood-based score test in VC linkage analysis. However, RELPAL has several options that can make the test robust to the assumption of multivariate normality.
2. Methods 2.1. The S.A.G.E. Software
S.A.G.E. is a freely available (but requires a license) software package containing programs for use in the genetic analysis of family, pedigree, and individual data. It was developed by the Human Genetic Analysis Resource funded by the National Center for Research Resources of the National Institute of Health. Every S.A.G.E. program takes as input at least two files: a parameter file which tells the program how to read in the data, and a pedigree file which contains the actual data. Analyzing data with S.A.G.E. can be accomplished either by the original command line interface or using the graphical user interface (GUI) as a front end. We discuss how to use the GUI to analyze the data, but it is a simple matter to use the parameter files generated by the GUI as input for the command line. Among the many programs that make up S.A.G.E., we highlight the following programs: SNPCLIP—Can be used to thin the marker data. PEDINFO—Reports descriptive statistics about the pedigree structures present in the data. GENIBD—Estimates the number of alleles shared IBD at numerous different genomic locations. FREQ—Estimate allele frequencies. SIBPAL—Performs many different flavors of HE regression using the output of GENIBD as input. Can only use sibling pairs. RELPAL—Performs the two-level HE regression using the output of GENIBD as input. Can analyze general pedigrees. The typical workflow for a model-free analysis in S.A.G.E. is shown in Fig. 1. There are three pieces of information that are needed to even begin the analysis. These are the phenotype/pedigree data, the marker data, and a genome description file (i.e., the genetic map file). Typically, the workflow is repeated for each chromosome to be analyzed. Because managing the output for all
16
Model-Free Linkage Analysis of a Quantitative Trait
Marker Data for One Chromosome / Pedigree Data / Phenotype
305
RELPAL / SIBPAL IBD File Results FREQ
GENIBD
Marker Locus Description File
Genome Description (Map) File
Fig. 1. Typical workflow. Squares represent data files and circles represent programs.
chromosomes is difficult and the analysis time may tie up the analyst’s personal computer, it may be useful to write a script to run the analysis on a server. We do not discuss how to do this because we focus on the GUI. However, as mentioned earlier, the parameter files created by the GUI may be easily extracted and used in a script. 2.2. Initial Steps
The first steps in performing any data analysis are obtaining the software, cleaning/manipulating the data, and reading the data into the program. These initial steps have been covered in other chapters (see, for example, Chapter 30). However, we do need to make several additional important comments concerning these initial steps. First, we strongly suggest that in the data cleaning stages the user identify Mendelian errors in the markers and check for relationship misclassification using the marker data. These steps have been covered in Chapters 2 and 3, respectively. Markers with a large number of errors should be removed, and markers with a small number of errors should be left as missing for the pedigrees in which the errors occurred. Furthermore, if possible, relationships should be reclassified according to what is suggested by the molecular data, since incorrect relationships may bias linkage analysis findings (17, 18). Second, for dense single nucleotide polymorphism (SNP) data, it is important to thin the markers. This will increase the computational speed of the analysis later, and it is also important theoretically. When estimating IBD proportions with missing founder genotypes the markers are assumed to be in linkage equilibrium. Thus, the use of densely spaced SNPs may lead to inflated type I error or loss of power (19). SNPCLIP is one program which can easily thin out the SNP density keeping only informative SNPs that are not in linkage disequilibrium. To access SNPCLIP from the S.A.G.E. GUI, just go to Tools > Run > SNP Clip. If a message
306
N.J. Morris and C.M. Stein
appears requesting the location where SNP Clip is installed, select the select the sage bin directory (e.g., “c:\SomeThing\S.A.G.E. v6.1.0\bin\,” replacing “SomeThing” and the version number as appropriate). Third, we suggest that the reader use the program PEDINFO to get some descriptive statistics about the pedigrees before proceeding. This can give the user some idea of how much information about linkage is contained in the data. To run PEDINFO: 1. Go to Analysis > Summary Statistics > PEDINFO. 2. On the left panel of the Project Window drag your pedigree file from its position in Data—Internal to Jobs—PEDINFO1— Errors—Missing Data File. 3. On the lower right hand side of the Project Window, press the Next button and the Run button. At this point the program should run. Typical run time is only on the order of seconds. 4. The resulting output should now appear under Jobs—PEDINFO1—Output. Simply double click on the output files to view the results. When viewing the results, first view the pedinfo.inf information file to look for potential errors. Next, view the actual results file PEDINFO1. Finally, we point out that, if the user has the phenotype data, pedigree data, and genotype data in separate files, it will be helpful for downstream analysis to merge the genotype, pedigree, and phenotype files. That is, there should be one file for each region/ chromosome to be analyzed. (Note, however, that it is possible to keep the phenotype and genotype data separate. That is, you could have one file per chromosome that has pedigree and genotype information for calculating allele frequencies and IBD values, and one file that has pedigree and phenotype information for actually performing the final linkage analysis.) 2.3. Obtaining and Formatting a Genetic Map File
Besides the actual data, a second crucial piece of information for linkage analysis is known as the genetic map information. In order to make full use of your genetic data the program GENIBD, which estimates the IBD values, must know the genetic distance between linked markers. Some of the vendors of marker assays distribute a genetic map file. The Rutgers Genome Map file based on the work of Kang et al. (20, 21) contains a number of genetic markers and can be found as a supplementary file. If no genetic map is available, then the genetic position of the markers should be estimated by linear interpolation using the physical position of the marker between two markers of known physical and genetic position.
16
Model-Free Linkage Analysis of a Quantitative Trait
307
In S.A.G.E., the genetic map file for the markers is known as the “Genome Description File.” The information must be saved in an ASCII text file in a format similar to this:
The distance in the above example is the distance between markers and it must be recorded in centimorgans. The map function may be specified as Haldane or Kosambi, depending on how the genetic distance was measured. The first entry may be entered as the distance from the p terminus to the first marker, or the user may choose to start with the first marker, as exemplified above. The marker names (e.g., “chr1marker1” and “chr1marker2” etc.) should be replaced by the true marker names. In order to use the “Genome Description File” in the S.A.G.E. GUI, the user must import the file. To accomplish this critical step, drag the genome file icon (i.e., ) from the Pallet to Data— Internal in the Project Window. It is also possible to import the map file from several other formats using the Tools > Create a Genome Description File option. We note that, historically, linkage studies were sometimes performed using what is sometimes referred to as “single-marker” or “two-point” linkage analysis. Single-marker linkage analysis analyzes the genetic markers independently, and so it is not necessary to know the genetic distance between markers. We strongly recommend that “multipoint” IBD values be calculated to make full use of the information available. Thus, it is critical to obtain a correctly formatted genetic map file. 2.4. Obtaining the Allele Frequencies for the Markers
In order to calculate the IBD values, GENIBD needs to know the allele frequencies of your markers. In S.A.G.E., this information is stored in a file known as the “Marker Locus Description File.”
308
N.J. Morris and C.M. Stein
If the user already has estimates of the allele frequencies, then the frequencies may be entered as described in the S.A.G.E. manual. We assume that the user estimates the allele frequencies from the available data using the program FREQ. Estimating allele frequencies has already been covered in greater detail in Chapter 5, but we briefly go over the steps to achieving this in S.A.G.E.: 1. Go to Analysis > Allele Frequency Estimation > FREQ. 2. On the left panel of the Project Window, drag the pedigree file from its position in Data—Internal to Jobs—PEDINFO1— Errors—Missing Data File. 3. On the lower right hand side of the Project Window, press the Next button. 4. On the right hand side of the Project Window, click on the drop down box for “Marker” and click “Select All.” 5. Click Run on the lower right hand side of the Project Window. 6. Click OK in the “Analysis Information” pop-up box. 7. Run time may take some time depending on the data size. The user may look at the tasks window to see the status of the job. When the job is finished, as always check the information file (i.e., the file ending in .inf) first, to make sure there were no errors in analyzing the data. Next, the user should look at the summary file (ending in .sum) and the detail file (ending in .det). The final file that FREQ outputs is the “Marker Locus Description File” which ends in .loc. This file will be used as input for GENIBD. 2.5. Obtaining IBD Information Using GENIBD
Before any model-free analysis can proceed, the IBD information must be obtained by using the program GENIBD. This program takes as input the “Marker Locus Description File” and the “Genome Description File.” To run GENIBD in the S.A.G.E. GUI: 1. Go to Analysis > IBD Allele Sharing Estimation > GENIBD. 2. On the left panel of the Project Window drag the pedigree file from its position in Data—Internal to Jobs—GENEIBD1— Errors—Missing Data File. 3. On the left panel of the Project Window drag the “Genome Description File” imported earlier from its position in Data— Internal to Jobs—GENEIBD1—Errors—Missing Genome file to the multipoint. 4. On the left panel of the Project Window drag the “Marker Locus Description File” created by FREQ from its position in Jobs—FREQ1—Output to Jobs—GENIBD1—Errors—Missing Marker locus file. 5. On the left panel of the Project Window click Next. The left panel of the Project Window should now be on the Analysis Definition tab.
16
Model-Free Linkage Analysis of a Quantitative Trait
309
6. Select the desired options for the analysis. The default options are generally appropriate, but we briefly discuss what the various options are: (a) Title. Just describes the title that will be displayed in the output. (b) Region. Selects which unlinked regions in the “Genome Description File” will be analyzed. (c) IBD mode. Selects whether the markers will be considered independent or not. As discussed in our section on the genetic map file, this should usually be set to Multipoint. (d) Scan type. Determines whether the IBD values will be reported at the actual marker position, or at those positions and at some uniform grid across the entire region. The uniform grid (as specified by selecting the intervals option) may result in slightly nicer plots of the linkage signal. (e) Output pair types. Determines for which types of relative pairs the IBD values are calculated. If SIBPAL is going to be used, then it is sufficient to output All sibs, but doing the calculations for all relatives will not hurt anything. If RELPAL is going to be used, then it is preferable to use the All relatives option. (f) Use simulation. Decides whether an exact Hidden Markov Model or a Markov Chain Monte Carlo (MCMC) method will be used to estimate the IBD values. If set to True, then the MCMC method will be used only when the pedigrees are large and an exact analysis would take up a large amount of memory. If set to False, then the MCMC method will never be used. This is often faster than the MCMC method, but it may take too much memory. It is often more computationally efficient to set Use simulation to False and Split pedigree to true. We do not recommend that Use simulation be set to Always. Also, we do not recommend that the user change the other settings for the MCMC simulation without a deeper understanding of the algorithm. (g) Maximum bit. Determines the largest family that can be analyzed exactly. We recommend that the user not change this without some knowledge of the underlying algorithm. (h) Split pedigrees. Splits multigenerational pedigrees into nuclear families before analysis. This loses some information, and makes it impossible to estimate non-sibling IBD pairs. However, it is often much faster computationally than using MCMC simulations. (i) Allow loop. Allows pedigrees with inbreeding to be analyzed. This option cannot be used with the multipoint option.
310
N.J. Morris and C.M. Stein
7. Once the appropriate options have been selected, click the Run button in the lower right hand side of the project window. 8. Click OK on the Analysis Information pop-up window. 9. Run time may vary depending on the amount of data. For large pedigrees with a large number of markers, runtime can potentially be hours for a single chromosome. If this is the case, then the user may wish to run the analysis on a computer cluster instead of a personal computer. We briefly discuss this later. The output may be found under Jobs—GENIBD1—Output. As always, check the information file (genibd.inf) first for errors and warnings. The file ending with .ibd contains the actual IBD information, which will be used by the other programs. 2.6. Analyzing Sibling Pair Data Using SIBPAL
After following all of the previous steps, it is now actually time to perform the linkage analysis. SIBPAL is a program that can robustly make use of sibling pair information. To run SIBPAL: 1. Go to Analysis > Linkage Analysis > Model-free > Sibling Pairs > SIBPAL. 2. On the left panel of the Project Window drag the pedigree file from its position in Data—Internal to Jobs—SIBPAL1— Errors—Missing Data File. 3. On the left panel of the Project Window drag the IBD file created earlier from its position in Jobs—GENIBD1—Output to in Jobs—SIBPAL1—Missing IBD file. 4. On the left panel of the Project Window click Next. The left panel of the Project Window should now be on the Analysis Definition tab. 5. Select the Trait Regression option. The left panel of the Project Window should now be on Trait Regression tab. 6. Next to the Trait option click Define. In the Specification popup box, select the trait that you wish to analyze. If the data involves unascertained samples, we recommend that the user select BLUP mean, for the mean. For strongly ascertained traits, the user should look to the paper by Sinha et al. (22) for guidance (see Note 1). Click Add trait and OK. 7. If you have covariates that you wish to adjust out, click on the Define button next to Covariate. Select the covariate that you wish to use. Recall that in HE regression, covariates must be pair specific. SIBPAL will form a pair-specific covariate using one of the operations under option. For covariates that are measured at the individual level, available options are sum, difference, and mean. The user should select the option that will be the easiest to interpret. If the covariate is already coded at the sibship level (e.g., ethnicity), then this could be indicated and analyzed as is. The power option will put the newly formed
16
Model-Free Linkage Analysis of a Quantitative Trait
311
pair-specific covariate to some power. Click Add covariate and OK. (Note that it is often more readily interpretable to preadjust the trait for the covariates by performing linkage analysis on the residuals of the trait after linear regression.) 8. Select the markers you wish to analyze. Generally, this will be All. 9. If you wish to analyze only a subset of the data, you may select a variable for the Subset option. This variable must be a dummy variable (i.e., take on only the values 0 or 1). Only those individuals with a value of 1 for this variable will be considered for analysis. 10. If you wish to model interactions between markers, you may do so using the Interactions option. 11. We recommend that the user compute empirical P-values. This requests that SIBPAL use a permutation method to assess statistical significance. To do this, click on the Define button next to Compute empirical P-values. The default in the Specification pop-up box is to only compute empirical P-values if the asymptotic P-value is less than 0.05. We recommend that the user keep the default values and click ok. If more precision is desired for the P-value, this may be done in a second analysis of the data by decreasing the Width of P-value option. 12. For the Dependent variable, we recommend that the user select the W4 option when the sample is not ascertained. As mentioned in Subheading 2, this has been shown to have good power when compared to the other methods (see Note 1). 13. For Pair type select full, half, or both, depending on whether full sibling pairs or half sibling pairs are available in the data. 14. It is particularly useful to select the Produce tab delimited output option because this will allow for the output to be easily imported into other programs to make nice looking plots and tables. 15. We do not recommend using the robust variance estimator as it is believed to be overly conservative. 16. Click Next. 17. Click Run. 18. Click OK. 19. The output should show up under to Jobs—SIBPAL1—Output. View the information file first to find any errors and warnings. The .treg file contains a summary of the results. The .treg_det contains more details about the results. The .treg_export file is only produced if the Produce tab delimited output check box is selected. It will contain a file which may be easily imported into programs, such as R or Excel. See Note 2 below for detail regarding interpretation of jagged P-value distributions. Also see Fig. 2 for some sample output and notes on interpretation.
312
N.J. Morris and C.M. Stein
Fig. 2. SIBPAL output (slightly edited).
16
2.7. Analyzing Extended Families Using RELPAL
Model-Free Linkage Analysis of a Quantitative Trait
313
If the data contain extended relative pairs, it may be more appropriate to use the program RELPAL. To analyze data using RELPAL: 1. Go to Analysis > Linkage Analysis > Model-free > Arbitrary Relative Pairs > RELPAL. 2. On the left panel of the Project Window drag the pedigree file from its position in Data—Internal to Jobs—RELPAL1— Errors—Missing Data File. 3. On the right panel of the Project Window click the “. . .” next to Data file. Select the IBD file created earlier. 4. Click Next. 5. Select a Trait from the drop down box. 6. For Model select Single marker. This is the option for linkage analysis in general. The Multiple marker option allows you to adjust your signal for linkage at other locations. The Zero marker option allows you to estimate the covariate effects without doing linkage analysis. 7. If there are covariates to be adjusted for, click define next to First level. Select a Covariate and leave Trait to be adjusted blank. Click Add Covariate. You may repeat this to add multiple covariates. 8. If the user wishes to analyze only specific markers, then click define next to Second level. We do not recommend that the user select covariates for the second level because it is difficult to interpret what such covariates mean. See the S.A.G.E. user manual for more about these options. 9. The Produce comma delimited output option may be used to create an output file which can be easily imported into other programs such as R or Excel. 10. Click Run. 11. The output should show up under to Jobs—RELPAL1—Output. View the information file first to find any errors and warnings. The .out file contains a summary of the results. The .out file is only produced if the Produce comma delimited output check box is selected. It will contain a file that may be easily imported into programs such as R or Excel. Note that the “Empirical P-value” reported by RELPAL is actually asymptotic, and does not involve permutations. See Note 3 below for detail on convergence of RELPAL, and Note 2 regarding numerical instability if P-values appear jagged. See the S.A.G.E. user manual for the most current overview of the different possible robust score tests that are reported. We recommend against using the naı¨ve variance estimate unless the data is known to be very close to a normal distribution. See Fig. 3 for some sample output.
314
N.J. Morris and C.M. Stein
Fig. 3. RELPAL output (slightly edited).
16
Model-Free Linkage Analysis of a Quantitative Trait
315
3. Notes 1. Choice of parameterization of dependent variable. As summarized by Sinha and Gray-McGuire (22), there are many options for the parameterization of the dependent variable and mean correction factor. Their paper examines the implications of ascertainment scheme and trait distribution on power and type I error of various parameterizations. We refer the interested reader to Figures 7 and 8 of their paper (22) for a scheme to select the dependent variable and mean correction. Unfortunately, not all the possible options in SIBPAL were investigated because the BLUP may be combined with W4. 2. Marks of numerical instability. If the plot of the log10 P -value is extremely jagged (i.e., nearby markers do not have similar P -values), there may be a problem with numerical stability. In extreme cases, the reader ought to consider using a different method. However, note that if empirical P-values are being used, the problem may also be due to an insufficient number of simulations/permutations. In the GUI for SIBPAL, this may be changed by clicking on Define next to Compute empirical P-values. 3. Failure to converge in RELPAL. Occasionally, RELPAL may issue an error stating that it failed to converge. This is typically because the data displays extreme non-normality. If this is the case, it may help to first transform the residuals. Alternatively, under the First level analysis definition, the user may specify Normalize residual. This may also help. References 1. Haseman JK, Elston RC (1972) The investigation of linkage between a quantitative trait and a marker locus. Behav Genet 2: 3–19 2. Elston RC, et al (2000) Haseman and Elston revisited. Genet Epidemiol 19: 1–17 3. Shete S, Jacobs KB, Elston RC (2003) Adding further power to the Haseman and Elston method for detecting linkage in larger sibships: Weighting sums and differences. Hum Hered 55: 79–85 4. Wang T, Elston RC (2004) A modified revisited Haseman-Elston method to further improve power. Hum Hered 57: 109–116 5. Amos CI. (1994) Robust variance-components approach for assessing genetic linkage in pedigrees. Amer J Hum Genet 54: 535–543
6. Almasy L, Blangero J (1998) Multipoint quantitative-trait linkage analysis in general pedigrees. Amer J Hum Genet 62: 1198–1211 7. Allison DB, et al (1999) Testing the robustness of the likelihood-ratio test in a variance-component quantitative-trait loci–mapping procedure. Amer J Hum Genet 65: 531–544 8. Sham PC, Purcell S (2001) Equivalence between Haseman-Elston and variance-components linkage analyses for sib pairs. Amer J Hum Genet 68: 1527–1532 9. Chen WM, Broman KW, Liang KY (2004) Quantitative trait linkage analysis by generalized estimating equations: Unification of variance components and Haseman-Elston regression. Genet Epidemiol 26: 265–272
316
N.J. Morris and C.M. Stein
10. Goldgar DE (1990) Multipoint analysis of human quantitative genetic variation. Amer J Hum Genet 47: 957–967 11. Shete S, Jacobs KB, Elston RC (2003) Adding further power to the Haseman and Elston method for detecting linkage in larger sibships: Weighting sums and differences. Hum Hered 55: 79–85 12. Wang T, Elston RC (2005) Two-level Haseman-Elston regression for general pedigree data analysis. Genet Epidemiol 29: 12–22 13. Lange KL, Little RJA, Taylor J (1989) Robust statistical modeling using the T distribution. J Amer Statist Assoc 84: 881–896 14. Blangero J, Williams JT, Almasy L (2000) Robust LOD scores for variance componentbased linkage analysis. Genet Epidemiol Suppl 19: S8–S14. 15. Chen WM, Broman KW, Liang KY (2005) Power and robustness of linkage tests for quantitative traits in general pedigrees. Genet Epidemiol 28: 11–23 16. Abecasis GR, et al (2001) Merlin—rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet 30: 97–101
17. Cardon LR, et al (1994) Quantitative trait locus for reading disability on chromosome 6. Science 266: 276–279 18. Cardon LR, et al (1995) Quantitative trait locus for reading disability: correction. Science 268: 1553 19. Goode EL, Jarvik GP (2005) Assessment and implications of linkage disequilibrium in genome wide single nucleotide polymorphism and microsatellite panels. Genet Epidemiol Suppl 29: S72–S76 20. Kong X, et al (2004) A combined linkage-physical map of the human genome. Amer J Hum Genet 75: 1143–1148 21. Matise TC, et al (2007) A second-generation combined linkage–physical map of the human genome. Genome Res 17: 1783–1786 22. Sinha R, Gray-McGuire C (2008) Haseman Elston regression in ascertained samples: Importance of dependent variable and mean correction factor selection. Hum Hered 65: 66–76
Chapter 17 Model-Free Linkage Analysis of a Binary Trait Wei Xu, Shelley B. Bull, Lucia Mirea, and Celia M.T. Greenwood Abstract Genetic linkage analysis aims to detect chromosomal regions containing genes that influence risk of specific inherited diseases. The presence of linkage is indicated when a disease or trait cosegregates through the families with genetic markers at a particular region of the genome. Two main types of genetic linkage analysis are in common use, namely model-based linkage analysis and model-free linkage analysis. In this chapter, we focus solely on the latter type and specifically on binary traits or phenotypes, such as the presence or absence of a specific disease. Model-free linkage analysis is based on allele-sharing, where patterns of genetic similarity among affected relatives are compared to chance expectations. Because the model-free methods do not require the specification of the inheritance parameters of a genetic model, they are preferred by many researchers at early stages in the study of a complex disease. We introduce the history of model-free linkage analysis in Subheading 1. Table 1 describes a standard model-free linkage analysis workflow. We describe three popular model-free linkage analysis methods, the nonparametric linkage (NPL) statistic, the affected sib-pair (ASP) likelihood ratio test, and a likelihood approach for pedigrees. The theory behind each linkage test is described in this section, together with a simple example of the relevant calculations. Table 4 provides a summary of popular genetic analysis software packages that implement model-free linkage models. In Subheading 2, we work through the methods on a rich example providing sample software code and output. Subheading 3 contains notes with additional details on various topics that may need further consideration during analysis. Key words: Genetic linkage analysis, Nonparametric linkage (NPL) score, Identity by descent (IBD) sharing, Affected relative pairs, Affection status, Likelihood ratio based linkage model, Pedigree structure, Kong and Cox model, Genetic heterogeneity, GENEHUNTER, ALLEGRO
1. Introduction 1.1. Overview
Genetic linkage analysis aims to detect chromosomal regions containing disease genes that influence risk of specific inherited diseases, by examining patterns of inheritance in families. Pedigree relationships are needed, as well as disease or trait information and genetic marker genotypes for at least some of the family members.
Robert C. Elston et al. (eds.), Statistical Human Genetics: Methods and Protocols, Methods in Molecular Biology, vol. 850, DOI 10.1007/978-1-61779-555-8_17, # Springer Science+Business Media, LLC 2012
317
318
W. Xu et al.
The presence of linkage is indicated when a disease or trait cosegregates through the families with genetic markers at a particular region of the genome. Two main types of genetic linkage analysis are in common use. One of these, usually referred to as model-based or parametric linkage analysis, requires the assumption of a number of parameters about the relationship between the unknown gene mutations and the disease or trait and the genetic model underlying it. In this chapter, we focus solely on the other type, known as model-free linkage analysis, and specifically on binary traits or phenotypes, indicating the presence or absence of a specific disease. Modelfree linkage analysis is based on allele-sharing, where patterns of genetic similarity among affected relatives are compared to chance expectations. Allele sharing models are called model-free methods because prior specification of a disease inheritance model is not required. The popularity of these methods is largely derived from the conceptual simplicity of the approach. In a linkage study, the disease susceptibility genotypes carried by each individual are unobserved. The relationship between a particular genetic variant and disease risk can be described by a penetrance function, which is defined as the probability of being affected given the genotype at the disease gene. Let y denote the disease status, with y ¼ 1 denoting disease, y ¼ 0 denoting no disease. Then P(y|g) denotes the penetrance for phenotype y of an individual with risk genotype g. If there are only two alleles at the locus, one abnormal (D), and one normal (d), then there are only three genotypes (DD, Dd, and dd). If the penetrances are known, a model-based linkage analysis that uses this information will be the most powerful test of linkage. However, this is rarely the case, especially for complex diseases. Since the model-free methods do not require specification of the penetrance parameters, they are preferred by many researchers at early stages in the study of complex disease (1, 2). The relevant observation in studies of affected relative pairs (ARPs) is how frequently the two related individuals share copies of the same ancestral marker allele; such copies are said to be inherited with “identity by descent” (IBD) (3, 4). Usually, IBD sharing at a genetic marker cannot be unequivocally determined from genotype data, but rather, is determined probabilistically based on observed multiple marker (multipoint) data. The software GENEHUNTER (5) was one of the first of several programs that can carry out such multipoint calculations to estimate IBD sharing probabilities for pairs of relatives using the Lander–Green algorithm (6). This algorithm is based on a hidden Markov model formulation of the pattern of inheritance at multiple loci. Gudbjartsson et al. improved the algorithm to run much faster, allowing slightly larger families to be analyzed (ALLEGRO software) (7, 8). Here we describe three popular model-free linkage analysis methods. First, we discuss methods developed for particular types of relative
17
Model-Free Linkage Analysis of a Binary Trait
319
Table 1 Standard analysis workflow for model-free linkage analysis Step 1
Assemble families for analysis, paying attention to consistent phenotype definitions (see Note 4) and completeness and quality of DNA collection. Estimate power to detect linkage in the available families, prior to genotyping (see Chapter 13)
Step 2
Obtain genotyping data from desired set of markers. Perform error checking using the genotype data—Mendelian errors (see Chapter 2), pedigree errors (see Chapters 3 and 4), Hardy–Weinberg equilibrium to test for poorly performing markers (Chapter 7) and clean the data set as needed
Step 3
Estimate allele frequencies, or obtain appropriate allele frequency estimates (see Chapter 5 and Note 2). Estimate marker informativity (see Note 1)
Step 4
Decide on the appropriate model-free linkage method to be used (see Subheading 1, and Note 5), and hence on the software to use (see Table 4 for a comparison of software features)
Step 5
Calculate patterns of IBD allele sharing (see Subheading 1.3, Note 3)
Step 6
Calculate model-free evidence of linkage. Plot results as a function of genetic distance and calculate likely intervals for linkage peaks (see Note 12)
Step 7
Estimate significance levels using random gene dropping, to take pedigree structures and marker informativity into account (see Note 10)
Step 8
Assess sensitivity to assumptions such as (1) phenotype definition (see Note 4), (2) allele frequencies (see Note 3), (3) choice of test statistic (see Note 5)
Step 9
Consider possible evaluation of evidence of linkage heterogeneity as a function of covariates (see Note 8)
pairs, in particular sibling pairs. For example, the affected sibpair (ASP) likelihood ratio test of Risch (9, 10) is based on three parameters, the respective probabilities that a sibling pair shares 0, 1, and 2 parental alleles IBD at a marker or a particular place in the genome; this approach and its extensions are implemented in the S.A.G.E. software LODPAL. Secondly, the nonparametric linkage (NPL) statistic, developed by Whittemore and Halpern (11) and implemented in several software packages is described. This approach can be used for larger families than simple relative pairs. Finally, we describe the likelihood approach of Kong and Cox (12), implemented in the ALLEGRO software, which has better properties than the NPL statistic when the information about the patterns of inheritance, obtained from the marker data, is incomplete. Table 1 describes a standard model-free linkage analysis workflow. The theory behind the linkage tests is described in Subheadings 1.3–1.7 together with a simple example of the relevant calculations. Subheading 2 works through a richer example in detail. Subheading 3 contains notes on various topics. Notes 1–3 address genotyping considerations, while notes 4 and 5 address phenotyping and model choice respectively. We discuss special topics in Notes 6–9, and issues of statistical inference in Notes 10–12.
320
W. Xu et al.
1.2. Simple Example
To illustrate the basic concepts, we present a simple example with one microsatellite marker genotyped in two families, each containing three individuals affected with a disease of interest. The pedigree diagrams are presented in Fig. 1. The genetic marker has five possible alleles (1/2/3/4/5) with allele frequencies (0.20, 0.20, 0.20, 0.20, and 0.20). In family 1, the parental genotype information is unavailable, but in family 2, everyone has genotype data. Linkage analysis can be undertaken to investigate whether this marker is linked to the disease locus or, in other words, whether the disease and the marker tend to be inherited together.
1.3. Identity by Descent
All linkage analysis methods depend on estimates of IBD, that is, estimates of whether a chromosomal segment has been inherited from the same ancestor. This is simplest to describe for a pair of siblings. At any location on the autosomes, siblings can share 0, 1, or 2 copies of their parents’ chromosomes. The two unaffected siblings in family 2, for example, share two alleles IBD since they inherited the same alleles from each parent. In contrast, sibs 23 and 25 (in family 2) share no alleles IBD since allele 3, which they each carry, was
Fig. 1. Pedigree diagrams for the simple example. Notes: Circles represents females, rectangles represent males, and black symbols represent affected individuals. Each individual is assigned an identifying label (above the symbols). and the genotypes of each individual are marked below each symbol. For example, individuals 21 and 22 in family 2 each inherited marker allele 4 from their father, and marker allele 5 from their mother.
17
Model-Free Linkage Analysis of a Binary Trait
321
Table 2 Expected IBD sharing under the null hypothesis of no linkage (z0, z1, z2), with IBD estimates based on the marker genotypes ðz^0 ; z^1 ; z^2 Þ for all sib pairs in two simple pedigrees (estimated by GENEHUNTER) Pedigree
Sib pair
z0
z1
z2
z^0
z^1
z^2
Family 1
21, 22
0.25
0.5
0.25
0
0
1
Family 1
21, 23
0.25
0.5
0.25
0
1
0
Family 1
21, 24
0.25
0.5
0.25
1
0
0
Family 1
21, 25
0.25
0.5
0.25
0
1
0
Family 1
22, 23
0.25
0.5
0.25
0
1
0
Family 1
22, 24
0.25
0.5
0.25
1
0
0
Family 1
22, 25
0.25
0.5
0.25
0
1
0
Family 1
23, 24
0.25
0.5
0.25
0
1
0
Family 1
23, 25
0.25
0.5
0.25
0.5
0
0.5
Family 1
24, 25
0.25
0.5
0.25
0
1
0
Family 2
21, 22
0.25
0.5
0.25
0
0
1
Family 2
21, 23
0.25
0.5
0.25
0
1
0
Family 2
21, 24
0.25
0.5
0.25
1
0
0
Family 2
21, 25
0.25
0.5
0.25
0
1
0
Family 2
22, 23
0.25
0.5
0.25
0
1
0
Family 2
22, 24
0.25
0.5
0.25
1
0
0
Family 2
22, 25
0.25
0.5
0.25
0
1
0
Family 2
23, 24
0.25
0.5
0.25
0
1
0
Family 2
23, 25
0.25
0.5
0.25
1
0
0
Family 2
24, 25
0.25
0.5
0.25
0
1
0
inherited from a different parent. At a location that is not associated with disease, for a sibling pair: IBD ¼ 0 with probability ¼, IBD ¼ 1 with probability ½, and IBD ¼ 2 with probability ¼. Table 2 shows the IBD estimates for all the sibling pairs in our simple example. When multiple markers have been genotyped, recombination or crossovers during meiosis in each parent will lead to observable changes in the IBD patterns between relatives along the chromosomes. Suppose that, for a pair of siblings, a large number of genetic markers spanning the genome have been genotyped. Then, assuming that recombination rates and allele frequencies are known, estimates of IBD can be calculated at any point in the genome,
322
W. Xu et al.
including between marker locations. Usually, IBD cannot be completely inferred from the available marker data, so IBD estimates are simply the probabilities of the three IBD states at a particular location. In larger pedigrees of more general structure, Kruglyak et al. (5) demonstrated how to obtain patterns of IBD in a computationally efficient manner. All possible inheritance patterns from the founders in each family down to the bottom generation are enumerated for each founder allele. These patterns are termed inheritance vectors, and the probability of each inheritance vector can be calculated by taking into account the recombination in each pattern. Then IBD is estimated in a straightforward manner from the inheritance patterns and their probabilities. Details of the algorithm and more recent advances are well-described elsewhere (8, 13, 14). 1.4. Model-Free Linkage Analysis for Affected Relative Pairs
Once estimates of IBD sharing at markers for ARPs have been calculated, tests of linkage can be developed. Let zi denote the probability that two relatives inherit i marker alleles IBD, i ¼ 0, 1, and 2. Let z ¼ ðz0 ; z1 ; z2 Þ, and ^z ¼ ð^z0 ; ^z1 ; ^z2 Þ be the IBD estimates from the data at a genomic location of interest. Outbred relatives cannot share more than two alleles IBD. Several tests of linkage have been proposed for ARPs and, in each case, the null hypothesis is that the proportions z follow their expectations in the absence of linkage, that is z ¼ ðz0 ; z1 ; z2 Þ ¼ p, where p depends on the relative pair type. For example, for sib pairs, p ¼ (0.25, 0.5, 0.25). Under an alternative hypothesis of linkage, excess allele sharing would be expected for ARPs at positions where the marker locus is linked to the disease gene, with a pattern of sharing along the chromosome that attenuates with distance from a linkage peak of excess sharing. An early model-free ARP linkage test is the “mean test,” which compares the mean value of alleles shared IBD by the ARPs, 2^z2 þ ^z1 , with its null expectation 2z2 þ z1 . A second test, the proportion test, compares the proportion ^z2 with its null expectation z2 . This latter test is appropriate only for sib pairs since other relative pairs are not expected to share two alleles IBD (4, 15–17). Both the mean test and the proportion test can be seen as special cases of a general test statistic obtained by taking a weighted linear combination (18) w0 ð^z0 z0 Þ þ w1 ð^z1 z1 Þ þ w2 ð^z2 z2 Þ: Since the weights can be standardized arbitrarily, it is possible to set w0 ¼ 0 and w2 ¼ 1. Therefore, the test can be completely specified by assigning the weight w1 . Specifically, w1 ¼ 0:5 corresponds to the mean test, and w1 ¼ 0 corresponds to the proportion test. Tests in this family are referred to as “1 degree of freedom tests.” Although all are valid tests in the absence of linkage, they have different power for detecting linkage, and the power depends on the true penetrances and genetic model.
17
Model-Free Linkage Analysis of a Binary Trait
323
In each of families 1 and 2, we can examine the three pairs of affected siblings. In family 1, two affected pairs share one allele IBD, and one affected pair shares two alleles IBD; while in family 2, one affected pair shares zero alleles IBD, and two affected pairs share one allele IBD. If for the moment we assume that all the six sibling pairs are independent, the mean test would compare 2(1/6) + 1(4/6) ¼ 6/6 to the expected value of 1. The proportion test would compare 1/6 to the expected value of 1/4. 1.5. Model-Free Linkage Analysis in General Pedigrees
For families containing a variety of configurations of affected individuals, more general test statistics are needed. Whittemore (19) showed how tests of linkage could be conceptually unified, by demonstrating that patterns of IBD sharing among affected relatives can be assigned scores with the property that the scores are larger when there is more sharing among affected relatives. Allele sharing is quantified by a scoring function Sðvi ðtÞ; Fi Þ that depends on the inheritance vector vi ðtÞ and phenotype Fi specific to pedigree i. The scoring function of each family is normalized under the null hypothesis of no linkage to obtain a pedigree-specific NPL score, and then inference about linkage is based on a linear combination of pedigree-specific NPL scores (5). This approach to linkage, often termed “nonparametric linkage” became widely used with the development of software to calculate the IBD patterns simultaneously with the scores (5, 20). Several different scoring functions have been proposed (see Note 5). One popular scoring function is Sall (11) which measures the IBD allele sharing by giving a larger weight to alleles shared among many different individuals in a family. Let h represent a collection of alleles obtained by selecting one allele from each of the a affected individuals, and let bj (h) equal the number of times that the j th founder allele appears in h (for j ¼ 1, . . ., 2f ). Then, " 2f # 1 X Y bj ðhÞ! : Sall ¼ a 2 h j ¼1 An alternative scoring function is Spairs, which is the number of alleles IBD shared by two distinct affected relatives, summed over all possible pairs. Suppose that IBD can be inferred with certainty at a particular marker at chromosomal position t (this case is also known as “complete data”). Then for pedigree i with phenotype Fi the inheritance vector vi ðtÞ is known with certainty and, for a chosen statistic S, the corresponding NPL score Zi(t) is defined as: S ½vi ðtÞ; Fi mi Zi ðtÞ ¼ ; si
324
W. Xu et al.
where the terms mi and si are the mean and standard deviation of the scoring function, respectively, calculated under the null hypothesis of no linkage. Let P0 ½vi ðtÞ ¼ w represent the null probability of inheritance vector w and define: Si;w ðtÞ ¼ S ½vi ðtÞ ¼ w; Fi ; mi ¼ E0 ½Sðvi ðtÞ; Fi Þ; and s2i ¼ E0 S 2 ðvi ðtÞ; Fi Þ E02 ½S ðvi ðtÞÞ: When the inheritance vector is known with certainty, i.e., in the complete data case, S ½vi ðtÞ; Fi can be directly calculated by enumerating all sets. For the collection of N pedigrees, the overall NPL score is: N X
Z ðtÞ ¼
gi Zi ðtÞ;
i¼1
where gi is a pedigree-specific weight. In the complete data case, the null variance of Z(t) is well pffiffiffiffiffiapproximated by 1. Kruglyak et al. (5) proposed using gi ¼ 1= N , so that in the absence of linkage the NPL score has asymptotically a standard normal distribution. When full inheritance information is not available, several inheritance vectors may be compatible with the observed data, and the probability distribution of the set of possible inheritance vectors can be calculated from the marker data. Then, the expected value of the scoring function is computed using the possible inheritance vectors indicated by the data, weighted by their probabilities. Let gi;w ðtÞ ¼ P½vi ðtÞ ¼ wj marker data. The expected value of the scoring function is then: X i ðtÞ; FÞ ¼ Sðv Si;w ðtÞgi;w ðtÞ: w
Given that
Si;w ðtÞ mi ; Zi;w ðtÞ ¼ si
the expected value Zi ðtÞ is the pedigree-specific NPL score X Zi;w ðtÞgi;w ðtÞ Zi ðtÞ ¼ w
and the Zi ðtÞ are combined for the N sampled families to obtain an overall NPL statistic: Z ðtÞ ¼
N X
gi Zi ðtÞ:
i¼1
A large NPL value at a particular genomic location provides evidence of linkage between that locus and a gene that increases disease risk. Using the program GENEHUNTER (5), Table 3 shows the estimated NPL scores and P -values for the single genetic
17
Model-Free Linkage Analysis of a Binary Trait
325
Table 3 NPL scores and P -values using Sall or Spairs in our simple example Test
Sall (Spairs)
Family
NPL score
P -value
Family 1
0.816
0.438
Family 2
0.816
1.000
0.000
0.684
Total
marker in our simple example. In family 2, the NPL score is negative, reflecting less than expected allele sharing. This yields a P -value of 1.0 due to the test being one-sided in favor of excess sharing. For ASPs, the scoring functions Sall and Spairs give the same result. 1.6. Likelihood-Based Models in Allele-Sharing Linkage Analysis
Risch (9, 10) proposed a likelihood-based model for analyzing marker data from n ARPs. A test of linkage is based on the likelihood ratio Lð^z Þ=LðpÞ for affected sib pairs as follows: ! P2 n X z i wij i¼0 ^ LLR ¼ 2 loge P2 ; j ¼1 i¼0 zi wij where wij are the probabilities of observing the marker genotypes of the j th pair, given that they share i alleles IBD. The ^zi are the probabilities of an affected pair sharing i alleles IBD (which are unknown and need to be estimated) under the alternative hypothesis, and zi are the null IBD sharing probabilities, for example, (0.25, 0.5, 0.25) for sib pairs. The LLR test statistic compares the likelihood of the ARP data when the three allele sharing probabilities are estimated, Lð^z0 ; ^z1 ; ^z2 Þ, to the likelihood under the null hypothesis LðpÞ. The EM algorithm is used to maximize this LLR statistic with respect to the parameters z. If the maximized LLR statistic is greater than the test criterion, the null hypothesis of no linkage is rejected. The only constraint on the maximum likelihood estimate (MLE) is that ^z0 þ ^z1 þ ^z2 ¼ 1. When no other constraints are imposed on the estimated proportions, this likelihood ratio test is expected to be asymptotically distributed as w22 . In ASPs, however, the power of the LR test can be increased by evaluating the numerator not at the point ^z but, rather, at a different point z 0 , where z 0 is constrained to lie within the triangle of values consistent with the underlying genetics of IBD allele sharing by the sibs (21). Hence, this constrained likelihood ratio test is based on the ratio Lðz 0 Þ=LðpÞ.
326
W. Xu et al.
Using this likelihood ratio test for sibling pairs in our simple example, the ð^z0 ; ^z1 ; ^z2 Þ estimate is (0.266, 0.509, 0.225), and the test statistic for linkage is 1.401 with corresponding P -value 0.496. The test is based on all the affected sib pairs, assuming independence of the pairs (see Note 11). In general pedigrees, Whittemore (19) presented a differently parameterized likelihood approach to linkage analysis and showed its correspondence with NPL analysis. To construct the likelihood model, we can consider a pedigree i with M markers genotyped at position t1, . . ., tM along a chromosome. Let Yi ðtm Þ represent the observed genotypes of the marker at position tm and let Wi ðtm Þ denote the phase-known genotypes such that Yi ðtm Þand Wi ðtm Þ are equivalent to identity in state (IIS) and IBD allele configurations. Whittemore (19) defined the pedigree risk ratio Ri;w ðt; dÞ as the ratio of the conditional probability of phenotype Fi given IBD configuration w at locus t, over the probability of observing exactly the same pedigree phenotype irrespective of the IBD configuration at t for an arbitrary pedigree of the same size and structure. This ratio depends on a parameter d, which measures the effect of the gene at t on the disease risk: Ri;w ðt; dÞ ¼
P ½Fi jWi ðtÞ ¼ w; d : P ½Fi
A simple linear model for the pedigree phenotype risk ratio is: Ri;w ðt; dÞ ¼ 1 þ xi;w ðtÞd; where xi;w ðtÞ is an explanatory variable that is a function of the IBD configuration of the disease gene. The likelihood function incorporating the pedigree phenotype risk ratio is X Ri;w ðt; dÞgi;w ðtÞL0;i : Li ðt; dÞ ¼ w
The factors gi;w ðtÞ and L0;i in the above likelihood are calculated directly using the observed marker genotypes and populationbased estimates of marker allele frequencies. The pedigree phenotype risk ratio Ri;w ðt; dÞ is the only component of the likelihood that depends on the unknown parameter d. To test for linkage, Whittemore (19) specified the null hypothesis H0 : d ¼ 0 and the one-sided alternative HA : d>0. A likelihood ratio statistic LR(t) is constructed to compare the likelihood, maximized with respect to d, to the likelihood under the null when d0 ¼ 0: LRðtÞ ¼ 2 ln
N X X Lðt; ^dÞ ¼ 2 ln Ri;w ðt; ^dÞgi;w ðtÞ: Lðt; d0 Þ i¼1 w
The distribution of LR(t) is asymptotically w2 with one degree of freedom. The null hypothesis of no linkage is rejected when the LOD(t) score exceeds a predetermined critical level, where
17
Model-Free Linkage Analysis of a Binary Trait
LODðtÞ ¼ log10
327
Lðt; ^dÞ LRðtÞ : ¼ L0 ðtÞ 2 lnð10Þ
Based on the peak LOD estimator of location we can construct a 1-LOD support interval (22). The 1-LOD support interval is determined by the chromosomal points where the LOD scores are within 1 LOD unit of the peak LOD score (see Note 12). 1.7. Kong and Cox Model
Following the approach of Whittemore (19), Kong and Cox (12) developed a linkage likelihood model in which the covariates of the pedigree phenotype risk ratio are weighted NPL scores. They defined xi;w ¼ gi Zi;w ðtÞ. Then the corresponding pedigree phenotype risk ratio is defined as Ri;w ðt; dÞ ¼ ½1 þ dgi Zi;w ðtÞ: For a set of N pedigrees, the resulting score statistic SS(t) is equivalent to the overall NPL When the pedigree-specific weights gi PNscore. 2 are constrained to g ¼ 1, and the null variance of Z(t) is i¼1 i approximated by 1, the efficient score statistic simplifies to ESðtÞ ¼ Z ðtÞV0 ½SSðtÞ1 Z ðtÞ ¼ ½Z ðtÞ2 : When inheritance information is only partially available, however, application of the complete data approximation of Kruglyak et al. (5) gives conservative results, because the null variance of Z(t) is less than 1. The likelihood ratio test proposed by Kong and Cox (12) does not require the complete data approximation and provides a more accurate linkage test than ES(t) or NPL analysis for most situations where inheritance information is incomplete. The linear likelihood function specified by Kong and Cox (12) for a set of N pedigrees is: L ðt; dÞ ¼
N Y
½1 þ dgi Zi ðtÞL0;i :
i¼1
Testing for linkage via a likelihood ratio test requires the maximization of Lðt; dÞ with respect to d. An upper bound b is imposed on the MLE ^d to ensure Ri ðt; dÞ 0: If ai ðtÞ is the smallest possible value that the scoring function Si;w ðtÞ can theoretically take at position t, bi ðtÞ ¼ si =½mi ai ðtÞ and the upper bound of ^d is b ¼ minðbi ðtÞÞ for i ¼ 1, . . ., N sampled pedigrees. The lower bound of ^d is 0 as the model does not permit a negative gene effect. A likelihood ratio test LR(t) is constructed using the likelihood maximized under the constraint 0 ^d b: " # h i Lðt; ^dÞ LRðtÞ ¼ 2 ln ¼ 2 lðt; ^dÞ l ðt; d0 Þ : L ðt; d0 Þ
328
W. Xu et al.
Taking the natural logarithm of the likelihood function gives: ( ) N Y l ðt; dÞ ¼ ln ½1 þ dgi Zi ðtÞL0;i i¼1
¼Cþ
N X
ln½1 þ dgi Zi ðtÞ;
i¼1
P where C ¼ N i¼1 ln½L0;i is a constant calculated using the observed marker data. Equivalently one can compute the statistic rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi h iffi ^ ZLR ðtÞ ¼ 2 lðt; dÞ l ðt; d0 Þ vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u N h i u X ln 1 þ ^dgi Zi ðtÞ ; ¼ t2 i¼1
which is well approximated by a Gaussian distribution when the number of pedigrees is large. Kong and Cox (12) implemented the ZLR(t) linkage test in the program GENEHUNTER-PLUS, and this test statistic is currently available in the ALLEGRO and MERLIN software (see Table 4). When inheritance is incomplete, ZLR(t) is a more powerful test than the NPL method. However, in the presence of a gene effect, the upper bound imposed on the MLE of d restricts the amount of possible deviation. This can lead to substantial power losses if the dataset consists of a small number of pedigrees and dramatic sharing is observed (12). Kong and Cox (12) outlined an alternative exponential model where the parameter d has no upper bound. The exponential model can be written as dgi Si;w ðtÞ mi P ðvi ðtÞ ¼ wjdÞ ¼ pi;w ðtÞri ðdÞ exp ; si where ri ðdÞ ¼
pi;w ðtÞ
X w
dg ½Si;w ðtÞ mi exp i si
!1
isP the renormalization constant necessary to ensure that w P½vi ðtÞ ¼ wjd ¼ 1. Computation of ZLR(t) is more demanding for the exponential model because the conditional distribution of each Zi(t) must be calculated. However, the upper bound is no longer a problem for this exponential model. Since our simple example contains only two families, the Kong and Cox tests of linkage would be unreliable and are not given here. 1.8. Summary
In contrast to model-based tests of linkage (see Chapter 15), model-free linkage tests do not need to specify the true relationship
Perfect-data approximation
NPL statistics, Perfect-data using S(all), S approximation (pairs), and four for NPL; other scoring asymptotic LR functions; LOD for LOD score; score from Kong two types of and Cox linear or multipoint data exponential simulation model
Models and statistics P -value calculation
Maximum NPL statistics, likelihood using S(all) or S estimation for all (pairs) scoring non-founder functions relative pairs (calculated as single-point or multipoint)
Prior and posterior pairwise IBD sharing for all relative pairs, single-point and multipoint
IBD allele sharing estimation
May be provided by Single-point or NPL statistics, Asymptotic or by the user or multipoint for all using S(all) or S gene-dropping estimated by relative pairs in a (pairs), and LOD simulation maximum pedigree, with score from Kong likelihood or additional IBD and Cox linear or counting; states for inbred exponential haplotype and non-inbred model estimation in pedigrees pedigrees; estimation of information content
Option to use pedigree information to identify unlikely genotypes
ASP, ARP, pedigrees
Must be provided by the user; Haplotype reconstruction; multiple measures of information content
MERLIN (Multipoint Engine for Rapid Likelihood INference)c
Calculation of observed crossover rate to detect genotyping errors or marker order problems
By examination of Must be provided estimated by the user; haplotypes for estimation of excessive obligate information recombination content
ASP, ARP, pedigrees
Allele frequency estimation
ASP, ARP, pedigrees
a
Family structures Error checking
GENEHUNTERb
ALLEGRO
Software
Table 4 Comparison of available software for model-free linkage analysis of binary traits
(continued)
Options to allow for LD among neighboring markers; option for analysis of large pedigrees using SIMWALK2
Includes X chromosome analysis
Allows unequal family weights; use of sexspecific marker maps
Other
ASP, ARP, Discordant Sib Pairs (DSP)
SAGE (Statistical Analysis for Genetic Epidemiology)d
IBD allele sharing estimation
FREQ: estimated in GENIBD: singlesingletons and in and multipedigrees by marker, pairwise maximum for fullsibs, half likelihood or by sibs, counting grandparental, avuncular, first cousin
Allele frequency estimation SIBPAL: test of mean allele sharing, for full and half-sibs; LODPAL: LOD score for ASPs, LOD score for ARPs in 1 and 2 parameter conditional logistic model Asymptotic and empirical
Models and statistics P -value calculation
Links to software, documentation, and tutorials: a http://www.decode.com/software/ b http://www.broad.mit.edu/ftp/distribution/software/genehunter/ c http://www.sph.umich.edu/csg/abecasis/Merlin/ http://www.sph.umich.edu/csg/abecasis/Merlin/tour/ d http://darwin.cwru.edu/sage/
MARKER INFO: for Mendelian errors, RELTEST: for relationship errors
Family structures Error checking
Software
Table 4 (continued)
Allows for discordant relative pairs, Xlinked models, parent-oforigin models, and option for pair-level covariates
Other
17
Model-Free Linkage Analysis of a Binary Trait
331
between the genes and the disease risks. However, while the presumed disease model is made explicit in model-based linkage analysis, model-free methods make implicit assumptions about the disease–gene relationships, which can influence the power of the tests of linkage (see Note 5). In a genome scan for linkage, modelfree linkage tests are usually only the first step in the search for disease susceptibility genes, and a genome-wide scan for linkage peaks may be used simply to identify regions that may harbor such genes. These regions can then be further analyzed with alternative methods such as fine-mapping studies. These allele-sharing methods have been widely used for the study of linkage of complex diseases; however, in their simple form they cannot directly detect gene–gene or gene–environment interactions.
2. Methods 2.1. Data
The methods for model-free linkage analysis are illustrated by analysis of a data set containing families with multiple cases of inflammatory bowel disease. Inflammatory bowel disease is a disorder of the autoimmune system characterized by chronic inflammation of the gastrointestinal tract. Epidemiological studies have provided evidence of a substantial genetic contribution to susceptibility. Familial aggregation has been observed in 10% of cases, with monozygotic and dizygotic twin concordance rates of 40–50% and 8%, respectively, and disease prevalence ten times greater in first degree relatives of affected persons than in unrelated individuals from the general population (23, 24). Cases with early age at onset are hypothesized to have a more strongly genetic etiology. The disease occurs in two main forms, Crohn disease (CD) and ulcerative colitis (UC). To illustrate model-free linkage methods, we apply them in an analysis of 122 Canadian CD families recruited from the Toronto area as part of a genome-wide linkage study (25). As reported previously (26, 27), all types of affected relatives, as available, were included in the analyses (Table 5). The 122 CD families are classified into two categories: CD16 (at least one patient diagnosed at 16 years or younger) and CD > 16 (families not in CD16). Genotyping data were available for 17 microsatellite markers on chromosome 5 spanning 156 cM with an average inter-marker distance of 10.34 centiMorgans (cM) (Table 6). These were highly polymorphic markers with considerable variation in allele frequencies.
2.2. Statistical Analyses
Pedigree-specific NPL scores from multipoint linkage analyses, using the allele-sharing scoring functions Sall and Spairs, were obtained in ALLEGRO. We also fitted the Kong and Cox (12) linear and exponential likelihood models separately to CD, CD16, and
332
W. Xu et al.
Table 5 A breakdown of the 122 CD families, according to the type of affected relatives Number of families Relationship of other affected relatives to the affected siblings
1 Affected 2 Affected 3 Affected sibling siblings siblings
None
–
106
7
Aunt/uncle
2
2
–
Cousin
–
2
–
a
1
1
1
Other a
Other includes: child; parent and grandparent; great aunt/uncle and great niece/nephew
CD > 16 family subgroups. To compare CD16 vs. CD > 16 family subgroups, analyses of genetic heterogeneity were conducted at the chromosome 5 position with the highest linkage signal, as reported previously (26, 27). A positive covariate value (X ¼ 1) was assigned to the CD16 subgroup in which the observed evidence for linkage was greater. The linear and exponential models were applied to yield the one-sided ZLR test for linkage and two likelihood ratio tests: LRC (2 df) assessing linkage and heterogeneity, and LRH (1 df) testing heterogeneity in allele sharing between CD16 and CD > 16 family subgroups (27). The test statistics, LR C ¼ Z 2 LR ðCD16Þ þ Z 2 LR ðCD > 16Þ and LR H ¼ Z 2 LR ðCD16Þ þ Z 2 LR ðCD > 16Þ Z 2 LR ðCDÞ ; were computed using the ZLR allele-sharing test statistics evaluated in the CD16, CD > 16, and CD (combined) samples. 2.3. Linkage Analyses of CD16 Families Using ALLEGRO
The linkage analyses software package ALLEGRO can be downloaded from http://www.decode.com/software/. At this time, the latest ALLEGRO version 2.0 includes distribution files for Unix, Redhat/Linux, Windows/DOS, and Mac/G5 platforms. Analyses are performed using a script or options file that specifies input data files, commands for various analyses, and output files with results. Here we illustrate running ALLEGRO on a Windows machine to analyze the CD16 families. The program is invoked by typing allegro-2_v0f CD.opt at the DOS command prompt in a directory that includes both the allegro-2v0f.exe executable file and
17
Model-Free Linkage Analysis of a Binary Trait
333
Table 6 Allele frequencies and Marker map distances on chromosome 5 Distance from first marker (cM)
Marker
Allele frequency distribution (estimated from founders)
D5S1492
(0.002, 0.006, 0.0636, 0.0915, 0.3738, 0.4632)
0
D5S807
(0.0019, 0.0075, 0.0188, 0.0226, 0.0245, 0.0245, 0.0414, 0.0546, 0.064, 0.1205, 0.2015, 0.4181)
9.6
D5S817
(0.0039, 0.0097, 0.0874, 0.2019, 0.2427, 0.4544)
13.5
D5S1473
(0.002, 0.002, 0.002, 0.002, 0.0059, 0.0059, 0.0119, 0.0257, 0.0317, 0.0931, 0.1743, 0.2554, 0.3881)
26.8
D5S1470
(0.0019, 0.0039, 0.0154, 0.0751, 0.0963, 0.106, 0.1233, 0.1464, 0.1734, 0.2582)
35.9
D5S2494
(0.0303, 0.0303, 0.0303, 0.0606, 0.3333, 0.5152)
49.5
GATA67D03 (0.0021, 0.0083, 0.0145, 0.0455, 0.0663, 0.0787, 0.0911, 0.1863, 0.2505, 0.2567)
59.8
D5S1501
(0.0077, 0.0251, 0.0271, 0.029, 0.0445, 0.0754, 0.1161, 0.1721, 0.1915, 0.3114)
75.8
D5S1719
(0.0041, 0.0164, 0.0184, 0.0777, 0.0879, 0.2495, 0.2495, 0.2965)
85.4
D5S1453
(0.002, 0.004, 0.0061, 0.0242, 0.0323, 0.0384, 0.0505, 0.0545, 105.3 0.1273, 0.2061, 0.4545)
D5S1505
(0.0017, 0.0087, 0.0419, 0.0471, 0.1082, 0.1431, 0.1728, 0.2286, 0.2478)
120.4
GATA68A03 (0.002, 0.0082, 0.0368, 0.0573, 0.1268, 0.1391, 0.2188, 0.411) 124.2 D5S816
(0.0018, 0.0036, 0.0344, 0.058, 0.1087, 0.1703, 0.1884, 0.2174, 0.2174)
129.9
D5S1480
(0.002, 0.004, 0.0381, 0.0641, 0.0842, 0.0882, 0.2244, 0.2385, 138.1 0.2565)
D5S820
(0.0019, 0.0243, 0.0598, 0.0654, 0.086, 0.1981, 0.2804, 0.2841)
D5S1471
(0.0018, 0.0037, 0.0221, 0.0314, 0.0978, 0.1402, 0.286, 0.417) 162.7
D5S1456
(0.0039, 0.0308, 0.1291, 0.1541, 0.1792, 0.1888, 0.3141)
150.4
165.4
334
W. Xu et al.
the following CD.opt options file:
Lines preceded by % correspond to comments and do not affect analyses. The commands PREFILE and DATFILE read the input pedigree cd16.ped and marker data ch5.loci files, respectively. Both of these are in LINKAGE format as described by Terwilliger and Ott (22). The pedigree file cd16.ped specifies the pedigree structure, affection status and genotypes at the 17 microsatellite markers on chromosome 5, for members of the CD16 families (N ¼ 51). The marker file ch5.dat specifies the marker positions on chromosome 5 and allele frequencies of the 17 microsatellite markers. The LODEXACTP and NPLEXACTP commands request computation of exact P -values for the LOD and NPL scores, respectively. ALLEGRO has the capacity to run multiple models simultaneously. The first MODEL command requests the following linkage analysis: multipoint (mpt) IBD estimation using the linear (lin) model and the Sall scoring function (all) with families receiving equal weights (equal). The output file cd16lin.out provides a summary of results summarized across families, whereas cd16lin.out provides family-specific results. The second MODEL command requests a similar linkage analysis, in this case for the exponential likelihood model (exp). The output file cd16exp.out is listed below.
17
Model-Free Linkage Analysis of a Binary Trait
335
336
W. Xu et al.
ALLEGRO computes all statistics at (m 1) positions between consecutive markers with m ¼ 2 as the default. The first column corresponds to the location in cM, relative to the first marker, and the last column contains the marker name or “–” for positions between markers. The second and fourth columns provide the allele-sharing LOD score and the nonparametric linkage NPL score. The dhat column is the MLE ^d and the Zlr column gives the ZLR score as defined earlier. The nplexactp and lodexactp columns provide P -values for the LOD and NPL scores, respectively, and the info column gives a likelihood-based measure of information (see Note 1). The family-specific output file cd16fexp.out has a similar format, as do the linear model output files. For additional commands and further details regarding alternative analyses options, please see the documentation included with the ALLEGRO software. 2.4. Results
The information content of the genotyped markers for CD families is provided in Fig. 2. The greatest evidence for linkage was observed near position 129.9 cM (Fig. 3) for both CD and CD16 subgroups. At this locus, the summary NPL scores using Sall are Z ¼ 2.11 (p ¼ 0.017) for CD and Z ¼ 2.68 (p ¼ 0.0037) for CD16 families, but only Z ¼ 0.49 for CD > 16 (Table 7). Although the NPL score was higher in the CD16 families, these data alone do not provide sufficient evidence to declare significant linkage (p < 2 105) according to established criteria (28). As expected, when inheritance information is less than complete, the model-based ZLR values from the Kong and Cox (12) likelihood models are consistently higher than the summary Z statistics (Table 7). In all subgroups, the linkage parameter d in the linear model maximized within the imposed constraints.
Fig. 2. Information content for CD families (n ¼ 122).
17
Model-Free Linkage Analysis of a Binary Trait
337
Fig. 3. Summary multipoint NPL scores for CD (n ¼ 122), CD16 (n ¼ 51), and CD > 16 (n ¼ 71) family subgroups. Calculations were performed using the Sall scoring function, with similar results obtained for Spairs.
Table 7 Results of linkage analyses at 129.9 cM position in CD family subgroups defined according to age at diagnosisa Linear model Subgroup N
Info Z
ZLR
d
Exponential model
P -value ZLR
d
P -value
(1) Z and ZLR tests based on Sall scoring function CD 122 0.74 2.11 2.47 0.26 0.007 CD16 51 0.77 2.68 3.10 0.46 0.001 CD > 16 71 0.71 0.49 0.58 0.08 0.28
2.46 0.26 0.007 3.06 0.49 0.001 0.59 0.08 0.28
(2) Z and ZLR tests based on Spairs scoring function CD 122 0.74 2.25 2.62 0.27 0.004 CD16 51 0.77 2.71 3.10 0.46 0.001 CD > 16 71 0.71 0.66 0.78 0.11 0.22
2.63 0.28 0.004 3.11 0.51 0.001 0.78 0.11 0.22
a
The summary NPL score Z and one-sided ZLR tests for the linear and exponential models were evaluated using Sall and Spairs scoring function implemented in the package ALLEGRO
The results of joint linkage and heterogeneity analyses, conducted at locus 129.9 cM, yielded similar results under the linear and exponential likelihood models (Table 8). In the CD families, the P -value for the LRC (2 df) test for linkage including a binary covariate defining the CD16 and CD > 16 subgroups is of the same order of magnitude as that of the ZLR (1 df) test for linkage without covariates (Table 8). Marginal evidence for heterogeneity between the CD16 and CD > 16 groups was detected with the Sall scoring function in both the linear and exponential models,
338
W. Xu et al.
Table 8 p Values for test statistics: ZLR for linkage in all CD families, LRC for linkage in CD16 and CD > 16 groups jointly, and LRH for allele-sharing heterogeneity between two CD family subgroups defined according to age at diagnosis Linear model Subgroup
ZLR (1 df)
Exponential model
LRC (2 df)
LRH (1 df)
ZLR (1 df)
LRC (2 df)
LRH (1 df)
CD16 vs. CD > 16 Sall
0.007
0.007
0.051
0.007
0.008
0.055
Spairs
0.004
0.006
0.068
0.004
0.006
0.065
consistent with the observation of higher excess allele sharing in the families with early age at onset. This locus has become known as IBD5, and subsequent analyses refined the locus to a 250-kb risk haplotype (29). In a recent review (30), the authors note that the association of this haplotype with IBD has been widely replicated in a number of independent populations and that the IBD5 risk haplotype has been principally associated with CD. This region has proven to be particularly difficult to fine map because it contains a significant degree of linkage disequilibrium, making it difficult to discern a causative allele from a marker allele co-inherited with the disease-causing allele.
3. Notes 1. Marker informativity. The ability to detect linkage rests on the ability to accurately estimate the patterns of identity by descent. Microsatellite markers, which tend to have four or more alleles, are more informative about inheritance patterns. However, SNP markers, which only have two alleles, are now the most commonly used marker type since there are millions in the genome and they can be typed efficiently. To achieve the same level of information about the inheritance patterns across the genome as obtained with microsatellites, denser SNP typing is needed (31). Several alternative measures of information content have been proposed, including an entropy-based measure (5) and likelihood-based measures (8, 32).
17
Model-Free Linkage Analysis of a Binary Trait
339
2. Allele Frequency Errors. Genotype data will be unavailable for some individuals, especially for the top generation of a pedigree. Estimation of IBD then relies on population frequency estimates of each allele (allele frequencies, see Chapter 5). Unfortunately, linkage results can be substantially biased if the wrong allele frequencies are used (33, 34). When a common allele is mistakenly assumed to be rare and the founders are not genotyped, false-positive linkage signals can be obtained, since family members carrying this allele will be inferred (erroneously) to have the same ancestor (35). Ideally, allele frequency estimates appropriate for the ancestral background of each pedigree should be used, but most software packages do not allow different allele frequencies for different sets of families. 3. Linkage Disequilibrium Between Markers. Most algorithms to calculate IBD assume linkage equilibrium between markers; however, closely spaced markers are likely to be in linkage disequilibrium. Hence, some haplotypes may be more common than would be expected under linkage equilibrium, and this can be falsely interpreted as within-family sharing (36, 37). It is, therefore, important to ensure that the markers used in linkage analysis are well spaced and in approximate equilibrium. This can be achieved by judiciously removing markers from the data set (38). The MERLIN software uses a clustering method to address this issue (39). An alternative approach is implemented in the EAGLET software (40). 4. Phenotype Definitions. The trait being studied is crucial to the success of any linkage study. Ideally, the best choices for phenotypes are diseases or traits that are closely or directly influenced by risk genes. However, since the mechanisms relating genes to phenotypes are unknown, choosing on this basis directly is impossible. The chosen phenotypes for analysis should, at a minimum, be clearly measurable and show good inter- and intra-rater reliability. Particularly for psychiatric disorders, it has been argued that the major diagnoses may be too heterogeneous, and so different phenotypes, possibly based on specific test results, may be more useful (41). For example, when studying suicide, impulsive aggressive behavior is associated with completed suicides, and this trait may be more directly affected by genes (42). 5. Choice of Models and Test Statistics. Although available test statistics provide valid tests of linkage (so that in the absence of linkage their distributions in large samples are known), their power varies, and the optimal statistic to test for linkage depends on the true underlying penetrances and genetic model. McPeek (43) compared a number of different scoring functions for tests of linkage against a number of different genetic models and identified the most powerful scores for
340
W. Xu et al.
different situations. In the absence of knowledge about the best choice, some have tried maximizing the tests of linkage across a selection of models, with empirical P -values obtained through simulation (44). Unfortunately, this approach does not always lead to better power. In the Kong and Cox approach, even after specifying the scoring function, it is necessary to choose either the linear or the exponential model. Although estimation in the linear model is easier, when sharing deviates markedly from the null hypothesis the exponential model may be more powerful. Certainly, in the presence of heterogeneity, the exponential model appears to be more powerful (27). For multigenerational pedigrees, Basu et al. (45) proposed a new model specification, within the likelihood framework of Kong and Cox, implemented in the program lm ibdtests in the software package MORGAN (http://www.stat.washington.edu/thompson/Genepi/MORGAN/Morgan.shtml). Their approach is also applicable to smaller pedigrees, can incorporate information on both affected and unaffected individuals, and does not require specification of an IBD measure such as Sall or Spairs. 6. X Chromosome. Any linkage method, including estimation of IBD, needs modification to work properly for markers on the X chromosome. The most popular software packages for model-free linkage with binary traits all work properly on X chromosome data (see Table 4). 7. Unaffected Relative Pairs or Discordant Pairs. Although linkage tests can also be developed as a function of IBD sharing patterns in unaffected relatives, or between pairs of individuals who are discordant for the disease of interest, such study designs generally have low power (46). However, when the disease has very high prevalence, using discordant pairs can be more powerful than using pairs of affected relatives (47). 8. Detection of Heterogeneity in Allele-Sharing Linkage Analysis. The etiology of complex disorders is varied and may involve several susceptibility loci with interaction among multiple genetic and environmental factors. A likely component of complex disease is genetic heterogeneity, which refers to the situation in which the disease trait is independently caused by two or more factors, at least one of which is genetic. Analytic approaches used in gene mapping studies, including traditional likelihood LOD scores and model-free allele-sharing methods, are generally sensitive to the underlying genetic mechanism, and particularly to the presence of genetic heterogeneity. If analyses overlook the presence of genetic heterogeneity, then results may be biased and conclusions misleading. Heterogeneity among the allele-sharing distributions of sampled pedigrees can arise when genetic susceptibility varies
17
Model-Free Linkage Analysis of a Binary Trait
341
among the sampled pedigrees. This is the case with locus heterogeneity when two unlinked genes independently cause disease. Environmental factors can also independently affect the disease phenotype or interact with one or more genetic loci to alter the penetrance of susceptibility genotypes. Age, for example, is an important covariate that may affect the penetrance of susceptibility loci. Regardless of the cause, the presence of genetic heterogeneity affects the extent of observed IBD allele sharing at map positions closely linked to a disease gene, with serious consequences for linkage analysis. An approach to modeling heterogeneity in allele sharing in ASPs was developed by Greenwood and Bull (48, 49) who generalized the LR approach of Risch (9, 10) to include covariates and developed an EM approach to parameter estimation. They proposed a multinomial logistic regression model for the inclusion of covariates, and applied it with multiple covariates (50). Olson (51) developed a closely related conditional-logistic model for ARP linkage analysis, allowing different types of relative pairs to be included in the same ARP analysis (see Goddard et al. (52) for an example application). This model is parameterized in terms of the logarithms of allele-sharing-specific relative risks and is equivalent to Risch’s LR models (10). For multiple covariates, Xu et al. (53, 54) developed a recursive partitioning algorithm to identify nonlinear GxE interactions based on the allele-sharing likelihood ratio tests. The Kong and Cox (12) linear and exponential likelihood models were extended by Nicolae (55) and Mirea et al. (26, 27) to model heterogeneity in allele sharing among affected relatives using a family-level covariate vector and a corresponding regression parameter in the likelihood function. Incorporating individual-level covariates in model-free linkage analysis is conceptually challenging, since linkage is based on sharing between individuals. However, Whittemore and Halpern (56) proposed a method for including individual, covariate-based weights in NPL linkage analysis and they showed that this approach could lead to increased power to detect linkage. 9. Imprinting and Transmission Ratio Distortion. Most linkage models assume that inheritance of alleles from parents follows the usual Mendelian rules. However, violations of these assumptions are known to exist. Particular parental alleles may be preferentially inherited, leading to transmission ratio distortion. Such effects may be mediated through imprinting in the parental genome. Linkage models have been extended to allow for parent-of-origin imprinting effects (57, 58), and the performance of affected sib pair models has been examined when transmission ratio distortion is present (59). 10. Genome-Wide Significance for Linkage. Guidelines for genomewide significance levels for linkage analysis were proposed in
342
W. Xu et al.
1995 and became generally accepted (60). For sibling pairs, these authors recommended that a P -value of 7 104 be considered suggestive linkage, but a P -value of 2 105 would be needed to conclude significant linkage. Slightly different thresholds apply for different relative pair types. These stringent thresholds were recommended to control for the fact that researchers would repeat analyses with increasing numbers of families and increasingly dense markers in the hopes of finding significant linkage, and these numbers rest on the assumption of infinitely dense markers (completely informative for IBD patterns) and on large numbers of families. It may be impossible to achieve anything like these P -values in a small data set with a fixed marker set. The ability to detect linkage will also depend on the available family structures. Therefore, simulation (or gene dropping) is recommended to estimate the expected distribution of linkage test statistics specific to the data set, in the absence of linkage. The observed family structures (with the disease status) are held fixed, and marker data are “dropped” through each family from the founders down to the bottom of each pedigree, allowing for recombination at the normal rates. Tests of linkage can then be calculated on the simulated data, and a null distribution obtained by repeating the gene dropping many times. 11. Dependence of Multiple Relative Pairs from the Same Family. When using a linkage method for ARPs, multiple pairs from the same family are not independent. Methods such as the mean test or the proportion test for ASPs will give valid tests of linkage, but ARP likelihood-based methods, as described in Subheading 1.6, will be biased, generally inflating the evidence for linkage. Weighting schemes have been proposed (61) to reduce the total contribution of families containing multiple pairs; however, such weighting schemes can lead to overly conservative tests of linkage (62). When many pedigrees contain more than a single affected pair, it is a better choice to use a pedigree-based method (such as the NPL score or the Kong and Cox method) for calculating the evidence for linkage. 12. Intervals for Disease Gene Localization. The likely location of a putative disease locus is often constructed by the “1-LOD interval,” finding all points on the chromosome with an LOD score greater than the peak LOD score minus one. For parametric linkage models, this simple interval has approximately 95% coverage (22). However, these intervals are not always accurate, especially for model-free linkage tests, and so several other interval estimates have been proposed. Sinha et al. (63) describe several proposed methods, in addition to developing a new Bayesian method that has better coverage than many of the other approaches.
17
Model-Free Linkage Analysis of a Binary Trait
343
Acknowledgments We acknowledge the support of research grants from the Natural Sciences and Engineering Research Council of Canada and the Canadian Network of Centres of Excellence in Mathematics (MITACS, Inc.). References 1. Ott J (1996) Complex traits on the map. Nature 379: 772–773 2. Elston RC (2000) Introduction and overview. Statistical methods in genetic epidemiology. Statist Meth Med Res 9: 527–541 3. Fishman, et al (1978) A robust method for the detection of linkage in familial disease. Amer J Hum Genet 30: 308–321 4. Suarez BK (1978) The affected sib pair IBD distribution for HLA-linked disease susceptibility genes. Tissue Antigens 12: 87–93 5. Kruglyak L, et al (1996) Parametric and nonparametric linkage analysis: a unified multipoint approach. Amer J Hum Genet 58: 1347–1363 6. Lander ES, Green P (1987) Construction of multilocus genetic maps in humans. Proc Natl Acad Sci 84: 2363–2367 7. Gudbjartsson DF, et al (2000) Allegro, a new computer program for multipoint linkage analysis. Nat Genet 25: 12–13 8. Gudbjartsson DF et al (2005) Allegro version 2, Nat Genet 37: 1015–1016 9. Risch N (1990a) Linkage strategies for genetically complex traits. I. Multilocus models. Amer J Hum Genet 46: 222–228 10. Risch N (1990b) Linkage strategies for genetically complex traits. II. The power of affected relative pairs. Amer J Hum Genet 46: 229–241 11. Whittemore AS, Halpern J (1994) A class of tests of linkage using affected pedigree members. Biometrics 50: 118–127 12. Kong A, Cox NJ (1997) Allele-sharing models: LOD scores and accurate linkage tests. Amer J Hum Genet 61: 1179–1188 13. Kruglyak L, Lander ES (1998) Faster multipoint linkage analysis using Fourier transforms. J Comp Biol 5: 1–7 14. Markianos K, Daly MJ, Kruglyak L (2001) Efficient multipoint linkage analysis through reduction of inheritance space. Amer J Hum Genet 68: 963–977 15. Suarez BK, Van Eerdewegh P (1984) A comparison of three affected-sib-pair scoring meth-
ods to detect HLA-linked disease susceptibility genes. Amer J Med Genet 18: 135–46 16. Blackwelder WC, Elston RC (1985) A comparison of sib-pair linkage tests for disease susceptibility loci. Genet Epidemiol 2: 85–97 17. Tierney C, McKnight B (1993) Power of affected sibling method tests for linkage. Hum Hered 43: 276–287 18. Schaid DJ, Nick TG (1990) Sib-pair linkage tests for disease susceptibility loci: common tests vs. the asymptotically most powerful test. Genet Epidemiol 7: 359–730 19. Whittemore AS (1996) Genome scanning for linkage: an overview. Amer J Hum Genet 59: 704–716 20. Kruglyak L, Lander ES (1995) Complete multipoint sib-pair analysis of qualitative and quantitative traits. Amer J Hum Genet 57: 439–454 21. Holmans P (1993) Asymptotic properties of affected-sib-pair linkage analysis. Amer J Hum Genet 52: 362–374 22. Terwilliger JD, Ott J (1994) Handbook of human genetic linkage. The Johns Hopkins University Press, Baltimore 23. Orholm M, et al (1991) Familial occurrence of inflammatory bowel disease. New England J Med 324: 84–88 24. Tysk C, et al (1988) Ulcerative colitis and Crohn’s disease in an unselected population of monozygotic and dizygotic twins. A study of heritability and the influence of smoking. Gut 29: 990–996 25. Rioux JD, et al (2000) Genome wide search in Canadian families with inflammatory bowel disease reveals two novel susceptibility loci. Amer J Hum Gene 66: 1863–70 26. Mirea L (1999) Detection of heterogeneity in allele sharing of affected relatives. M.Sc. Thesis, University of Toronto 27. Mirea L, Briollais L, Bull S (2004) Tests for covariate-associated heterogeneity in IBD allele sharing of affected relatives. Genetic Epidemiology 26: 44–60
344
W. Xu et al.
28. Kruglyak L, Lander ES (1995) High-resolution genetic mapping of complex traits. Amer J Hum Genet 56: 1212–1223 29. Rioux JD, et al (2001) Genetic variation in the 5q31 cytokine gene cluster confers susceptibility to Crohn disease. Nature Genetics 29: 223–228 30. Walters TD, Silverberg MS (2006) Genetics of inflammatory bowel disease: current status and future directions. Can J Gastroenterol 20: 633–639 31. Evans DM, Cardon LR (2004) Guidelines for genotyping in genome wide linkage studies: single-nucleotide-polymorphism maps versus microsatellite maps. Amer J Human Genet 75: 687–692 32. Nicolae DL, Kong A (2004) Measuring the relative information in allele-sharing linkage studies. Biometrics 60: 368–375 33. Goring HH, Terwilliger JD (2000) Linkage analysis in the presence of errors I: complexvalued recombination fractions and complex phenotypes. Amer J Hum Genet 66: 1095–1106 34. Margaritte-Jeannin P, et al (1997) Heterogeneity of marker allele frequencies hinders interpretation of linkage analysis: Illustration on chromosome 18 markers. Genet Epidemiol, 14: 669–674 35. Williamson JA, Amos CI (1995) Guess LOD approach: sufficient conditions for robustness. Genet Epidemiol 12: 163–176 36. Huang, Q, Shete S, Amos CI (2004) Ignoring linkage disequilibrium among tightly linked markers induces false-positive evidence of linkage for affected sib pair analysis. Amer J Hum Genet 75: 1106–1112 37. Schaid DJ, et al (2002) Caution on pedigree haplotype inference with software that assumes linkage equilibrium. Amer J Hum Genet 71: 992–995 38. Cho K, Dupuis J (2009) Handling linkage disequilibrium in qualitative trait linkage analysis using dense SNPs: a two-step strategy. BMC Genetics 10: 44 39. Abecasis GR, Wigginton JE (2005) Handling marker-marker linkage disequilibrium: Pedigree analysis with clustered markers. Amer J Hum Genet 77: 754–767 40. Stewart WC, Peljto AL, Greenberg DA (2010) Multiple subsampling of dense SNP data localizes disease genes with increased precision. Hum Hered 69: 152–159 41. Gershon ES, Goldin LR (1986) Clinical methods in psychiatric genetics, I: Robustness of genetic marker investigative strategies. Acta Psychiatr Scand 74: 113–118
42. Zouk H, et al (2007) The effect of genetic variation of the serotonin 1B receptor gene on impulsive aggressive behavior and suicide. Amer J Med Genet B Neuropsychiatr. Genet 144B: 996–1002 43. McPeek MS (1999) Optimal allelesharing statistics for genetic mapping using affected relatives. Genet Epidemiol 16: 225–249 44. Margaritte-Jeannin P, Babron MC, ClergetDarpoux F (2007) On the choice of linkage statistics. BMC Proc 2007 Suppl 1: S102 45. Basu S, et al (2010) A likelihood-based traitmodel-free approach for linkage detection of a binary trait. Biometrics 66: 201–213 46. Risch N, Zhang H (1995) Extreme discordant sib pairs for mapping quantitative trait loci in humans. Science 268: 1584–1589 47. Rogus JJ, Krolewski AS (1996) Using discordant sib pairs to map loci for qualitative traits with high sibling recurrence risk. Amer J Hum Genet 59: 1376–1381 48. Greenwood CMT, Bull SB (1997) Incorporation of covariates into genome scanning using sib-pair analysis in bipolar affective disorder. Genet Epidemiol 14: 635–640 49. Greenwood CMT, Bull SB (1999) Analysis of affected sib pairs, with covariates–with and without constraints. Amer J HumGenet 64: 871–885 50. Bull SB, et al (2002) Regression models for allele sharing: analysis of accumulating data in affected sib pair studies. Statist Med 21: 431–444 51. Olson JM (1999) A general conditionallogistic model for affected-relative-pair linkage studies. Amer J Hum Genet 65: 1760–1769 52. Goddard KA et al (2001) Model-free linkage analysis with covariates confirms linkage of prostate cancer to chromosomes 1 and 4. Amer J Hum Genet 68: 1197–1206 53. Xu W, et al (2005) Recursive partitioning models for linkage in COGA data. BMC Genet 6 Suppl 1: S38 54. Xu W, et al (2006) A tree-based model for allelesharing-based linkage analysis in human complex diseases. Genet Epidemiol 30: 155–169 55. Nicolae DL (1999) Allele sharing models in gene mapping: A likelihood approach. Ph.D. thesis, Dept. Statistics, Univ. Chicago 56. Whittemore AS, Halpern J (2006) Nonparametric linkage analysis using person-specific covariates. Genet Epidemiol 30: 369–379 57. Strauch K, et al (2000) Parametric and nonparametric multipoint linkage analysis with
17
Model-Free Linkage Analysis of a Binary Trait
imprinting and two-locus-trait models: application to mite sensitization. Amer J Hum Genet 66: 1945–1957 58. Sinsheimer JS, Blangero J, Lange K (2000) Gamete-competition models. Amer J Hum Genet 66: 1168–1172 59. Greenwood CMT, Morgan K (2000) The impact of transmission ratio distortion on allele sharing in affected sibling pairs. Amer J Hum Genet 66: 2001–2004 60. Lander ES, Kruglyak L (1995) Genetic dissection of complex traits: Guidelines for interpret-
345
ing and reporting linkage results. Nat Genet 11: 241–247 61. Sham PC, Zhao JH, Curtis D (1997) Optimal weighting scheme for affected sib-pair analysis of sibship data. Ann Hum Genet 61: 61–69 62. Greenwood CMT, Bull SB (1999) Downweighting of multiple affected sib pairs leads to biased likelihood ratio tests under no linkage. Amer J Hum Genet 64: 1248–1252 63. Sinha R, et al (2009) Bayesian intervals for linkage locations. Genet Epidemiol 33: 604–616
sdfsdf
Chapter 18 Single Marker Association Analysis for Unrelated Samples Gang Zheng, Jinfeng Xu, Ao Yuan, and Joseph L. Gastwirth Abstract Methods for single marker association analysis are presented for binary and quantitative traits. For a binary trait, we focus on the analysis of retrospective case–control data using Pearson’s chi-squared test, the trend test, and a robust test. For a continuous trait, typical methods are based on a linear regression model or the analysis of variance. We illustrate how these tests can be applied using a public available R package “Rassoc” and some existing R functions. Guidelines for choosing these test statistics are provided. Key words: Additive, Association, ANOVA, Binary trait, Case–control design, Dominant, Genetic model, Genotype relative risks, MAX3, Mode of inheritance, Penetrance, Rassoc, Recessive, Quantitative trait, Robustness
1. Introduction Statistical procedures for testing whether there is an association between a phenotype and a single nucleotide polymorphism (SNP) are described and illustrated. Usually, the phenotype of interest is either a binary or quantitative one. For a binary trait, we focus on a retrospective case–control study, in which cases and controls are randomly drawn from case and control populations, respectively. For a continuous trait, the data are obtained from a random sample of the general population. Although a large number of SNPs is available for testing association, single marker analysis is often employed. The significance level to test a single hypothesis is 0.05. When multiple SNPs are tested, the Bonferroni correction can be applied. Denote the genotypes of an SNP as G0, G1, and G2. For case–control data, denote the penetrances as f0, f1, and f2 with respect to the three genotypes, respectively. Under the null hypothesis H0, we have f0 ¼ f1 ¼ f2 ¼ Pr(case). A genetic model is recessive if f1 ¼ f0, additive if f1 ¼ ( f0 + f2)/2, or dominant if f1 ¼ f2. For a single SNP, Robert C. Elston et al. (eds.), Statistical Human Genetics: Methods and Protocols, Methods in Molecular Biology, vol. 850, DOI 10.1007/978-1-61779-555-8_18, # Springer Science+Business Media, LLC 2012
347
348
G. Zheng et al.
the observed case–control data consist of genotype counts (r0, r1, r2) among r cases and (s0, s1, s2) among s controls. Denote nj ¼ rj + sj (j ¼ 0, 1, 2) and n ¼ r + s. The Cochran-Armitage trend test (referred to as the trend test) is one of the two most commonly used statistics for the analysis of case–control data. It can be written as (1, 2) P2 j ¼0 xj fð1 ’Þrj ’sj g T1 ðxÞ ¼ P 2 1=2 (1) P2 2 2 n’ð1 ’Þ j ¼0 xj nj =n j ¼0 xj nj =n where (x0, x1, x2) ¼ (0, x, 1), x is determined by the genetic model and ’ ¼ r/n. Under H0, given x, T1(x) asymptotically follows a standard normal distribution N(0,1). When the genetic model is recessive and the risk allele is known, T1(0) is used. When the genetic model is dominant and the risk allele is known, T1(1) is used. When we only know that the genetic model is recessive (or dominant) but not the risk allele, T1(0) (or T1(1)) cannot be used alone. When the genetic model is additive regardless of the risk allele, T1(1/2) is used. Pearson’s chi-squared test (referred to as Pearson’s test) is another commonly used test. It can be written as 2 2 X ðrj rnj =nÞ2 X ðsj snj =nÞ2 þ : T2 ¼ ðrnj =nÞ ðsnj =nÞ j ¼0 j ¼0
Under H0, T2 asymptotically follows a chi-squared distribution with two degrees of freedom (df), denoted as w22 . The robust test, MAX3, is given by MAX3 ¼ maxðjT1 ð0Þj; jT1 ð1=2Þj; T1 ð1ÞjÞ; where T1(0), T1(1/2), and T1(1) are the trend tests given by Eq. 1 with different x values. Under H0, the asymptotic distribution of MAX3 is far more complex than those of T1(x) and T2. A procedure for determining its asymptotic null distribution and P -value is discussed in Note 2. For other robust tests, see Note 3. The power of each statistical test depends on the underlying genetic model. It will be seen that MAX3 is more robust than any single trend test or Pearson’s chi-squared test when the genetic model is unknown, thus, it should be used in practice. Since MAX3 does not follow any chi-squared distribution, we discuss how to find its P -value using the R package Rassoc (3). This package is available from the Comprehensive R Archive Network at “http:// CRAN.R-project.org/package¼Rassoc.” For a continuous trait Y, a typical model is Y ¼ m + g + [, where m is a fixed overall mean of the trait under H0, g is the random genetic effect due to G, and [ is a random error. The genetic value of g is a when G ¼ G0, d when G ¼ G1, and a when G ¼ G2. Under H0, we have a ¼ d ¼ 0. A genetic model is
18 Single Marker Association Analysis for Unrelated Samples
349
recessive, additive, or dominant if d ¼ a, d ¼ 0, or d ¼ a, respectively. The observed data consist of pairs (Yij, Gj) for i ¼ 1,. . ., nj and j ¼ 0, 1, 2, where Yij is the trait value of the ith individual with genotype Gj. Denote n ¼ n0 + n1 + n2. For the analysis of a quantitative trait, linear regression and the analysis of variance (ANOVA) are routinely used. Let (x0, x1, x2) ¼ (0, x, 1) be the values for the genotypes (G0, G1, G2), where x ¼ 0, 1/2, or 1 for the recessive, additive, or dominant models, respectively. Using the data (Yij, Gi), i ¼ 1,. . ., nj and j ¼ 0, 1, 2, the F-test derived from a linear regression model is given by ðn 2Þ F ðxÞ ¼
( PP j
PP j
2
ðxj xÞ
i
PP j
)
Þ ðxj xÞðYij Y
i
Þ ðYij Y 2
i
P2
( PP j
P2
Pnj
2
)2 ; Þ ðxj xÞðYij Y
i
(2)
¼ where x ¼ j ¼0 nj xj =n and Y j ¼0 i¼1 Yij =n: Given x, F(x) has an asymptotic F-distribution with (1,n 2) df under H0. :j ¼ Pnj Yij nj for j ¼ 0, 1, 2, Alternatively, denote Y i¼1
P :j Y Þ2 and SSe ¼ P2 Pnj Yij Y :j 2 . The SSb ¼ 2j ¼0 nj ðY j ¼0 i¼1 F-test derived from the ANOVA is given by F ¼
SSb =2 ; SSb =ðn 3Þ
(3)
which under H0, has an asymptotic F-distribution with (2,n 3) df. We illustrate how to use these F-tests using some existing R functions later. The allele-based analysis, valid under Hardy–Weinberg equilibrium (HWE), has similar performance to the genotype-based analysis under the additive model. Therefore, we focus on the genotypebased analysis, which does not require HWE. In Note 4, we present some power comparisons of different F-tests for quantitative traits.
2. Methods 2.1. Analysis of Case–Control Data
Denote the genotypes as (G0, G1, G2) ¼ (AA, AB, BB). If we happen to know the risk allele, it is always denoted as B. The choice of test statistic depends on which of the following four situations holds: (1) the genetic model and the risk allele are known, (2) the genetic model is known but not the risk allele, (3) the risk allele is known but not the genetic model, and (4) neither the genetic model nor the risk allele is known. Common genetic models include recessive, additive, and dominant. It is important that one does not determine the genetic model and/or the risk allele from
350
G. Zheng et al.
the same data that will be used in the subsequent association analysis. Of course, the genetic model and/or the risk allele may be known based on scientific knowledge or information from previous data. In our view, (3) and (4) are the most common situations in practice. 2.1.1. Which Test to Use?
Which test to choose depends on each of the four situations outlined above (see Note 1). Let w22 ð1 aÞ be the upper 100(1 a)th percentile of w22 and z(1 a) be the upper 100(1 a)th percentile of N(0,1). 1. The genetic model and the risk allele are known. T1(x) is optimal and should be used. Since the risk allele is known, a one-sided H1 is used and z(0.95) ¼ 1.645. For the recessive (additive, or dominant) model, reject H0 if T1(0) > z(0.95) (T1(1/2) > z(0.95), or T1(1) > z(0.95)). In each case, the P -value equals the probability of Z > T1(x), where Z ~ N(0,1). 2. The genetic model is known but not the risk allele. When the model is additive, use T1(1/2) and reject H0 if |T1(1/2)| > z (0.975) ¼ 1.96. The P -value equals two times the probability of Z > |T1(1/2)|. When the model is recessive or dominant, use MAX3 (see (3) next). 3. The genetic model is unknown (regardless of the risk allele). Use MAX3. Three approaches are available to calculate the P -value of MAX3 using the R package Rassoc. But we recommend using the one based on the asymptotic null distribution of MAX3. 4. The same as (3). Thus, we only discuss (3) in the following. Note that we do not recommend T2, because it is always less powerful than MAX3 (see Note 1). If T2 is used, reject H0 if T2 >w22 ð0:95Þ ¼ 5:9915. The P -value equals the probability of T > T2, where T w22 .
2.1.2. Examples Using R
The R package Rassoc can be loaded from
There are two functions CATT(data,x) and MAX3(data, method,m) in the package for computing the trend tests and MAX3 and their P -values. Pearson’s test and its P -value can be obtained using an existing R function. In both functions, the “data” comprises a 2 3 contingency table, i.e., genotype counts (r0, r1, r2) for cases and (s0, s1, s2) for controls. In the first function, “x” is 0, 0.5, or 1 for the recessive, additive, or dominant models, respectively. In the second function, the “method” refers to the procedure to calculate the P -value of MAX3. Three methods are available: “boot” for the bootstrap procedure, “bvn” for the bivariate normal procedure, or “asy” for the asymptotic procedure.
18 Single Marker Association Analysis for Unrelated Samples
351
The first two procedures are simulation-based and the last one is based on the asymptotic distribution of MAX3 (see Note 2). The “m” in the second function refers to the number of replicates when “boot” or “bvn” is used. When “asy” is used, “m” can be any positive integer. For illustration, we use a SNP (rs420259) reported by the WTCCC (4), which was the only SNP showing strong association with bipolar disorder in a genome-wide association study (GWAS) with 500,000 SNPs (the actual number of SNPs tested after quality control steps is less than 500,000). The genome-wide significance level used by (4) was 5 107 for strong association. The genotype counts are (r0, r1, r2) ¼ (83, 755, 1020) and (s0, s1, s2) ¼ (260, 1134, 1537). The data can be entered as follows.
To check that the data are correctly entered, just type a.
We may not have sufficient scientific knowledge to claim a priori which allele (A or B) is the risk one and what the true genetic model is. For illustration purpose, let us say we know a priori the true model is dominant and that B is the risk allele. The analysis is carried out based on the three situations outlined before. 1. If we know the genetic model is dominant and the risk allele is B, apply T1(1) using the R function CATT as follows.
The output shows that |T1(1)| ¼ 5.7587 and its P -value is 8.478 109. This is a two-sided test. We use a one-sided test because the risk allele is known. Thus, the actual P -value is half of the reported one, that is, 4.239 109, which is less than the significance level 5 107. Hence we reject H0. 2. If we know the genetic model is dominant but not the risk allele, use MAX3 not T1(1). Suppose we happen to enter the data as b and apply T1(1) as before.
352
G. Zheng et al.
The two-sided P -value is 0.09656, which is not significant. This example shows that knowing the risk allele is necessary for using T1(0) or T1(1). The use of MAX3 is illustrated in (3) later. We first show how to use the following R function to obtain Pearson’s test T2 ¼ 33.165 and its P -value 6.285 108. This P -value is also significant but larger than that of T1(1) in case (1), because T1(1) is optimal for the dominant model. If we apply T2 to the dataset b, we would obtain the same results.
3. If we do not know the genetic model, we apply MAX3 and calculate its P -value using the “asy” procedure. The reported statistic is MAX3 ¼ 5.7587 with P -value ¼ 2.347 108. Thus, we reject H0. Note that this P -value is smaller than that of T2 but larger than that of T1(1) in case (1).
If x ¼ 0.5 is used in T1(x) regardless of the true genetic model and the risk allele, the following results show that the P -value of T1(1/2) is not significant.
2.2. Quantitative Trait 2.2.1. Which Test to Use?
For a continuous trait, the four situations outlined before also apply. Let Fu,v(1 a) be the upper 100(1 a)th percentile of an F-distribution with (u,v) df. 1. When the genetic model and the risk allele are known, the statistic F(x) given in Eq. 2 is used. Since the risk allele is known, a onesided H1 is used. For the recessive (additive, dominant) model, reject H0 if F(0) > F1,n 2(0.95) (F(1/2) > F1,n 2(0.95), F(1) > F1,n 2(0.95)), where F(0) (F(1/2), F(1)) is the observed statistic. In each case, the P -value equals half of the probability of f1,n 2 > F(x), where f1,n 2 follows an F-distribution with (1,n 2) df. 2. The genetic model is known but not the risk allele. When the genetic model is additive, F(x) given in Eq. 2 is used (with x ¼ 1/2). Reject H0 if F(1/2) > F1,n 2(0.95), where F(1/2) is the observed statistic. The P -value equals the probability of f1,n 2 > F(1/2). When the genetic model is not additive
18 Single Marker Association Analysis for Unrelated Samples
353
(either recessive or dominant), F given in Eq. 3 is used. Reject H0 if F > F2,n 3(0.95), where F is also the observed statistic. The P -value equals the probability of f2,n 3 > F, where f2,n 3 follows an F-distribution with (2,n 3) df. 3. When the genetic model is unknown, F given in Eq. 3 is used. The rejection rule and P -value are similar to those in case (2) when F is used. 4. The same as (3). Thus, we focus on (3) only. 2.2.2. Examples Using R
For illustration, we simulated a dataset called “QTLex.txt,” which contains (Y,G) for n ¼ 100 individuals. In the simulation, the true model was dominant and the risk allele was B with population frequency 0.3. HWE was assumed in the population. The heritability was set to 0.1. The trait Y was simulated from a normal distribution using the model given in Subheading 1 where m ¼ 0, E([) ¼ 0, and Var([) ¼ 1. The genotype G is AA, AB, or BB. The data can be read as follows.
If we know the true genetic model (dominant) and the risk allele a priori, we use F(x) given in Eq. 2 with x ¼ 1 as follows.
The function “as.integer(G)” assigns 1 for AA, 2 for AB, and 3 for BB. Thus, “objF1 ¼ aov(Y ~ (as.integer(G)¼¼1),data ¼ c)” conducts the ANOVA comparing the mean trait values between the two genotype groups: AA and AB + BB. In the first line of the output, “F value” is the statistic F(1) and “Pr(>F)” is the P -value. These values are reported in the second line. In this example, F(1) ¼ 31.317 and the P -value is 1.993 107. The strength of association is indicated by “***” near the P -value, and the interpretation of this significance code is given in the last line of the output. Since the risk allele is known, a one-sided test should be used. Thus, the actual P -value is half of that reported one, i.e., 9.965 108. This P -value is very significant compared to the 0.05 significance level. If we did not know the genetic model, the statistic F given in Eq. 3 should be used. See the following output. In this case, we would not assign scores (1, 2, 3) to the three genotypes. The output given below shows F ¼ 15.575 with P -value 1.361 106,
354
G. Zheng et al.
which is also significant, but larger than the P -value obtained from F(1), which is the most powerful test when the true model is dominant.
For illustration, we also calculate F(1/2) and F(0) and their P -values. F(1/2) can be obtained by
In this case, “as.integer(G)” is equivalent to using scores 1, 2, 3 for the three genotypes (under the additive model). The reported P -value is 1.011 106, which is the P -value if we do not know the risk allele. If we know the risk allele, the P -value is 1.011 106/2 ¼ 5.055 107. Both one-sided and two-sided P -values are significant. Interestingly, if we apply F(1/2) even when the risk allele is unknown, the two-sided P -value (1.011 106) is smaller than that of F. The test F(0), which is optimal for a recessive model, is obtained as follows.
In this case, the ANOVA is applied with two genotype groups: AA + AB and BB. The reported P -value is 0.1398. If we know the risk allele, the P -value is 0.1398/2 ¼ 0.0699, which is not significant at the 0.05 level. This illustrates loss of power can occur if the test used is not appropriate for the underlying genetic model.
3. Notes 1. Choosing among the trend tests, Pearson’s test, and MAX3. In practice, neither the genetic model nor the risk allele is known. The trend test T1(1/2) and Pearson’s test T2 are not robust when they are used alone. A robust test should protect against substantial loss of power when the model is misspecified (5, 6). To examine which test is most robust across the three genetic models, we conducted a simulation choosing the genotype relative risk (GRR), given by f2/f0, for a given x so that the optimal trend test for that x had about 80% power. The results are reported in Table 1. The power of the optimal test given
18 Single Marker Association Analysis for Unrelated Samples
355
Table 1 Empirical power (%) and robustness of different tests for the analysis of case–control data P
x
GRR
T1(0)
T1(1/2)
T1(1)
T2
MAX3
0.10
0.0 0.5 1.0 Min Max of min
3.15 1.88 1.47
80.40 21.23 8.94 8.94
30.30 80.30 77.60 30.30
11.56 79.20 79.85 11.56
70.58 73.09 72.78 70.58
72.42 76.89 75.07 73.42 72.42
0.30
0.0 0.5 1.0 Min Max of min
1.65 1.60 1.38
80.37 44.78 12.91 12.91
52.07 81.71 71.00 52.07
15.52 76.59 80.73 15.52
71.18 73.01 70.92 70.92
72.46 76.96 72.54 72.54 72.54
0.45
0.0 0.5 1.0 Min Max of min
1.45 1.57 1.44
79.69 56.52 14.63 14.63
61.13 80.01 64.92 61.13
16.46 67.91 80.81 16.46
70.25 71.52 71.72 70.25
72.27 76.07 73.74 73.74 73.74
a genetic model is in bold. The minimum power of each test across the three genetic models is presented. In the table, p ¼ Pr(B). The test with higher minimum power for any of the three possible underlying genetic models is the most robust test. The results show that the power of T1(1/2) ranges from 30 to 80% for p ¼ 0.1, 50 to 80% for p ¼ 0.3, and 60 to 80% for p ¼ 0.45. However, MAX3 is most robust as the power of MAX3 always exceeds 70% regardless of the underlying genetic model or the allele frequency p. The minimum power of T2 across the three genetic models exceeds 70%, although it has slightly lower power than MAX3 in the simulation studies. More extensive simulations and results can be found in ref. 7. 2. The asymptotic null distribution and P -value of MAX3. Three approaches to compute the P -value of MAX3 are presented in ref. 3. Let rxy be the asymptotic null correlation of T1(x) and T1(y), where x,y ¼ 0, 1/2, 1, and let pj be the population frequency of genotype Gj (j ¼ 0, 1, 2). Then rxy ¼
ðxyp1 þ p2 Þ ðxp1 þ p2 Þðyp1 þ p2 Þ fðx 2 p
1=2
1
þ p2 Þ ðxp1 þ p2 Þ2 g
1=2
fðy 2 p1 þ p2 Þ ðyp1 þ p2 Þ2 g
:
356
G. Zheng et al.
and o1 ¼ ðr11 Denote o0 ¼ ðr01 r01 r11 Þ 1 r201 2 2 2
r01 r01 Þ 1 r201 . In the following, rxy, o0 , and o1 are 2
estimated under H0 by replacing pj with ^pj ¼ nj =n (j ¼ 1,2). The asymptotic distribution of MAX3 under H0 is far more complex than those of the trend test and Pearson’s test. An expression for the asymptotic null distribution of MAX3, P(t) ¼ Pr(MAX3 < t), is given by 1 0 Z tð1o1 Þ o0 B t r01 u C ffiAfðuÞ du; PðtÞ ¼ 2 F@qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 0 1 r01 0 1 Z t Bt r01 uC ffi AfðuÞ du 2 F@ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 0 1 r201 0 1 Z t Bt o0 u=o1 r01 uC qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi þ 2 tð1o Þ F@ AfðuÞ du; 1 o0 1 r201 (4) where f and F are the density and distribution functions of N (0,1) (3). Using Eq. 4, the asymptotic P -value of MAX3 is given by 1P(max3), where max3 is the observed MAX3. This approach is denoted as “asy” in the second R function in Rassoc. Alternatively, simulations can be used to approximate the null distribution of MAX3. Given the observed data (r0, r1, r2) and (s0, s1, s2), in the jth simulation (j ¼ 1,. . ., m), we generate (r0j, r1j, r2j) from the multinomial distribution Mulðr; ^p0 ; ^p1 ; ^p2 Þ and (s0j, s1j, s2j) from the same distribution except that r is replaced by s, where ^pi ¼ ni =n (i ¼ 0, 1, 2). For each j, we compute MAX3 denoted as MAX3j. Then MAX31, . . ., MAX3m form an empirical null distribution of MAX3 when m is large enough. For single marker analysis, we use m ¼ 100,000 to determine the null distribution. A larger m may be used, if the P -value of MAX3 is smaller than 105. This parametric bootstrap procedure is denoted as “boot” in the second R function in Rassoc. A more efficient simulation approach, denoted as “bvn” in the second R function in Rassoc, is to directly generate T1(0) and T1(1) in the jth simulation from a bivariate normal distribution with zero means and unit variances with correlation r01. Then compute T1(1/2) ¼ o0T1(0) + o1T1(1) and MAX3 denoted as MAX3j. Hence, MAX31,. . ., MAX3m form an empirical null distribution of MAX3.
18 Single Marker Association Analysis for Unrelated Samples
357
We recommend using the method “asy” to compute the P -value of MAX3, especially for small P -values, which require a large number of replicates m. For example, in the illustration in Subheading 2.2, to obtain an accurate estimate of a P -value as small as 1e8, we need at least 10 million replicates, which is computationally intensive. Using a smaller m, say, m ¼ 100,000 the estimated P -value would be 0, which may be reported as “0” or as “ < 2.2e16” by R. In the following example, we used the “boot” procedure with 100,000 replicates. The output shows the P -value is less than 2.2 1016.
One can also use an approximation of the tail probability of MAX3 to approximate the P -value of MAX3 (8), which has a closed form. This approximate P -value, however, is not reported by the R package Rassoc. 3. Other Robust Tests for Binary Traits. Nearly all robust methods have been developed for case– control studies. In addition to MAX3 (2), other robust tests are also developed for case–control association studies. A review of different robust tests for association studies can be found in ref. 9, 10. Although different robust tests have been developed, they have similar performance under the alternative hypothesis. In ref. 9, a function “casecontrol” in R is provided, which also outputs the three trend tests: Pearson’s test, MAX3, and other robust tests. The P -values of the trend tests and Pearson’s test are based on the asymptotic distributions, while the P -value for MAX3 is based on the bootstrap simulation. Discussion of applying single-marker analysis with robust tests in genomewide association studies can be found in ref. 11. 4. Comparison of F-Tests for Quantitative Traits. We conducted a simulation to compare F(x) (x ¼ 0,1/2,1) and F by choosing the frequency of allele B (denoted as p), the sample size n, the heritability h, and the unit variance for the random error. Given a genetic model x and the values of p and h, we computed a and d. The empirical power is reported in Table 2. F(x) is most powerful when x is correctly specified. However, when x is misspecified, F(1/2) is most robust among the three F(x) statistics. On the other hand, F is slightly less powerful than F(1/2) under the additive or dominant models, but it protects against substantial power loss under the recessive model.
358
G. Zheng et al.
Table 2 Empirical power (%) for the analysis of a quantitative trait using different test statistics given p, h ¼ 0.1, and n ¼ 100 Model (x)
P
F(0)
F(1/2)
F(1)
F
0.0
0.10 0.30 0.45
62.29 87.65 89.86
31.54 56.16 70.22
12.88 15.60 17.00
59.70 80.49 82.87
0.5
0.10 0.30 0.45
53.80 57.15 70.88
89.05 90.57 90.49
87.65 83.96 77.42
82.89 83.50 84.03
1.0
0.10 0.30 0.45
40.03 15.58 17.76
87.82 84.14 77.45
89.75 90.15 89.62
84.44 83.39 82.85
References 1. Sasieni PD (1997) From Genotypes to Genes: Doubling the Sample Size. Biometrics 53: 1253–1261 2. Freidlin B, Zheng G, Li Z, Gastwirth JL (2002) Trend tests for case–control studies of genetic markers: power, sample size and robustness. Hum Hered 53: 146–152. (Erratum (2009) 68: 220) 3. Zang Y, Fung WK, Zheng G (2010) Simple algorithms to calculate asymptotic null distributions of robust tests in case–control genetic association studies in R. J Stat Softw 33 (8): 1–24 4. The Wellcome Trust Case Control Consortium (WTCCC) (2007) Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447: 661–678 5. Gastwirth JL (1966) On Robust Procedures. J Am Stat Assoc 61: 929–948 6. Gastwirth JL (1985) The Use of Maximin Efficiency Robust Tests in Combining Contingency Tables and Survival Analysis. J Am Stat Assoc 80: 380–384
7. Zheng G, Freidlin B, Gastwirth JL (2006) Comparison of robust tests for genetic association using case–control studies. In: Rojo J (ed) Optimality: The Second Eric L. Lehmann Symposium, IMS Lecture Notes-Monograph Series, Institute of Mathematical Statistics, Beachwood, Ohio 8. Li QZ, Zheng G, Li Z, Yu K (2008) Efficient approximation of P -value of the maximum of correlated tests, with applications to genomewide association studies. Ann Hum Genet 72: 397–406 9. Joo J, Kwak M, Chen Z, Zheng G (2010) Efficiency robust statistics for genetic linkage and association studies under genetic model uncertainty. Stat Med 29: 158–180 10. Kuo CL, Feingold E (2010) What’s the Best Statistic for a Simple Test of Genetic Association in a Case–control Study? Genet Epidemiol 34: 246–253 11. Zheng G, Joo J, Tian X, Wu CO, Lin J-P, Stylianou M, Waclawiw MA, Geller NL (2009) Robust genome-wide scans with genetic model selection using case–control design. Stat Its Interface 2: 145–151
Chapter 19 Single-Marker Family-Based Association Analysis Conditional on Parental Information Ren‐Hua Chung and Eden R. Martin Abstract Family-based designs have been commonly used in association studies. Different family structures such as extended pedigrees and nuclear families, including parent–offspring triads and families with multiple affected siblings (multiplex families), can be ascertained for family-based association analysis. Flexible association tests that can accommodate different family structures have been proposed. The pedigree disequilibrium test (PDT) (Am J Hum Genet 67:146-154, 2000) can use full genotype information from general (possibly extended) pedigrees with one or multiple affected siblings but requires parental genotypes or genotypes of unaffected siblings. On the other hand, the association in the presence of linkage (APL) test (Am J Hum Genet 73:1016-1026, 2003) is restricted to nuclear families with one or more affected siblings but can infer missing parental genotypes properly by accounting for identity-by-descent (IBD) parameters. Both the PDT and APL are powerful association tests in the presence of linkage and can be used as complementary tools for association analysis. This chapter introduces these two tests and compares their properties. Recommendations and notes for performing the tests in practice are provided. Key words: Family-based association test, Linkage disequilibrium, Transmission statistics, Nontransmission statistics, Parental information, EM algorithm, Rare variants, Genome-wide association, Extended pedigree, Nuclear family, Parallelization, Population stratification
1. Introduction Family-based designs have been commonly used for association analysis in candidate gene studies and more recently in genomewide association studies (GWAS). Several studies have identified candidate genes based on the family design (1–4). Family-based association tests using full genotype data are generally robust to population stratification, which can cause spurious results for population-based association analysis (e.g., case–control study) (5). However, family data may be more difficult to ascertain than unrelated individuals used in case–control studies.
Robert C. Elston et al. (eds.), Statistical Human Genetics: Methods and Protocols, Methods in Molecular Biology, vol. 850, DOI 10.1007/978-1-61779-555-8_19, # Springer Science+Business Media, LLC 2012
359
360
R.-H. Chung and E.R. Martin
The transmission/disequilibrium test (TDT) for case–parent triads was the first popular family-based association test (6). The TDT statistic is calculated conditional on the parental genotypes so that the statistic is robust to population stratification. Thereafter, the TDT was generalized to different family structures such as extended pedigrees with multiple affected siblings in the pedigree disequilibrium test (PDT) (7, 8). The TDT requires parental genotypes for the test; however, for late-onset diseases, parental genotypes are often missing. Two general approaches were proposed to deal with this problem. The first approach is to compare the difference in allele frequencies between affected and unaffected siblings without using parental information. Examples of such approaches include the S-TDT, the DAT, and SDT (9–11). The second approach is to infer missing parental genotypes based on the siblings’ genotypes and then calculate the transmission and non-transmission statistics. The missing parental genotypes can be constructed based on sample allele frequencies, such as done by association in the presence of linkage (APL), TRANSMIT, and UNPHASED (12–14), or conditional on the observed genotypes within each family, such as in the RC-TDT and FBAT (15, 16). It has been shown that methods that infer missing parental genotypes based on sample allele frequencies can have more power than methods that reconstruct parental mating types based on information within each family (17). However, methods that use sample allele frequencies to infer missing parental genotypes may have inflated type 1 error rates in the presence of population stratification, owing to the difference in allele frequencies among subpopulations. In this chapter, we introduce two popular single-marker familybased association tests: PDT and APL. The PDT considers phenotypically informative nuclear families and discordant sibships in each pedigree. Phenotypically informative nuclear families are ones in which there is at least one affected child, and both parents genotyped at the marker. Phenotypically informative discordant sibships have at least one affected and one unaffected sibling (discordant sibpair; DSP) and may or may not have parental genotype data. For a specific allele M1 at a marker, define a random variable XT as the difference between the transmission and non-transmission statistics: XT ¼ ð#M1 transmittedÞ ð#M1 not transmitted) and define a random variable XS as: XS ¼ ð#M1 in affected sibs) ð#M1 in unaffected sibs): Then for a pedigree with nT informative family triads and nS informative DSPs, a summary random variable D is defined as ! nT nS X X 1 D¼ (1) XTj þ XSj : nT þ nS j ¼1 j ¼1
19
Single-Marker Family-Based Association Analysis Conditional. . .
361
This is referred to as the “PDT-avg” statistic (18). Alternatively, “PDT-sum” was proposed to use the sum in Eq. 1 without dividing by nT + nS (18). Note that triads with homozygous parents or DSPs with identical genotypes do not contribute to the PDT-sum statistic. If there are n unrelated informative pedigrees and Di is the summary random variable for pedigree i, the PDT statistic T is defined as n P
Di ffiffiffiffiffiffiffiffiffiffiffiffiffi ; T ¼ si¼1 n P Di2
(2)
i¼1
which asymptotically follows a normal distribution with a mean of 0 and a standard deviation of 1 under the null hypothesis of no linkage or no association. The PDT-sum gives more weight to families of larger size, whereas the PDT-avg gives all families equal weight. Simulation results suggested that neither test is uniformly more powerful over all genetic models (18). The PDT was also extended to the genoPDT, which is a genotype-based association test for general pedigrees (19). Simulation results showed that the geno-PDT can have more power than the allele-based PDT in recessive and dominant models, whereas the original allele-based PDT can have more power than the geno-PDT if the alleles have an additive effect. The most important property of the geno-PDT is its ability to test for association with particular genotypes, which can reveal underlying patterns of association at the genotypic level. The APL considers independent informative nuclear families. Informative nuclear families for the APL are ones that have at least one affected sibling, with or without parental genotypes. Unaffected siblings are not required, but they can improve the parental genotype inference. The APL statistic is based on the difference between the observed number of alleles in affected siblings and its expected value conditional on parental genotypes under the null hypothesis that there is no linkage or no association. Specifically, for nuclear family i, let Xi be the number of copies of a specific allele in affected siblings, Gi be a vector of genotypes of the siblings, A be the siblings’ affection status, Gpj be a vector of the parental mating-type, C be the set of all possible parental mating types conditional on Gi, and Npij be the observed number of alleles in Gpj. Then the numerator of the APL statistic Ti is calculated as X ^ pj jGi ; AÞNpij : (3) PðG Ti ¼ Xi j 2C
362
R.-H. Chung and E.R. Martin
When parental genotypes are available, C is the vector of observed genotypes. When parental genotypes are missing, we replace the parental genotype probabilities in Eq. 3 with the following: mGp PðGp jG; AÞ ¼
2 P
zk PðGjGp ; IBD ¼ kÞ
k¼0
PðGjAÞ
;
(4)
where Gp is a vector of parental mating types, G is a vector of the siblings’ genotypes, mGp is the unconditional parental mating-type probability, zk is the identity-by-descent (IBD) parameter, and P GjGp ; IBD ¼ k is the Mendelian transition probability conditional on the IBD status of the siblings. The APL correctly infers missing parental genotypes in the presence of linkage by considering the IBD parameters when estimating parental mating-type probabilities. The probabilities P GjG ; IBD ¼ k reduce to Menp delian transition probabilities P GjGp if there is only one affected sibling in a family. The APL also uses genotypes from unaffected siblings and partial parental genotypes to help estimate parental mating-type probabilities (12). The expectation maximization (EM) algorithm is used to estimate the parameters mGp and zk. The APL statistic Ts is the sum of Ti over all nuclear families. The APL has been generalized to use nuclear families with different missing patterns and different numbers of affected and unaffected siblings by adopting a bootstrap procedure to estimate the variance of Ts (17). Briefly, k bootstrap resamplings are performed. Each family is treated as an independent unit for resampling. For each bootstrap sample, a new set of n families is resampled with replacement from the original n families. The APL statistic is calculated for each bootstrap sample and the sample variance is calculated based on the APL statistics from the bootstrap samples. The sample variance provides the estimate of the variance for the APL statistic Ts, which is standardized to be asymptotically normal with a mean of 0 and variance 1. Both the PDT and APL tests have been widely used for different disease studies (20–25). The PDT and APL share some properties such as both being powerful association tests in the presence of linkage and they can both use families with one or more affected siblings. They also have several complementary properties. For example, the PDT can use full information for extended pedigrees, while the APL uses general nuclear families but can infer missing parental genotypes. Steps to perform the two tests will be described in the following section including (1) software download, (2) input files, (3) control file setup, and (4) interpretation of the results. Finally, practical notes regarding the two methods will be discussed.
19
Single-Marker Family-Based Association Analysis Conditional. . .
363
2. Methods 2.1. Perform the PDT Test 2.1.1. Software Download
The PDT software, PDT2 (currently PDT version 6.0), is available for download at the Hussman Institute for Human Genomics (HIHG) website: http://hihg.med.miami.edu/software-download/ pdt. PDT2 is implemented in C++, and the source code is included in the package for users to compile on local machines.
2.1.2. Input Files
PDT2 requires a map file, which contains three columns for chromosomes, marker names and base-pair positions, and a ped file, which contains pedigree and genotype information. For SNPs with two alleles, the alleles are coded as 1s and 2s, and missing alleles are coded as 0s. Both map and ped files can be generated using PLINK (26) with the “recode12” option. However, a PLINK map file contains two columns for map information (one for the genetic map distance and the other for the base pair positions). The column for genetic distance in a PLINK map file needs to be removed as PDT2 only accepts one column for map information. Currently, covariates are not considered in PDT or APL (see Note 1). Therefore, covariates are not accepted in the input files. More detailed descriptions about the input files can be found in the PDT2 user manual included in the PDT2 software package.
2.1.3. Control File Setup
A control file is necessary for PDT2, and the parameters in the control file should be carefully specified because they determine how PDT2 performs the tests. Some parameters that need special attention are listed as follows: geno_pdt This option decides whether the allele-based PDT or the global test for the geno-PDT will be performed. The null hypothesis for the global test in the geno-PDT is that none of the genotypes are associated with the disease or the locus is not linked. max_cpus To efficiently handle GWAS data that may have > 1 million markers, PDT2 is implemented with parallel algorithms based on the POSIX threads (p-threads) technique. Each independent family with all of the markers is analyzed in parallel threads. The number of threads is specified in max_cpus. Each thread keeps receiving and analyzing one independent family at a time over all markers until all families are analyzed. The number of threads should be equal to or less than the number of cores on the computer. Our simulation results suggested that on a computer with dual quad-core processors, using seven threads achieved the optimum performance. It is ideal
364
R.-H. Chung and E.R. Martin
to leave one thread available so that it can perform routine work such as memory or task management on the system. options This parameter lets users decide whether to consider only transmission and non-transmission statistics in nuclear families, to use only the difference in the numbers of a specific allele in discordant sibships, or to use all of the available information as in Eq. 1, in the PDT test. This option allows users to examine separately whether an association signal comes largely from the transmission/ non-transmission component or from the DSP component in the PDT statistic. Using only transmission/non-transmission statistics may be more powerful if parents are available and if we expect reduced penetrance so that unaffected siblings may actually be carriers of the disease allele. 2.1.4. Interpretation of the Results
In the PDT2 output files, users can find the numbers of triads, DSPs, and independent pedigrees considered in the PDT statistics for each marker. For the allele-based PDT, the transmission and non-transmission statistics from parents to affected siblings, allele counts in DSPs, and statistic calculated based on Eq. 2 for each allele will be shown. For the geno-PDT, the statistic similar to Eq. 2 for each genotype and P -values for the global tests will be reported.
2.2. Perform the APL Test
The APL test, which is included in the CAPL software package, is available for download at the HIHG website (http://hihg.med. miami.edu/software-download/capl). CAPL is implemented in C ++, and the source code is included in the package. The CAPL software package provides the CAPL association test, a generalization of APL that can accommodate family and case–control data, and adjusts for population stratification (see Note 2). When there are only family data in the sample and one population is considered, CAPL reduces to the APL. In the following text, we refer to CAPL as the software implementation of the APL.
2.2.1. Package Download
2.2.2. Input Files
Three input file formats are accepted in the CAPL, which can all be generated by the commonly used software PLINK. The three formats are the binary files (bim, bed, and fam files), text files with alleles coded as A, T, C, and G (map and ped files); text files with alleles coded as either 1’s or 2’s; and missing alleles coded as 0’s (also map and ped files). Note that each marker should have only two types of allele coding with the missing allele code because current implementation of CAPL assumes only diallelic markers (see Note 3). More detailed descriptions about the input files can be found in the user manual included in the CAPL software package.
19 2.2.3. Control File Setup
Single-Marker Family-Based Association Analysis Conditional. . .
365
The control file for the CAPL determines how the CAPL performs the test and therefore parameters in the control file should be carefully specified. Some parameters that need special attention are as follows: em_precision The EM algorithm is used in the APL test to estimate the parameters such as the allele frequencies and IBD parameters. CAPL stops the EM iterations if the difference in the allele frequency estimates between the current iteration and the previous iteration is less than em_precision. Therefore, more EM iterations in CAPL will be performed if smaller em_precision is specified and more CPU time will be required. Our simulation results suggested that 106 for em_precision gives good estimates for the parameters in CAPL. bootstrap_length This parameter determines how many bootstrap resamplings will be performed in the APL for the variance estimator. Generally, 200 bootstrap replicates are enough to give a good estimate of the variance (27). Since the bootstrap procedure is a stochastic process, the estimated variance, and consequently resulting test statistic and P -value for a marker, may not be exactly the same if the test is repeated. Our simulation results suggested that generally using 1,000 bootstrap replicates gives a reasonably accurate estimate of the variance. Note that each bootstrap replicate involves several EM iterations to estimate the parameters in the APL. Therefore, increasing the number of bootstrap replicates also increases the running time linearly. In practice, users can first use 200–500 bootstrap replicates for initial estimates of P -values. Markers with P -values less than a certain threshold can then be tested with 1,000 bootstrap replicates for more accurate estimates of the P -values. max_cpus CAPL is also implemented with parallel algorithms to efficiently handle GWAS data. Since each marker can be analyzed independently in the APL, testing of different markers can be performed in parallel. We provide two versions of CAPL. The first version is implemented with p-threads and the second is implemented with both message passing interface (MPI) and p-threads. The first version of CAPL uses parallel threads with shared memory on one machine to take advantage of current computers with multi-core design. Each thread keeps receiving and analyzing one marker at a time until all markers are analyzed. The number of threads is specified in max_cpus. mpi_processes The second version of CAPL is implemented with a hybrid of MPI and p-threads, which allows the jobs to be distributed across a cluster
366
R.-H. Chung and E.R. Martin
of computers, and parallel threads are performed on each computer with shared memory. The number of computer nodes is specified in mpi_processes. The number of threads to be invoked on each node is specified in max_cpus. The total number of threads is therefore the number of nodes times the number of threads on each node. start and stop These two parameters are useful when users are only interested in a specific region, such as a candidate region or a chromosome. The start and stop parameters correspond respectively to the start and end positions of the SNPs in a region in the map file. 2.2.4. Interpretation of the Results
CAPL shows the numbers of families for different nuclear family structures used in the APL test. Five types of family structures are shown: C for unrelated cases, U for unrelated controls, A for families with one affected sib, AA for families with two affected sibs, and AAA for families with at least three affected sibs (see Note 4 for more details). The CAPL output file shows the names of markers, allele frequencies, observed allele counts in affected siblings, expected allele counts conditional on parental genotypes, variance for the APL statistics, and P -values. The variance for the APL statistics in the output file can be used to evaluate the asymptotic property of the APL statistics (see Note 5). One of the frequently asked questions from users is how to identify the risk alleles based on the results. Users can identify a risk allele if its observed allele count is greater than the expected allele count for a significant marker. Also, users frequently ask the choice between using PDT and APL. Detailed comparisons between the PDT and APL can be found in Note 6. Currently, odds ratios for markers are not calculated in either the PDT or APL. However, odds ratios can be calculated with other tools (see Note 7).
3. Notes 1. Covariates are not considered in either PDT or APL. We have developed APL-OSA (36), which is an extension of the ordered subset analysis (OSA) (37). APL-OSA can identify a subset of families that provide the most evidence of association based on covariate values and tests the null hypothesis that there is no relationship between the family-specific covariate and the family-specific evidence for allelic association. Alternatively, the software UNPHASED (14) can also be used to include covariates in association tests. 2. The APL uses allele frequencies estimated from the entire sample to infer the missing parental mating-type probabilities.
19
Single-Marker Family-Based Association Analysis Conditional. . .
367
Table 1 Properties for the PDT and APL software Test/properties
PDT
APL
Null hypothesis
No linkage or no association
No linkage or no association
Family structures
Independent extended pedigrees or nuclear families with one or more affected siblings
Independent nuclear families with one or more affected siblings
Parental genotypes
Complete/incomplete (incomplete requires DSP)
Complete/incomplete
Missing parental genotype inference
No
Yes
Allele-based test
Yes
Yes
Genotype-based test
Yes
No
Parallelization in software
Threads with shared memory
Threads with shared memory and MPI with threads in a distributed system
When there is population stratification in the data, the allele frequency estimates may reflect the true allele frequency for each family, which can cause inflated type 1 error rate for the APL (32). In practice, software to identify population structure such as STRUCTURE (33) or EIGENSTRAT (34) should be used in the quality control (QC) step to generate a homogeneous sample. Recently, the CAPL, which is an extension of the APL, was proposed to account for population stratification in the APL statistic and allow for use of unrelated cases and controls (32). Briefly, a clustering algorithm is used in the CAPL to identify subpopulations in the sample, and Eq. 4 is calculated conditional on the subpopulation information. The CAPL is useful when users have discrete subpopulations in the sample and would like to perform joint analysis using all of the samples from the subpopulations. 3. Current implementation of CAPL only assumes diallelic markers, while PDT2 can handle markers with multiple alleles. To analyze a multi-allelic marker using CAPL, users can collapse alleles other than the allele to be tested into one allele. For example, the allele to be tested can be coded as 1, and other alleles can be all coded as 2 in the ped file. 4. The APL test implemented in the CAPL software currently uses nuclear families with up to three affected siblings (17). When there are extended pedigrees or nuclear families with more than three affected siblings, the APL will select the
368
R.-H. Chung and E.R. Martin
most informative nuclear families. For a nuclear family with more than three affected siblings, the APL selects the three affected siblings with the highest genotyping rates based on all markers among all affected siblings in the nuclear family. There is no restriction on the number of unaffected siblings in the APL. Therefore, all unaffected siblings in a nuclear family will be used. In an extended pedigree, the APL examines every nuclear family and extracts the nuclear family with the most affected siblings and with the highest genotyping rate. 5. When sample size is small or allele frequency is low, the APL may have inflated type 1 error rate because the assumption of an asymptotic normal distribution for the APL statistic may not hold (17). Chung et al. (17) used simulations to demonstrate that APL statistics with variance >5 are generally valid. In practice, APL tests with variance <5 should be ignored or carefully examined. Our analysis results suggested that the PDT is more robust for rare variants (low-frequency alleles). However, the PDT results for markers with rare variants should also be examined carefully. The issue of rare alleles will be especially important for next-generation sequencing data, where many rare variants will be identified. The collapsing methods proposed to deal with rare variants in sequencing data (28–31) or similar multivariant approaches will need to be incorporated in the APL or PDT for valid and powerful tests. 6. The PDT and APL share some properties but also have distinct properties. Table 1 summarizes the properties for the PDT and APL. The PDT and APL can be used as complementary tools. The PDT has the advantage of being able to use full information from extended pedigrees. If the sample consists of triad families with complete genotypes, the PDT and APL are expected to have similar power. Note that the PDT specifically considers the difference in allele counts between affected and unaffected siblings, whereas the APL uses unaffected siblings only to help infer allele frequencies. Therefore, the unaffected siblings contribute indirectly to the APL statistic when there are missing parents. The APL does not use information from unaffected siblings when there are no missing parental data. Therefore, the PDT may have more power than the APL for complete nuclear families with unaffected siblings. However, if the sample has nuclear families with missing parents, as are often collected in studies of late-onset diseases such as Alzheimer’s disease and Parkinson’s disease, the APL, which can infer missing parental genotypes, should be used to obtain more power. As discussed in (12), nuclear families with only affected siblings and no parental genotype data can be used in the APL, whereas the PDT cannot use families with only affected siblings. Therefore, the APL can have more power than the PDT when a portion of the sample consists of families with only
19
Single-Marker Family-Based Association Analysis Conditional. . .
369
affected siblings and no parents. In summary, the choice between the PDT and APL depends on the family structures in the sample. The general rule of thumb is that when the sample has many extended pedigrees, the PDT should have more power than the APL. But when the sample mainly consists of nuclear families, the APL should be used, especially when there are missing parental genotypes. 7. Currently, odds ratios for markers are not calculated in either the PDT or APL. To obtain an estimate of effect size, users can apply regression analysis with general estimating equations (GEE) (35) to estimate odds ratios while properly accounting for the correlation among family data. PLINK also provides odds ratio calculations in the TDT test. However, the calculations are restricted to triad families. References 1. Ma DQ, Salyakina D, Jaworski JM et al (2009) A genome-wide association study of autism reveals a common novel risk locus at 5p14.1. Ann Hum Genet 73:263–273 2. International Multiple Sclerosis Genetics Consortium, Hafler DA, Compston A, et al Risk alleles for multiple sclerosis identified by a genomewide study (2007) N Engl J Med 357:851–862 3. Sklar P, Gabriel SB, McInnis MG, et al (2002) Family-based association study of 76 candidate genes in bipolar disorder: BDNF is a potential risk locus. Brain-derived neutrophic factor. Mol. Psychiatry 7:579–593 4. Oudot T, Lesueur F, Guedj M. et al (2009) An association study of 22 candidate genes in psoriasis families reveals shared genetic factors with other autoimmune and skin disorders. J Invest Dermatol 129:2637–2645 5. Lander ES, Schork NJ (1994) Genetic dissection of complex traits. Science 265:2037–2048 6. Spielman RS, McGinnis R E, Ewens WJ (1993) Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am J Hum Genet 52:506–516 7. Martin ER, Monks SA, Warren LL, Kaplan NL (2000) A test for linkage and association in general pedigrees: the pedigree disequilibrium test. Am J Hum Genet 67:146–154 8. Martin ER, Kaplan NL, Weir BS (1997) Tests for linkage and association in nuclear families. Am J Hum Genet 61:439–448 9. Spielman RS, Ewens WJ (1998) A sibship test for linkage in the presence of association: The
sib transmission/disequilibrium test. Am J Hum Genet 62:450–458 10. Boehnke M, Langefeld, CD (1998) Genetic Association Mapping Based on Discordant Sib Pairs: The Discordant-Alleles Test. Am. J. Hum. Genet. 62:950–961 11. Horvath S, Laird NM (1998) A discordantsibship test for disequilibrium and linkage: no need for parental data. Am J Hum Genet 63 (6):1886–97 12. Martin ER, Bass MP, Hauser ER, Kaplan NL (2003) Accounting for linkage in family-based tests of association with missing parental genotypes. Am J Hum Genet 73:1016–1026 13. Clayton D (1999) A generalization of the transmission/disequilibrium test for uncertain-haplotype transmission. Am J Hum Genet 65:1170–1177 14. Dudbridge F (2008) Likelihood-based association analysis for nuclear families and unrelated subjects with missing genotype data. Hum. Hered 66:87–98 15. Knapp M (1999) The transmission/disequilibrium test and parental-genotype reconstruction: The reconstruction-combined transmission/ disequilibrium test. Am J Hum Genet 64:861–870 16. Rabinowitz D, Laird N (2000) A unified approach to adjusting association tests for population admixture with arbitrary pedigree structure and arbitrary missing marker information. Hum. Hered 50:211–223 17. Chung RH, Hauser ER, Martin E R (2006) The APL Test: Extension to General Nuclear Families and Haplotypes and Examination of Its Robustness. Hum. Hered 61:189–199
370
R.-H. Chung and E.R. Martin
18. Martin ER, Bass MP, Kaplan NL (2001) Correcting for a potential bias in the pedigree disequilibrium test. Am J Hum Genet 68:1065–1067 19. Martin ER, Bass MP, Gilbert JR, Pericak-Vance MA, Hauser ER (2003) Genotype-based association test for general pedigrees: the genotype-PDT. Genet Epidemiol 25:203–213 20. Gregory SG, Schmidt S, Seth P, et al (2007) Interleukin 7 receptor alpha chain (IL7R) shows allelic and functional association with multiple sclerosis. Nat Genet 39:1083–1091 21. Martin ER, Scott WK, Nance MA et al (2001) Association of single-nucleotide polymorphisms of the tau gene with late-onset Parkinson disease. JAMA 286:2245–2250 22. Schmidt S, Hauser MA, Scott WK. et al (2006) Cigarette Smoking Strongly Modifies the Association of LOC387715 and Age-Related Macular Degeneration. Am J Hum Genet 78:852–864 23. Wang L, Hauser ER, Shah SH et al (2007) Peakwide mapping on chromosome 3q13 identifies the kalirin gene as a novel candidate gene for coronary artery disease. Am J Hum Genet 80:650–663 24. Prokunina L, Castillejo-Lopez C, Oberg, F. et al (2002) A regulatory polymorphism in PDCD1 is associated with susceptibility to systemic lupus erythematosus in humans. Nat Genet 32:666–669 25. Deak KL, Dickerson ME, Linney E, et al (2005) Analysis of ALDH1A2, CYP26A1, CYP26B1, CRABP1, and CRABP2 in human neural tube defects suggests a possible association with alleles in ALDH1A2. Birth Defects Res. A. Clin. Mol. Teratol. 73:868–875 26. Purcell S, Neale B, Todd-Brown K, et al (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 81:559–575 27. Efron B, Tibshirani, R (1993) An Introduction to the Bootstrap. Chapman & Hall, New York
28. Li B, Leal, SM (2008) Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet 83:311–321 29. Madsen, B. E., and Browning, S. R. (2009) A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet 5, e1000384 30. Morris AP, Zeggini E (2010) An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genet Epidemiol 34:188–193 31. Price AL, Kryukov GV, de Bakker PI, Purcell SM, Staples J, Wei LJ, Sunyaev SR (2010) Pooled association tests for rare variants in exon-resequencing studies. Am J Hum Genet 86:832–838 32. Chung RH, Schmidt M A, Morris RW, Martin ER (2010) CAPL: a novel association test using case–control and family data and accounting for population stratification. Genet Epidemiol 7:747–755. 33. Pritchard JK, Stephens M, Donnelly P (2000) Inference of population structure using multilocus genotype data. Genetics 155: 945–959 34. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38:904–909 35. Liang K, Zeger SL (1986) Longitudinal data analysis using generalized linear models. Biometrika 73:13–22 36. Chung RH, Schmidt S, Martin ER, Hauser ER (2008) Ordered-subset analysis (OSA) for family-based association mapping of complex traits. Genet Epidemiol 32:627–637 37. Hauser ER, Watanabe RM, Duren WL, Bass MP, Langefeld CD, Boehnke M (2004) Ordered subset analysis in genetic linkage mapping of complex traits. Genet Epidemiol 27:53–63
Chapter 20 Single Marker Family-Based Association Analysis Not Conditional on Parental Information Junghyun Namkung Abstract Family-based association analysis unconditional on parental genotypes models the effects of observed genotypes. This approach has been shown to have greater power than conditional methods. In this chapter, I review two popular association analysis methods accounting for familial correlations: the marginal model using generalized estimating equations (GEE) and the mixed model with a polygenic random component. The marginal approach does not explicitly model familial correlations but uses the information to improve the efficiency of parameter estimates. This model, using GEE, is useful when the correlation structure is not of interest; the correlations are treated as nuisance parameters. In the mixed model, familial correlations are modeled as random effects, e.g., the polygenic inheritance model accounts for correlations originating from shared genomic components within a family. These unconditional methods provide a flexible modeling framework for general pedigree data to accommodate traits with various distributions and many types of covariate effects. The analysis procedures are demonstrated using the ASSOC program in the S.A.G.E. package and the R package gee, including how to prepare input data, conduct the analysis, and interpret the output. ASSOC allows models to include random components of additional familial correlations that may be not sufficiently explained by a polygenic effect and addresses nonnormality of response variables by transformation methods. With its ease of use, ASSOC provides a useful tool for association analysis of large pedigree data. Key words: Family-based association test, Unconditional method, Observed genotype, Polygenic inheritance, Linear mixed model, Marginal model, Generalized estimating equations, Generalized linear mixed model, Variance components, ASSOC, S.A.G.E., R package gee, Working correlation, Heritability
1. Introduction Family-based association analysis unconditional on parental information directly models the phenotypes associated with observed genotypes while conditional approaches, the so-called transmission disequilibrium test (TDT) methods, models the phenotypes associated with parental allelic transmission. Two modeling Robert C. Elston et al. (eds.), Statistical Human Genetics: Methods and Protocols, Methods in Molecular Biology, vol. 850, DOI 10.1007/978-1-61779-555-8_20, # Springer Science+Business Media, LLC 2012
371
372
J. Namkung
strategies for unconditional family-based association analysis are popularly used: the mixed model with a random polygenic component and the marginal model using generalized estimating equations (GEE). Both these approaches provide a flexible modeling framework to incorporate traits of various distributions and many types of covariate effects. Interpretation of the estimates from these models is straightforward. The methods that are unconditional on parental genotypes are very efficient when there is no concern about population substructure. The greater efficiency of unconditional methods in terms of power compared to TDT type methods has been shown through a simulation study (1). 1.1. The Marginal Model Using GEE
GEE can be seen as an extension of the generalized linear model (GLM) for the analysis of data with correlated responses (2). The GEE approach models the marginal expectation of responses assuming subpopulations share common values of covariates and thus it provides estimates of a population-averaged effect. GEE is a quasi-likelihood method that uses only the mean and variance of the observations to estimate parameters. Thus, GEE is applicable even when the complete distribution of responses cannot be given. One of its attractive features is that GEE yields consistent estimates of the regression coefficients and their variances, even with misspecification of the covariance model, when the robust variance estimator is used. Although better efficiency is obtained the closer the defined working covariance structure is to the true structure, the loss of efficiency caused by model misspecification may not be significant if the sample size is large (3, 4). The GEE methods can be outlined as follows. 1. The family-specific mean is related to a linear combination of the covariates through the known link function g. g mij ¼ xij0 b; where mij ¼ E yij , yij denotes the response for member j in family i, xij is the vector of covariate values for this person, b is a vector of regression parameters. For binomial response variables, the logit link is frequently used. 2. The variance of yij is defined as a function of the mean. Var yij ¼ f vij ; where vij ¼ v mij , v is the variance function and f is a scale parameter. 3. The form of the working correlation matrix Ri(a) is defined. The ( j, j0 ) elements of Ri(a) is the known, hypothesized, or estimated correlation between yij and yij0 , which corresponds to observations from two members of the ith family in a family
20 Single Marker Family-Based Association Analysis. . .
373
data analysis. This working correlation matrix may depend on a vector of unknown parameter a, which needs to be estimated from the data. Although for data with variable family size ti the ti ti matrix Ri(a) can differ from family to family, Ri(a) ¼ R (a) is frequently assumed to approximate the average dependency among family members. The choice of R(a) in practice is further discussed in Subheading 2.1.2. 4. The parameter vector b and its covariance matrix are estimated. The working covariance matrix of yi is 1=2
1=2
Vi ðaÞ ¼ fAi Ri ðaÞAi ; where
0
vi1 B 0 B Ai ¼ B .. @ .
0 vi2 .. .
.. .
0 0 .. .
1 C C C: A
0 viti ^ The parameter estimates, b, are the solution of the following GEE.: n X @mi 0 ½Vi ð^aÞ1 ðyi mi Þ ¼ 0: U ðbÞ ¼ @b i¼1 0
The solution of the above equation is found by iterating the following two steps until convergence: (1) Estimate b using the estimating equation for fixed a and f and (2) obtain consistent estimates of a and f for fixed b. A popular form of inference on the regression parameters in GEE approaches is the Wald test using naı¨ve or robust standard errors. The variance–covariance matrix of ^ can be estimated using the inverse of the Fisher information b matrix, which is called a naı¨ve or model-based estimator. However, this estimator will not provide a consistent estimator when the model specification is incorrect. In other words, the estimates are not robust to the choice of the working correlation matrix and the wrong assumption about the variance function. To achieve robustness, Liang and Zeger (2) suggested using the following variance estimator: ^ ¼ M 1 M1 M 1 ; d b Var 0 0 where M01 ¼
" n X @^ mi 0 i¼1
@b
#1 @^ mi Vi ð^aÞ ; @b 1
the model-based variance estimator, and n X @^ mi 0 @^ mi ^i Þðyi m ^i Þ0 Vi ð^aÞ1 : Vi ð^aÞ1 ðyi m M1 ¼ @b @b i¼1
374
J. Namkung
This estimator is the robust sandwich estimator because the empirical evidence is sandwiched between the model-driven covari^ is assumed to be asymptotically normal, ance matrices. Because b tests for the significance are conducted using the Wald statistics, which follow asymptotic w2 distributions under the null hypothesis. If the estimation of the underlying covariance structure is of interest as well, an extension of GEE, GEE2 can be used (5). GEE2 uses higher moments of the distribution function instead of only the mean and variance and produces more efficient estimates while sacrificing robustness a bit. Because the gain of efficiency are not substantial (6), GEE is commonly used when the covariance structure is not of main interest. General statistical software packages implementing GEE are widely available. For example, a commonly used statistical tool, SAS (SAS Institute, Inc., North Carolina, USA) implements GEE in proc GENMOD. As free software, there are a few tools, including gee and geepack packages that run in the R environment. 1.2. The Mixed Model with Random Polygenic Effects
The linear mixed model (7) has been widely applied to data with various levels of dependencies such as multicenter clinical trial outcomes or repeated measures. The mixed model includes fixed effects and random effects in the same model. The fixed effects are assumed to be shared across subpopulations and the random effect is assumed to be subject specific. The random effect is used to model individual-specific errors (decomposition of residual errors) in a regression model. In genetic studies, trait values can be modeled as the sum of a major genotypic effect and a polygenic effect that models the background genetic effects. The mixed model has been used to explain phenotypic variation with a major gene effect and polygenic inheritance by modeling the former as a fixed effect and the latter as a random effect. The polygenic inheritance model was first proposed to explain how Mendelian inheritance can underlie a quantitative trait (8). The model assumes the effects of an infinite numbers of genetic loci determine trait values, each individual locus having the same small effect. Each locus contribution to a trait is independent and additive and by the central limit theorem the resulting trait values follow a normal distribution (8). When an offspring inherits chromosomes, half from the mother and half from the father, half of the phenotypic variation due to parental genetic effects will be inherited from each of the parents but the phenotypic value of the offspring will have the same variance as the parents. This relationship can also be explained by the kinship coefficient. A typical linear mixed model for an association test using family data is expressed as yi ¼ Xi b þ Zi ui þ ei ;
20 Single Marker Family-Based Association Analysis. . .
375
where yi is a ti 1 vector of observed phenotypes and Xi is a ti q matrix of fixed effects including the intercept, SNPs, and adjusting covariates. b is q 1 vector representing coefficients of the fixed effects. Z is an incidence matrix of 0s and 1s, and ui is an unobservable random vector of dimensions ti 1, and Varðui Þ ¼ s2g D, where D is the ti ti matrix with values representing correlations within a family. The residual effects ei are assumed to have a ti ti dimensioned Varðei Þ ¼ s2e I and means zero. The resulting variance matrix of the phenotypes will be Varðyi Þ ¼ s2g Zi DZi0 þ s2e I : Parameters are estimated by the methods of either maximum likelihood (ML) or restricted maximum likelihood (REML). The significance of parameter estimates can be tested by Wald tests, likelihood ratio tests (LRTs), or score tests. Linear mixed effect models are most efficient when the data satisfy the assumption of normality, and variances are constant among compared groups (homoscedasticity). To address non-normality of the distribution, the following two methods can be used: the mixed model using a transformation or the generalized linear mixed model (GLMM), an extension of the mixed model as a GLM (9). For the purpose of applying the mixed model with a polygenic effect, the former approach is implemented in the ASSOC program of S.A.G.E. (http://darwin.cwru.edu/sage/) (10) and the latter is in the grammar function of an R package named GenABEL (1, 11). I give further details about ASSOC, used for demonstration below. The ASSOC program estimates association between a dependent trait and independent covariates in pedigree data, and also estimates familial variance components and heritability (User manual for S.A.G.E. 6.1). ASSOC is designed to analyze associations using large pedigree data. In addition to polygenic effects, this program allows a model to include other types of familial correlations that may not be sufficiently explained by an additive polygenic effect. For example, many traits show positive correlations between spouses even though the spouses do not share a genetic background. Siblings who share the same environment when growing up may also show stronger correlations than implied by the calculated kinship coefficient (10). The ASSOC model is a mixed model with both-side transformation. For any individual i with trait yi, and covariate vectors including genotypic values, models in ASSOC look like X eRi ; h ðyi Þ ¼ h ðbXi Þ þ R2fp;f ;m;s;r g
where eRi is the random effect comprising a subset of polygenic (R ¼ p), nuclear family (R ¼ f ), sibling (R ¼ s), marital (R ¼ m), and random error (R ¼ r) effects, b is a vector of parameters for the effect of covariates, including an intercept, Xi is a matrix of covariate values for individual i and h(.) is a transformation function. The transformation is applied to induce the assumed normality of the residuals. After transformation, each random component is assumed
376
J. Namkung
to have a normal distribution with mean 0 and variance s2R , where R2 P {p, f, s, m, r}. Thus total variance of y is V ½hðyÞ ¼ R2fp;f ;m;s;r g s2R . It has been shown that transforming both sides of the regression equation yields median unbiased estimators on the original scale of the response values and so does not lose interpretability of the regression coefficient estimates (12, 13). In ASSOC, the transformation methods of Box and Cox (14) and George and Elston (15, 16) are implemented. Generally, the George and Elston transformation is recommended because it does not restrict the range of the variables, whereas the Box and Cox transformation can only be applied to non-negative values. The George and Elston transformation function h(.) is 8 h i l1 > y þ l 1 ð þ 1 Þ j j > 2 < ; l1 6¼ 0 ; hðyÞ ¼ signðy þ l2 Þ l1y l1 1 > > : signðy þ l2 Þy lnðjy þ l2 j þ 1Þ; l1 ¼ 0 hQ i1=N N y þ l , N ¼ number of individuals ð þ 1 Þ where y ¼ j j i 2 i¼1 with complete data, l2 is a location (shift) parameter, and l1 is the power parameter. The geometric mean is used to standardize the transformed variables so that all likelihoods are comparable. This is also called the generalized modulus power transformation (15, 16). The location and power parameters control skewness and kurtosis (peakedness) of the distribution, i.e., the George and Elston transformation can induce normality with data having skewness and/or kurtosis different from that of a normal distribution. Parameter estimates are obtained by a numerical method that searches the parameter values to maximize the likelihood, which is an iterative algorithm that proceeds until convergence (17). 1.3. Practical Considerations in Modeling
The simplest model approach to detect associations of genotypes with a trait, using family data, is to ignore familial correlations. However, a regression model treating individuals as being independent is not adequate because ignoring strong familial correlations can incur large type I error (lack of validity) (1). As more appropriate approaches, two types of modeling strategies were introduced in Subheadings 1.1 and 1.2. A choice between the mixed model and GEE should first be made based on the objectives of the study. Because GEE models the marginal expectation of responses over all subpopulations, the estimates obtained are interpreted as population-averaged effects. In the mixed models, individual-specific variation is explicitly modeled by the random effects and thus the estimates of fixed effects are interpreted as subject-specific effects. For quantitative traits with a normal distribution, subject-specific and population-averaged effects are the same. However, when responses are not normally distributed and modeled with a non-
20 Single Marker Family-Based Association Analysis. . .
377
linear link function, such as binary traits with a logit link, the effect sizes may be different and the marginal model tends to underestimate the individual-specific effects. Therefore, if subject-specific effects are of interest, the mixed model is recommended. For screening a large number of markers to find significant associations, GEE is in many cases computationally more attractive than the mixed model. However, remember that GEE gains robustness by sacrificing a small amount of efficiency (power loss), so mixed models may give more significant results. When the pedigree size is large, mixed models, especially with a nonlinear link function (GLMM), sometimes fail to converge to obtain maximum likelihood estimates. ASSOC is modeled to provide test optimality via transformation for non-normal data. With a user friendly software interface, ASSOC provides a useful tool for association analysis for large pedigree data. Compared to conditional on parental genotype approaches, unconditional approaches provide greater power because they use more of the available information. However, when there is substructure in the collected samples, such as from population stratification, spurious associations can occur. Several studies have proposed solutions to address this issue in the mixed model framework. For example, Amin et al. suggested an approach to adjust for the inflation of statistics by using statistics obtained from null markers as in genomic control (18). The genetic relatedness measured from large amounts of genotype data can be modeled as a random effect and related methods were proposed by Yu et al. (19) and Kang et al. (20). Additionally, unconditional methods should be used with caution for nonrandomly ascertained samples because the analysis can be sensitive to the sampling design. To account for ascertainment, the sampling procedure should be accounted for in the model (21).
2. Methods There are several methods to test for the association using family data. In this section, two methods, using gee in the R software for binary traits and the ASSOC program in the S.A.G.E. package for continuous traits will be demonstrated. We start with gee for binary traits. 2.1. The Family-Based Association Analysis of Binary Traits Using gee
R is a programming language and software package freely available at the R website (http://r-project.org). The following examples were conducted with R ver. 2.11.0 (see Note 1). In this section, a package named gee will be used to demonstrate a family-based association analysis procedure with an example of binary trait data.
378
J. Namkung
To use functions in the gee package, install the package by typing the following command in the R environment. > install.packages("gee") When the user input prompt appears, select a mirror site to download the package files near where you are. 2.1.1. Input Data for R
Frequently in R, input data are read from a plain text file with a character separator. External data are classified as numeric and nonnumeric data and you read them as numeric variables and factors, respectively. A factor is a vector object used to specify multi-categorical variables and is coded as integers internally with a matching table of integer and original values (see Note 2). The entire data read from a text file is formatted as a data frame (see Note 3), that is, a matrix that includes columns of numeric variables and factors. The example data shown in Table 1 can be read by executing the following command line: > dat ¼ read.table("example.dat", sep¼ "\t", header¼T, na. strings¼ c(".","./.")) This command line will read the external file “example.dat” as a tab separated text file with a header line, where “.” and “./.” are coded as missing values, and assigns the table as a data frame with the name “dat”. To model additive genetic effects, we convert genotype coding to the values 0, 1, and 2 for homozygotes with major alleles, heterozygotes, and homozygotes with minor alleles, respectively, where 2 is the minor allele and we assume that it is the risk allele (see Notes 4 and 5). Using a genetic package function, this can be done by the following lines: > library(genetics) > dat$M1 ¼ allele.count(genotype(dat$M1, 2)) library() loads an installed package, and the genetics package should be installed in your computer first. To use the affection status AFF as a response variable, it should be converted to a 0, 1 binary variable: > dat$bin ¼ as.numeric(dat$AFF¼¼"A") This will generate a new variable that has 0 for an “A” value for the AFF variable and 0 for the other character. The new variable will be used as a response variable in the following analysis. When the dataset is very large, the head( ) function is useful to view the generated data at a glance. The function shows only several leading lines of the data. Make sure that the data have been created as intended (see Note 6).
20 Single Marker Family-Based Association Analysis. . . 2.1.2. Run Analysis Using gee
379
A model formula in R looks like Response ~ Covariate1 + Covariate 2+ . . . Covariates may include existing variables in the data or a function of given variables (see Note 7). Full hierarchical models with interaction terms can be expressed by the highest interaction term. For example, a three-way interaction model can be expressed as Y ~ X1 * X2 * X3 and this is the same as the expression Y ~ X1 + X2 + X3 + X1:X2 + X2:X3 + X1:X3 + X1:X2:X3. To use the gee package, load the package. The following is an example to analyze using a logistic regression model for a binary trait in correlated samples using GEE: > library(gee) > fit ¼ gee(bin ~ as.numeric(M1), id¼PID, data¼dat, family¼binomial, corstr¼"exchangeable") Of these arguments of the gee function, id specifies a group variable for the correlated values. In this example, a family is a unit of correlated values, so PID, a family id, is assigned as a group variable. family defines a type of distribution of response variable (see Note 8). For a binary trait, “binomial” is used to fit a logistic regression model. corstr defines the working correlation structure. As values for this option, “independence”, “fixed”, “stat_M_dep” for stationary m-dependent, “non_stat_M_dep” for non-stationary m-dependent, “exchangeable”, “AR-M” for m-autoregressive and “unstructured” are allowed (see Note 9). “independence” assumes no correlation. “exchangeable” assumes the same correlation coefficient among all the members and this correlation structure is sometimes called complete symmetry. If corstr is not defined, as a default independence will be assumed for the correlation structure. Although the GEE estimates are robust to incorrect modeling of the working correlation when the robust estimator is used, specifying it close to the true data structure will improve the efficiency of estimates. Two other analyses, using partial information of the example pedigree data, are presented below. l
For siblings only Generate sibling data by subtracting individuals with information on both parents and then assign a new group id with pedigree id, father’s id, and mother’s id. For data with only siblings, “exchangeable” should be a reasonable choice as a correlation structure. > dat_sib ¼ dat[which(!is.na(dat$FA)&!is.na(dat$MO)),] > dat_sib$grp ¼ as.integer(as.factor (apply(dat_sib[,c(1,3,4)],1, paste,collapse¼"_"))) > fit ¼ gee(bin ~ as.numeric(M1), id¼grp, data¼dat_sib, family¼binomial, corstr¼"exchangeable")
380
J. Namkung l
For nuclear families For data with similar pedigree structures (not too much heterogeneity), a kinship coefficient matrix can be used for a working correlation structure to account for the correlation due to a polygenic effect. For example, we can define a kinship coefficient matrix of nuclear families with at most four offspring that looks like 0 1 1 0 0:5 0:5 0:5 0:5 B 0 1 0:5 0:5 0:5 0:5 C B C B 0:5 0:5 1 0:5 0:5 0:5 C B C B 0:5 0:5 0:5 1 0:5 0:5 C B C @ 0:5 0:5 0:5 0:5 1 0:5 A 0:5 0:5 0:5 0:5 0:5 1 and specify it as a user-defined correlation matrix. The commands are as follows: > kin_mat¼matrix(1/2,6,6) > for(i in 1:6)kin_mat[i,i]¼1 > kin_mat[1,2]¼0; kin_mat[2,1]¼0 > fit ¼ gee(bin ~ as.numeric(M1), id¼nuc_fam, data¼nuc, family¼binomial, corstr¼"fixed", R¼kin_mat) The most general model for the correlation structure is the unstructured model that uses t(t 1) / 2 parameters, where t is the number of correlated members. Although an unstructured working correlation can be specified when the covariance structure is unknown, this cannot be applied for all situations (see Note 10).
2.1.3. gee Output
summary(fit) will print out summary results, including model information, a coefficient table, and a working correlation matrix (Fig. 1). The coefficient table shows the estimates and the standard errors of the estimates. Naı¨ve standard errors are obtained from the model-based variance estimates and robust standard errors are obtained from the sandwich variance estimates. z scores are computed based on Wald type statistics. P values for one-sided tests can be obtained by > 1-pnorm(z-score) If the estimate for the marker is significant, the estimate is interpreted as the mean change in the log odds ratio of being affected over a unit increase of the risk allele. The output of the gee function provides residuals and fitted values. The residuals and fitted values can be accessed as follows. > fit¼gee( f,. . . ) > fit$residuals > fit$fitted.values
20 Single Marker Family-Based Association Analysis. . .
381
Fig. 1. The output of the gee function.
fit$residuals has the estimated residuals from the model f. fit$fitted. values has the expected values given individual covariates. In logis^ ^ ^ ¼ expðX bÞ=ð1 tic regression, the expected value p þ expðX bÞÞ:, ^ where b is the estimated intercept and parameters for covariates, and X is the matrix of covariate values. GEE uses quasi-likelihood, thus classical model selection criteria based on likelihood, such as Akaike’s information criterion (AIC) and Bayesian information criterion (BIC) may be not adequate. Instead, quasi-information criterion (QIC), a related measure for GEE methods has been proposed (24). compare.gee() function in the R package named ape provides various selection tools, including QIC calculation. 2.2. The Family-Based Association Analysis for Continuous Traits Using ASSOC
The ASSOC program is embedded in the S.A.G.E. package, which can be freely downloaded from its webpage (http://darwin.cwru. edu/sage/). The ASSOC program can be run in either a command line mode from a command prompt $ path/assoc.exe or in a graphic user interface (GUI) mode. To run ASSOC in GUI mode, the user only needs to drag the ASSOC icon to the analysis tree. The following demonstration will be conducted mostly with the S.A.G.E. GUI.
382
J. Namkung
2.2.1. Prepare Input Data and Create a New Project in S.A.G.E.
Two types of input data file are required to run ASSOC: a pedigree file and a parameter file. Pedigree file format is presented with a header in the example data in Table 1 and this example is used in the following demonstrations. The pedigree file should contain pedigree structure information, values of traits, covariates, and markers to be used in the analysis. S.A.G.E. can read pedigree data in an Excel file with the extension .xls or in a plain text file with a character separator. One may prepare data using the Excel program and save it as tab delimited text file, for example. The parameter file contains specifications about the analysis and output options. The parameter file can be prepared in a text editor following predefined syntax or it can be generated in the S.A.G.E GUI mode. Thus, we only need to prepare a pedigree file to use ASSOC in the S.A.G.E. GUI mode. After starting the S.A.G.E. program, users will be requested to create a new project or open an existing one. Because we only have a pedigree file, choose “I have all pedigree data required by S.A.G.E. but no parameter file” and then open the pedigree file by clicking the Browse button at the top and specify the format of the pedigree file. After reading a pedigree file, the types of data columns should be specified, including pedigree ID, individual ID, parent 1, parent 2, and sex. A variable that is to be the primary response in the analysis should be specified as a trait. Other observed values should be specified as continuous or binary covariates. To conduct an association analysis between a certain trait and genotypes, markers should be specified as covariates using the following steps. 1. Select the Non-codominant marker option in the popup window that appears once you choose MARKER as a variable type.
Table 1 Leading lines, with header, of an example dataset in pedigree file format for the S.A.G.E. package PID
ID
FA
MO
SEX
AGE
AFF
Q1
M1
M2
1256
3146
.
.
M
75
.
.
./.
./.
1256
3046
.
.
F
76
U
.
1/2
2/2
1256
4346
3146
3046
M
40
A
.
1/1
2/2
1256
2046
3146
3046
M
48
U
.
./.
./.
1256
2096
.
.
F
45
U
2.88
1/2
2/1
1256
1002
2046
2096
M
23
A
.
1/2
2/1
20 Single Marker Family-Based Association Analysis. . .
383
Fig. 2. Set values to use markers as covariates.
2. Then, a new popup box will appear. Check Use this marker as covariate and choose the marker inheritance model among additive (ADD), dominant (DOM), or recessive (REC). Name the covariate allele, which will be modeled as the risk allele (Fig. 2). This generates new internal variables (see Note 11) named “original marker name_mode of inheritance_risk allele,” for example, M1_ADD_2. If there are many markers to be analyzed, one can apply the current specification to the following columns by checking Apply to next [nnn] column(s). 3. After specifying data types for all the variables to use, you should click the general specification button at the top left of the data preview table and set individual missing values to be applied to the pedigree information fields and a code for the sex variable. 4. Go to the next page where the name of the data file and parameter file can be changed. Once the steps to create a new project are completed, you are ready to run an analysis. 2.2.2. Run ASSOC
In the main analysis page, by clicking the ASSOC icon from the tool box and dropping it to the branch named Jobs in the tree in the central panel, you can start to make an analysis job. Error messages will appear under the new job subbranch. Once the data file is loaded by a drag-and-drop of a pedigree file from Data>Internal
384
J. Namkung
branch to the branch of the ASSOC job, any error messages will disappear (Fig. 3). Then, click the Analysis Definition tab at the top to specify analysis options. At the top, one can change the title of the analysis report. Suppose we fit the following regression model: Q 1 ¼ AGE þ SEX þ SNP þ ep þ em ; where ep and em represent random effects for polygenic and marital effects, respectively. 1. Specify the response variable in the Trait by selecting a variable. 2. Covariates in the model can be defined in the popup box by clicking the Define button on the Covariate query line. In the popup box, select the SEX variable from the covariate list and click the Add covariate button. Repeat this with AGE. Specifying Value will set an initial estimate for the covariate effect; and when Fixed is also checked, the relevant parameters will not be estimated (Fig. 4). It is not necessary to set an initial value if you wish to estimate the effect. 3. To account for the correlations within a family, random polygenic, marital, sibling, and familial effects can be included in the analysis
Fig. 3. Create an ASSOC analysis job.
20 Single Marker Family-Based Association Analysis. . .
385
Fig. 4. Model specification.
model by clicking the relevant checkbox. To add random effects for two types of familial correlation, check for example Estimate polygenic, and Estimate Marital in the Variance components section (see Note 12). 4. Check at the Batch mode enabled checkbox to include markers as test covariates one by one (see Note 11). To induce normality of the distribution of trait values, a transformation option can be specified (see Note 13). Box and Cox (14) with l2 ¼ 1 is set as the default method for quantitative response variables (see Note 14). This performs a power transformation without specifying a location parameter. To apply the George and Elston transformation, click the Define button in the line under Transformation and remove the check in the checkbox for the Fixed options for l1 and l2 (Fig. 5). Then, the values in the box will be estimated using the given values as starting values of the iterative estimation procedure. Box and Cox with l1 ¼ l2 ¼ 1, and George and Elston with l1 ¼ 1; l2 ¼ 0 are the same as None for the transformation option.
386
J. Namkung
Fig. 5. Specification of the transformation function.
Covariates added without specifying a model name, such as SEX and AGE in this example, will be used as baseline model covariates. If covariates are used with a defining model name, they will be used as test covariates. The baseline model, which includes an intercept, baseline model covariates, and the random effects, will be compared to a model with test covariates in addition to test the significance of these covariates. Each covariate can be included in multiple models and each model can include multiple covariates. Models may include new variables derived from existing variables. For example, one may want to test the quadratic effect of AGE. To generate a new variable, open Tools from the menu bar at the top and select Create New Variable in the drop down menu. In the popup box, click the Add button. Give the new variable a name, such as “age_sq” in this example. Choose the covariate variable type. At the bottom left box, click the Existing Variables folder. Then, the list of variable names will appear. Choose AGE, and double-click to insert the variable name into the textbox in the
20 Single Marker Family-Based Association Analysis. . .
387
Fig. 6. Create a new variable from an existing variable.
center. Now, open the Operator folder and double-click the ** character, which is the power operator. Complete the formula by typing 2 after the power operator. The completed formula will look like AGE**2 (Fig. 6). Variance components provides four types of random components as options: Polygenic, Marital, Sibling, and Familial. Polygenic represents the shared additive genetic effects within a family, and Marital represents the shared effect between spouses originating from a shared environment. Familial is for a shared environmental effect in a nuclear family, i.e., household effect (see Note 15). Sibling represents the common environmental effect and/or a dominant polygenic effect among siblings (10). Additional random components can be included by specifying variables of categorical values in the Class effect section.
388
J. Namkung
If Allow averaging value is set as Yes, missing covariate values are imputed with the mean of the covariate values. The default value is No (see Note 16). Residuals from a certain analysis model can be retrieved by specifying a model name in the Residuals popup menu. The model should be one that is specified in the Covariate section. The model is used to obtain expected values for each individual. It can be the model without test covariates (null model, baseline model) or the alternative model with test covariates. If a model name has not been specified by the user in the Covariate section, the model name should be given as “Baseline” to get residuals. The residual for individual i is calculated as X Residi ¼ h ðyi Þ h ^y i ¼ h ðyi Þ h ðbXi Þ ¼ eRi : R2fp;f ;m;s;r g
In the Summary display popup window, options for the summary output file can be specified. Filters allows you to specify ordering the results using filtering criteria, such as test P values, and limit the number of outputs. This is useful when the number of tests is very large, as in a genome-wide analysis, and only a certain number of topmost significant results are of interest. Once the specification of the analysis models and other analysis options are done, ASSOC is ready to run. Click the Run button and then an Analysis information popup box will appear. The content is the analysis block of the parameter file created following the user’s specifications. The parameter file will be saved as a text file with a .par extension and is seen in the analysis tree under Input branch. Table 2 shows an example of a parameter file. Modifying a parameter file is a simple way to specify various models (see Note 17). After reviewing the analysis options, click OK to execute the analysis. 2.2.3. ASSOC Output
When an analysis is completed, four output files are generated in the Output branch of the analysis tree: inf, sum, tsv, and det. l
*.inf Contains messages regarding exceptions and errors that occur during the analysis (see Note 18)
l
*.sum Contains summary results with parameter estimates and test P values
l
*.tsv Provides a table with P values for test covariates
l
*.det Detailed results including the estimated variance–covariance matrix of all the parameter estimates
For each model, sample description and summary statistics for covariates included in the model are presented at the top of the detailed output file. Results of parameter estimates and the variance–covariance matrix of all the estimates from a model without
20 Single Marker Family-Based Association Analysis. . .
Table 2 Parameter file
389
390
J. Namkung
Table 3 Detailed output – parameter estimates MAXIMIZATION RESULTS M2_ADD_2 with test covariates Parameter
Estimate
S.E.
P value
Deriv
Variance components Random Polygenic Marital
2.776642 1.757669 0.204385
0.979833 0.656563 0.661608
0.0046 0.003713 0.37869
7.2E06 1.12E05 2.82E05
Other parameters Total variance Heritability
4.738696 0.370918
0.420511 0.124927
1.00E07 0.001493
8.09E06 3.78E06
Correlations Full sibs Half sibs Parent offspring Spouses Marital Intercept
0.185459 0.09273 0.185459 0.043131 0.068562 4.825051
0.062463 0.031232 0.062463 0.139246 0.225725 0.162064
0.002987 0.002987 0.002987 0.756753 0.761325 1.00E07
0 0 0 0 0
Covariates SEX AGE M2_ADD_2
0.13359 0.005627 0.026475
0.255368 0.008106 0.209168
0.600884 0.487605 0.899278
0 0 0
Transformation Lambda1 Lambda2
1 0
Fixed Fixed
test covariates, and then those from a model with test covariates, follow. As an example, parameter estimates from a model named M2_ADD_2 with test covariates are shown in Table 3. Recorded for each parameter is the estimate, the standard error of the estimate, Wald type test P value for the estimate, and Deriv—that is the criterion indicating how well the estimation algorithm converged and should be close to 0. The parameter estimates for the covariates that are modeled as fixed effects are presented under the title Covariates. The estimate for a marker effect is interpreted as a unit increase of the response values for an increase in the number of risk alleles (allele “2” in this example) in the case of additive mode of inheritance. Because sex is modeled 1 for female and 0 for male, the estimated parameter is the effect for a female (see Note 19). Under Variance components, there are results of parameter estimates for the random effect terms specified in the model. These estimates and the estimate of the total variance are used to derive estimates of heritability and correlations.
20 Single Marker Family-Based Association Analysis. . .
391
Table 4 Definition of correlations listed in the detailed output file Name
Definition
Environmental intraclass correlations (based on non-zero variance components) Nuclear family s2 s2 s2 F 2 T 2 G 2 Marital s þ sM sT s2G F2 Sibship s2T s2G sF þ s2S Residual familial correlations (based on non-zero variance components) 2 Full sib sF þ s2S þ ð1=2Þs2G s2T 2 Half sib s þ ð1=4Þs2G s2T F2 Spouses s þ s2M s2T F2 Parent offspring sF þ ð1=2Þs2G s2T s2T , s2G , s2F , s2S , and s2M represent estimates of total, polygenic, nuclear family, sibling, and marital variance components, respectively.
Heritability is computed as the polygenic variance component divided by the total variance (22). Some correlations are caused by sharing genetic components, while other correlations are caused by shared environment (see Note 20). The correlations are defined based on non-zero estimates of variance components. Of those correlations, nuclear family, marital, and sibship correlations are defined as environmental intraclass correlations where the denominator is the total variance minus the polygenic variance. The rest of the correlations, including full sib, half sib, spouses, and parent offspring correlations, are defined to describe the residual familial correlations in which the denominator is the total variance of the response values. To see the definitions for each of them, refer to Table 4. If a calculated value is zero, the correlation will not appear in the list. Forexample, the correla tion for full sibs in Table 3 is computed as s2F þ ð1=2Þs2G s2T ¼ 1:757669=ð2 4:738696Þ ¼ 0:1854591:. Under the estimate result tables, the final log likelihoods are presented. The log likelihoods from two nested models are used to compute an LRT statistic (see Note 21). The result of the LRT for the significance of test covariates in the defined models compared to the baseline model is presented as joint tests in the detailed output file (Table 5). Degrees of freedom presents the difference in number of parameters between baseline and test models. When the defined model has multiple test covariates, the degrees of freedom will be greater than 1. P values from the LRT and Wald test for each of the test covariates are also presented side by side in a summary output file. They are asymptotically equivalent, i.e., they will give the same values when sample size is large enough (23).
392
J. Namkung
Table 5 Detailed output – joint test Joint test H0 ln likelihood M2_ADD_2 without test covariates
5,968.315890
H1 ln likelihood M2_ADD_2 with test covariates
5,968.307883
2 |H0 H1|
0.016013
Degrees of freedom
1
P value
0.899303
The variance–covariance matrix presents the variances and covariances of parameter estimates. Those of estimates included in the model are computed based on Fisher’s information theory. All variances and covariances of the estimates derived from the model parameters are presented. When the Residuals option is specified, a *.res file will be generated. This file contains records of pedigree id, individual id and estimates of the residual for each individual. By multiplying the inverse of the Cholesky decomposition of the estimated covariance matrix, asymptotically independent errors can be obtained. Using these transformed residuals, model diagnostic methods for independent data can also be adopted for the analysis procedure.
3. Notes 1. The R program is actively updated and useful packages for various applications are uploaded in CRAN websites (http:// www.r-project.org/), frequently. The function used in this book can be updated, too. Type help (“function name”) to open help documents when one uses a new function. One can also find a certain function with keywords by typing help.search (“keyword”). 2. When the user tries to apply a mathematical operation on a factor variable, one may obtain unexpected results. When reading data from a file, a column of numeric values will be formatted as a factor if there are nonnumeric values, including a blank character. This is a common mistake that gives wrong results. The unique values appearing in a factor can be retrieved by the levels() function.
20 Single Marker Family-Based Association Analysis. . .
393
Fig. 7. Finding allele frequencies in the genetics package.
3. A data frame is a special data class in R. This allows a table to include various data types at the same time. 4. Alternative steps to convert genotypes into integers for an additive genetic effect model without installing the genetics package are as follows: > tmp ¼ tmp[,9:10] > tmp ¼ apply(tmp, 2, as.character) > tmp[tmp¼¼"1/1"] ¼ 0 > tmp[tmp %in% c("1/2","2/1")] ¼ 1 > tmp[tmp¼¼"2/2"] ¼ 2 > dat2 ¼ cbind(dat[,1:8], tmp) 5. Using the summary.genoype() function in the genetics package, a minor allele can be easily found as shown in Fig. 7. 6. There are some useful functions that can be used to check if the data are generated or loaded correctly. dim(data) outputs the dimensions of a data table as a vector of row number and column number. class(data$columnName) gives the type of a variable, such as factor, numeric, integer, or character. 7. To include the function of a variable as a covariate, the I( ) indicator function should be used. Otherwise, R will produce a single estimate for those derived variables and the component
394
J. Namkung
variable. For example, if one wants to include age and age2 into the model, the formula should be written as R ~ cov +age +I(age^2). 8. Frequently, family¼“gaussian” is used for a quantitative trait with a continuous variable, and this assigns the identity as a link function. For a count variable, family¼“poisson” is specified to use the log link function. For a binary variable, family¼“binomial” is specified to use the logit link function. 9. Permitted values for corstr are as follows: “independence,” “fixed,” “stat_M_dep,” “non_stat_M_dep,” “exchangeable,” “AR-M,” and “unstructured.” Correlation models defined by the values of corstr are as follows: Value
Cor yij ; yik
exchangeable
r for
AR-M
rjj kj for jj kj m;
0 for jj kj>m
stat_M_dep (Toepliz)
rjj kj for jj kj m;
0 for jj kj>m
non_stat_M_dep
rjk for jj kj m;
unstructured
rjk
j 6¼ k
0 for jj kj>m
for j 6¼ k
When corstr is “stat_M_dep,” “non_stat_M_dep,” or “AR-M,” then Mv must be specified with a value smaller than or the same as the matrix column/row number Mv ¼ m. For example, a command line for the autoregressive correlation structure will look like > gee(y ~ x, family¼gaussian, id¼group, data¼dataset, corstr¼"AR-M", Mv¼1) When the dimension of a matrix is 4 4, the correlation model will then be defined as 0 1 1 r r2 r3 B r 1 r r2 C B C @ r2 r 1 r A r3 r2 r 1 10. The unstructured working correlation structure requires a large number of parameters to be estimated. Thus, this has limited application to studies with small families (a few time points, in the case of an individual’s repeated measures) and balanced and complete data (3). 11. The covariate type variables generated from markers do not appear in variable lists or in a data file in the Internal branch. Note that the markers cannot be used as marker type variables
20 Single Marker Family-Based Association Analysis. . .
395
in an analysis once markers are read as covariates. Some variables, such as PEDIGREE_SIZE, FOUNDER_INDICATOR, and SEX_CODE, are automatically generated. All the automatically generated variables are used as test covariates when Batch mode enabled is checked. They are also accessible by naming them in a parameter file as well. 12. The default is to include marital, sibling, and polygenic effects as random components. When the estimates for some of the variance components converge to zero, ASSOC will automatically recalculate the likelihood after removing those variance components from the model. If a variance component is removed from a test model, the disparity between the null and the test model is noted in the output, and no LRT is performed. 13. ASSOC is being continually improved and one such change that will be available in S.A.G.E. 6.2 is the option to transform the difference, i.e., h(yi E(y|Xi)), rather than transform both sides, so check that transformation of both sides is still a default for the version you use. Currently, transformation is not allowed for binary response variables. 14. The Box–Cox transformation with shift parameter is 8 l1 > < ðy þ l2 Þ 1 ; hðyÞ ¼ l1y l1 1 > : ln y ðy þ l2 Þ;
l1 6¼ 0
;
l1 ¼ 0
hQ i1=N N ð y þ l Þ ; and N ¼ number of individuals where y ¼ 2 i¼1 i with complete data. This can be only applied to positive values, thus the George and Elston method is recommended when there are negative observations (such as residuals from another analysis). 15. Nuclear family correlation has meaning only when the pedigree size and number of pedigrees are adequately large. 16. When you have multiple covariates to be tested and missing values for the covariates occur sporadically in different individuals, null models for each covariate may have different individuals with complete data. Thus, ASSOC computes the likelihoods of null models separately for each (set of) test covariate(s), and this takes more time than if you allow averaging to impute missing values with mean values, because then the likelihood for the null model is computed only once. 17. For example, one can perform a 2 df test for the M2 marker by modifying the parameter file. To generate two dummy variables, add function blocks as in the following example:
396
J. Namkung
In the assoc block, add the following two lines.
In the detailed output file, the joint test section will show the 2 df test result. 18. Exceptional cases in which analyses are skipped, and/or warning or error messages occurred during the maximization process are written in the inf file. Skipped analyses due to no variation of the values of the test variables will be listed in this file, for example. The top ten lines of the input data are presented at the top so one can check if the data have been read correctly. It is recommended to check this file every time before opening other output files to read the analysis results. 19. All the covariates are automatically centered prior to analysis so the intercept should be interpreted accordingly. 20. Estimates of heritability may change due to changes in the composition of the random effects, such as inclusion or exclusion of marital effects. Transformation may also affect the heritability estimates due to the changes in the polygenic variance. 21. Two types of test statistics, the LRT and the Wald test, are used for significance tests in ASSOC. LRT: 2½L0 L1 w2 ; where L0 and L1 refer to a log likelihoods under the null model (without test covariates) and a log likelihood under the alternative model (with test covariates) respectively. . ^ b Þ ðSEðbÞÞ ^ w2 , where b ^ is an estimate from an Wald: ðb 0 1 ^ is the value under the null model alternative model and b 0 (usually taken to be 0). References 1. Aulchenko, Y. S., de Koning, D. J., and Haley, C. (2007) Genomewide rapid association using mixed model and regression: a fast and simple method for genomewide pedigree-based quantitative trait loci association analysis, Genetics 177:577–585.
2. Liang, K.-Y., and Zeger, S. (1986) Longitudinal data analysis using generalized linear models, Biometrika 73:13–22. 3. Diggle, P., Heagerty, P., and Liang, K.-Y. (2002) Analysis of Longitudinal Data Second edition ed., Oxford University Press USA.
20 Single Marker Family-Based Association Analysis. . . 4. Davis, C. S. (2002) Statistical Methods for the Analysis of Repeated Measurements, Springer. 5. Zhao, L., and Prentice, R. (1990) Correlated binary regression using a quadratic exponential model, Biometrika 77:642–648. 6. Balemia, A., and Leea, A. (2009) Comparison of GEE1 and GEE2 estimation applied to clustered logistic regression, Journal of Statistical Computation and Simulation 79:361–378. 7. McLean, R. A., Sanders, W. L., and Stroup, W. W. (1991) A Unified Approach to Mixed Linear Models, The American Statistician 45:54–64. 8. Fisher, R. (1918) The correlation between relatives on the supposition of Mendelian inheritance, Transactions of the Royal Society of Edingurgh 52:399–433. 9. Breslow, N. E., and Clayton, D. G. (1993) Approximate inference in generalized linear mixed models, Journal of the American Statistical Association 88:9–25. 10. Gray-McGuire, C., Bochud, M., Goodloe, R., and Elston, R. C. (2009) Genetic association tests: a method for the joint analysis of family and case-control data, Hum Genomics 4:2–20. 11. Aulchenko, Y. S., Ripke, S., Isaacs, A., and van Duijn, C. M. (2007) GenABEL: an R library for genome-wide association analysis, Bioinformatics 23:1294–1296. 12. Carroll, R. J., and Ruppert, D. (1984) Power transformation when fitting theoretical models to data, J. Am. Stat. Ass. 79:321–328. 13. Carroll, R. J., and Ruppert, D. (1988) Transformation and Weighting in Regression, Chapman and Hall/CRC. 14. Box, G. E. P., and Cox, D. R. (1964) An analysis of transformations, Journal of the Royal Statistical Society, Series B 26:211–252.
397
15. George, V. T., and Elston, R. C. (1987) Testing the association between polymorphic markers and quantitative traits in pedigrees, Genet.Epidemiol. 4:193–201. 16. George, V. T., and Elston, R. C. (1988) Generalized Modulus Power Transformations, Comm. Statist. -Theory Meth. 17:2933–2952. 17. Elston, R. C., George, V. T., and Severtson, F. (1992) The Elston-Stewart algorithm for continuous genotypes and environmental factors, Hum. Hered. 42:16–27. 18. Amin, N., van Duijn, C. M., and Aulchenko, Y. S. (2007) A genomic background based method for association analysis in related individuals, PLoS One 2, e1274. 19. Yu, J., Pressoir, G., Briggs, W. H., Vroh Bi, I., Yamasaki, M., Doebley, J. F., McMullen, M. D., Gaut, B. S., Nielsen, D. M., Holland, J. B., Kresovich, S., and Buckler, E. S. (2006) A unified mixed-model method for association mapping that accounts for multiple levels of relatedness, Nat Genet 38:203–208. 20. Kang, H. M., Zaitlen, N. A., Wade, C. M., Kirby, A., Heckerman, D., Daly, M. J., and Eskin, E. (2008) Efficient control of population structure in model organism association mapping, Genetics 178:1709–1723. 21. Pfeiffer, R. M., Pee, D., and Landi, M. T. (2008) On combining family and case-control studies, Genet Epidemiol 32:638–646. 22. Ritland, K. (1996) Inferring the genetic basis of inbreeding depression in plants, Genome 39:1–8. 23. Agresti, A. (2002) Categorical Data Analysis, second edition ed., John Wiley and Sons. 24. Pan, W. (2001) Akaike’s information criterion in generalized estimating equations, Biometrics 57:120–125.
sdfsdf
Chapter 21 Allowing for Population Stratification in Association Analysis Huaizhen Qin and Xiaofeng Zhu Abstract In genetic association studies, it is necessary to correct for population structure to avoid inference bias. During the past decade, prevailing corrections often only involved adjustments of global ancestry differences between sampled individuals. Nevertheless, population structure may vary across local genomic regions due to the variability of local ancestries associated with natural selection, migration, or random genetic drift. Adjusting for global ancestry alone may be inadequate when local population structure is an important confounding factor. In contrast, adjusting for local ancestry can more effectively prevent falsepositives due to local population structure. To more accurately locate disease genes, we recommend adjusting for local ancestries by interrogating local structure. In practice, locus-specific ancestries are usually unknown and cannot be accurately inferred when ancestral population information is not available. For such scenarios, we propose employing local principal components (PC) to represent local ancestries and adjusting for local PCs when testing for genotype–phenotype association. With an acceptable computation burden, the proposed algorithm successfully eliminates the known spurious association between SNPs in the LCT gene and height due to the population structure in European Americans. Key words: Genome-wide association studies, Local ancestries, Local principal components, Migration, Random genetic drift, Natural selection, Genomic inflation factor, Genomic control, Local ancestry principal components correction, Fine mapping
1. Introduction In association studies, the concerns for population stratification can be dated back to twenty years ago (1). It has been well realized that population stratification—systematic ancestry differences between study subjects—can confound association tests (2–13). Population structure exists in study subjects as a result of distinct ethnic groups or a single pool of admixed individuals. Global population structure can be characterized by individual global ancestries. An individual’s global ancestry can be calculated as the proportions of his/her
Robert C. Elston et al. (eds.), Statistical Human Genetics: Methods and Protocols, Methods in Molecular Biology, vol. 850, DOI 10.1007/978-1-61779-555-8_21, # Springer Science+Business Media, LLC 2012
399
400
H. Qin and X. Zhu
genome inherited from the underlying ancestral populations. Local population structure in a local genomic region or at a locus can be similarly characterized. During the past decade, statistical methods to control the false-positive rate due to population stratification have mainly involved adjustments for global population structure, which is mainly due to recent migration and random genetic drift. Some prevailing paradigms are genomic control (3), structural association methods (4, 5), and principal component methods (6, 7, 10) that use markers randomly selected across the genome. In genomic control, the variance inflation factor of a test statistic is assumed to be a constant across the entire genome. In a classical principal component (PC) method, the PCs of the genotype score matrix of genome-wide markers (referred to as global PCs) are used as ancestry surrogates in the association analysis for each testing marker. Many studies have shown that the global PCs can effectively represent human demographic history (6, 10–12, 14–16). Nevertheless, subtle local structures do occur in some small genomic regions, owing to demographic history, natural selection pressure, and random fluctuations of admixture (13, 17, 18). The imprint of natural selection, for example, has recently been identified in many regions across the genome (19, 20) and can create substantial variation in population differentiation, which in turn affects the degree of variance inflation for specific loci (21). Even subtle structure, if ignored, can either inflate type I error or reduce statistical power, especially when the sample size is large (8). For recently admixed populations with known reference ancestral populations (e.g., African Americans), locus-specific ancestry can be inferred accurately using hidden Markov modelbased methods (4, 22–26). However, it is difficult to accurately infer locus-specific ancestries for an admixed individual owing to either the lack of ancestral population information or the high similarity between ancestral populations. For example, population admixture of European Americans occurred within populations of similar origin. For such scenarios, we propose using local PCs of the genotype score matrix of the markers within local genomic regions to represent local ancestries, at the same time using global PCs to represent global ancestries, and interrogating the ancestries in local genomic regions for fine mapping (13). Extensive simulations and applications to the data sets from three genome-wide association studies illustrate the necessity and practical implications of the adjustment of local ancestry principal components. Both European Americans and African Americans demonstrate greater variability in local ancestry than do Nigerians. Adjusting local PCs successfully eliminates the well-known spurious association between the LCT gene and height of European Americans due to the underlying population structure. In this chapter, we illustrate how to run the local ancestry principal components correction (LAPCC).
21
Allowing for Population Stratification in Association Analysis
401
Fig. 1. A typical window consists of a 4-Mb core and an envelope with 8-Mb margins on each side of the core. The first ‘ PCs of the genotypic score matrix of the SNPs in the 20Mb window are employed to adjust for local ancestries of the SNPs within the 4-Mb core.
In LAPCC, we divide each chromosome into 4-Mb adjacent segments (referred to as window cores hereafter) according to the SNP map of base pair positions. Typically, we add an envelope with an 8-Mb margin to each side of a core (Fig. 1) and construct a 20-Mb window. We choose this window width according to the linkage disequilibrium due to recent population admixture (22, 24). The left (right) envelopes of the first (last) 2 windows of some chromosomes might be shorter than 8 Mb. For each local window, we compute the first ‘ ¼ 10 (see Note 1) PCs of the genotypic scores of window-wide SNPs to adjust for the population structure of the SNPs within the 4-Mb core. At each SNP in the core, we compute genotype score residuals g~i and trait value residuals ~yi by regressing genotype scores gi and trait values yi on the 10 local PCs. We measure the evidence of genotype–phenotype association by s 2 ¼ ðN ‘ 2Þr 2 ð1 r 2 Þ, where N is the number of individuals used after excluding individuals with missing genotypes and r is the correlation coefficient between the residuals ~yi and g~i . Asymptotically, s 2 follows a w21 distribution under the null of no genotype–phenotype association if all confounding factors are well adjusted. To be specific, we denote by G ¼ ðgij Þ the M N genotypic matrix (of the entire genome or a local window), where gij 2 f0; 1; 2g is the copy number of a reference allele at the i th marker of the j th subject.P We then center each row i of matrix G by the row mean mi ¼ N 1 N j ¼1 gij and denote the centered matrix by X ¼ ðxij Þ: We exclude each missing entry gij from the computation of mi and set the corresponding xij to be 0. We denote the eigensystem of the N N matrix C ¼ X0 X by C ¼ VLV0 ; where V ¼ ½v1 ; :::; vN , L ¼ diagðl1 ; :::; lN Þ, and l1 > >lN 1 >lN are eigenvalues corresponding to eigenvectors v1 ; :::; vN , respectively, and in particular, lN ¼ 0 and vN ¼ N 1=2 ð1; :::; 1Þ0 . Following others’ previous research work (10), we define the kth axis of variation as eigenvector vk ¼ ðv1k ; :::; vNk Þ0 and select k eigenvectors (PCs) as the ancestral surrogates. We do not normalize the matrix X (see Note 2) and do not suggest thinning SNPs for calculating the PCs (see Note 3). Instead, we only center the genotypic scores and calculate the PCs from all available genotypic data.
402
H. Qin and X. Zhu
To interrogate local structure, we calculate the coefficients of multiple-determination (R2 ) and squared coefficients of canonical correlation (l2 ) between the global PCs and local PCs. Let N denote the sample size, A ¼ ½a1 ; :::; aK denote the N K matrix consisting of the first K global PCs, and B ¼ ½b1 ; :::; bK denote the N K matrix consisting of the first K local PCs in a local window. The coefficient of multiple-determination Rj2 for bj and A is the R2 for the linear regression of bj on A. The j th largest squared coefficient of canonical correlation l2j between A and B is the j th largest coefficient of determination between any linear combination of B’s columns and any linear combination of A0 s columns. Mathemati2 cally, R12 l21 , and R12 þ þ RK ¼ l21 þ þ l2K . Rj2 equals the 2 j th diagonal element, and lj equals the j th largest eigenvalue of 2 , where rji2 ¼ ðb0 j ai Þ2 is the coefB0 AA0 B, and Rj2 ¼ rj21 þ þ rjK ficient of determination between bj and ai . Canonical correlation analysis and multiple-determination analysis enable us to evaluate the degree of discrepancies between local and global PCs and facilitate the exploration of population structures. R2 indicates how much variation of the local PCs can be accounted for by the global PCs, and l2 measures the shared variance between the local and global PCs.
2. Methods We will illustrate how to use the software package LAPCC.exe with genome-wide genotypic data sets—the Maywood, Nigeria, and Framingham data sets (27, 28). The file LAPCC.exe can be downloaded from http://darwin.cwru.edu/LAPCC/LAPCC.exe. The user needs to put the file LAPCC.exe and the formatted data in the same folder. 2.1. Formatting the Data
First, we will explain how to preprocess and format the data. For example, the Maywood cohort comprises 775 unrelated AfricanAmericans from Maywood, IL, with 909,622 SNPs genotyped on the Affymetrix 6.0 platform. We drop 74 individuals because of possible DNA contamination, false identity, and relatedness. For the 701 retained individuals, we remove 86,800 SNPs whose missing rates >5% and minor allele frequencies <1%. The final analysis data set includes 822,822 SNPs in each of 701 individuals. Similarly, after QC, the Nigeria Affymetrix 6.0 data set contains 759,222 SNPs genotyped for 982 individuals. For the Framingham data set, Mendelian errors are checked, and the corresponding SNPs with Mendelian inconsistence are set missing. SNPs with HWE P values <106 are dropped. We select unrelated individuals from each family (i.e., spouses) based on an algorithm that prioritizes individuals
21
Allowing for Population Stratification in Association Analysis
403
with higher genotyping rates, selecting individuals at random when needed. After QC, the Framingham data set contains 415,281 SNPs genotyped for 1,106 unrelated individuals. For each chromosome of each data set, five plain input files must be prepared like the following example. In the Maywood data set, the first autosome contains 67,242 SNPs. The first file is conf_chr1.txt, which contains three rows as below: 67242 711153 247165315 The first row is the number of SNPs on autosome 1; the second and third rows give the base pair positions of the first and the last SNPs on the autosome, respectively. The second input file is bp_chr1.txt, containing 67,242 rows. Each row contains the base pair position of one SNP, and all the base pair positions are sorted in ascending order: 711153 730720 742429 751595 . . .. . . The third file is snp_chr1.txt, containing the rs-numbers of all the 67,242 SNPs: rs12565286 rs12082473 rs3094315 rs2286139 . . .. . . The fourth file is geno_chr1.txt, containing a 701-by-67,242 white-spaced matrix of genotypic scores. For each SNP (column), a reference allele is randomly assigned, and the copy number of the reference allele in each person is recorded as his/her genotypic score at the SNP. Missing genotypes are recorded as 9. In the file geno_chr1.txt, the genotypic scores of the first five SNPs of the first four individuals are as below: 1
2
2
2
2
...
2
1
2
2
2
...
2
1
0
1
2
...
2
9
2
2
2
...
...
...
...
...
...
...
404
H. Qin and X. Zhu
The last file is y_height.txt, listing the trait values (the residuals of height after adjusting out such covariates such as sex, age, and age-squared) one person per row: 0.710633 0.346286 1.25865 0.350496 . . .. . . 2.2. Run the Package
After preparing the 5 plain files, the user can run LAPCC.exe to process all or specified autosomes in a parallel way. For each specified autosome, the LAPCC.exe outputs windows, window-wide R2 , and l2 -values as well P values of all input SNPs on the autosome. For the three aforementioned populations, we compare local and global PCs to uncover local ancestry patterns. Figure 2 presents the l2 -values between window-wide local PCs and the first 10 global PCs. In general, local PCs and global PCs here only display a small degree of correlation. However, the Maywood and Framingham data sets demonstrate substantially larger l2-values and larger variation in l2 -values than does the Nigerian data set. These results suggest that there is relatively little population structure in the Nigerian sample, whereas Maywood and Framingham display relatively more population structure. Somewhat surprisingly, the Framingham data set demonstrates much more complex local population structure than do the other two data sets.
3. Notes 1. Setting ‘ ¼ 10 is somewhat arbitrary. Patterson et al. (29) developed a formal procedure to determine ‘ using the genotypic data. This procedure is based on the Tracy–Widom theory (29, 30) and proves to be conservative if the normalized genotypic data matrix is a true Wishart matrix (29). Improvements may be possible since the first ‘ PCs may not be the most informative. 2. According to our analysis of real data, normalization may hurt the representativeness of the global (local) PCs for the true ancestry when the data involve more than two ancestral populations (Fig. 3). For the case of only two ancestral populations, the results without normalization are mathematically valid. Thus, we do not suggest normalization even if it may yield better results under certain conditions (29), which may be violated in practice.
21
a
Allowing for Population Stratification in Association Analysis
405
Maywood: 670 windows 1
λ2
0.8 0.6 0.4 0.2 0
b
1
4
7
10
13
16
19
22
7
10
13
16
19
22
7
10 13 Autosome
16
19
22
Nigeria: 670 windows 1
λ2
0.8 0.6 0.4 0.2 0
c
1
4
Framingham: 647 windows 1
λ2
0.8 0.6 0.4 0.2 0
1
4
Fig. 2. The distributions of the l2 -values of the local windows in three GWAS data sets. For each window in a given genotype data set, l2 is the largest squared coefficient of canonical correlation between the first 10 local PCs and the first 10 global PCs. Relatively, the Maywood participants demonstrate more population structure, the Nigerian samples demonstrate little population structure, whereas the Framingham participants demonstrate a much more complex local population structure than do the other two samples.
3. In our simulations, random correlations between markers have little impact on global PCs of tens of thousands to millions of genome-wide SNPs (Fig. 4). The first global PC coordinates of a large number of unlinked ancestral informative markers (AIMs) across the genome are highly correlated with true individual global ancestries. The first global PC coordinates using more random markers represent the true global ancestries even better, regardless of the abundant LD. This suggests that global
406
H. Qin and X. Zhu
a
Normalization-based PCA 0.1 0 -0.1
PC2
-0.2 -0.3 CEPG:
-0.4 -0.5 -0.6 -0.7 -0.8 -0.08
-0.06
-0.04
-0.02
0
90
Denver Chinese:
107
Han Chinese: Japanese: Luhya:
109 105 108
Tuscan:
66
Yoruba:
112
0.02
0.04
0.06
0.08
0.1
PC1
b
Centralization-based PCA 0.04 0.02 0
PC2
-0.02 -0.04 -0.06 -0.08 -0.1 -0.1
CEPG:
90
Denver Chinese: Han Chinese: Japanese: Luhya:
107 109 105 108
Tuscan: Yoruba:
66 112
-0.08
-0.06
-0.04
-0.02 PC1
0
0.02
0.04
0.06
Fig. 3. PCs with (a) and without (b) normalizing the GAW17 genotypic data set of 697 unrelated individuals. Clearly, the PCs without normalization provide better discrimination, although both PCs roughly classify the 697 individuals into 3 large groups: CEPI and Tuscan, Luhya and Yoruba, as well as Denver Chinese, Han Chinese, and Japanese. The PCs without normalization appear more robust to outliers.
PCs of all available genotypic data would well represent global ancestries. In the same vein, we do not suggest thinning SNPs to calculate the local PCs. 4. The local regions are often empirically created and thus may be away from the optimal. Standard local PCs may be fragile to apparent random inter-SNP correlations, especially for
21
Allowing for Population Stratification in Association Analysis
407
0.06 3,209
0.04
Standard global PC1 coordinates
AIMs:
r = - 0.9844
19,697 random SNPs: r = - 0.9939 196,974 random SNPs: r = - 0.9981
0.02
0
-0.02
-0.04
-0.06
-0.08
-0.1 0
0.1
0.2
0.3
0.4
0.5
0.6
Individual global ancestries Fig. 4. Pearson correlation coefficients between the standard first global PC coordinates of distinct subsets of 1,969,739 SNPs and individual global ancestries of the 2,000 individuals. The data set is generated by the GenoAnceBase0 program (13) applied to the CEU and YRI haplotypes of the HapMap data (Phase II) to simulate African-American genomes. The standard first global PC coordinates of the 3,029 unlinked AIMs across the genome are highly correlated with true individual global ancestries. The first standard global PC coordinates using more random markers represent the true global ancestries even better, regardless of there being more abundant LD.
local genomic regions without sufficient ancestral informative markers. Thus, for local PCs, one may weigh genotypic scores of the SNPs within a local window (31). It is reasonable to incorporate the adjustment of global PCs to prevent falsepositives due to other sources of confounding. 5. The LAPCC was designed for data sets comprising unrelated individuals and would fail to account for the other types of sample structure, i.e., cryptic relatedness and family structure. Cryptic relatedness may occur in a wide range of data sets, and it is necessary to model family structure in familybased association studies with sample ascertainment (32). It may be instructive to incorporate the full covariance structure across individuals using mixed models.
408
H. Qin and X. Zhu
Acknowledgments We thank the members in Dr. Zhu’s lab for their helpful comments. This work was supported by NIH grants HL074166, HL086718, and HG003054 to XZ. References 1. Knowler WC, Williams RC, Pettitt DJ, Steinberg AG (1988) Gm3;5,13,14 and type 2 diabetes mellitus: an association in American Indians with genetic admixture. Am J Hum Genet 43: 520–526 2. Lander ES and Schork NJ (1994) Genetic dissection of complex traits. Science 265: 2037–2048 3. Devlin B and Roeder K (1999) Genomic control for association studies. Biometrics 55: 997–1004 4. Pritchard JK, Stephens M, Rosenberg NA, Donnelly P (2000) Association mapping in structured populations. Am J Hum Genet 67: 170–181 5. Satten GA, Flanders WD, Yang Q (2001) Accounting for unmeasured population substructure in case–control studies of genetic association using a novel latent-class model. Am J Hum Genet 68: 466–477 6. Zhu X, Zhang S, Zhao H, Cooper RS (2002) Association mapping, using a mixture model for complex traits. Genet Epidemiol 23: 181–196 7. Zhang S, Zhu X, Zhao H (2003) On a semiparametric test to detect associations between quantitative traits and candidate genes using unrelated individuals. Genet Epidemiol 24: 44–56 8. Marchini J, Cardon LR, Phillips MS, Donnelly P (2004) The effects of human population structure on large genetic association studies. Nat Genet 36: 512–517 9. Campbell CD et al. (2005) Demonstrating stratification in a European American population. Nat Genet 37: 868–872 10. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38: 904–909 11. Zhu X et al. (2008a) A unified association analysis approach for family and unrelated samples correcting for stratification. Am J Hum Genet 82: 352–365
12. Zhu X et al. (2008b) Admixture mapping and the role of population structure for localizing disease genes. Adv Genet 60: 547–569 13. Qin et al. (2010) Interrogating local population structure for fine mapping in genomewide association studies. Bioinformatics 26 (23): 2961–2968 14. Cavalli-Sforza LL and Bodmer WF (1999) The genetics of human populations. Dover Publications, Mineola, New York 15. Epstein MP et al. (2007) A simple and improved correction for population stratification in case– control studies. Am J Hum Genet 80: 921–930 16. Novembre J and Stephens M (2008) Interpreting principal component analyses of spatial population genetic variation. Nat Genet 40: 646–649 17. Tang H et al. (2007) Recent genetic selection in the ancestral admixture of Puerto Ricans. Am J Hum Genet 81(3): 626–633 18. Wang X et al. (2011) Adjustment for local ancestry in genetic association analysis of admixed populations. Bioinformatics 27(5): 670–677 19. Voight BF et al. (2006) A map of recent positive selection in the human genome. PLoS Biol 4, e72 20. Sabeti PC et al. (2007) Genome-wide detection and characterization of positive selection in human populations. Nature 449: 913–918 21. Crow JF and Kimura M (1970) An introduction to population genetics theory. Harper & Row, New York, 469–478 22. Patterson N et al. (2004) Methods for highdensity admixture mapping of disease genes. Am J Hum Genet 74: 979–1000 23. Tang H et al. (2006) Reconstructing genetic ancestry blocks in admixed individuals. Am J Hum Genet 79: 1–12 24. Zhu X et al. (2006) A classical likelihood based approach for admixture mapping using EM algorithm. Hum Genet 120: 431–445 25. Sankararaman, S. et al. (2008) Estimating local ancestry in admixed populations. Am J Hum Genet 82: 290–303
21
Allowing for Population Stratification in Association Analysis
26. Price AL et al. (2009) Sensitive detection of chromosomal segments of distinct ancestry in admixed populations. PLoS Genet 5, e1000519 27. Kang, SJ et al. (2010) Genome wide association of anthropometric traits in African and African derived populations. Human Molecular Genetics 19 (13): 2725–2738 28. Levy D. et al. (2009) Genome-wide association study of blood pressure and hypertension. Nat Genet 41: 677–687 29. Patterson N, Price AL, Reich D (2006) Population structure and eigenanalysis. PLoS Genet 2
409
(12): 2074–2093, e190. doi:10.1371/journal. pgen.0020190 30. Johnstone I (2001) On the distribution of the largest eigenvalue in principal components analysis. Ann Stat 29: 295–327 31. Zou F et al. (2010) Quantification of population structure using correlated SNPs by shrinkage principal components. Human Heredity 70: 9–22 32. Price AL et al. (2010) New approaches to population stratification in genome-wide association studies. Nature Reviews 11: 459–463
sdfsdf
Chapter 22 Haplotype Inference Xin Li and Jing Li Abstract Haplotypes, as they specify linkage patterns between individual nucleotide variants, confer critical information for understanding the genetics of human diseases. However, haplotype information is not directly obtainable from high-throughput genotyping platforms. In this chapter, we introduce two representative methods to reconstruct haplotypes from unphased genotype data, one method is for unrelated individuals and the other is for families. Key words: Haplotype, Genotype, Pedigree, Population, Mendelian law, Linkage disequilibrium, PedPhase 3.0, FASTPHASE
1. Introduction: Haplotype Inference and Two Types of Methods
Humans are diploid, with two homologous chromosomes, one from each parent. Current genotyping technologies yield unphased genotypes, i.e., pairs of unordered alleles, with unidentified parental origins. A haplotype refers to a combination of alleles on a single chromosome or, in other words, alleles of the same parental origin. The problem of phasing or haplotyping is to restore the parental origins of alleles and consequently their haplotypes from unphased genotypes using computational methods. As haplotype information is critical in various applications, the development of efficient haplotyping methods draws much attention (for reviews, see refs. 1–5). In this chapter, we introduce two state-of-the-art haplotyping methods and step-by-step guideline on how to apply them to different types of data. The methods for haplotype inference can be classified into two categories based on the type of data they use. One type of method is called population-based, which works on unrelated individuals; the other type is called family or pedigree based, which works on
Robert C. Elston et al. (eds.), Statistical Human Genetics: Methods and Protocols, Methods in Molecular Biology, vol. 850, DOI 10.1007/978-1-61779-555-8_22, # Springer Science+Business Media, LLC 2012
411
412
X. Li and J. Li
relatives. Methods for the two data types are fundamentally different regarding the information they employ. Population-based methods basically explore the property that short segments of common haplotypes are extensively shared among many individuals. By contrast, methods for pedigree data actually make use of the Mendelian law of inheritance to recover parental origins of alleles. Here, we pick two representative programs to explore: FASTPHASE (6) for population data and PedPhase 3.0 (7) for pedigree data, which are both pioneers on the basis of their computational efficiency and are also methodologically typical for each category. FASTPHASE is an efficient program to reconstruct haplotypes among a large number of unrelated individuals. Similar to many population-based methods, FASTPHASE phases the data by utilizing the property that over short chromosomal regions, haplotypes tend to form clusters. Compared with another popular program, PHASE (8), which is used to phase the HapMap (9) data, FASTPHASE is slightly less accurate, but it is much more efficient. PedPhase 3.0 is a program to infer haplotypes for pedigrees. As is typical for most family-based methods, PedPhase mainly relies on the Mendelian law of inheritance to infer haplotypes. The program uses linear systems to encode Mendelian constraints in a family and consequently solves the systems using disjoint-set data structures. The program is very fast as it runs in linear time with respect to both pedigree sizes and marker numbers. In the following section, we first briefly touch upon the algorithmic aspects of these two methods; after that, we mainly focus on a detailed flowchart on how to run each of them in a practical setting. In Subheading 3, we offer some useful advice from the perspective of an experienced user, including a list of precautions in running the two programs and important factors to keep in mind when interpreting the results. So far we have discussed two typical haplotype inference methods for population and family data. However, there are many other approaches in the literature we have not covered that tackle the same problem. In each of the two categories, most methods share some common ideas. However, they might still have their own advantages and disadvantages in treating different types of data. Here, we provide a literature guide to some of those methods for the convenience of the reader. Methods for population data (10–15) typically use some statistical models to characterize haplotype sharing among unrelated individuals, similar to FASTPHASE. On the other hand, most methods for family data (16–23) utilize some heuristic rules to capture Mendelian constraints, similar to PedPhase 3.0. However, not all of these have been implemented. Traditional linkage analysis tools (24–30) originally designed for sparse markers can also be used to infer haplotypes for family data. However, such approaches are usually slow and not scalable to large families or large numbers of markers owing to their assumptions.
22
Haplotype Inference
413
2. Methods 2.1. FASTPHASE: A Population-Based Method
Human beings are a relatively young species even between unrelated individuals, we can still observe small segments of shared haplotypes. Motivated by this observation, most haplotyping methods for unrelated individuals explore the property that, over short chromosomal regions, haplotypes tend to form clusters. These clusters actually imply shared ancestries, and their behaviors can be modeled by evolutionary mechanisms. However, such models are usually very parameter rich and computationally intensive. FASTPHASE takes a convenient shortcut by directly characterizing the change of cluster membership along the chromosome using a hidden Markov model. Though this approach might sacrifice some biological plausibility, it still captures the local nature of haplotype clusters. As a result, it is comparably accurate though much faster than alternative approaches. FASTPHASE is freely available and can be downloaded from http://stephenslab.uchicago.edu/software.html.
2.2. Input Files
FASTPHASE works on unrelated individuals; therefore, we only need to supply to the program the genotype information of these individuals. The program requires explicit specification of how many individuals and how many markers to import from the input file. These numbers should appear as the beginning two lines of the input file. After that, we should provide the genotypes for individuals, with each individual occupying three lines: the first line is the individual’s ID, followed by two lines specifying the genotypes at each marker. Since the input data are unphased genotypes, the order of a pair of alleles—which one should appear in which of the two lines—does not matter. In the output file, the actual orders will be restored according to the inferred allelic phases. The alleles can be specified using either integers or characters, with question mark “?” indicating a missing value. A simple sample input file is provided below. This input file contains five individuals with identifiers (IDs) 1–5, each with ten markers. We emphasize again that in the input file the two lines are not necessarily haplotypes. Pairs of unordered alleles are just arbitrarily arranged into two lines: 5 10 1 1112112?22 2212111?22 2 1122111122 1212222122 3
414
X. Li and J. Li
1112222122 1112112122 4 1111112122 1112112111 5 2112221122 1121112122 The program can be easily launched using the command line by typing the command name followed by the input file name as below: FASTPHASE sample_input.txt 2.3. Output Files
Among multiple output files, the .out file is the one holding the inferred haplotypes. Opening the output file, we can see a list of protocol declarations. After that, the actual inferred haplotypes are wrapped between the keywords “BEGIN GENOTYPES” and “END GENOTYPES.” The following is the haplotype portion of the output file generated from the sample input file. Here, each line now represents an actual inferred haplotype, as compared with the arbitrary orders in the input file. Notice that many loci are swapped to restore the correct phases.
22
Haplotype Inference
415
For population data, accuracy of haplotype reconstruction mainly depends on the sample size and the marker density. As individuals are not related, we need a large number of samples (preferably more than 100 individuals) in order to obtain the desirable amount of haplotype sharing among these individuals. Meanwhile, it also requires significant linkage disequilibrium to limit the haplotype diversity in a region. In this sense, we prefer the marker interval to be as small as possible. We discuss these issues further in Note 2. 2.4. PedPhase 3.0: A Family-Based Method
The Mendelian law of inheritance dictates that an individual should inherit one allele from each of the two parents at each locus. This strong constraint can help resolve much of the uncertainties in allelic phases when family structures are available. As we know, a child’s alleles must come from parents. If one of the parents is homozygous (carrying two identical alleles), we can unambiguously recover the parental origin of that corresponding allele in the child. Beyond this information, we may also combine the constraints from different loci of a chromosome. Within short chromosomal regions, we can assume there is no meiotic recombination; hence, a child will inherit an intact chromosomal segment from a parent. This constraint is often termed the zero-recombinant constraint, which is widely used in most family-based haplotyping approaches. PedPhase is a haplotyping toolkit which implements a set of haplotyping algorithms (30–33). In this chapter, we only focus on the most recent development that is implemented in PedPhase3.0. The algorithm itself is also termed “DSS” because it uses disjoint-set structures in the implementation. DSS identifies both types of constraints: Mendelian constraints and zero-recombinant constraints. The program uses a linear system to encode both types of constraints and solves the system using disjoint-set structures. The computational complexity of the method is linear in both family size and marker number. The approach is theoretically firm, in the sense that it exhausts all existing constraints in a family and guarantees all possible configurations compatible with the constraints. PedPhase is freely available and can be downloaded from http://vorlon.case.edu/~jxl175/haplotyping.html.
2.5. Input Files
Different from population-based methods, PedPhase 3.0 requires both genotypes and pedigree structures in order to perform haplotyping. Therefore, the input file should supply both types of information. PedPhase 3.0 takes as input a commonly used file format called “linkage” format which specifies both pedigree structures and genotypes for each individual in a single file. In this file format, each line represents an individual: the first six fields of a line are family structure fields, and the rest of the fields are a list of alleles of this individual. The six fields of the pedigree structure are family
416
X. Li and J. Li
ID, individual ID, father ID, mother ID, sex, and disease status (optional). All fields should be integers: sex should be 1 for male and 2 for female, and disease status should be 1 for normal, 2 for affected, and 0 for unknown. Haplotype inference does not use the disease information. The inclusion of this field is merely for compatibility with the linkage format. The rest of a line lists the genotypes of each individual, with alleles separated by white spaces. We use 1 and 2 to denote different alleles, with 0 representing missing alleles. An input file can contain multiple families, each identified by a different family ID. Each family is treated separately. Therefore, individuals in each family can be numbered independently. A sample input file is provided below. This input file contains two families, one with seven members and the other with three members. The family structures occupy the first six columns; the remaining columns are alleles. Each individual has 10 markers, therefore 20 alleles, separated by white spaces.
The program can be easily launched using the command line by typing the command name “DSS” followed by the input file name, as below: DSS sample_input.txt 2.6. Output Files
The output file will contain the inferred haplotypes for each individual. Each individual will have three lines: the first line restates the family ID, individual ID, father ID, mother ID, sex, and disease status, and the second and third lines are the paternal and maternal haplotypes. A sample output file generated from the sample input is provided below:
22
Haplotype Inference
417
Compared with population data, haplotypes inferred from family data are in general more accurate. Mendelian constraints are very powerful for resolving allelic phases. Even given very small family structures of 3–4 members, family-based methods can still yield much higher accuracy than population-based methods. However, the application of Mendelian constraints requires both parents and children to be genotyped. DSS can handle occasional ungenotyped family members, but this will significantly slow down the computing process. We elaborate on the scalability of the method to different data settings in Note 1.
418
X. Li and J. Li
In the notes, we discuss some precautions in running FASTPHASE and PedPhase 3.0, namely, what sorts of data are suitable to be treated by either of these two programs. We also give some empirical assessment of the relative reliabilities of the outputs produced by these two programs under different data settings. Obviously, FASTPHASE treats population data, and PedPhase 3.0 handles pedigree data. However, there are further limitations to warrant valid results. We describe the input data with respect to the following five characteristics: number of individuals, number of markers, average marker interval, genotyping errors, and missing genotypes.
3. Notes 1. Number of individuals and number of markers. Both FASTPHASE and PedPhase 3.0 have computational complexity linear in the number of individuals and number of markers. Therefore, on a theoretical basis, they are both scalable to these two data dimensions. However, comparing absolute running time, population-based methods are in general much slower than family-based methods. We have run these two programs on an ordinary PC of 2GHz CPU and 1G memory. FASTPHASE takes more than an hour to finish 200 unrelated individuals with 1,000 markers, while PedPhase 3.0 only takes a few seconds for an equivalent amount of family data. These numbers are typical for most population-based methods owing to the nature of the search space. Therefore, in many modern data settings with millions of markers and hundreds of individuals, haplotyping using population information is still very time-consuming. One possible way to overcome this difficulty is to segment the markers and run each segment using distributed computing. One might also think of dividing the data by individuals and run each subgroup independently. However, this is not a correct way. For population data, individuals are mutually informative for each other. Therefore, separating them will result in significantly impaired performance in accuracy. Family-based methods are in general very fast. However, we require complete family structures in order to apply Mendelian constraints. For missing members in a family, most methods need to enumerate their genotypes, which will be very time-consuming or even infeasible. We should therefore be aware of such excessive processing time if families contain untyped individuals.
22
Haplotype Inference
419
2. Marker density. Both FASTPHASE and PedPhase 3.0 require dense markers to work properly. FASTPHASE requires significant linkage disequilibrium between markers or sufficient haplotype sharing among individuals. This is only true when markers are dense. The DSS function of PedPhase 3.0 is based on zero-recombinant haplotyping. It can only accept occasional recombination events. Therefore, it also requires high marker density. But the scale of density can be very different for the two approaches. For family-based data, PedPhase 3.0 probably can handle data with a marker interval distance no greater than 0.1 centimorgan or 100 kbp. For population-based data, recommended marker interval distances should be no greater than a few kilobase pairs. 3. Genotyping errors. Both methods can tolerate genotyping errors within a reasonably controlled fraction. However, there is actually no explicit mechanism in either of the two programs to model and control the behavior of genotyping errors. Therefore, it is very necessary to do data cleaning beforehand as the influence of typing errors can be unpredictable. For FASTPHASE, since population information cannot tell apart the typing errors from correct genotypes, incorrect genotypes will be treated as correct ones. For PedPhase 3.0, genotyping errors may cause a violation of the Mendelian law, which is examined in each family. The program will discard the loci showing Mendelian inconsistency. However, a genotyping error may accidentally appear as valid Mendelian inheritance or mimic a recombination event, which will consequently confuse the program. Therefore, according to our experience, genotyping errors cannot be greater than 1% to obtain reliable results. In practice, carefully managed experiments normally have lower error rates for many popular platforms. 4. Missing genotypes. Both programs can accept and impute missing genotypes. The influence of missing genotypes and the power of these programs to impute missing genotypes is different. Previously, we have done an empirical study regarding this issue (34) using the same number of individuals for both studies. We found that, given families of two parents and two children, family constraints could correctly impute roughly 85% of missing genotypes. By contrast, population information could impute around 90% of missing genotypes, given 100 or more unrelated individuals. The primary reason is that when all family members have missing genotypes at a particular locus, no information can be obtained from the family.
420
X. Li and J. Li
However, with respect to the accuracy in resolving allelic phases, family information is much more powerful than population information. Given the same setting of families of 4 and 100 unrelated individuals, the accuracy to restore correct phases of heterozygous alleles is 90% versus 50%. Generally speaking, population information is more powerful in imputing missing genotypes, while family information is more powerful in resolving allelic phases. We should be aware of their relative reliabilities when interpreting the haplotyping results or further use them in downstream applications.
Acknowledgments This work was supported in part by NIH R01 LM008991.
References 1. Bonizzoni, P., Della Vedova, G., Dondi, R., and Li, J. (2003) The haplotyping problem: an overview of computational models and solutions. Journal of Computer Science and Technology 18:675–688. 2. Gusfield, D. (2004) An overview of combinatorial methods for haplotype inference. Computational Methods for SNPs and Haplotype Inference. 599–600. 3. Halldorsson, B., Bafna, V., Edwards, N., Lippert, R., Yooseph, S., and Istrail, S. (2004) A survey of computational methods for determining haplotypes. Computational Methods for SNPs and Haplotype Inference. 613–614. 4. Zhang, X., Wang, R., Wu, L., and Chen, L. (2006) Models and algorithms for haplotyping problem. Current Bioinformatics 1:105–114. 5. Li, J. and T. Jiang. (2008) A survey on haplotyping algorithms for tightly linked markers. Journal of Bioinformatics and Computational Biology 6:241–259. 6. Scheet, P. and Stephens, M. (2006) A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. The American Journal of Human Genetics 78:629–644. 7. Li, X. and Li, J. (2009) An almost linear time algorithm for a general haplotype solution on tree pedigrees with no recombination and its extensions. Journal of Bioinformatics and Computational Biology 7:521–545.
8. Stephens, M., Smith, N., and Donnelly, P. (2001) A new statistical method for haplotype reconstruction from population data. The American Journal of Human Genetics 68:978–989. 9. The International HapMap Consortium. (2003) The international HapMap project. Nature 426: 789–796. 10. Excoffier, L. and Slatkin, M. (1995) Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Molecular biology and evolution 12:921–927. 11. Hawley, M. and Kidd, K. (1995) HAPLO: a program using the EM algorithm to estimate the frequencies of multi-site haplotypes. Journal of Heredity 86:409–411. 12. Niu, T., Qin, Z., Xu, X., and Liu, J. (2002) Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. The American Journal of Human Genetics 70:157–169. 13. Qin, Z., Niu, T., and Liu, J. (2002) Partitionligation-expectation-maximization algorithm for haplotype inference with single-nucleotide polymorphisms. American journal of human genetics 71:1242–1247. 14. Sun, S., Greenwood, C., and Neal, R. (2007) Haplotype inference using a Bayesian hidden Markov model. Genetic Epidemiology 31:937–948. 15. Browning, S. and Browning, B. (2007) Rapid and accurate haplotype phasing and missing-
22 data inference for whole-genome association studies by use of localized haplotype clustering. The American Journal of Human Genetics 81:1084–1097. 16. O’Connell, J. (2000) Zero-recombinant haplotyping: applications to fine mapping using SNPs. Genetic Epidemiology 19:S64–S70. 17. Qian, D. and Beckmann, L. (2002) Minimumrecombinant haplotyping in pedigrees. The American Journal of Human Genetics 70:1434–1445. 18. Tapadar, P., Ghosh, S., and Majumder, P. (2000) Haplotyping in pedigrees via a genetic algorithm. Human Heredity 50:43–56. 19. Zhang, K., Sun, F., and Zhao, H. (2005) HAPLORE: a program for haplotype reconstruction in general pedigrees without recombination. Bioinformatics 21:90–103 20. Chan, M., Chan, W., Chin, F., Fung, S., and Kao, M. (2006) Linear-time haplotype inference on pedigrees without recombinations. Algorithms in Bioinformatics. 56–67. 21. Xiao, J., Liu, L., Xia, L., and Jiang, T. (2007) Fast elimination of redundant linear equations and reconstruction of recombination-free Mendelian inheritance on a pedigree. Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithm spp. 655–664. 22. Li, X., Chen, Y., and Li, J. (2010) Detecting genome-wide haplotype polymorphism by combined use of Mendelian constraints and local population structure. Pacific Symposium on Biocomputing 15:348–358. 23. Liu, L., Xi, C., Xiao, J., and Jiang, T. (2007) Complexity and approximation of the minimum recombinant haplotype configuration problem. Theoretical Computer Science 378:316–330. 24. Elston, R. and Stewart, J. (1971) A general model for the genetic analysis of pedigree data. Human Heredity 21:523–542. 25. Lander, E. and Green, P. (1987) Construction of multilocus genetic linkage maps in humans.
Haplotype Inference
421
Proceedings of the National Academy of Sciences 84:2363–2367. 26. Sobel, E. and Lange, K. (1996) Descent graphs in pedigree analysis: applications to haplotyping, location scores, and marker-sharing statistics. American Journal of Human Genetics 58:1323–1327. 27. Kruglyak, L., Daly, M., Reeve-Daly, M., and Lander, E. (1996) Parametric and nonparametric linkage analysis: a unified multipoint approach. American Journal of Human Genetics 58:1347–1363. 28. Gudbjartsson, D., Jonasson, K., Frigge, M., and Kong, A. (2000) Allegro, a new computer program for multipoint linkage analysis. Nature Genetics 25:12–13. 29. Abecasis, G., Cherny, S., Cookson, W., and Cardon, L. (2001) Merlin – rapid analysis of dense genetic maps using sparse gene flow trees. Nature genetics 30:97–101. 30. Abecasis, G. and Wigginton, J. (2005) Handling marker-marker linkage disequilibrium: pedigree analysis with clustered markers. The American Journal of Human Genetics 77:754–767. 31. Li, J. and Jiang, T. (2003) Efficient inference of haplotypes from genotypes on a pedigree. International Journal of Bioinformatics and Computational Biology 1:41–70. 32. Doi, K., Li, J., and Jiang, T. (2003) Minimum recombinant haplotype configuration on tree pedigrees. Algorithms in Bioinformatics. 339–353. 33. Li, J. and Jiang, T. (2005) Computing the minimum recombinant haplotype configuration from incomplete genotype data on a pedigree by integer linear programming. Journal of Computational Biology 12:719–739. 34. Li, X. and Li, J. (2007) Comparison of haplotyping methods using families and unrelated individuals on simulated rheumatoid arthritis data. BMC proceedings 1, S55.
sdfsdf
Chapter 23 Multi-SNP Haplotype Analysis Methods for Association Analysis Daniel O. Stram and Venkatraman E. Seshan Abstract This chapter reviews the rationale for the use of haplotypes in association-based testing, discusses statistical issues related to haplotype uncertainty that complicate the analysis, then gives practical guidance for testing haplotype-based associations with phenotype or outcome trait, first of candidate gene regions and then for the genome as a whole. Haplotypes are interesting for two reasons: First, they may be in closer LD with a causal variant than any single measured SNP, and therefore may enhance the coverage value of the genotypes over single SNP analysis. Second, haplotypes may themselves be the causal variants of interest and some solid examples of this have appeared in the literature. This chapter discusses three possible approaches to incorporation of SNP haplotype analysis into generalized linear regression models: (1) a simple substitution method involving imputed haplotypes; (2) simultaneous maximum likelihood (ML) estimation of all parameters, including haplotype frequencies and regression parameters; and (3) a simplified approximation to full ML for case–control data. Examples of the various approaches for a haplotype analysis of a candidate gene are provided. We compare the behavior of the approximation-based methods and show that in most instances the simpler methods hold up well in practice. We also describe the practical implementation of genome-wide haplotype risk estimation and discuss several shortcuts that can be used to speed up otherwise potentially very intensive computational requirements. Key words: Haplotype-specific risk estimation, Phase estimation, Genetic association testing, Expectation-substitution methods, Maximum likelihood, Uncertainty analysis
1. Introduction Haplotypes comprise of SNPs or other markers or variants on the same chromosome that are inherited together with little chance of contemporary recombination, i.e., recombination during the most recent generation of meioses. A basic approach to characterizing linkage disequilibrium in a region containing two or more (common) SNPs is to identify the common haplotypes of the SNPs that are segregating in the region, thereby defining the amount of
Robert C. Elston et al. (eds.), Statistical Human Genetics: Methods and Protocols, Methods in Molecular Biology, vol. 850, DOI 10.1007/978-1-61779-555-8_23, # Springer Science+Business Media, LLC 2012
423
424
D.O. Stram and V.E. Seshan h1
h2
h3
SNP1
1
0
0
SNP2
0
1
0
SNP3
1
1
0
Fig. 1. Three SNPs and three haplotypes. Even if the second SNP is not genotyped, the haplotype (0-1) of SNP1 and SNP3 are perfectly associated with this SNP, although SNP1 and SNP2 are not individually associated.
historical recombination that has taken place between these SNPs. The prevalence of recombination between segregating markers is, as described in Chapter 7, dependent upon both the length (or more technically the genetic distance from start to end) of the haplotype and upon population history. Within a region of high linkage disequilibrium there is a restriction on the number of haplotypes of a given set of SNPs. For example, it can be easily shown that without recombination the number of distinct haplotypes of a total of m SNPs must be less than or equal to m + 1 and often is far less than this (when many SNPs are in perfect LD with each other). This is much smaller than the total number of haplotypes, 2m, which occur when all of the m SNPs are in perfect linkage equilibrium. If a causal variant (caused by a single mutation) occurs within a region of limited recombination and then (through selection or random drift) becomes common, the SNP haplotype that this variant fell on necessarily also becomes common. Since there may have been no single SNP that uniquely defined that haplotype as different from all others (see Fig. 1), it follows that haplotype-based association testing may add additional information for detecting causal variants beyond that contained in the analysis of any of the single SNPs that make up the observed haplotypes. In the figure, the second (causal) SNP falls exclusively on the (0 1) haplotype of the first and third SNPs and is perfectly associated with that haplotype while not being perfectly correlated with either the first or the third SNP. If we only measured the first and third SNPs then we would see a stronger haplotype association than any single SNP association. The above illustration serves as a heuristic rationale for the use of what is often called haplotype block-based analysis. Haplotype blocks (1, 2) are simply regions of limited recombination between SNPs demarked by regions with higher levels of historical recombination. Several algorithms can be used to define such blocks, and are most notably implemented in the graphical program Haploview (3). Haplotype-specific association analysis is sometimes thought of as being restricted to such blocks since the likelihood of picking up an association from an unmeasured variant is much smaller when the measured SNPs are nearly independent (i.e., NOT in the same haplotype block). Since the same comment
23
Multi-SNP Haplotype Analysis Methods for Association Analysis
425
applies to single SNP associations, the only realistic approach to finding risk variants in regions of high recombination is simply to genotype every known variant in the region with the hope that there are no unknown causal variants there. We start by considering haplotype frequency estimation from genotype data for unrelated participants under the assumption of Hardy–Weinberg Equilibrium (HWE) applied to the haplotypes. An EM implementation of maximum likelihood estimation (4) involves the calculation for each possible haplotype, h, of an estimate of the haplotype dosage, dh (H), which is the count of the number of copies of h contained in the true (but generally unknown) pair of haplotypes H carried by that individual (i.e., dh (H) ¼ 0, 1, or 2). Starting with a set, ph1 ; ph 2 ; . . . ; ph2 m of haplotype frequency estimates, in each iteration of the EM the expectation step estimates, for each subject, i, (for i ¼ 1. . .N), the haplotype dosage dh,i (H) conditionally on the genotype data Gi for each subject, treating the current estimates of the haplotype frequencies as if they were known. Assuming HWE, the haplotype dose estimates are computed for each haplotype h as P dh ðH Þph 1 ph 2 H Gi P ; (1) Eðdh jGi Þ ¼ ph 1 ph 2 H Gi
where H~G refers to the haplotype pairs that are compatible with the known genotypes G. Next the haplotype frequency estimates, ph, are updated in the maximization step as N 1 X E ðdh jGi Þ: 2N i¼1
This process is then repeated iteratively. Upon convergence of the algorithm the haplotype dosage estimates, E(dhi|Gi), can be used in association analysis to form score tests for haplotype-specific effects (5, 6) as described below. If the number of SNPs, m, is large, the EM algorithm can be very slow, since with 2m possible haplotypes there are (2m + 1)2m–1 possible haplotype pairs that are being summed over for each subject. However, in regions of limited recombination, most of the haplotype frequency estimates, ph, will rapidly approach zero. This observation suggests a simple divide-and-conquer strategy that can be used to increase the number of markers that can be utilized. This algorithm (called the partition-ligation EM algorithm (7)) applies the EM algorithm in “chunks” (partitions) of 5–10 markers, and then performs a stitching together (ligation) of adjoining partitions, by running the EM algorithm again. In the ligation EM algorithm, the summation in Eq. 1 is over the cross-product of the haplotypes in each partition that are estimated to have nonzero probability from the earlier EM steps. This method can be effective in estimating haplotype frequencies for 20 or more SNPs.
426
D.O. Stram and V.E. Seshan
Even if haplotype frequencies are perfectly estimated and all SNPs making up those haplotypes are measured, there can remain an inherent uncertainty in the estimation of haplotypes: A formal calculation of the squared correlation, Rh2 , of the estimates, E(dh|Gi), with the true counts, dh, of haplotype h carried by each subject is described in Stram et al. (8); this quantity will be less than one if recombinant haplotypes have nonzero frequency. Haplotype uncertainty reduces the effective sample size of a haplotype-based association analysis (compared to a study that could genotype haplotypes directly) approximately in proportion to Rh2 . For very small numbers of SNPs in linkage equilibrium (but of course near enough so that contemporary recombination is rare) or for larger numbers of markers that are in high linkage disequilibrium, the predictability of common haplotypes is generally quite high. Haplotype imputation using fewer SNPs than were considered when defining the haplotypes can also be considered; the aim is to infer haplotypes seen in reference data, such as the HapMap, using the genotypes of tagging SNPs (for candidate gene analyses) or the SNPs available on a genome-wide association platform. The only modification to Eq. 1 that is needed is to have the symbol h represent the haplotypes of interest as seen in the reference data, while Gi represents just the measured genotypes. The uncertainty of such haplotype prediction can also be computed as described in Stram et al. (8). 1.1. Expectation Substitution
Haplotype association testing using the expectation-substitution method for studies of unrelated subjects involves three basic steps: (1) the estimation of haplotype frequencies for all the haplotypes of these SNPs; (2) the formation of estimated haplotype dosage variables, E(dh|Gi), for each individual i with measured genotypes Gi; and then (3) the use of these dosage variables, which we abbreviate as ^dh;i , as (continuous) predictor variables in the generalized linear model analysis. That is, we simply use standard methods to fit the generalized linear regression model (or GLM) g½EðYi Þ ¼ m þ bh ^dh;i
(2)
to estimate haplotype-specific effects for carriers of one or more copies of a single haplotype compared to all other haplotypes. (Here g is the link function, as in logistic or linear regression.) More general estimation of all haplotype effects involves fitting the model X g½EðYi Þ ¼ m þ bh ^dh;i (3) h
summing over the haplotypes with nonzero estimated probability. Subheading 2.1 gives an example, performed in R, of haplotype risk estimation using data from a published case–control study.
23
Multi-SNP Haplotype Analysis Methods for Association Analysis
427
Notice that the expectation-substitution method is extremely simple and convenient: once the haplotype dosage variables are estimated for each subject they are used as ordinary continuous variables in generalized linear models (GLM) order to fit associations with phenotypes and disease. As just described, the models are estimating the change in mean phenotype or in log odds of disease that is associated with a one copy increase in dh, i.e., we are fitting an additive or log additive model to the phenotype means or log odds, respectively. As described in Chapter 18, we may also be interested in estimating other types of models, i.e., dominant, recessive, or unconstrained 2-degree of freedom models. Before we discuss the (minor) changes to the expectation-substitution approach that are needed to fit such models, we first consider the impact that uncertainty in the haplotype frequency estimation may have upon testing and estimation in these models. The expectation-substitution method, both as used in general statistical analysis (9) and in haplotype analysis (6, 10), typically has extremely good control of the type I error rate. So long as errors in dosage estimation are non-differential then type 1 errors will be preserved. The assumption of non-differential errors means that the haplotype dosage estimates are not predictive of disease except to the degree that they are surrogate variables for the true haplotype dosages. For example, in the analysis of binary traits (D ¼ 0 or 1) we assume that PrðD ¼ 1j^dh;i ; dh;i Þ ¼ PrðD ¼ 1jdh;i Þ;
(4)
where dh,i denotes the value of the true haplotype count dh for individual i. The assumption of non-differential errors can be violated if, for example, we perform haplotype frequency estimation separately for the cases and controls in a case–control study, since then the random error in haplotype frequency estimation will cause consistent differences in the estimates of E (dh|Gi) between cases and controls. Proper use of the expectation-substitution method combines cases and controls when we estimate haplotype frequencies. It turns out that not only does the expectation-substitution method give correct type I error rates when, ^dh;i is substituted for dh,i but doing so has been shown (6) to be equivalent to performing a score test that the regression parameter bh in Eq. 4 is zero in a model in which both the haplotype frequencies and the regression parameters are estimated simultaneously by maximum likelihood (ML). Thus, for testing purposes expectation-substitution should have near optimal characteristics. Expectation-substitution methods have been noted in the exposure measurement error literature to have some biases, both in effect and confidence interval estimation, when used away from the null hypothesis. These biases are most important when the
428
D.O. Stram and V.E. Seshan
exposure is very influential (i.e., the magnitude of bh is large) and where there is considerable error in the parameters (here the haplotype frequency estimates) that relate the observed data to the true exposure of interest. Rosner et al. (11) gives a correction to the standard errors of an effect estimate that can be used when calibration study data are available. For haplotype analysis, simple corrections to improve the standard errors for regression parameters have not been worked out and would seem to be difficult to derive. Instead, a number of authors (12–14) have considered maximum likelihood estimation of both the haplotype frequency estimates and the risk parameters in general linear regression involving haplotypes with the hope that these methods will have better statistical properties than the expectation-substitution method. Recently Hu and Lin (15) have shown in a slightly different setting (SNP rather than haplotype imputation) that, when haplotypes involving unmeasured SNPs (i.e., SNPs seen in a reference panel but not genotyped) are to be imputed, the statistical behavior of the ML methods may represent an important improvement over the substitution approach under the alternative hypothesis. That is, the ML estimates have less bias and likelihood-based confidence intervals for bh have better coverage properties. Again, however, these benefits are mainly seen when the true magnitudes of bh are quite large; tests for bh ¼ 0 based upon the expectationsubstitution approach were observed by Hu and Lin to retain good type I error properties and to be reasonably powerful compared to ML. Note that, while the range of the values that E(dh|Gi) takes is from 0 to 2, this expectation is not necessarily equal to an integer value. Only if Rh2 is precisely equal to 1 for that haplotype will all values be integers (0, 1, and 2). Allowing E (dh|Gi) to take non-integer values largely corrects for the uncertainty of haplotype estimation by removing attenuation bias. This is especially true when we are only interested in inferring haplotypes of the main study genotypes Gi rather than inferring haplotypes that were seen in a reference panel that includes SNPs not genotyped in the main study. 1.2. Fitting Dominant, Recessive, or 2-Degree of Freedom Models for the Effect of Haplotypes
The EM algorithm can be used to compute the expectation of functions of the haplotype count; for example, we can fit a generalized linear model using a codominant (2 degree of freedom) coding as gðEðYi ÞÞ ¼ m þb1 I ðdh;i ¼ 1Þ þ b2 I ðdh;i ¼ 2Þ;
(5)
where I () is the indicator function, by replacing the indicator functions with their expectations given the observed genotype data, i.e., gðEðYi ÞÞ ¼ m þ b1 Efðdh;i ¼ 1Þg þ b2 EfI ðdh;i ¼ 2Þg:
(6)
23
Multi-SNP Haplotype Analysis Methods for Association Analysis
429
The expectations of the indicator functions are computed, when the EM algorithm converges, as P I ðdh ðH Þ ¼ 1Þph 1 ph 2 H Gi P EfI ðdh;i ¼ 1Þg ¼ ph 1 ph 2 H Gi
and
P EfI ðdh ;i ¼ 2Þg ¼
H Gi
I ðdh ðH Þ ¼ 2Þph 1 ph 2 P ; ph 1 ph 2 H Gi
respectively. Fitting dominant or recessive models can be performed by constraining b1 to either equal b2 or zero, respectively. There are a number of programs (fastPHASE (16), Beagle (17), PHASE (18)) which provide estimates of the two haplotypes that each individual carries. The EM estimate as described above is not estimating haplotype phase per se. The expectations described above provide an estimate of the marginal posterior probability that a person carries haplotype h1 given Gi and an estimate of the marginal posterior probability that a person carries h2, but not the joint posterior probability that the individual carries h1 and h2. To estimate this joint probability, Eq. 1 needs to be adjusted to compute the expectation of the product of two indicator functions as in E I ðdh1 ;i ¼ 1Þ I ðdh2 ;i ¼ 1ÞjGi ¼ Pr I ðdh1 ;i ¼ 1Þ I ðdh2 ;i ¼ 1Þ ¼ 1jGi P I ðdh1 ðH Þ ¼ 1ÞI ðdh2 ðH Þ ¼ 1Þph 1 ph 2 H Gi P : ¼ ph 1 ph 2 H Gi
(7) Essentially, phasing programs like fastPHASE estimate these probabilities for many pairs of haplotypes and then nominate the haplotype pair with the greatest probability as the phased data for subject i. It is tempting to apply a program such as fastPHASE over the region of interest and then to use the estimated pair of haplo^ i , as if they were actually equal to the unknown Hi and to types, H treat the dh as known when fitting models in Eq. 2 or 5. However, this approach introduces attenuation bias (bias of the regression parameter estimates toward zero) and can also reduce power to reject a false null hypothesis relative to the expectation-substitution approach (10, 19). In general, the use of the most probable phased haplotypes when fitting models in Eq. 4 or 5 for association analysis produces less reliable estimates than the expectationsubstitution method.
430
D.O. Stram and V.E. Seshan
Haplotype imputation can be enhanced if genotypes for related individuals are available, as in the HapMap, for certain of the populations. However, even if nuclear family data are available, some haplotype uncertainty may remain if one or both parents are heterozygous for two or more markers. While phasing is inherently uncertain for large numbers of SNPs, phased haplotypes are often used as a starting point for SNP imputation purposes, and this can be done effectively so long as an appropriate probabilistic model is incorporated into imputation. Use of hidden Markov models (HMM) for SNP imputation underlies the methods of Beagle, fastPHASE, and others for SNP and haplotype imputation. These methods typically start with phased haplotype data available from reference panels such as the HapMap while ignoring any errors or uncertainty in phasing, but only use phase information locally to the SNPs that are being imputed. Moreover, the probability model incorporated in the HMM must allow for additional recombination between haplotypes that are not directly seen in the phased reference panel data. 1.3. Global Test for Haplotype Effects
Haplotype dosages estimated for the haplotypes in a given region can be used in several different ways to construct tests and assign significance levels to the results of the analysis. Besides the testing for the effects (on phenotypes) of single haplotypes, global tests of the null hypothesis can be constructed by fitting all haplotype dosage variables simultaneously. For example, suppose that in a given region or haplotype block there are r haplotypes and we wish to examine whether any of the haplotypes are (linearly) related to the phenotype. Assuming additive haplotype effects, the model we are interested in is gðEðYi ÞÞ ¼ m þb1 dh1 ;i þ b2 dh2 ;i þ . . . þ br dhr ;i
(8)
and the null hypothesis of interest is b1 ¼ b2¼ ¼br ¼ 0. This hypothesis can be tested by comparing the log likelihood of the null model (with only m) to that of the full model, fitted by expectation substitution. Several practical issues arise in this testing. First, the model of Eq. 8 is (not surprisingly) overparameterized; for each subject the sum of the haplotype dosage variables is equal to 2, so that imposing a constraint is needed to make the model identifiable. A typical analysis (20) drops the most common haplotype from the model, so that the mean phenotypes of carriers of one or more of the less common haplotypes are compared to carriers of two copies of the most common haplotype, so that this group serves as the baseline comparison group (now captured by the estimate of m). When haplotypes are made up of more than
23
Multi-SNP Haplotype Analysis Methods for Association Analysis
431
just a few SNPs, some haplotypes are likely to be present in low frequency. Since the power of detecting the effects of rare haplotypes can be limited (and also convergence problems may arise in the iterated reweighted least squares algorithm used to fit a model that includes very sparse covariates), it may not be useful to include all rare haplotypes in the model in Eq. 8 directly. If the terms bj that correspond to rare haplotypes are simply dropped from the model, then whatever effect these haplotypes have is mixed into the effect of the baseline (most common) haplotype. Separating the effects of the rare haplotypes from the common haplotypes is accomplished by computing the sum of all haplotype counts for haplotypes with frequency less than some cutoff value (typically 1–5%) and using this sum as a composite (rare haplotype) effect. If h1 is the most common haplotype and hk through hr are all rare haplotypes, then model Eq. 8 can be modified as gðEðYi ÞÞ ¼ m þ b2 dh2 ;i þ b3 dh3 ;i þ . . . þ bk1 dhk1 ;i þ brare ðdhk ;i þ dhk þ1;i þ . . . þdhr ;i Þ; were all the b estimates relate to the differences in phenotype expected value compared to individuals carrying two copies of the most common haplotype. 1.4. Maximum Likelihood Methods
We now consider maximum likelihood methods for jointly estimating the regression parameters in the (additive) model of Eq. 8, where dhi is the count (0, 1, or 2) of copies of haplotype h carried by subject i. For the time being, we ignore issues of population structure or relatedness among subjects and concentrate on studies in which both HWE (for all haplotype counts and hence all markers as well) and independence (between Yi and Yj for i 6¼ j) can be assumed. Assuming also that there has been no explicit sampling on case–control status or phenotype value, the likelihood to be maximized is Y YX PrðYi ; Gi Þ ¼ PrðYi ; Gi ; Hi Þ i
¼
YX i
H
i
H
PrðYi jH ; Gi Þ PrðH ; Gi Þ ¼
Y X i
PrðYi jH Þ PrðH Þ:
H Gi
Here the likelihood is a function of both the haplotype frequency parameters ph and the regression parameters in the model in Eq. 8. Under HWE Pr(H) is simply ph1 ph2, as in Eq. 1. The removal of Gi from Pr(Yi|H,Gi) and Pr(H,Gi) follows because any genotype count can be constructed as the sum of the counts of the haplotypes that contain the alleles counted in Gi.
432
D.O. Stram and V.E. Seshan
A generalized EM algorithm (12), which is based upon formulas described in (21) that can be used to maximize this likelihood, involves the following steps: 1. Given an initial set of parameter estimates for each of the parameters m and each pair of haplotype frequencies ph and haplotype effects bh, a calculation of the conditional expected value of the vector of score statistics for the regression parameters is performed. This expectation is obtained by first computing, for each subject i, the score contributions, Si(m,b|Gi) as a weighted average of Si(m,b|H) with the weights being equal to Pr(H|Gi, Yi; m,b p) (where p is the set of haplotype frequencies). These are computed as PrfH ¼ ðh1 ; h2 Þ j G; Y Þg ¼
I ðH GÞ PrðY jH Þph1 ph2 P : PrðY jH 0 Þph1 ph2
(9)
H 0 G
2. Computing contributions, i0 (mb)i to a pseudo-information matrix as the weighted average over all possible H of the contributions of individual i to the information matrix i(m,b|H) given H, with the weights again equal to Pr{H ¼ (h1,h2)|G,Y)}. 3. Summing over i to compute the total score vector S(m,b) and pseudo information matrix i0 (m,b). 4. Updating the parameter vector as (mnew,bnew)T ¼ (mnew,bold)T + i0 (mold,bold)–1S(mold,bold). We call i0 (m,b) a pseudo information matrix since computing the proper information matrix in the course of an EM algorithm involves additional calculation, as described in (21). Since the expectation given in Eq. 9 depends upon the population haplotype frequencies, updating these frequencies with each new estimate of bh formally becomes part of step (4) as well. The updating of the haplotype frequencies is now modified from those given in Eq. 1 as 1 X Eðdh ðH Þj Gi ; Yi Þ phnew ¼ 2N i ¼
N X 1 X dh ðH Þ PrðH jYi ; Gi ; aold ; bold ; pold Þ: 2N i¼1 H G i
1.5. Case–Control Sampling
Case–control sampling leads to the enrichment of cases with high-risk haplotypes (those where bh > 0), thereby distorting our estimates of haplotype frequencies ph, and possibly violating the assumption of HWE in the combined data. To deal with the complications due to
23
Multi-SNP Haplotype Analysis Methods for Association Analysis
433
case–control sampling we adopt a simplistic “view” of the way in which case–control data has been ascertained (12). This approach is appropriate when (1) frequency matching, rather than individual matching, of cases to controls is utilized and (2) the disease rate in the underlying population is known. We make the simplifying assumption that cases in the underlying population were chosen randomly with known probability p1 and that controls were chosen randomly with known probability p0. In this case, an approximation to the full likelihood of the case–control data set (22) is Y
PrðYi ; Gi jsubject i is sampledÞ ¼
i
Y i
pYi PrðYi jGi Þ PrðGi Þ : p1 PrðY ¼ 1Þ þ p0 PrðY ¼ 0Þ
(10) By summation over the haplotypes this can be written as Q P pY i PrðYi j Hi Þph2 ph2 H Gi i N : (11) P fp1 PrðY ¼ 1jH Þph1 ph2 þp0 PrðY ¼ 0jH Þph1 ph2 g allH
To estimate simultaneously all parameters in the likelihood (Eq. 11), we first estimate initial values for all the ph using the standard EM algorithm (if there are large numbers of SNPs we use an implementation of the partition-ligation EM algorithm of (7)). This is equivalent to maximizing Eq. 11 with an initial value of b ¼ 0. Then, we drop from further consideration all haplotype frequency parameters that are estimated in this first stage that have smaller estimated frequency than a fixed positive constant (we used e ¼ 0.001 for the calculations for the example below). We then construct the full score and full information matrix for all remaining parameters. This is done by using the Louis formulas for the likelihood in the numerator in Eq. 11, and then by subtracting the first and (minus the) second derivatives of the log of the denominator from the appropriate elements of the resulting score and information. These calculations allow for a full Newton–Raphson update for all parameters simultaneously, which can be used iteratively to compute the final ascertainment-corrected estimates. Inverting the matrix of second derivatives can be problematic when there are numerous low-frequency haplotypes being considered. This problem is mitigated by merging the lowest frequency haplotypes into a single “rare haplotype” variable. A number of other authors (13, 14, 23) have discussed maximum likelihood or related approaches to haplotype analysis for case–control studies, including Venkatram et al. (24) discussed below.
434
D.O. Stram and V.E. Seshan
1.6. An Approximation to Case–Control Maximum Likelihood Estimation
Venkatraman et al. (24) proposed a method for obtaining estimates of the relative risks, making use of the EM algorithm and the assumption of HWE only for controls. In Subheading 2 we describe the methodology and illustrate it using the function haplotypeOddsRatio in the R package genepi. For a specific disease haplotype h, each subject has 0, 1, or 2 copies with odds ratios y1 and y2, respectively. In this procedure, the haplotype frequencies for the controls; ph, are estimated assuming HWE. The resulting estimates are then used in the following equations for the relationship between the relative frequencies of haplotype pairs in cases and controls: qij ¼ pij/T if i, j ¼ 6 h; qih ¼ y1pih/ T if i ¼ 6 h; and qhh ¼ y2phh/T and pij and qij are the population frequencies for the haplotype pair ij in the controls and cases, respectively (T is a normalizing constant equal to the sum of all the qij). We solve for the qij and y using an iterative procedure alternating between the two. Since the estimates of y1 and y2 are obtained by imputing the number of copies of the disease haplotypes, their variance estimates should account for this imputation. The model can be expanded by writing the odds ratio as a log linear function of the number of copies of the disease haplotype and a vector of covariates. That is, the logit of the probability of observing a case is given as m + b1I(dh ¼ 1) + b2I(dh ¼ 2)aX, where as above the bs are the log odds ratio for a subject having 1 or 2 copies of the disease haplotype, the Is are the indicator functions denoting whether subject has 1 or 2 copies of h, X is the covariate vector and a its log odds ratio. Observe that the model follows a logistic regression framework except that I(dh ¼ 1) is uncertain for subjects with haplotype ambiguity. The contribution to the likelihood of such subjects corresponds to the contribution for 0 and 1 copy of the disease haplotype with the probability of observing 0 and 1 copy, respectively. We account for the ambiguity by entering these subjects twice, once each with H1 ¼ 0 and H1 ¼ 1, with weights equal to the probability of observing 0 and 1 copies of the disease haplotype. The estimation is done using an iterative procedure as before, where the parameter estimates are obtained and then used to estimate haplotype frequencies and vice versa. We can also account for population stratification by estimating the haplotype frequencies within strata while estimating the odds ratio from the whole data set (see below). Finally, the extra variation in the parameter estimates attributable to haplotype ambiguity is accounted for by simulating the copy number with the fitted probabilities of 0/1 copy, estimating the coefficients and adding it variance to the estimated variance. This methodology is implemented using the R programming language in the function haplotypeOddsRatio of the package genepi. We describe its use in Subheading 2 below.
23
Multi-SNP Haplotype Analysis Methods for Association Analysis
435
1.7. Large-Scale Haplotype Risk Estimation
We now consider the problem of performing large-scale (e.g., genome-wide) haplotype risk estimation where the haplotypes to be considered are made up of contiguous SNPs either in haplotype blocks, which may be defined in a number of different ways, or are grouped together using windowing methods. Here the concentration is upon haplotypes made up of the measured SNPs and not directly upon haplotypes seen in the larger number of SNPs that may have been genotyped in a reference panel such as the HapMap.
1.8. Studies of Homogeneous Non-admixed Populations
As described in Chapter 21, correction for population structure, admixture, and possibly for hidden relatedness between subjects is an important issue in single SNP analysis and these considerations carry over into haplotype analysis. However, here we consider only homogeneous populations, see Subheading 2 for a discussion of some issues involved with haplotype analysis performed in a multiethnic setting, i.e., for studies that involve multiple racial/ ethnic groups. Within homogeneous populations HWE can be assumed for most markers and marker haplotypes so that Eq. 1 applies for the population (except possibly for haplotypes under strong selective pressures). Note that in case–control studies it can generally be assumed that HWE holds for the controls (especially for a rare disease) but not for the cases, because selection of high-risk haplotypes could have distorted the underlying haplotype frequencies. However, as mentioned previously, ignoring this distortion in allele frequencies still leads to valid tests of the null hypothesis, for the simple reason that, under the null, HWE will hold for the combination of cases and controls, so that using the expectationsubstitution method gives an appropriate score test (5, 6). If genome-wide SNP data are available, then either a blockbased or sliding window-based method can be considered to look for contiguous SNPs the haplotypes of which are predictive of (unknown) variation related to phenotypes or disease. The block-based approach of Gabriel et al. (2), which is the most common method used to visual LD blocks, has an important drawback when it comes to defining groups of SNPs to be used in haplotype association; specifically many SNPs are declared by the algorithm to not be in blocks at all. However, minor adjustments in the details of the block definition (specifically the values of D0 and confidence intervals computed for D0 used in the algorithm implemented in Haploview) can cause SNPs outside blocks to be included in blocks, etc. For example, in Fig. 2, over one-third of the common SNPs in a small region in chromosome 8q24 (for the CEU population) are found with the default Gabriel et al. block definitions in Haploview to be outside of blocks. Relatively minor adjustments will determine whether these SNPs are included in blocks or not, and overall this region appears to be in relatively high LD.
Fig. 2. Haploview Plot of LD pattern for 26 SNPs in region 8q24. Block definitions are based on the Gabriel et al. default criteria shown in the Customize Blocks dialog box. Minor modifications in the default parameters determine whether or not SNPs labeled as 13–19 or 24–26 are included in the block now encompassing only SNPs 20–22. Data are based on HapMap Phase 2 CEU participants.
436 D.O. Stram and V.E. Seshan
23
Multi-SNP Haplotype Analysis Methods for Association Analysis
437
An alternative to a block-based approach is windowing, in which a window of a certain size—the number of contiguous markers to be included in the window—is determined and then either overlapping or non-overlapping windows are placed over the region of interest. Within each window haplotype frequencies are estimated and the expectation-substitution method used to investigate haplotype-specific risk. An inflexible window size (especially if it is quite short) will produce many correlated tests when it is run over regions of very high linkage disequilibrium, but if the window size is too large then haplotypes in regions in low LD will each be of very low frequency. A more flexible approach is to allow the window size to be larger in regions of high linkage disequilibrium and smaller in regions of low LD. A compromise approach is to define haplotype blocks according to some reasonable criteria but to include, using a windowing method, SNPs outside of blocks using a short window size. Overlapping, as well as non-overlapping, windows have been considered as an approach for discovery of haplotype effects. While offering a more comprehensive examination of each region, the downside to using sliding or otherwise overlapping windows in haplotype analysis is the increase in computation time and the increase in difficulty in obtaining type I error bounds for global significance. The Bonferroni method certainly will perform poorly in such a setting, since overlapping windows will have very similar haplotypes. Adjustments to the Bonferroni method, such as ones based on the Ornstein–Uhlenbeck approximations (see Chapter 5 of Siegmund and Yakir (25)), as well as permutation methods and approximations (26–29), are possible. 1.9. The Four Gamete Rule for Rapid Block Definition
A very simple approach to estimating haplotype blocks is based on the four gamete rule. This rule is one of the methods implemented in Haploview for block definition and it is also extremely simple to compute in R or in other appropriate languages. As described above, we can show that the presence of all four possible haplotypes (a-b, A-b, a-B, and A-B) of two SNPs (with alleles a and A and b and B, respectively) is evidence (under an infinite sites approximation) for the occurrence of recombination between these two loci. A very quick way of checking, using the genotype data alone without having to impute haplotypes, whether there is evidence of recombination is to form the 3 by 3 table of joint allele counts for the pair of SNPs as given in Table 1. Note that if any of n00, n01, and n10 are nonzero, then this implies that haplotype a-b must be present. Similarly if any of n10 n20, or n21 are nonzero then haplotype A-b must be present. Extending this to the other “corners” of the table leads to the rule that if all of the quantities (n00 + n10 + n01), (n10 + n20 + n21), (n21 + n22 + n12), and (n01 + n12 + n02) are greater than zero then all four haplotypes must be present and therefore a recombination can be assumed to have occurred. This “four
438
D.O. Stram and V.E. Seshan
Table 1 Genotype data for two SNPs Genotypes for SNP 1 Genotypes for SNP 2
aa
aA
AA
bb
n00
n10
n20
bB
n01
n11
n21
BB
n02
n12
n22
The body of the table contains counts of the number of individuals with the listed genotype combination
corners” rule can be checked rapidly while blocks are being defined. Note that the frequency of haplotype a-a is at least equal to 2n00 þn10 þn01 , with the only uncertain contribution being the num2n ber of copies of haplotype a-b that contributed to n11 since individuals in this cell have uncertain haplotypes (either (a-b, A-B) or (a-B, A-b)). A reasonable relaxation of the four gamete rule allows 10 þn01 the minimum of 2n00 þn and the other three similar quantities 2n for the other haplotypes to be nonzero but “small” while blocks are extended from an initial starting point. Checking all pairs of SNPs in this fashion as blocks are extended is thus a simple, easy-toimplement, and quite fast block formation rule. 1.10. Approximations to the Global Test for Haplotype Effects
The global test based on fitting model Eq. 8 is useful because it can help (in the course of a windowing or haplotype block-based analysis) to identify regions of interest for which more detailed examination may subsequently be undertaken. Considering regions in which there is no recombination between markers (i.e., in an idealized haplotype block), three useful facts are worth noting. First of all, if the haplotype type frequencies are known (and consistent with no recombination), then there is no haplotype uncertainty; all the dosage variables for all the haplotypes will take integer values 0, 1, or 2 equal to the true haplotype counts. Second, the likelihood ratio test constructed for a global test of no haplotype effects will be precisely the same test as the likelihood ratio test of no SNP effects, i.e., the likelihood obtained fitting model Eq. 8 will be the same as the likelihood obtained by fitting the model gðEðYi ÞÞ ¼ m þ b1 Gi1 þ b2 Gi2 þ . . . þ bs Gis ;
(12)
where s is the number of SNPs in the region and the Gij are the genotype counts for SNP j for subject i. This model may also be overparameterized, not because of the inclusion of m but because some of the markers may be perfectly correlated with each other. In the absence of recombination the model degrees
23
Multi-SNP Haplotype Analysis Methods for Association Analysis
439
of freedom when fitting Eq. 12 will be equal to r – 1, just as when fitting model Eq. 8. Bearing this in mind, a reasonably fast haplotype block-based search for haplotype effects may be considered genome-wide as follows: 1. Use the four gamete rule to define haplotype blocks as in the previous algorithm. 2. Compute likelihood ratio or score tests (which are computationally faster) for model Eq. 12 using standard regression software. 3. For haplotype blocks for which the global test for model Eq. 12 is significant (see below), examine haplotype effects more carefully by first imputing dosage variables for all haplotypes and then fitting models such as Eq. 8 or 2. 1.11. Multiple Comparisons in Haplotype Analysis
If only SNPs showing no (or little) evidence of historical recombination are to be considered in genome-wide haplotype analysis, then (as follows from the above discussion) the effective number of tests in a haplotype analysis is very similar to the effective number of tests that are performed using SNP markers alone (i.e., after taking account of LD). Thus, similar criteria for global significance (p < 108 or so) are likely to be useful for block-based haplotype analysis.
1.12. Extensions
In addition to serving as surrogates for unmeasured variants, haplotypes may themselves be the causal variants. Two or more potentially causal variants may themselves only have an effect, or their effects may be enhanced, if they fall upon the same chromosome. For example, Nackley et al. (30) reported that three common haplotypes involving two synonymous and one non-synonymous SNP in the COMT gene code for differences in COMT enzymatic activity and are associated with pain sensitivity. Evidence that the haplotypes rather than the individual SNPs were significant was found in functional analysis of RNA loop structures and enzymatic activity. It is possible that causal haplotypes could include SNPs that are not in very high linkage disequilibrium with each other, and so the restriction of interest to haplotypes only made up of neighboring SNPs (as in haplotype block or windowing methods) may be too strict. A search over wide regions in the genome for such longrange haplotype effects would by necessity have to involve only 2 or possibly 3 SNPs, since using any more vastly increases the number of comparisons to be tested and would increase the uncertainty in haplotype estimation. Nevertheless, it would be interesting to consider, in a scan, estimation of phenotype or risk effects of all two or three SNP haplotypes of all SNPs that lie within a centimorgan
440
D.O. Stram and V.E. Seshan
or two of each other. While this is a large number of haplotypes, it is much smaller than the number of pairwise or three-way interactions between all SNPs in the human genome, which has sometimes been considered (31–33). Moreover, haplotypes containing SNPs within 1 or 2 centimorgans in genetic distance (up to several megabases in physical distance) away from each other could potentially contribute to the additive heritability of a trait, since the haplotype will be inherited mostly as an intact unit, even though over past history recombination has occurred. The same is not true of interactions between very distant SNPs—i.e., these interaction effects would not make a significant contribution to the additive heritability of a trait because they would rarely be passed on to descendants. As additional bioinformatics tools are developed for understanding gene and pathway interplay it may also be possible to develop a priori hypotheses to serve as a guide to interpreting and prioritizing results from really large-scale studies of (noncontiguous) haplotypes and gene by gene interactions.
2. Methods 2.1. An Example of Using the ExpectationSubstitution Method
This section uses data originally published in Stram et al. (12). The data are for a nested case–control study of a candidate gene (CYP17) in breast cancer. The expectation substitution uses a partitionligation version of the EM algorithm (originally written in Fortran 90) called from an R function named “expected_haplotypes.” This R function, other necessary programs, and the data used in this example are available at the URL http://www-hsc.usc.edu/ ~stram/tagsnpsv2.zip. The function expected_haplotypes takes genotype count data from unrelated individuals and returns an object that contains (1) a list of haplotypes with nonzero frequency, (2) the frequency of each such haplotype, and (3) the estimated haplotype count data for each haplotype for each individual. To see how this works we first reproduce the haplotype frequency estimates for the controls in the study as given in Stram et al. The dataset Cyp17_breast_WH_cases _ and_controls.dat can be read into R as follows: >x<read.table("Cyp17_breast_WH_cases_and_controls.dat") >cc<x[,1] # the first column is case control status >nsnps<6 >SNPTable<as.matrix(x[,2:(nsnps+1)]) ># select the controls only for analysis >controlh<expected_haplotypes(SNPTable[cc¼¼0,]) >
23
Multi-SNP Haplotype Analysis Methods for Association Analysis
441
The return value (here controlh) is a list with elements controlh$haplist (the list of haplotypes sorted by allele frequency), controlh$hfreqs (which gives the haplotype frequencies) and controlh$haps, which gives the predicted haplotype counts for each subject. We can view the haplotype frequencies by entering
These are the same haplotype frequencies as shown in Table 1 of Stram et al (12). Recomputing the haplotype frequencies using the cases and controls to use the predicted haplotypes in a model for case/control status is accomplished by > h_all<expected_haplotypes(SNPTable) Upon examination of h_all$hfreqs, we see that there are now four haplotypes with estimated frequency greater than 5% and these constitute 88% of all segregating haplotypes. Using each of these haplotypes in turn to fit regression models is accomplished by > summary(glm(cc~h_all$haps[,1])) > summary(glm(cc~h_all$haps[,2])) > summary(glm(cc~h_all$haps[,3])) > summary(glm(cc~h_all$haps[,4])) The results show modestly significant results for haplotype h000000 (log OR ¼ 0.064, std err ¼ 0.028, p ¼ 0.0231) comparing carriers of this (most common) haplotype to all others. The global test for the significance of any of the first 4 haplotypes can be accomplished by first fitting the null model and calculating the deviance (978.24), then the deviance of the model that includes the haplotypes (2–4) as an additional variable calculated as the sum of all rare haplotypes, and then haplotypes (2–4) compared to the most common haplotype. This is accomplished by > rare<rowSums(h_all$haps[,5:20]) > summary(glm(cc~h_all$haps[,2:4]+rare,family¼binomial())) The deviance from the model is equal to 23.86 on 4 df (p ¼ 0.097) so that again the evidence for any haplotype effect is
442
D.O. Stram and V.E. Seshan
Fig. 3. Comparisons of three profile likelihoods based on the approaches described in the text. The naı¨ve likelihood refers to the profile likelihood from the expectation-substitution method.
only modest. There is some indication of risk associated with carrying the third most common haplotype (h000010) compared to the most common (h000000) (p ¼ 0.0205), but enthusiasm for this finding is tempered by the fact that the overall test is not significant. A detailed analysis of profile likelihoods from three different methods to fit the first model (comparing haplotype h000000 to all the other haplotypes) to these data is given in Stram et al. (12). The three methods compared were 1. Expectation substitution. 2. Full likelihood analysis without any case/control sampling ascertainment correction. 3. Full likelihood analysis including the ascertainment correction Eq. 11 with selection probabilities p0 ¼ 0.002 for controls and p1 ¼ 1 for cases. Figure 3 is from that paper. The profile likelihoods at each point on the curves are obtained by holding the log odds ratio parameter, b0, corresponding to h000000, fixed at the value given on the x-axis while maximizing over the other parameters in the model, and then calculating the log likelihood of that maximized model, which is then displayed on the y-axis. For the expectation-substitution method, the only other parameter is the intercept parameter m, which incorporates “all other” haplotypes. For the two full likelihood methods, the estimates of the haplotype frequencies ph are also simultaneously maximized as the value of b0 is fixed at each point shown on the x-axis. Note that, for this example at least, there are no important differences among the three likelihoods or among inferences using
23
Multi-SNP Haplotype Analysis Methods for Association Analysis
443
them. The far simpler expectation-substitution approach appears to have provided an adequate analysis, including appropriate standard errors, etc., in this instance. 2.2. Using Multiple Populations in Haplotype Analysis
Haplotype analysis in multiple populations raises some interesting issues. Since LD structure changes, sometimes dramatically, between long-separated populations, the indirect use of haplotype analysis to search for unknown variants that may be in higher LD with SNP haplotypes than with individual SNPs may be attenuated in power by the inclusion of multiple groups. Even if the same unobserved variant is present and biologically related to phenotype or disease susceptibility in all groups, it may not be associated with the same haplotype or group of haplotypes in each group. This is very similar to the SNP imputation problem, where different tagging SNPs may be needed to serve as surrogates for ungenotyped SNPs. On the other hand, as emphasized below, haplotypes of known variants themselves may be disease or phenotype-related biologically. Therefore, it is worth discussing the issue of multiple ethnic groups for haplotype analysis. A standard approach taken in such papers as Haiman et al. (20), who analyzed data from the CYP19 gene (like CYP17, CYP19 is a candidate gene for breast cancer) is to estimate haplotype frequencies separately for each racial/ethnic group included in a given analysis. This is important since HWE, which underlies the EM algorithm, is violated when data for multiple populations are combined. To give an appreciation of the importance of this, we compare two different methods for predicting haplotype frequencies for African American participants in the substudy used for haplotype discovery in Haiman et al. First, we use the 70 African American subjects with dense genotyping for this gene to estimate the haplotype frequencies for the African Americans and produce the haplotype frequency estimates for these same subjects for 22 SNPs in “block 1” of this gene (see Note 1). R code to do this using the data for CYP19 is listed below: > AA_SNPTable<read.table("AA.dat") > h<expected_haplotypes(AA_SNPTable,code ¼ 2) > # code ¼ 2 means 2 numbers per SNP We next perform haplotype imputation using a total of five ethnic groups (whites, African Americans, Latinos, Japanese Americans, and Native Hawaiians) in one run (rather than separately) to compute haplotype frequencies. We then compare the haplotype imputations on the African American data from this run with the imputations from the “African American only” data derived above: >All<read.table("all.dat") # read data for 5 ethnic groups >race<as.character(All[,1]);
D.O. Stram and V.E. Seshan
0.6 0.4 0.0
0.2
hall$haps[AAlist, 2]
0.8
1.0
444
0.0
0.2
0.4
0.6
0.8
1.0
ha$haps[, 1]
Fig. 4. Plot of haplotype dosage estimates for the first most common African American haplotype when using (x-axis) only the African American data to estimate haplotype frequencies, compared to using data for a total of five different ethnic groups (y-axis). The correlation between the haplotype dosage estimates is quite high (0.91), but individual instances of major differences are evident.
>nsnps<22 >All_SNPTable<as.matrix(All[,2:(nsnps*2 + 1)])# 22 SNPs read with 2 columns per SNP >hall<expected_haplotypes(All_SNPTable,code ¼ 2) >AAlist<(race¼¼"A") # select the African Americans ># The most common AA haplotype is the 2nd most common in the overall data so >plot(ha$haps[,1],hall$haps[AAlist,2]) # plots the two dosage estimates for the same haplotype Figure 4 shows the results of the plot. While the two dosage estimates are similar for most individuals, in several instances the estimates are wildly different, emphasizing that haplotype imputation is sensitive to population stratification. Generally speaking, haplotype frequency estimation should be performed using reference panels as similar as possible to those being considered in the main study for analysis of haplotype-specific risk or phenotype associations (see Note 2).
23
2.3. Example Using Methods of Venkatraman et al. to Approximate a Maximum Likelihood Analysis
Multi-SNP Haplotype Analysis Methods for Association Analysis
445
We now demonstrate the steps involved in such an analysis of data that come from the GEM (Gene Environment and Melanoma) study, an international multicenter case–control study of melanoma using the methods of (24) described above. The results of the analysis we illustrate have been published in Table V of Millikan et al. (34). In the GEM study the disease risk associated with several nucleotide excision repair (NER) genes were assessed and we focus on the XPD gene with polymorphisms in codons 312 and 751. We want to evaluate the disease risk associated with the haplotype formed by having the disease variant in both codons. Since melanoma risk depends on age, sex, and geographic location (nine centers), we use these factors for our adjusted analysis. Table 2 shows five randomly chosen subjects from the data that are the input for the estimation procedure. Preparing the genotype data: The possible genotypes for xpd312 are Asp/Asp, Asp/Asn, and Asn/Asn, with Asn the variant that has higher disease risk. The genotype column for xpd312 gives the number of copies of Asn present in the subject. Similarly, the possible genotypes for xpd751 are Lys/Lys, Lys/Gln, and Gln/Gln, with Gln being the disease variant and the xpd751 column gives the number of copies of Gln in a subject. The genetic variants in this example are given in the form of the amino acids for which they code. Changing the first base of their codons from G to A changes Asp to Asn and a change from A to C in the first base of the codons changes Lys to Gln (35). Often the original data may not come coded as 0, 1, and 2, but rather as the base (DNA) or amino acid pair. It is also possible that the two alleles can come in two separate columns, i.e., allele 1 is Asp and allele 2 is Asn. In both these cases, the data should be recoded to get to the 0, 1, and 2 representation of the number of copies of disease variant. Finally, if the genotype comes coded numerically, the user needs to ensure that the numbers do indeed correspond to the number of copies of disease variant or translate them suitably, if needed. Once the data matrix is prepared,
Table 2 Data for five participants in the GEM study Centre
Status
Age
Sex
xpd312
xpd751
2
0
41
1
0
0
5
1
83
1
1
1
5
0
52
1
1
0
6
0
59
2
1
1
9
1
52
1
2
2
446
D.O. Stram and V.E. Seshan
the estimation and inference can be carried out. Missing variables, including failed genotyping, should be coded as NA. Calling the estimation procedure: Once the genotype data are coded, the estimation procedure can be called for various models. It is available as a function in the R package “genepi,” which should be loaded prior to issuing the following call: haplotypeOddsRatio(formula, gtypevar, data, stratvar ¼ NULL, nsim ¼ 100, tol ¼ 1e-8) There are six arguments in total, three of which are required and three optional, with default values specified. The three required arguments are “formula,” which specifies the model that is being fit, “gtypevar,” which gives the names of the genotype variables being considered, and “data,” the object (data frame) that contains the data. The optional arguments are “stratvar,” the population stratification variable, “nsim,” the number of replications used to estimate the extra variation due to imputing the ambiguous haplotypes, and “tol,” the tolerance level used for the convergence of the EM algorithm. data: The data object that is being analyzed. This should be a data frame where each column is a variable and each row is a subject. All the variables in the model, as well as the genotype and stratification variables, should be present in this data frame. formula: The model being fitted is written in the form of an R formula. Since the procedure’s goal is to estimate the haplotype disease risk, the variable corresponding to the number of copies of the risk conferring haplotype will not be included explicitly in the formula. Only other variables that we need to adjust for will be used. For our example, the case–control variable is status. Some example formulas are Unadjusted odds ratio
Status ~ 1
Adjust for age
Status ~ age
Adjust for age and sex
Status ~ age + sex
Adjust for age, sex, and their interaction
Status ~ age*sex
gtypevar: The names of the variables for the loci for which we are constructing the haplotype. In our example it will be c ("xpd312","xpd751"). The data frame being used could contain information from other genes. The example has genotype information for other genes, such as ERCC6 and XPF. The data object needs to be prepared only once, and specifying “gtypevar” will give the estimates for the gene of interest. stratvar: The variable to be used for population stratification. This need not be specified, in which case the data will be treated as
23
Multi-SNP Haplotype Analysis Methods for Association Analysis
447
coming from a single stratum. The GEM study is multinational with nine centers spread over the USA, Canada, Australia, and Italy. Since there could be differences in the population frequencies of the XPD genotypes, and hence haplotypes, we use center as the stratification variable. nsim: The number of replicates simulated. The “true” haplotype of the subjects with ambiguous haplotypes are simulated and then the model is fitted. The variation in the parameter estimates over a number of simulations is the extra variation attributable to imputing the haplotypes. The default of 100 provides reasonable results. tol: The tolerance limit for the iterations. The haplotype frequencies and the odds ratios are computed using an iterative procedure (EM). The tolerance level is used to determine the stopping criterion for the iteration. The default value provides stable results. Interpreting the results: The call to this procedure prints a table of estimates and the Wald tests for them, as well as the degrees of freedom and deviances from the model. In the following, we show the results for various models, the command that is used is the first line of the printout (beginning with >) and the rest is the results from the model. First, we start with the unadjusted model. > haplotypeOddsRatio(status ~ 1, gtypevar ¼ c ("xpd312","xpd751"), data ¼ xpd)
The unadjusted model for the haplotypes suggests that there is increased risk (OR ¼ 1.13, from the log odds ratio) for subjects with 1 copy of the disease haplotype (not statistically significant) and a significantly increased disease risk for subjects with two copies (OR ¼ 1.53). The standard errors for the parameters are obtained from the estimated covariance matrix that is returned, but not displayed by a call to this function. As stated earlier, the incidence of melanoma is a function of age and these rates are different for men and women both overall and as functions of age. Additionally, the centers contain different mixtures of ethnicities representing genomically distinct populations (see Note 3). Therefore, we use it as a stratification variable. The results of these model fits are
448
D.O. Stram and V.E. Seshan
age only
> haplotypeOddsRatio(status ~ age, c("xpd312","xpd751"), xpd, stratvar¼"centre")
There is a gradual but pronounced drop in the residual deviances of the models as the variables are added. The adjusted haplotype odds ratios are consistent across the models; (1.19, 1.18, 1.18) and (1.59, 1.58, 1.57).
23
Multi-SNP Haplotype Analysis Methods for Association Analysis
449
Finally, since the incidence rates of melanoma are different across the centers (Australia has a much higher rate than Italy and parts of the USA), we use center as a covariate in addition to using it as a stratification variable. 2.3.1. Age, Sex, Their Interaction, and Center
First, we set up center as a factor variable with center 5 (the center with the largest number of subjects) as the reference group. Then we fit the model > xpd$centre0 < relevel(factor(xpd$centre), ref ¼ "5") > haplotypeOddsRatio(status ~ age*sex + centre0, gtypevar ¼ c ("xpd312","xpd751"), stratvar ¼ "centre", data ¼ xpd)
Although, as expected, some of the centers have significantly lower cancer rates than the reference group (centre05), the reduction in the deviance from this model seems minimal compared to the model without center as a covariate. Thus, we use the age, sex, and their interaction as the final model. The adjusted odds ratio and confidence intervals are given by exp(est) and exp(est 1.96*s.e.), which for one copy is 1.18 and (1.006, 1.1385) and for two copies is 1.57 and (1.193, 2.066), reproducing the result in Table V of Millikan et al. (34). The above procedure presented estimates of the disease risk for subjects with one and two copies of a disease haplotype. Our model treats all the risks as coming from that specific haplotype, which may not be the case. The model can be expanded to include more than one disease haplotype. If r of the 2k possible haplotypes increase disease risk, we can model the risk associated with carrying a single one and any pair of them. Note that in the resulting model
450
D.O. Stram and V.E. Seshan
the number of parameters to be estimated increases as the square of r and thus inferences can become increasingly unreliable with increasing r.
3. Notes 1. Today’s African Americans constitute an incompletely admixed population, with African, European, and generally smaller amounts of American Indian or Hispanic ancestry, and with admixture fractions that vary significantly from individual to individual. This means that certain SNPs and haplotypes may fail to be in HWE, even in a study sample made up entirely of self-reported African American participants. This violation of HWE (generally only detectable with quite large sample sizes) is relatively limited in terms of its effect on haplotype frequency estimation and individual haplotype prediction. 2. This remark pertains to the use of tagging SNP data to predict haplotypes composed of SNPs, rather than to the tags that are actually genotyped in the main study. In such a case, a reference panel is needed in which all SNPs have been genotyped, and the haplotype frequency estimation is performed using only those subjects. Once a list of potential haplotypes has been formed, these haplotypes are imputed using the tagging data. The program tagSNPs, which does the EM estimation (called by the R program expected_haplotypes) in the examples, is not optimized for predicting haplotypes that include unmeasured SNPs, mainly because it does not allow for the possibility that specific rare genotype combinations, which are not seen in the reference panel, will appear in the (usually much larger) main study data. When these are encountered, haplotype counts are set to missing values in the main study by tagSNPs. Note that this problem does not arise when haplotypes made up of only the measured SNPs are estimated. If haplotypes consisting of more SNPs than those measured in the main study are to be imputed, then other programs, which allow for additional recombinant haplotypes (recombination between haplotypes seen in the reference panel), such as MaCH or IMPUTE can be considered (36). 3. The haplotypeOddsRatio function imputes the control haplotypes assuming that the control population is in HWE. However, as in Note 1, population admixture can violate this assumption and can lead to biased results. Additionally, as with any generalized linear model fitted, sparseness of some categories can lead to convergence issues.
23
Multi-SNP Haplotype Analysis Methods for Association Analysis
451
References 1. Daly, M. J., Rioux, J., Schaffner, S., Hudson, T., and Lander, E. (2001) High-resolution haplotype structure in the human genome, Nature Genetics 29:229–232. 2. Gabriel, S. B., Schaffner, S. F., Nguyen, H., Moore, J. M., Roy, J., Blumenstiel, B., Higgins, J., DeFelice, M., Lochner, A., Faggart, M., Liu-Cordero, S. N., Rotimi, C., Adeyemo, A., Cooper, R., Ward, R., Lander, E. S., Daly, M. J., and Altshuler, D. (2002) The structure of haplotype blocks in the human genome, Science 296:2225–2229. 3. Barrett, J. C., Fry, B., Maller, J., and Daly, M. J. (2005) Haploview: analysis and visualization of LD and haplotype maps, Bioinformatics 21:263–265. 4. Excoffier, L., and Slatkin, M. (1995) Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population, Mol Biol Evol 12:921–927. 5. Zaykin, D. V., Westfall, P. H., Young, S. S., Karnoub, M. A., Wagner, M. J., and Ehm, M. G. (2002) Testing association of statistically inferred haplotypes with discrete and continuous traits in samples of unrelated individuals, Hum Hered 53:79–91. 6. Xie, R., and Stram, D. O. (2005) Asymptotic equivalence between two score tests for haplotype-specific risk in general linear models, Genet Epidemiol 29:166–170. 7. Qin, Z. S., Niu, T., and Liu, J. S. (2002) Partition-ligation-expectation-maximization algorithm for haplotype inference with singlenucleotide polymorphisms, Am J Hum Genet 71:1242–1247. 8. Stram, D. O., Haiman, C. A., Hirschhorn, J. N., Altshuler, D., Kolonel, L. N., Henderson, B. E., and Pike, M. C. (2003) Choosing haplotype-tagging SNPs based on unphased genotype data from a preliminary sample of unrelated subjects with an example from the Multiethnic Cohort Study, Hum Hered 55 (1):27–36. 9. Carroll, R. J., Ruppert, D., Stefanski, L. A., and Crainiceanu, C. (2006) Measurement error in nonlinear models: A modern perspective, Second Edition, 2 ed., Chapman and Hall, New York. 10. Kraft, P., Cox, D. G., Paynter, R. A., Hunter, D., and De Vivo, I. (2005) Accounting for haplotype uncertainty in matched association studies: A comparison of simple and flexible techniques., Genetic Epidemiology 28:261–272. 11. Rosner, B., Spiegelman, D., and Willett, W. (1992) Correction of logistic relative risk esti-
mates and confidence intervals for random within-person measurement error, Amer J of Epidemiology 136:1400–1409. 12. Stram, D. O., Pearce, C. L., Bretsky, P., Freedman, M., Hirschhorn, J. N., Altshuler, D., Kolonel, L. N., Henderson, B. E., and Thomas, D. C. (2003) Modeling and E-M Estimation of Haplotype-Specific Relative Risks from Genotype Data for a Case–control Study of Unrelated Individuals, Human Heredity 55:179–190. 13. Lin, D. Y., and Zeng, D. (2006) LikelihoodBased Inference on Haplotype Effects in Genetic Association Studies, Journal of the American Statistical Association 101:89–104. 14. Lin, D. Y., and Huang, B. E. (2007) The use of inferred haplotypes in downstream analyses, Am J Hum Genet 80:577–579. 15. Hu, Y., and Lin, D. (2010) Analysis of untyped snps, maximum likelihood and Imputation Methods, Genetic Epidemiology 34:803–815. 16. Scheet, P., and Stephens, M. (2006) A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase, Am J Hum Genet 78:629–644. 17. Browning, S. R., and Browning, B. L. (2007) Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering, Am J Hum Genet 81:1084–1097. 18. Stephens, M., Smith, N. J., and Donnelly, P. (2001) A new statistical method for haplotype reconstruction from population data, Am J Hum Genet 68:978–989. 19. Kraft, P., and Stram, D. O. (2007) Re: the use of inferred haplotypes in downstream analysis, Am J Hum Genet 81:863–865; author reply 865–866. 20. Haiman, C. A., Stram, D. O., Pike, M. C., Kolonel, L. N., Burtt, N. P., Altshuler, D., Hirschhorn, J., and Henderson, B. E. (2003) A Comprehensive Haplotype Analysis of CYP19 and Breast Cancer Risk: The Multiethnic Cohort Study, Hum Mol Genet 12:2679–2692. 21. Louis, T. (1982) Finding the Observed Information Matrix when using the EM algorithm, JRSS-B 44(2) :226–233. 22. Spinka, C., Carroll, R. J., and Chatterjee, N. (2005) Analysis of case–control studies of genetic and environmental factors with missing genetic information and haplotype-phase ambiguity, Genet Epidemiol 29:108–127.
452
D.O. Stram and V.E. Seshan
23. Zhao, L. P., Li, S. S., and Khalid, N. (2003) A method for the assessment of disease associations with single-nucleotide polymorphism haplotypes and environmental variables in case–control studies, Am J Hum Genet 72:1231–1250. 24. Venkatraman, E. S., Mitra, N., and Begg, C. B. (2004) A method of evaluating the impact of individual haplotypes on disease incidence in molecular epidemiology studies, Statistical Applications in Genetics and Molecular Biology (Berkley Electronic Press) 3: 1–20. 25. Siegmund, D., and Yakir, Y. (2007) The Statistics of Gene Mapping, Springer, New York. 26. Dudbridge, F. (2006) A note on permutation tests in multistage association scans, Am J Hum Genet 78:1094–1095; author reply 1096. 27. Dudbridge, F., and Koeleman, B. P. (2004) Efficient computation of significance levels for multiple associations in large studies of correlated data, including genomewide association studies, Am J Hum Genet 75: 424–435. 28. Dudbridge, F., and Koeleman, B. P. (2003) Rank truncated product of P -values, with application to genomewide association scans, Genet Epidemiol 25: 360–366. 29. Lin, D. Y. (2006) Evaluating Statistical Significance in Two-Stage Genomewide Association Studies, Am J Hum Genet 78(3):505–509. 30. Nackley, A. G., Shabalina, S. A., Tchivileva, I. E., Satterfield, K., Korchynskyi, O., Makarov, S. S., Maixner, W., and Diatchenko, L. (2006) Human catechol-O-methyltransferase haplo-
types modulate protein expression by altering mRNA secondary structure, Science 314:1930–1933. 31. Marchini, J., Donnelly, P., and Cardon, L. R. (2005) Genome-wide strategies for detecting multiple loci that influence complex diseases, Nat Genet 37:413–417. 32. Millstein, J., Conti, D. V., Gilliland, F. D., and Gauderman, W. J. (2006) A testing framework for identifying susceptibility genes in the presence of epistasis, Am J Hum Genet 78:15–27. 33. Evans, D. M., Marchini, J., Morris, A. P., and Cardon, L. R. (2006) Two-stage two-locus models in genome-wide association, PLoS Genet 2:e157. 34. Millikan, R. C., Hummer, A., Begg, C., Player, J., de Cotret, A. R., Winkel, S., Mohrenweiser, H., Thomas, N., Armstrong, B., Kricker, A., Marrett, L. D., Gruber, S. B., Culver, H. A., Zanetti, R., Gallagher, R. P., Dwyer, T., Rebbeck, T. R., Busam, K., From, L., Mujumdar, U., and Berwick, M. (2006) Polymorphisms in nucleotide excision repair genes and risk of multiple primary melanoma: the Genes Environment and Melanoma Study, Carcinogenesis 27: 610–618. 35. Scritable, Nature Education. http://www.nature. com/scitable/topicpage/the-information-indna-determines-cellular-function-6523228 36. Li, Y., Willer, C. J., Ding, J., Scheet, P., and Abecasis, G. R. (2010) MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes, Genetic Epidemiology 34:816–834.
Chapter 24 Detecting Rare Variants Tao Feng and Xiaofeng Zhu Abstract The limitations of genome-wide association (GWA) studies that are based on the common disease common variants (CDCV) hypothesis have motivated geneticists to test the hypothesis that rare variants contribute to the variation of common diseases, i.e., common disease/rare variants (CDRV). The newly developed high-throughput sequencing technologies have made the studies of rare variants practicable. Statistical approaches to test associations between a phenotype and rare variants are quickly developing. The central idea of these methods is to test a set of rare variants in a defined region or regions by collapsing or aggregating rare variants, thereby improving the statistical power. In this chapter, we introduce these methods as well as their applications in practice. Key words: GWA, Common disease common variants, Common disease rare variants, SNPs, Haplotype, Collapsing, Aggregation
1. Introduction 1.1. Recent Successes, Limitations, and Challenges of Genome-Wide Association Studies
The underlying biology of common diseases is unclear. It has been suggested that both genetic and environmental factors contribute to the variation in risk of complex diseases. Studying the genetic risk factors for a common disease will be helpful for understanding the etiology of the disease and, therefore, to prevent the disease eventually. The completion of the human genome sequence (1, 2) and the progress in SNP genotyping technologies have facilitated the creation of dense SNP databases such as the International HapMap Project (3, 4). Many genome-wide association (GWA) studies have been completed, and several genetic variants have been identified as being associated with complex diseases. A hypothesis underlying GWA studies is that causal variants are common in a population (5, 6). Under this assumption, variants contributing to the variation in risk of common disease susceptibility can be detected by testing tagging SNPs across the genome. A well-known wholegenome association study, the Welcome Trust Case Control
Robert C. Elston et al. (eds.), Statistical Human Genetics: Methods and Protocols, Methods in Molecular Biology, vol. 850, DOI 10.1007/978-1-61779-555-8_24, # Springer Science+Business Media, LLC 2012
453
454
T. Feng and X. Zhu
Consortium (WTCCC) study (7), was based on the CDCV assumption. By comparing the allele frequencies between cases and controls in the WTCCC study, 24 independent association signals were identified. These signals were 1 for bipolar disorder, 1 for coronary artery disease, 9 for Crohn’s disease, 3 for rheumatoid arthritis, 7 for type I diabetes, and 3 for type II diabetes (7). Despite the remarkable successes in identifying genetic variants that influence common diseases, many challenges still lie ahead. The majority of genetic variants contributing to disease susceptibility are yet to be discovered. Some common diseases, such as hypertension, are known to have strong genetic components. Yet uncovering their underlying putative genetic variants using GWA studies has proven to be difficult. The common variants that have been identified to date can only explain a small proportion of the variation in disease phenotype. For example, height is known to be a heritable trait with estimated heritability around 0.8, which implies about 80% of the interindividual variation is attributable to genetic factors. Although the three GWA studies in 2008 (8–10) have found 40 previously unknown variants, each locus can only explain a small proportion of the phenotypic variance (0.3–0.5%). The failure of GWA studies to identify the majority of susceptibility variants can be attributed to various aspects, such as population stratification, small effect sizes of genetic variants, copy number variations, and rare variants. 1.2. Multiple Rare Variants Hypothesis and Recent Evidence
Some recent studies persuade us to believe that multiple rare variants may contribute to disease risk in the human population. Recent findings, such as the association of BRCA1 and BRCA2 to breast cancer (11), indicate that multiple rare variants within the same gene can contribute to largely monogenic disorders. Most of the common variants identified in GWA studies have modest effective sizes with odds ratios between 1.2 and 1.5 (12). It has been suggested that rare variants may have larger effective sizes than the common variants (13). However, the analysis of rare variants can be challenging. In this chapter, we describe some recently developed methods for analyzing rare genetic variants. There are several approaches to test the association of rare variants and a disease outcome or trait. One approach is to test each variant individually using a standard contingency table or regression method. However, this approach has low statistical power to detect an association between the rare variants and the trait owing to their small allele frequencies (14–16). To overcome this issue, one viable strategy is to “collapse or aggregate” sets of rare variants into a single group and test their collective frequency difference between cases and controls (15, 17). We first introduce some notation and explain different strategies for collapsing the rare variants. We provide specific details for using a two-stage haplotype-based method for evaluating rare
24 Detecting Rare Variants
455
variants. Next, in Subheading 2, we explain how to conduct the two-stage haplotype-based analysis using the R programming language (http://cran.r-project.org). We focus on case–control association studies, where cases are individuals who have the disease of interest and controls are disease-free individuals. We assume that the cases and controls are independent. We also refer to the cases as affected individuals and the controls as unaffected individuals. 1.2.1. Notation
Assume that within a region there are M SNPs that can independently cause disease susceptibility. The term “region” refers to the unit in which the variants will be collectively analyzed. It can be defined by using a gene, or pathway, or a genomic block based on some criteria. Each SNP has two alleles, denoted Ai and ai ; i ¼ 1; 2; . . . ; M , in which Ai always refers to the rare high-risk allele and has an allele frequency pi. Furthermore, let Gk , k ¼ 0; 1; 2 denote the genotypes aa, Aa, and AA, respectively.
1.3. Collapsing Method
Our goal is to test whether the presence of rare variants in a region is associated with disease. Define an indicator variable X for the j th case individual as ( 1 rare variant(s) present Xj ¼ : 0 otherwise Yj is defined similarly for control individuals. Due to the rarity of variants, the probability of carrying more than one variant for an individual is low (see Note 1), and the method collapses genotypes across all variants, such that an individual is coded as 1 if a rare allele is present at any of the variant sites in the region M and as 0 otherwise. The detection of an association of multiple rare variants is transformed into a test of whether the proportions of individuals with rare variants in cases and controls differ. Any single SNP association test that is applied in GWA studies can be applied here, such as a chisquared test for a contingence table or a regression analysis.
1.4. Combined Multivariate and Collapsing
Li and Leal (15) considered an extension of the collapsing method, which they termed the combined multivariate and collapsing (CMC) method, to take advantage of both the multiple marker tests and the collapsing method. For a predefined region, they first divide the markers (e.g., SNPs) within the region into groups according certain criteria (e.g., allele frequencies) and then collapse the rare variants within each group using the method describe in Subheading 1.3. To analyze the groups of collapsed rare variants, a multivariate test such as Hotelling’s T 2 test is applied.
1.5. Weighted Sum Association Method
Madsen and Browning (18) proposed a statistic for testing a prespecified collapsed set of variants that weights each variant by its frequency, thus allowing one to include variants of any frequency into the collapsed set. This approach proceeds as follows.
456
T. Feng and X. Zhu
Let a be a predefined frequency threshold. For the ith variant ði ¼ 1; 2; . . . ; M Þ, if pi
miu þ 1 ; 2ðniu þ 1Þ
where miu is the number of mutant alleles observed for variant i in the unaffected individuals, niu is the number of unaffected individuals genotyped for variant i, and ni is the total number of individuals genotyped for variant i (affected and unaffected). If pi a, the weight of ith variant is 0. The next step is to collapse all the variants whose weight is not zero in this region by defining the genetic score of individual j as gj ¼
M X Iij i¼1
wi
;
where Iij is the number of mutant alleles in variant i for individual j . Thus, for person j, gj represents a single score that is obtained by combining information from all the M variants in the region of interest. An association test is performed by testing this score rather than testing the individual variants. Madsen and Browning (17) suggest using a nonparametric Wilcoxon’s test for the association test and calculating the P -value using a permutation approach (see Note 2). 1.6. Pooled Association Tests for Rare Variants
The WSM method described above calculates the weights of the variants for which the frequencies of the rare alleles are less than a predefined threshold a, such as a ¼ 0:02 or a ¼ 0:01. However, it is difficult to pick an optimal threshold in practice. Price et al. (19) proposed a variable-threshold approach for testing rare coding variants.
1.6.1. Variable-Threshold Approach
The idea behind this approach is that there exists some (unknown) threshold T for which variants with a minor allele frequency (MAF) below T are substantially more likely to be functional than are variants with an MAF above T . To obtain this data-driven threshold T , a z-score zðT Þ for each allele-frequency threshold T is computed, and the maximum z-score across different values of T is defined as zMax. A permutation procedure is used to assess the statistical significance of zMax, allowing zMax in the permuted data to be attained at values of T different from those in un-permuted data to ensure the validity of the permutation test. We refer the reader to Price et al. (19) for details about the calculation of the z-scores and for testing the statistical significance of the variants using this method.
24 Detecting Rare Variants
457
1.6.2. Incorporation of Computational Predictions of Functional Effects
The weights may also be defined based on the functional relevance of the individual variants. Price et al. suggest using the PolyPhen2 scores (20, 21), which evaluate the possible functional effect of an SNP by calculating the distributions of PolyPhen-2 probabilistic scores for neutral and damaging amino acid changes. The posterior probabilities that variants are is functional are calculated using these distributions and used to define the weights. We refer the reader to Price et al. (19) for details about this method.
1.7. Two-Stage Haplotype-Based Methods
Because most of the rare variants are either not genotyped or not well called in GWAS, it is impossible to test specific rare variants or collapsed rare variants. Although the rare variants are usually not well tagged by common variants, it is reasonable to assume that one or more rare variants may fall on only one haplotype consisting of common SNPs. So, another way of exploiting summary statistics for rare variant analysis involves comparing haplotype frequencies between the case and control groups. Zhu et al. (22) have developed a novel association method that can summarize multiple rare variants in this way. This two-stage approach proceeds as follows. In the first stage, a set of susceptibility haplotypes is identified by comparing the shared haplotype frequencies in the cases with those in a general population. These susceptibility haplotypes are then compared between cases and controls in the second stage to identify those having significant association with the disease. Although we describe this approach for case–control studies, this method is also applicable to other types of designs such as affected sibpairs. Suppose we have a total of n individuals of whom nu are unaffected (controls) and the remaining n nu are affected (cases). In stage 1, randomly sample N (< nu) unaffected and N (g i ; : ; N2 where g is a predefined number that affects the misclassification rate and power. Here, N2 ¼ 2N denotes the total number of individuals used in stage 1. In the second stage, consider the nu – N unaffected individuals and the n nu N affected individuals who were not used in the
458
T. Feng and X. Zhu
first stage. Further, consider the haplotypes in the risk haplotype set S identified in the first stage. Compare the frequencies of these haplotypes in the affected versus the unaffected individuals considered in the second stage (see Note 3).
2. Methods We now explain the steps for conducting the two-stage haplotypebased analysis of rare variants using the R programming language (http://cran.r-project.org). We focus on the analysis of case–control samples here, although the two-stage method has also been described for sibpairs by Zhu et al. (22). This method involves the following three key steps. 2.1. Infer Haplotypes (Stage 1 Analysis)
We first infer haplotypes of all cases and controls. For unrelated individuals, software such as Phase, Fastphase, and Beagle (24–27) can be applied. We implemented Beagle to infer haplotypes because of its accuracy and efficiency compared with other software. A description of the Beagle software is available at http://faculty. washington.edu/browning/beagle/beagle.html. (The method proposed by Li et al. (23) can be used to infer the haplotypes for sibpair data).
2.2. Collapsing the Risk Haplotypes (Stage 1 Analysis)
Once haplotypes are inferred, the risk haplotype set S can be obtained as described above using the haplotypes of a randomly selected subset of cases and controls. When grouping the risk haplotypes, we set g ¼ 1:28 to control the misclassification rate.
2.3. Association Test (Stage 2 Analysis)
The haplotypes of the remaining cases and controls are used to compare the frequency difference in total risk haplotypes between cases and controls using an exact method, such as Fisher’s exact test or a test statistic based on an asymptotic approximation.
2.3.1. Example of R Code
Suppose that the estimated haplotypes of case–control dataset are stored in a file named data.txt. Assume that the total sample size is n ¼ 5,000. The first n nu ¼ 2,000 records are case haplotypes, and the remaining nu ¼ 3,000 records are control haplotypes. There is a total of M ¼ 50 genetic variants in this region. The data format of file data.txt is: Case_1 14414141432122124324241214241424223222132131111324 14414141432122124324241214241424223222314133311214 Case_2 14232221434242344144233234342124422222332331311224
24 Detecting Rare Variants
459
14232221434244324124233234222142232422314133311214 Case_3 14414141432244324124233234222142232422314133311214 14232221434242344144233234342124422224132131111324 ... Case_2000 14232221434122324124241214241124222212332331331222 14232221434242344144233234342124422224132131111324 Control_1 14232221434242344144233234342124422222332331311224 14232221434242344144233234342124422222332331311224 Control_2 44232223414244324124233234222142232422332331311224 14232221432122124324241214241424223222132131111324 Control_3 44232223414244324124233234222142232422314133311214 44232223414244324124233234222142232422314133311214 ... Control_3000 14232221434122324124241214241124222222132121111324 14232221434242344144233234342124422222332331311224 In this dataset, there are three rows for each individual. The first three rows correspond to person 1 who is a case. Case_1 (the first row) is the case identifier (ID). The second and third rows show the inferred haplotypes of this person, i.e., 1441414143212212432424121424 1424223222132131111324 and 1441414143212212432424121 4241424223222314133311214. These haplotypes are inferred from Beagle. The remaining rows of this file can be interpreted in a similar manner. The numbers 1, 2, 3, and 4 in the haplotypes represent the nucleotides A, C, G, and T. The two-stage haplotype-based association test can be done using an R code, as described below: 1. Read the data file, and convert it into a matrix.
2. Randomly select a subset of cases and controls to define the risk haplotypes (stage 1 analysis), and retain the remaining data for the association test (stage 2 analysis). In this example, we select 400 cases and 1,000 controls for stage 1.
460
T. Feng and X. Zhu
3. Calculate the haplotype frequencies.
The output of the above steps looks as follows. The variable tHaplotype just contains the list of the haplotypes and the variable tHaplotypeFre is the frequency of each haplotype in tHaplotype. 14232221432122124324241214241424223222132131111324
0.00625
14232221434242324122241214342124222212332331331222
0.00250
14232221434242344144233234342124422222332331311224
0.09875
4. Group the risk haplotypes using the randomly picked 400 haplotypes in cases and 1,000 haplotypes in control.
The output of these steps looks as follows. Hapmat consists of three columns corresponding to the haplotypes, the case
24 Detecting Rare Variants
461
tHaplotype
tHaplotypeFre
sHaplotypeFre
14232221432122124324241214241424223222132131111324
0.00625
0.0075
14232221434242324122241214342124222212332331331222
0.0025
0.001
14232221434242344144233234342124422222332331311224
0.09875
0.1185
14232221434242344144233234342124422222314133311214
0.0025
0.0035
14232221434242344144233234342124422222132131111324
0.00125
0.0015
14232221434242344144233234342124422222132121111324
0.00125
5e-04
14232221434242344144233234342124422222134133311214
0.00125
0.001
14232221434242344144233234342124422224132133111324
0.00125
14232221434242344144233234342124422224132131111324
0.045
0.037
14232221434242344144233234342124422224132131111222
0.0075
0.0045
14232221434242344144233234342124422224114133331214
0.00125
frequencies of the haplotypes, and the control frequencies. An example is shown above. The first line is the title of each column. In the third column, two haplotypes have NA values, meaning that these haplotypes are not found in the controls. For further use, these NA values are replaced with 0 by using the command Hapmat[is.na(Hapmat)]<0. 5. Define the risk haplotypes.
If there is at least one risk haplotype (i.e., if length(ERiskHap) > 0), then (and only then) run the following commands. 6. Association testing (stage 2 analysis). First, collect the defined risk haplotypes in cases and in controls for association testing.
7. Association testing (stage 2 analysis). Group the risk haplotypes for association testing. pos<which(Hapmat2[,1]%in%ERiskHap) Rmat<Hapmat2[pos,]
462
T. Feng and X. Zhu
8. Association testing (stage 2 analysis). Do the association test using Fisher’s exact procedure.
9. Association test (stage 2 analysis). The asymptotic distribution may also be used. stat <(Fd Fc)/sqrt(Fd*(1 Fd)/nd + Fc*(1 Fc)/nc) # statistics pval_asy<pnorm(stat, TRUE) # P -value
mean ¼ 0,
sd ¼ 1,
lower.tail ¼
10. Return the result. return(list(pval_fisher, pval_asy)); In our example, the P -values of Fisher’s exact test and the asymptotic test were 0.9711 and 0.9994, respectively. Hence, the above code returned the P -value pair (0.9711, 0.9994). We prefer to use the results of the exact test to interpret the result, but the two P -values should be similar except in the case of a rare haplotype.
3. Notes 1. The collapsing strategy is based on an important assumption, namely that each individual is unlikely to have more than one rare variant. Given the low frequency of the rare variants, it might be a reasonable assumption. However, the assumption can be violated when the rare variants interact with each other or a large genomic region is considered. 2. A challenge of this method is that it applies a permutation test, which is very time-consuming. 3. The haplotype inference can be a challenge if a testing region is large. It may take much computational time for several hundreds of SNPs and a large number of individuals. Also, if
24 Detecting Rare Variants
463
many SNPs are considered, it is possible that each individual has his/her own unique haplotypes, and this may result in failing to group risk haplotypes. When only unrelated individuals are available, the best approach is to use all the sample in both steps, but then the permutation test should be performed when calculating the P -value (28). References 1. Lander ES, et al (2001) Initial sequencing and analysis of the human genome. Nature 409: 860–921 2. Venter JC, Adams MD, Myers EW et al (2001) The sequence of the human genome. Science 291: 1304–1351 3. The International HapMap Consortium (2003) The International HapMap Project. Nature 426: 789–796 4. Frazer KA, Ballinger DG, Cox DR et al (2007) A second generation human haplotype map of over 3.1 million SNPs. Nature 449: 851–861 5. Chakravarti A (1999) Population geneticsmaking sense out of sequence. Nat Genet 21: 56–60 6. Lander ES (1996) The new genomics: global views of biology. Science 274: 536–539 7. Consortium WTCC (2007) Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447: 661–678 8. Gudbjartsson DF, Walters GB, Thorleifsson G et al (2008) Many sequence variants affecting diversity of adult human height. Nat Genet 40: 609–615 9. Lettre G, Jackson AU, Gieger C et al (2008) Identification of ten loci associated with height highlights new biological pathways in human growth. Nat Genet 40: 584–591 10. Weedon MN, Lango H, Lindgren CM et al (2008) Genome-wide association analysis identifies 20 loci that influence adult height. Nat Genet 40: 575–583 11. Easton DF et al (2007) A systematic genetic assessment of 1,433 sequence variants of unknown clinical significance in the BRCA1 and BRCA2 breast cancer predisposition genes. Am J Hum Genet 81: 873–883 12. Bodmer W, Bonilla C (2008) Common and rare variants in multifactorial susceptibility to common diseases. Nat Genet 40: 695–701 13. Schork NJ, Murray SS, Frazer KA, Topol EJ (2009) Common vs rare allele hypotheses for
complex diseases. Curr Opin Genet Dev 19: 212–219 14. Gorlov IP, Gorlova OY, Sunyaev SR et al (2008) Shifting paradigm of association studies, value of rare single-nucleotide polymorphisms. Am J Hum Genet 82: 100–112 15. Li B, Leal SM (2008) Methods for detecting associations with rare variants for common diseases, application to analysis of sequence data. Am J Hum Genet 83: 311–321 16. Altshuler D, Daly MJ, Lander ES (2008) Genetic mapping in human disease. Science 322: 881–888 17. Morgenthaler S, Thilly WG (2007) A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: a cohort allelic sums test (CAST). Mutat Res 615: 28–56 18. Madsen BE, Browning SR (2009) A Groupwise Association Test for Rare Mutations Using a Weighted Sum Statistic. PLoS Genet doi:10.1371/journal.pgen.1000384 19. Price AL et al (2010) Pooled association tests for rare variants in exon-resequencing studies. Am J Hum Genet 86: 832–838 20. Ramensky V, Bork P, Sunyaev S (2002) Human Nonsynonymous SNPs: server and survey. Nucleic Acids Res 30: 3894–3900 21. Adzhubei IA, Schmidt S, Peshkin L et al (2010) A method and server for predicting damaging missense mutations. Nat Methods 7: 248–249 22. Zhu X, Feng T, Li Y, Lu Q, Elston RC (2010) Detecting rare variants for complex traits using family and unrelated data. Genet Epidemiol 34: 171–187 23. Li X, Chen Y, Li J (2010) Detecting genomewide haplotype polymorphism by combined use of mendelian constraints and local population structure. Pac Symp Biocomput 15: 348–358. 24. Stephens M, Donnelly P (2003) A comparison of Bayesian methods for haplotype reconstruction from population genotype data. Am J Hum Genet 73: 1162–1169
464
T. Feng and X. Zhu
25. Stephens M, Smith N, Donnelly P (2001) A new statistical method for haplotype reconstruction from population data. Am J Hum Genet 68: 978–989 26. Scheet P, Stephens M (2006) A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet 78: 629–644
27. Browning SR, Browning BL (2007) Rapid and accurate haplotype phasing and missing data inference for whole genome association studies using localized haplotype clustering. Am J Hum Genet 81: 1084–1097 28. Feng T, Zhu X (2010) Genome-wide searching of rare genetic variants in WTCCC data. Hum Genet 128: 269–280
Chapter 25 The Analysis of Ethnic Mixtures Xiaofeng Zhu Abstract Populations of ethnic mixtures can be useful in genetic studies. Admixture mapping, or mapping by admixture linkage disequilibrium (MALD), is specially developed for admixed populations and can supplement traditional genome-wide association analyses in the search for genetic variants underlying complex traits. Admixture mapping tests the association between a trait and locus-specific ancestries. The locusspecific ancestries are in linkage disequilibrium (LD) which is generated by the admixture process between genetically distinct ancestral populations. Because of highly correlated locus-specific ancestries, admixture mapping performs many fewer independent tests across the genome than current genome-wide association analysis. Therefore, admixture mapping can be more powerful because of the smaller penalty due to multiple tests. In this chapter, I introduce the theory behind admixture mapping and how we conduct the analysis in practice. Key words: Admixture mapping, Population admixture, Ancestry information marker, Hidden Markov model
1. Introduction In genetic epidemiology, we always want to study the relationship between a phenotype and a genetic marker. A popular design is a retrospective case–control design for a binary trait or a populationbased design for a quantitative trait. Association for a genetic marker can be established by performing logistic regression or linear regression analysis. When the study samples are collected from a recent admixed population such as African Americans or Mexican Americans, each subject’s chromosome has a mosaic structure of chromosome segments that come from ancestral populations. Intuitively, we are able to test the association between the ancestry at any position of the genome and a disease trait, given such information is available. For example, a 2 by 2 table can be created and standard statistical methods for an association test can be applied (Table 1). The underlying assumption is that the risk allele at a Robert C. Elston et al. (eds.), Statistical Human Genetics: Methods and Protocols, Methods in Molecular Biology, vol. 850, DOI 10.1007/978-1-61779-555-8_25, # Springer Science+Business Media, LLC 2012
465
466
X. Zhu
Fig. 1. The mosaic structure of chromosomes of cases and controls sampled from an admixed population with two ancestral populations. The dark and light segments represent chromosome segments inherited from two ancestral populations. The vertical line represents the location of a disease susceptibility variant. When the disease variant has high frequency in dark population and rare in the other, more dark segments are observed at the disease locus in cases than outside of the disease locus as well as among any regions in controls.
Table 1 A 2 by 2 table of testing association between an ancestry and a disease trait in samples from an African American population Ancestry
Cases
Controls
Odds ratio
AA
nAA(D)
nAA(C)
EA
nEA(D)
nEA(C)
YAA
Total
nAA(D) + nEA(D)
nAA(C) + nEA(C)
2n
AA African-ancestral allele, EA European ancestral allele. Superscript D: case; C: control
locus occurs at different frequencies among ancestral populations. When this is true, we expect that, in the admixed population, affected individuals share an excess ancestry from the ancestral population with the highest frequency of the risk allele. Figure 1 illustrates such chromosomes sampled at the current generation when admixture occurs in two ancestral populations. Assume that the dark chromosomes are from an ancestral population that has high disease prevalence and the light chromosomes are from the other ancestral population whose disease prevalence is low. As an ideal case, we expect that all the chromosomes at a disease locus in affected cases are inherited from one ancestral population. In comparison, controls will be less likely to carry dark chromosomes. Caution should be taken when performing the association test in Table 1. As in many association studies of population-based samples,
25
The Analysis of Ethnic Mixtures
467
confounding is a serious problem. Since the disease prevalence is different in two ancestral populations, an affected individual is more likely to carry chromosomes from a high-risk ancestral population than from a low risk one, a phenomenon of population structure. In fact, a disease variant, rather than the ancestry itself, contributes to the phenotypic variation. Thus, analyzing the data in Table 1 should take care of the effect of population structure. In general, admixture mapping methods can be simply viewed as testing association between a locus-specific ancestry and a phenotype, meanwhile controlling the effect of population structure. The test statistics can be built by comparing the locus-specific ancestry between cases and controls or by comparing the locusspecific ancestry to the ancestry distribution across the genome among cases only. 1.1. Test Statistics
Mathematical models for admixture mapping can be found in the literature (1–9). Suppose we have an admixed population C resulting from two ancestral populations, X and Y. Let Pd ðyÞ and Pc ðyÞ be the proportion of alleles that are from ancestral population X among cases and controls in the current admixed population, respectively, where y represents the genetic distance between the disease location and the candidate marker. The null hypothesis is that the marker is unlinked to the disease risk, or y ¼ 0:5 between a marker locus and a disease locus. In a case-only design, we test the null hypothesis: Pd ðyÞ ¼ Pd ð0:5Þ, and in a case–control design, we test the null hypothesis: Pd ðyÞ Pd ð0:5Þ ¼ Pc ðyÞ Pc ð0:5Þ. When we know which ancestral populations an individual’s alleles at any marker locus are from, we would be able to estimate Pd and Pc at any genomic position, which are estimated by the frequencies of ancestry present in cases and controls, respectively. Pd ð0:5Þ and Pc ð0:5Þ are estimated by the average of ancestry across ^ d ðtÞ be the the genome in cases and controls, respectively. Let P estimated proportion of ancestry from population X at chromosome location t, conditional on the observed marker genotypes. A test statistic for the case-only design is ZC ðtÞ ¼
^ d ðy ¼ 0:5Þ ^ d ðtÞ P P ; ^ d ðtÞ s P
(1)
and a test statistic for the case–control design is ^ d ðy ¼ 0:5Þ P ^ c ðtÞ P ^ c ðy ¼ 0:5Þ ^ d ðtÞ P P ; (2) ZCC ðtÞ ¼ ^ c ðtÞ ^ d ðtÞ P s P respectively (3, 4, 7). Neither test is affected by population structure because we are testing the excess of ancestry at a marker position.
468
X. Zhu
Consider a study consisting of n1 unrelated cases and n2 unrelated controls genotyped at M marker loci. Let xij be the proportion of alleles from ancestral population X for the ith individual at marker j. At marker j we have n1 X ^ d ðj Þ ¼ 1 xij P n1 i¼1
and nX 1 þn2 ^ c ðj Þ ¼ 1 xij ; P n2 i¼n þ1 1
for cases and controls, respectively. Similarly, we have n1 M X X ^ d ðy ¼ 0:5Þ ¼ 1 xij P Mn1 j ¼1 i¼1
and M nX 1 þn2 X ^ c ðy ¼ 0:5Þ ¼ 1 xij ; P Mn2 j ¼1 i¼n þ1 1
respectively. Here we assume that only a few loci will contribute to disease disparity among ancestral populations, which is reasonable. We estimate the variance in Eqs. 1 and 2 by M X ^ dÞ ¼ 1 ^ d ðy ¼ 0:5Þ 2 ^ d ðj Þ P s2 ðP P M j ¼1
(3)
and M X ^c ¼ 1 ^ c ðj Þ P ^ d ðy ¼ 0:5Þ P ^ c ðy ¼ 0:5Þ 2 : ^d P ^ d ðjÞ P s2 P P M j ¼1
(4) The rationale for estimating the variance by Eqs. 3 and 4 is that the proportion of X by descent at any locus approximately comes from the same distribution when the locus is not linked with a trait locus. The variance estimated in this way has been theoretically shown to be asymptotically unbiased (10). A likelihood-based method can be also applied (6). Let Pðdiseasejboth alleles from X Þ r¼ be the ancestry risk ratio at the Pðdiseasejno allele from X Þ locus in a study under the assumption of a multiplicative model. Let l be the admixture proportion from the high-risk parental population (i.e., population X). Then Pd ðy ¼ 0Þ for an intermixture model is (2, 3)
25
The Analysis of Ethnic Mixtures
469
pffiffiffi l r Pd ðy ¼ 0Þ ¼ pffiffiffi : l r þ1l Thus the likelihood of the observed ancestral alleles at marker j is Y ðlpffiffirffiÞxij ð1 lÞ1xij pffiffiffi LðrÞ ¼ : ðl r þ 1 lÞ i A standard likelihood ratio test or score test can be carried out to test the null hypothesis r ¼ 1 (6). For the case–control test, a logistic regression can be applied (6), which is log
Pðyi jxij Þ ¼ b0 þ b1 ðxij xi Þ; 1 Pðyi jxij Þ
where yi is the disease status for individual i, xi is the average ancestry for individual i, and b1 is the log odds ratio of disease for individuals with 2 vs. 0 allele copies from the high-risk parental population. The null hypothesis is b1 ¼ 0. It is straightforward to extend the above method to a quantitative trait (11). A linear regression can be directly applied as yi ¼ b0 þ b1 ðxij xi Þ þ ei ; where the null hypothesis is b1 ¼ 0. 1.2. Inferring LocusSpecific Ancestry
It is straightforward to perform admixture mapping analysis if we know the locus-specific ancestry. When only marker genotypes are available, statistical methods to infer locus-specific ancestry have been developed (4–7, 12–15). A typical method of inferring locusspecific ancestry is based on a hidden Markov model. Let fgt gM t¼1 denote M ordered observed genotypes along a chromosome and fvt gM t¼1 the number of alleles being X by descent at the corresponding marker loci. We can model fgt ; nt gM t¼1 in an HMM, as illustrated: observed Hidden
genotype g1 " States n1
g2 " ! n2
gM " ; ! nM
by assuming conditional independence given the underlying unobservable states, that is, Pðgt jg1 ; :::; gt1 ; n1 ; :::; nt Þ ¼ Pðgt jnt Þ. This assumption may not be true when we have dense markers available, such as the markers used for a genome-wide association analysis, in which pairwise linkage disequilibrium in ancestral populations are often present. In contrast, in the marker hidden Markov model (MHMM) proposed by Tang et al. (12), the observed state gt depends not only on nt but also on the past history, as illustrated by
470
X. Zhu
observed Hidden
genotype g1 " States n1
! g2 " ! n2
! gM " : ! nM
For computational tractability, Tang et al. (12) consider only the first-order Markovian dependence, that is, Pðgt jgt1 ; nt Þ if nt ¼ nt1 Pðgt jg1 ; :::; gt1 ; n1 ; :::; nt Þ ¼ : Pðgt jnt Þ otherwise Thus, MHMM is more general than HMM and has the advantage of allowing for background linkage disequilibrium in ancestral populations. The transition matrix can be obtained based on a continuous gene-flow model as presented by Zhu et al. (3). An alternative flexible transition matrix is implemented in STRUCTURE (15, 16), which assumes an intermixing model, that is, all chromosomes in the sampled admixed subjects descended from a mixed group of ancestral chromosomes n generations ago, who have subsequently mated randomly (27).
2. Methods In this section, I describe the procedures for conducting admixture mapping analysis in practice when both genotype and phenotype data are available. 2.1. Step 1. Quality Controls
When raw genotype data are obtained, quality controls are necessary before the formal data analysis is performed. The typical genotype data include customer-designed chips, such as iSelect Custom BeadChip, or standard arrays used for whole genome association studies such as Affymetrix 5.0 and 6.0 platforms, or Illumina Human669WQuad or HumanOmni1-Quad. The standard QCs include removing either individuals or SNPs because of low calling rate. For example, an array with calling rate less than 0.9 may be removed and an SNP with calling rate less than 0.95 can also be removed. Illumina platforms use the software GenomeStudio to make the genotype calling. An important parameter is the GenTrain Score ranging from 0 to 1, which is a score calculated from the GenTrain clustering algorithm. SNPs are often sorted by GenTrain score in the SNP table. SNPs with lower scores have poor clustering in the SNP graph and should be excluded in the analysis. The next level of QC includes examining heterozygosity, which measures the degree of inbreeding. Too low or too high a heterozygosity (defined as <4 SD or > 4 SD beyond the mean) indicates possible DNA contamination or poor DNA quality. In admixture mapping analysis, the subjects are assumed to
25
The Analysis of Ethnic Mixtures
PCA
471
MDS
0.4
0.04 0.02
0.3
0 0.2
C2
PC2
-0.02 0.1
-0.04 0 -0.06 -0.1
-0.2
-0.08
-0.25 -0.2 -0.15 -0.1 -0.05 PC1
0
0.05
-0.1 -0.05
0
0.05
0.1
C1
Fig. 2. PCA (left panel ) and MDS (right panel ) of the genotype data of the 701 African Americans sampled from Maywood, Illinois. The PCA and MDS result in consistent patterns. The subjects within circles may indicate relatedness and special attention should be paid.
be unrelated. To check the relatedness in the samples, a pairwise identity-by-descent (IBD) score is examined for each pair of samples. The outliers of IBD scores can be determined based on the distribution of the IBD scores. One of a pair of subjects within the outliers should be removed. The subject with lower genotyping rate is usually selected to be removed. In addition, samples with IBD 5% or more with other samples are also removed because of possible DNA contamination. Multidimensional scaling (MDS) or principal components (PCs) are used to estimate population substructure, and the identified outliers are excluded in analysis. The results of MDS and PCs are usually robust, as indicted in Fig. 2, where the subjects in the circles indicate these samples are related and may be excluded from the analysis. All the above procedures can be performed by the popular the software PLINK (17). However, we do not suggest using HWD to filter SNPs (see Note 1). 2.2. Step 2. Inferring an Individual’s Locus-Specific Ancestry (Local Ancestry)
After QCs, we assume all the markers are correctly genotyped. If AIMs are selected, we can use the program ADMIXPROGRAM (4), STRUCTURE (15, 16), or ANCESTRYMAP (5) to estimate the locus-specific ancestry. We usually assume AIMs are in linkage equilibrium in ancestral populations. Thus, we would like to examine whether any AIMs are in linkage disequilibrium in the ancestral populations. We can examine the LD among the AIMs in HapMap data using the software PLINK. Only one AIM should be selected if several AIMs are in strong LD in ancestral populations. For the program ADMIXPROGRAM, the input genotype and phenotype files have the same format as STRUCTURE. The first row refers to the genetic distance between two neighboring AIMs.
472
X. Zhu
Table 2 The input file for ADMIXPROGRAM
The first line represents the genetic distance between two adjacent markers. The distance between the last marker of a chromosome and the fist marker of a next chromosome is 1. In the second line, the first column is ID, followed by population ID, gender, affected/not affected, two alleles for each marker separated by a space
Table 3 An example of the parameter file for ADMIXPROGRAM 600
African American sample size
2,806
Number of AIMs
2
Which ancestral population allele frequencies will be used. 0: no information, 1: only European, 2: only African, and 3: both European and African
6
Initial value of the number of generations since population admixture occurred
0
Use all, odd, or even markers. 0: all markers, 1: odd markers, and 2: even markers
0
Is there a file including the information of bad markers? 0: No and 1: yes
2
Estimate case-only Z-score and case–control Z-score. 0: No, 1: case-only Z-score, and 2: both case-only and case–control Z-scores
The distance between the last AIM on one chromosome and the first SNP on the next chromosome is coded as 1. The first four columns refer to individual ID, population information, gender, and affected status. The remaining columns are the genotype codes, two columns per AIM, with each column indicating one allele coded by 0 or 1. An example of the input file is presented in Table 2. The parameter file includes the number of individuals, whether the ancestral population allele frequency file is provided, and initial number of generations since admixture events occurred (see Table 3). The parameter file also requires providing which set of AIMs will be used to analyze the data, for example, all the markers, or even or
25
The Analysis of Ethnic Mixtures
473
odd markers only. If any AIMs should be excluded in the analysis, it can also be flagged. The program will output either case-only or case–control Z-scores for admixture mapping analysis, as requested by the user, as well as the locus-specific ancestries for each individual. The results from ADMIXPROGRAM can be examined in several ways. The estimated average admixture rate can be examined as the first step. If the AIMs do not provide enough information for estimating the average admixture rate, it is likely to have 50%/50% admixture rate for two ancestral populations. For African Americans, the estimate of admixture rate is usually close to 20% European and 80% African ancestry, that is, a 20/80% admixture. The software also outputs the estimated allele frequencies in ancestral populations. Although these frequencies will not be the same as that observed in HapMap samples, for example, HapMap CEU and YRI, these frequencies should be highly correlated. However, if a region is under strong selection pressure, we may identify substantial differences between the estimated allele frequencies and those observed in the HapMap data. It should be noticed that we have not observed any regions with such strong selection evidence that leads to strong deviation of allele frequencies, except in Puerto Ricans, which are admixed by Europeans, native Indians, and Africans (18). For example, Zhu and Cooper (19) used ADMIXPROGRAM to estimate the allele frequencies of the AIMs in European and African ancestral populations using the 1,743 African Americans enrolled in the Dallas Heart Study (DHS). The estimated allele frequencies in ancestry populations using ADMIXPROGRAM and the corresponding current European and African allele frequencies estimated from European Americans in DHS and Africans from the literature (26) (for the AIMs) were plotted in Fig. 3). High correlation of observed and estimated allele frequencies can be found in Fig. 3 (correlation coefficient >0.97). However, it may happen that the allele frequency estimates in ancestry populations may be flipped (Fig. 4). In this case, different initial allele frequencies should be used in ADMIXPROGRAM. One way is to switch the estimated allele frequencies in the ancestral populations, and these new allele frequencies can be used as the initial allele frequencies in the next run of ADMIXPROGRAM. The software ADMIXPROGRAM usually automatically switches the estimated ancestral population allele frequencies. We also suggest running ADMIXPROGRAM at least three times: (1) without using ancestral population allele frequencies, (2) using one ancestral population frequencies provided, and (3) using both ancestral population frequencies provided. The final results will be compared by examining the negative log-likelihood function values. If the program runs correctly, all the negative log-likelihood function values should be similar. STRUCTURE (15, 16) and ANCESTRYMAP (5) are popular programs for inferring locus-specific ancestries when only AIMs
474
X. Zhu
Fig. 3. The comparisons of estimated allele frequencies in ancestral populations and observed allele frequencies in European and African populations (19). A. European ancestral allele frequencies estimated by ADMIXPROGRAM in the African American sample vs. the observed allele frequencies in European Americans. B. African ancestral allele frequencies estimated by ADMIXPROGRAM in the African American sample vs. the observed allele frequencies by the weighted average from Ghana and Cameroon obtained from Smith et al. (26). The points on the off diagonal line suggest the SNPs have switched allele labels.
Fig. 4. The comparisons of estimated allele frequencies in ancestral populations and observed allele frequencies in European and African populations due to switches of ancestral allele frequency estimated between two ancestral populations.
25
The Analysis of Ethnic Mixtures
475
Table 4 Programs for admixture mapping or inferring locus-specific ancestries No of ancestral populations
Background LE
Microsatellite or SNPs
No limit
Yes
HMM MCMC
Microsatellite or SNPs
No limit
Yes
HMM MCMC
SNPs
2
Yes
ADMIXPROGRAM (3, 4)
HMM, ML SNPs
2
Yes
SABER
(12)
MHMM ML
SNP
No limit
No
HAPMIX
(25)
MHMM ML
SNPs
2
No
LAMP LAMP-ANC
(14)
Moving window
SNPs
No limit
No
HAPAA, uSWITCH
(13)
HMM
SNPs
No limit
No
Program
References Method
AIMs
STRUCTURE/ MALDSOFT
(7, 15, 16)
HMM MCMC
ADMIXMAP
(2, 6)
ANCESTRYMAP
(5)
are available. STRUCTURE allows more than two ancestral populations while both ADMIXPROGRAM and ANCESTRYMAP only allow two ancestral populations. STRUCTURE provides many features to model admixed populations, including no admixture model, admixture model, and linkage model. For admixed population such as African Americans or Mexican Americans, the linkage model is suggested to give the best results. Both STRUCTURE and ANCESTRYMAP are based on MCMC algorithm. STRUCTURE requires users to provide burn-in length and the number of runs after burn-in. A burn-in of 10,000–100,000, with additional 10,000–100,000 runs, is suggested. This is very time consuming when the number of subjects is over 5,000 and the number of AIMs is over 3,000. When AIMs are well selected (Highly informative AIMs), we have found that a burn-in length of 5,000 followed by an additional 5,000 runs are usually adequate. When genome-wide scan data are available, such as 500K SNPs or more, the AIMs can be selected from the available data although information may be lost. Statistical methods and software have been developed to analyze such large data sets. SABER (12) applies MHMM to incorporate the dense SNPs with possible background LD, which is typical for dense SNPs such as 500K or more. Other similar softwares are listed in Table 4. Because the dense SNPs are used to reconstruct ancestry blocks, these methods are more accurate than using AIMs but at a cost of more computation time.
476
X. Zhu
2.3. Step 3. Association Analysis of Testing Locus-Specific Ancestry
Once the locus-specific ancestries are inferred, testing the association of locus-specific ancestry and a phenotype is quite straightforward. However, certain data manipulations are necessary. For case-only data, ADMIXPROGRAM, STRUCTURE/MALDSOFT, ANCES TRYMAP, and ADMIXMAP can output either Z-score or LOD score at AIMs. For case–control data or quantitative traits, the logistic regression or linear regression analysis can be applied using SAS or R statistical packages. Let reg1 be a SAS data set including phenotype, covariates, and ancestry proportion at each marker, which can be obtained from the output of ADMIXPROGRAM or STRUCTURE. We can apply the following SAS macro for association testing.
In the SAS data file prob_1, the following informatin will be outputted for each AIM. The P value is testing the association between a marker specific ancestry and the phenotype.
25
The Analysis of Ethnic Mixtures
477
Estimate
Std. err.
t Value
P
log10(P )
0.174361
0.657886
0.265032
0.790994
0.101827
0.095116
0.64143
0.148287
0.882121
0.054472
0.019573
0.637086
0.030722
0.975492
0.010776
0.015492
0.629289
0.024618
0.98036
0.008614
0.02246
0.613737
0.03659
0.970814
0.012864
0.020716
0.613837
0.033749
0.973078
0.011852
0.245184
0.619785
0.395596
0.692417
0.159633
0.318612
0.620987
0.513075
0.607917
0.216156
The Z-scores from a case-only analysis or P values can be summarized as shown in Fig. 5. When a large Z-score, such as greater than 3, is observed, additional analysis should be performed to determine whether the significant result is due to genotyping error (see Note 2). Since admixture mapping analysis is testing the association between a phenotype and locus-specific ancestries, which can be highly correlated because of the recent admixture, the number of total independent tests in whole genome can be much less than the number of actual tests conducted in admixture mapping analysis. It has been suggested that there are about 1,000 independent tests (7), which is consistent with that in a simulation study for African American samples (4). For case–control analysis or quantitative traits, a permutation test can also be applied to obtain the empirical P values. However, the permutation test should be performed by permuting individual’s phenotype and covariates together, keeping the entire marker data unchanged. Thus, we do not need to reestimate the locus-specific ancestries in each permutation. This permutation method can be applied to determine the empirical genome-wide significance level. 2.4. Additional Remarks
Admixture mapping analysis is similar to linkage analysis when an entire admixed population is considered as a large family. Unlike genome-wide association studies, admixture mapping will identify chromosomal regions that may harbor disease variants. Such regions range from 10 to 20 Mb in length, which may include many genes. When genome-wide data are available, several thousands of markers genotyped by an array may also fall into the regions identified by admixture mapping. Presumably, the followup direct association tests can be performed for these SNPs using standard association methods for searching which SNPs are responsible for the evidence observed in admixture mapping analysis. This may also be done by accounting for population stratification, using
−6
−4
Z score −2 0
2
4
X. Zhu
2
3
4
5
6
7
8
9 10 11 12 13
15
17
19 21
−2
Z score 0
2
4
1
−4
478
1
2
3
4
5
6
7
8 9 10 11 12 13 Chromosome
15
17
19 21
Fig. 5. The genome-wide Z-scores of admixture mapping in the Dallas Heart Study (19). Top: the Z-scores calculated using hypertensive cases only; Bottom: the Z-scores calculated based on case–control samples.
methods such as principal component analysis (20, 21) or the genomic control approach (22). However, only the makers with substantial allele frequency differences in ancestral populations can contribute to the association evidence in admixture mapping analysis. Thus, rather than testing all available SNPs in a region, we only need to test a subset of markers whose allele frequencies differ between ancestral populations by greater than a predefined value (such as 0.2). This procedure can substantially reduce the number of tests. Furthermore, these selected SNPs can be strongly correlated because of the population admixture. The actual number of tests can be estimated by the methods proposed in (23, 24). To define a region in admixture mapping analysis, we can use the 1unit drop region from a peak of –log10(P value). The size of the region is dependent on the peak value, and the number of regions is dependent on how large a peak we want to follow up in further studies. For example, if we choose a peak having negative log10(P value) greater than 3, we will expect one region if we assume there
25
The Analysis of Ethnic Mixtures
479
are 1,000 independent regions in the genome under the null hypothesis that there is no genetic variant contributing disease disparity among ancestral populations. This number can be increased to 10 if we select regions having negative log10(P value) greater than 2. To determine the significance level of a test, we should add the numbers of independent tests in admixture mapping analysis and in all the regions that have been selected for single marker association tests. Even so, the total number of tests is much less than the number of tests in standard genome-wide association studies, which is typically 500K to 1 million. Typically, a genome-wide significant P value is around 105 to 106, depending upon what criteria (the peak of negative log10(P value)) is used for selecting regions for follow-up association analyses. This is one of the important advantages in admixture mapping, which reduces the penalty due to a larger number of multiple comparisons. Finally, it should be noted that admixture mapping can be used to detect the risk variants that are less frequent in a high risk ancestral population than in a lower risk ancestral population (Note 3).
3. Notes 1. Hardy-Weinberg Equilibrium (HWE) is often used for QCs to exclude SNPs. However, HWE can be created through the population admixture process. For example, suppose two ancestral populations have genotype frequencies (0.01, 0.18, 0.81) and (0.81, 0.18, 0.01) for genotypes (AA, Aa, aa), respectively. Assuming the admixture rate is 50/50%, then the genotype frequencies in admixed population are (0.41, 0.18, 0.41) with allele frequencies being 0.5 for A and a. HWE is violated in the admixed population. Thus, we do not suggest using HWD to filter SNPs. 2. When a large Z-score is observed, caution should be exercised before claiming the success of the analysis. It is possible that a large Z-score is driven by some specific SNPs whose genotyping qualities are questionable or if the SNPs are in strong LD. One way to examine this problem is to redo the analysis using half the markers, for example, only even or odd numbered markers. If the result after reducing markers is consistent with that without dropping markers, the observed large Z-score may be robust. 3. It is usually mistakenly believed that only the risk variants more frequent in high-risk ancestral population can be detected in admixture mapping studies. In fact, admixture mapping can also detect the risk variants that are more frequent in a low-risk ancestral population. In fact, the power of admixture mapping is larger when the risk allele is more frequent in the low-risk
480
X. Zhu
ancestral population, with a low admixture rate contribution to the admixed population.
Acknowledgment This work was supported by grant from National Human Genome Research Institute (HG003054). References 1. Risch NJ (1992) Mapping genes for complex disease using association studies with recently admixed populations. Am J Hum Genet 51: 13 2. McKeigue PM (1998) Mapping genes that underlie ethnic differences in disease risk: methods for detecting linkage in admixed populations, by conditioning on parental admixture. Am J Hum Genet 63: 241–251 3. Zhu X, Cooper RS, Elston RC (2004) Linkage analysis of a complex disease through use of admixed populations. Am J Hum Genet 74: 1136–1153 4. Zhu X, Zhang S, Tang H, Cooper R (2006) A classical likelihood based approach for admixture mapping using EM algorithm. Hum Genet 120: 431–445 5. Patterson N, Hattangadi N, Lane B, Lohmueller KE, Hafler DA, et al. (2004) Methods for high-density admixture mapping of disease genes. Am J Hum Genet 74: 979–1000 6. Hoggart CJ, Shriver MD, Kittles RA, Clayton DG, McKeigue PM (2004) Design and analysis of admixture mapping studies. Am J Hum Genet 74: 965–978 7. Montana G, Pritchard JK (2004) Statistical tests for admixture mapping with case-control and cases-only data. Am J Hum Genet 75: 771–789 8. Zhang C, Chen K, Seldin MF, Li H (2004) A Hidden Markov modeling approach for admixture mapping based on case-control data. Genet Epidemiol 27: 225–239 9. Zhu X, Tang H, Risch N (2008) Admixture mapping and the role of population structure for localizing disease genes. Adv Genet 60: 547–569 10. Sha Q, Zhang X, Zhu X, Zhang S (2006) Analytical correction for multiple testing in admixture mapping. Hum Hered 62: 55–63 11. Basu A, Tang H, Arnett D, Gu CC, Mosley T, et al. (2009) Admixture mapping of quantita-
tive trait loci for BMI in African Americans: evidence for loci on chromosomes 3q, 5q, and 15q. Obesity (Silver Spring) 17: 1226–1231 12. Tang H, Coram M, Wang P, Zhu X, Risch N (2006) Reconstructing genetic ancestry blocks in admixed individuals. Am J Hum Genet 79: 1–12 13. Sundquist A, Fratkin E, Do CB, Batzoglou S (2008) Effect of genetic divergence in identifying ancestral origin using HAPAA. Genome Res 18: 676–682 14. Sankararaman S, Sridhar S, Kimmel G, Halperin E (2008) Estimating local ancestry in admixed populations. Am J Hum Genet 82: 290–303 15. Pritchard JK, Stephens M, Donnelly P (2000) Inference of population structure using multilocus genotype data. Genetics 155: 945–959 16. Falush D, Stephens M, Pritchard JK (2003) Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics 164: 1567–1587 17. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, et al. (2007) PLINK: a tool set for whole-genome association and populationbased linkage analyses. Am J Hum Genet 81: 559–575 18. Tang H, Choudhry S, Mei R, Morgan M, Rodriguez-Cintron W, et al. (2007) Recent genetic selection in the ancestral admixture of Puerto Ricans. Am J Hum Genet 81: 626–633 19. Zhu X, Cooper RS (2007) Admixture mapping provides evidence of association of the VNN1 gene with hypertension. PLoS ONE 2: e1244 20. Zhu X, Zhang S, Zhao H, Cooper RS (2002) Association mapping, using a mixture model for complex traits. Genet Epidemiol 23: 181–196 21. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, et al. (2006) Principal
25 components analysis corrects for stratification in genome-wide association studies. Nat Genet 38: 904–909 22. Devlin B, Roeder K (1999) Genomic control for association studies. Biometrics 55: 997–1004 23. Li J, Ji L (2005) Adjusting multiple testing in multilocus analyses using the eigenvalues of a correlation matrix. Heredity 95: 221–227 24. Nyholt DR (2004) A simple correction for multiple testing for single-nucleotide polymorphisms in linkage disequilibrium with each other. Am J Hum Genet 74: 765–769
The Analysis of Ethnic Mixtures
481
25. Price AL, Tandon A, Patterson N, Barnes KC, Rafaels N, et al. (2009) Sensitive detection of chromosomal segments of distinct ancestry in admixed populations. PLoS Genet 5: e1000519 26. Smith MW, Patterson N, Lautenberger JA, Truelove AL, McDonald GJ, et al. (2004) A high-density admixture map for disease gene discovery in african americans. Am J Hum Genet 74: 1001–1013 27. Long JC (1991) The genetic structure of admixed populations. Genetics 127: 417–428
sdfsdf
Chapter 26 Identifying Gene Interaction Networks Gurkan Bebek Abstract In this chapter, we introduce interaction networks by describing how they are generated, where they are stored, and how they are shared. We focus on publicly available interaction networks and describe a simple way of utilizing these resources. As a case study, we used Cytoscape, an open source and easy-to-use network visualization and analysis tool to first gather and visualize a small network. We have analyzed this network’s topological features and have looked at functional enrichment of the network nodes by integrating the gene ontology database. The methods described are applicable to larger networks that can be collected from various resources. Key words: Interaction networks, Protein–protein interactions, Gene ontology, Cytoscape, Pathways, Network
1. Introduction A gene interaction network is a set of genes (nodes) connected by edges representing functional relationships among these genes. These edges are named interactions, since the two given genes are thought to have either a physical interaction through their gene products, e.g., proteins, or one of the genes alters or affects the activity of other gene of interest. The functional products of genes, e.g., proteins, work together to achieve a particular task, and they often physically associate with each other to function or to form a more complex structure. These interactions can be long lasting, such as while forming protein complexes, or brief, when proteins modify each other such as the phosphorylation of a target protein by a protein kinase. Since these interactions are important to carry out most biological processes, knowledge about interacting proteins is crucial for understanding these biological functions, which can be easily done via studying networks of these interactions.
Robert C. Elston et al. (eds.), Statistical Human Genetics: Methods and Protocols, Methods in Molecular Biology, vol. 850, DOI 10.1007/978-1-61779-555-8_26, # Springer Science+Business Media, LLC 2012
483
484
G. Bebek
Besides these physical interactions, there are genetic interactions, in which two gene variants have a combined effect that does not manifest itself with only one of them alone. These genetic interactions are also measured at high throughput. There are two general categories of such interactions: synthetic lethal interactions and suppressor interactions. Synthetic lethal interactions are caused when two nonessential genes combine to form an overall lethal effect, and suppressor interactions occur when a lethal variant of one gene is “negated” by that of another gene. These types of interactions are essential in understanding pathways and regulation in model organisms (1–3), as well as providing insight into complex diseases (4). In recent years, high-throughput methodologies, such as the yeast two-hybrid (Y2H) (5–7), co-immunoprecipitation followed by mass spectrometry (8, 9), or tandem affinity purification (TAP) (10), have been widely used to identify physical protein–protein interactions for a wide range of organisms. In addition, genetic interactions were mapped for humans in high-throughput drug screens (11) and for many organisms (12). As a result, in the past decade, the number of known protein– protein interactions has increased significantly, and various public databases were created to share these findings. There are also computational approaches that are used to predict protein–protein interactions. These utilize genomic data to establish a structural or evolutionary link among protein pairs (13, 14) or predict novel interactions by analyzing known interactions (15–17). Almost all of the interactions that are discovered through experiments are collected in public databases. While fairly new, these databases constantly grow in size and present protein–protein interaction data for multiple organisms. Initially, these databases independently collected their datasets. However, now through the International Molecular Exchange Consortium, these databases keep a nonredundant set of protein–protein interaction data from a broad taxonomic range of organisms. Moreover, these databases commit to providing these datasets in standard file formats, such as MITAB or PSI-MI XML 2.5. Currently, the databases listed in Table 1 are actively producing relevant numbers of records curated to these standards and provide these via the Proteomics Standard Initiative Common Query Interface (PSICQUIC) service. These databases maintain interactions that can be all encompassing, such as IntAct, MINT, and DIP, organism centric, such as BioGrid or MPIDB, or biological domain centric, such as MatrixDB. However, regardless of the database, these interactions are available in standards-compliant, tab-delimited, and XML formats. Currently, these databases carry some redundancy. However, as the data collection pipelines for each database is established and with the implementation of an internal data management system, access to interaction datasets will be more robust.
26
Identifying Gene Interaction Networks
485
Table 1 The International Molecular Exchange Consortium Partner Databases: Public databases that provide nonredundant set of protein–protein interactions for a range of organisms are listed (as of January 2011) Database
Web site
Number of PPI
DIP (39)
http://dip.doe-mbi.ucla.edu
107,619
IntAct (33)
http://www.ebi.ac.uk/intact
272,410
MINT (40)
http://mint.bio.uniroma2.it/mint
90,537
MPact (23)
http://mips.gsf.de/genre/proj/mpact
15,454
MatrixDB (41)
http://matrixdb.ibcp.fr
MPIDB (42)
http://www.jcvi.org/mpidb
24,295
BioGRID (37)
http://www.thebiogrid.org
365,574
InnateDB (43)
http://www.innatedb.com
9,909
BIND (44)
http://www.blueprint.org
192,961
845
Visualization and analysis of these interaction networks is also essential for researchers. In recent years, a variety of software for various platforms has been introduced, such as Cytoscape (18), Osprey (19), Pajek (20), etc. These pieces of software can visualize the networks by employing graph layout algorithms and displaying data attributes as well as visual mappings (e.g., protein images, coloring). Moreover, via filters and plug-ins, they analyze these networks and assist in integration of external data sources such as gene ontology (21, 22). In this chapter, we will build an interaction network using one of these databases as a resource. We will visualize the network and analyze it using an open source (and free), platform-independent network browser Cytoscape (18). We will first acquire a publicly available interaction dataset. We will then visualize this network, analyze its network properties, and, using available plug-ins, check its functional enrichment. In the near future, we will have access to more centralized systems for maintaining and storing these datasets, so this approach can be easily extended to multiple datasets, extending its use for biomedical research.
2. Methods IntAct, a network database maintained by EMBL-EBI, hosts almost 300,000 binary interactions shared among more than 50,000 proteins (23). In this chapter, to establish an easier example to follow, a
486
G. Bebek
Fig. 1. An IntAct database of the month archive entry is shown. Like most databases, IntAct provides the data shared in multiple formats. The images in the figure are links to the same dataset in different formats. Internal link to IntAct table page with external links (left ); PSI-MI XML (versions 1.0 and 2.5) (middle); and PSI-MI Tab (right ).
small dataset from IntAct will be acquired and analyzed. The same methodology can be extended to larger datasets as needed. 2.1. Acquiring Datasets
The IntAct database of protein–protein interactions can be accessed freely through the IntAct Web site (http://www.ebi.ac.uk/intact). The whole database can be downloaded in PSI-MI XML or PSI-MI TAB format or can be queried for a list of proteins via a Web interface. Databases mentioned in Table 1 provide similar services. IntAct also provides analysis tools with documentation. In this chapter, a more general approach, applicable to a wider number of datasets, is described. All interactions in IntAct are derived from literature curation or direct user submissions. IntAct provides these datasets in smaller chunks as well. For a list of highlighted datasets, please visit IntAct’s Dataset of the month archive (24). In this chapter, as an example, we are going to download and analyze the interactions submitted in support of a study that targeted the phosphatidylinositol 3-kinasemammalian (PI3K-mTOR) pathway (25). As shown in Fig. 1, the dataset can be accessed in multiple formats. The basic plain text version (PSI-MI Tab) will be used for simplicity. After downloading the dataset, simply open the dataset in a text editor or spreadsheet editor. The tab-delimited file is organized as a table, where each line describes an interaction (an extended edge list). Each record contains many fields that describe the interaction. For visualization purposes, the gene name field is preferred, since most genes/proteins are annotated with these names. To be able to load this file into Cytoscape, using the Find–Replace function of the editor, remove “uniprotkb:” and “(gene name)” texts from the records. Essentially, an edge list with two columns that have gene names is created. Similar datasets can be generated. Although there are means to import the xml dataset into Cytoscape (see Note 1), the example will focus on the flat file for applicability to other databases or sources.
2.2. Visualizing Interaction Networks
Cytoscape is a platform-independent application (http://www. cytoscape.org) (18). The open source software supports many standard network and annotation file formats, including Simple Interaction Format (SIF), XML-based BioPAX, PSI-MI, SBML, etc. The software can also load delimited text files and MS Excel™
26
Identifying Gene Interaction Networks
487
Workbooks. In this chapter, loading text files is described to ensure the applicability of this methodology for various data sources. Moreover, Cytoscape can also import data files, such as expression profiles, GO annotations generated by other applications or spreadsheet programs, or images for nodes. Using these features, you can load and save arbitrary attributes on nodes, edges, and networks. For instance, you can input a set of custom annotation terms for a protein (26), create a set of confidence values for protein– protein interactions, and filter them later in the software (27). Cytoscape can establish powerful visual mappings across systems biology, genomics, and proteomics data. It supports advanced analysis and modeling via specific plug-ins that can be installed with one click. Also, visualization and analysis of human-curated pathway datasets such as Reactome or KEGG is possible. After launching the software, make sure that the plug-in required for this chapter, Bingo, is loaded. If not installed, Bingo can be installed through the Plug-in Manager under the Plug-ins menu (search for Bingo and click install the latest supported version). The network prepared in Subheading 2.1 can be loaded by clicking Import > Network From Table (Text/MS Excel). . . under the File menu. The dialog box shown in Fig. 2 will ask for columns to be picked. As the file is altered, the gene names will correspond to columns 3 and 4. First, click on the Show Text File Import Options for more options. Since the first line is headers, ignore the first line by incrementing the Start Import Row to 2. Next, select column 3 and column 4 for source interaction and target interaction drop-down lists. This should highlight the preview as shown in the dialog box (Fig. 2). The network should load after the import button is clicked. The user can manipulate the network by selecting and dragging the nodes or alter the layout using the Layout menu. For further details of the available layouts and options, please refer to the software documentation. In this example, the network is shown in yFiles > Organic layout found under the Layout menu. 2.3. Analyzing Interaction Networks
In order to gain more insight about the network, a number of network analyses can be performed. Network Topology gives an overview of network topological features, including diameter, degree distribution, shortest path distribution, and clustering coefficient of the interaction network. A path in a protein–protein interaction network is defined as a list of nodes where each node has an edge to the next node. The shortest path is the shortest path from one node to another in the network. The diameter of the network is defined as the maximum value of distance of the shortest path over all pairs of distinct nodes in a graph. These values are mostly used to see how connected a network is.
488
G. Bebek
Fig. 2. Screen shot of Cytoscape’s import network from table (Text/MS Excel). . . dialog box. As input, a flat file is selected, and appropriate columns are marked to import the text file as an edge list into the software.
The degree distribution measures the proportion of nodes in a graph with a specified number of edges. This distribution is assumed to follow a power law degree distribution, i.e., the degree of the nodes varies as a power of the number of nodes on the network, and has sparked interest on how this distribution has a role in robustness of a system represented as an interaction network (28–30). The clustering coefficient is another measure that describes how well-connected neighbors of the node are, ranging from one to zero. The Network Analysis plug-in displayed under the Plug-ins menu of Cytoscape has various functions that can be utilized for these analyses. In Fig. 3, we selected the whole network and selected Analyze Network, treating it as an undirected graph. The plug-in displays various simple-to-calculate features of the network, such as the degree distribution (Subheading 2.2), number of connected components, etc. When the Visualize Parameters button is clicked, for
26
Identifying Gene Interaction Networks
489
Fig. 3. Cytoscape (18) and network analysis plug-in is shown. The plot on the left bottom shows that the degree distribution of the analyzed network likely follows a power law. The nodes in the networks are sized by their node degree.
instance by selecting the node degree from the drop list named Map node size to, this information can be reflected onto the network for a more visual representation. The next tab on this analysis window is the node degree distribution, where a power law line can be fitted (as seen in Fig. 3) on the degree distribution. Other properties and functions can be accessed by clicking through the tabs on this window, and detailed documentation can be accessed by hitting Help. 2.4. Functional Enrichment Analysis
We further analyze the network via querying the network genes for overrepresentation of gene ontology annotations. The network we established is the result of a study that targeted the PI3K-mTOR pathway. This pathway plays pivotal roles in cell survival, growth, and proliferation downstream of growth factors (25). Moreover, its perturbations have been associated with cancer progression, type 2 diabetes, and neurological disorders. In Fig. 4, we show how one of the plug-ins, Bingo (22), can be used to investigate this network. We first selected the biggest component of the interaction network we have established (Fig. 3). The 56 selected nodes are then passed
490
G. Bebek
onto the Bingo plug-in by launching the plug-in from the Plug-ins menu. Before running the plug-in selecting, adjust the input form, as described below: 1. Give a name to your new analysis in the Cluster name text box. 2. Leave the Get Cluster from Network box checked. 3. Select your default species name by selecting the organism of the network from the Select organism/annotation dropdown list. Select Homo sapiens for this example. If, in the future, you require a different species, you will need a Gene Association file for that species. You can download one from The Gene Ontology Project (21) and load it as a custom file. 4. Select the statistical test and corrections that are required. In this example, we used Hypergeometric test. Binomial testing is preferred when the amount of data is very large. 5. Select the statistical correction that will be used. You can choose Benjamini and Hochberg False Discovery Rate (FDR). This is the correction to use in most cases, as the Bonferroni correction is too conservative. Bonferroni correction can be used when you want to demonstrate that your results are significant without doubt. 6. Choose a significance level. You can leave the default 0.05. This threshold controls which nodes are detailed. 7. Under select the categories to be visualized, choose the overrepresented categories after correction so that they are visualized. 8. Select the ontology file that will be used for the analysis. The three main categories, Biological Process, Cellular Component, and Molecular Function, and a combined list are available. In this example, we will use the GO_Biological_Process file. 9. Note that if you want to save these settings, you can hit the Save settings as default button at this stage. 10. Start Bingo. 11. A smaller window named Parsing Annotation. . . should appear, showing progress, listing the number of entities (proteins) and classifications (assignments of protein to term). The plug-in then generates an acyclic graph of gene ontology (GO) terms and is labeled accordingly. The network shows the terms that are actually mostly associated (overrepresented) among the network nodes (note the gradient from white to yellow). Since the GO graph is quite large, branches of the GO with no significant terms would not appear in this network. As shown in Fig. 4, terms under biological regulation, such as fatty acid metabolic processes or regulation of cell proliferation, have been highlighted, which is
26
Identifying Gene Interaction Networks
491
Fig. 4. Enrichment analysis of the biggest component of the network analyzed (Fig. 3) is shown. A section of the gene ontology term graph is shown in the background. The Bingo plug-in settings and the output are shown in the foreground.
relevant to previous associations of the PI3K-mTOR pathway (see Note 2 for other sources to use). Cytoscape has a node/edge/network attribute browser named Data Panel that is displayed below the networks. Select a node from the GO network and browse the Node Attribute field. Some of the fields populated under the node attributes include l
description_test: the name of the GO biological process
l
adjustedPValue_test: the P -value for the node, adjusted for multiple hypothesis testing (for comparison, the un-adjusted P -value is also there, with the name pValue_test)
l
n_test: the number of network nodes in your selection set
l
x_test: the number of nodes in your selection set mapping to the term
Moreover, the Bingo plug-in will produce an output window listing the P -values of all nodes with significant enrichment (Fig. 4, bottom). You can select these terms, and the Select nodes button
492
G. Bebek
will highlight the nodes in the network that are associated with those terms. This window also links the GO terms to Amigo (31), the gene ontology browser.
3. Notes 1. As data integration and analysis can become challenging, datasets may not always be grabbed and used as easily. For those who are not as computer savvy, the data collection and visualization described can be facilitated through additional plug-ins. For instance, Cytoscape can also work as a web service client. This means Cytoscape can directly connect to external public databases and import network and annotation data. Currently, Pathway Commons (32), IntAct (33), BioMart (34), NCBI Entrez Gene (35), and The Protein Identifier Mapping Service (PICR) (36) are supported through Cytoscape. As more standards are established and accepted throughout the research community, the number of databases accessible likewise should increase. Moreover, databases such as the BioGRID (37) developed their own plug-in for Cytoscape, BiogridPlugin2, to import interaction data sets into Cytoscape (note that BioGRID is a Prospective IMEx consortium member). 2. As mentioned earlier, if the additional plug-ins to analyze the networks do not work, there are always additional web services that can accomplish similar tasks. Although utilizing these will make life harder, similar results should be generated as most of these resources use similar techniques. There are other Cytoscape plug-ins, such as Pingo (45) that can identify significantly associated user-defined target Gene Ontology terms. Also, there are services independent of Cytoscape (e.g., FuncAssociate (38)) that are capable of providing functions similar to Bingo. Researchers can consider using these tools for verification and extended analysis of their networks. References 1. Avery L, and Wasserman, S (1992) Ordering gene function: the interpretation of epistasis in regulatory hierarchies. Trends Genet 8: 312–316 2. Guarente L (1993) Synthetic enhancement in gene interaction: a genetic tool come of age. Trends Genet 9: 362–366 3. Sham P (2001) Shifting paradigms in genemapping methodology for complex traits. Pharmacogenomics 2: 195–202 4. Dolma S, Lessnick SL, Hahn WC, Stockwell BR (2003) Identification of genotype-selective
antitumor agents using synthetic lethal chemical screening in engineered human tumor cells. Cancer Cell 3: 285–296 5. Fields S, Song O (1989) A novel genetic system to detect protein-protein interactions. Nature 340: 245–246 6. Gavin AC, et al (2002) Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 415: 141–147 7. Ito T, et al (2001) A comprehensive twohybrid analysis to explore the yeast protein
26 interactome. Proc Natl Acad Sci USA 98: 4569–4574 8. Hartman J L, Garvik B, Hartwell L (2001) Principles for the buffering of genetic variation. Science 291: 1001–1004 9. Ho Y, et al (2002) Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 415: 180–183 10. Rigaut G, et al (1999) A generic protein purification method for protein complex characterization and proteome exploration. Nat Biotechnol 17: 1030–1032 11. Huang LS, Sternberg PW (2006) Genetic dissection of developmental pathways. WormBook, 1–19 12. Tong AH, et al (2004) Global mapping of the yeast genetic interaction network. Science 30: 808–813 13. Goh CS, Cohen FE (2002) Co-evolutionary analysis reveals insights into protein-protein interactions, J Mol Biol 324: 177–192 14. Overbeek R, et al (1999) The use of gene clusters to infer functional coupling. Proc Natl Acad Sci USA 96: 2896–2901 15. Bebek G, Yang J (2007) PathFinder: mining signal transduction pathway segments from protein-protein interaction networks. BMC Bioinformatics 8: 335 16. Ng SK, Zhang Z, Tan SH (2003) Integrative approach for computationally inferring protein domain interactions. Bioinformatics 19: 923–929 17. Aloy P, et al (2004) Structure-based assembly of protein complexes in yeast. Science 303: 2026–2029 18. Shannon P, et al (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13: 2498–2504 19. Breitkreutz BJ, Stark C, Tyers M (2003) Osprey: a network visualization system. Genome Biol 4: R22 20. Batagelj V, Mrvar A (1998) Pajek - Program for Large Network Analysis. Connections 21: 47–57 21. Ashburner M, et al (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25: 25–29 22. Maere S, Heymans K, Kuiper M (2005) BiNGO: a Cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks. Bioinformatics 21: 3448–3449 23. Guldener U, et al (2006) MPact: the MIPS protein interaction resource on yeast. Nucleic Acids Res 34: D436–441
Identifying Gene Interaction Networks
493
24. IntAct. (2011) Datasets of the month Archive, http://www.ebi.ac.uk/intact/pages/dotm/ dotm_archive.xhtml 25. Pilot-Storck F, et al (2010) Interactome mapping of the phosphatidylinositol 3-kinasemammalian target of rapamycin pathway identifies deformed epidermal autoregulatory factor-1 as a new glycogen synthase kinase-3 interactor. Mol Cell Proteomics 9: 1578–1593 26. Smoot ME, et al (2011) Cytoscape 2.8: New Features for Data Integration and Network Visualization. Bioinformatics 27: 431–432 27. Linderman G, Chance M, Bebek G (2011) Magnet: Micro Array Gene Expression Network Evaluation Toolkit. (submitted) 28. Barabasi AL, Albert R (1999) Emergence of scaling in random networks. Science 286: 509–512 29. Jeong H, et al (2001) Lethality and centrality in protein networks. Nature 411: 41–42 30. Bebek G, et al (2006) The degree distribution of the generalized duplication model. Theoretical Computer Science 369: 239–249 31. Carbon S, et al (2009) AmiGO: online access to ontology and annotation data. Bioinformatics 25: 288–289 32. Cerami EG, et al (2011) Pathway Commons, a web resource for biological pathway data. Nucleic Acids Res 39: D685–690 33. Aranda B, et al (2010) The IntAct molecular interaction database in 2010. Nucleic Acids Res 38: D525–531 34. Haider S, et al (2009) BioMart Central Portal– unified access to biological data. Nucleic Acids Res 37: W23–27 35. Maglott D, et al (2011) Entrez Gene: genecentered information at NCBI. Nucleic Acids Res 39: D52–57 36. Cote RG, et al (2007) The Protein Identifier Cross-Referencing (PICR) service: reconciling protein identifiers across multiple source databases. BMC Bioinformatics 8: 401 37. Stark C, et al (2006) BioGRID: a general repository for interaction datasets. Nucleic Acids Res 34: D535–539 38. Berriz GF, et al (2003) Characterizing gene sets with FuncAssociate. Bioinformatics 19: 2502–2504 39. Salwinski L, et al (2004) The Database of Interacting Proteins: 2004 update. Nucleic Acids Res 32: D449–451 40. Ceol A, et al (2010) MINT, the molecular interaction database: 2009 update. Nucleic Acids Res 38: D532–539
494
G. Bebek
41. Chautard E, et al (2011) MatrixDB, the extracellular matrix interaction database. Nucleic Acids Res 39: D235–240 42. Goll J,et al (2008) MPIDB: the microbial protein interaction database. Bioinformatics 24: 1743–1744 43. Lynn DJ, et al (2008) InnateDB: facilitating systems-level analyses of the mammalian innate immune response. Mol Syst Biol 4: 218
44. Isserlin R, El-Badrawi RA, Bader GD (2011) The Biomolecular Interaction Network Database in PSI-MI 2.5. Database (Oxford) 2011, baq037 45. Smoot M, Ono K, Ideker T, Maere S (2011) PiNGO: a Cytoscape plugin to find candidate genes in biological networks. Bioinformatics 27: 1030–1031
Chapter 27 Structural Equation Modeling Catherine M. Stein, Nathan J. Morris, and Nora L. Nock Abstract Structural equation modeling (SEM) is a multivariate statistical framework that is used to model complex relationships between directly and indirectly observed (latent) variables. SEM is a general framework that involves simultaneously solving systems of linear equations and encompasses other techniques such as regression, factor analysis, path analysis, and latent growth curve modeling. Recently, SEM has gained popularity in the analysis of complex genetic traits because it can be used to better analyze the relationships between correlated variables (traits), to model genes as latent variables as a function of multiple observed genetic variants, and assess the association between multiple genetic variants and multiple correlated phenotypes of interest. Though the general SEM framework only allows for the analysis of independent observations, recent work has extended SEM for the analysis of general pedigrees. Here, we review the theory of SEM for both unrelated and family data, the available software for SEM, and provide an example of SEM analysis. Key words: Multivariate analysis, Latent variables, Modeling, Candidate gene analysis, Complex traits, Path analysis, Structural equation modeling, Association, Population studies, Family studies
1. Introduction Structural equation modeling (SEM) is a multivariate statistical method that involves the estimation of parameters for a system of simultaneous equations. SEM is a generalized framework that includes regression analysis, pathway analysis, factor analysis, simultaneous econometric equations, and latent growth curve models, to name a few (1). Here, we provide an overview of the methodology behind the SEM framework, how this framework has been extended to analyze related individuals, and currently available software that can be used to conduct SEM analyses. After the overview, we provide a step-by-step procedure for SEM analysis. Finally, we provide some notes on challenges faced when performing these types of analyses.
Robert C. Elston et al. (eds.), Statistical Human Genetics: Methods and Protocols, Methods in Molecular Biology, vol. 850, DOI 10.1007/978-1-61779-555-8_27, # Springer Science+Business Media, LLC 2012
495
496
C.M. Stein et al.
1.1. Overview of Methodology
SEM is used to estimate a system of linear equations to test the fit of a hypothesized “causal” model. Thus, the first step involves visualizing the hypothesized model or creating a “path diagram” based on prior knowledge and/or theories. In path diagrams, rectangles represent observed or directly measured variables and circles/ ovals typically represent unobserved or latent constructs which are defined by measured variables. Unidirectional arrows represent causal paths, where one variable influences another directly, and double-headed arrows represent correlations between variables. Some prefer the term “arc” rather than “causal path” (2, 3). Fig. 1 illustrates an example SEM model. The system of equations can be written as a number of separate equations or with a general matrix notation. SEMs comprise two submodels. First, the measurement model estimates relationships between the observed variables, also referred to as indicators, and latent variables; this is the same framework used in factor analysis. Please note that here we use the word “indicator variable” in a very different way than in typical statistical models. In regression and other statistical theories, “indicator variable” implies a binary yes/no sort of variable. Here, as is customary for SEM, “indicator variable” refers to a variable that is directly associated with a latent variable such that differences in the values of the latent variable mirror differences in the value of the indicator (4). Second, the structural model develops the relationships between the latent variables. For clarity of presentation, here we describe the system of equations for this particular example. The measurement model consists of the following equations, using standard notation used by Bollen (1): x1 ¼ l1 x1 þ d1
y1 ¼ l3 1 þ e1
x2 ¼ l2 x2 þ d2
y2 ¼ l4 1 þ e2
x3 ¼ l3 x3 þ d3
y3 ¼ l5 1 þ e3 ;
where the x’s and y’s are observed indicators for latent variables, the x’s and ’s are latent variables, the l’s are factor loadings, and the e’s and d’s are error, or disturbance, terms. In general matrix notation, the measurement model is written as x ¼ Lx x þ d y ¼ Ly þ e: ζ2
ζ1 γ11
ξ1 λ1
λ2
λ3
β21
η1 λ3
λ4
λ5
x1
x2
x3
y1
y2
y3
δ1
δ2
δ3
ε1
ε2
ε3
Fig. 1. Example SEM diagram.
η2
27 Structural Equation Modeling
497
Using the path diagram, the arrows point to the x’s and y’s, so they are modeled as dependent variables. Also, note that the factor loadings for x1 and y1 can be set to 1, which can be done for two reasons: so that the model is identifiable and so that the latent variable is on the same statistical scale as the observed variables. Model identification, which is discussed in further detail in Subheading 2.1, can also be achieved in other ways, such as setting the variance for the latent variable to 1. Generally, the indicator with factor loading set to 1 is chosen based on what the analyst deems is the best descriptor of the latent construct, but can be arbitrary. Finally, we can differentiate between exogenous variables, which have no directed arcs ending on them, and endogenous variables, which have at least 1 arc ending on them. The structural model consists of the following equations: 1 ¼ g11 x1 þ z1 2 ¼ b21 x2 þ z2 ; where the g and b terms are factor loadings for the latent variables and z’s are error terms. Here, we can evaluate causal relationships between unobserved variables. In general, the structural model may be rewritten in matrix form as the following: ¼ a þ B þ Gx þ z; where h is a m 1 vector of latent endogenous variables, j is an n 1 vector of latent exogenous variables, a is an m 1 vector of intercept terms, B is an m m matrix of coefficients that give the influence of h on each other, G is an m n matrix of the coefficients of the effect of j on h, and z is the m 1 vector of disturbances that contain the explained parts of the h’s. Though it may appear counterintuitive to regress h on itself, each variable in hi is influenced by other variables in hi, so this represents relationships between latent variables and not necessarily feedback loops. We assume that «, d, and z are mutually uncorrelated. Traditional regression approaches are robust to measurement errors in the outcome but not in the predictors. Also, univariate regression approaches cannot model the correlation between error terms for two different outcomes. SEM allows us to model measurement error for both the predictor and the outcome, and it allows a high degree of flexibility in modeling the correlation between the various error terms. For example, if two of the indicators were lab measurements assayed in one lab, while another two were measurements conducted in another lab, the analyst could model the correlation between the first pair of measurements separately from the second pair. Also, the SEM allows for the decomposition of effects if the direct and indirect effect of variables on the outcome is of interest. For example, the direct effect of 1 on 2 is estimated by b21, and the indirect effect of x1 on 2 is estimated by g11. Alternatively, one could model the direct effect of x1 on 2 with the model depicted in Fig. 2, with corresponding coefficient g12. More detail on mediation models can be found elsewhere (5, 6).
498
C.M. Stein et al.
Fig. 2. Example SEM diagram, illustrating the addition of a direct effect in the model.
These models are estimated using the variance–covariance matrix of the data. Usually, maximum likelihood estimation fitting functions are used to fit the system of equations to the data, but this method requires that the data be normally distributed and the observations be independent. Variations that relax the assumption of multivariate normality have been developed, including the robust weighted least squares estimator (WLSMV), which allows for binary and categorical dependent variables (7). To assess the overall model fit, there are a number of fit statistics, including the root mean squared error (RMSEA) and comparative fit index (CFI) (1), and for categorical data, the weighted root mean square residual (WRMR) is appropriate (8). Hu and Bentler (9) categorize these fit statistics as “comparative” or “absolute.” One could also compare nested models, as is done with traditional regression models and segregation analysis models, using a likelihood ratio test (LRT) and non-nested models using Akaike’s AIC; by contrast, the aforementioned fit statistics (RMSEA, CFI, WRMR, etc.) do not require the models being compared to be nested. 1.2. SEM for Genetics
As pointed out by Pearl (3), SEM was first developed by geneticists. Early models, developed by Sewall Wright (10), were called path analysis. Later models were parameterized in very specific ways. Twin pair data could be used for the estimation of the proportion of variance due to additive genetic, dominance genetic, and shared environmental effects (11), the so-called ACE models. Nuclear family data could also be used for the estimation of additive genetic and shared environmental variance, using the so-called Tau and Beta models (12). SEM is easily extended for the analysis of genetic and environmental influences on traits. For example, genes may be modeled as unobserved latent constructs with single nucleotide polymorphisms (SNPs) as indicators of these gene constructs (13, 14). If the investigator has a specific polymorphism of interest, it may be modeled as an observed variable. Our work has shown that densely spaced SNPs
27 Structural Equation Modeling
499
are best for modeling latent gene constructs, and linkage disequilibrium (LD) between these SNPs may be modeled by correlating the error terms within the measurement model (13). A word of caution is needed here regarding the selection of genes or SNPs for analysis within a SEM framework. We emphasize that SEM is a hypothesis-driven approach. Thus, it is not agnostic like genome-scan approaches. In genome-wide searches for genes, the analyst conducts linkage or association analysis without a biologic model in mind. SEM is not amenable to this agnostic approach; the implication here is that the model developer must have a set of genes or biological pathways in mind. One approach is to select very specific candidate genes, and only these genes are included in the model. If genome-scan data are available, another approach is to take a twostage approach by first conducting association analysis between the SNPs and the traits considered within the SEM (14). It should be emphasized up front that while algorithmic model searching and comparison may be useful, we do not advocate such an approach. Instead, we believe that it is perfectly reasonable to start with a small number of hypothesized models and compare them. The above theory is applicable for independent observations. Muthe´n has proposed using a robust maximum likelihood estimator that provides test statistics and standard errors robust to nonindependence of observations (www.statmodel.com). However, in genetic studies, we often have data collected from families, and it is preferable to model the family structures explicitly. One approach to deal with family relationships in SEM was proposed by Todorov et al. (15), whose framework allowed for causal links between measured phenotypes and could include linkage information. However, their approach lacked an explicit measurement model for the traits and was difficult to extend to general pedigrees. We have developed a generalized framework for modeling familial correlations for SEM (16). By using Kronecker notation, this framework allows for incorporation of both a measurement and structural model, as well as polygenic, environmental, and genetic variance components within the SEM. This allows for linkage and family-based association analyses to be conducted within a complex modeling framework, and can be used to build and compare causal models in family data with or without genetic marker data. Because of the generalized framework of this model, it is easily extended to sophisticated models such as latent growth curve models. 1.3. Software
For general SEM analysis, there are a number of packages available (summarized in Table 1). Of the packages that were not explicitly designed to conduct genetic analyses, none are freely available. Mplus does have the capability to conduct genetic analyses, but they are not as general as those methods described above. For example, SNP genotypes may be incorporated as observed variables, and there are special ways to conduct linkage analysis, but
500
C.M. Stein et al.
Table 1 Overview of available SEM software packages Package
Weblink or base package
Notes
Amos
Add-on to SPSS
Proc CALIS
Procedure in SAS
EQS
Multivariate software http://www.mvsoft.com/
GLLAMM
Add-on to STATA: http://www.gllamm.org/
HYBALL
http://web.psych.ualberta.ca/~rozeboom/
LISREL
http://www.ssicentral.com/index.html
Mplus
http://www.statmodel.com/
Mx
http://www.vcu.edu/mx/
Free Best for twin data Now has GUI
OpenMx
http://openmx.psyc.virginia.edu/
Free R package—based on Mx
NEUSREL
Uses MATLAB http://www.neusrel.com/
SYSTAT
http://www.systat.com/products.aspx
Sem
Package for R: http://socserv.socsci.mcmaster.ca/ jfox/Misc/sem/index.html
Free
SEGPATH
Weblink broken!
Free
SEPATH
In Statistica: http://www.statsoft.com/products/ statistica-advanced-linear-non-linear-models/ itemid/5/
TETRAD
http://www.phil.cmu.edu/projects/tetrad/
Free
Free beta
List partially abstracted from Ed Rigdon’s Webpage: http://www2.gsu.edu/~mkteer/
besides that, Mplus cannot currently be used for more sophisticated genetic analyses. Another major consideration in the choice of software is ease of use vs. capability. For example, Amos allows the user to literally draw the SEM diagram that will be fitted, compared to Mplus, which requires the user to write code. However, Mplus may have additional capabilities in terms of specific algorithms and user support. In addition, we recommend that Amos be used with extreme caution, since it is too easy to draw a path diagram without thinking through the parameterization, theoretical implications, etc. A comparison of the most commonly used SEM software packages is provided by Buhi et al. (17). Currently there are two packages available that implement SEM for genetic analysis. SEGPATH was originally developed for path analysis for sibling pairs and has been extended to conduct segregation analysis, linkage analysis, and interaction effects and analyze multiple phenotypes simultaneously (18) using the method of Todorov et al. (15). However, there are a few limitations of this software.
27 Structural Equation Modeling
501
The likelihood formulation assumes multivariate normality, which makes the analysis of binary or categorical traits impossible without making important assumptions. Also, though Province et al. (18) state that their method has been extended to extended pedigrees, it is not trivial to estimate polygenic effects for such pedigrees without making other assumptions. Finally, at the time of this writing, the weblinks for SEGPATH were broken, so it is unknown whether this package is still available or actively maintained. Second, Mx software (http://www. vcu.edu/mx/) was originally developed for the analysis of twin data. Recently, a graphic user interface (GUI) and R package version have been made available, in which the user can draw SEM diagrams, similar to how Amos is used for general SEM. These recent developments are quite important since the coding language for Mx is not intuitive and rather difficult to implement without very specific examples. The Mx GUI can be used for SEM analysis of general data (unrelated individuals) and twin data, but sibpair data and other general pedigrees cannot be analyzed without extensive programming in the underlying Mx script language. Mx is also limited to traits that follow multivariate normality, though OpenMx can handle binary traits. In addition, we are currently developing software for our own methodology (16). Recall that our framework includes both a measurement and structural model, allows for general pedigree structures, and is generalizable for both genetic and general SEM analyses. At the time of this writing, MATLAB code for our framework is freely available by request from the authors. We are also developing an R package for our method and will eventually release a GUI version also.
2. Methods As we described above, software for general SEM is not freely available, and software for SEM with genetics has its limitations. Here, we provide an example using Mplus. We also note that genetics SEM packages will change in availability and functionality in the next couple of years. Below we provide a worked example using data from the 1000 Genomes Project (Pilot Project 3) generated for the Genetic Analysis Workshop (GAW) held on October 17, 2010. We provide the example worked into two parts. Part 1 (Model 1) shows how to build the latent gene construct for one gene, and evaluate the gene’s potential association on Q1, including potential effects of covariates. Part 2 (Model 2) demonstrates how to simultaneously model two genes, and evaluate their potential associations on Q1 (Fig. 1). In the following sections, we provide the Mplus v5.1 code with annotations for the various steps embedded within the code and highlight important findings in the subsequent discussion.
502
C.M. Stein et al.
2.1. Develop the Model
SEM is a strongly hypothesis-driven analytical method. One danger with methods like SEM is the temptation to fit all sorts of models that have no grounding in biology or other scientific background. That is why it is essential to develop a model first. Draw the hypothesized relationships. If there are several plausible models, draw them all; in step #4, we will discuss how these models are compared. There are several issues to keep in mind when developing the model. The measurement model (factor analysis) should be fitted first, followed by the structural model (2). First, as a general rule, when modeling latent constructs, each latent variable requires at least two observed indicator variables, but three is preferable (1); if there are only two indicators, then the latent variable must be correlated with another latent variable. This relates to the issue of model identification, which we will discuss subsequently. When conducting a factor analysis, the factor loadings should form independent clusters (2). Second, the analyst must be mindful of the default procedures of the software being used. For example, many software packages automatically estimate correlations between all latent variables. If the analyst does not want this, he/she must specify the analysis appropriately. Third, it is important to specify the disturbance/error terms and the correlation between them. If disturbance terms are left out, the assumption is made that the variable (xi or yi) is perfectly measured. Fourth, many software packages are unable to validly estimate parameters for binary or categorical dependent variables (endogenous variables); more about this in Note 1. Those software packages that can handle categorical outcomes have different computational approaches that should be considered. Finally, one must consider how to parameterize the latent variables. There are a couple of approaches here. In one approach, the analyst may select one indicator variable for which the factor loading will be set to 1. The result of this will be that the variance of that latent construct will be set to the variance of that specific indicator variable but, at the same time, the importance of that variable to the latent construct cannot be estimated, because there will not be a factor loading. Alternatively, the analyst may fix the variance of the latent variable to 1 and its mean to 0, which then allows factor loadings to be estimated for all indicators. Both approaches are valid, and the decision comes down to interpretation. It is important to assess whether the model is identified. Identification concerns whether it is possible to uniquely solve for the model parameters in terms of the moments of the observed variables using these equations. A SEM is identifiable if all of its parameters can be determined uniquely from a mean and covariance structure. One quick test to assess model identification is to see if each equation set by the model is a regression, and the covariance of all disturbance variables is zero (2). Another step in this process is to assign a scale to each latent variable that is measured with error. This can be done by either choosing one indicator for each latent variable
27 Structural Equation Modeling
503
and setting the factor loading to 1 or setting the variance for the latent variable. Evaluating the identification of a model is easier said than done, and a full discussion of this topic is outside the scope of this review. Bollen (1) provides algebraic arguments to assess model identification, and Pearl (3) presents graphical arguments. Also, see Note 1 for more on model identification. 2.2. Worked Example
Briefly, the 1000 Genomes Project is an international, public–private consortium aimed at building the most detailed map of human genetic variation, with the overarching goal of improving our understanding of the genetic contribution to common human diseases. Initially launched in 2008, three pilot studies have been completed to sequence the full genomes of 1,000 individuals in order to identify rare variants in diverse populations. Pilot Project 3 involved sequencing the coding regions (exons) of 3,205 genes in 697 individuals from seven populations, which revealed 24,487 rare and common genetic variants. To illustrate the latent gene construct SEM approach of Nock et al. (13, 14) using the GAW 17 data (unrelated subjects, Replicate 137), we selected two genes (OR52E4: olfactory receptor, family 52, subfamily E, member 4; OR2T3: olfactory receptor, family 2, subfamily T, member 3), which are biologically related to each other. We focused on Q1 as the phenotype because both OR52E4 and OR2T3 had at least one SNP each that was associated with Q1 in Replicate 137. We have taken a similar approach in our previous work (13, 14); sometimes it is helpful to first do a standard association analysis to identify SNPs associated with the trait(s) of interest, then include those genes within the SEM. In this example, we demonstrate how to model the variation in these genes with latent constructs using multiple SNPs and how to evaluate the potential associations of these genes on the simulated quantitative phenotype, Q1, including the potential effects of covariates [age, sex, population (pop1), and smoking] using Mplus v5.1 (Muthen and Muthen, 1998–2008, www.statmodel.com).
2.2.1. Prepare the Data
Once the model has been developed, the analyst can consider which variables to be included in the dataset, and then, how the datafile will be prepared. Many software packages accept typical flat files with typical delimiters, with each variable in a separate column, and each line of the file representing one individual subject’s data. However, some software packages allow a covariance or correlation matrix to be input as data. For this example, Q1, sex, age, smoking status, Pop1, and SNP genotypes for OR52E4 and OR2T3 were included in a commadelimited (csv) file, which was then uploaded into Mplus. In the Mplus code below, the data upload step can be seen in the “DATA:” section. For coding SNP genotype data, we employed an additive genetic model whereby SNPs were coded as 0, 1, or 2 for having 0, 1, or 2 copies of the variant (minor) allele, respectively.
504
C.M. Stein et al.
2.2.2. Assess First Model
Next, the analyst can fit the first model. Model fit assessment has two parts: overall fit and component fit (19). Again, each software package differs in how the causal paths, correlations, and factor analyses for latent variables are specified. Regardless of the software package, after the model is fitted, a variety of statistics will be output. These include pathway coefficients and corresponding pvalues, correlations, R2 for each indicator for a latent variable, and model fit statistics that are based on maximum likelihood or generalized least squares, such as the chi-square, AIC, BIC, RMSEA, CFI, and other similar statistics. Further discussion of assessment of global fit vs. comparison of nested models can be found in the Subheading 1. In addition, model fit can also be assessed by identification of “Heywood cases,” which are negative estimates of variance (1). A word of caution: sometimes, with large sample sizes, the power of significance tests based on the chi-square is so great that even trivial departures lead to rejection of the null hypothesis (19). On the other hand, indexes of model fit are for the most part ad hoc. See refs. 20 and 21 for some interesting discussions of these issues. Also see Note 2 for more on analysis of categorical variables. We provide the example worked into two parts. Part 1 (Model 1) shows how to build the latent gene construct for one gene, OR52E4, and evaluate the gene’s potential association on Q1, including potential effects of covariates. Part 2 (Model 2) demonstrates how to simultaneously model two genes, OR52E4 and OR2T3, and evaluate their potential associations on Q1 (Fig. 1). The following provides the Mplus v5.1 code for the worked examples, with annotations for the various steps embedded within the code. Example: Part 1: OR52E4 Gene on Q1 Simulated Trait:
27 Structural Equation Modeling
505
506
C.M. Stein et al.
As expected, given the large sample size, the w2 test statistic is statistically significant; however, given a CFI value of 0.90, an RMSEA value 0.06, and a SRMR value 0.08, the overall fit of the model is basically good (9). As such, Model 1 results are interpretable, and we highlight that the standardized path coefficient of OR52E4 is statistically significant, although its magnitude is less than that of age or smoking. Furthermore, we note that this single gene model only explains ~0.15 of the variance in Q1. 2.2.3. Fit Other Models and Compare
Once the analyst has examined the results of the first model, decisions must be made on how to modify the model. This is the crux of SEM modeling. The goal is to find the most plausible, best-fitting model. When assessing changes to make to the initial model, a variety of issues may be considered. Which path coefficients are not statistically significant? Should they remain in the model because they are biologically or epidemiologically important? What other relationships are worth examining—how would these be depicted in the model? Once another model is fitted, another set of statistics will be output as above (path coefficients, R2, and fit statistics).
27 Structural Equation Modeling
507
Specifically, the following statistical issues should be considered when comparing models. Are the R2 values good? Do some indicators of latent variables have lower R2 values and, if so, should those be removed? Sometimes a path coefficient, though not statistically significant by itself, may contribute to the overall fit of the model, such that the inclusion of that variable results in a better AIC, RMSEA, and/or CFI. Then it is up to the analyst which model is “better.” It is then important to replicate findings in an independent dataset (19). Finally, we must comment about the modification of models before arriving at one with the “best fit.” Wright (10) advocated careful thought and prior knowledge for the comparison of alternative models. All models should be based on substantive theory and causal conjectures (2). We and others (2, 3) recommend against a “quasi-random walk” through a sequence of models and instead promote theoretical justification of all models. Example: If we add in another gene, OR2T3, which is biologically related to OR52E4, to our first model, the fit of the model is slightly better and the amount of variance explained in Q1 increases to ~0.19. Example: Part 2: OR2T3 and OR52E4 Genes on Q1 Simulated Trait (Model 2)
508
C.M. Stein et al.
27 Structural Equation Modeling
509
510
C.M. Stein et al.
Although both gene constructs are significantly associated with Q1 (Model 2), the magnitude of the path coefficient for OR2T3 is larger than for OR52E4. It is also interesting to note that the path coefficient of OR52E4 is attenuated when OR2T3 is included (Model 2) compared to that when OR2T3 is not included (Model 1). We can also see via the magnitude and significance of the standardized coefficients that population structure (pop1) is more influential on OR2T3 than on OR52E4. 2.2.4. Presentation of “Final” Model
Once the best model has been selected, presenting it for publication is also not trivial. Often, the path diagram is very complex, including indicators for latent variables with corresponding factor loadings, correlations between all latent variables, and, of course, the causal pathways. The authors will want to present P -values for each path coefficient, and also some assessment of the goodness-of-fit of the final model. Clearly, presentation of every model considered with their fit statistics would be out of the question. The authors should make every attempt to draw the path diagram simply and clearly. One may consider listing factor loadings for latent variables in a separate table, and providing correlations between latent variables in another table, so that the path diagram is not too cluttered. McDonald and Ho (2) propose several guidelines for presenting SEM results in addition to those stated above. The report should give theoretical grounds for the presence or absence of each causal path (“arc”) in the model and also some discussion about the use of causal pathways instead of correlations. If space allows, the full covariance matrix of the observed variables should be provided; if not, means and standard deviations of each variable are sufficient. Also, the global w2 statistic should also be provided, in addition to other fit statistics, such as RMSEA and CFI. In Fig. 3, we present the final model (Model 2) from our worked example. Here, we present SNPs in rectangles, genes as ovals since they are modeled as latent variables and provide the factor loadingsstandard errors above the single-headed arrows directed from the SNPs to the gene. Q1 is an observed continuous trait, and thus is presented as a rectangle. Similarly, age, sex, and smoking status are observed covariates, and thus represented by rectangles. The correlation between OR52E4 and OR2T3 is represented by a double-headed arrow between the two latent variables.
3. Notes 1. As might be expected with models that include many pathways and many variables, particularly many latent variables, model convergence might be a problem. One thing to look at is the existence of (phenotypic) outliers in the data. If there are
27 Structural Equation Modeling
511
Fig. 3. Modeling the aggregate effects of common and rare variants in multiple potentially interesting genes using latent variable SEM. Model of the associations between two genes (11 SNPs) and potential associations with Q1 (CFI ¼ 0.91; RMSEA ¼ 0.04; SRMR ¼ 0.03). Standardized loadings and standard errors are shown above the arrows. *p 0.05; **p 0.01. Residuals are not shown for clarity.
outliers in the observed variables, the removal of these datapoints may enable the model to converge. In addition, model over- or under-specification may result in problems with model convergence. As stated before, evaluating the identification of the model is difficult. If faced with this concern, it is best to draw the model, then write out the simultaneous equations evaluated within the model, and then apply the aforementioned rules described by Bollen (1) to assess identification. A final method to increase model convergence is to increase the sample size. Since SEM is really the estimation of simultaneous regression equations, a similar rule of thumb may be applied: at least 20 observations per variable are recommended, but more is better. 2. The analysis of binary or categorical traits is not trivial. Most of the original SEM methodology was developed for quantitative traits and made the assumption of multivariate normality and linear causal effects. The so-called “Asymptotic Distribution Free” approach to model fitting relaxes the assumption of multivariate normality. However, it does not relax the assumption of linearity, and it has been shown that in finite samples, its behavior is quite poor (22). Numerous methods have been developed that explicitly model categorical traits using a threshold model. That is, it is assumed that an underlying quantitative
512
C.M. Stein et al.
multivariate normal trait exists which belongs to a specific category if it falls into a specific range. Some software packages such as Mplus, GLLAMM, and OpenMx support such explicit models for various types of categorical traits. For instance, the MLR estimator in Mplus is robust to non-normality and can be used for categorical variables. In general, we recommend that modelers who have categorical traits avoid using software that does not support an explicit model for such categorical traits. References 1. Bollen K (1989) Structural equations with latent variables John Wiley & Sons, New York 2. McDonald RP, Ho MH (2002) Principles and practice in reporting structural equation analyses. Psychol. Methods 7: 64–82 3. Pearl J (2000) Causality: Models, Reasoning, and Inference Cambridge University Press, New York, New York 4. Bollen K (2001) Indicator: Methodology, in Internatinoal Encyclopedia of the Social and Behavioral Sciences (Smesher, N. and Baltes, P., Eds.) pp 7282–7287, Elsevier Sciences, Oxford 5. Sobel M (1982) Asymptotic confidence intervals for indirect effects in structural equation models. Sociological Methodology 13: 290–312 6. Baron RM, Kenny DA (1986) The moderatormediator variable distinction in social psychological research: conceptual, strategic, and statistical considerations. J Pers. Soc. Psychol. 51: 1173–1182 7. Muthe´n BO (1984) A general structural equation model with dichotomous ordered categorical and continuous latent variable indicator. Psychometrika 49: 115–132 8. Hancock GR, Mueller RO (2006) Structural Equation Modeling: A Second Course Information Age Publishing, Inc., Greenwich CT 9. Hu LT, Bentler PM, (1999) Cutoff criteria for fit indexes in covariance structure analysis: conventional criteria versus new alternatives. Structural equation modeling 6: 1–55 10. Wright S (1923) The Theory of Path Coefficients: A Reply to Niles’s Criticism. Genetics 8: 239–255 11. Neale M, Cardon LR (1992) Methodology for Genetic Studies of Twins and Families Kluwer Academic Publishers, Dordrecht, The Netherlands 12. Rao DC (1985) Application of path analysis in human genetics, in Multivariate analysis (Krishnaiah PR, Ed.) pp 467–484, Elsevier Science Publishers 13. Nock NL, Larkin EK, Morris NJ, Li Y, Stein CM (2007) Modeling the complex gene x envi-
ronment interplay in the simulated rheumatoid arthritis GAW15 data using latent variable structural equation modeling. BMC. Proc 1 Suppl 1: S118 14. Nock NL, Wang X, Thompson CL, Song Y, Baechle D, Raska P, Stein CM, Gray-McGuire C (2009) Defining genetic determinants of the Metabolic Syndrome in the Framingham Heart Study using association and structural equation modeling methods. BMC Proc. 3 Suppl 7: S50 15. Todorov AA, Vogler GP, Gu C, Province MA, Li Z, Heath AC, Rao DC (1998) Testing causal hypotheses in multivariate linkage analysis of quantitative traits: general formulation and application to sibpair data. Genet Epidemiol 15: 263–278 16. Morris NJ, Elston RC, Stein C M (2011) A Framework for Structural Equation Models in General Pedigrees. Hum Hered 70: 278–286 17. Buhi ER, Goodson P, Neilands TB (2007) Structural equation modeling: a primer for health behavior researchers. Am J Health Behav. 31: 74–85 18. Province MA, Rice TK, Borecki IB, Gu C, Kraja A, Rao DC (2003) Multivariate and multilocus variance components method, based on structural relationships to assess quantitative trait linkage via SEGPATH. Genet Epidemiol 24: 128–138 19. Bollen K (1998) Structural Equation Models, in Encyclopedia of Biostatistics (Armitage, P. and Colton, T., Eds.) pp 4363–4372, John Wiley & Sons, Sussex, England 20. Mulaik S (2007) There is a place for approximate fit in structural equation modeling. Personality and Individual Differences 42: 883–891 21. Barrett P (2007) Structural equation modeling: Adjudging model fit. Personality and Individual Differences 42: 815–824 22. Curran PJ, Finch JF, West SG (1996) The robustness of tests statistics to nonnormality and specification error in confirmatory factor analysis. Psychol. Methods 1: 16–29
Chapter 28 Genotype Calling for the Affymetrix Platform Arne Schillert and Andreas Ziegler Abstract The analysis of high-throughput genotyping data in genome-wide association (GWA) studies has become a standard approach in genetic epidemiology. Data of high quality are crucial for the success of these studies. The first step in the statistical analysis is the generation of genotypes from signal intensities, and several approaches have been proposed for obtaining as accurate genotypes as possible. For the Affymetrix Genome-Wide Human SNP Array 6.0, the genotype calling algorithms Birdseed and CRLMM are commonly used in applications. After a brief description of the statistical methods for both algorithms, their usage is described in detail. Links are provided to the software and to sample code for the installation and execution of the algorithms. Additionally, a suggestion for processing the result files is made. Key words: Affymetrix, Birdseed, CEL files, CRLMM, Genotype calling algorithm, Signal intensities
1. Introduction In the last 5 years, genome-wide association (GWA) studies have become a standard approach for unraveling the genetic basis of complex disorders. With GWA studies, hundreds of thousands of single nucleotide polymorphisms (SNPs) can be interrogated simultaneously. This advance has only been made possible by the rapid developments in microarray technology, and two companies have been the major players in this area, i.e., Affymetrix and Illumina. The approaches taken by the companies differ substantially in the technology, and we therefore restrict our attention to Affymetrix microarrays in this chapter; genotype calling with Illumina microarrays is described in Chapter 29. The underlying principle is the specific hybridization of DNA fragments from a sample to complementary sequences on the microarray (1).
Robert C. Elston et al. (eds.), Statistical Human Genetics: Methods and Protocols, Methods in Molecular Biology, vol. 850, DOI 10.1007/978-1-61779-555-8_28, # Springer Science+Business Media, LLC 2012
513
514
A. Schillert and A. Ziegler
The workflow is as follows; for details of how the array is built, see ref. 2: 1. Genomic DNA is cut into smaller pieces (using restriction enzymes). 2. Fragments are amplified via polymerase chain reaction (PCR). 3. PCR products are cleaned, fragmented, and marked. 4. Fragments are hybridized on the microarrays (one microarray per sample). 5. Arrays are washed to remove DNA fragments which have not bound to complementary probes on the array. 6. The bound DNA fragments are stained with a fluorescent dye. 7. Arrays are scanned to convert the fluorescence intensities into a gray scale image. 8. The gray values are summarized per pixel into intensities per probe; this generates CEL files. The summarized intensities are assumed to reflect the amount of DNA present for each of the two alleles. The assignment of genotypes from probe intensities is referred to as genotype calling. After genotype calling, standard and advanced quality control is performed as described elsewhere (3–8). 1.1. Genotype Calling with Microarrays
The most commonly used Affymetrix microarray is the GenomeWide Human SNP Array 6.0 which will be superseded in due course by the Axiom array. The Affymetrix Genome-Wide Human SNP Array 6.0 allows the investigation of approximately 900,000 SNPs and the same number of copy number variations (CNVs). For this array, two genotype calling algorithms are commonly used. Birdseed developed by Korn et al. (9) is the algorithm propagated by Affymetrix, and it is implemented in the Affymetrix Power Tools. CRLMM (10, 11) is implemented in the Bioconductor package crlmm. Both algorithms are multichip methods, i.e., the signal intensities of all samples for a specific SNP are used for the assignment of genotypes. Multichip approaches lead to increased accuracy of the genotype calling but require massive preprocessing of the data to make the intensities from different samples (CEL files) comparable. They usually consist of the following steps: background correction, normalization, and summarization. While quantile normalization (12) is used by both algorithms, different approaches for background correction and summarization are utilized. For CRLMM, the effect of DNA fragment length and probe sequence is removed, and a linear model is fitted using median polish. This approach is known as robust multiarray averaging (RMA) (10). The method applied in Birdseed is the probe logarithmic intensity error (PLIER) estimation (13). An important aspect of this
28
Genotype Calling for the Affymetrix Platform
515
approach is that probes with high between sample variability are down-weighted to decrease their influence on the summarized intensity value for a probe set. 1.2. Principles of Genotype Calling
The scatterplot of A-alleles versus B-alleles of one SNP after normalization typically shows three clusters, representing the AA, AB, and BB genotype clusters (5); for rare SNPs, i.e., SNPs with low minor allele frequency, we may observe only two clusters. The detection of the three clusters and the assignment of genotypes to the clusters are the tasks of the genotype calling algorithms. Both Birdseed and CRLMM use HapMap calls to define known genotypes, which are subsequently used to define a training set for classification. For Birdseed, the intensities are modeled in a two-dimensional Gaussian mixture model. The initial guesses for the position and variability of each cluster per SNP are stored in the model file (see Subsection 2.1). Using the classical EM algorithm, cluster membership for each sample is computed, and these estimations are then used to maximize the cluster shape. In the case of missing clusters, the position and variance are imputed from the values in the model file. This process is repeated until convergence of the model or a specified number of iterations are reached. CRLMM follows a two-stage hierarchical model for genotype calling (10). To make the algorithm more robust to probe effects, log2 normalized signal intensities are used. Using a mixture model, splines are fitted to eliminate the dependence of the ratio of intensities on the overall intensity. Then, for a given SNP, the distribution of intensity ratios, conditioned on the genotype, is modeled using the normal distribution. The precision of the model parameter estimates is improved by using a hierarchical model. Specifically, an empirical Bayes approach is employed to borrow strengths from other SNPs using a multivariate normal prior (14, 15). The posteriors are used as a confidence measure. Lin et al. (14) found that the confidence measures provided by CRLMM version 1 were not optimal and proposed an ad hoc adjustment based on a training approach. CRLMM version 2 uses these adjusted confidence measures.
2. Methods The software described below is intended for users who are familiar with command-line-controlled software. The result files have to be processed further to fit a format which is readable by the analysis software in mind. Thus, knowledge of a programming language, such as Perl (http://www.perl.org) or Python (http://www.python. org), which is specifically designed to handle large text files is required.
516
A. Schillert and A. Ziegler
2.1. Genotype Calling with Birdseed
A flowchart for the steps followed for genotyping calling by Birdseed is provided in Fig. 1: 1. Installation: Birdseed is provided as part of a bundle of command-line tools, termed Affymetrix Power Tools. After registration at http://www.affymetrix.com, one can select the appropriate version for download (see Note 1). 2. Collecting necessary files: For genotype calling, the CEL files are required plus some additional files. For example, files defining the chip layout and providing priors for the algorithm are needed. For a detailed list of required files and a link to the Web site, see Note 2. 3. Chip-wise and batchwise quality control: The quality of the scanned images (CEL files) depends on several factors. To identify CEL files with insufficient quality,
list of all CEL files
qca file
CDF file
qcc file
QC calling • apt-geno-qc • contrast QC > 0.4 • batch-wise mean contrast Q C>1.7
special SNPs file
list of quality controlled CEL files
CDFfile
Genotype calling • apt-probeset-genotype
confidence scores
calls
signal intensities
Fig. 1. Schematic workflow of the genotype calling with Birdseed.
28
Genotype Calling for the Affymetrix Platform
517
the following procedure is proposed (see Note 3 for an example): (a) Run apt-geno-qc to compute the contrast QC and to estimate the sex. (b) Exclude samples with nonmatching sex information when comparing sex information from your database with the information estimated by apt-geno-qc. (c) Exclude all samples with a contrast QC < 0.4. (d) Compute the mean contrast QC per batch; exclude the samples of all batches with mean contrast QC < 1.7. 4. The calling command: Calling with Birdseed: The remaining samples will now be used for the genotype calling using apt-probeset-genotype; see Note 4 for an example. 5. Conversion of files: The output files of the calling have to be converted into a format that is more accessible for further analysis. A typical approach would be to convert the calls file into a tped file, which can be processed by standard packages for GWA analyses, such as PLINK (http://pngu.mgh.harvard.edu/~purcell/plink/) or GenABEL (16). See Note 8 for additional information. 2.2. Genotype Calling with CRLMM
1. Installation and setup: The genotype calling with CRLMM is done using the Bioconductor package crlmm. Bioconductor is a collection of packages for the open source statistical software R; Bioconductor specifically deals with biological topics. Details of the download and installation are provided in Note 5. The download requires some knowledge of the R language; tutorials can be found on the R homepage. 2. Calling: The genotype calling with CRLMM is a simple call to the function crlmm(). The only decision which has to be made beforehand is whether the memory of the computer is sufficiently large to store all the results in the RAM or whether large objects should be stored on a disk. The more general approach of using crlmm2() supporting large GWA studies is described in Note 6. 3. Conversion of files: For downstream analysis with different software, export of the R objects is needed. When the large data support is enabled, the calls and confidence scores are stored on disk in a binary format. These binary files can easily be converted into plain text files; see Note 7 for an example. Alternatively, one can analyze the data in R using statistical approaches that take into account the genotyping uncertainty (17).
518
A. Schillert and A. Ziegler
3. Notes 1. Installation of the Affymetrix Power Tools Precompiled 32-bit versions can be downloaded for Windows and Linux, and precompiled 64-bit programs are available for Windows, Linux, and MacOS. The source code for selfcompiling is also provided. After compilation, the regression tests should be carried out. Test data are available under https://bioinfo.affymetrix.com/APT. 2. Files required for calling with Birdseed The following files are required for calling with Birdseed: l
CEL files containing the signal intensity information
l
cdf file—chip definition format
l
qca and qcc files for QC calling
l
Model file
l
Special SNP file
l
Annotation file—not for calling, but annotation afterward
The files can be downloaded from http://www.affymetrix. com/support/mas/index.affx#1_1. The model file is continuously improved by the data gathered from many genotype callings. Up-to-date versions of the model file are provided by the Broad Institute and can be downloaded from http://www.broadinstitute.org/mpg/ birdsuite/birdseed.html. 3. QC Calling with apt-geno-qc: Code example Listing 1: Shell code invocation of the quality control procedure for Affymetrix DNA microarrays
28
Genotype Calling for the Affymetrix Platform
519
The resulting text file contains various quality metrics per CEL file. The most important variables are stored in columns of cel_files, contrast_qc, andem-cluster-chrX-het-contrast_gender. The remaining columns report the contrast QC computed in different subsets of SNPs or the now deprecated QC call rate. 4. Calling with apt-probeset-genotype: Code example Listing 2: Shell code for the invocation of the genotype calling command with apt-probeset-genotype
A detailed description of the various command-line options can be found on the Affymetrix Web site at http:// media.affymetrix.com/support/developer/powertools/ changelog/apt-probeset-genotype.html. The output files are shown in Fig. 1. Briefly, the signal intensities, the genotype calls, and the confidence scores per genotype are returned in separate files. 5. Installation of the crlmm package R can be downloaded from http://www.r-project.org/; versions are provided for the major operating systems. After installation of R, the Bioconductor package crlmm needs to be installed from within R. Listing 3 provides example R code for installing the basic bioconductor packages needed for the genotype calling of Genome-wide Human SNP Array 6.0 data with crlmm. Because all objects created during an R session are stored in the memory, the genotype calling of hundreds of samples would require a huge amount of memory. To solve this issue,
520
A. Schillert and A. Ziegler
the crlmm package can use the ff package to store large objects on a disk instead. Listing 3: R code for the installation of the crlmm package and its prerequisites
6. Calling with crlmm Listing 4 shows a minimal example for the genotype calling of all CEL files in the specified directory. As an alternative approach, the vector of CEL files can be constructed by hand. To keep track of the objects, we recommend customizing the path where ff objects are created. The resulting object crlmmRes is of the class SnpSet and provides pointers to the ff objects of the genotype calls and confidence scores. Additionally, the quality scores are accessible through the respective list elements of the SnpSet object. Currently, the normalized and summarized signal intensities cannot be accessed when using crlmm2 with the ff package. Signal intensities can be accessed by using the function crlmm, specifying save.it ¼ TRUE and providing a file name in intensityFile. The signal intensities are then stored in an R object. Listing 4: R code for running the genotype calling with crlmm using the ff package. CEL files from the Hapmap project are used
28
Genotype Calling for the Affymetrix Platform
521
7. Converting ff objects into text files If downstream analysis of called genotypes is to be carried out outside of R, genotype calls need to be converted into a format which is readable by various programs. A plain text file is a good starting point. Listing 5 shows the required steps. First, one needs to specify a file name where genotype calls should be stored. After conversion into an ff data.frame, the calls object can be written via write.table.ffdf. Listing 5: Conversion of an ff object into a white space delimited text file
If genotype calls with lower confidences should be excluded, the confidence scores need to be exported as well. This is done analogous to Listing 5 by substituting the accessor function calls() by conf() and providing new file names. By default, no genotypes are excluded. Thresholds for the confidence scores are typically chosen so that the percentage of no calls is between 0 and 10%. CRLMM’s measure to evaluate the quality of a sample is the signal to noise ratio (SNR) [10]. By default, samples are excluded from calling with a SNR < 5. The computation of the SNR is done in the preprocessing step of crlmm(), and thus allows the instant exclusion of samples failing a given threshold without the necessity of rerunning the calling. A batch quality score is provided in addition to the sample specific quality measures. The batch quality score should simplify identifying batches of low quality, which in turn should be quality-controlled in greater detail and might be excluded from downstream analysis. To obtain the batch quality score, it is necessary to call the samples batchwise, i.e., only samples which have been processed together in the lab should be called jointly. 8. Conversion of apt output files Using an in-house Perl script, we convert the calls file into a tped file. This conversion requires additional input of an annotation file provided by Affymetrix and a table that assigns each CEL file its corresponding sample name. During
522
A. Schillert and A. Ziegler
conversion, the following tasks are simultaneously accomplished: l
Recoding numeric genotype values into alphabetic genotype values.
l
Flipping allele codings to the forward strand.
l
Extraction of genomic position for each SNP.
l
Exclusion of samples read from a list.
l
Recoding of sample IDs.
l
Recoding of Affymetrix SNP IDs into rs-numbers.
l
Output is written into a tped and a tfam file.
Acknowledgments The work presented in this chapter was funded by the German Ministry of Education and Research (grant: 01EZ0874) and the German Research Foundation (grant: ZI 591/17-1). References 1. Kennedy GC, et al (2003) Large-scale genotyping of complex DNA. Nat Biotechnol 21: 1233–1237 2. Lipshutz RJ, Fodor SP, Gingeras TR, Lockhart DJ (1999) High density synthetic oligonucleotide arrays. Nat Genet 21: 20–24 3. Ziegler A (2009) Genome-wide association studies: Quality control and population-based measures. Genet Epidemiol 33: S45–S50 4. Weale ME (2010) Quality control for genomewide association studies. Methods Mol Biol 628: 341–372 5. Ziegler A, Ko¨nig IR, Thompson JR (2008) Biostatistical aspects of genome-wide association studies. Biom J 50: 8–28 6. Laurie CC, et al (2010) Quality control and quality assurance in genotypic data for genome-wide association studies. Genet Epidemiol 34: 591–602 7. Fardo DW, Ionita-Laza I, Lange C (2009) On quality control measures in genomewide association studies: a test to assess the genotyping quality of individual probands in family-based association studies and an
application to the HapMap data. PLoS Genet 5: e1000572 8. Ziegler A, Ko¨nig IR (2010) A Statistical Approach to Genetic Epidemiology: Concepts and Applications, Second ed., Wiley-VCH, Weinheim 9. Korn JM, et al (2008) Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs. Nat Genet 40: 1253–1260 10. Carvalho B, Bengtsson H, Speed TP, Irizarry RA (2007) Exploration, normalization, and genotype calls of high-density oligonucleotide SNP array data. Biostatistics 8: 485–499 11. Carvalho B, Louis TA, Irizarry RA (2010) Quantifying uncertainty in genotype calls. Bioinformatics 26: 242–249 12. Bolstad B M, Irizarry RA, Astrand M, Speed TP (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19: 185–193 13. Lin HY, Myers L (2006) Power and Type I error rates of goodness-of-fit statistics for binomial
28
Genotype Calling for the Affymetrix Platform
generalized estimating equations (GEE) models. Comput Stat Data An 50: 3432–3448 14. Lin S, Carvalho B, Cutler DJ, Arking DE, Chakravarti A, Irizarry RA (2008) Validation and extension of an empirical Bayes method for SNP calling on Affymetrix microarrays. Genome Biol 9: R63 15. Zhang L, et al (2010) Assessment of variability in GWAS with CRLMM genotyping algorithm on WTCCC coronary artery disease. Pharmacogenomics J 10: 347–354
523
16. Aulchenko YS, Ripke S, Isaacs A, van Duijn CM (2007) GenABEL: an R library for genome-wide association analysis. Bioinformatics 23: 1294–1296 17. Ruczinski I, Li Q, Carvalho B, Fallin MD, Irizarry RA, Louis TA (2009) Association tests that accommodate genotyping errors, The Berkeley Electronic Press. http://www. bepress.com/jhubiostat/paper181/
sdfsdf
Chapter 29 Genotype Calling for the Illumina Platform Yik Ying Teo Abstract Genome-wide association studies have been made possible because of advancements in the design of genotyping technologies to assay a million or more single nucleotide polymorphisms (SNPs) simultaneously. This has resulted in the introduction of automated and unsupervised statistical approaches for translating the probe hybridization intensities into the actual genotype calls. This chapter aims to provide an introduction to this process of genotype calling, highlighting in particular the design and approach used for the Illumina BeadArray platforms that are commonly used in large-scale genetic studies. The chapter also provides detailed instructions for preparing the input files required as well as the actual Linux commands and options to execute the ILLUMINUS software. Finally, it concludes with a brief exposition on the different outcomes from genotype calling and the use of perturbation analysis for identifying SNPs with erroneous genotype calls. Key words: Genotype calling, Illumina, Mixture model, Quality control, Perturbation analysis, Hybridization, Normalization, Clusterplots, Expectation maximization, Oligonucleotide microarray
1. Introduction The advent of genome-wide association studies (GWAS) that survey the entire human genomic landscape for correlation with disease severity or onset has really been spurred by the remarkable advancements in genotyping technologies. Commercial companies like Affymetrix (Santa Clara, CA, USA) and Illumina (San Diego, CA, USA) that manufacture genotyping microarrays have been particularly successful in allowing up to a million single nucleotide polymorphisms (SNPs) to be surveyed simultaneously, and the next generation of genotyping technology that assays up to five million SNPs is already being introduced by Illumina as part of their Omni family of microarrays. While the number of SNPs assayed concurrently has increased exponentially, the fundamental technology for genotyping remains similar to traditional methods of examining fluorescent dye intensities for differential expression. For a diallelic SNP, two oligonucleotide Robert C. Elston et al. (eds.), Statistical Human Genetics: Methods and Protocols, Methods in Molecular Biology, vol. 850, DOI 10.1007/978-1-61779-555-8_29, # Springer Science+Business Media, LLC 2012
525
Y.Y. Teo
C
T
TT Genotype
Intensity
Intensity
CT Genotype
C
T
NULL Genotype
Intensity
CC Genotype
Intensity
526
C
T
C
T
Fig. 1. Expected hybridization intensity profiles for the three valid genotypes at a diallelic SNP, as well as the situation where hybridization has failed and has resulted in comparatively low intensities for both alleles.
probing sequences differing only at the position of the target location in the genome are used to query the DNA sample of an individual, where the fluorescence tag on an appropriate probing sequence illuminates when binding to the target allele is achieved. In theory, fluorescence binding happens only for one of the two probing sequences when the individual is carrying a homozygous genotype at the target location, while both probes fluoresce for a heterozygous individual (Fig. 1). In reality, the extent of fluorescence can be ambiguous or there may be no clear evidence of hybridization, and it is common to assign an unclassified call (or a NULL call) in such situations. The genotype of an individual at a SNP is thus determined by comparing the relative hybridization intensities of the two probing sequences. Prior to the advent of large-scale genotyping, this process was performed manually through a qualitative visual assessment of the degree of fluorescence for the two probes. However, large-scale genotyping crucially depends on the application of automated strategies for translating the hybridization intensities for the two alleles at each SNP into a categorical genotype call. This automated process is commonly known as genotype calling. Numerous genotype calling algorithms exist, some of which are specific to genotyping arrays from one of the two manufacturers whereas others are generic and can be applied to both Affymetrix and Illumina technologies. The earlier generation of automated genotype assignments, like the dynamic modeling (DM) algorithm for Affymetrix (1), relied on assigning the genotypes for each sample independently through a direct quantification of the relative extent of hybridization for the two alleles, although this has been shown to generate more unclassified calls for heterozygotes because of making heterozygous calls rely on successful hybridizations at both alleles. Such single-sample genotype assignment fails to utilize an extremely valuable source of information that is available from simultaneously genotyping multiple samples—that the hybridization profile at a SNP tends to be similar across multiple samples—
29
Genotype Calling for the Illumina Platform
527
Fig. 2. The hybridization intensity profiles for multiple samples at a specific SNP illustrated within the same plot. The triangles and circles correspond to samples that are homozygous for the A and B alleles, respectively, while the squares correspond to samples that are heterozygous.
and this can improve the accuracy of the genotype assignments. Subsequent generations of calling algorithms essentially converged towards the use of clustering methods for determining the genotypes of multiple samples simultaneously, with most relying on multicomponent mixture models (e.g., BRLMM (2), CHIAMO (3), XTYPING (4), ILLUMINUS (5), GENTRAIN (6), and BIRDSEED (7)), and some even include second-order information from the linkage disequilibrium between neighboring SNPs to improve the performance of the calling (e.g., BEAGLE (8) and the method by Yu and colleagues (9)). Pooling information across multiple individuals generally improves the performance of the genotype calling, as samples with similar intensity profiles are likely to be carrying the same genotype (Fig. 2). However, this assumes that nonbiological differences in the intensities across the multiple samples are minimized to avoid artifactual variations in the intensity profiles of samples carrying the same genotype. The scheme for normalizing the intensities within each sample and across the pooled samples has thus become as central to the genotype calling as the mathematical modeling of the algorithm itself (10–13). Downstream to the normalization, the coordinate system that the mixture models are subsequently applied to can also affect the performance of the algorithms. The number of samples and SNPs assayed simultaneously in large-scale genetic studies necessitates the application of automated and unsupervised methods in translating the fluorescence intensities into the valid genotype calls. While the promise of affordable whole-genome sequencing appears to overshadow large-scale
528
Y.Y. Teo
genotyping as the de rigueur method for surveying the human genome, in the short to medium term, next-generation genotyping technologies that assay an ever increasing number of SNPs discovered from population-based whole-genome sequencing projects like the 1000 Genomes Project (www.1000genomes.org) are likely to continue to dominate medical and population genetics. With up to five million SNPs being concurrently genotyped for each sample, there will be a continued demand for more powerful, more accurate, and faster genotype calling algorithms, particularly those that will be calibrated for assigning the genotypes of rare variants. This chapter aims to provide a brief overview of the design of the Illumina BeadArray technology, as well as a description of the proprietary normalization protocol that is performed in the GenomeStudio (previously BeadStudio) software suite. The chapter then introduces one of the more commonly used algorithms, ILLUMINUS, for calling the genotypes of samples that have been assayed with the Illumina technology, highlighting the inbuilt capability within the software for performing perturbation analysis in order to yield a quantitative measure to the robustness of the assigned genotype calls. The commands for running ILLUMINUS in a Linux environment will be provided, along with instructions on preparing the input file format that the software requires. 1.1. Chip Design
The Illumina BeadArray technology is used for designing the Illumina HumanHap family of microarrays (which includes the 550K, 610K, 650K, and 1M versions), as well as the next generation of genotyping microarrays in the Omni family (14, 15), and the following provides an example description of the design for the 550K version (5). Every microarray chip possesses ten lateral strips, where each strip contains a pool of 55,000 beadtypes. Each of these beadtypes queries a specific SNP and is made up of 20 beads on average. These beads are locus-specific probe sequences 50-mer long, corresponding to the nucleotide sequences that are directly adjacent to the target SNPs. One end of the probe sequence is thus extended by a single base for the genomic position that is being assayed, carrying the specific complement allelic makeup for binding. For each DNA sample, there are thus on average 20 measurements of hybridization binding for each SNP. The extent of hybridization is measured by the intensity and wavelength of the fluorescence.
1.2. Normalization Protocol
The fluorescence intensities of the beads across the entire microarray are divided into a number of sub-bead pools (e.g., there are 25 sub-bead pools for the 550K chip), and normalization of the bead intensities occurs within each sub-bead pool. The normalization process first aims to identify and remove SNPs that are assayed by beads with extremely low or high levels of hybridization intensities relative to the rest of the SNPs in each sub-bead pool, defined as either the fifth smallest intensity value or the intensity at the first
29
Genotype Calling for the Illumina Platform
529
percentile, or the fifth largest intensity value or the intensity at the 99th percentile. With the remaining SNPs, an estimation of the background hybridization intensity is performed by uniformly sampling 400 intensity observations on each intensity axis for candidate SNPs that are exhibiting homozygous genotypes. A linear regression is performed for each collection of 400 intensity values, and the intercept of the regression lines from the two linear models is subsequently used to define the origin of the normalized scale. With the two collections of 400 uniformly sampled intensities, estimations of the degrees of rotation and shearing are performed with respect to the defined origin, in order to yield near-perfect conjugate intensity values for the two collections of homozygous genotypes. A scaling of the modified intensities is subsequently performed with the use of virtual control points in order to standardize the intensity spectrum across the samples. This normalization procedure produces a pair of standardized intensity values for each assayed SNP that corresponds to the hybridization signals for the two alleles, and is internally performed by the Illumina GenomeStudio software (16). These signals are subsequently used as input information for most genotype calling software. 1.3. Clustering Space Transformation
The coordinate scale in which the normalized intensities are found provides an intuitive basis for determining the genotypes, since the specific homozygous genotype is assigned when a strong hybridization signal is observed only for the appropriate allele. However, ILLUMINUS assigns genotypes through the use of a separate coordinate system that is equivalent to the polar coordinates, measuring the distance of the observed intensity profile from the origin (strength) and a scaled definition of the counterclockwise angle spanned from the positive horizontal axis (contrast) such that the transformed axis is bounded between 1 and 1 inclusive. Mathematically, suppose xjl and yjl denote the normalized signal intensities for alleles A and B for sample j at SNP l, the corresponding strength (sjl) and contrast (cjl) are defined respectively as sjl ¼ logðxjl þ yjl Þ cjl ¼
xjl yjl : xjl þ yjl
Figure 3 provides a pictorial illustration of the relationship between the normalized signal intensities (xjl, yjl) and the transformed contrast–strength coordinates (cjl, sjl). 1.4. Mixture Modeling for ILLUMINUS
ILLUMINUS fits a three-component bivariate mixture model to the contrast–strength coordinates for each sample, where the parameters of the mixture model are estimated in an expectationmaximization (EM) approach across all the combined samples. The three components correspond to the expected intensity
530
Y.Y. Teo
Fig. 3. An illustration of the hybridization intensities on the normalized scale (left ) and after transforming to the contrast–strength coordinates (right ), which is equivalent to a scaled version of the polar coordinates where the contrast representing the angular deviation from the y ¼ x line and the strength a measure of the distance from the origin. The hybridization profiles for three samples are shown in both plots to illustrate the effects of the transformation.
profiles for the genotype classes AA, AB, and BB, and a multivariate truncated t-distribution is used to model each of the three classes since this allows the algorithm to control for the heavier tails that are common in the homozygous clusters, due to the normalization procedure from the Illumina GenomeStudio technology. Mathematically, let f(x; m, S, n) define the density function of a t-distribution at x with location parameter m, variance–covariance matrix S, and possessing n degrees of freedom. The density for the intensity profile for sample j at SNP l, or Xjl ¼ (cjl, sjl), can be expressed as F ðXjl Þ ¼
3 X
lk fk Xjl ; mk ; Sk ; vk ;
k¼1
where (l1, l2, l3) denotes the mixture proportions according to Hardy–Weinberg equilibrium, and f1 ðXjl ; m1 ; S1 ; v1 Þ ¼
f ðXjl ; m1 ; S1 ; v1 Þ R 1 1 / f ðXjl ; m1 ; S1 ; v1 Þdc
f2 ðXjl ; m2 ; S2 ; v2 Þ ¼
f ðXjl ; m2 ; S2 ; v2 Þ R 1 1 1 f ðXjl ; m2 ; S21 ; v2 Þdc
f3 ðXjl ; m3 ; S3 ; v3 Þ ¼
f ðX ; m ; S ; v Þ R / jl 3 3 3 : 1 1 f ðXjl ; m3 ; S31 ; v3 Þdc
A fourth component is introduced as an outlier class for samples with intensity profiles that clearly do not belong to any of the
29
Genotype Calling for the Illumina Platform
531
three components, and this is modeled as a bivariate Gaussian distribution with zero covariance and comparatively large variances such that the density is effectively flat across the likely range of intensity values. While the parameters nk are predetermined such that n1 ¼ n3 < n2 for most of the SNPs and with n1 ¼ n2 ¼ n3 for the remaining handful of SNPs (when empirical evidence suggests the variance profiles for the contrast axis of the three genotype classes are similar), the parameters mk and Sk are determined from the data assuming the genotype assignments are known such that ! nk nk X X 1 1 mk ¼ ðc k ;s k Þ ¼ ck ; sk nk j jl nk j jl and
0
nk P
1 B B j Sk ¼ B nk nk 1 @ P j
ðcjlk
2 k
c Þ
ðcjlk c k Þðsjlk s k Þ
nk P j nk P j
1 ðcjlk
k
c
Þðsjlk 2
ðsjlk s k Þ
k
s ÞC C C; A
with nk denoting the number of samples that are assigned to genotype class k, and the superscript refers to the intensity profiles of the same samples. Estimating the parameters of the mixture models assumes that the genotype membership of each sample is known, except during the start of the analysis, where the algorithm relies on only the contrast axis, and the required parameters are predefined in order to initialize the algorithm. During the first iteration of the algorithm to assign putative genotypes to the samples, five guided starts are tested in order to identify the optimal set of parameters that yields the largest likelihood of the data for seeding the algorithm. Out of the five starting sets of parameters, two are determined from a simple summary of the empirical contrast–strength characteristics and this flexibility has proven to be useful in accommodating the handful of SNPs with unusual hybridization profiles that result in genotype clusters that are shifted from the normal positions. The location (mkc) and spread (skc) parameters for the contrast axis of the five guided starts are: l
m1c ¼ 0.9, m2c ¼ 0, m3c ¼ 0.9; s1c ¼ s2c ¼ s3c ¼ 0.1
l
m1c ¼ 0.9, m2c ¼ 0.5, m3c ¼ 0.9; s1c ¼ s2c ¼ s3c ¼ 0.1
l
m1c ¼ 0.9, m2c ¼ 0.5, m3c ¼ 0.9; s1c ¼ s2c ¼ s3c ¼ 0.1
l
l
m1c ¼ 0.9, m2c ¼ 0.5 [max(c) + min(c)], m3c ¼ 0.9 max(c); s1c ¼ s2c ¼ s3c ¼ 0.05 [max(c) + min(c)] m1c ¼ 0.9 min(c), m2c ¼ 0.5 [max(c) + min(c)], m3c ¼ 0.9; s1c ¼ s2c ¼ s3c ¼ 0.05 [max(c) + min(c)].
532
Y.Y. Teo
ILLUMINUS adopts an EM procedure that alternates between (1) assigning the genotype membership based on the intensity data conditional on knowing the parameters of the genotype clusters (E-step) and (2) recalibrating the parameter estimates of the genotype clusters using maximum-likelihood conditional on knowledge of the genotype memberships of the samples. In addition, an intuitively simple step is introduced such that samples with contrast values larger than the average contrast of the heterozygotes will never be assigned to the AA genotype class, and similarly samples with contrast values smaller than the average contrast of the heterozygotes will never be assigned to the BB genotype class. The genotype membership for each sample is defined as the genotype class with the largest posterior probability, subject to the condition that this maximum posterior probability is above some user-defined threshold (with a default of 0.95). When the maximum posterior probability is below the threshold, or when the maximum posterior probability is found to occur at the fourth outlier class, a NULL genotype is assigned and this is commonly considered as a missing genotype call. The EM procedure stops when two consecutive iterations yield the same genotype configuration for all the samples. 1.5. Genotype Calling at Chromosome X
For SNPs that are not located in the pseudo-autosomal arm (PAR) of chromosome X, assigning the genotypes requires additional information on the genders for the samples. As males only possess a single copy of the X chromosome, males will never be heterozygous for any SNPs that are found in the non-PAR regions of chromosome X. However, as females possess two copies of the X chromosome, there is effectively no difference in determining the genotypes for females. The algorithm is modified to remove the dependency on Hardy–Weinberg equilibrium for determining the mixture proportions, and that f2 ðx jl ; m2 ; S2 ; v2 ; maleÞ ¼ 0 for male individuals.
2. Methods 2.1. Perturbation Analysis
Given the automated and unsupervised nature of calling the genotypes for up to a million SNPs, it is inevitable that the genotypes for a handful of SNPs will be erroneously determined. For most of such SNPs with problematic genotyping, the calling algorithm will appropriately assign the genotypes as NULL calls to reflect the higher degree of uncertainty, and these SNPs can be easily identified and filtered from downstream analyses by the higher amount of missing calls. However, there will be some SNPs where, despite stringent thresholds on the
29
Genotype Calling for the Illumina Platform
533
maximum posterior probabilities, the genotype assignment may be incorrect without generating a high amount of missing calls. It is particularly important to identify and remove such SNPs from downstream analyses, especially in population genetics, as they may artificially present evidence of population differentiation when compared against other datasets or public databases, like the genotypes from the International HapMap Project (www.hapmap.org), or the 1000 Genomes Project (www.1000genomes.org). At present, curation of the genotype assignments is performed manually through visual inspection of the clusterplots—intensity profiles with the genotype calls overlaid in color. This is tedious and unrealistic to extend to all the genotyped SNPs. ILLUMINUS presents a simple yet efficient solution for evaluating the robustness of the assigned genotype calls. For each SNP, in addition to performing the usual genotype assignment with ILLUMINUS, the contrast–strength intensities for each sample is perturbed slightly by the addition of a small degree of white noise and a second round of genotype calling is performed on the perturbed intensities for the same SNP. The proportion of concordant calls from the two rounds of genotype calling using the unperturbed and perturbed data, between 0 and 1 inclusive, provides a metric for quantifying the stability of the assigned genotypes, and this has been shown to correspond to the quality of the genotype assignment at the SNP level (17). 2.2. Input File and Command Line for Executing ILLUMINUS
A single space-delimited input file is required for running ILLUMINUS except in the situation of calling SNPs on the non-PAR arm of chromosome X, where a separate file with binary indicators (0 ¼ female; 1 ¼ male) is required to identify which samples in the input file are males. The input file has a header row and is arranged in a matrix with L subsequent rows representing the L SNPs and (2N + 3) columns, where the first three columns carry information on the SNP identifier, coordinates, and allele designations while the remaining 2N columns contain the normalized allele A and allele B intensities from GenomeStudio for the N samples (Fig. 4). Note that GenomeStudio by default assigns “NaN” as the intensities for samples where hybridization failed to generate a valid fluorescence profile, and ILLUMINUS has been configured to acknowledge the presence of “NaN” in the input file and produces NULL calls for such observations. ILLUMINUS can be executed in a Linux environment with the following command: illuminus [-i INPUT] [-o OUTPUT] [options] where the options include l
-t NUM: threshold on the maximum posterior probability, default at 0.95 (see Note 1).
534
Y.Y. Teo
Fig. 4. An example of the formatting for the input file for ILLUMINUS.
l
-p: generates an additional output file containing the posterior probabilities for each of the four possible call (corresponding to the genotype classes 1, 2, 3, NULL).
l
-w: for running the version of ILLUMINUS that has been specifically optimized for whole genome–amplified DNA (see Note 2).
l
-a: for performing perturbation analysis (see Note 3).
l
-x FILE: for indicating that the SNPs belong to the non-PAR region of chromosome X, and the file required contains binary (0 for females; 1 for males) entries representing the genders of the samples.
l
-s NUM1 NUM2: for running ILLUMINUS between the NUM1th SNP to the NUM2th SNP in the input file, this is useful for parallelizing the use of ILLUMINUS (see Note 4).
There are two possible space-delimited files that can be produced after executing ILLUMINUS. The file containing the genotype calls has the additional suffix of “_calls” appended to the end of the output file name. In this file, the first two columns contain the coordinates and identifier of the SNPs. The third column contains the concordance score from the perturbation analysis, which is set as 0 if the “-a” option was not specified while executing ILLUMINUS. The fourth column contains the allele designations in the “AB” format, which is required for translating the numerical genotype calls made by ILLUMINUS to the actual genotype where: 1 ¼ AA, 2 ¼ AB, 3 ¼ BB, and 4 ¼ NULL. The rest of the columns contain the genotype calls in the (1, 2, 3, 4) format for the samples, arranged in the identical order as in the input intensity file. When the “-p” option is specified, an additional output file with the suffix “_probs” is generated and, for
29
Genotype Calling for the Illumina Platform
535
each sample, there are four columns carrying the respective posterior probabilities for assigning the sample to the four genotype classes. An example command will thus be illuminus -i chr6.txt -o chr6_illuminus -t 0.90 -p -a which will execute ILLUMINUS for the input file that has been named “chr6.txt,” and produce the corresponding output files with a prefix “chr6_illuminus.” Specifically, the two files that are generated and are immediately relevant to the users are “chr6_illuminus_calls” and “chr6_illuminus_probs.” The latter file is produced because the “-p” option is specified. In the former file “chr6_illuminus_calls,” the third column will now contain numerical entries that are between 0 and 1 inclusive, and which represent the concordance in the genotype calls with the original versus the perturbed intensity data. A valid genotype is assigned when the corresponding posterior probability is at least 0.90, which is controlled by the “-t” option. 2.3. Examples of Unequivocal, Ambiguous, and Erroneous Genotype Clustering
In Fig. 5, we provide examples of the clusterplots for SNPs with unequivocal genotype assignments (left box), with accurate genotyping in the presence of noisy genotype clouds that result in higher rates of missingness (middle box) and for SNPs where the genotype assignments contain a high level of errors despite the maximum posterior probabilities exceeding the defined threshold of 0.95 (right box). The mixture model adopted by ILLUMINUS specifically allows the truncated t-distribution corresponding to the heterozygotes possessing a smaller degree of freedom, and this can result in an excess of heterozygous calls when the hybridization intensity profiles of most of the samples are particularly ambiguous, as seen in the three SNPs in the right column. The excess heterozygous calls almost certainly result in a gross departure from Hardy–Weinberg equilibrium, and evaluating whether the genotype distribution at each SNP deviates significantly from Hardy–Weinberg equilibrium has become a useful criterion for identifying SNPs with erroneous genotyping (18).
3. Notes In executing available computational software, the default settings are often used, without realizing that changing specific program parameters can improve the outcome of the analysis. In this section, we will highlight various features of ILLUMINUS that may improve the accuracy of the genotype calling. 1. Changing the threshold of the posterior probability. The default for the threshold of the posterior probability is set at 0.95. This
536
Y.Y. Teo
Fig. 5. Example clusterplots of three SNPs with unequivocal genotype assignments (left box), three SNPs with accurate genotype assignments and appropriately assigned NULL calls (middle box), and three SNPs with indistinct hybridization patterns resulting in high confidence but erroneous genotype assignments (right box).
can be manually changed by specifying the preferred threshold using the “-t” option. Depending on the purpose of the genotype calls, it is often useful to generate a set of extremely high confidence genotype calls. For example, in investigating the extent of genomic diversity between the study population and reference populations from public resources like the HapMap and the 1000 Genomes Project, genotyping error in the primary study can inadvertently increase the extent of genetic differentiation. Increasing the threshold to 0.99 (by including the additional command “-t 0.99” will increase the confidence and the accuracy of the resulting valid calls). 2. Whole genome–amplified DNA. Whole-genome amplification (WGA) is a laboratory procedure that performs in vitro reproduction of template DNA. While this procedure allows a small quantity of DNA to be duplicated to a sufficient quantity for performing genome-wide assays with microarray technologies, the success of the reproduction is not uniform across the genome. Specifically, it has been reported that WGA can lower
29
Genotype Calling for the Illumina Platform
537
the extent of allele hybridization in genotyping, as well as increase the variability in the hybridization intensity. Both factors directly compromise genotype calling, resulting in lower accuracy and greater rates of missingness (or NULL calls). The “-w” option in ILLUMINUS explicitly calls a routine that has been optimized to handle the noisier hybridization intensities, and this recovers the performance of the genotype calling. 3. Perturbation analysis. When the “-a” option is specified in the command, ILLUMINUS will perform two rounds of genotype calling at each SNP: the first round will be performed on the original intensity, while the second round will be performed on the perturbed intensity, where a small amount of white noise has been added to the original intensity. The concordance between the two sets of calls reflects the stability of the genotype calling and is a useful metric to quantify the quality of the genotype calling at the SNP level. This metric is often useful for deciding whether the SNP should be considered for downstream analyses, or whether it should be excluded. The concordance is a value between 0 and 1 inclusive, and empirically we found that concordance exceeding 98% to be reflective of highly consistent genotype calls. The recommendation is thus to adopt a concordance filter of 95%, up to 99%. 4. Batch running. The increasing number of samples that are genotyped, along with higher SNP density (i.e., five million SNPs) in the next-generation genotyping technologies, means greater computational resources will be required for processing the intensity data and to perform the genotype calling. The “s” option is particularly useful in such situations, as it breaks a larger job (e.g., involving 300,000 SNPs at one chromosome for 10,000 samples) into several smaller jobs (e.g., each of 3,000 SNPs for 10,000 samples). This dramatically reduces the computation time as well as the memory required. References 1. Di X, et al (2005) Dynamic model based algorithms for screening and genotyping over 100K SNPs on oligonucleotide microarrays. Bioinformatics 21: 1958–1963 2. Affymetrix Inc (2006) BRLMM: an improved genotype calling method for the GeneChip Human Mapping 500K Array Set http:// media.affymetrix.com/support/technical/ whitepapers/brlmm_whitepaper.pdf 3. Wellcome Trust Case Control Consortium (2007) Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447: 661–678
4. Plagnol V, et al (2007) A method to address differential bias in genotyping in large-scale association studies. PLoS Genet 3: e74 5. Teo YY, et al (2007) A genotype calling algorithm for the Illumina BeadArray platform. Bioinformatics 23: 2741–2746 6. Illumina Inc (2005) Illumina GenCall data analysis software http://www.illumina. com/Documents/products/technotes/tech note_gencall_data_analysis_software.pdf 7. Korn JM, et al (2008) Integrated genotype calling and association analysis of SNPs,
538
Y.Y. Teo
common copy number polymorphisms and rare CNVs. Nat Genet 40: 1253–1260 8. Browning BL, Browning SR (2009) A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Amer J Hum Genet 84: 210–223 9. Yu Z, et al (2009) Genotype determination for polymorphisms in linkage disequilibrium. BMC Bioinformatics 10: 63 10. Bolstad BM, et al (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19: 185–193 11. Rabbee N, Speed TP (2006) A genotype calling algorithm for Affymetrix SNP arrays. Bioinformatics 22: 7–12 12. Carvalho B, et al (2007) Exploration, normalization, and genotype calls of high-density oligonucleotide SNP array data. Biostatistics 8: 485–499
13. Xiao Y, et al (2007) A multi-array multiSNP genotyping algorithm for Affymetrix SNP microarrays. Bioinformatics 23: 1459–1467 14. Gunderson KL, et al (2006) Whole-genome genotyping of haplotype tag single nucleotide polymorphisms. Pharmacogenomics 7: 641–648 15. Steemers FJ, et al (2006) Whole-genome genotyping with the single-base extension assay. Nat Methods 3: 31–33 16. Kermani BG (2005) Artificial intelligence and global normalization methods for genotyping. US Patent 20060224529 17. Teo YY, et al (2008) Perturbation analysis: a simple method for filtering SNPs with erroneous genotyping in genome-wide association studies. Ann Hum Genet 72: 368–374 18. Teo YY, et al (2007) On the usage of HWE for identifying genotyping errors. Ann Hum Genet 71: 701–703
Chapter 30 Comparison of Requirements and Capabilities of Major Multipurpose Software Packages Robert P. Igo Jr. and Audrey H. Schnell Abstract The aim of this chapter is to introduce the reader to commonly used software packages and illustrate their input requirements, analysis options, strengths, and limitations. We focus on packages that perform more than one function and include a program for quality control, linkage, and association analyses. Additional inclusion criteria were (1) programs that are free to academic users and (2) currently supported, maintained, and developed. Using those criteria, we chose to review three programs: Statistical Analysis for Genetic Epidemiology (S.A.G.E.), PLINK, and Merlin. We will describe the required input format and analysis options. We will not go into detail about every possible program in the packages, but we will give an overview of the packages requirements and capabilities. Key words: Software, S.A.G.E, Merlin, PLINK, Statistical analysis, Linkage analysis, Association analysis, Quality control, GUI, Data management
1. Introduction A vast number of software packages has been made available for genetic data analysis. The Rockefeller list (http://linkage.rockefeller. edu/soft/) includes over 500 programs for genetic analysis. Some programs are very specialized, while others are part of a package that performs multiple types of analyses and often include data management tools. Historically, most packages were designed for family data to be used in segregation and linkage analyses; more recently, packages have included analyses of unrelated individuals from case/control association studies. A major problem over the years is that the input file formats for the different programs differ in such a way that using more than one program required reformatting the data—often extensively. Other recent trends are toward
Robert C. Elston et al. (eds.), Statistical Human Genetics: Methods and Protocols, Methods in Molecular Biology, vol. 850, DOI 10.1007/978-1-61779-555-8_30, # Springer Science+Business Media, LLC 2012
539
540
R.P. Igo and A.H. Schnell
Table 1 Comparison of three multiuse genetic analysis packages Program
Model-based linkage
Model-free linkage
Association unrelated
Association families
QC
GUI
Current version
S.A.G.E.
Yes
Yes
Yesa
Yes
Yes
Yes
6.1
b
Merlin
Yes
Yes
No
Yes
Yes
No
1.1.2
PLINK
No
No
Yes
Yes
Yes
Yes
1.07
All programs are available for Windows, Apple, Linux, and Solaris operating systems a Via ASSOC b Quantitative traits only
much larger datasets, now that millions of single-nucleotide polymorphisms (SNPs) are available, and large population studies with thousands of individuals are to be analyzed. Major factors influencing choice of programs include compatibility with other programs, available documentation, type of analyses performed, ease of use, platforms for which it is available, speed, and possibly user support. The initial format of the data and collaborators’ choice of programs may also be considered. As the majority of the programs are free [SAS (SAS Institute, Inc., Cary, NC) being the most notable exception], cost is usually not a factor. Unlike traditional statistical packages, genetic software has typically been written and supported by individuals rather than businesses, and hence, they may be less well documented and supported. Going from one program to another can be problematic, and as such an “as common as possible” format should be considered. While forethought is often difficult, the pattern so far in genetic epidemiology is that life changes quickly and the analyses of today will not be the analyses of tomorrow. Given that, the more flexible the file specifications, the easier it may be to deal with newer developments down the road. This chapter will review three major commonly used packages that are free, updated regularly, and available for UNIX, PC, Mac, and Sun operating systems. All packages have programs that perform quality control measures, linkage, and association analyses (Table 1).
2. Methods 2.1. Statistical Analysis for Genetic Epidemiology
http://darwin.case.edu/sage/ Statistical Analysis for Genetic Epidemiology (S.A.G.E.) comprises 17 programs for the analysis of genetic epidemiological data and is designed mainly for pedigree data. It originally appeared in 1986 and continues to be updated frequently. S.A.G.E. programs are
30 Comparison of Requirements and Capabilities. . .
541
Fig. 1. Mapping text fields to S.A.G.E. variables via the GUI. The five standard pedigree fields and a binary trait have been defined.
designed specifically for pedigree data analyses, although certain programs, including FREQ and ASSOC, can also be used for analyses of unrelated individuals. (Here, we use the terms pedigree and family interchangeably.) For this chapter, we will focus on the most frequently used programs. In addition to formal tests for genetic segregation, linkage, and association, S.A.G.E. offers programs for descriptive statistics and quality control procedures. The package has evolved significantly over the years: beginning from a simple command-line interface, it now has a graphical user interface (GUI) and supplies many optimal default options. 2.1.1. Input Files
Required Files. S.A.G.E. requires as minimal input a pedigree file and parameter file; the individual programs may have additional requirements. S.A.G.E. is very versatile: it will take a variety of file formats with the requirement of one record (line) per individual. The parameter file can be constructed from scratch with the aid of the GUI; the GUI can also assist with formatting of the data file (see Note 1). There is no designated order to the columns, and the variables may be in any order. Fields for pedigree, individual, parents, and sex must be present, and these key fields are mapped to the variables in the data file (Fig. 1). Columns in the data file are
542
R.P. Igo and A.H. Schnell
mapped to pedigree identifier (ID), individual ID, parents, sex, traits, covariates, and markers. The exceptions to this requirement are that (1) the sex field may be absent and (2) the parent fields may be absent and then individuals with the same family ID are treated as full siblings (see Note 2). The combination of pedigree and individual ID should be such that every individual has a unique identifier. S.A.G.E. will recognize single or multiple column delimiters, as specified by the user. Multiple delimiters allow space-delimited files with fixed column widths to be read in. Allele delimiters for marker loci are specified separately from column delimiters. Custom missing values may be specified and may be coded in any manner as long as they are different from the column delimiters. Any alphanumeric character is acceptable for all fields. Any number of columns may be present in any order. A powerful feature of S.A.G.E. is the ability of the user to define new variables for analysis as functions of existing variables. S.A.G.E. will calculate the values for user-defined variables on the fly, as specified in the parameter file. A large number of mathematical and Boolean operations are supported. In addition, genetic models for marker loci are available that implement additive, dominant, and recessive coding for a given reference allele. A transmitted allele indicator function, for use in ASSOC, specifies whether or not a given allele was transmitted from a heterozygous parent. Other Files. For each program, the GUI prompts for the required input files. For certain programs, a marker locus description file is required. This file lists the markers and the associated allele frequencies. These are typically codominant markers, but non-codominant markers may be also specified (e.g., the ABO blood group). For codominant markers, this file can be generated by the S.A.G.E. program FREQ. Multipoint linkage analysis and multipoint identity-by-descent (IBD) calculations require a genome description file. This file specifies one or more mutually unlinked regions, usually chromosomes, and, within each region, a list of markers in order with the genetic map distances between adjacent markers. Optionally, male and female map distances may be specified. Map units may be either Haldane or Kosambi centimorgans (cM), but as with all other linkage programs, all S.A.G.E. programs that require map information use the Haldane map function internally. The use of intermarker distances rather than map positions is potentially problematical. The S.A.G.E. GUI, however, can generate genome description files from tables of marker locations. Model-based linkage analyses on binary traits via the programs LODLINK and MLOD can be run under a mode of inheritance specified in a trait marker file. Alternatively, a type probability file
30 Comparison of Requirements and Capabilities. . .
543
may first be generated from a segregation analysis using the program SEGREG. We recommend obtaining penetrance parameters from SEGREG whenever the sample is large enough to obtain reasonably precise estimates, because SEGREG offers a wide variety of segregation models. 2.1.2. Quality Control Programs
PEDINFO. PEDINFO provides descriptive statistics on pedigrees (e.g., pedigree sizes and counts of various types of relative pairs) and reports summary statistics for traits. It detects and reports “broken” pedigrees, in which not all members are connected, and certain types of relationships that may be in error (e.g., individuals with multiple mates and consanguineous marriages). MARKERINFO. MARKERINFO reports Mendelian errors for markers in family data and can generate new pedigree files with the errors removed. The program generates a summary file that lists the number of Mendelian incompatibilities encountered by marker and by pedigree. This information is valuable in identifying poorly typed markers and in quickly identifying pedigrees possibly containing one or more relationship errors. In addition, a detailed output file displays the relevant genotypes for each Mendelian error. RELTEST. RELTEST uses marker data to classify the pedigree data according to the “true” familial relationships. The program examines relationships within but not across pedigrees, as will RELPAIR (1) (see, however, Note 3). It evaluates two genome-wide IBD-sharing statistics to distinguish true parent-offspring, full-sibling, half-sibling, and unrelated relationships. For ease of interpreting results, RELTEST provides histograms of the IBD-sharing statistics, which, for full siblings, have an asymptotic standard normal distribution for all types of putative relationship pairs (see also Note 4).
2.1.3. Aggregation/ Segregation Analysis Programs
FCOR. FCOR estimates familial correlations and their asymptotic standard errors via the Pearson product-moment estimator (2). A test of homogeneity is available to compare correlations among subtypes of relative pairs defined by sex. This program is useful for estimating heritability, through the sibling correlation, and for testing cross-correlations for related individuals, between two correlated traits, which is evidence of a common genetic basis. SEGREG. Segregation analysis for quantitative traits is performed using the regressive models of Bonney (3). These models are extremely flexible and allow for covariates that affect either the trait genotype means or variances. Correction for ascertainment is possible by conditioning on threshold or actual trait values for members of a user-defined proband sampling frame. Several types of within-family correlations may be added to the model to correct for polygenic inheritance and other factors, such as assortative mating and shared environment. Data may be transformed using the Box–Cox or George–Elston (4) transformations, but parameter
544
R.P. Igo and A.H. Schnell
estimates are always reported on the original scale and are median unbiased. Two models are available for binary traits: a finite polygenic mixed model (5, 6) with support for traits with variable age of onset and a multivariate logistic model (7). SEGREG generates a type probability file, listing posterior trait genotype probabilities and individual penetrances for phenotyped individuals for use in model-based linkage analysis. By default, SEGREG performs several likelihood-ratio tests comparing different transmission models, including no transmission, Mendelian transmission, and general transmission. The user may specify other tests as needed. 2.1.4. Linkage Analysis Programs
Model-Based Linkage Analysis. Two S.A.G.E. programs, LODLINK and MLOD, perform two-point and multipoint model-based linkage analysis, respectively. LODLINK is based on the Elston–Stewart algorithm (8), and thus can handle pedigrees of arbitrary size. However, LODLINK does not currently accommodate pedigrees with loops. MLOD evaluates the likelihood and reports a LOD score at each marker locus and, optionally, at a grid of cM locations, a specified distance apart. MLOD is based on the Lander–Green algorithm (9) and is thus limited in the size of pedigree that it can analyze. Pedigrees larger than the maximum size will be skipped. MLOD defaults to a maximum size of 18 inheritance bits (defined as twice the number of nonfounders minus the number of founders), but in our experience, a pedigree size of 22 bits is manageable (see Note 5). MLOD also provides the Shannon information (10) at each location evaluated. Both programs accept SEGREG-type probability files consisting of individual-specific penetrance functions. Consequently, highly sophisticated segregation models are accommodated. IBD-Sharing Probabilities. The program GENIBD estimates single- and multipoint IBD-sharing probabilities using variations on the Elston–Stewart and Lander–Green algorithms, respectively. GENIBD calculates sharing probabilities for five types of relative pairs: full siblings, half siblings, grandparent–grandchild, avuncular, and first cousins. Calculations may be restricted to a subset of these types, in the interest of time, as for regression-based model-free linkage analysis requiring only sib-pair IBD estimates. Multipoint IBD-sharing estimation carries the same limitations on pedigree size as does MLOD, but GENIBD offers two options for calculating IBD probabilities on large pedigrees. First, it can break large pedigrees into nuclear families and analyze the nuclear families as separate pedigrees. Second, it can use a Markov chain Monte Carlo algorithm, based on the method of Sobel and Lange (11), to estimate sharing probabilities by sampling from the distribution of inheritance patterns.
30 Comparison of Requirements and Capabilities. . .
545
Model-Free Linkage Analysis. S.A.G.E.’s flagship program for model-free linkage analysis is SIBPAL, an implementation of the Haseman–Elston regression-based approach (12) modified to increase power by taking into account information from the meanadjusted trait sum squared, YS, in addition to the difference squared, YD (13, 14). SIBPAL is valid for use with binary as well as quantitative traits. Several weighting schemes for YD and YS are available. The more powerful weighting schemes are also more computationally complex: because the statistical model may not converge under the most powerful scheme for a particular data set, a less powerful but more stable scheme may be required. Half-sibling pairs may also be incorporated into the analysis, but because separate parameter estimates are initially obtained for the half-sib pairs, care must be taken to ensure that the analysis is computationally stable. SIBPAL estimates empirical P -values by permuting IBD-sharing values within sibships and across sibships of the same size. The program provides, in addition to its comprehensive output, a tabular output file that may be imported directly into a graphing program, such as R, for display of results as a function of map position. The program LODPAL implements a conditional logistic regression approach for affected relative pairs (15, 16). A limitation of LODPAL as currently implemented is that it assumes that all relative pairs within a pedigree are independent. Interpretation of results from LODPAL, especially in the presence of covariates, is challenging, as the degrees of freedom assigned to the LOD score changes with the number of covariates. A significant covariate effect indicates that the evidence for linkage depends on the covariate value, implying a gene environment interaction. Recently, an extension of the Haseman–Elston approach (17, 18) has been implemented in the program RELPAL. This program implements a two-level Haseman–Elston method that incorporates fixed effects for covariates and trait loci at the first, or individual, level, and a polygenic effect and random effects for covariates and trait loci at the second, or pedigree, level. RELPAL conducts a Wald test for significance of first-level effects, and a score test for second-level effects. The linear mixed model in RELPAL is highly adaptable: multiple traits or marker loci may be included, as well as epistatic interactions. 2.1.5. ASSOC: S.A.G.E. Association Analysis Program
The program ASSOC conducts family-based association analysis via a linear mixed model in which marker genotypes are included as fixed effects. One or more types of familial correlation may be incorporated into the model as random effects. Available familial effects include a polygenic effect and random effects for sibships, spousal pairs, and nuclear families (see Note 6). Moreover, custom “group” random effects (e.g., for multifamily households) may be incorporated by defining a categorical variable for which all
546
R.P. Igo and A.H. Schnell
members of each group share a unique value. ASSOC is not restricted to pedigree data, but for unrelated data such a complicated regression model as that in ASSOC is probably not necessary. ASSOC currently reports a likelihood-ratio statistic and a Wald statistic on each test (see Note 7), and a score test is currently in development. ASSOC can apply the George–Elston transformation (4) during the analysis to normalize regression residuals. Parameter estimates other than variances, however, are reported on the original scale and are median unbiased. An additive, dominant, or recessive model for markers may be specified via a user-defined function. Any type of marker, suitably encoded, may be included as a predictor. Thus, unlike PLINK, ASSOC will perform association analysis on microsatellite and other types of polyallelic markers. Covariates may also be included as fixed effects. Multiple test models may be defined and compared, using a likelihood-ratio test, for each marker. If transmitted allele indicators are used as a predictor, ASSOC can conduct transmission/disequilibrium test (TDT)-like tests. A major advantage to ASSOC for family-based association analysis is that ASSOC can use all family members without conditioning on parental genotypes, as is done by FBAT (19). For region- or genome-wide association testing, ASSOC can run in “batch” mode, applying the same set of test models for every marker within a large pedigree file, or on a contiguous subset of markers. However, for the purpose of defining genetic models, all markers must be encoded with the same allele designations. Thus, for SNP markers, nucleotide alleles must be converted to a common coding scheme (e.g., A/B). A major disadvantage of ASSOC, compared to the association analyses in Merlin and PLINK, is speed. A genome-wide association analysis (GWAS) may require several days’ CPU time. However, a GWAS comprising nearly 500,000 SNPs was completed in less than 1 day by subdividing the SNP set by chromosome and by running ASSOC in parallel on a multi-CPU Linux server (20). Score tests, which will be included in a future release of S.A.G.E., are expected to greatly accelerate GWAS using ASSOC. 2.1.6. Haplotyping: DECIPHER
The DECIPHER program for haplotyping and haplotype-based association analysis can estimate posterior probabilities for haplotype pairs for individuals within a pedigree and haplotype frequencies based on family data. However, in its current form, it has severe limitations. First, it only reports posterior probabilities for a single individual in a pedigree per run. Second, it is currently based on the EM algorithm and, therefore, has all the disadvantages of the EM algorithm: it is slow, may not converge quickly to a maximum, and is severely limited in the number of SNPs it can handle. Nonetheless, it will perform haplotyping for polyallelic markers and will even consider polyploid genotypes.
30 Comparison of Requirements and Capabilities. . .
547
2.1.7. Documentation
S.A.G.E. is thoroughly documented in a PDF manual totaling nearly 500 pages. An advantage of the S.A.G.E. documentation is that the statistical theory behind the programs is described in detail. All features and options for each program are shown, along with the correct syntax for the parameter file and default settings. Examples of parameter files are included for many analyses. No online documentation currently exists in HTML format, but the PDF is freely available through the S.A.G.E. Web site. As it exists, the structure of the documentation may seem intimidating. A more accessible edition of the documentation is now in preparation, with statistical theory, general program information, and parameter-file syntax split into separate parts.
2.2. Merlin
http://www.sph.umich.edu/csg/abecasis/Merlin/index.html Merlin (Multipoint Engine for Rapid Likelihood Inference) (21) performs a variety of tests for linkage and association on pedigree data. Whereas S.A.G.E. is designed for versatility, Merlin is designed for speed: it performs tasks very quickly or not at all. Based on the Lander–Green algorithm (9) (as are the S.A.G.E. MLOD and GENIBD programs), Merlin use sparse binary trees to improve efficiency of calculation and, by extension, the size of data sets amenable to analysis over other Lander–Green based programs, such as Genehunter (22). Merlin is still limited in the complexity of families that it can analyze, but with modern computers, it seems generally capable of processing pedigrees of at least 29 inheritance bits (defined as twice the number of nonfounders minus the number of founders). Unlike S.A.G.E., Merlin automatically calculates allele frequencies as required for other analyses. A separate program, MINX (Merlin in X), is available for analyses on the X chromosome. It has not been as carefully tested as Merlin, and the authors urge caution in using this program.
2.2.1. Merlin Input Files
A Merlin dataset comprises a pedigree file, containing the data; a data file, providing information on the pedigree file; and a map file. Genetic data for Merlin are included in a pedigree file somewhat similar to that of PLINK (Fig. 2). The first five columns contain
1 1 1 1 2 2 2 2 2
101 102 103 104 201 202 203 204 205
0 0 101 101 0 0 201 201 201
0 0 102 102 0 0 202 202 202
1 2 2 1 1 2 2 2 1
1 1 2 2 1 2 2 2 1
A G G A G A A A A
G G G G G A G G G
C T C T T C T C T
T T T T T T T T T
G G G T T G G T C
T T G T T T T T T
C C C C G C C C C
C G G G G C G G G
A A A A A A A A A
A C C A C A C C A
A A A A A A A A A
T A A A T T A A T
Fig. 2. Tab-delimited text file compatible with PLINK and Merlin. Fields are, in order, family ID, individual ID, father, mother, sex, binary trait, and six marker genotypes with alleles separated by space characters. This file can be converted to a S.A.G.E. file by replacing the allele delimiters with slashes (/) and by adding a header row containing the names of the pedigree fields, binary trait, and markers.
548
R.P. Igo and A.H. Schnell
pedigree information: they are obligatory and always appear in the same order: family ID, individual ID, father’s ID, mother’s ID, and sex. Subsequent columns may contain binary trait, quantitative trait, covariate, or marker information. The variable names and types of data in the sixth and later columns are specified in the Merlin data file. In marker fields, alleles may be space- or slashdelimited, and must be coded as integers or as nucleotides (A, C, G, or T). The Merlin map file lists for each marker locus the chromosome, marker name, and sex-averaged genetic map position (sexspecific map positions are optional). Beginning with version 1.1.1, multiple pedigree and data file may be specified as input at the command line, compressed (gzipped) files are automatically decompressed for import. Merlin also accepts data in the format used for the LINKAGE package ((23); http://linkage.rockefeller.edu/soft/linkage/), but the “QTDT” format described here is far more straightforward and is easily converted to PLINK and S.A.G.E. files. 2.2.2. Quality Control
Merlin does not itself conduct general quality control analysis and error checking on pedigrees. It does, however, perform a test for unlikely Mendelian-compatible genotyping errors, by detection of genotypes that force unlikely recombination patterns (21). The test statistic, r, is a function of the likelihood ratio with a given genotype present and counted as missing in the data, under the linkage map provided and assuming all markers are unlinked. A large value of r indicates an unlikely genotype. Values of r are difficult to interpret, but a threshold of r > 40 corresponded in simulation analyses to a false positive rate for genotype errors of 0.001.
2.2.3. Linkage Analysis
Model-Based Linkage Analysis. Merlin conducts model-based linkage analysis on binary traits but not on quantitative traits. For model-based analysis, Merlin requires an accessory model file containing trait name, risk allele frequency for a diallelic locus, and penetrances for the three trait-locus genotypes. Penetrances may depend on covariates, as in S.A.G.E. SEGREG, but they must be defined exhaustively as liability classes. Consequently, a continuous function for age-dependence penetrance is not possible; penetrance must be described in terms of discrete age classes. For linkage analysis on binary traits, Merlin will work in conjunction with the Markov chain Monte Carlo-based program Simwalk2 (24), “farming out” the intractably large pedigrees to Simwalk2 and integrating output from both programs. IBD-Sharing Probabilities. Merlin rapidly calculates exact singleand multipoint IBD-sharing probabilities for any relative pair type within pedigrees (see Note 8). Merlin also calculates probabilities for extended single-locus IBD sharing, which accounts for inbreeding and distinguishes the maternal and paternal alleles at a
30 Comparison of Requirements and Capabilities. . .
549
locus. The package does not, however, have an option to break up intractably large families for IBD-sharing estimation, as the S.A.G.E. program does. Hence, the inheritance-bit limit is a strict ceiling for IBD calculations, which represents a distinct disadvantage for using large pedigrees in sibling-pair-based analyses such as Haseman–Elston regression. Model-Free Linkage Analysis. Three model-free linkage analyses are available in Merlin: Whittemore and Halpern’s NPL scores (25), a regression-based statistic for quantitative traits (26) and variancecomponent linkage analysis. Merlin calculates, on request, both the NPLpairs and NPLall statistics (25), as well as the Kong and Cox (27) LOD score. The Kong and Cox linear model, most appropriate for small genetic effects, is used by default; the exponential model is optional. A version of the NPL score is also available for quantitative traits, using as the score a function of the squared, mean-adjusted trait values assigned to founder alleles. For quantitative traits, the accessory program MERLIN-REGRESS performs a version of Haseman–Elston regression (12) modified to take into account the mean-adjusted squared trait sum as well as the squared difference (28), generalized to include all relative pairs within a pedigree (26). The variance-component linkage model implemented in Merlin includes as random effects an additive polygenic component, which is reported as heritability, and an additive quantitative-traitlocus (QTL) effect (see Note 9). Covariates may be included in the model as fixed effects. Linkage Disequilibrium (LD) in Linkage Analysis. Merlin allows for LD among markers by clustering sets of markers assumed to be in LD. However, the program assumes no recombination within a cluster of markers, and no LD between clusters. The user may supply a file defining clusters, may set a distance threshold in cM below which markers are clustered, or may set a threshold for the r2 linkage disequilibrium measure above which markers are clustered. 2.2.4. Family-Based Association Analysis
A family-based test of association for quantitative traits, based on a variance-component model (29), is available in newer versions of Merlin. Like ASSOC, the linear mixed model used in Merlin incorporates SNP effects as fixed effects, although in Merlin only an additive SNP model is available. Family correlations are modeled as random effects, including an additive polygenic effect and an additive effect due to a linked major gene. Genotypes of individuals with phenotype but not marker information may be imputed as the expected number of alleles, as estimated from the posterior probabilities of each genotype given relatives’ genotypes. The user may choose for association testing a fast score test or a slower, but more powerful, likelihood-ratio test based on the multivariate normal distribution. As in ASSOC, Merlin will restrict the analysis to a
550
R.P. Igo and A.H. Schnell
consecutive subset of SNPs in the pedigree file. Both ASSOC and Merlin’s association tests use only the population-level information and ignore the transmission information. They are susceptible, therefore, to confounding due to population stratification. Whereas Merlin will only run association analysis on quantitative traits, ASSOC will also consider binary traits under a logistic model. 2.2.5. Haplotyping
Merlin is unusual among multipurpose genetic software packages in conducting fast, efficient haplotype phase estimation in pedigrees for tightly linked markers (30). Because the program fills in missing alleles during haplotype phasing, Merlin can impute genotypes in family data given full data in the founders (31) (see Note 10). Three types of haplotyping output are available: the haplotypes corresponding to the most likely inheritance vector (a particular inheritance pattern) of each pedigree, a set of one or more random inheritance vectors, or all possible haplotype configurations. Merlin does not report posterior diplotype probabilities, as does DECIPHER in S.A.G.E. and dedicated haplotyping programs such as PHASE (32). Merlin does not allow recombination within haplotypes, unless linkage equilibrium is assumed in one or more intermarker intervals, and will report an error for any pedigree with obligate recombination. Family-based haplotyping in Merlin is critically hampered by its semi-pictorial output format. Haplotypes are depicted vertically by default, with maternal and paternal alleles separated by a delimiter that shows recombination status. Equally probable, ambiguous haplotype configurations are indicated by listing both SNP alleles at the ambiguous locus. These features cause automated processing of haplotyping results by means of a computer program or script exceedingly difficult. An alternative format displays chromosomes horizontally, but retains the complicated ambiguity codes.
2.2.6. Documentation
Documentation for Merlin is maintained mostly on the World Wide Web. A useful Quick Reference page (http://www.sph.umich.edu/ csg/abecasis/Merlin/reference.html) provides a comprehensive listing of commands and parameters. In addition, several tutorials are available that walk the user step-by-step through basic analyses on example files packaged with the software. Input file formats are thoroughly described. Extensive documentation is not, however, included with the software itself; the README file contains a brief overview of the major functions and refers the user to the Web site.
2.3. PLINK
http://pngu.mgh.harvard.edu/~purcell/plink/index.shtml PLINK (33) is used extensively for handling and analyzing large data sets containing SNP genotypes. Its major advantages are the ability to handle vast quantities of data typical of GWAS, including easily reformatting/transposing genotyping files, speed of use, and
30 Comparison of Requirements and Capabilities. . .
551
well-maintained documentation. Its greatest limitation is restriction to association analysis and diallelic (e.g., SNP) markers. A Java program, gPLINK, provides a GUI front end for the software, in the same manner as S.A.G.E, but it must be installed separately from the main program (see Note 11). PLINK is optimized for speedy completion of simple statistical tests and is extremely fast for basic association analyses. 2.3.1. PLINK Input Files
PLINK’s great strength is its ability to manage gigantic datasets and to store SNP data efficiently. A PLINK dataset comprises a pedigree file, a map file, and optionally phenotype and covariate files. Files can be space- or tab-delimited. The first five columns for the pedigree file contain the same information as in Merlin (Fig. 1). The sixth column of the pedigree file is by default a binary trait, and the seventh and subsequent columns are marker alleles. A pedigree file may contain only one trait; additional traits must be read in from accessory files. The PLINK map file lists markers in order, including chromosome number, SNP name, physical position, and optionally genetic map position in cM. Alternatively, PLINK will accept “long file” formats for genotypes consisting of one marker per individual per row, and given the pedigree and map file will recode the data and output a pedigree file (now containing genotype data) and a map file. It also accepts transposed genotype files with one row per SNP, in which columns represent individuals. This format is useful when a small number of individuals are typed for hundreds of thousands of SNPs. There is an option to support compression libraries; it will automatically decompress gzipped files. Data files may be split or merged as needed. PLINK’s binary file format is capable of storing genotypes for thousands of individuals typed for even the densest SNP chips, such as the Affymetrix 6.0 and Illumina 1M chips, in a form that the program can handle. The genotypes are read in from a separate file, and this file does not have a header row. The alleles are required to be coded as single characters, but any character except 0, the missing value code, may be used. Some variations on the standard pedigree file format are accepted; for example, the family or trait columns may be omitted. These options, like other features of PLINK, are specified at the command line.
2.3.2. Quality Control Functions
Genotype QC. PLINK can screen SNPs for common characteristics used in quality control, including missingness (call rate), minor allele frequency, deviation from Hardy–Weinberg proportions and, when pedigree data are available, Mendelian error rate. If X-chromosome data are available, it will perform a check on the recorded sex of individuals. It will filter SNPs for analysis or data export by user-defined thresholds for any of these measures. As of version 1.07, however, in our experience it does not support
552
R.P. Igo and A.H. Schnell
filtering on two or more criteria simultaneously. The program will also accept a separate text file listing SNPs to be removed for analysis or to create a new dataset. Individuals may be screened in the same manner for Mendelian error rate and missingness. Within pedigrees, PLINK identifies and reports Mendelian inconsistent genotypes. Like S.A.G.E. MARKERINFO, it optionally sets to missing all genotypes for each SNP within every nuclear family containing a Mendelian inconsistency, and will save genotype data cleaned for Mendelian errors as new datasets. Other QC. PLINK performs multidimensional scaling (MDS) to explore genetic substructure within samples. The MDS function generates a table of coordinates on each dimension for every individual, which may then be plotted using a graphing program. Outliers may be detected within PLINK via a statistic that compares the nearest-neighbor (or nth-nearest-neighbor for n > 1) distance for an individual with the overall distribution of distance to (nth-) nearest neighbors across the entire sample. Pairwise relationship testing is conducted in PLINK by estimating genome-wide IBD sharing, based on identity-in-state comparisons, using an extension of the maximum-likelihood framework of Milligan et al. (34). Allele-sharing proportions for 0, 1, and 2 alleles IBD are reported, as well as the overall-sharing proportion. This information can inform pedigree error detection in pedigree cleaning, but the program does not make inference on IBD sharing. Moreover, we have found that PLINK tends to overestimate IBD sharing among distantly related relatives. It does have one major advantage over RELTEST, however: it will, by default, estimate IBD sharing between all pairs of individuals in a sample, whether or not they are putatively related, whereas RELTEST does not have an option to make comparisons across families. A more precise test for relationship is possible by combining results from PLINK and RELTEST (see Note 4). 2.3.3. Association Analyses
PLINK offers basic tests of association, plus linear and logistic regression including covariates, analyses for SNP SNP and SNP environment interactions, and haplotype-based tests. An exhaustive description of the association tests and their variations for stratified samples, subsets, copy number variants, etc., is beyond the scope of this chapter. Instead, we will concentrate on the basic, most commonly used analyses. Tests of Association on Unrelated Samples. For case–control samples, the simplest association test available is the allele-based 2 2 w2 test of association, reported in a table listing the allele frequencies in cases and controls, the w2 statistic, and an estimate of the odds ratio. A more comprehensive suite of tests, the standard for most users, is also available through a single command. They include the additive model (Cochran–Armitage trend test),
30 Comparison of Requirements and Capabilities. . .
553
dominant and recessive in the minor allele, plus a 2-d.f. genotype test. The same results are reported for each of these tests as for the allele-based w2 test. Exact tests are also available for all of these models, as are permutation-based empirical P -values. For quantitative traits, the basic test of association is a simple linear regression with the number of minor SNP alleles included as a predictor. When covariate adjustment is required, PLINK can conduct linear or logistic regression, drawing the covariate data, and perhaps the trait values, from accessory files. Association analysis output from regression models with covariates is not straightforward to parse, as the regression results for each SNP appear on several lines (one line for the SNP and one for each covariate). Regression-based tests run more slowly than the simpler tests, but it is still possible to analyze dense genome-wide scans on large samples in a reasonable amount of time (less than 1 day). Family-Based Tests of Association. Family-based association tests include the TDT and a variation, parenTDT, that incorporates parental phenotype data. An additional family-based test for binary traits is based on the sib-TDT (35) and incorporates unrelated individuals. A version of the QTDT (36) with permutation to adjust for family structure is also implemented. These tests are limited to nuclear families. Meta-analysis. The Mantel–Haenszel test is implemented for stratified samples, with stratification specified by a clustering variable. PLINK has recently acquired an option to run meta-analysis, which takes as input two or more association output files. It reports Cochran’s Q, I2, and results from both fixed-effects and randomeffects analysis, although the documentation declines to state which approaches are used for the latter two. 2.3.4. Haplotyping
PLINK estimates haplotype phase using the EM algorithm, as does S.A.G.E. DECIPHER. For family data, it first assigns haplotypes to the founders and then phases the offspring given the parents. Haplotype assignments may be incorporated in most of the standard tests of association. The limitations of the EM algorithm make this function unsuitable for phasing large numbers of SNPs simultaneously.
2.3.5. Imputation
Imputation in PLINK is carried out by a multiple-marker tagging approach. It was found to run more efficiently in a comparison between it and other imputation programs, but performed considerably less well in terms of imputation efficacy and accuracy (37). The authors do not recommend using PLINK for serious imputation studies, as the approach is still under development.
2.3.6. Documentation
The online documentation for PLINK is exemplary. All functions and options are explained in meticulous detail, with examples. At first, the sheer volume of the documentation can be daunting to the
554
R.P. Igo and A.H. Schnell
newcomer, but a hierarchical table of contents provided on each page, arranged by topic, enables fast and easy navigation. One important omission, however, is explanation of the statistical underpinning of many of the functions and more complicated tests of association. Some theory is provided in the publication describing the software (33). No documentation is supplied directly with the software. However, a 300-page PDF version is available for download (http://pngu. mgh.harvard.edu/~purcell/plink/pdf.shtml). 2.4. Comparison and Conclusions
S.A.G.E. and PLINK have the greatest number of analysis options and data management tools. S.A.G.E. offers by far the most flexibility in terms of pedigree file formats, at the expense of requiring a separate parameter file (see Notes 2 and 12). To save effort in writing the parameter file, users may export a parameter file generated by the S.A.G.E. GUI for running S.A.G.E. at the command line. PLINK also offers a GUI, which requires additional installation/configuration steps to set up. All three programs accept multiple input files for a single analysis, but S.A.G.E. currently has the most versatility in accepting multiple files of varying format, although it does not generate new, unified data files as PLINK does except in certain circumstances, as in removing Mendelian incompatible genotypes. S.A.G.E. allows some pedigree data fields to be missing, under certain assumptions (see Note 13). Merlin and PLINK are very fast relative to S.A.G.E., at the cost of flexibility. PLINK is designed to handle very large data files. Merlin has an enormous advantage in speed with phasing haplotypes, since S.A.G.E. and PLINK both (currently) use the EM algorithm. However, Merlin does very little quality control (see Note 14), and its other functions are constrained by the computational limits of its modified Lander–Green architecture. PLINK is restricted to diallelic markers, greatly limiting its usefulness outside of genome-wide association studies. A major advantage of the S.A.G.E. documentation is that it explains the theory behind each of the programs. PLINK has the most comprehensive online documentation of the three packages, but does not reference statistical approaches. All programs would benefit from the inclusion of more examples to illustrate analysis options. All programs generate tabular output files that can be imported into graphing software such as R. In addition, the programs themselves create simple graphs from certain analyses. Merlin currently has the best graphical capabilities: the program can generate PDF format files displaying graphical summaries for some analyses. S.A.G.E. provides text-based histograms for RELTEST. The tables of association results generated by PLINK from genome-wide SNP data can become cumbersome through sheer size. Also, when covariates are included in PLINK association tests, results for each SNP are presented over several lines, thus requiring additional file manipulation to extract the SNP-specific effect sizes and P -values.
30 Comparison of Requirements and Capabilities. . .
555
In summary, whereas these three powerful software packages overlap considerably, each offers unique strengths. It is likely that users performing a variety of analyses on genetic data will find occasion to use all of them.
3. Notes 1. The S.A.G.E. GUI may be used to construct a parameter file for use in running S.A.G.E. at the command line in Linux. 2. In S.A.G.E., some pedigree structure fields, including parental IDs and sex, may be omitted in programs that do not use them. 3. RELTEST can be made to evaluate all pairs of individuals, unrelated or unrelated, by omitting the family ID data and instructing S.A.G.E. to assume all individuals are full siblings. This step will violate certain assumptions that govern the program’s algorithm to differentiate different pair types, but a manual investigation can still be performed from the distributions of test statistics. 4. We have found plotting the Yj statistic from S.A.G.E. ^, RELTEST against the average proportion of IBD sharing, p from PLINK can increase resolution in identifying seconddegree relative pairs compared to either program by itself. 5. PEDINFO reports the number of inheritance bits for each pedigree under the “report pedigree-specific statistics” option. Pedigrees that exceed the limit for MLOD may be omitted or manually split into smaller pedigrees. 6. Including a polygenic variance component in the model yields an estimate of the narrow-sense heritability for the trait under study. 7. We have used a comparison between the likelihood ratio and Wald test statistics as a quality control measure. A large discrepancy between the P -values from these tests suggests that the likelihood may not have converged to a maximum. 8. Merlin is often used for IBD calculations to export into other programs, especially SOLAR (38) (http://www.sfbr.org/ Departments/genetics_detail.aspx?p¼37), the flagship program for variance-component linkage analysis (39). SOLAR does not itself compute exact IBD probabilities. 9. SOLAR does allow incorporation of dominance variance components for both QTL and polygenic effects in the model, but these may be difficult to estimate unless the number of sibling pairs in the sample is very large, and it is generally not recommended to estimate dominance components.
556
R.P. Igo and A.H. Schnell
10. The programs Genehunter (40) (http://www.broadinstitute. org/ftp/distribution/software/genehunter/) and Simwalk2 (24) (http://www.genetics.ucla.edu/software/simwalk) also estimate haplotypes on pedigree data, but both assume linkage equilibrium among all markers. 11. gPLINK is integrated with Haploview (41) (http://www. broadinstitute.org/haploview) for visualization of results. 12. Several programs that reformat data among several file types are available; the most versatile is Mega2 (42). These programs, however, do not generalize well to the widely varied types of datasets encountered in practice. Of particular importance is that S.A.G.E. allows for column and allele delimiters other than spaces, tabs, or slashes. Ideally, any research group working with genome-wide data from current SNP genotyping platforms includes a programmer well versed in a scripting language, for example, Perl or Python, that can handle files of over 1 GB. Statistical packages like R and SAS are not well suited to manipulating files of this size or complexity. 13. Many programs in S.A.G.E. accommodate deviations from the strictest requirements for pedigree data. For example, all programs will run on a sample in which two unrelated families have the same pedigree number. However, in many cases when deviations from the “ideal” format occur, the programs will issue warnings that unexpected data structures were encountered. Users are strongly encourage to consult the program information (*.inf) file, which contains all generated warning and error messages, after each analysis. 14. Some functions that S.A.G.E. performs are available in the program PedStats (http://www.sph.umich.edu/csg/abecasis/ PedStats/index.html), also maintained by Gonc¸alo Abecasis’s research group. References 1. Epstein MP, Duren WL, Boehnke M (2000) Improved inference of relationships for pairs of individuals. Am J Hum Genet 67: 1219–1231 2. Keen KJ, Elston RC (2003) Robust asymptotic sampling theory for correlations in pedigrees. Stat Med 22: 3229–3247 3. Bonney GE (1984) On the statistical determination of major gene mechanisms in continuous human traits: regressive models. Am J Med Genet 18: 731–749 4. George VT, Elston RC (1988) Generalized modulus power transformations. Commun Statist Theory Meth 17: 2933–2952 5. Fernando RL, Stricker C, Elston RC (1994) The finite polygenic mixed model: an alterna-
tive formulation for the mixed model of inheritance. Theor Appl Genet 88: 573–580 6. Lange K (1997) An approximate model of polygenic inheritance. Genetics 147: 1423–1430 7. Karunaratne PM, Elston RC (1998) A multivariate logistic model (MLM) for analyzing binary family data. Am J Med Genet 76: 428–437 8. Elston RC, Stewart J (1971) A general model for the analysis of pedigree data. Hum Hered 21: 523–542 9. Lander E, Green P (1987) Construction of multilocus genetic linkage maps in humans. Proc Natl Acad Sci USA 84: 2363–2367 10. Kruglyak L, Lander ES (1995) Complete multipoint sib-pair analysis of qualitative and quantitative traits. Am J Hum Genet 57: 439–454
30 Comparison of Requirements and Capabilities. . . 11. Sobel E, Lange K (1996) Descent graphs in pedigree analysis: applications to haplotyping, location scores, and marker-sharing statistics. Am J Hum Genet 58: 1323–1337 12. Haseman J, Elston R (1972) The investigation of linkage between a quantitative trait and a marker locus. Behav Genet 2: 3–19 13. Elston R, Buxbaum S, Jacobs K, Olson J (2000) Haseman and Elston revisited. Genet Epidemiol 19: 1–17 14. Shete S, Jacobs K, Elston R (2003) Adding further power to the Haseman and Elston method for detecting linkage in larger sibships: weighting sums and differences. Hum Hered 55: 79–85 15. Goddard KAB, Witte JS, Suarez BK, Catalona WJ, Olson JM (2001) Model-free linkage analysis with covariates confirms linkage of prostate cancer to chromosomes 1 and 4. Am J Hum Genet 68: 1197–1206 16. Olson JM (1999) A general conditionallogistic model for affected-relative-pair linkage studies. Am J Hum Genet 65: 1760–1769 17. Wang T, Elston RC (2005) Two-level Haseman-Elston regression for general pedigree data analysis. Genet Epidemiol 29: 12–22 18. Wang T, Elston RC (2007) Regression-based multivariate linkage analysis with an application to blood pressure and body mass index. Ann Hum Genet 71: 96–106 19. Laird NM, Horvath S, Xu X (2000) Implementing a unified approach to family-based tests of association. Genet Epidemiol 19, S36–S42 20. Kopplin LJ, Igo RP Jr, Wang Y et al (2010) Genome-wide association identifies SKIV2L and MYRIP as protective factors for age-related macular degeneration. Genes Immun 11: 609–621 21. Abecasis GR, Cherny SS, Cookson WO, Cardon LR (2002) Merlin—rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet 30: 97–101 22. Markianos K, Daly MJ, Kruglyak L (2001) Efficient multipoint linkage analysis through reduction of inheritance space. Am J Hum Genet 68: 963–977 23. Lathrop GM, Lalouel JM, Julier C, Ott J (1984) Strategies for multilocus linkage analysis in humans. Proc Natl Acad Sci USA 81: 3443–3446 24. Sobel E, Sengul H, Weeks DE (2001) Multipoint estimation of identity-by-descent probabilities at arbitrary positions among marker loci on general pedigrees. Hum Hered 52: 121–131
557
25. Whittemore AS, Halpern J (1994) A class of tests for linkage using affected pedigree members. Biometrics 50: 118–127 26. Sham PC, Purcell S, Cherny SS, Abecasis G (2002) Powerful regression-based quantitative-trait linkage analysis of general pedigrees. Am J Hum Genet 71: 238–253 27. Kong A, Cox NJ (1997) Allele-sharing models: LOD scores and accurate linkage tests. Am J Hum Genet 61: 1179–1188 28. Sham PC, Purcell S (2001) Equivalence between Haseman-Elston and variance-components linkage analyses for sib pairs. Am J Hum Genet 68: 1527–1532 29. Chen W-M, Abecasis GR (2007) Family-based association tests for genomewide association scans. Am J Hum Genet 81: 913–926 30. Zhang K, Zhao H (2006) A comparison of several methods for haplotype frequency estimation and haplotype reconstruction for tightly linked markers from general pedigrees. Genet Epidemiol 30: 423–437 31. Li Y, Willer C, Sanna S, Abecasis GR (2009) Genotype imputation. Annu Rev Genomics Hum Genet 10: 387–406 32. Stephens M, Donnelly P (2003) A comparison of Bayesian methods for haplotype reconstruction from population genotype data. Am J Hum Genet 73: 1162–1169 33. Purcell S, Neale B, Todd-Brow K et al (2007) PLINK: a toolset for whole-genome association and population-based linkage analysis. Am J Hum Genet 81: 559–575 34. Milligan BG (2003) Maximum-likelihood estimation of relatedness. Genetics 163: 1153–1167 35. Spielman RS, Ewens WJ (1998) A sibship test for linkage in the presence of association: the sib transmission/disequilibrium test. Am J Hum Genet 62: 450–458 36. Abecasis GR, Cookson WO, Cardon LR (2000) Pedigree tests of transmission disequilibrium. Eur J Hum Genet 8: 545–551 37. Nothnagel M, Ellinghaus D, Schreiber S, Krawczak M, Franke A (2009) A comprehensive evaluation of SNP genotype imputation. Hum Genet 125: 163–171 38. Almasy L, Blangero J (1998) Multipoint quantitative-trait linkage analysis in general pedigrees. Am J Hum Genet 62: 1198–1211 39. Amos CI (1994) Robust variance-components approach for assessing genetic linkage in pedigrees. Am J Hum Genet 54: 535–543 40. Kruglyak L, Daly MJ, Reeve-Daly MP, Lander ES (1996) Parametric and nonparametric
558
R.P. Igo and A.H. Schnell
linkage analysis, a unified multipoint approach. Am J Hum Genet 58: 1347–1363 41. Barrett JC, Fry B, Maller J, Daly MJ (2005) Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics 21: 263–265
42. Mukhopadhyay N, Almasy L, Schroeder M, Mulvihilll WP, Weeks DE (1999) Mega2, a data-handling program for facilitating genetic linkage and association analyses. Am J Hum Genet 65: A436
INDEX A AAF. See Ascertainment, ascertainment-assumption-free method (AAF) Acute guttate psoriasis ........................................121, 124 Additive.............................................. 3–5, 7, 8, 153, 154 Additive genetic variance .............................. 9, 152, 172, 173, 179, 303, 498 Admixed population ..................................... 79, 80, 400, 450, 465–467, 475, 477, 479, 480 ADMIXPROGRAM ........................................... 471–476 Affected relative pairs (ARPs)..............................56, 318, 322–323, 325, 329, 330, 341, 342, 545 Affymetrix CEL files .......................................514, 516, 518–521 CRLMM package................514, 515, 517, 519–521 AIBS. See Allele-sharing statistics AIMs. See Ancestry informative markers (AIMs) ALLEGRO ............................. 288, 318, 319, 328, 329, 331–337 Allele, definition ............................................................... 2 Allele frequency ancestral vs European and African populations .......................................................474 definition ........................................................... 61–62 errors.......................................................................339 Allele-sharing statistics AIBS................................................................... 32–33 EIBD.................................................................. 32–33 IBS ..................................................................... 32–33 Allelic association ......................................... 6–7, 89, 366 Allelic linkage disequilibrium ............................. 104–106 Analysis of variance (ANOVA) .................. 349, 353, 354 Ancestry informative markers (AIMs) ...................... 405, 407, 471–473, 475, 476 ANCESTRYMAP.............................. 471, 473, 475, 476 ANOVA. See Analysis of variance APL. See Association in the presence of linkage (APL) Apt-probeset-genotype .......................................517, 519 Ascertainment ascertainment-assumption-free method (AAF) ...... 188, 195–198, 203, 204, 206 complete ascertainment ............. 190–195, 197, 198, 202–204, 206 multiple ascertainment......................... 192–195, 204 perfect single ascertainment ..................................203 proband, definition of............................................189
PSF .......................................189, 200, 201, 229, 230 single ascertainment........... 139, 191–195, 197–200, 202–207, 230 ASSOC. See Statistical Analysis for Genetic Epidemiology (S.A.G.E.) Association in the presence of linkage (APL) .......................................... 360–369 Assortative mating.......................78, 79, 108, 154, 156, 174, 216, 543 Automatic genotype elimination...................................12 Autosome ................................................... 320, 403, 404
B Batch quality score .......................................................521 Best linear unbiased estimator (BLUE)................. 63, 71 Binary trait.........................................122–141, 212–217, 219, 222, 226, 229–231, 233, 234, 263, 264, 280, 285–300, 317–342, 347, 357, 377–381, 427, 465, 501, 541, 542, 544, 547, 548, 550, 551, 553 Bingo .......................................................... 487, 489–492 BioGRID database ..................................... 484, 485, 492 Birdseed ...................................................... 514–518, 527 Block-based haplotype ...................... 424, 435, 437–439 BLUE. See Best linear unbiased estimator
C Case–control data analysis empirical power and tests .............................355, 358 genetic model and risk allele ........................ 349–352 CaTS software ................................... 253, 254, 259–260 CCVM. See Congenital cardiovascular malformation (CCVM) CD16 ................................................................... 331–338 CEL files ............................................ 514, 516, 518–521 Chi-square goodness-of-fit test ...........82–83, 85, 90–92 Cholesky decomposition ........................... 156, 157, 392 Chronic obstructive pulmonary disease (COPD)............ 121, 127, 128, 130, 134 Classical twin model............................................ 154–157 Classic model-based linkage analysis..................263, 264 CMC. See Combined multivariate and collapsing method CMH. See Cochran–Mantel–Haenszel Cochran-Armitage trend test ...................... 89, 348, 552 Cochran–Mantel–Haenszel (CMH) ......... 124–126, 147
Robert C. Elston et al. (eds.), Statistical Human Genetics: Methods and Protocols, Methods in Molecular Biology, vol. 850, DOI 10.1007/978-1-61779-555-8, # Springer Science+Business Media, LLC 2012
559
TATISTICAL HUMAN GENETICS 560 || SIndex
Codominant, definition ................................................... 3 Collapsing.................................................368, 454, 455, 458, 462 Combined multivariate and collapsing method (CMC) ................................................455 Complete ascertainment. See Ascertainment, complete ascertainment Complex trait ............................................. 4–5, 154, 287 Congenital cardiovascular malformation (CCVM).......................................... 121, 135, 136 COPD. See Chronic obstructive pulmonary disease (COPD) Critical-genotype algorithm ................................... 15–18 CRLMM ....................................514, 515, 517, 519–521 Cryptic relatedness family structure ......................................................407 measures of population structure...........79, 255–256 Cytoscape........................................... 485–489, 491, 492
D DECIPHER. See Statistical Analysis for Genetic Epidemiology (S.A.G.E.) DESPAIR program .................................... 246, 257–259 Diallelic ........................................................ 2, 30, 75, 82, 87, 90, 95, 104–107, 201, 214, 221, 266, 286, 364, 367, 525, 526, 548, 551, 554 Discordant pairs ..................................................258, 340 Disease heterogeneity ....................... 238, 242–243, 250 Disturbance/error terms ....................................496, 502 Dominant, definition ...................................................7–8
E EIBD. See Allele-sharing statistics Elston–Stewart algorithm ............................ 75, 301, 544 EM algorithm. See Expectation maximization (EM) algorithm Epistasis ...............................................................6, 8, 154 Ethnic mixtures admixture mapping methods ......................467, 469, 470, 473, 475, 477–479 ancestral populations....................465–475, 478–480 individual’s locus-specific ancestry ...... 467, 469–477 STRUCTURE program .............................470, 471, 473, 475, 476 Expectation-substitution method ......................427, 442 Expectation maximization (EM) algorithm ................33, 71, 109, 325, 362, 365, 425, 428, 429, 432–434, 440, 443, 446, 515, 546, 553, 554
F Factor loadings ......................... 496, 497, 502, 503, 510 False positive..................................25, 59, 79, 238, 244, 255–257, 339, 400, 407, 548
Familial aggregation........................ 119–148, 177, 215, 264, 331 Family case–control designs ............. 122, 133–137, 141 Family history (FH) ........................ 121–129, 133, 141, 144, 145, 250 Family history exposure positive and negative..............................................123 Family history score (FHS) ................................ 127–129 Family size distribution (FSD) ....................................190 FASTPHASE ........... 412–414, 418, 419, 429, 430, 458 FCOR. See Statistical Analysis for Genetic Epidemiology (S.A.G.E.) FH. See Family history (FH) FHS. See Family history score (FHS) Finite polygenic mixed model (FPMM) ...........214–216, 221, 222, 226–228, 544 FPMM. See Finite polygenic mixed model (FPMM) F-tests ......................................................... 349, 357–358 Functional enrichment analysis .......................... 489–492
G Gametic phase disequilibrium (GPD).......................6–7, 103, 110 GEE. See Generalized estimating equations (GEE) GenABEL ................................................... 112, 375, 517 GENASSOC.................................................................112 Gene definition .................................................................... 2 interaction network....................................... 483–492 ontology .......................................485, 487, 489–492 Gene–environment correlation ...................................155 Gene–environment interaction .........................129–131, 134–135, 155, 331 GENEHUNTER program ................................267, 288, 318, 321, 324, 329, 547, 556 Generalized estimating equations (GEE) .........134–136, 147, 303, 369, 372–374, 376, 377, 379, 381 Generalized linear models (GLM) ....................147, 372, 375, 426–428, 450 General transmission model .....................211, 215, 220, 221, 227, 228 Genetic drift ................................60, 63, 64, 77, 79, 400 Genetic heterogeneity..................... 241, 287, 297, 332, 340, 341 Genetic marker ...............................12, 88, 94, 130, 172, 211, 251, 263, 286, 306, 307, 317, 318, 320, 321, 465, 499 Genetic Power Calculator...................................247, 259 Genome-wide association studies (GWAS) .......... 34, 40, 47–53, 55, 74, 86, 95, 239, 247, 248, 250, 252–256, 259, 351, 357, 359, 363, 365, 400, 405, 453–455, 457, 477, 479, 513, 517, 525, 546, 550, 554 array selection .........................................................252
STATISTICAL HUMAN GENETICS | 561
Index |
Genome-wide significance .................................240, 243, 245–247, 252, 253, 341–342, 351, 477, 479 Genotype conditional probability................................29, 30, 33 definition ......................................................... 3, 5, 15 errors................................................... 11–23, 82, 548 frequencies ............................... 60, 87, 93, 103, 104, 302, 479 Genotype-elimination algorithm............... 12, 14–16, 18 Genotypic LD measures ................... 104, 106, 109, 115 Genotypic linkage disequilibrium ...................... 106–107 George and Elston transformation ...................177, 184, 376, 385, 546 GLM. See Generalized linear models GOLD ..........................................................................112 GPD. See Gametic phase disequilibrium (GPD) Graphical user interface (GUI) ............................. 75, 99, 175, 217, 222, 226, 229, 230, 234, 265–269, 273, 281, 294, 304, 305, 307, 308, 315, 381, 382, 500, 501, 541, 542, 551, 554, 555 GUI. See Graphical user interface (GUI) GWAS. See Genome-wide Association Studies (GWAS)
H Haldane map function .................................................542 Haplotype ........................5, 15, 41, 103, 245, 329, 407, 411, 423, 454, 546 Haploview...................98, 110–113, 424, 435–437, 556 HapMap....................................49, 51–52, 55, 108, 110, 252, 407, 412, 426, 430, 435, 436, 453, 471, 473, 515, 520, 533, 536 Hardy–Weinberg Equilibrium (HWE) ................. 60–62, 65, 75, 78, 87, 92, 93, 95, 98, 104, 107, 201, 213, 217, 250, 299, 319, 349, 402, 425, 431, 432, 434, 435, 443, 450, 479, 530, 532, 535 Hardy–Weinberg proportion test........................... 77–99 Haseman–Elston (HE) .............302–304, 310, 545, 549 HE. See Haseman–Elston (HE) Heritability broad sense heritability ..........................................172 narrow sense heritability ...............................172, 555 Hidden Markov model (HMM) ................ 29, 309, 318, 400, 413, 430, 469, 470, 475 HMM. See Hidden Markov model (HMM) Homogeneity test ........................................................282 Homogeneous general transmission ..................214, 234 HWE. See Hardy–Weinberg Equilibrium (HWE)
I IBD. See Identical by descent (IBD) Identical by descent (IBD) ................................7, 25–37, 41–43, 49–56, 75, 173, 240, 241, 245, 248, 278, 301, 302, 304–310, 313, 318–323,
325, 326, 329, 330, 334, 338–342, 362, 365, 471, 542–545, 548, 549, 552, 555 Identical in state (IIS) ...........................7, 32, 34–36, 38, 41, 42, 44, 301, 326 IIS. See Identical in state (IIS) Illumina chip design..............................................................528 whole-genome sequencing ...........................527, 528 Imprinting ...........................................................256, 341 Inbreeding ............................. 40, 42, 64, 75, 78, 79, 88, 154, 274, 309, 470, 548 IntAct database.............................................................486 Intraclass correlation ...........................................142, 391
K Kinship coefficients .............................26–29, 32, 33, 49, 56, 302, 374, 375, 380 Kong and Cox Model ......................................... 327–328 Kosambi map function........................................267, 307
L Lander–Green–Kruglyak algorithm ..............................13 LAPCC. See Local ancestry principal components correction (LAPCC) Latent variable ..................... 5, 153, 496, 497, 502–504, 507, 510, 511 LD. See Linkage disequilibrium (LD) Likelihood ratio test (LRT) ................................... 31, 54, 56, 85, 87, 89, 174, 217, 221, 222, 225, 227, 228, 297, 303, 319, 325–327, 332, 341, 375, 391, 395, 396, 438, 439, 469, 498, 544, 546, 549, 555 Linear mixed effect model...........................................375 Linear regression .............................. 173, 302, 311, 349, 402, 426, 465, 469, 476, 529, 553 LINKAGE ........................................... 12, 18, 281, 282, 334, 548 Linkage analysis software ...................267–268, 288–289 Linkage disequilibrium (LD) definition ................................................................103 multilocus LD measures ........................................108 Local ancestry............................400, 401, 404, 471–475 Local ancestry principal components correction (LAPCC) ............. 400–402, 404, 407 Locus definition ................................................................2, 4 description files...........................................67, 71, 74, 265–267, 274, 275, 289, 292–293, 295, 298, 307, 308, 542 heterogeneity............................. 242–243, 245, 287, 296, 297, 341 LocusZoom ..................................................................112
TATISTICAL HUMAN GENETICS 562 || SIndex
LODLINK. See Statistical Analysis for Genetic Epidemiology (S.A.G.E.) LODPAL. See Statistical Analysis for Genetic Epidemiology (S.A.G.E.) LODs, LOD scores ...................12, 239, 241–243, 264, 277–280, 282, 283, 287–288, 295–299, 327, 329, 330, 336, 340, 342, 476, 544, 545, 549 Loop....................................23, 75, 175, 177, 182, 183, 266, 278, 289, 309, 439, 497, 544 LRT. See Likelihood ratio test (LRT)
M MAF. See Minor allele frequency (MAF) Major locus transmission ..........211, 212, 215, 217–228 Marginal model .......................................... 372–374, 377 MARKERINFO. See Statistical Analysis for Genetic Epidemiology (S.A.G.E.) Marker informativity ...........................................319, 338 Markov chain Monte Carlo (MCMC) algorithm .......................................... 85, 475, 544 Matching criteria ..........................................................133 Maximum likelihood (ML) .....................159, 375, 427, 428, 475 Maximum likelihood ratio test (MLRT) ....................26, 29–31, 35, 36, 39–42, 45 MCMC. See Markov chain Monte Carlo (MCMC) algorithm Mendelian inconsistencies .............. 12, 26, 35, 42, 402, 419, 552 Mendelian transmission models ................ 221, 231, 264 MENDEL program .......................................................13 MERLIN. See Multipoint Engine for Rapid Likelihood INference (MERLIN) Microsatellite map........................................................245 Migration/gene flow .............................................. 63, 64 Minor allele frequency (MAF) ......... 51, 53, 65, 67, 456 Mixed model ...............................5, 211, 212, 214, 372, 374–377, 407, 544, 545, 549 Mixture Hardy–Weinberg proportion (mHWP) exact test...................................... 87–88 ML. See Maximum Likelihood (ML) MLOD. See Statistical Analysis for Genetic Epidemiology (S.A.G.E.) MLRT. See Maximum likelihood ratio test (MLRT) Model-based linkage analysis............................218, 231, 263–283, 285–300, 302, 318, 331, 542, 544, 548–549 Model-free linkage analysis............... 25, 287, 301–315, 317–343, 544, 545, 549 Model identification................................... 497, 502, 503 Monoallelic ....................................................................... 2 Monogenic ........................................ 4–5, 130, 239, 454 Monte Carlo test ..................................................... 84, 85 Mplus .........................................499–501, 503, 504, 512
Multiallelic ..............................................2, 107, 286, 367 Multifactorial inheritance models ...............................225 Multilocus genotype ....................................................5–6 Multiple ascertainment. See Ascertainment, multiple ascertainment Multipoint analysis ..................264–266, 273, 278–279, 297, 298 Multipoint Engine for Rapid Likelihood INference (MERLIN).................... 23, 304, 328, 329, 339, 540, 547–551, 554, 555 Multi-SNP haplotype analysis methods ............. 423–450 Multistage design ................................................ 252–255 Mutation......................................4, 60, 63–64, 77, 244, 255, 287, 318
N Natural selection ......................................60, 63, 77, 400 NER. See Nucleotide excision repair (NER) Network analyses................................................. 487–489 Nonadditive genetic effects .........................................154 Nonparametric linkage score (NPL)....... 319, 323–329, 331, 334, 336, 337, 341, 342, 549 Non-shared environmental influences ........................153 Non-transmission statistics .................................360, 364 NPL score. See Nonparametric linkage score (NPL) Nuclear families........................... 13–16, 142, 171–185, 188, 199, 214, 215, 222, 266, 274, 309, 360–362, 364, 366–369, 375, 380, 387, 391, 395, 430, 498, 544, 545, 552, 553 Nuclear pedigree algorithm..............................13–14, 18 Nucleotide excision repair (NER)...............................445
O Odds ratio.......................8, 15–17, 106, 115, 123, 124, 126–130, 132, 134, 135, 143–147, 366, 369, 380, 434, 442, 446–450, 454, 466, 469, 552 OpenMx .........................151, 153, 157, 158, 165–169, 500, 501, 512 Osprey...........................................................................485
P Pajek..............................................................................485 Parametric linkage analysis ................................... 12, 318 Partition-ligation EM algorithm ........................425, 433 Path coefficient......................... 163, 164, 506, 507, 510 Path diagram ............................ 153, 496, 497, 500, 510 PDT. See Pedigree disequilibrium test (PDT) PedCheck....................................... 13, 15–18, 21, 23, 26 Pedigree disequilibrium test (PDT).................360–364, 366–369 Pedigree relationship errors..............................25–45, 52 PEDINFO. See Statistical Analysis for Genetic Epidemiology (S.A.G.E.)
STATISTICAL HUMAN GENETICS | 563
Index |
PedPhase............................................ 412, 415, 418, 419 Perfect single ascertainment. See Ascertainment, perfect single ascertainment Permutation test..........................92, 456, 462, 463, 477 Phenotype, definition ...................................................... 3 PICR. See Protein Identifier Mapping Service (PICR) Pleiotropy ......................................................................... 6 PLINK software ..............................................95–98, 471 Polyallelic ................................................................. 2, 546 Polygenic ........................ 4–5, 172–174, 179, 211–228, 234, 372, 374–376, 380, 384, 385, 387, 390, 391, 395, 396, 499, 501, 543–545, 549, 555 inheritance model........... 5, 212, 215, 226, 376, 543 transmission model .......................................214, 215 Polygenic-environmental model .................................215 Polymorphism ..................2, 4, 92, 114, 129, 239, 245, 347, 445, 498 Population stratification.........................48, 78, 79, 255, 359, 360, 364, 367, 377, 399–407, 434, 444, 446, 454, 477, 550 Population structure ..............................6, 79, 255–257, 367, 399–402, 404, 405, 431, 435, 467, 474, 510 Power ....................... 26, 39, 42, 45, 48–50, 56, 59, 80, 81, 86, 89, 107, 114, 131, 141, 156, 161, 162, 171, 174, 177, 178, 184, 238–250, 252–260, 264, 265, 289, 296, 297, 303–305, 310, 311, 315, 319, 322, 325, 328, 331, 339–341, 348, 349, 354, 355, 357, 360, 361, 368, 369, 377, 385, 387, 400, 419, 429, 431, 443, 454, 457, 479, 488, 489, 504, 514, 516, 518, 545 PREST ...................................................... 26, 28, 34–36, 39–41, 48–56 Principal components analysis ............................400, 471 Proband. See Ascertainment, proband, definition of Proband-dependent sampling ............................ 198–201 Proband sampling frame (PSF) ...............189, 200, 201, 229, 230, 543 Protein Identifier Mapping Service (PICR) ...............492 Protein–protein interactions............................... 484–487 Proteomics...........................................................484, 487 Proteomics Standard Initiative Common Query Interface (PSICQUIC) ....................................................484 Pseudo-likelihood ...................................... 199, 205–208 PSF. See Proband sampling frame (PSF) PSICQUIC. See Proteomics Standard Initiative Common Query Interface (PSICQUIC) Psoriasis................................................................121, 124
Q QTL. See Quantitative trait locus Quantitative trait locus ........................................... 7, 549 QUANTO ....................................................................131
R Random environmental transmission model........... 215, 217, 220 Rare variant.......................................... 48, 90, 252, 368, 453–463, 503, 511, 528 Recessive, definition......................................................... 3 Recombination fraction ..............12, 30, 200, 264, 265, 267, 274, 277, 282, 287–288, 295–297 Regressive models ............214, 216, 221–226, 228, 543 RELPAIR............................................................... 23, 543 RELPAL. See Statistical Analysis for Genetic Epidemiology (S.A.G.E.) RELTEST. See Statistical Analysis for Genetic Epidemiology (S.A.G.E.) RMSEA. See Root mean squared error (RMSEA) Root mean squared error (RMSEA)................498, 504, 506, 507, 510, 511 R package Rassoc ................................................348, 350
S SAS/Genetics .......................................................... 90–93 SEGPATH ...........................................................500, 501 SEGREG. See Statistical Analysis for Genetic Epidemiology (S.A.G.E.) Segregation analysis ........................ 211–234, 264, 267, 498, 500, 543–544 SEM. See Structural equation modeling (SEM) SIBPAL. See Statistical Analysis for Genetic Epidemiology (S.A.G.E.) Single ascertainment. See Ascertainment, single ascertainment Single-nucleotide polymorphism (SNP)..........2, 26, 47, 65, 82, 104, 130, 237, 239, 266, 286, 305, 329, 347, 363, 375, 401, 423, 453, 470, 498, 513, 525, 540 imputation .....................................................430, 443 SIR. See Standardized incidence ratio (SIR) SMR. See Standardized mortality ratio (SMR) SNP. See Single-nucleotide polymorphism (SNP) Sporadic model.................................. 215, 217, 222, 228 Standardized incidence ratio (SIR) ........... 139–141, 147 Standardized mortality ratio (SMR) ......... 139, 147, 148 Statistical Analysis for Genetic Epidemiology (S.A.G.E.) ASSOC......................173–175, 375, 377, 381, 382, 395, 541, 542, 545–546 DECIPHER ......................................... 547, 550, 553 FCOR .....................................................................543 LODLINK.........................218, 265–267, 289, 291, 293, 294, 297, 542, 544 LODPAL .......................................................319, 545 MARKERINFO ................................... 267, 274, 552 MLOD....................................... 218, 265, 267, 297, 542, 544, 547, 555
sdfsdf
TATISTICAL HUMAN GENETICS 564 || SIndex
Statistical Analysis for Genetic Epidemiology (S.A.G.E.) (cont.) PEDINFO ........................ 175, 176, 183, 273, 274, 304, 306, 555 RELPAL ............................................... 304, 313, 545 RELTEST ............................267, 274, 552, 554, 555 SEGREG.............................................201, 212, 217, 218, 234, 267, 544, 548 SIBPAL ..........................................................304, 545 Structural equation modeling (SEM) factor loadings ....................................496, 497, 502, 503, 510 path diagram................................................496, 497, 500, 510 STRUCTURE program ................... 471, 473, 475, 476 Suppressor interactions ................................................484 Synthetic lethal interactions ........................................484
T Tandem affinity purification (TAP).............................484 TAP. See Tandem affinity purification (TAP) TDT. See Transmission/disequilibrium, test (TDT) TGFA. See Transforming factor alpha locus (TGFA) Transforming factor alpha locus (TGFA) .................................................... 129–132 Transmission/disequilibrium test (TDT) ...........................360, 371, 372, 546, 553 transmission ratio distortion .................................341 Transmission probability........................5, 211–214, 216 Transmission ratio distortion ......................................341 Two-locus genotypes ....................................................... 7 Two-stage association analysis............................253, 499
U Unaffected relative pairs ..............................................340 Unified model ..................................................... 211–234 Unrelated individuals48, 51, 60–62, 64, 65, 71, 74, 79, 86, 165, 331, 359, 402, 403, 406, 407, 411–413, 418–420, 440, 458, 463, 501, 539, 541, 553
V Variance components ...................... 152, 153, 162, 164, 173, 174, 177, 180, 184, 303, 375, 385, 387, 390, 391, 395, 499, 549, 555 Variance-covariance matrix................................163, 165, 166, 168, 373, 388, 392, 498, 530 Variance prioritization..................................................130
W WAG. See Whole-genome amplification (WAG) Weighted least squares estimator (WLSMV)..............498 Weighted root mean square residual (WRMR)..........498 Wellcome Trust Case Control Consortium (WTCCC) ....................................... 351, 453–454 Whole-genome amplification (WAG) .........................536 WLSMV. See Weighted least squares estimator (WLSMV) WRMR. See Weighted root mean square residual (WRMR) WTCCC. See Wellcome Trust Case Control Consortium (WTCCC)
X X chromosome ........ 287, 329, 340, 532–534, 547, 551