Volume 20
Number 12
2010
Exe cutive Ed i tor Hillary E . Su ssman
Assist an t Editor R obe r t C . Ma jo vsk i E di tor s
Aravinda Chakravarti Johns Hopkins Univ. School of Medicine
Evan E. Eichler Univ. of Washington
Richard A. Gibbs Baylor College of Medicine
Eric Green* National Human Genome Research Institute
Richard Myers HudsonAlpha Institute for Biotechnology
William J. Pavan* National Human Genome Research Institute
Edit oria l Se cret ary P e ggy C a l ic ch ia
Pro ductio n Manage r Li nd a S us s m a n
P roduct ion Edi tor No r a Ru t h
Pro duction Assis tant P a ul i ne He n ic k
*Serving in their own capacity.
Editorial Board J. Akey (Univ. of Washington) B. Andersson (Karolinska Institute) S.E. Antonarakis (Univ. of Geneva Medical School) B.E. Bernstein (Broad Institute) W. Bickmore (Medical Research Council) M. Boehnke (Univ. of Michigan) M. Brudno (Univ. of Toronto) M. Bulyk (Brigham & Women’s Hospital and Harvard Medical School) L. Carrel (Penn State College of Medicine) N.P. Carter (Wellcome Trust Sanger Institute) G.A. Churchill (The Jackson Laboratory) B. Cohen (Washington Univ. in St. Louis School of Medicine) G.M. Cooper (HudsonAlpha Institute for Biotechnology) G.E. Crawford (Duke Univ.) A. Di Rienzo (Univ. of Chicago) M. Dunham (Univ. of Washington) P.J. Farnham (Univ. of California, Davis) S. Gabriel (Broad Institute) M.B. Gerstein (Yale Univ.) M. Hahn (Indiana Univ.) I.M. Hall (Univ. of Virginia) D.B. Jaffe (Broad Institute) L. Jin (Fudan University) S. Jones (BC Cancer Agency) J. Korbel (European Molecular Biology Laboratory) J.D. Lieb (Univ. of North Carolina in Chapel Hill) E. Liu (Genome Institute of Singapore) J.R. Lupski (Baylor College of Medicine) T.F.C. Mackay (North Carolina State Univ.) P. Majumder (Indian Statistical Institute)
K. Makova (Pennsylvania State Univ.) E. Mardis (Washington Univ. in St. Louis School of Medicine) E.H. Margulies (National Human Genome Research Institute) G.T. Marth (Boston College) A.S. McCallion (Johns Hopkins Univ. School of Medicine) M.L. Meyerson (Dana-Farber Cancer Institute and Harvard Medical School) A. Milosavljevic (Baylor College of Medicine) R. Mitra (Washington Univ. in St. Louis School of Medicine) J.V. Moran (Univ. of Michigan Medical School) M.A. Nobrega (Univ. of Chicago) J.P. Noonan (Yale Univ. School of Medicine) J. Parkhill (The Wellcome Trust Sanger Institute) W.R. Pearson (Univ. of Virginia) J.H. Postlethwait (Univ. of Oregon) O. Rando (Univ. of Massachusetts Medical School) A. Regev (Broad Institute) D.A. Relman (Stanford Univ.) J. Rogers (Baylor College of Medicine) S.L. Salzberg (Univ. of Maryland) P.C. Scacheri (Case Western Reserve Univ.) E. Segal (Weizmann Institute) J. Shendure (Univ. of Washington) J.A. Stamatoyannopoulos (Univ. of Washington) M.R. Stratton (Wellcome Trust Sanger Institute) S. Tishkoff (Univ. of Pennsylvania) A.J.M. Walhout (Univ. of Massachusetts Medical School) S.T. Warren (Emory Univ. School of Medicine) D.A. Wheeler (Baylor College of Medicine) K. Wolfe (Trinity College, Dublin) K. Zhao (National Heart, Lung, and Blood Institute)
Genome Research (ISSN 1088-9051) is published monthly by Cold Spring Harbor Laboratory Press, 500 Sunnyside Blvd., Woodbury, NY 11797-2924. Periodicals paid at Woodbury, NY and additional mailing offices. Canada Post International Publications Mail Product (Canadian distribution) Sales Agreement No. 1321846. POSTMASTER: Send address changes to Cold Spring Harbor Laboratory Press, 500 Sunnyside Boulevard, Woodbury, NY 11797-2924. Subscriptions: Kathleen Cirone, Subscription Manager. Individual subscribers have a choice of "online only" or "print + online" subscriptions for this journal. Online only: $85; Print + Online: U.S., $135; Canada and Mexico, $205; R.O.W., $230 (includes airlift). For 2011 institutional pricing, visit http://genome.org/site/ subscriptions/cost.dtl. Orders may be sent to Cold Spring Harbor Laboratory Press, Fulfillment Department, 500 Sunnyside Boulevard, Woodbury, New York 11797-2924. Telephone: Continental U.S. and Canada 1-800-843-4388; all other locations 516-422-4100. Fax 516-422-4097. Personal subscriptions must be prepaid by personal check, credit card, or money order. Claims for missing issues must be received within four months of issue date. Advertising: Marcie Siconolfi, Advertising Manager, Cold Spring Harbor Laboratory Press, 1 Bungtown Rd., Cold Spring Harbor, New York 11724-2203. Phone: 516-422-4010; fax: 516-422-4092. Information for Contributors: Author instructions are available at our website, http://www.genome.org. Online Manuscript Submission: http://submit.genome.org. Copyright Information: Authorization to photocopy items for internal or personal use, or the internal or personal use of specific clients, is granted by Cold Spring Harbor Laboratory Press for libraries and other users registered with the Copyright Clearance Center (CCC). Contact Copyright Clearance Center at www.copyright.com or 978-750-8400. This consent does not extend to other kinds of copying, such as copying for general distribution for advertising or promotional purposes, for creating new collective works, or for resale. Copyright ª 2010 by Cold Spring Harbor Laboratory Press
Volume 20 Issue 12 December 2010
Perspective Transgenerational epigenetic inheritance: More questions than answers
1623
Lucia Daxinger and Emma Whitelaw
Research A recombination hotspot leads to sequence variability within a novel gene (AK005651 ) and contributes to type 1 diabetes susceptibility
1629
Iris K.L. Tan, Leanne Mackin, Nancy Wang, Anthony T. Papenfuss, Colleen M. Elso, Michelle P. Ashton, Fiona Quirk, Belinda Phipson, Melanie Bahlo, Terence P. Speed, Gordon K. Smyth, Grant Morahan, and Thomas C. Brodnicki
Regulated post-transcriptional RNA cleavage diversifies the eukaryotic transcriptome
1639
Tim R. Mercer, Marcel E. Dinger, Cameron P. Bracken, Gabriel Kolle, Jan M. Szubert, Darren J. Korbie, Marjan E. Askarian-Amiri, Brooke B. Gardiner, Gregory J. Goodall, Sean M. Grimmond, and John S. Mattick
Assessing the effect of the CLPG mutation on the microRNA catalog of skeletal muscle using high-throughput sequencing
1651
Florian Caiment, Carole Charlier, Tracy Hadfield, Noelle Cockett, Michel Georges, and Denis Baurain
Selective sweeps and parallel mutation in the adaptive recovery from deleterious mutation in Caenorhabditis elegans
1663
Dee R. Denver, Dana K. Howe, Larry J. Wilhelm, Catherine A. Palmer, Jennifer L. Anderson, Kevin C. Stein, Patrick C. Phillips, and Suzanne Estes
Coevolution within a transcriptional network by compensatory trans and cis mutations
1672OA
Dwight Kuo, Katherine Licon, Sourav Bandyopadhyay, Ryan Chuang, Colin Luo, Justin Catalana, Timothy Ravasi, Kai Tan, and Trey Ideker
RNA synthesis precision is regulated by preinitiation complex turnover
1679
Kunal Poorey, Rebekka O. Sprouse, Melissa N. Wells, Ramya Viswanathan, Stefan Bekiranov, and David T. Auble
Pervasive gene content variation and copy number variation in maize and its undomesticated progenitor
1689
Ruth A. Swanson-Wagner, Steven R. Eichten, Sunita Kumari, Peter Tiffin, Joshua C. Stein, Doreen Ware, and Nathan M. Springer
Localized hypermutation and associated gene losses in legume chloroplast genomes Alan M. Magee, Sue Aspinall, Danny W. Rice, Brian P. Cusack, Marie Se´mon, Antoinette S. Perry, Sasˇa Stefanovic´, Dan Milbourne, Susanne Barth, Jeffrey D. Palmer, John C. Gray, Tony A. Kavanagh, and Kenneth H. Wolfe
(continued)
1700
Methods High-throughput discovery of rare insertions and deletions in large cohorts
1711
Francesco L.M. Vallania, Todd E. Druley, Enrique Ramos, Jue Wang, Ingrid Borecki, Michael Province, and Robi D. Mitra
Evaluation of affinity-based genome-wide DNA methylation data: Effects of CpG density, amplification bias, and copy number variation
1719
Mark D. Robinson, Clare Stirzaker, Aaron L. Statham, Marcel W. Coolen, Jenny Z. Song, Shalima S. Nair, Dario Strbenac, Terence P. Speed, and Susan J. Clark
Gene expression profiling of human breast tissue samples using SAGE-Seq
1730
Zhenhua Jeremy Wu, Clifford A. Meyer, Sibgat Choudhury, Michail Shipitsin, Reo Maruyama, Marina Bessarabova, Tatiana Nikolskaya, Saraswati Sukumar, Armin Schwartzman, Jun S. Liu, Kornelia Polyak, and X. Shirley Liu
Scaffolding a Caenorhabditis nematode genome with RNA-seq
1740
Ali Mortazavi, Erich M. Schwarz, Brian Williams, Lorian Schaeffer, Igor Antoshechkin, Barbara J. Wold, and Paul W. Sternberg
Erratum
1748
Author Index
1749
Reviewer Index
1755
OA
Open Access paper.
Cover RNA transcripts are ultimately cleaved into constituent nucleotides during degradation and recycling. In this issue, it is reported that RNA transcripts can also be cleaved in a regulated manner to generate smaller, stable RNAs that contribute to the diversity of the eukaryotic transcriptome. The disjunction illustrated on the cover represents the prevalent cleavage of RNA transcripts in geometrically abstract terms. (Cover illustration by Tim Mercer. [For details, see Mercer et al., pp. 1639–1650.])
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Perspective
Transgenerational epigenetic inheritance: More questions than answers Lucia Daxinger and Emma Whitelaw1 Epigenetics Laboratory, Queensland Institute of Medical Research, Herston, Brisbane, Queensland 4006, Australia Epigenetic modifications are widely accepted as playing a critical role in the regulation of gene expression and thereby contributing to the determination of the phenotype of multicellular organisms. In general, these marks are cleared and reestablished each generation, but there have been reports in a number of model organisms that at some loci in the genome this clearing is incomplete. This phenomenon is referred to as transgenerational epigenetic inheritance. Moreover, recent evidence shows that the environment can stably influence the establishment of the epigenome. Together, these findings suggest that an environmental event in one generation could affect the phenotype in subsequent generations, and these somewhat Lamarckian ideas are stimulating interest from a broad spectrum of biologists, from ecologists to health workers. Epigenetics became an established discipline in the 1970s and 1980s as a result of work carried out by geneticists using model organisms such as Drosophila (Henikoff 1990). Originally, this research area aimed to understand those instances in which stable changes in genome function could not be explained by changes in DNA sequence. This definition suited Waddington’s original purpose, i.e., to explain how a multicellular organism could develop from one genome (Waddington 1942). More recently, with increasing knowledge of the underlying molecular mechanisms, the field has taken on a more biochemical flavor (Bird 2007; Kouzarides 2007). Constant progress is being made in the identification of epigenetic marks, i.e., the molecular marks to the chromosome that influence genome function, and while DNA methylation remains the most extensively studied, the importance of histone modifications as well as the contribution of RNA has become increasingly clear. There has always been much interest in the idea that some epigenetic marks can be inherited across generations. However, despite the fact that these marks are considered relatively stable during development (i.e., transmissible across mitosis), in theory they must undergo reprogramming in primordial germ cells (PGCs) and in the zygote to ensure the totipotency of cells of the early embryo, enabling them to differentiate down any pathway. For transgenerational epigenetic inheritance to occur at a particular locus, this reprogramming must be bypassed (Hadchouel et al. 1987; Roemer et al. 1997; Morgan et al. 1999). Recent reports that the establishment of epigenetic states can be altered by the environment, combined with the idea that epigenetic states can be inherited across generations, has resurrected an interest from the scientific community in Lamarckism. Here, we will highlight recent developments in our understanding of transgenerational epigenetic inheritance in multicellular organisms and discuss how alterations of the epigenotype might contribute to the determination of the adult phenotype of future generations. In particular, recent advances in our ability to study the integrity of the genome will help to identify true epigenetic phenomena.
Naturally occurring epialleles Some of the earliest evidence for transgenerational epigenetic inheritance came from studies in plants (Bender and Fink 1995; 1
Corresponding author. E-mail
[email protected]. Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.106138.110.
Jacobsen and Meyerowitz 1997; Soppe et al. 2000; Rangwala et al. 2006). One of the oldest examples involves a change in flower symmetry from bilateral to radial in Linaria vulgaris. This change appears to be explained by a change in DNA methylation rather than DNA sequence. The phenotype of the flower correlates tightly with the degree of DNA methylation at the promoter of the Lcyc gene, and the presence or absence of DNA methylation at the promoter correlates with its silent or active state, respectively. Occasionally, reversions to wild-type flowers occur in one branch of a plant, concomitant with hypomethylation at the locus (Cubas et al. 1999). Sequencing of around 1 kb upstream of the Lcyc locus did not detect any differences between wild-type and mutant plants (Cubas et al. 1999). It is still not clear how the silent state is maintained across generations at the locus. It is important to note that care has to be taken when meiotically heritable changes in phenotype are described as epigenetic, because in most systems it is almost impossible to completely rule out mutations either at the locus or elsewhere in the genome that could contribute directly or indirectly to the phenotype. Only recently it was shown that the bal variant in Arabidopsis, which was long thought to have an underlying epigenetic explanation, arose from a gene duplication, and that the duplication alone is necessary and sufficient for the phenotype observed in the bal variant (Yi and Richards 2009). Duplicated regions of genomes are difficult to detect and therefore difficult to rule out. It was shown recently that altered patterns of DNA methylation in plants (as seen in ddm1 or met1 mutants) can be heritable over many generations, even following backcrossing to wild-type plants (Johannes et al. 2009; Reinders et al. 2009). However, at some specific loci this is not the case; the methylation state reverts to that of the wild-type state. The reason for these differences is not clear. The best evidence for transgenerational epigenetic inheritance in the mouse comes from the study of epialleles, such as agouti viable yellow and axin fused, in which DNA methylation levels of intracisternal A particle (IAP) retrotransposons control the expression of the neighboring gene. Alleles of this type are referred to as epialleles because the epigenetic state of the IAP transcriptional control element determines the phenotype. IAP elements are among a small group of long terminal repeat (LTR) retrotransposons, and it is interesting that this group appears to be resistant to the erasure of DNA methylation during reprogramming events in the gametes and early embryos (Morgan et al. 1999; Lane et al. 2003; Rakyan et al. 2003; Popp et al. 2010). Variable phenotypes
20:1623–1628 Ó 2010 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/10; www.genome.org
Genome Research www.genome.org
1623
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Daxinger and Whitelaw (a range of coat colors from yellow, yellow and brown patches, to brown) are observed in inbred (and, therefore, presumably isogenic) mice carrying an agouti viable yellow allele, and this correlates with the DNA methylation status of the IAP. The epigenetic state can be transmitted to the next generation, maternally for agouti viable yellow and both maternally and paternally for axinfused (Morgan et al. 1999; Rakyan and Whitelaw 2003; Rakyan et al. 2003; Blewitt et al. 2006). Although the Avy and the axin-fused loci are the best-characterized epialleles in the mouse, the epigenetic marks that facilitate transgenerational epigenetic inheritance are not known. The simplest explanation has always been that DNA methylation at the locus escapes the reprogramming events. However, at least one study has shown that in blastocysts, following maternal transmission, DNA methylation at Avy is completely lost, suggesting that DNA methylation is not the inherited mark (Blewitt et al. 2006).
Transgenerational epigenetic inheritance at transgenes There has been a long history of reports of epigenetic silencing at transgenes both in animals and plants (Hadchouel et al. 1987; Allen et al. 1990; Dorn et al. 1993; Kearns et al. 2000; Matzke et al. 2000; Sutherland et al. 2000; Lane et al. 2003; Xing et al. 2007). In some cases these silent states have been shown to be passed on to the next generation. For example, a transgene construct, consisting of polycomb response element (PRE), Fab-7, placed upstream of a GAL4 UAS-inducible reporter gene, provided the first evidence for transgenerational epigenetic inheritance in Drosophila (Cavalli and Paro 1998, 1999). During embryogenesis the normally silent state of the Fab7-PRE could be switched to active. This active state was stably maintained through many rounds of mitosis, but also through meiosis in the absence of the initial GAL4 inducer (Paro et al. 1998; Cavalli and Paro 1999). A recent study in Arabidopsis has shown that after extreme temperature or UV-B stress, a silent transgene and some endogenous transposable elements were activated, and these changes were heritable for two generations. Interestingly, the loss of silencing at these loci correlated with an increase in histone acetylation, a mark known to be associated with active transcription, but was not accompanied by a loss of DNA methylation (Lang-Mladek et al. 2010). However, the changes in histone acetylation may well be a response to the change in transcriptional activity rather than a driver. The molecular basis for the meiotic memory in both of these cases of transgenerational epigenetic inheritance at transgenes remains unknown.
Paramutation Paramutation and paramutation-like phenomena have been described in plants, fungi, and mammals, and most extensively studied in maize. Paramutation involves an allelic interaction (in trans) that leads to a heritable change in gene expression. At the b1 locus the ‘‘paramutagenic’’ B9 allele (normally associated with pale plants) changes the epigenetic state of the ‘‘paramutable’’ B-I allele (normally associated with dark plants) when crossed, which results in a phenotypic change at the B-I allele from dark purple to lightly pigmented mature plant tissues. No changes from B9 to B-I have been observed, making the B9 allele extremely stable over many generations (Arteaga-Vazquez and Chandler 2010). While the presence of tandem repeats in cis has been shown to play an important role in the b1 paramutation system, the molecular basis of these phenomena has been has been difficult to understand (Stam et al. 2002).
1624
Genome Research www.genome.org
Exciting results have been obtained recently from forward genetic screens using the b1 locus and another paramutable locus, pl1, and several genes required for paramutation have been uncovered (Dorweiler et al. 2000; Alleman et al. 2006; Erhard et al. 2009; Sidorenko et al. 2009; Stonaker et al. 2009). To date, the majority of the genes identified overlap with factors required for small RNA-mediated silencing (Arteaga-Vazquez and Chandler 2010). An RNA-dependent RNA polymerase has been shown to be absolutely required for the establishment and maintenance of paramutation (Alleman et al. 2006; Sidorenko and Chandler 2008). A paramutation-like phenomenon has also been observed in the mouse. This involves the inheritance of a white-tail phenotype caused by an insertional mutation (a transgene was inserted downstream from the Kit promoter that produces an aberrant transcript) at the Kit locus, resulting in no KIT protein. Upon analysis of the offspring of heterozygous intercrosses (i.e., when breeding mice heterozygous for this mutant Kit allele) the number of phenotypically wild-type mice was less than that expected by Mendelian rules. Further analysis revealed that wild-type offspring were generated in the expected Mendelian ratio, but that most had maintained the mutant phenotype, displaying white tail tips. Importantly, this mutant phenotype (Kit*) could be transmitted by the genotypically wild-type mice to the next generation (Rassoulzadegan et al. 2006). A role for RNA in this paramutationlike phenomenon was proposed because elevated RNA levels were found in the sperm of mice heterozygous for the Kit mutation and in Kit* wild-type males. Paramutation could be induced following microinjection of microRNAs targeted to the Kit locus (Rassoulzadegan et al. 2006). However, concerns have been raised regarding the particular phenotype. The white tail is found surprisingly frequently in inbred C57BL/6 mice obtained from the Jackson Lab, i.e., those that we can be confident did not have ancestors carrying the Kit tmlAlf allele (Arnheiter 2007). Similar paramutation-like phenomena have been reported by the same group at some other loci in the mouse (Wagner et al. 2008; Grandjean et al. 2009). In the case of paramutation in the mouse, it seems that the amount of RNA, originally present to trigger the response, is important for the transgenerational inheritance of the phenotype (Grandjean et al. 2009). A similar observation was described in C. elegans, where it was shown that heritable silencing of the oocyte maturation factor (oma-1) following dsRNA injection is dose dependent (Alcazar et al. 2008). Long-term silencing effects lasted three to four generations, but dropped significantly afterward. This transgenerational silencing has also been described with other target genes (Grishok et al. 2000; Vastenhouw et al. 2006). While molecular mechanisms for this transgenerational silencing in worms remain unknown, it has been shown that the silencing can be transmitted independently of the originally targeted locus, indicating a mobile silencing signal (Grishok et al. 2000; Alcazar et al. 2008). Alcazar and colleagues propose that RNA molecules are the inherited signal (Alcazar et al. 2008).
Epigenetics and environment Throughout their life cycle, organisms are constantly exposed to environmental influences that pose a threat to the stability of their genome and/or epigenome. Several cases have been reported in various organisms, in which environmental influences such as exposure to chemicals (Anway et al. 2005; Vandegehuchte et al. 2009), nutritional supplements/nutrient availability (Wolff et al. 1998; Cooney et al. 2002; Dolinoy et al. 2006; Kaminen-Ahola
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Transgenerational epigenetic inheritance et al. 2010), maternal behavior (Weaver et al. 2004), pathogens (Boyko et al. 2007), or temperature (Lang-Mladek et al. 2010) cause alterations in gene expression that persist throughout life and sometimes appear to be transmitted to the next generation. Honeybees provide an interesting example of where a change in nutritional input dramatically alters the phenotype of the developing, genetically identical larvae. In the wild, larvae fed with royal jelly become queens, which differ significantly in their physiology from that of workers. It has been shown recently that downregulation of the bee’s DNA methyltransferase, during a critical ‘‘decision-making’’ period in larval development, results in the emergence of an increased number of queens from the larvae not fed royal jelly (Kucharski et al. 2008). These studies highlight the importance of DNA methylation as an intermediary between the environment and the developmental outcomes. A recent study in the mouse reported that ethanol consumption by pregnant females can influence the adult phenotype of the developing embryos. Developmental abnormalities (decrease in body weight, smaller skull size, and differences in cranial shape) were observed in adolescent offspring from mothers that were exposed to ethanol during the first half of pregnancy (Kaminen-Ahola et al. 2010). Moreover, using the epigenetically sensitive agouti viable yellow (Avy ) as a read-out system, it was shown that ethanol exposure led to an increase in transcriptional silencing associated with hypermethylation at the Avy locus and a shift toward pseudoagouti (brown) (Kaminen-Ahola et al. 2010). It remains to be determined whether the effects observed after ethanol exposure can be transmitted to the next generations or are restricted to directly exposed animals. It is important to remember that when transgenerational phenomena are observed in mice that have been exposed to environmental stresses during pregnancy, not only the mother, but the F1 generation (embryo) and the developing germ line of the F2 generation are also exposed to these triggers (Youngson and Whitelaw 2008). Another instance in which the Avy allele has been used as a biosensor revealed a shift of coat color toward pseudoagouti after feeding the mice a methyl-rich diet (Wolff et al. 1998; Cooney et al. 2002; Waterland and Jirtle 2003; Cropley et al. 2006, 2007; Waterland et al. 2006, 2007). DNA methylation at the Avy locus was found to be increased in animals that were fed with a methyl-rich diet (Waterland and Jirtle 2003). Follow-up studies were performed to determine whether increased DNA methylation levels at Avy were inherited to the next generation. Waterland et al. (2007) came to the conclusion that the acquired DNA methylation marks were not transgenerationally inherited, whereas using a slightly different breeding strategy, Cropley et al. (2006) reported the opposite. An epigenetic memory of stress has been observed in dandelions (Taraxacum officinale) (Verhoeven et al. 2010). Dandelions are apomictic, i.e., they reproduce through unfertilized seeds, and are therefore assumed to be genetically identical, providing the opportunity of studying epigenetic variation in the absence of genotypic variation. In a recent study, isogenic dandelions were exposed to a variety of stresses (biotic and abiotic) and, together with the first generation of unstressed offspring, were analyzed for genome-wide DNA methylation changes using methylationsensitive amplified-fragment-length polymorphism. The results showed that stress-induced DNA methylation changes occurred, and that these changes were transmitted to the next generation (Verhoeven et al. 2010). The nature of the differentially methylated loci, i.e., whether they are genes or transposable elements, is not yet clear. Moreover, in the absence of a complete genome se-
quence it is hard to rule out underlying genetic changes. Despite the technical challenges in studying dandelions, they provide an interesting example of a situation in which genetic variation is limited and where transgenerational epigenetic inheritance could provide a useful mechanism for adaptation to environmental changes. Whereas inbred mouse strains and apomictic dandelions provide an opportunity to study epigenetic variation in a situation in which genetic variation is greatly mimimized, the situation is different in outbred populations, such as humans. Studies in monozygotic (MZ) twin pairs, which are genetically identical, provide some evidence for epigenetic variation between individuals within a twin pair (Fraga et al. 2005; Mill et al. 2006; Oates et al. 2006; Kaminsky et al. 2009). However, a more recent genome-wide study of the genetic, epigenetic, and transcriptomic differences in monozygotic twins discordant for multiple sclerosis failed to find any significant genetic or epigenetic differences (Baranzini et al. 2010). Clearly, more work needs to be carried out in this area. MZ twins provide a unique opportunity to unravel the extent to which the epigenome is hard-wired in humans. The effects of environmental influences and the possibility that the resulting epigenetic alterations are heritable to the next generation are of considerable interest to those studying disease in humans. A recent study investigated the long-term effects of prenatal exposure to famine on DNA methylation at the imprinted IGF2 gene. Individuals conceived during the Dutch Hunger Winter (1944–1945) showed hypomethylation at the IGF2 differentially methylated region (DMR) when analyzed six decades later. Interestingly, no differences in DNA methylation were observed in individuals exposed to famine late in gestation. The finding suggests that the protein-deficient diet of the mother contributed to the loss of DNA methylation at the IGF2 DMR (Heijmans et al. 2008). It is difficult to tease out cause and effect. The loss of methylation in old age may be a consequence of some as yet unknown physiological changes. Unfortunately, in this study there is no record of DNA methylation patterns earlier in development. A prospective cohort study would be best, and epidemiologists are now collecting biospecimens from MZ twins at birth (Foley et al. 2009). This will provide us with exciting new data in the coming decades. A large epidemiological study carried out in Sweden reported that early paternal smoking was associated with a greater body mass index in sons (Pembrey et al. 2006). Additionally, they found a correlation between mortality risk ratio of grandsons and paternal grandfather’s food supply in mid-childhood. The mortality risk ratio of the granddaughters was linked to the paternal grandmother’s food supply (Pembrey et al. 2006). While it is possible to explain these observations based on transgenerational epigenetic inheritance, other equally plausible explanations exist. In these types of studies, cultural confounders are almost impossible to rule out. An epimutation in humans has been described in an individual with hereditary nonpolyposis colorectal cancer. The patient had altered DNA methylation patterns at one allele of MLH1, a DNA mismatch repair gene. Silencing of the MLH1 allele was detectable in all three germ layers, suggesting that an epimutation had occurred in the parental germ line. Some siblings inherited the same allele in an unmethylated state, and no DNA mutations were identified in the MLH1 coding or promoter regions, supporting the idea that this was a case of transgenerational epigenetic inheritance (Hitchins et al. 2005). However, a mechanism for the MLH1 epimutation has yet to be identified, and trans-acting genetic alterations cannot be ruled out (Hesson et al. 2010). There is
Genome Research www.genome.org
1625
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Daxinger and Whitelaw increasing evidence that trans-effects can influence the epigenetic state of a locus so that undetected copy number variations, large duplications, and inversions anywhere in the genome could be the cause of the effect (Hesson et al. 2010).
Toward a molecular mechanism In those cases in which transgenerational epigenetic inheritance has been observed, the underlying molecular mechanisms are poorly understood. The recent discovery that germ-line cells contain large numbers of different small RNA species in mice, flies, and plants suggests a novel way of transmitting epigenetic information through the germ line (Aravin and Hannon 2008; Mosher et al. 2009; Slotkin et al. 2009; Teixeira et al. 2009). Indeed, maternal transmission of Piwi-interacting RNAs (piRNAs) in Drosophila has been shown to influence fertility of the offspring via piRNA-directed silencing of transposable elements in a phenomenon termed hybrid dysgenesis (Brennecke et al. 2008). In mammals, microRNAs have been implicated as the trigger for a paramutation-like phenomenon, and this has been discussed above. However, the involvement of other classes of small RNAs in transgenerational epigenetic inheritance in mammals remains to be determined. As we have indicated already, transgenerational epigenetic inheritance does appear to preferentially occur at transgenes, transposable elements, or genes that are under the transcriptional control of transposable elements. It has been suggested that what makes these states special is that they involve insertions, which, when heterozygous, trigger events such as ‘‘meiotic silencing of unpaired DNA’’ (MSUD), a process that has been extensively studied in Neurospora (Shiu et al. 2001). Until recently, a role for chromatin in transgenerational epigenetic inheritance in mammals was considered unlikely, because in sperm the histones are replaced by smaller, arginine-rich protamines. This replacement would erase any epigenetic modification at histone tails, thereby preventing epigenetic inheritance. However, it has now become clear that some nucleosomes are retained, and these are not random remnants of insufficient clearance of epigenetic marks, but enriched at specific loci important for embryonic development. High levels of H3K27me3, H3K4me2, and H3K4me3 are found at these loci (Hammoud et al. 2009a,b; Brykczynska et al. 2010). Brykczynska et al. (2010) propose that H3K27me3 might be the epigenetic modification that is transmitted paternally to the next generation. While the mechanisms of inheritance of histone modifications in mammals are still under debate, there is at least one report in C. elegans demonstrating the importance of complete erasure of H3K4me2 patterns in the germ line to prevent transmission of this epigenetic mark to the next generation (Katz et al. 2009). The absence of the H3K4me2 demethylase LSD1/KDM1 in C. elegans over many generations was shown to result in a significant increase of H3K4me2 levels at genes required for spermatogenesis. The accumulation of active marks was shown to correlate with an increase in gene expression at these loci (Katz et al. 2009). It has also been shown that haploinsufficiency for DNMT1 (a DNA methyltransferase) and SNF2H (SMARCA5) (a chromatin remodeler) in male mice can trigger phenotypic abnormalities in the offspring that did not inherit the mutated gene (Chong et al. 2007). These are referred to as paternal effects. Chong and colleagues proposed a model whereby a shift in dosage (or the compromised function) of epigenetic modifiers can modify the epigenome of wild-type gametes at regions that are not cleared, and that these can, in turn, act in trans on alleles introduced only via the egg. It
1626
Genome Research www.genome.org
will be interesting to see whether haploinsufficiency for other proteins involved in epigenetic reprogramming display similar effects.
Conclusion and future directions Multicellular organisms have evolved complex mechanisms to clear epigenetic states between generations. However, in some cases these mechanisms can be circumvented. Recent studies across a wide range of species have strengthened the idea that the direct inheritance of RNA molecules and of chromatin states does occur, making these plausible explanations. The development of highthroughput methods of sequencing both RNA and DNA in combination with antibodies specific to particular histone modifications will enable us to fully characterize the epigenetic marks across the entire genome of gametes and early embryos in the near future. Together, these studies will provide us with exciting new insights on how and to what extent transgenerational epigenetic inheritance occurs in various organisms. Certainly, we are only at the beginning, and most likely we will have to revise our current models about the nature and stability of the epigenetic marks to fully understand this mechanism.
Acknowledgments L.D. is supported by the Austrian Science Fund (FWF) Erwin Schroedinger Fellowship ( J-2891-B12). E.W. is a National Health and Medical Research Council (NHMRC) Australia Fellow.
References Alcazar RM, Lin R, Fire AZ. 2008. Transmission dynamics of heritable silencing induced by double-stranded RNA in Caenorhabditis elegans. Genetics 180: 1275–1288. Alleman M, Sidorenko L, McGinnis K, Seshadri V, Dorweiler JE, White J, Sikkink K, Chandler VL. 2006. An RNA-dependent RNA polymerase is required for paramutation in maize. Nature 442: 295–298. Allen ND, Norris ML, Surani MA. 1990. Epigenetic control of transgene expression and imprinting by genotype-specific modifiers. Cell 61: 853– 861. Anway MD, Cupp AS, Uzumcu M, Skinner MK. 2005. Epigenetic transgenerational actions of endocrine disruptors and male fertility. Science 308: 1466–1469. Aravin AA, Hannon GJ. 2008. Small RNA silencing pathways in germ and stem cells. Cold Spring Harb Symp Quant Biol 73: 283–290. Arnheiter H. 2007. Mammalian paramutation: A tail’s tale? Pigment Cell Res 20: 36–40. Arteaga-Vazquez MA, Chandler VL. 2010. Paramutation in maize: RNA mediated trans-generational gene silencing. Curr Opin Genet Dev 20: 156–163. Baranzini SE, Mudge J, van Velkinburgh JC, Khankhanian P, Khrebtukova I, Miller NA, Zhang L, Farmer AD, Bell CJ, Kim RW, et al. 2010. Genome, epigenome and RNA sequences of monozygotic twins discordant for multiple sclerosis. Nature 464: 1351–1356. Bender J, Fink GR. 1995. Epigenetic control of an endogenous gene family is revealed by a novel blue fluorescent mutant of Arabidopsis. Cell 83: 725–734. Bird A. 2007. Perceptions of epigenetics. Nature 447: 396–398. Blewitt ME, Vickaryous NK, Paldi A, Koseki H, Whitelaw E. 2006. Dynamic reprogramming of DNA methylation at an epigenetically sensitive allele in mice. PLoS Genet 2: e49. doi: 10.1371/journal.pgen.0020049. Boyko A, Kathiria P, Zemp FJ, Yao Y, Pogribny I, Kovalchuk I. 2007. Transgenerational changes in the genome stability and methylation in pathogen-infected plants (virus-induced plant genome instability). Nucleic Acids Res 35: 1714–1725. Brennecke J, Malone CD, Aravin AA, Sachidanandam R, Stark A, Hannon GJ. 2008. An epigenetic role for maternally inherited piRNAs in transposon silencing. Science 322: 1387–1392. Brykczynska U, Hisano M, Erkek S, Ramos L, Oakeley EJ, Roloff TC, Beisel C, Schubeler D, Stadler MB, Peters AH. 2010. Repressive and active histone methylation mark distinct promoters in human and mouse spermatozoa. Nat Struct Mol Biol 17: 679–687.
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Transgenerational epigenetic inheritance Cavalli G, Paro R. 1998. The Drosophila Fab-7 chromosomal element conveys epigenetic inheritance during mitosis and meiosis. Cell 93: 505–518. Cavalli G, Paro R. 1999. Epigenetic inheritance of active chromatin after removal of the main transactivator. Science 286: 955–958. Chong S, Vickaryous N, Ashe A, Zamudio N, Youngson N, Hemley S, Stopka T, Skoultchi A, Matthews J, Scott HS, et al. 2007. Modifiers of epigenetic reprogramming show paternal effects in the mouse. Nat Genet 39: 614– 622. Cooney CA, Dave AA, Wolff GL. 2002. Maternal methyl supplements in mice affect epigenetic variation and DNA methylation of offspring. J Nutr 132: 2393S–2400S. Cropley JE, Suter CM, Beckman KB, Martin DI. 2006. Germ-line epigenetic modification of the murine A vy allele by nutritional supplementation. Proc Natl Acad Sci 103: 17308–17312. Cropley JE, Suter CM, Martin DI. 2007. Methyl donors change the germline epigenetic state of the Avy allele. FASEB J 21: 3021–3022. Cubas P, Vincent C, Coen E. 1999. An epigenetic mutation responsible for natural variation in floral symmetry. Nature 401: 157–161. Dolinoy DC, Weidman JR, Waterland RA, Jirtle RL. 2006. Maternal genistein alters coat color and protects Avy mouse offspring from obesity by modifying the fetal epigenome. Environ Health Perspect 114: 567–572. Dorn R, Krauss V, Reuter G, Saumweber H. 1993. The enhancer of positioneffect variegation of Drosophila, E(var)3-93D, codes for a chromatin protein containing a conserved domain common to several transcriptional regulators. Proc Natl Acad Sci 90: 11376–11380. Dorweiler JE, Carey CC, Kubo KM, Hollick JB, Kermicle JL, Chandler VL. 2000. mediator of paramutation1 is required for establishment and maintenance of paramutation at multiple maize loci. Plant Cell 12: 2101–2118. Erhard KF Jr, Stonaker JL, Parkinson SE, Lim JP, Hale CJ, Hollick JB. 2009. RNA polymerase IV functions in paramutation in Zea mays. Science 323: 1201–1205. Foley DL, Craig JM, Morley R, Olsson CA, Dwyer T, Smith K, Saffery R. 2009. Prospects for epigenetic epidemiology. Am J Epidemiol 169: 389–400. Fraga MF, Ballestar E, Paz MF, Ropero S, Setien F, Ballestar ML, Heine-Suner D, Cigudosa JC, Urioste M, Benitez J, et al. 2005. Epigenetic differences arise during the lifetime of monozygotic twins. Proc Natl Acad Sci 102: 10604– 10609. Grandjean V, Gounon P, Wagner N, Martin L, Wagner KD, Bernex F, Cuzin F, Rassoulzadegan M. 2009. The miR-124-Sox9 paramutation: RNAmediated epigenetic control of embryonic and adult growth. Development 136: 3647–3655. Grishok A, Tabara H, Mello CC. 2000. Genetic requirements for inheritance of RNAi in C. elegans. Science 287: 2494–2497. Hadchouel M, Farza H, Simon D, Tiollais P, Pourcel C. 1987. Maternal inhibition of hepatitis B surface antigen gene expression in transgenic mice correlates with de novo methylation. Nature 329: 454–456. Hammoud S, Emery BR, Dunn D, Weiss RB, Carrell DT. 2009a. Sequence alterations in the YBX2 gene are associated with male factor infertility. Fertil Steril 91: 1090–1095. Hammoud SS, Nix DA, Zhang H, Purwar J, Carrell DT, Cairns BR. 2009b. Distinctive chromatin in human sperm packages genes for embryo development. Nature 460: 473–478. Heijmans BT, Tobi EW, Stein AD, Putter H, Blauw GJ, Susser ES, Slagboom PE, Lumey LH. 2008. Persistent epigenetic differences associated with prenatal exposure to famine in humans. Proc Natl Acad Sci 105: 17046– 17049. Henikoff S. 1990. Position-effect variegation after 60 years. Trends Genet 6: 422–426. Hesson LB, Hitchins MP, Ward RL. 2010. Epimutations and cancer predisposition: Importance and mechanisms. Curr Opin Genet Dev 20: 290–298. Hitchins M, Williams R, Cheong K, Halani N, Lin VA, Packham D, Ku S, Buckle A, Hawkins N, Burn J, et al. 2005. MLH1 germline epimutations as a factor in hereditary nonpolyposis colorectal cancer. Gastroenterology 129: 1392–1399. Jacobsen SE, Meyerowitz EM. 1997. Hypermethylated SUPERMAN epigenetic alleles in Arabidopsis. Science 277: 1100–1103. Johannes F, Porcher E, Teixeira FK, Saliba-Colombani V, Simon M, Agier N, Bulski A, Albuisson J, Heredia F, Audigier P, et al. 2009. Assessing the impact of transgenerational epigenetic variation on complex traits. PLoS Genet 5: e1000530. doi: 10.1371/journal.pgen.1000530. Kaminen-Ahola N, Ahola A, Maga M, Mallitt KA, Fahey P, Cox TC, Whitelaw E, Chong S. 2010. Maternal ethanol consumption alters the epigenotype and the phenotype of offspring in a mouse model. PLoS Genet 6: e1000811. 10.1371/journal.pgen.1000811. Kaminsky ZA, Tang T, Wang SC, Ptak C, Oh GH, Wong AH, Feldcamp LA, Virtanen C, Halfvarson J, Tysk C, et al. 2009. DNA methylation profiles in monozygotic and dizygotic twins. Nat Genet 41: 240–245.
Katz DJ, Edwards TM, Reinke V, Kelly WG. 2009. A C. elegans LSD1 demethylase contributes to germline immortality by reprogramming epigenetic memory. Cell 137: 308–320. Kearns M, Preis J, McDonald M, Morris C, Whitelaw E. 2000. Complex patterns of inheritance of an imprinted murine transgene suggest incomplete germline erasure. Nucleic Acids Res 28: 3301–3309. Kouzarides T. 2007. Chromatin modifications and their function. Cell 128: 693–705. Kucharski R, Maleszka J, Foret S, Maleszka R. 2008. Nutritional control of reproductive status in honeybees via DNA methylation. Science 319: 1827–1830. Lane N, Dean W, Erhardt S, Hajkova P, Surani A, Walter J, Reik W. 2003. Resistance of IAPs to methylation reprogramming may provide a mechanism for epigenetic inheritance in the mouse. Genesis 35: 88–93. Lang-Mladek C, Popova O, Kiok K, Berlinger M, Rakic B, Aufsatz W, Jonak C, Hauser MT, Luschnig C. 2010. Transgenerational inheritance and resetting of stress-induced loss of epigenetic gene silencing in Arabidopsis. Mol Plant 3: 594–602. Matzke MA, Mette MF, Matzke AJ. 2000. Transgene silencing by the host genome defense: Implications for the evolution of epigenetic control mechanisms in plants and vertebrates. Plant Mol Biol 43: 401–415. Mill J, Dempster E, Caspi A, Williams B, Moffitt T, Craig I. 2006. Evidence for monozygotic twin (MZ) discordance in methylation level at two CpG sites in the promoter region of the catechol-O-methyltransferase (COMT) gene. Am J Med Genet B Neuropsychiatr Genet 141B: 421–425. Morgan HD, Sutherland HG, Martin DI, Whitelaw E. 1999. Epigenetic inheritance at the agouti locus in the mouse. Nat Genet 23: 314–318. Mosher RA, Melnyk CW, Kelly KA, Dunn RM, Studholme DJ, Baulcombe DC. 2009. Uniparental expression of PolIV-dependent siRNAs in developing endosperm of Arabidopsis. Nature 460: 283–286. Oates NA, van Vliet J, Duffy DL, Kroes HY, Martin NG, Boomsma DI, Campbell M, Coulthard MG, Whitelaw E, Chong S. 2006. Increased DNA methylation at the AXIN1 gene in a monozygotic twin from a pair discordant for a caudal duplication anomaly. Am J Hum Genet 79: 155–162. Paro, R., Strutt, H., and Cavalli, G. 1998. Heritable chromatin states induced by the Polycomb and trithorax group genes. Novartis Found Symp 214: 51–61; discussion 61–56, 104–113. Pembrey ME, Bygren LO, Kaati G, Edvinsson S, Northstone K, Sjostrom M, Golding J. 2006. Sex-specific, male-line transgenerational responses in humans. Eur J Hum Genet 14: 159–166. Popp C, Dean W, Feng S, Cokus SJ, Andrews S, Pellegrini M, Jacobsen SE, Reik W. 2010. Genome-wide erasure of DNA methylation in mouse primordial germ cells is affected by AID deficiency. Nature 463: 1101–1105. Rakyan V, Whitelaw E. 2003. Transgenerational epigenetic inheritance. Curr Biol 13: R6. 10.1016/S0960-9822(02)1377-5. Rakyan VK, Chong S, Champ ME, Cuthbert PC, Morgan HD, Luu KV, Whitelaw E. 2003. Transgenerational inheritance of epigenetic states at the murine Axin(Fu) allele occurs after maternal and paternal transmission. Proc Natl Acad Sci 100: 2538–2543. Rangwala SH, Elumalai R, Vanier C, Ozkan H, Galbraith DW, Richards EJ. 2006. Meiotically stable natural epialleles of Sadhu, a novel Arabidopsis retroposon. PLoS Genet 2: e36. doi: 10.1371/journal.pgen.0020036. Rassoulzadegan M, Grandjean V, Gounon P, Vincent S, Gillot I, Cuzin F. 2006. RNA-mediated non-mendelian inheritance of an epigenetic change in the mouse. Nature 441: 469–474. Reinders J, Wulff BB, Mirouze M, Mari-Ordonez A, Dapp M, Rozhon W, Bucher E, Theiler G, Paszkowski J. 2009. Compromised stability of DNA methylation and transposon immobilization in mosaic Arabidopsis epigenomes. Genes Dev 23: 939–950. Roemer I, Reik W, Dean W, Klose J. 1997. Epigenetic inheritance in the mouse. Curr Biol 7: 277–280. Shiu PK, Raju NB, Zickler D, Metzenberg RL. 2001. Meiotic silencing by unpaired DNA. Cell 107: 905–916. Sidorenko L, Chandler V. 2008. RNA-dependent RNA polymerase is required for enhancer-mediated transcriptional silencing associated with paramutation at the maize p1 gene. Genetics 180: 1983–1993. Sidorenko L, Dorweiler JE, Cigan AM, Arteaga-Vazquez M, Vyas M, Kermicle J, Jurcin D, Brzeski J, Cai Y, Chandler VL. 2009. A dominant mutation in mediator of paramutation2, one of three second-largest subunits of a plant-specific RNA polymerase, disrupts multiple siRNA silencing processes. PLoS Genet 5: e1000725. doi: 10.1371/journal.pgen.1000725. Slotkin RK, Vaughn M, Borges F, Tanurdzic M, Becker JD, Feijo JA, Martienssen RA. 2009. Epigenetic reprogramming and small RNA silencing of transposable elements in pollen. Cell 136: 461–472. Soppe WJ, Jacobsen SE, Alonso-Blanco C, Jackson JP, Kakutani T, Koornneef M, Peeters AJ. 2000. The late flowering phenotype of fwa mutants is caused by gain-of-function epigenetic alleles of a homeodomain gene. Mol Cell 6: 791–802. Stam M, Belele C, Dorweiler JE, Chandler VL. 2002. Differential chromatin structure within a tandem array 100 kb upstream of the maize b1 locus is associated with paramutation. Genes Dev 16: 1906–1918.
Genome Research www.genome.org
1627
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Daxinger and Whitelaw Stonaker JL, Lim JP, Erhard KF Jr, Hollick JB. 2009. Diversity of Pol IV function is defined by mutations at the maize rmr7 locus. PLoS Genet 5: e1000706. doi: 10.1371/journal.pgen.1000706. Sutherland HG, Kearns M, Morgan HD, Headley AP, Morris C, Martin DI, Whitelaw E. 2000. Reactivation of heritably silenced gene expression in mice. Mamm Genome 11: 347–355. Teixeira FK, Heredia F, Sarazin A, Roudier F, Boccara M, Ciaudo C, Cruaud C, Poulain J, Berdasco M, Fraga MF, et al. 2009. A role for RNAi in the selective correction of DNA methylation defects. Science 323: 1600–1604. Vandegehuchte MB, Lemiere F, Vanhaecke L, Vanden Berghe W, Janssen CR. 2009. Direct and transgenerational impact on Daphnia magna of chemicals with a known effect on DNA methylation. Comp Biochem Physiol C Toxicol Pharmacol 151: 278–285. Vastenhouw NL, Brunschwig K, Okihara KL, Muller F, Tijsterman M, Plasterk RH. 2006. Gene expression: Long-term gene silencing by RNAi. Nature 442: 882. doi: 10.1038/442882a. Verhoeven KJ, Jansen JJ, van Dijk PJ, Biere A. 2010. Stress-induced DNA methylation changes and their heritability in asexual dandelions. New Phytol 185: 1108–1118. Waddington CH. 1942. The epigenotype. Endeavor 1: 10–20. Wagner KD, Wagner N, Ghanbarian H, Grandjean V, Gounon P, Cuzin F, Rassoulzadegan M. 2008. RNA induction and inheritance of epigenetic cardiac hypertrophy in the mouse. Dev Cell 14: 962–969.
1628
Genome Research www.genome.org
Waterland RA, Jirtle RL. 2003. Transposable elements: Targets for early nutritional effects on epigenetic gene regulation. Mol Cell Biol 23: 5293– 5300. Waterland RA, Dolinoy DC, Lin JR, Smith CA, Shi X, Tahiliani KG. 2006. Maternal methyl supplements increase offspring DNA methylation at Axin Fused. Genesis 44: 401–406. Waterland RA, Travisano M, Tahiliani KG. 2007. Diet-induced hypermethylation at agouti viable yellow is not inherited transgenerationally through the female. FASEB J 21: 3380–3385. Weaver IC, Cervoni N, Champagne FA, D’Alessio AC, Sharma S, Seckl JR, Dymov S, Szyf M, Meaney MJ. 2004. Epigenetic programming by maternal behavior. Nat Neurosci 7: 847–854. Wolff GL, Kodell RL, Moore SR, Cooney CA. 1998. Maternal epigenetics and methyl supplements affect agouti gene expression in Avy/a mice. FASEB J 12: 949–957. Xing Y, Shi S, Le L, Lee CA, Silver-Morse L, Li WX. 2007. Evidence for transgenerational transmission of epigenetic tumor susceptibility in Drosophila. PLoS Genet 3: 1598–1606. Yi H, Richards EJ. 2009. Gene duplication and hypermutation of the pathogen Resistance gene SNC1 in the Arabidopsis bal variant. Genetics 183: 1227–1234. Youngson NA, Whitelaw E. 2008. Transgenerational epigenetic effects. Annu Rev Genomics Hum Genet 9: 233–257.
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Research
A recombination hotspot leads to sequence variability within a novel gene (AK005651) and contributes to type 1 diabetes susceptibility Iris K.L. Tan,1,2,6 Leanne Mackin,1,6 Nancy Wang,1,2 Anthony T. Papenfuss,3 Colleen M. Elso,1 Michelle P. Ashton,1,2 Fiona Quirk,3 Belinda Phipson,3,4 Melanie Bahlo,3 Terence P. Speed,3 Gordon K. Smyth,3 Grant Morahan,5 and Thomas C. Brodnicki1,7 1
St. Vincent’s Institute of Medical Research, Fitzroy, Victoria 3065, Australia; 2Department of Medicine, The University of Melbourne, Parkville, Victoria 3010, Australia; 3The Walter & Eliza Hall Institute of Medical Research, Parkville, Victoria 3052, Australia; 4 Department of Medical Biology, The University of Melbourne, Parkville, Victoria 3010, Australia; 5The Western Australian Institute of Medical Research, Perth, Western Australia 6000, Australia More than 25 loci have been linked to type 1 diabetes (T1D) in the nonobese diabetic (NOD) mouse, but identification of the underlying genes remains challenging. We describe here the positional cloning of a T1D susceptibility locus, Idd11, located on mouse chromosome 4. Sequence analysis of a series of congenic NOD mouse strains over a critical 6.9-kb interval in these mice and in 25 inbred strains identified several haplotypes, including a unique NOD haplotype, associated with varying levels of T1D susceptibility. Haplotype diversity within this interval between congenic NOD mouse strains was due to a recombination hotspot that generated four crossover breakpoints, including one with a complex conversion tract. The Idd11 haplotype and recombination hotspot are located within a predicted gene of unknown function, which exhibits decreased expression in relevant tissues of NOD mice. Notably, it was the recombination hotspot that aided our mapping of Idd11 and confirms that recombination hotspots can create genetic variation affecting a common polygenic disease. This finding has implications for human genetic association studies, which may be affected by the approximately 33,000 estimated hotspots in the genome. [Supplemental material is available online at http://www.genome.org. The sequence data from this study have been submitted to dbSNP (http://www.ncbi.nlm.nih.gov/projects/SNP/) under accession nos. ss262803370, ss262803372, ss262803374, ss262803376, ss262803379, ss262803382, ss262803385, ss262803388, ss262803390, ss262803391, ss262803392, ss262803394, ss262803397, ss262803400, ss262803402, ss262803403, ss262803404, and ss262803405, and to the NCBI Probe Database (http://www.ncbi.nlm.nih.gov/sites/entrez?db=probe) under accession nos. 10544425– 10544446.] Type 1 diabetes (T1D) is a polygenic autoimmune disease in which lymphocytes mediate the destruction of insulin-producing beta cells in the pancreas (Atkinson and Eisenbarth 2001). The events that trigger the pathogenic autoimmune response are still not clear, but recent genome-wide association studies indicate that there are more than 40 loci affecting susceptibility to T1D in humans (Hakonarson et al. 2007; Todd et al. 2007; The Wellcome Trust Case Control Consortium 2007; Barrett et al. 2009; Concannon et al. 2009a). These studies have detected association of T1D with the common variants for previously identified genes and new candidates. Except for the HLA locus, however, these loci have small effects upon disease risk (odds ratio < 2.5) and fail to adequately explain the genetic variance for T1D (Concannon et al. 2009b). Instead, it has been proposed that rare/private mutations with larger effects may account for the missing genetic variance in complex genetic diseases (Goldstein 2009). Whether rare or
6
These authors contributed equally to this work. Corresponding author. E-mail
[email protected]; fax 613-9416-2676. Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.101881.109.
7
common, identifying the actual causative variants for T1D has proved challenging due to genetic heterogeneity among affected individuals. A complementary approach to human studies is the use of inbred mouse strains, in which genetic heterogeneity is avoided, and selective mating can precisely map genes for which allelic variation affects disease susceptibility. Analysis of the nonobese diabetic (NOD) mouse strain, which spontaneously develops T1D similar to humans, has been widely used to better understand disease pathology and gain key insights into the genetics of T1D (Atkinson and Leiter 1999). In parallel to human studies, more than 25 loci (termed Idd) have been linked to T1D in the NOD mouse (Serreze and Leiter 2001; Ridgway et al. 2008). Confirmation of these loci is best achieved using congenic mouse strains (Rogner and Avner 2003), which are generated by controlled mating of NOD mice with diabetes-resistant strains to introduce a donor-derived chromosome interval carrying a resistant allele onto the susceptible NOD genetic background. By testing smaller donor-derived intervals for their effect upon diabetes onset, a region small enough to be sequenced for disease-causing variants can be identified. To date, congenic NOD strains have confirmed Idd loci on chromosomes (chr) 1–4, 6, 7, 11, 13, 17, and 18 (Serreze and
20:1629–1638 Ó 2010 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/10; www.genome.org
Genome Research www.genome.org
1629
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Tan et al. Leiter 2001; Ridgway et al. 2008). Besides the MHC region on Chr 17, at least eight of these loci have been dissected into smaller intervals using congenic NOD strains, and accumulating evidence has identified B2m, Il2, Il21, Ctla4, Slc11a1, and Trpv1 as genes for which NOD mice harbor T1D susceptibility alleles (Hamilton-Williams et al. 2001; Kissler et al. 2006; Razavi et al. 2006; Yamanouchi et al. 2007; Araki et al. 2009; McGuire et al. 2009). Preliminary evidence also suggests that Arntl2 and Stat5b are potential T1D susceptibility genes (Hung et al. 2006; Laloraya et al. 2006). Although few causative variants have been defined to date, the NOD alleles that increase T1D risk in this mouse strain can be either rare or common among inbred mouse strains. For example, the effect of the Idd1 locus is attributed to the relatively rare MHC class II variant (Abg7), together with a more common variant found in other inbred mouse strains that encodes a deletion in the I-E-a chain promoter (Serreze and Leiter 2001). This observation, along with human genetic studies, suggests that increased T1D risk in humans may also result from the combination of rare and common variants within the human population (Concannon et al. 2009b). Despite the identification of several Idd genes to date, this limited collection does not fully explain T1D pathogenesis or the underlying genetic architecture for T1D risk. One of the many Idd loci still to be identified is Idd11, which is located on Chr 4 and originally linked to T1D in NOD backcrosses to the C57BL/6 (B6) and SJL strains (Morahan et al. 1994). As B6 mice carry a resistance allele for Idd11, congenic strains with different Chr 4-B6–derived intervals on the NOD genetic background were produced (NOD.B6Idd11A, NOD.B6Idd11B, NOD.B6Idd11C, NOD.B6Idd11D). Three of these congenic NOD strains demonstrated significant diabetes resistance, thus confirming and localizing Idd11 to an ;8-Mb interval on Chr 4 (Brodnicki et al. 2000, 2005). To localize Idd11 further and identify the underlying gene, we established new congenic NOD strains to dissect this ;8-Mb interval and monitored them for diabetes onset. Remarkably, each of these smaller congenic intervals were derived from recombination breakpoints within the same 6.9-kb interval that resulted in varying levels of T1D susceptibility for the different congenic NOD strains. Here, we report the sequence analysis of these breakpoints and the identification of a recombination hotspot that led to the discovery of a novel candidate gene (GenBank mRNA: AK005651) for Idd11. This gene of unknown function exhibits decreased expression in the thymus and spleen of NOD mice. Furthermore, NOD mice carry a unique haplotype for Idd11 compared with 25 other inbred mouse strains analyzed. Our findings demonstrate that recombination hotspots, which have been relatively neglected in human association studies, can create unique DNA sequence variation that has relatively large effects upon the risk for a common polygenic disease.
Results
Table 1.
Genetic intervals for Idd11 congenic mouse strains Congenic strainsc
Markera
;Mbb
Bd
D
E
F
G
D4Mit12 D4Mit338 D4Mit73 D4Mit72 D4Mit203 D4Wehi1 D4Wehi2 D4Wehi5
124,048,407 125,017,654 126,497,935 128,630,230 129,249,262 129,422,665 129,521,871 129,633,461
D4Wehi6 D4Wehi13 D4Wehi17 D4Wehi21 A892 D4Wehi22 D4Mit204 D4Mit339 D4Mit69 D4Mit126 D4Mit256
129,640,320 129,666,984 129,711,291 130,392,752 130,848,378 132,008,637 132,983,282 133,923,341 135,916,989 142,152,658 154,364,548
N N N N N N N N * B B B B B B B B N N N
N N N B B B B B B B B B B B B B B N N N
N N N B B B B B * N N N N N N N N N N N
N N N N N N N N * B B B B B B B B N N N
N N N B B B B B * N N N N N N N N N N N
Idd11
a For D4Wehi marker oligonucleotides and their NCBI Probe Database accession numbers, see Supplemental Table 1. b Genomic coordinates are from NCBI build 37 assembly, mm9. c Strain names have been abbreviated (e.g., D = Idd11D = NOD. B6Idd11D). d The Idd11B congenic interval is presented here for comparison. The T1D incidence curve for Idd11B mice has been previously reported and is similar to NOD mice (Brodnicki et al. 2000, 2005). B (boldface), C57BL/6 genotype; N, NOD genotype; *, location of breakpoints described in Table 2 that aided in localizing Idd11.
NOD.B6Idd11G) were established from recombinant F2 mice, and these were monitored for diabetes onset compared to NOD and Idd11D mice (note that henceforth congenic strain names are abbreviated, e.g., NOD.B6Idd11D = Idd11D) (Fig. 1). Comparison of the resulting diabetes incidence curves indicated that Idd11 mapped to an ;6.9-kb interval, between D4Wehi5 and D4Wehi6 (Table 1; Fig. 1A,B). However, the Idd11 locus appeared to be more complex than expected because the congenic strains exhibited varying levels of T1D incidence. For example, Idd11E and Idd11G seemed to have identical B6-derived intervals, but Idd11E was more susceptible to T1D than Idd11G (65% vs. 33% diabetic by 300 d). On the other hand, the Idd11G interval provided less diabetes protection compared with the larger Idd11D interval (33% vs. 6%) but provided greater protection compared with the Idd11F interval (33% vs. 52%). As Idd11E, Idd11F, and Idd11G were derived from the Idd11D strain, we postulated that sequencing the recombinant boundary between D4Wehi5 and D4Wehi6 would explain the variability in T1D susceptibility between the congenic strains.
Mapping Idd11 using congenic mouse strains To localize Idd11 further, new congenic mouse strains were derived from NOD.B6Idd11D because this strain carried the smallest Chr 4B6–derived interval providing diabetes protection among our panel of previously characterized congenic strains (Table 1; Brodnicki et al. 2000, 2005). Briefly, heterozygous NOD.B6Idd11D mice were intercrossed to generate F2 progeny that were screened for recombination events by genotyping novel markers we identified within this interval (Table 1; Supplemental Table 1; Supplemental Fig. 1). Three new congenic NOD strains (NOD.B6Idd11E, NOD.B6Idd11F,
1630
Genome Research www.genome.org
Sequence analysis of the Idd11 critical interval Sequence analysis of the NOD, B6, and congenic NOD strains identified several sequence variants within the ;6.9-kb interval (Table 2; Supplemental Tables 2, 3). Remarkably, Idd11E and Idd11G were isogenic except at sequence variant 3, demonstrating that allelic variation at this position can significantly affect T1D susceptibility. However, the B6-derived allele at this position alone could not account for all of the Idd11 effect since it conferred different levels of diabetes protection to Idd11D (;6% diabetic by 300 d), Idd11F
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
A recombination hotspot defines Idd11 4). Further sequence analysis for these variants within 20 other inbred mouse strains (including wild-derived strains) indicated that the Idd11 haplotype consisted of at least variants 1–5, with NOD mice representing a unique haplotype marked by a 12-bp deletion at variant 1 (Table 2; Supplemental Table 4).
Confirmation of a recombination hotspot
Figure 1. Cumulative diabetes incidence curves for Idd11 congenic strains. Female cohorts were monitored for diabetes by measuring urinary glucose levels. Pairwise comparisons of diabetes incidence curves were performed using the log-rank test. As a historical footnote, Idd11E was monitored first because it was presumed that this strain would be protected against diabetes given that Idd11B exhibited a similar diabetes incidence curve to NOD (Brodnicki et al. 2000). Once we observed that Idd11E was not protected against diabetes, we established cohorts for Idd11F and Idd11G to confirm this result using independent congenic strains, which initially appeared to have congenic intervals similar to Idd11B and Idd11E, respectively ( Table 1). Subsequent sequence analysis of the interval between D4Wehi5 and D4Wehi6 ( Table 2) identified genetic variation that explained the diabetes incidence observed for Idd11B (Brodnicki et al. 2000) and Idd11E (A).
(;52%), and Idd11G (;33%) (Fig. 1A,B; Table 2). This comparison of diabetes incidence curves indicated that Idd11 was likely due to a haplotype effect in which B6-derived sequences are required at more than one position to provide optimal T1D protection. These data raised a critical question: How big is the Idd11 haplotype? NOD crosses with either SJL, NON, 129, C57L, C57BL/10 (B10), or NOR mouse strains have previously demonstrated the linkage of T1D to chr 4, suggesting that these strains, similar to B6, may harbor a T1D-resistance allele for Idd11 (Morahan et al. 1994; Rodrigues et al. 1994; McAleer et al. 1995; McDuffie 2000; Reifsnyder et al. 2005; Leiter et al. 2009). The B10-identified and the NORidentified loci, which overlap the B6-defined Idd11 locus, have also been confirmed by congenic strains (Lyons et al. 2000; Reifsnyder et al. 2005). Sequence analysis determined that the B10 and NOR strains were identical to B6 across the congenic breakpoint positions (Table 2; Supplemental Table 4). In contrast, the SJL strain was identical to NOD at these variants except at 1, 3, and 5, for which SJL was identical to B6, B10, and NOR (Table 2; Supplemental Table
It was conspicuous that recombination breakpoints for four of our congenic strains occurred within an ;1.2-kb interval and resulted in one instance of a crossover with a complex conversion tract (Idd11E) (Table 2). This dense clustering of crossover events and the presence of a complex conversion tract is characteristic of a recombination hotspot (Petes 2001; Jeffreys and May 2004; Bois 2007; Kauppi et al. 2007). To confirm and measure the frequency of the meiotic crossovers within the Idd11 haplotype, 723 F2 progeny were generated by intercrossing heterozygous (NOD 3 Idd11D)F1 mice and were screened for recombination events between D4Wehi5 and D4Wehi6. The calculated crossover activity was ;50 cM Mb-1, which is 100-fold greater than the mouse genome average (;0.5 cM Mb-1) (Shiroishi et al. 1995), and confirms that this interval harbors a recombination hotspot (Fig. 2). Genomic DNA available for 231 of these F2 progeny (representing 462 meiosis events) that did not exhibit a crossover between D4Wehi5 and D4Wehi6 was further genotyped for sequence variants 1–8 (Table 2). Only one noncrossover event (also termed gene conversion) was identified, at variant 4. This suggests a relatively low frequency (<0.5%) for noncrossover events in this hotspot, but gene conversion events can only be detected and estimated using the available sequence variants located within the hotspot interval. Hence, this conversion frequency is likely underestimated due to the relatively few variants within this interval between NOD and B6 mice. While other local recombination hotspots may reside elsewhere within the Idd11D interval, we observed no recombination events either #112 kb proximal to D4Wehi5 or #71 kb distal to D4Wehi6 (Fig. 2, inset). These flanking recombination ‘‘cold regions’’ have restricted our ability to generate smaller congenic intervals encompassing only the proposed Idd11 haplotype. It should be noted that while this recombination hotspot may generate noncrossover events, more than one variant within this haplotype would need to be converted to confer T1D protection in subsequent congenic NOD mice.
Bioinformatic and expression analysis of a novel candidate gene for Idd11 The Idd11 haplotype, including the recombination hotspot, is located within a predicted gene, termed AK005651 (i.e., GenBank mRNA accession number), supported by a collection of expressed sequence tags (ESTs) (Supplemental Fig. 2). The encoded transcript consists of five exons (Fig. 3A; Supplemental Table 5) and two splice isoforms (6exon 3) confirmed by RT-PCR and sequencing data (data not shown). AK005651 has a number of open reading frames, but none demonstrate clear protein domains or homology with known proteins nor does the transcript encode an obvious microRNA. Conservation between mammalian sequences across this region suggests the presence of a human ortholog, although no orthologous human spliced EST or transcript has yet been described (bioinformatic analysis is summarized in Supplemental Table 6). If AK005651 does encode a protein, then variants 2 and 4 result in amino acid changes that depend on the open reading
Genome Research www.genome.org
1631
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Tan et al. Table 2.
Genotypes for the proposed Idd11 haplotype
although they may alter cryptic intron/ exon splicing motifs (Chasin 2007; Wang Sequence variant Cumulative and Burge 2008). In contrast, the unique diabetes NOD 12-bp deletion at position 1 shortens 1 2 3 4 5 6 7 8 incidence the intron between exons 2 and 3. Realtime PCR detected a significant decrease NOD — C C A T C — T 71%, 73% Idd11Bb — C C A C T AG C 73% in a spliced product from these two exons Idd11Ec CCTCTCGGTGTT T C G T C — T 65%c in thymic and splenic RNA isolated from Idd11F — T T G C T AG C 52% NOD mice compared with B6 and Idd11D Idd11G CCTCTCGGTGTT T T G T C — T 33% mice (Fig. 3C). The spliced product from Idd11D CCTCTCGGTGTT T T G C T AG C 6%, 12% NONd CCTCTCGGTGTT C C A T C — T 0% exons 2 and 4 was also significantly deSJLe CCTCTCGGTGTT C T A C C — T 0% creased in the NOD thymus and spleen, g PWD CCTCTCGGTGTT C T G C C AG C 0% as well as the pancreas, compared with B6 CAST CCTCTCGGTGTT T T G C C AG T 0%g and Idd11D (Fig. 3D). We note that the C57BL/6f CCTCTCGGTGTT T T G C T AG C 0% anomalous result for the liver (Fig. 3C) Distance between markers (bp) 128 138 149 37 152 820 557 was due to one extreme outlier. This B6 sample did not result in an outlier usa Sequence variant 1, ss262803379 (129,638,063); 2, rs32465505 (129,638,192); 3, rs27507073 ing other real-time PCR probes and was (129,638,344); 4, rs27507071 (129,638,482); 5, rs27507066 (129,639,302); 6, rs27507065 the only such outlier observed within our (129,639,451); 7, ss262803405 (129,640,008); and 8, rs48826903 (129,640,045). Chr 4 positions are expression data set (Figs. 3, 4; Supplemenlisted in parentheses and based on NCBI build 37 assembly, mm9. Boldface text indicates sequence identical to C57BL/6 sequence. tal Fig. 3). b Idd11B mice exhibit a similar diabetes incidence curve to NOD and Idd11E mice (Brodnicki et al. 2000, To further dissect the effect of the 2005). c Idd11 haplotype upon AK005651, exIdd11E represents a crossover with a complex conversion tract as confirmed by direct sequencing of pression analysis was performed using a PCR product encompassing markers 1–4 derived from genomic DNA from five independent Idd11E mice. thymic and splenic RNA isolated from our d Variants shared by NON, CTS, BALB/c, NZO, NZW, C3H, CBA, DBA/2, A/J, ALR, ALS, AKR, 129/Sv, WSB. series of congenic mouse strains (Fig. 4; e Variants shared by SJL, SWR, FVB. f note: NOD, B6, Idd11D represent differVariants shared by C57BL/6, C57BL/10, NOR, NZB, DBA/1, MOLF. g ent cohorts to those in Fig. 3). Variant 1 Spontaneous diabetes has not been reported for PWD and CAST mouse strains. Note that sequencing of genomic DNA obtained from independent NOD and B6 strains from the Jackson Laboratory were identical (12-bp ins/del) had the most distinct efto the NOD and B6 strains within our mouse colony. Sequence variation across this region showing the fect upon AK005651 expression. Conindividual mouse strains is provided in Supplemental Table 4. genic mice harboring the B6-derived insertion (Idd11D, Idd11G, Idd11E) exhibited significantly higher transcription levels of the spliced frame (as shown in Supplemental Table 7). For the three possible open reading frames, only one results in silent substitutions due to product for exons 2/3 compared with NOD mice (Fig. 4B). However, the effect of the Idd11 haplotype upon the transcription levels variants 2 and 4 (Supplemental Table 7, ORF 1), whereas these two variants produce amino acid changes in all other open reading of spliced products for exons 2/4 and for exons 4/5 appears to be more complex. While the two spliced products had similar exframes between NOD and B6. The longest open reading frame for AK005651 (methionine to stop) is 53 amino acids and is located in pression profiles in the respective tissues (Fig. 4A,C), the different Idd11 haplotypes led to varied expression between strains, as well exon 5. No sequence variation between NOD and B6 occurs in this open reading frame. as between tissues. For example, only Idd11D exhibited twofold or more increases for both splice products in both tissues; whereas Sequence variation within the proposed Idd11 haplotype may also affect gene expression. The sequence encompassing variants Idd11E and Idd11F had more than twofold increases only in the spleen. These results indicate that different combinations of B61–4 contains predicted transcription factor binding sites (TFBSs) based upon conservative analyses using TRANSFAC matrices derived sequences within the Idd11 haplotype have varying effects upon AK005651 expression, with increased expression for all three (Matys et al. 2006), but none of the Idd11 sequence variants occur within these predicted TFBSs (Supplemental Table 8). Variants 1 spliced products in both the thymus and spleen correlating with and 2 do occur in a promoter-associated regulatory feature (Supthe greatest degree of T1D protection (i.e., Idd11D). plemental Fig. 2) predicted by Ensembl based upon DNase I hypersensitivity and the enrichment of histone H3K4me3 in difDiscussion ferent cell types (Mikkelsen et al. 2007). Real-time PCR analysis indicated that AK005651 is differentially expressed between NOD Positional cloning using congenic mouse strains has been used to and diabetes-resistant mouse strains (i.e., B6 and Idd11D) in tissues confirm and localize a number of T1D susceptibility loci (Serreze relevant to T1D pathogenesis. Exon 4 and 5 were present in both and Leiter 2001; Ridgway et al. 2008). Nonetheless, only a handful splice variants, and B6 and Idd11D mice had at least a twofold of genes for Idd loci have been identified to date, principally beincrease in thymic and splenic expression of these spliced exons cause it takes large mouse colonies and significant breeding time to (Fig. 3B). On the other hand, expression analysis of Spocd1 and generate a series of congenic mouse strains that maps a locus to Bai2, two genes flanking AK005651 (Supplemental Fig. 2), inwithin a genomic interval containing only one or a few candidate dicated that the sequence variation within AK005651, in particular genes. Even when the candidate genes are identified (whether in variants 1–4 for the Idd11 haplotype, is unlikely to affect tranmouse or human genetic studies), it is often difficult to determine scription levels of these two flanking genes (Supplemental Fig. 3). which sequence variant(s) within the mapped interval are causal, Expression differences may also reflect deficient exon splicunless they are obvious mutations that disrupt known coding, ing. Sequence variants 2–5 do not disrupt canonical splice sites, splicing, or regulatory motifs (Hindorff et al. 2009). In this study, a
1632
Genome Research www.genome.org
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
A recombination hotspot defines Idd11
Figure 2. Distribution of crossover breakpoints within the NOD. B6Idd11D congenic interval. Seven-hundred-twenty-three F2 pups (1446 meiosis events) were derived from intercrossing NOD.B6Idd11D F1 mice and were genotyped for markers within the NOD.B6Idd11D congenic interval. The crossover activity was calculated by dividing the frequency of crossovers in each interval by its length. For example, five pups had a recombination between D4Wehi5 and D4Wehi6, giving rise to a genetic distance of 0.345 cM across this 6859-bp interval, which is ;50 cM Mb-1. Bars, crossover frequency in cM/Mb in each interval. Above each bar, the number of crossovers observed in that marker interval is shown. The dashed line indicates the average crossover activity for the mouse genome: ;0.5cM/Mb (Shiroishi et al. 1995). Genetic markers are indicated by ticks above the plot. The inset displays an enlarged view of the interval between D4Wehi2 and D4Wehi17.
a recombination hotspot generated a unique series of congenic mouse strains for Idd11 and enabled us to identify a novel candidate gene for which sequence variation affects T1D susceptibility in the NOD mouse. Comparison of the sequence and diabetes incidence curves between our congenic NOD strains indicated that Idd11 is due to a haplotype, consisting of at least five variants, for which the composition has variable effects on susceptibility to T1D. For example, the Idd11D congenic strain has the complete B6-derived haplotype and the best protection against diabetes onset, whereas Idd11E and Id11B lack two or more B6-derived alleles within this haplotype and have similar diabetes incidence to NOD mice (Table 2; Fig. 1; Brodnicki et al. 2000). In contrast, Idd11F and Idd11G mice lack one B6-derived allele within the proposed Idd11 haplotype but do so at different positions and exhibited different T1D incidence curves (Table 2; Fig. 1). Thus these variable effects depend on the number and combination of B6-derived alleles within this haplotype. Similar associations of polymorphic haplotypes with autoimmune disease are well established for the major histocompatibility complex (Fernando et al. 2008), the human cytokine gene cluster on chr 5q31 (Rioux et al. 2001), and the mouse SLAM/CD2 gene cluster (Wandstrat et al. 2004). However, the Idd11 haplotype appears to be confined to a single gene, whereas these other haplotypes encompass multiple genes. Sequence analysis determined that NOD mice have a unique Idd11 haplotype due to a 12-bp deletion. Notably, the NOD strain was generated from the same inbreeding program as the CTS and NON strains (Makino et al. 1980; Beck et al. 2000), but these two strains (as well as 18 other laboratory and four wild-derived inbred
strains) do not have the 12-bp deletion, suggesting this microdeletion arose as a de novo mutation or was preferentially inherited from a different parental ancestor during the derivation of the NOD strain. In either case, further sequence comparison was required because this microdeletion did not fully explain the varying levels of T1D susceptibility observed in our congenic mouse strains. Two additional sequence variants (rs27507073 and rs27507066) were associated with T1D based on resistance phenotypes determined by segregation and congenic analyses (Brodnicki et al. 2000, 2005; Lyons et al. 2000; Reifsnyder et al. 2005). Our finding does not imply that the Idd11 haplotype is limited to these three variants, but only indicates the minimum size of the B6-derived haplotype providing protection against T1D in NOD mice. Idd11 is not the only T1D susceptibility locus on chr 4. Genetic studies have mapped at least three other Idd loci to this chromosome. An outcross between NOD and B10 originally detected a locus, termed Idd9, linked next to the telomere of chr 4 (Rodrigues et al. 1994). Congenic NOD strains, harboring B10derived intervals, subsequently expanded and dissected this locus into three subloci (Idd9.1, Idd9.2, Idd9.3) (Lyons et al. 2000), with Idd9.1 mapped to the same region we had previously localized Idd11 (Morahan et al. 1994). The haplotype analysis reported here suggests that Idd11 and Idd9.1 are the same because B6 and B10 are identical by descent for this chromosome region. A NOD outcross with the NOR strain also detected linkage to the region encompassing Idd11, which was confirmed by NOD.NOR-chr 4 congenic mice (Reifsnyder et al. 2005). NOR is an inbred recombinant congenic strain derived from NOD and C57BLKS/J (Prochazka et al. 1992), and our sequence analysis indicated that the Idd11 haplotype for NOR is B6-derived. However, NOD.NOR-chr 4 mice harbor a larger congenic interval than the B6-derived intervals in our Idd11 congenic strains. Thus, resistant NOR-derived alleles at other chr 4 loci may also contribute to the overall diabetes protection observed for this congenic strain (Reifsnyder et al. 2005). Lastly, NOD outcrosses with the NON and 129/SvImJ mouse strains also detected linkage of T1D to chr 4 (McAleer et al. 1995; Leiter et al. 2009), but it is more likely that these linkage results were due to allelic variation at other loci because these two strains are NOD-like for the Idd11 haplotype—their allelic composition is not predicted to be protective based on comparison with the Idd11E haplotype for Idd11 (Table 2). Idd11 appears to represent a ‘‘gene-based functional haplotype’’: a defined sequence interval taken as a unit because individual variants are not sufficient to act as separate disease markers or completely account for the associated phenotypic effect (Hoehe 2003). The Idd11 haplotype, including the recombination hotspot, is located within AK005651, a predicted gene of unknown function. None of the open reading frames demonstrate homology with known proteins or evolutionary conservation, suggesting that AK005651 may encode a long noncoding RNA (Mercer et al. 2009; Ponting et al. 2009), although it was not detected in a recent large noncoding RNA screen of four mouse cell types (Guttman et al. 2009). As current bioinformatics approaches are limited in deciphering the function of AK005651, we investigated the effect of sequence variation on AK005651 expression. The different Idd11 haplotypes, represented by our series of congenic strains, have variable effects on AK005651 expression, which is perhaps not surprising given their variable effects on susceptibility to T1D. Real-time PCR indicated that the NOD Idd11 haplotype is associated with decreased expression of AK005651 in the thymus and spleen. In particular, the unique NOD microdeletion was associated with the decreased expression of the exon
Genome Research www.genome.org
1633
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Tan et al. expression to alter T1D susceptibility. The recent availability of efficient NOD ES cells (Nichols et al. 2009) should also enable replacement of the NOD variants with B6-derived protective variants to ultimately confirm the effect of this apparent gene-based functional haplotype. At present, AK005651 represents a novel disease susceptibility gene compared with the current set of identified Idd genes, which mainly encode proteins of the immune system (e.g., MHC molecules, B2M, IL2, IL21, CTLA4, SLC11A1) that were relatively well characterized before their discovery as T1D genes (Serreze and Leiter 2001; Ridgway et al. 2008). Discovery of Idd11 relied on the presence of a recombination hotspot. The ability to map a disease locus using congenic mice depends on the occurrence of recombination events to generate successively smaller genomic intervals to identify the underlying gene. However, recombination during meiosis is not random (Petes 2001; Kauppi et al. 2004). Recombination hotspots and associated flanking coldspots ultimately restrict how small the congenic interval can become Figure 3. Schematic diagram and expression analysis of AK005651. (A) The Idd11 haplotype (variants and dictate the number of candidate genes 1–5) is located within a predicted gene, termed AK005651, consisting of five exons (Supplemental Table to be investigated. Paradoxically, Idd11 5). Exon 3 is alternatively spliced giving rise to two different transcripts (6exon 3). Quantitative real-time mapped to a recombination hotspot. AlPCR was performed to detect expression differences between mouse strains at 50 d of age for the spliced product derived from: (B) exons 4 and 5, which are present in both splice variants; (C ) exons though this hotspot demonstrated lower 2 and 3, (*) one of the four C57BL/6 samples was an extreme outlier (the mean relative normalized crossover activity compared with other expression is 2.29 excluding this outlier); (D) exons 2 and 4. Approximate fold change ($2) is shown hotspots within the mouse genome only for significant pairwise comparisons between NOD and other mouse strains (P < 0.05, adjusted for (Guillon and de Massy 2002; Yauk et al. multiple testing). Bars, mean expression level (6pooled SEM for each tissue). 2003; Kauppi et al. 2007; Paigen et al. 2008), it was highly localized (;1.2 kb) 2/3 splice product, suggesting that this microdeletion may shorten and resulted in one instance of a crossover with a complex conthe intron in which it resides, impeding the splicing lariat structure version tract. This recombination hotspot provided the molecuand altering expression of splice isoforms (Black 2003). However, lar mechanism that generated the ‘‘hybrid haplotypes’’ in our increased splicing of these two exons was not sufficient to increase congenic NOD strains. It seems likely it is also responsible for the T1D protection (e.g., Idd11E mice develop T1D similar to NOD underlying allelic variation at this locus within the Mus species. mice but do not harbor the NOD microdeletion and have increased Rather than restricting our ability to map Idd11, this reexpression of the exon 2/3 splice product). Further comparison of combination hotspot enabled us to identify specific variants acour congenic strains also indicated that only Idd11D, which has counting for the varying levels of T1D susceptibility observed in the greatest T1D protection, exhibited significantly increased exour congenic strains. For example, the noncrossover/conversion pression of AK005651 for all three spliced products in both the event at rs27507073 (variant 3 in the Idd11 haplotype) signifithymus and spleen compared with NOD. Idd11G and Idd11F have cantly increased the risk of T1D in Idd11E mice compared with the next best T1D protection, respectively, but lack a similar inIdd11G mice. Gene conversion, as well as genomic rearrangement, crease and correlation between AK005651 expression and T1D associated with recombination hotspots has been implicated in protection. This lack of correlation may reflect that sequence varia number of Mendelian diseases (Lupski and Stankiewicz 2005; ation within the Idd11 haplotype also affects the function of the Chen et al. 2007; Turner et al. 2008). Our congenic strains demencoded gene product (whether a large noncoding RNA or protein) onstrate that recombination hotspots can generate unique hybrid leading to increased T1D protection, especially in combination haplotypes, due to shuffled haplotypes and/or complex converwith altered gene expression. The low level of expression for sion tracts, which increase the risk for a complex genetic disease AK005651, although seemingly technically challenging (e.g., Ct if they arise on a genetic background of some liability. Up to 85% values > 30 using the maximum amount of RNA/cDNA), was reof T1D cases in the human population are sporadic (Karvonen producibly detected by three different real-time PCR assays in difet al. 2001). Undoubtedly these affected individuals inherited ferent mouse cohorts. Given the relatively small differences obsome combination of diabetogenic alleles from their parents, but served between congenic strains, further studies are required to common variants (including copy number variation) identified determine the function of AK005651 and how sequence comin recent genome-wide association studies have failed to explain position within the Idd11 haplotype affects gene function and the total T1D risk attributable to genetic factors (Concannon et al.
1634
Genome Research www.genome.org
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
A recombination hotspot defines Idd11 ary 2009) shows that the homologous interval on the human chr 1p35 (between rs16834708 and rs13426) exhibits weak to no linkage disequilibrium (r 2 < 0.1) (The International HapMap Consortium 2007). Such potential recombination hotspots (;33,000 estimated in the human genome) (The International HapMap Consortium 2007) have so far been neglected in genetic studies due to the statistical methods employed for detecting association (Kauppi et al. 2004; The International HapMap Consortium 2007), but their ability to give rise to gene conversion and unique hybrid haplotypes, as observed for the Idd11 locus, demonstrates their capacity for producing sequence variation with demonstrable effects upon complex traits.
Methods Mice NOD/Lt (NOD) and C57BL/6 (B6) mouse strains were obtained from The Walter & Eliza Hall Institute specific pathogenfree (SPF) facilities. NOD.B6Idd11A, NOD. B6Idd11B, NOD.B6Idd11C, and NOD. B6Idd11D were established after 10 backcross generations or more using a convenFigure 4. Expression analysis of AK005651 in congenic NOD mouse strains. Quantitative real-time tional breeding approach for congenic PCR was performed to detect expression differences between mouse strains at 50 d of age for the spliced product derived from the following: (A) exons 4 and 5, which are present in both splice variants; (B) mouse strains as previously described exons 2 and 3; (C ) exons 2 and 4. Different cohorts of NOD, B6, and Idd11D mice to those in Figure 3 (Rogner and Avner 2003; Brodnicki et al. were generated and used in conjunction with Idd11B, Idd11E, Idd11F, and Idd11G mice for this ex2005). NOD.B6Idd11E, NOD.B6Idd11F, periment. Fold change ($2) is shown only for significant pairwise comparisons between NOD and other and NOD.B6Idd11G mouse strains were mouse strains (P < 0.05, adjusted for multiple testing). Bars, mean expression level (6pooled SEM for generated from (NOD 3 NOD.B6Idd11D)F2 each tissue). progeny (Supplemental Fig. 1), which were screened for recombination events 2009b; Conrad et al. 2009; The Wellcome Trust Case Control between D4Mit72 and D4Mit204. New congenic intervals, which dissected the Idd11D interval (Fig. 1), were fixed to homozygosity Consortium 2010). Instead, recombination hotspots may conby brother–sister mating. tribute to the generation of rare variants and/or hybrid haplotypes accounting for the significantly increased T1D risk in those individuals who have inherited a certain level of genetic liability Genotyping and sequencing due to common variants. DNA samples were extracted from tail biopsies by standard methods Our study represents a primary example of a recombination and genotyped with polymorphic markers by PCR (Supplemental hotspot associated with a complex genetic disease. Ng et al. (2010) Table 1; Brodnicki et al. 2000). NOD.B6Idd11D was genotyped have also recently described a recombination hotspot in a region using a 10-cM averaged genome-wide marker panel, and no B6of GABRB2 for which haplotypes are associated with schizophrenia. derived alleles were found outside the congenic interval (note Similar to our study in congenic mice, their human study suggests that all other congenic strains described were derived from that recombination hotspots are likely to contribute to the etiology NOD.B6Idd11D). To fine-map the recombination sites, new geof complex genetic diseases (Ng et al. 2010). However, sequence netic markers (i.e., nucleotide repeats) were identified using the variation for the region on the human chr 1p35 encompassing the publicly available mouse genome sequence (NCBI Build 37 ashomologous Idd11 locus has not been associated with T1D (Barrett sembly; mm9), the UCSC Genome Browser (http://genome.ucsc. et al. 2009). This is not necessarily unexpected as the NOD mouse edu/) (Kuhn et al. 2009), and the Tandem Repeats Finder program represents a ‘‘single case study’’ with a collection of diabetogenic (Benson 1999). These markers (D4Wehi1–D4Wehi22) were shown alleles, some of which are unique to the NOD mouse strain, while to be polymorphic between the NOD and B6 mouse strains by PCR others are common within the Mus species (Atkinson and Leiter and gel electrophoresis of genomic DNA using sequence-specific 1999; Serreze and Leiter 2001). Given that the NOD Idd11 haplotype oligonucleotides (Supplemental Table 1). Sequence within the is unique, the equivalent susceptibility locus in humans might not defined Idd11 critical interval was determined by direct sequencing exist or the equivalent human allele(s) may not be detectable in of overlapping PCR products (Supplemental Table 2) and the recent association studies because it is rare and/or resides in a BigDye Terminator v3.1 sequence kit (Applied Biosystems). Serecombination hotspot. HapMap data phase III (release 2, Februquence contigs for each strain were aligned to determine sequence
Genome Research www.genome.org
1635
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Tan et al. variation between inbred strains (Megalign, DNAstar, Inc.). Genotyping of sequence variants 1–8 within the Idd11 haplotype (Table 2) was performed on a Roche LightCycler 480 using TaqMan probes (Applied Biosystems) (Supplemental Table 3) and Roche Probe Master Mix (Roche Applied Science) according to manufacturer’s instructions.
Diabetes monitoring Cohorts of female mice were housed in an SPF facility and tested once a week for elevated urinary glucose (>110 mmol/L) using Diastix reagent strips (Bayer Australia, Ltd.) over a 300-d time course. Three consecutive elevated readings indicated the onset of diabetes. Pairwise comparisons of the diabetes incidence between mouse strains were done using the log-rank test.
Bioinformatic analysis of AK005651 Genomic sequence and transcript sequences were aligned to the mouse genome (NCBI Build 37 assembly; mm9) using BLAT via the UCSC Genome Browser (http://genome.ucsc.edu/) (Kuhn et al. 2009). Sequence was compared with annotated ESTs, multispecies alignments, conserved elements, and the gene prediction algorithms available through the UCSC Genome Browser. Transcript sequences were also aligned to the NCBI EST database using BLASTN and to the NR protein database using BLASTX (http:// blast.ncbi.nlm.nih.gov/Blast.cgi; Altschul et al. 1990). All potential transcripts were translated in six frames and searched for known domains represented by profile hidden Markov models from the PFAM database using hmmpfam (http://hmmer.janelia. org) (Finn et al. 2008). The RFAM database was used to evaluate the presence of a noncoding gene (http://rfam.janelia.org/; Gardner et al. 2009), and BLASTN was used to align transcript sequence to mature miRNAs and stem-loop sequences in miRBase (GriffithsJones et al. 2006). To determine if variants are within predicted TFBSs, 71 nucleotides (nt) of sequence centered on the individual variants were extracted from the mouse genome, and TFBSs were predicted in these sequences using only high-quality TRANSFAC matrices (Matys et al. 2006) representing vertebrate transcription factors and match score thresholds selected to give the minimum false-positive rate (Kel et al. 2003).
Quantitative real-time PCR Tissues were taken from female mice (;50 d old) and RNA isolated using TRIzol reagent (Invitrogen). cDNA was synthesized using Superscript III reverse transcriptase (Invitrogen). Quantitative realtime PCR was performed on a Roche LightCycler 480 using LightCycler Probe Master Reagent (Roche Diagnostics) or TaqMan Gene Expression Master Mix (Applied Biosystems) according to the manufacturer’s instructions. Technical replicates were done in triplicate to calculate the average Ct value for each biological sample. Oligonucleotide primers and fluorescent probes were synthesized by Sigma Genosys, Roche Applied Sciences, or Applied Biosystems (Supplemental Table 9). The fluorescent probes used to detect the spliced products between AK005651 exon 2 and exon 3, as well as exon 2 and exon 4, are located across the spliced exon boundaries, whereas the fluorescent probe used to detect the spliced product between exon 4 and exon 5 is located entirely within exon 5 (Fig. 3) or across the spliced exon boundary (Fig. 4). The fluorescent probe used to detect the spliced product for Spocd1 was located in exon 9. The fluorescent probe used to detect the spliced product for Bai2 was located in exon 29. Thermal cycling consisted of a denaturation step (10 min at 95°C) and 45 amplification cycles (10 sec at 95°C, 15–30 sec at 60°C–66°C, 30 sec at
1636
Genome Research www.genome.org
40°C). Products observed for each primer pair were confirmed by sequencing. Standard curves were generated for all primer sets to ensure exponential increase of targeted transcripts during amplification (efficiency = 10( 1/slope) = ;2). DCt for each tissue was calculated as CtAK005651 Ref, where Ref is the average Ct value of the reference genes Hmbs and Hprt1. Relative normalized log2 expression values were calculated for graphing purposes as 41 mean(Ref) DCt, where mean(Ref) is the grand mean of the reference genes Ct for each tissue. Here, 40 represents the practical Ct detection limit for real-time PCR, and 41 establishes the 0 point (i.e., no detectable gene expression) for the y-axis. Subtracting mean(Ref) converts the scores to log2-expression relative to the detection threshold scale from 0–41. Statistical significance for the difference in expression was obtained using pairwise t-tests with pooled standard deviations for each tissue. P-values were adjusted for multiple testing using Holm’s method (Holm 1979).
Acknowledgments We thank S. Foote, J. Stankovich, Y. Hu, and S. Mannering for useful discussions; V. Marshall, M. Martyn, and G. Brammar for technical assistance; E.H. Leiter for providing genomic DNA for various inbred mouse strains; and the mouse care facility staff at the Walter & Eliza Hall Institute and Department of Medicine at The University of Melbourne. This work was supported by the Juvenile Diabetes Research Foundation (1-2005-925), the Cooperative Research Centre for Discovery of Genes for Common Human Diseases, the Australian NHMRC (575552), and the NIH/ NIDDK (1R01 DK062882-01A1). I.K.L.T. is supported by a Melbourne International Research Scholarship. N.W. and M.P.A. are supported by Australian Postgraduate Awards. M.P.A. is also supported by a St. Vincent’s Institute Foundation Scholarship. C.E. is supported by a Peter Doherty Fellowship. M.B. is supported by an NHMRC Career Development Award. G.S. is supported by a NHMRC Senior Research Fellowship. G.M. is supported by NHMRC Program Grant 516700 and by the Diabetes Research Foundation of Western Australia. T.S. is supported by an Australia Fellowship.
References Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local alignment search tool. J Mol Biol 215: 403–410. Araki M, Chung D, Liu S, Rainbow DB, Chamberlain G, Garner V, Hunter KM, Vijayakrishnan L, Peterson LB, Oukka M, et al. 2009. Genetic evidence that the differential expression of the ligand-independent isoform of CTLA-4 is the molecular basis of the Idd5.1 type 1 diabetes region in nonobese diabetic mice. J Immunol 183: 5146–5157. Atkinson MA, Eisenbarth GS. 2001. Type 1 diabetes: New perspectives on disease pathogenesis and treatment. Lancet 358: 221–229. Atkinson MA, Leiter EH. 1999. The NOD mouse model of type 1 diabetes: As good as it gets? Nat Med 5: 601–604. Barrett JC, Clayton DG, Concannon P, Akolkar B, Cooper JD, Erlich HA, Julier C, Morahan G, Nerup J, Nierras C, et al. 2009. Genome-wide association study and meta-analysis find that over 40 loci affect risk of type 1 diabetes. Nat Genet 41: 703–707. Beck JA, Lloyd S, Hafezparast M, Lennon-Pierce M, Eppig JT, Festing MF, Fisher EM. 2000. Genealogies of mouse inbred strains. Nat Genet 24: 23– 25. Benson G. 1999. Tandem repeats finder: A program to analyze DNA sequences. Nucleic Acids Res 27: 573–580. Black DL. 2003. Mechanisms of alternative pre-messenger RNA splicing. Annu Rev Biochem 72: 291–336. Bois PR. 2007. A highly polymorphic meiotic recombination mouse hot spot exhibits incomplete repair. Mol Cell Biol 27: 7053–7062. Brodnicki TC, McClive P, Couper S, Morahan G. 2000. Localization of Idd11 using NOD congenic mouse strains: Elimination of Slc9a1 as a candidate gene. Immunogenetics 51: 37–41. Brodnicki TC, Fletcher AL, Pellicci DG, Berzins SP, McClive P, Quirk F, Webster KE, Scott HS, Boyd RL, Godfrey DI, et al. 2005. Localization of Idd11 is not associated with thymus and NKT cell abnormalities in NOD mice. Diabetes 54: 3453–3457.
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
A recombination hotspot defines Idd11 Chasin LA. 2007. Searching for splicing motifs. Adv Exp Med Biol 623: 85– 106. Chen JM, Cooper DN, Chuzhanova N, Ferec C, Patrinos GP. 2007. Gene conversion: Mechanisms, evolution and human disease. Nat Rev Genet 8: 762–775. Concannon, P., Chen, W.M., Julier, C., Morahan, G., Akolkar, B., Erlich, H.A., Hilner, J.E., Nerup, J., Nierras, C., Pociot, F. et al. 2009a. Genomewide scan for linkage to type 1 diabetes in 2,496 multiplex families from the Type 1 Diabetes Genetics Consortium. Diabetes 58: 1018–1022. Concannon P, Rich SS, Nepom GT. 2009b. Genetics of type 1A diabetes. N Engl J Med 360: 1646–1654. Conrad DF, Pinto D, Redon R, Feuk L, Gokcumen O, Zhang Y, Aerts J, Andrews TD, Barnes C, Campbell P, et al. 2009. Origins and functional impact of copy number variation in the human genome. Nature 464: 704–712. Fernando MM, Stevens CR, Walsh EC, De Jager PL, Goyette P, Plenge RM, Vyse TJ, Rioux JD. 2008. Defining the role of the MHC in autoimmunity: A review and pooled analysis. PLoS Genet 4: e1000024. doi: 10.1371/ journal.pgen.1000024. Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz HR, Ceric G, Forslund K, Eddy SR, Sonnhammer EL, et al. 2008. The Pfam protein families database. Nucleic Acids Res 36: D281–D288. Gardner PP, Daub J, Tate JG, Nawrocki EP, Kolbe DL, Lindgreen S, Wilkinson AC, Finn RD, Griffiths-Jones S, Eddy SR, et al. 2009. Rfam: Updates to the RNA families database. Nucleic Acids Res 37: D136–D140. Goldstein DB. 2009. Common genetic variation and human traits. N Engl J Med 360: 1696–1698. Griffiths-Jones S, Grocock RJ, van Dongen S, Bateman A, Enright AJ. 2006. miRBase: MicroRNA sequences, targets and gene nomenclature. Nucleic Acids Res 34: D140–D144. Guillon H, de Massy B. 2002. An initiation site for meiotic crossing-over and gene conversion in the mouse. Nat Genet 32: 296–299. Guttman M, Amit I, Garber M, French C, Lin MF, Feldser D, Huarte M, Zuk O, Carey BW, Cassady JP, et al. 2009. Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature 458: 223–227. Hakonarson H, Grant SF, Bradfield JP, Marchand L, Kim CE, Glessner JT, Grabs R, Casalunovo T, Taback SP, Frackelton EC, et al. 2007. A genomewide association study identifies KIAA0350 as a type 1 diabetes gene. Nature 448: 591–594. Hamilton-Williams EE, Serreze DV, Charlton B, Johnson EA, Marron MP, Mullbacher A, Slattery RM. 2001. Transgenic rescue implicates b2microglobulin as a diabetes susceptibility gene in nonobese diabetic (NOD) mice. Proc Natl Acad Sci 98: 11533–11538. Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, Manolio TA. 2009. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci 106: 9362–9367. Hoehe MR. 2003. Haplotypes and the systematic analysis of genetic variation in genes and genomes. Pharmacogenomics 4: 547–570. Holm S. 1979. A simple sequentially rejective multiple test procedure. Scand J Stat 6: 65–70. Hung MS, Avner P, Rogner UC. 2006. Identification of the transcription factor Arntl2 as a candidate gene for the type 1 diabetes locus Idd6. Hum Mol Genet 15: 2732–2742. The International HapMap Consortium. 2007. A second generation human haplotype map of over 3.1 million SNPs. Nature 449: 851–861. Jeffreys AJ, May CA. 2004. Intense and highly localized gene conversion activity in human meiotic crossover hot spots. Nat Genet 36: 151–156. Karvonen M, Sekikawa A, LaPorte R, Tuomilehto J, Tuomilehto-Wolf E. 2001. Type 1 diabetes: Global epidemiology. In The epidemiology of diabetes mellitus (ed. J.M. Ekoe et al.), pp. 71–102. John Wiley & Sons, West Sussex. Kauppi L, Jeffreys AJ, Keeney S. 2004. Where the crossovers are: Recombination distributions in mammals. Nat Rev Genet 5: 413–424. Kauppi L, Jasin M, Keeney S. 2007. Meiotic crossover hotspots contained in haplotype block boundaries of the mouse genome. Proc Natl Acad Sci 104: 13396–13401. Kel AE, Gossling E, Reuter I, Cheremushkin E, Kel-Margoulis OV, Wingender E. 2003. MATCH: A tool for searching transcription factor binding sites in DNA sequences. Nucleic Acids Res 31: 3576–3579. Kissler S, Stern P, Takahashi K, Hunter K, Peterson LB, Wicker L. 2006. In vivo RNA interference demonstrates a role for Nramp1 in modifying susceptibility to type 1 diabetes. Nat Genet 38: 479–483. Kuhn RM, Karolchik D, Zweig AS, Wang T, Smith KE, Rosenbloom KR, Rhead B, Raney BJ, Pohl A, Pheasant M, et al. 2009. The UCSC Genome Browser Database: Update 2009. Nucleic Acids Res 37: D755– D761. Laloraya M, Davoodi-Semiromi A, Kumar GP, McDuffie M, She JX. 2006. Impaired Crkl expression contributes to the defective DNA binding of Stat5b in nonobese diabetic mice. Diabetes 55: 734–741.
Leiter EH, Reifsnyder PC, Wallace R, Li R, King B, Churchill GC. 2009. NOD 3 129.H2g7 backcross delineates 129S1/SvImJ-derived genomic regions modulating type 1 diabetes (T1D) development in mice. Diabetes 58: 1700–1703. Lupski JR, Stankiewicz P. 2005. Genomic disorders: Molecular mechanisms for rearrangements and conveyed phenotypes. PLoS Genet 1: e49. doi: 10.1371/journal.pgen.0010049. Lyons PA, Hancock WW, Denny P, Lord CJ, Hill NJ, Armitage N, Siegmund T, Todd JA, Phillips MS, Hess JF, et al. 2000. The NOD Idd9 genetic interval influences the pathogenicity of insulitis and contains molecular variants of Cd30, Tnfr2, and Cd137. Immunity 13: 107–115. Makino S, Kunimoto K, Muraoka Y, Mizushima Y, Katagiri K, Tochino Y. 1980. Breeding of a non-obese, diabetic strain of mice. Jikken Dobutsu 29: 1–13. Matys V, Kel-Margoulis OV, Fricke E, Liebich I, Land S, Barre-Dirrie A, Reuter I, Chekmenev D, Krull M, Hornischer K, et al. 2006. TRANSFAC and its module TRANSCompel: Transcriptional gene regulation in eukaryotes. Nucleic Acids Res 34: D108–D110. McAleer MA, Reifsnyder P, Palmer SM, Prochazka M, Love JM, Copeman JB, Powell EE, Rodrigues NR, Prins JB, Serreze DV, et al. 1995. Crosses of NOD mice with the related NON strain. A polygenic model for IDDM. Diabetes 44: 1186–1195. McDuffie M. 2000. Derivation of diabetes-resistant congenic lines from the nonobese diabetic mouse. Clin Immunol 96: 119–130. McGuire HM, Vogelzang A, Hill N, Flodstrom-Tullberg M, Sprent J, King C. 2009. Loss of parity between IL-2 and IL-21 in the NOD Idd3 locus. Proc Natl Acad Sci 106: 19438–19443. Mercer TR, Dinger ME, Mattick JS. 2009. Long non-coding RNAs: Insights into functions. Nat Rev Genet 10: 155–159. Mikkelsen TS, Ku M, Jaffe DB, Issac B, Lieberman E, Giannoukos G, Alvarez P, Brockman W, Kim TK, Koche RP, et al. 2007. Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature 448: 553–560. Morahan G, McClive P, Huang D, Little P, Baxter A. 1994. Genetic and physiological association of diabetes susceptibility with raised Na+/H+ exchange activity. Proc Natl Acad Sci 91: 5898–5902. Ng SK, Lo WS, Pun FW, Zhao C, Yu Z, Chen J, Tong KL, Xu Z, Tsang SY, Yang Q , et al. 2010. A recombination hotspot in a schizophrenia-associated region of GABRB2. PLoS ONE 5: e9547. doi: 10.1371/ journal.pone.0009547. Nichols J, Jones K, Phillips JM, Newland SA, Roode M, Mansfield W, Smith A, Cooke A. 2009. Validated germline-competent embryonic stem cell lines from nonobese diabetic mice. Nat Med 15: 814–818. Paigen K, Szatkiewicz JP, Sawyer K, Leahy N, Parvanov ED, Ng SH, Graber JH, Broman KW, Petkov PM. 2008. The recombinational anatomy of a mouse chromosome. PLoS Genet 4: e1000119. doi: 10.1371/ journal.pgen.1000119. Petes TD. 2001. Meiotic recombination hot spots and cold spots. Nat Rev Genet 2: 360–369. Ponting CP, Oliver PL, Reik W. 2009. Evolution and functions of long noncoding RNAs. Cell 136: 629–641. Prochazka M, Serreze DV, Frankel WN, Leiter EH. 1992. NOR/Lt mice: MHCmatched diabetes-resistant control strain for NOD mice. Diabetes 41: 98–106. Razavi R, Chan Y, Afifiyan FN, Liu XJ, Wan X, Yantha J, Tsui H, Tang L, Tsai S, Santamaria P, et al. 2006. TRPV1+ sensory neurons control b cell stress and islet inflammation in autoimmune disease. Cell 127: 1123– 1135. Reifsnyder PC, Li R, Silveira PA, Churchill G, Serreze DV, Leiter EH. 2005. Conditioning the genome identifies additional diabetes resistance loci in Type I diabetes resistant NOR/Lt mice. Genes Immun 6: 528–538. Ridgway WM, Peterson LB, Todd JA, Rainbow DB, Healy B, Burren OS, Wicker LS. 2008. Gene-gene interactions in the NOD mouse model of type 1 diabetes. Adv Immunol 100: 151–175. Rioux JD, Daly MJ, Silverberg MS, Lindblad K, Steinhart H, Cohen Z, Delmonte T, Kocher K, Miller K, Guschwan S, et al. 2001. Genetic variation in the 5q31 cytokine gene cluster confers susceptibility to Crohn disease. Nat Genet 29: 223–228. Rodrigues NR, Cornall RJ, Chandler P, Simpson E, Wicker LS, Peterson LB, Todd JA. 1994. Mapping of an insulin-dependent diabetes locus, Idd9, in NOD mice to chromosome 4. Mamm Genome 5: 167–170. Rogner UC, Avner P. 2003. Congenic mice: Cutting tools for complex immune disorders. Nat Rev Immunol 3: 243–252. Serreze DV, Leiter EH. 2001. Genes and cellular requirements for autoimmune diabetes susceptibility in nonobese diabetic mice. Curr Dir Autoimmun 4: 31–67. Shiroishi T, Koide T, Yoshino M, Sagai T, Moriwaki K. 1995. Hotspots of homologous recombination in mouse meiosis. Adv Biophys 31: 119– 132. Todd JA, Walker NM, Cooper JD, Smyth DJ, Downes K, Plagnol V, Bailey R, Nejentsev S, Field SF, Payne F, et al. 2007. Robust associations of four new
Genome Research www.genome.org
1637
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Tan et al. chromosome regions from genome-wide analyses of type 1 diabetes. Nat Genet 39: 857–864. Turner DJ, Miretti M, Rajan D, Fiegler H, Carter NP, Blayney ML, Beck S, Hurles ME. 2008. Germline rates of de novo meiotic deletions and duplications causing several genomic disorders. Nat Genet 40: 90–95. Wandstrat AE, Nguyen C, Limaye N, Chan AY, Subramanian S, Tian XH, Yim YS, Pertsemlidis A, Garner HR Jr, Morel L, et al. 2004. Association of extensive polymorphisms in the SLAM/CD2 gene cluster with murine lupus. Immunity 21: 769–780. Wang Z, Burge CB. 2008. Splicing regulation: From a parts list of regulatory elements to an integrated splicing code. RNA 14: 802–813. The Wellcome Trust Case Control Consortium. 2007. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447: 661–678.
1638
Genome Research www.genome.org
The Wellcome Trust Case Control Consortium. 2010. Genome-wide association study of CNVs in 16,000 cases of eight common diseases and 3,000 shared controls. Nature 464: 713–720. Yamanouchi J, Rainbow D, Serra P, Howlett S, Hunter K, Garner VE, Gonzalez-Munoz A, Clark J, Veijola R, Cubbon R, et al. 2007. Interleukin-2 gene variation impairs regulatory T cell function and causes autoimmunity. Nat Genet 39: 329–337. Yauk CL, Bois PR, Jeffreys AJ. 2003. High-resolution sperm typing of meiotic recombination in the mouse MHC Eb gene. EMBO J 22: 1389– 1397.
Received October 15, 2009; accepted in revised form August 24, 2010.
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Research
Assessing the effect of the CLPG mutation on the microRNA catalog of skeletal muscle using high-throughput sequencing Florian Caiment,1 Carole Charlier,1 Tracy Hadfield,2 Noelle Cockett,2 Michel Georges,1,3 and Denis Baurain1 1
Unit of Animal Genomics, Department of Animal Production, GIGA-R, and Faculty of Veterinary Medicine, University of Lie`ge (B34), 4000-Lie`ge, Belgium; 2Department of Animal, Dairy and Veterinary Sciences, Utah State University, Logan, Utah 84322, USA The callipyge phenotype is a monogenic muscular hypertrophy that is only expressed in heterozygous sheep receiving the CLPG mutation from their sire. The wild-type phenotype of CLPG/CLPG animals is thought to result from translational inhibition of paternally expressed DLK1 transcripts by maternally expressed miRNAs. To identify the miRNA responsible for this trans effect, we used high-throughput sequencing to exhaustively catalog miRNAs expressed in skeletal muscle of sheep of the four CLPG genotypes. We have identified 747 miRNA species of which 110 map to the DLK1–GTL2 or callipyge domain. We demonstrate that the latter are imprinted and preferentially expressed from the maternal allele. We show that the CLPG mutation affects their level of expression in cis (;3.2-fold increase) as well as in trans (;1.8-fold increase). In CLPG/ CLPG animals, miRNAs from the DLK1–GTL2 domain account for ;20% of miRNAs in skeletal muscle. We show that the CLPG genotype affects the levels of A-to-I editing of at least five pri-miRNAs of the DLK1–GTL2 domain, but that levels of editing of mature miRNAs are always minor. We present suggestive evidence that the miRNAs from the domain target the ORF of DLK1, thereby causing the trans inhibition underlying polar overdominance. We highlight the limitations of high-throughput sequencing for digital gene expression profiling as a result of biased and inconsistent amplification of specific miRNAs. [Supplemental material is available online at http://www.genome.org. The sequence and miRNA expression data from this study have been submitted to NCBI’s GenBank (http://www.ncbi.nlm.nih.gov/genbank/) and Gene Expression Omnibus (GEO; http://www.ncbi.nlm.nih.gov/geo/) under accession nos. AF354168 and GSE24146, respectively. All new ovine miRNAs corresponding to the DLK1–GTL2 locus also have been submitted to miRBase (http://www.mirbase.org).]
The callipyge phenotype is an inherited muscular hypertrophy of sheep. It is characterized by an unusual inheritance pattern referred to as polar overdominance: Only heterozygous individuals having received the CLPG mutation from their father express the phenotype (Cockett et al. 1996). The CLPG point mutation inactivates a muscle-specific silencer controlling the expression of a subset of imprinted genes in the DLK1–GTL2 domain (i.e., the paternally expressed protein-encoding DLK1 and PEG11 [also known as RTL1] genes and the maternally expressed non-coding GTL2 [also known as MEG3], anti-PEG11 [also known as anti-RTL1], MEG8 [also known as RIAN], and MIRG genes) (Charlier et al. 2001a; Freking et al. 2002; Smit et al. 2003). Hence, padumnal heterozygotes (+Mat/CLPGPat ) are characterized by ectopic expression of PEG11 (Byrne et al. 2010) and DLK1 (Davis et al. 2004) in skeletal muscle. DLK1 is thought to contribute to the callipyge phenotype as its ectopic expression increases muscle mass in transgenic mice (Davis et al. 2004). Whether ectopic expression of PEG11 is also involved in phenotypic expression remains to be established. While showing increased levels of DLK1 mRNA in muscle— like their +Mat/CLPGPat counterparts—no DLK1 protein is observed in muscle of CLPG/CLPG animals, accounting for their wild-type
3
Corresponding author. E-mail
[email protected]; fax 32-4-366-41-98. Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.108787.110.
phenotype (Davis et al. 2004). The absence of DLK1 protein despite increased levels of DLK1 mRNA in these animals is thought to result from the ectopic expression of the madumnal noncoding RNAs, as this feature distinguishes CLPG/CLPG from +Mat/CLPGPat individuals (Georges et al. 2003, 2004). The madumnal long, noncoding RNA genes are hosting a large array of small C/D snoRNAs and miRNAs of unknown function (Seitz et al. 2004). We have postulated that these small RNAs are the mediators of the trans effect down-regulating DLK1 in CLPG/CLPG animals thus causing polar overdominance (Supplemental Fig. 1; Georges et al. 2003, 2004). This hypothesis received strong support from the demonstration, in the same DLK1–GTL2 locus, of RNAi-mediated trans inhibition of the paternally expressed PEG11 by miRNAs processed from the maternally expressed anti-PEG11 transcript (Seitz et al. 2003; Davis et al. 2005). To identify small RNAs that might be involved in the trans inhibition of DLK1 in skeletal muscle of CLPG/CLPG animals we have performed high-throughput sequencing (HTS) of small RNA libraries generated from skeletal muscle of sheep of the four possible CLPG genotypes. To qualify as mediators of the trans effect underlying polar overdominance (Georges et al. 2003, 2004), the corresponding small RNAs should (1) map to the DLK1–GTL2 domain; (2) be imprinted with expression from the maternal allele; (3) be subject to the cis effect of the CLPG mutation (i.e., be ectopically expressed in skeletal muscle upon maternal transmission of the mutation); and (4) have the ability to guide the RISC complex to DLK1 transcripts for inhibition.
20:1651–1662 Ó 2010 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/10; www.genome.org
Genome Research www.genome.org
1651
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Caiment et al.
Results A catalog of miRNAs expressed in skeletal muscle of sheep The callipyge phenotype is most pronounced in muscles of the hindquarters and manifests itself at ;1 mo of age. At 8 wk of age, DLK1 protein is detected in skeletal muscle of +mat/CLPGpat but not of CLPG/CLPG animals, suggesting that the trans effect operates at that stage (Davis et al. 2004). Thus, we elected to extract RNA from longissimus dorsi (LD) of two 8-wk-old animals per CLPG genotype. Small RNA (;18 to 30 bp) libraries were generated and sequenced on a Genome Analyzer I (Illumina). We obtained an average of 6,324,668 reads per animal (range: 5,222,920–6,685,342). Filtered, adapter-trimmed sequences (94.7%) were aligned to the bovine genome used as reference, with the exception of 390 kb of ovine sequence corresponding to the DLK1–GTL2 domain (GenBank AF354168). The resulting alignments were used to predict miRNA precursors using mirRDeep (Friedlander et al. 2008). This yielded 472 precursors, capturing 98.3% (range: 98.15%– 99.32%) of trimmed reads. Sequence comparison with precursors in miRBase combined with mapping data indicate that 228 and 87 are the orthologs and paralogs, respectively, of previously reported bovine miRNAs (Glazov et al. 2009), while nine are the orthologs of miRNAs described in non-ruminant mammals. Thus, 148 precursors might correspond to previously unknown miRNAs (Supplemental Table 1). The chromosomal distribution of miRNA precursors is shown in Supplemental Figure 2. Chromosome 21, harboring the CLPG locus, stands out with 61 precursors (chromosomal average: 14.8). As the genomic sequence of sheep is not completed, we used the bovine sequence as reference. The effect of this substitution was estimated by comparing the output of miRDeep using either the bovine or the ovine sequence of the DLK1–GTL2 domain as reference. miRDeep predicts 49 precursors in this domain when using the ovine sequence, of which three are missed when using the bovine reference. Sensitivity is thus decreased by ;6%, but specificity does not seem affected. We aligned all precursor pairs using BLASTN and used the bitscores (>35 bits) to identify precursors families using the MCL algorithm (Enright et al. 2002). The unique inflation parameter was set to the most aggregative value (2.0) for which the miR-376 family was recovered without contaminants (Seitz et al. 2004). Using this approach, 256 of the 472 precursors (= 54%) clustered in 62 families. The largest family (miR-2284 family) comprised 99 members. The remaining 61 families counted 2.6 members on average (range: 2–6). While for 197 (= 41.7%) precursors we observed reads mapping either to the 5p or 3p arm of the pre-miRNA, both types of reads were observed for the remaining 275 (= 58.3%), jointly defining 747 distinct miRNA ‘‘species.’’ The fraction (F) of 5p over total reads distinguishes five types of precursors (Landgraf et al. 2007): 5p-mature/3p-star (1 > F > 0.87), 5p > 3p (0.87 > F > 0.50), 5p = 3p (F = 0.5), 5p < 3p (0.50 > F > 0.13), and 5p-star/3p-mature (0.13 > F > 0). The frequency distribution of F-values is shown in Supplemental Figure 3. The five types represent, respectively, 46.0%, 10.2%, 1.2%, 10.2%, and 32.4% of precursors. Precursors spawning miRNAs preferentially from the 5p arm were 1.4 times more abundant than those with 3p excess. Aligning the reads with the identified precursors revealed considerable 39 length variability. As this might reflect trimming artifacts due to decreased sequencing fidelity toward the 39-end, we will not elaborate further on it. 59-Ends were in general more consistent, with nevertheless considerable evidence for the occur-
1652
Genome Research www.genome.org
rence of isomirs (Morin et al. 2008). For 65% of the miRNAs, $90% of reads shared the same 59 extremity, for 27%, $90% of reads shared one of two 59 extremities, and for 6%, $90% of reads shared one of three 59 extremities. In general, more than 91% of alternative 59 extremities were within 4 bp of the most common one.
Annotating miRNAs expressed from the DLK1–GTL2 domain Forty-nine of the precursors identified by miRDeep mapped to the DLK1–GTL2 domain. Of these, 39 corresponded to known miRNAs reported in miRBase, while 10 were unknown. Detailed examination of the SOAP (Li et al. 2008) alignments revealed 5729 reads mapping to 14 regions not recognized by miRDeep as miRNA precursors. Six of these corresponded to miRNAs reported in miRBase and were included in the catalog. In addition, 487 reads mapped to 12 predicted C/D snoRNAs within MEG8 (of note, bona fide C/D snoRNAs are ;80 bp long and would therefore have been excluded from the small RNA libraries). We found no reads for 14 miRNAs reported in human and/or in mice (of which five conserved in sheep). 59-Ends showed the level of variability observed in the genome-wide catalog, that is, respectively, 58%, 34%, and 8% of miRNAs with one, two, and three isomirs representing $90% of the reads. Remarkably, 49 precursors (89.1% of the expressed precursors) had reads mapping to both 5p and 3p arms, to be compared with the genome-wide 58.3%. A summary of all miRNA precursors identified in the DLK1– GTL2 domain is given in Figure 1. A total of 110 distinct small RNA species (not distinguishing isomirs) were identified, mapping to 61 miRNA and 12 C/D snoRNA precursors. All detected small RNAs derive from the same strand as GTL2, anti-PEG11, MEG8, and MIRG. Using a 10-way mammalian sequence alignment of MIRG, we generated a plot of sequence conservation within 8-nt windows (Supplemental Fig. 4A). We observed a striking coincidence between the peaks of conservation and the positions of the miRNAs, supporting miRNA generation as the primary function of MIRG. A similar colocalization of conservation peaks and C/D snoRNAs is not observed for MEG8 (Supplemental Fig. 4B).
Limits of HTS for the quantitative assessment of miRNA expression Read numbers are assumed to faithfully reflect expression levels, allowing for accurate digital gene expression profiling. However, recent data indicate that the amplification steps during library construction may introduce substantial, protocol-specific biases (Linsen et al. 2009). To evaluate accuracy and precision of our HTS data in measuring miRNA expression, we (1) repeated the HTS experiment for seven of the eight animals (including RNA extraction, library construction, and sequencing on an Illumina GA-II instrument); (2) hybridized skeletal muscle (LD) RNA from the eight animals on Exiqon miRCURY LNA (Version 9.2—updated to miRBase 11.0) arrays (GEO GSE24146); and (3) performed QRTPCR for eight miRNAs spanning a broad range of expression levels as determined by HTS. While the Exiqon arrays allow interrogation of 569 human miRNAs, we restricted the analysis to 265 for which the LNA probes were perfectly complementary to the orthologous ruminant miRNA. The main conclusions of this experiment can be summarized as follows: 1. Spearman rank correlations (rS) between sequencing replicates were 0.80 on average, thus suggesting adequate reproducibility of HTS (Supplemental Fig. 5). Note that correlations were
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
A microRNA catalog of callipyge skeletal muscle
Figure 1. Comparative map of the small RNA genes in the DLK1–GTL2 domain: snoRNAs (upper panel) and miRNAs (lower panel). Square boxes correspond to small RNAs detected in sheep (red), cow (orange), human (green), and mouse (blue). Gray lines connect orthologs in the four species and indicate their chromosomal position with respect to the four long noncoding RNA genes in the domain: GTL2, anti-PEG11, MEG8, and MIRG. The position of precursors not detected in sheep is indicated by squares nested between vertical gray lines. The numbers above and below snoRNA and miRNA columns, respectively, correspond to numbers of additional paralogs for the snoRNAs (from +1 to +8) and names of additional miRNAs. The red squares are filled when reads for the corresponding small RNA were found in the conducted HTS experiments, empty when not. (Black dots) miRNAs predicted by miRDeep (Friedlander et al. 2008) in sheep. Numbers below the black dots identify the cluster/family to which the corresponding miRNA was assigned using the BLAST/MCL algorithm (Enright et al. 2002). The family number of miR-544 is underlined as the other members map outside of the DLK1–GTL2 domain. (Black dots) snoRNAs predicted by HMMER (Durbin et al. 1998) in sheep. snoRNAs in mouse and human correspond to predictions made by Cavaille et al. (2002). snoRNAs in the cow were predicted by HMMER (Durbin et al. 1998). miRNAs in cow, human, and mouse were extracted from miRBase (GriffithsJones 2006).
slightly higher when comparing pairs of animals with the same CLPG genotype within sequencing runs (average rS = 0.83) (data not shown). Examination of specific miRNAs, however, highlighted limitations of digital expression profiling by HTS. Hence, while miR-127 accounted on average for 4% of reads originating from the DLK1–GTL2 domain in the first experiment, its contribution increased to 25% on average in the second, pointing toward systematic discrepancies between the two experiments for some miRNAs. Moreover, while miR-1 represented $83% of reads (average 86%) in the first series of eight libraries, and $80% of reads (average 82%) in 5/7 libraries of the second series, it only reached 32% and 49% in the two remaining ones, thus showing substantial discrepancies even within an experiment. Finally, within sequencing experiments, the 5p/3p ratio differed significantly between individuals for nearly all miRNA precursors (chi squared test). In extreme cases, different individuals would appear to have inverted 5p/3p ratios despite sequence depths of hundreds and even thousands.
In no case were these opposite 5p/3p ratios confirmed in the second experiment. A representative example (miR-382) is shown in Supplemental Figure 6. Thus, while the repeatability of HTS may seem satisfactory in general, our findings suggest that amplification efficiency of specific miRNAs may vary considerably between experiments. 2. rS values between expression levels (ranks) assessed using the Exiqon miRCURY LNA arrays averaged 0.86 between individuals of the same CLPG genotype. This value has to be compared with a value of 0.90 when restricting the HTS data to the 265 miRNAs interrogated with the Exiqon array. Thus, in these experiments, HTS and array hybridization were characterized by comparable reproducibility. Yet, when comparing ranks obtained with the two methods, rS values dropped to 0.63 (first sequencing experiment) and 0.68 (second sequencing experiment) (Supplemental Fig. 7A,B). Supplemental Figure 7, C and D, illustrates the impact of this correlation drop in terms of probability of reversed rank order between alternative methods as a function of observed fold
Genome Research www.genome.org
1653
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Caiment et al. difference in expression level: miRNA pairs showing a fivefold difference in expression level on the Exiqon arrays still have a probability of ;0.15 to be ranked inversely by HTS. Of note, the RNAs hybridized on the arrays were not size-selected (thus potentially including pri-, pre-, and mature miRNA molecules), while the RNAs used to construct the libraries for HTS were sizeselected to include only mature miRNAs. 3. Given the observed discrepancies between the HTS and arrayhybridization, we performed QRT-PCR for eight miRNAs (let7d, miR-1, miR-206, miR-127, miR-382-5p, miR-382-3p, miR3958, miR-3959) spanning a range of expression levels using the looped RT primer approach (targeting mature miRNAs) (Chen et al. 2005). QRT-PCR experiments were conducted in duplicate on the RNA samples used for HTS. Abundance of miRNA ‘‘x’’ relative to let-7d was estimated as ex Ct x =elet7d Ct let7d , where e’s are the experimentally determined amplification efficiencies and Ct’s the threshold exceeding cycle numbers. From these analyses it appeared that the QRT-PCR results were more consistent with the array-hybridization than with HTS in terms of expression ranks and estimated fold differences in expression levels (Supplemental Fig. 8A,B). Our data strongly suggest that some miRNAs undergo preferential amplification during the HTS procedure.
Effect of CLPG genotype on relative expression levels of miRNAs in the DLK1–GTL2 domain The previous findings call for caution when interpreting variations in expression levels of individual miRNAs. To overcome this limitation, we examined the effect of CLPG genotype on the expression level of the miRNAs from the DLK1–GTL2 domain considered as a group. We first confirmed the previously described cis effect of the CLPG mutation on neighboring genes in the sequenced RNA samples. QRT-PCR experiments were conducted using primer sets specific for mature DLK1 and GTL2 transcripts, and for two internal controls (RPLP0, RPS18) selected with geNorm out of five housekeeping genes (Vandesompele et al. 2002). The expected CLPG effects were clearly observed (Supplemental Fig. 9). Expression levels of DLK1 were increased ;12-fold and approximately fourfold in, respectively, +Mat/CLPGPat and CLPG/CLPG animals when compared to +/+, while being slightly decreased (;0.6) in CLPGMat/+Pat. Expression levels of GTL2 were increased ;30-fold, ;14-fold, and approximately fivefold in CLPG/CLPG, CLPGMat/+Pat, and +Mat/ CLPGPat, when compared to +/+ animals. These results were undistinguishable from the ones that were previously reported using samples originating from other animals (Charlier et al. 2001a; Davis et al. 2004, 2005). The previously observed approximately fivefold increase of GTL2 expression in +Mat/CLPGPat when compared to +/+ animals, and approximately twofold increase of GTL2 expression in CLPG/CLPG when compared to CLPGMat/+Pat animals remains particularly intriguing and points toward a trans effect of the padumnal CLPG mutation on the expression level of the madumnal noncoding RNA genes (Charlier et al. 2001a). We then analyzed the HTS data. Read numbers corresponding to a given miRNA species (i.e., mapping either to the 5p or 3p arm of a precursor) were first adjusted to account for the different numbers of total ‘‘mappable’’ reads per individual. The relative expression level for a given animal was expressed as log2(i/m), where i corresponds to the adjusted number of reads for that individual and m is the experiment-specific average number of adjusted reads for that miRNA across the seven individuals that were sequenced twice. Average log2(i/m) across miRNAs differed considerably be-
1654
Genome Research www.genome.org
tween individuals, including for miRNAs outside of the CLPG locus. This was thought to reflect experimental issues rather than genuine biological differences (Supplemental Fig. 10). Therefore, log2(i/m) values were corrected for the average log2(i/m) value across miRNAs mapping outside of the DLK1–GTL2 domain (for that individual). We then tested the effect of CLPG genotype on the corrected relative expression levels by ANOVA, using both sequencing experiments jointly. Figure 2A shows the corresponding log(1/p) values. The effect of CLPG genotype on the relative expression level of miRNAs from the DLK1–GTL2 domain is clearly visible from the localized cluster of significant log(1/p) values. Six (miR-379, miR-411a, miR-495, miR-154b, miR-655, and miR-299) of the 99 ‘‘regular’’ miRNAs (i.e., excluding small RNAs derived from C/D snoRNAs) exhibited P-values <6 3 105, corresponding to the Bonferroni-corrected 5% threshold, while 47 were characterized by nominal P-values <0.05. Figure 2B shows, for each of the eight studied animals, the average relative expression levels over all ‘‘regular’’ miRNAs of the DLK1–GTL2 domain. The order is identical to that observed for the long noncoding RNA genes including GTL2: CLPG/CLPG > CLPGMat/+Pat > +Mat/CLPGPat > +/+. The magnitude of the effect, however, was smaller: Expression levels were increased ;6.4-fold, ;4.4-fold, and ;2.0-fold in CLPG/CLPG, CLPGMat/+Pat, and +Mat/CLPGPat when compared to +/+ animals. The previous figures pertain to ‘‘regular’’ miRNAs from the DLK1–GTL2 domain. As can be seen from Figure 2A, the effect of CLPG genotype on the expression level of small RNAs derived from C/D snoRNAs were not significant, suggesting that MEG8 might escape the cis effect of the CLPG mutation, contradicting previous findings (Charlier et al. 2001a). Examination of the effect of CLPG genotype on C/D snoRNA-derived small RNAs, however, revealed the expected trend in all but one +Mat/CLPGPat individual (Supplemental Fig. 11). Expression levels of C/D snoRNA-derived species (average number of reads: 70; median: 6) were low when compared to miRNAs (average number of reads: 21,300; median: 249). Low levels, combined with the aberrant behavior of one individual, explain the nonsignificance of the CLPG effect on the expression level of small RNAs derived of C/D snoRNAs, which we nevertheless believe exists. The effect of CLPG genotype on relative miRNA expression levels was also evaluated from the array data (Supplemental Fig. 12A,B). The effect of CLPG genotype was equally clear, manifesting itself as a clustered rise in log(1/p) values. Expression levels were increased ;4.8-fold, ;3.1-fold, and ;2.2-fold in CLPG/CLPG, CLPGMat/+Pat, and +Mat/CLPGPat, when compared to +/+ animals. Hence, the ranking was as expected (CLPG/CLPG > CLPGMat/+Pat > +Mat/CLPGPat > +/+), yet the magnitude of the effect was slightly lower than the HTS estimates.
Imprinting status of miRNAs in the DLK1–GTL2 domain The obvious interpretation of miRNA expression levels in +Mat/ CLPGPat and CLPGMat/+Pat intermediate between +/+ and CLPG/ CLPG (Fig. 2B) is that most miRNAs from the DLK1–GTL2 domain are not imprinted, yet affected by the CLPG cis effect. All previous evidence, however, indicates that the maternally expressed noncoding RNA genes, including the embedded C/D snoRNA and miRNA genes, are exclusively expressed from the maternal allele. In the mouse, tested C/D snoRNAs and miRNAs from the DLK1– GTL2 domain were expressed in mice with maternal uniparental disomies of chromosome 12 (mUPD12) but not with paternal UPD12 (Cavaille et al. 2002; Seitz et al. 2003, 2004). The same small RNA genes were expressed in mice inheriting a deletion of the
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
A microRNA catalog of callipyge skeletal muscle
Figure 2. (A) Log10(1/p) values of the effect of CLPG genotype on the expression level of 851 small RNAs in skeletal muscle of eight 8-wk-old sheep. Expression levels were estimated from the number of Illumina GA reads from two independent HTS experiments. The statistical significance of the CLPG effect was estimated by ANOVA. Gray vertical bars correspond to miRNAs outside of the DLK1–GTL2 domain, red vertical bars to miRNAs from the DLK1–GTL2 domain, and orange vertical bars to small RNAs derived from C/D snoRNA precursors. Horizontal black lines correspond to the nominal (plain line) and Bonferroni-adjusted (dotted line) 5% significance thresholds. Horizontal blue bars mark the different chromosomes (right Y-axis). (UN) Unassigned sequence contigs. (B) Average expression level, relative to the mean expression level of seven individuals sequenced twice (HTS1 and HTS2), of 99 ‘‘regular’’ miRNAs (i.e., excluding small RNAs derived from C/D snoRNAs) from the DLK1–GTL2 domain in skeletal muscle of eight sheep sorted by CLPG genotype (gray: +/+; blue: +Mat/CLPGPat; red: CLPGMat/+Pat; purple: CLPG/CLPG). Error bars correspond to 1.96 3 the standard error of the estimate.
IG-DMR imprinting control element when on the paternal allele, but not when on the maternal allele (Lin et al. 2003; Seitz et al. 2004). In human, MEG8 (also known as RIAN), from which the C/D snoRNA genes are processed (Cavaille et al. 2002), was not expressed in patients with pUPD14 (Kagami et al. 2008). We have previously shown that in sheep muscle, anti-PEG11 and MEG8 (hosting miRNAs and C/D snoRNAs, respectively) are exclusively expressed from the maternal allele, irrespective of CLPG genotype (Charlier et al. 2001a,b). To more directly assess the imprinting of the miRNAs from the DLK1–GTL2 domain in sheep and the effect of the CLPG mu-
tation on it, we searched for SNPs in the vicinity of pre-miRNAs for which at least one of the four studied CLPGMat/+Pat or +Mat/CLPGPat animals would be heterozygous. We found nine such SNPs tagging six pre-miRNAs. Seven of these SNPs were within 160 bp from the corresponding pre-miRNA (miR-379, miR-134, miR485, miR-453, miR-154b), one was in the loop (miR-453), and one was at position 20 of the miRNA* (miR-377). One +Mat/ CLPGPat animal was homozygous for all SNPs and hence noninformative, but the other three were heterozygous for most (Table 1). For each SNP the allele associated with the CLPG mutation was determined by sequencing a CLPG/CLPG animal. For miR-377 and knowing that the SNP mapped to the miRNA*, we determined imprinting status from HTS data. We exclusively detected reads corresponding to the madumnal allele, both in the CLPGMat/+Pat and +Mat/CLPGPat animals, thus supporting tight imprinting, exclusive madumnal expression, and no effect of CLPG genotype on imprinting. For miRNAs with SNPs lying outside of the mature miRNAs, we amplified the primiRNA with primers within 179 bp from the pre-miRNA, directly sequenced the resulting amplicons, and measured the allelic ratio using PeakPicker (Ge et al. 2005). For four of the five miRNAs, the results were identical to miR-377 (Table 1): tight imprinting, near exclusive madumnal expression, and no effect of CLPG genotype. For miR-485, however, we observed relaxation of imprinting in one CLPGMat/+Pat animal and one +Mat/CLPGPat animal, for which ;20% of transcripts were derived from the padumnal allele. There was no evidence for relaxation of imprinting of miR-485 in the other informative CLPGMat/+Pat animal (Table 1). It is noteworthy that miR-134 located 588 bp upstream, and miR-154b located 4293 bp downstream of miR-485 did not show evidence for relaxation of imprinting in the same individuals.
Effect of CLPG genotype on absolute expression levels of miRNAs in the DLK1–GTL2 domain The previous analyses are not informative about absolute miRNA expression levels: Are they just present in minute, biologically irrelevant amounts? Or do they make a significant contribution to the pool of miRNAs in muscle? Analysis of the two sequencing experiments indicate that, after exclusion of miR-1 whose read numbers were inflated (see above), the percentage of reads originating from miRNA precursors in the DLK1–GTL2 domain was 4% in +/+ animals, but increased to 10%, 11%, and 21% of the total in
Genome Research www.genome.org
1655
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Caiment et al. Table 1.
Imprinting status of miRNAs in the ovine DLK1–GTL2 domain and effect of the CLPG genotype miRNAa miR-379
CLPG:+c (CLPG/+)1d (CLPG/+)2d +/CLPGd
miR-134
miR-485
miR-453
miR-154b
miR-377
80 bp-59b
39-32 bpb
39-160 bpb
39-78 bpb
39-94 bpb
57 bp-59b
Loopb
69 bp-59b
miR*b
T:C
A:C 100/0 100/0 94/6
A:C 100/0 95/5 97/3
T:C 100/0 84/16 83/17
A:G 98/2 86/14 88/12
T:C 99/1 100/0 95/5
T:C 99/1
T:C 97/3 96/4 98/2
T:C 14/0 7/0 46/0
98/2
a Name of the tested miRNA. miR-379 to miR-154b, percent of the madumnal/padumnal alleles found in cDNA; miR-377, number of HTS reads corresponding to the madumnal/padumnal allele. b Position of the interrogated SNP with respect to the corresponding pre-miRNA. c SNP alleles associated with the CLPG and + allele. d CLPG genotype of corresponding animal.
+Mat/CLPGPat, CLPGMat/+Pat, and CLPG/CLPG animals. The Exiqon arrays only measure the abundance of some miRNAs present in a tissue. However, as the proportion of miRNAs mapping to the DLK1–GTL2 domain was virtually identical in the HTS (99/826 = 12%) and hybridization data (34/265 = 12.8%), the ratio of the sum of fluorescence intensities for miRNAs in the domain over the sum of fluorescence intensities over all miRNAs would also provide an estimate of the cellular abundance of miRNAs originating from the DLK1–GTL2 domain. miRNAs from the domain accounted for 3.5%, 9.0%, 15.6%, and 22.3% of the Hy3 fluorescence on the Exiqon arrays in +/+, +Mat/CLPGPat, CLPGMat/+Pat, and CLPG/CLPG animals. Both approaches thus provided comparable estimates, indicating that the DLK1–GTL2 domain contributes a sizeable fraction of the miRNA population, especially in CLPG/CLPG animals in which they are predicted to mediate the trans inhibition of DLK1. While claims about expression levels of individual miRNAs are hazardous for the reasons mentioned before, the QRT-PCR and array experiments strongly suggest that miRNAs from the DLK1– GTL2 domain are characterized by an at least 30-fold range of expression levels (Supplemental Figs. 8, 13).
Effect of CLPG genotype on relative expression levels of miRNAs outside the DLK1–GTL2 domain Skeletal muscles that express the callipyge hypertrophy have a profoundly altered physiology. Ectopic expression of DLK1 (Davis et al. 2005) triggers a cascade of secondary events leading to muscular hypertrophy (e.g., Vuocolo et al. 2007). These may involve altered miRNA expression. To detect such secondary miRNA perturbations, we tested the effect of CLPG genotype on relative expression levels of miRNAs outside of the domain. We first considered the HTS and array hybridization data separately. When accounting for multiple testing, no miRNA outside of the DLK1– GTL2 domain appeared to be significantly affected by the CLPG genotype (Fig. 2; Supplemental Fig. 12). In an attempt to increase power, we combined HTS and array data for the 265 miRNAs with information on both platforms. Even then, no miRNA outside of the domain was significant (Supplemental Fig. 14). We conclude that the molecular events connecting ectopic expression of DLK1 and muscular hypertrophy do not involve altered miRNA expression.
Editing of miRNAs from the DLK1–GTL2 domain in skeletal muscle It was recently observed that a cluster of miRNAs mapping to the DLK1–GTL2 domain (human miR-368, miR-376a1 [59 and 39],
1656
Genome Research www.genome.org
miR-376b and miR-376a2 [59 and 39]; murine miR-376a [59 and 39], miR-376b [59 and 39], and miR-376c) undergo extensive A-to-I editing in human and mice, particularly in the central nervous system (Kawahara et al. 2007). The most extensively edited sites correspond to position ‘‘+3’’ or ‘‘+4’’ of the 5p miRNAs and position ‘‘+6’’ of the 3p miRNAs. Analyses performed in knockout (KO) mice suggest that 5p editing is ADAR2 (also known as ADARB1) dependent, while 3p editing is ADAR1 (also known as ADAR) dependent. Editing seemed not to affect processing, as equally high levels were observed in pri-miRNAs and derived mature miRNAs. By changing the seed, editing was predicted to alter the target spectrum. As editing of miRNA seeds from the DLK1–GTL2 domain in sheep may likewise alter affinity for DLK1, we systematically searched for it. We first examined whether the precursors of mir376a,b,c undergo editing in skeletal muscle of mice. We RT-PCRamplified the corresponding pri-miRNAs from cDNA of brain, kidney, and skeletal muscle (quadriceps femoris) from an FVB mouse and sequenced the corresponding PCR products. Strong (;80%) and moderate (;60%) editing of the ‘‘+44’’ 3p position (residue ‘‘+6’’ of the mature miRNA) was observed for the three studied miRNAs in, respectively, brain and kidney, hence recapitulating part of the results of Kawahara et al. (2007). Contrary to these investigators, we found no evidence for editing of the 5p arms, whether in or upstream of the mature miRNA sequence. No editing was observed in skeletal muscle of mice (data not shown). We then scanned the pri-miRNAs corresponding to 56 precursors from the DLK1–GTL2 domain using RNA extracted from skeletal muscle (LD) of one CLPG/CLPG and one +/+ sheep. Within the miR-376 cluster, we did observe substantial levels of editing of the ‘‘+44’’ 3p position (corresponding to position ‘‘+6’’ of the mature miRNA) of miR-376e (0%–25%), miR-376c (also known as miR-368; 5%–45%), miR-376a2 (0%–50%), miR-376b (0%–95%), but not of miR-654 and miR-376a1. No editing was observed in the 5p arm for any of these pri-miRNAs. Outside of the miR-376 cluster, we observed strong editing of three other pri-miRNAs: at the equivalent ‘‘+44’’ 3p position (corresponding in this case to position ‘‘+5’’ of the mature miRNA) for miR-381 (10%–82%), at 5p position ‘‘+5’’ (corresponding to position ‘‘+5’’ of the mature miRNA) for miR411a (0%–20%), and at 5p position 4 outside of the mature miRNA sequence for miR-369 (0%–40%). Note that neither miR-381 nor miR-411a shows obvious similarity with members of the miR-376 cluster. Based on these results, we evaluated the level of editing of miR-376e, miR-376c, miR-376a2, miR-376b, miR-381, and miR411a in 16 additional animals representing the four possible CLPG
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
A microRNA catalog of callipyge skeletal muscle genotypes at 2 and 8 wk of age. Editing of miR-376c, miR-376b, and miR-381 was observed in some of the new animals, but not of miR376e, miR-376a2, and miR-411a. Unexpectedly, we observed a highly significant (p # 1.3 3 105) effect of CLPG genotype on the level of editing of miR-376c, miR-376b, and miR-381: +/+ animals had markedly higher levels of pri-miRNA editing than the three other genotypes (Fig. 3A). To verify whether editing of the pri-miRNAs resulted in equivalent proportions of edited mature miRNAs (as observed by Kawahara et al. [2007]), we evaluated the level of editing of mature miR-376e, miR-376c, miR-376a2, miR-376b, miR-381, and miR411a in the HTS libraries (Fig. 3B). In general, editing levels of mature miRNAs were below 10%. For the three miRNAs with high levels of pri-miRNA editing (i.e., miR-376c, miR-376b, miR-381), levels dropped considerably in the fully processed miRNAs. This was most striking for miR-376b with virtually total absence of edited reads. For miR-376c and miR-381, editing levels dropped by a factor of ;10 when compared to the precursors. Thus, for these miRNAs, editing either inhibits pri/pre-miRNA processing and/or reduces the stability of the miRNA. The effect of CLPG genotype on editing levels was still apparent for miR-376c and miR-381. For miR-376a2, editing levels appeared higher after than before miRNA processing (although still well below 10%). In this case, editing may thus promote processing and/or stability. Finally, for miR-411a, mature editing levels were consistently of the order of 1%, which was well above background (average of 0.02% across 14 5p miRNAs with an A residue at position ‘‘+5’’). Such levels would not have been reliably detected at the pri-miRNA level.
Evaluating the affinity of miRNAs in the DLK1–GTL2 domain for DLK1
Having generated an exhaustive catalog of miRNAs expressed in skeletal muscle of CLPG/CLPG animals allowed us to test the miRNA-mediated DLK1 trans inhibition hypothesis with unprecedented power. For each of the 114 miRNA species from the DLK1–GTL2 domain, we singled out the most abundant isomir (or pair of isomirs in ex aequo cases, leading to 127 distinct sequences) and quantified its affinity for DLK1 using two established metrics. The first one (‘‘G-species score’’) follows Grimson et al. (2007) and counts the occurrences of 6-mer (Watson-Crick [WC] reverse complement of miRNA residues 2 to 7), 7-mer-m8 (WC reverse complement of miRNA residues 2 to 8), 7-mer-A1 (WC reverse complement of miRNA residues 2 to 7 plus 39 A anchor), and 8-mer matches (WC reverse complement of miRNA residues 2 to 8 plus 39 A anchor) in DLK1. Thus, an 8-mer match would increase the ‘‘G-species score’’ by 4, that of a 7-mer (without 8-nt match) by 2, and that of 6-mer match by 1. The second one (‘‘M-species score’’) sums scores ($140) obtained with the more liberal miRNA-target miRanda identification engine ( John et al. 2004). G-species scores were summed to generate a ‘‘G-quadrille score,’’ and ‘‘M-species scores’’ were summed to generate an ‘‘M-quadrille score.’’ ‘‘Quadrille scores’’ evaluate the affinity for DLK1 of the miRNAs considered as a team. Moreover, the same scores were generated for human and mouse (using species-specific miRNA sequences reported in miRBase), and corresponding scores were summed across species to generate ‘‘multiorganism (MO) scores.’’ The latter should be more effective at identifying an unusual affinity for DLK1 if conserved across species. Whether the mechanisms underlying the trans inhibition of DLK1 observed in sheep are shared with other species remains unknown. To evaluate the statistical significance of the obtained metrics, we compared them with their distribution obtained on 10,000 random shuffles of the DLK1 sequence. Shuffling was conducted such as to maintain the original trinucleotide composition of the target gene (cf. Supplemental Methods). While miRanda is expected to inflate the number of target predictions, their statistical significance should be well controlled by this approach (i.e., there is no reason why the true DLK1 sequence should yield better miRanda scores than the shuffled sequences). We first tested the approach using the PEG11 ORF as positive control (1000 shufflings). PEG11 is indeed targeted by at least six miRNA species derived from Figure 3. (A) Percentage of A-to-I editing of pri-miRNAs at the ‘‘+5’’ position (pre-miR-411a) or ‘‘+44’’ five pre-miRNAs in anti-PEG11 in ovine, position (pre-miR-376e, pre-miR-376c, pre-miR-376a2, pre-miR-381) in longissimus dorsi of 18 animals human, and mouse (miR-431, miR-433, Mat Pat Mat Pat representing the four possible CLPG genotypes [gray: +/+ (5); blue: + /CLPG (4); red: CLPG /+ (4); miR-127, miR-432 [human and sheep], purple: CLPG/CLPG (5)]. The first two animals of each CLPG genotype were analyzed at 2 wk, the others at 8 wk. (B) Percentage of A-to-I editing of mature miRNAs at position ‘‘+5’’ (miR-411a = 5p), ‘‘+6’’ (miR-376e, miR-434 [mouse], miR-136) (Davis et al. miR-376c, miR-376a2, miR-376b = 3p), and ‘‘+5’’ (miR-381 = 3p). Animals are ordered as in A. (*) Animals 2005). The targeting of PEG11 by these without HTS data. The numbers above each column correspond to the total number of reads (edited + nonmiRNAs is ‘‘plant-like,’’ relying on WC edited) available for analysis. The black horizontal lines correspond to the average level of A-to-G substitution complementarity over the entire length, observed for miRNAs derived from the 5p arm at position ‘‘+5’’ (miR-411a), from the 3p arm at position resulting in target slicing (Davis et al. ‘‘+6’’ (miR-376e, miR-376c, miR-376a2, miR-376b), and from the 3p arm at position ‘‘+5’’ (miR-381).
Genome Research www.genome.org
1657
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Caiment et al. 2005). For proper comparison with the presumably ‘‘animal-like’’ situation of DLK1, we only considered the miRNA seed sequences (residues 1 to 8) when computing the G-species scores (replacing the 39 A-anchor constraint [applying to 8-mer and 7-mer-A1 matches] by 39 W&C complementarity). miRanda scores were computed as before. When considering the ovine sequence (ORF) alone, both quadrille scores were significant (G: p = 0.003; M: p = 0.025), hence detecting the presence of one or more miRNAs targeting PEG11. Considered individually, however, none of the miRNA exceeded the Bonferroni-corrected 5% threshold. Two (of 12) miRNAs processed from anti-PEG11 (miR-431-5p and miR-1363p) achieved nominal significance (p = 0.003) for both G- and M-species scores. Intriguingly, the ovine-specific miR-3959-3p, although processed from MIRG, achieved equivalent significance ( p = 0.003) (Supplemental Fig. 15A). When considering the three species simultaneously (hence exploiting evolutionary conservation known to exist), the significance of the two quadrille scores increased (G and M: p < 0.001). Moreover, six (of 12) miRNA species processed from anti-PEG11 yielded the highest possible signal (nominal p < 0.001; Bonferronicorrected p ; 0.10). Interestingly, miR411a-5p processed from MIRG achieved the same top score, while the signal for miR-3959 remained essentially unchanged (nominal p = 0.003) as this Laurasiatheriaspecific miRNA is not shared with human and mouse (Supplemental Fig. 15B). The high miR-411a scores reflect one 7-mer and 10 6-mer matches in mouse, one 7-mer and six 6-mers in human, and two 7-mers and two 6-mers in sheep. Three matches were conserved in the three species, and one in two species. The high ovine miR-3959 score is due to one 8-mer and one 7-mer match. Thus, application of our method to PEG11 indicated that (1) significant quadrille scores, but not species scores, could be obtained without exploiting evolutionary conservation; (2) significant quadrille and species scores could be obtained when exploiting conservation. Most interestingly, this analysis strongly suggests that the paternally expressed PEG11 is not only targeted by fully complementary miRNAs processed from the maternally expressed antiPEG11, but also by miRNAs processed from the maternally expressed MIRG primiRNAs, which recognize their target via seed-dominated complementarity. We then applied the same approach to DLK1 including the 59-UTR, ORF, and 39-UTR as it is established that miRNAs may target these different gene compartments (Baek et al. 2008; Selbach et al. 2008; Tay et al. 2008; Chi et al. 2009). When relying solely on ovine information, the most noteworthy result was the nearly significant G-quadrille score (p = 0.052) on the DLK1 ORF, hence suggesting an unusual affinity of the ovine miRNA team
1658
Genome Research www.genome.org
for this segment of DLK1 (Fig. 4A). This signal was primarily driven by miR-377-3p (nominal p = 0.0012; one 8-mer and two 7-mer-m8 matches), miR-1193-3p (nominal p = 0.0098; one 8-mer and two 7-mer-m8 matches), and miR-370-5p (nominal p = 0.0098; one 8-mer and one 7-mer-m8 match; Fig. 4B). Note that none of these miRNAs achieves Bonferroni-adjusted significance. There was no convincing evidence for miRNA targeting of the DLK1 59- or 39-UTR. When adding the human and murine information, the significance of the MO G-quadrille score for the ORF increased slightly (p = 0.036), although none of the individual miRNAs clearly stood out (Fig. 4C). When applied to the 39-UTR, the MO M-species score for miR-376c-3p achieved Bonferroni-corrected significance (nominal p = 0.0004; Bonferroni-corrected p = 0.05). This signal was due to a miRanda target site shared between mouse and sheep (Fig. 4C). Weighting miRNA scores by expression level estimated from HTS reads and including all isomirs in such analyses did not yield stronger signals (data not shown).
Figure 4.
(Continued on next page)
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
A microRNA catalog of callipyge skeletal muscle
Discussion We herein establish a catalog of miRNAs expressed in skeletal muscle of sheep. Using miRDeep (Friedlander et al. 2008), we detected 747 small RNA species mapping to 472 miRNA precursors. Of these, 324 were classified as orthologs or close paralogs of known miRNAs, leaving 148 candidate novel miRNAs. It is noteworthy that expression levels of new miRNAs were considerably lower than those of known miRNAs (Supplemental Fig. 16). As the ovine genome sequence is not completed, we used the bovine as reference for most of the genome. Comparison of the miRDeep performances on 390 kb of contiguous sequence available in both bovine and sheep indicates a possible loss of ;6% sensitivity but not of specificity. Reanalyzing the sequence data with the ovine reference may thus increase the number of detected miRNAs. Within the DLK1–GTL2 domain, six known miRNA precursors were missed by miRDeep despite the occurrence of HTS reads. Assuming that the DLK1–GTL2 miRNA catalog of sheep is near complete, this corresponds to a sensitivity of 49/55 = 0.89. This figure is identical to that obtained by the miRDeep developers in Caenorhabditis elegans (Friedlander et al. 2008). When focusing on the annotation of miRNAs in the DLK1– GTL2 domain, we noted about 500 reads mapping to 12 predicted C/D snoRNA genes in MEG8. This is reminiscent of Ender et al. (2008), who reported human AGO1-4-associated small RNAs mapping to C/D and H/ACA snoRNA precursors. More specifically, Ender et al. (2008) demonstrated Drosha-independent (also known as RNASEN), Dicer-dependent (also known as DICER1) processing of miRNAs derived from the bona fide ACA45 snoRNA (also known as SCARNA15), thereby revealing an alternative pathway for the generation of functional miRNAs. It is not yet known which of the C/D snoRNAs in MEG8 (Cavaille et al. 2002) are genuine, associating with core components of the C/D RNP
(e.g., FBL, NHP2L1, NOP56, NOP58). Contrary to Ender et al. (2008), in four of six cases in which reads were derived from both arms, the two miRNA species were characterized by 2-nt 39 overhangs compatible with Drosha-dependent processing of the primiRNA (Supplemental Fig. 17). Further work is needed to exclude the trivial possibility that the corresponding miRNAs derive from erroneously annotated C/D snoRNAs. We confirm that miRNAs from the DLK1–GTL2 domain are imprinted in skeletal muscle of sheep and preferentially expressed from the maternal allele. As for the other genes in the domain, the imprinting status of the miRNAs is not affected by the CLPG genotype. Of note, the imprinting status of the genes in the DLK1– GTL2 domain cannot be directly tested in CLPG/CLPG animals as the padumnal and madumnal alleles are identical. However, we have demonstrated that the IG-DMR is differentially methylated in CLPG/CLPG animals as in other genotypes, supporting regular imprinting (data not shown). Previous studies in skeletal muscle of sheep revealed tight imprinting control for both paternally and maternally expressed genes. For one of the miRNAs (miR-485), we observed relaxation of imprinting in two (one CLPGMat/+Pat and one +Mat/CLPGPat ) out of three studied individuals: Molecules derived from the paternal allele represented ;15% of the total. Flanking miRNAs (located 600 bp upstream [miR-134] and 4000 bp downstream [miR-453]) did not show evidence of relaxation in the same samples. This strongly suggests that miR-485 is at least in part processed from a transcription unit independent of the one generating the two other miRNAs. Rather than being one unique large pri-miRNA, MIRG may thus encompass multiple transcription units controlled by distinct promoters. Along the same lines, it has recently been suggested that miR-433 and the adjacent miR127 are processed from distinct pri-miRNAs (Song and Wang 2008) rather than from a unique anti-PEG11 precursor shared with miR431, miR-434, and miR-136 (Davis et al. 2005).
Figure 4. (Continued on next page)
Genome Research www.genome.org
1659
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Caiment et al. lele when the paternal allele is CLPG versus wild type ([+Mat/ CLPGPat ]/[+/+] » 2.1; [CLPG/CLPG]/[CLPGMat/+Pat ] » 1.5). While the MAT ! PAT trans effect is known (PEG11) (Davis et al. 2005) or hypothesized (DLK1) (Georges et al. 2003, 2004) to reflect miRNAmediated trans inhibition, the molecular mechanisms underlying this PAT ! MAT trans effect remain elusive. One possible explanation is that the silencer element that is inactivated by the CLPG mutation has the capacity to exert its effect in trans on the homologous chromosome. Such mechanisms have been attributed to enhancers in Drosophila and other organisms and may underlie transvection (Kennison and Southworth 2002). The cis effect on miRNA expression is consistent with that observed for the long noncoding RNA genes but is considerably weaker (;3.2-fold vs. ;9.5-fold). The reason for this difference is unknown. An explanation might be the saturation of the miRNA processing machinery. Although HTS-based digital expression profiling may not be as quantitative as initially assumed, the combined HTS, array, and QRT-PCR data strongly suggest that miRNA expression levels differ at least ;30-fold. This could be due to differential processing efficiency of the precursors and/ or stability of the processed miRNAs, but may also reflect the dependence on distinct promoters of unequal strength. The latter hypothesis is supported by the miR-485 imprinting data (see above), as well as by the recent identification of private hostgene-independent promoters for intronic miRNAs (Ozsolak et al. 2008). Contrary to skeletal muscle of mouse, we observed substantial levels of A-to-I editing for 4/6 pri-miRNAs from the miR376 family (miR-376e, miR-376c, miR376a2, miR-376b) and for three unrelated pri-miRNAs (miR-381, miR-411a, miR369) from the DLK1–GTL2 domain in skeletal muscle of sheep. We noted a significant effect of CLPG genotype on primiRNA editing. Editing in +/+ animals, characterized by the lowest miRNA expression levels, was ;4.2-fold higher than in the other genotypes. The reasons underlying this observation remain unclear, but could involve saturation of the editing machinery or down-regulation of components of the editing machinery by miRNAs from the domain. Contrary to Kawahara et al. (2007), editing levels were lower in the mature miRNA population when comFigure 4. (A) Statistical significance [log(1/p)] of the affinity of ovine miRNAs in the DLK1–GTL2 dopared to precursors: Edited molecules never main for the 59-UTR, coding sequence (ORF), and 39-UTR of the ovine DLK1. The affinity was measured made up >10% of reads. using either G- (blue) or M-scores (orange) as defined in the text. Bars are dark colored for highly The primary aim of this study was to expressed and light colored for lowly expressed miRNAs. The last pair of bars (‘‘quad’’) at the right of the graph corresponds to the quadrille scores, the remaining bars to the species scores and are labeled lay the grounds for the identification of accordingly. P-values were determined using the sequence-shuffling test described in the text. Species miRNAs that might account for the scores require a Bonferroni correction for 127 independent tests. (B) Position in the DLK1 mRNA of target translational down-regulation of DLK1 sites (8-mers, 7-mers, and 6-mers as defined by Grimson et al. [2007]) for the same set of miRNA species. observed in CLPG/CLPG animals, that is, (C ) Same as in A except that the scores are ‘‘multiorganism (MO) scores’’ combining information from sheep, human, and mouse. the MAT ! PAT trans effect. We approached
We find that the miRNAs from the DLK1–GTL2 domain are affected by the CLPG mutation in the same manner as the maternally expressed long noncoding RNA genes (Charlier et al. 2001a). The main effect is in cis causing an ;3.2-fold increase in miRNA expression from a CLPG versus wild-type maternal chromosome ([CLPGMat/+Pat ]/[+/+] » 3, 7; [CLPG/CLPG]/[+Mat/CLPGPat ] » 2, 7). As for the other genes affected by the CLPG mutation, this is thought to result from the inactivation of a muscle-specific cis-acting silencer element. In addition, we confirm a trans effect consisting in the ;1.8-fold higher expression of miRNAs from the maternal al-
1660
Genome Research www.genome.org
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
A microRNA catalog of callipyge skeletal muscle this by posing that if an exceptional affinity of miRNAs from the DLK1–GTL2 domain for DLK1 could be clearly demonstrated in silico, this would very strongly support the hypothesis. The absence of such statistically significant affinity does not preclude actual interaction in vivo, but is neutral with respect to the hypothesis. We would not pretend that our data reveal an unambiguous affinity of the DLK1–GTL2 miRNAs for DLK1, yet it is intriguing that both the G-quadrille score for the ORF and the miR-376c-3p MO M-species score for the 39-UTR achieved 5% significance. This suggests that the DLK1–GTL2 miRNAs might, indeed, effectively downregulate DLK1 in CLPG/CLPG animals. It is worthwhile re-stating in this regard that miRNAs from the domain account for an estimated ;20% of cellular miRNAs in these animals. We are in the process of testing this prediction biochemically using both reporter assays and AGO-based (also known as EIF2C) target coimmunoprecipitation (e.g., Takeda et al. 2010), prioritizing miRNAs on the basis of the results presented in Figure 4. While miRNAs from the domain are strong candidate direct mediators of the MAT ! PAT trans effect on DLK1, alternative possibilities should not be excluded. Among those figure (1) an indirect effect of the miRNAs from the domain, as well as (2) miRNAindependent mechanisms. It is interesting with regard to the latter that no clear function has yet been assigned to GTL2. Individual miRNAs are predicted to target 200 to 300 genes on average (Grimson et al. 2007). What are the targets of the highly conserved miRNAs in the DLK1–GTL2 domain? To start addressing this question, we assembled the list of 2832 bovine genes having at least one conserved 8-mer or 7-mer target site in their 39-UTRs for any of the 24 out of 153 ‘‘conserved miRNA families’’ with representatives in the DLK1–GTL2 domain (Friedman et al. 2009). We then looked for the enrichment of specific gene ontology (GO) terms among these genes (Ashburner et al. 2000). To that end, we randomly sampled 10,000 (GO Slim analysis) or 200,000 (whole GO analysis) sets of 2832 genes from the complete TargetScan list of 8458 bovine miRNA-targeted genes and compared the hit number of each GO term for the list of target genes of the DLK1– GTL2 miRNAs with the distribution of hit numbers across the 10,000 (respectively, 200,000) random sets of genes. The resulting P-values were Bonferroni-corrected for the 6930 terms sampled out of the whole GO graph, or for the 55 terms of the GO Slim graph. Supplemental Table 2A shows the eight most enriched terms in both analyses, corresponding to a Bonferroni-corrected P-value #0.027 for the GO slim analysis, or a nominal P-value #104 for the GO whole analysis. The outcome of this analysis strongly suggests that the miRNAs from the DLK1–GTL2 domain are devoted to the targeting of regulators of the gene circuitry operating at the transcriptional, translational, and post-translational level, primarily in the nervous system. The list of genes corresponding to these top hits is provided in Supplemental Table 2, B and C.
Methods Construction of small RNA libraries and high-throughput sequencing Small RNA libraries were constructed using the ‘‘Small RNA sample preparation kit’’ following the instructions of the manufacturer (Illumina). Briefly, 10 mg of total RNA extracted with TRIzol (Invitrogen) was size-fractionated by denaturing polyacrylamide gel electrophoresis (PAGE; 15%), and molecules ranging from 18 to 30 nt were eluted. RNA adapters were successively ligated to the 59- then 39-end of the isolated small RNAs, and ligation products of the desired length (40–60 bp then 70–90 bp) were recovered
by sequential PAGE (15% then 10%). Small RNAs appended with 59 and 39 adapters were reverse-transcribed with Superscript II (Invitrogen) and amplified with Phusion DNA polymerase (Finnzymes Oy). Resulting amplicons were PAGE (6%) gel-purified, hybridized on a flow cell lane, clustered, and sequenced (36 cycles) using standard procedures (Illumina). Libraries corresponding to eight animals (two of each CLPG genotype) were first sequenced on a GA-I (Illumina) by Fasteris SA. The experiment was subsequently repeated for seven animals using a GA-II instrument (Illumina) at the GIGA-R core facilities.
Bioinformatics analysis of small RNA reads The bioinformatics procedures applied for preprocessing of HTS reads; prediction, curation, and annotation of miRNA precursors; prediction of C/D snoRNAs in the DLK1–GTL2 domain; gene annotation and conservation analyses in the domain; quantitative analyses of HTS reads; comparison of HTS, Exiqon, and TaqMan data; analysis of non-miRNA HTS reads; evaluation of miRNA affinity for DLK1; and GO analyses of targets of miRNAs encoded in the domain are described in detail in the Supplemental Methods.
Exiqon array hybridization Skeletal muscle RNA samples from the same eight animals (two of each CLPG genotype) extracted with TRIzol (Invitrogen) were hybridized on Exiqon miRCURY LNA Arrays (v.9.2) at Exiqon (Vedbaek). Briefly, RNA quality was evaluated on an Agilent Bioanalyzer 2100. Individual samples were labeled with Hy3 using the miRCURY Hy3/Hy5 power labeling kit, and cohybridized on the arrays with a Hy5-labeled equimolar mix of the eight samples. Arrays were scanned in an ozone-free environment. Fluorescence intensities were normalized with the global Lowess (LOcally WEighted Scatterplot Smoothing) regression algorithm using all probes except those corresponding to miRNAs from the DLK1– GTL2 domain.
Quantitative RT-PCR QRT-PCR analyses of miRNAs were conducted using predesigned (miR-1, miR-127, miR-206, miR-382, miR-382*, let-7d) or custom (miR-3958 and miR-3959) TaqManMicroRNA assays (ABI) on an 9700HT (ABI) instrument. Assay-specific amplification efficiencies were determined using serial RNA dilutions.
Editing The level of pri-miRNA editing was determined by sequence analysis of genomic- and cDNA-derived PCR products. Genomic DNA and total RNA were extracted using TRIzol (Invitrogen). Reverse transcription was carried out using Supercript III (Invitrogen) on 1 mg of total RNA pre-treated with Turbo DNase (Ambion) and cDNA PCR-amplified using GOLD Taq (ABI; 35 cycles). The primer sequences used for mouse and sheep are provided in Supplemental Table 3. Amplicons were sequenced on a 3730 instrument (ABI), and the degree of editing was estimated from the electropherograms using PeakPicker (Ge et al. 2005).
Acknowledgments This work was funded by grants from the Fonds National de la Recherche Scientifique, the University of Lie`ge, the European FW6 program (CALLIMIR), the Communaute´ Francxaise de Belgique
Genome Research www.genome.org
1661
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Caiment et al. (ARC Mirage and ARC Biomod), and the Belgian Science Policy Organisation (SSTC Genefunc PAI). Carole Charlier is Chercheur Qualifie´ au Fonds National de la Recherche Scientifique. We are grateful for the support of the GIGA-R sequencing core facility.
References Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. 2000. Gene Ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25: 25–29. Baek D, Villen J, Shin C, Camargo FD, Gygi SP, Bartel DP. 2008. The impact of microRNAs on protein output. Nature 455: 64–71. Byrne K, Colgrave ML, Vuocolo T, Pearson R, Bidwell CA, Cockett NE, Lynn DJ, Fleming-Waddell JN, Tellam RL. 2010. The imprinted retrotransposon-like gene PEG11 (RTL1) is expressed as a full-length protein in skeletal muscle from Callipyge sheep. PLoS ONE 5: e8638. doi: 10.1371/journal.pone.0008638. Cavaille J, Seitz H, Paulsen M, Ferguson-Smith AC, Bachellerie JP. 2002. Identification of tandemly-repeated C/D snoRNA genes at the imprinted human 14q32 domain reminiscent of those at the Prader-Willi/ Angelman syndrome region. Hum Mol Genet 11: 1527–1538. Charlier C, Segers K, Karim L, Shay T, Gyapay G, Cockett N, Georges M. 2001a. The callipyge mutation enhances the expression of coregulated imprinted genes in cis without affecting their imprinting status. Nat Genet 27: 367–369. Charlier C, Segers K, Wagenaar D, Karim L, Berghmans S, Jaillon O, Shay T, Weissenbach J, Cockett N, Gyapay G, et al. 2001b. Human-ovine comparative sequencing of a 250-kb imprinted domain encompassing the callipyge (clpg) locus and identification of six imprinted transcripts: DLK1, DAT, GTL2, PEG11, antiPEG11, and MEG8. Genome Res 11: 850– 862. Chen C, Ridzon DA, Broomer AJ, Zhou Z, Lee DH, Nguyen JT, Barbisin M, Xu NL, Mahuvakar VR, Andersen MR, et al. 2005. Real-time quantification of microRNAs by stem–loop RT–PCR. Nucleic Acids Res 33: e179. doi: 10.1093/nar/gni178. Chi SW, Zang JB, Mele A, Darnell RB. 2009. Argonaute HITS-CLIP decodes microRNA–mRNA interaction maps. Nature 460: 479–486. Cockett NE, Jackson SP, Shay TL, Farnir F, Berghmans S, Snowder GD, Nielsen DM, Georges M. 1996. Polar overdominance at the ovine callipyge locus. Science 273: 236–238. Davis E, Jensen CH, Schroder HD, Farnir F, Shay-Hadfield T, Kliem A, Cockett N, Georges M, Charlier C. 2004. Ectopic expression of DLK1 protein in skeletal muscle of padumnal heterozygotes causes the callipyge phenotype. Curr Biol 14: 1858–1862. Davis E, Caiment F, Tordoir X, Cavaille J, Ferguson-Smith A, Cockett N, Georges M, Charlier C. 2005. RNAi-mediated allelic trans-interaction at the imprinted Rtl1/Peg11 locus. Curr Biol 15: 743–749. Durbin R, Eddy SR, Krogh A, Mitchison G. 1998. Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, UK. Ender C, Krek A, Friedlander MR, Beitzinger M, Weinmann L, Chen W, Pfeffer S, Rajewsky N, Meister G. 2008. A human snoRNA with microRNA-like functions. Mol Cell 32: 519–528. Enright AJ, Van Dongen S, Ouzounis CA. 2002. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 30: 1575– 1584. Freking BA, Murphy SK, Wylie AA, Rhodes SJ, Keele JW, Leymaster KA, Jirtle RL, Smith TP. 2002. Identification of the single base change causing the callipyge muscle hypertrophy phenotype, the only known example of polar overdominance in mammals. Genome Res 12: 1496– 1506. Friedlander MR, Chen W, Adamidi C, Maaskola J, Einspanier R, Knespel S, Rajewsky N. 2008. Discovering microRNAs from deep sequencing data using miRDeep. Nat Biotechnol 26: 407–415. Friedman RC, Farh KK, Burge CB, Bartel DP. 2009. Most mammalian mRNAs are conserved targets of microRNAs. Genome Res 19: 92–105. Ge B, Gurd S, Gaudin T, Dore C, Lepage P, Harmsen E, Hudson TJ, Pastinen T. 2005. Survey of allelic expression using EST mining. Genome Res 15: 1584–1591. Georges M, Charlier C, Cockett N. 2003. The callipyge locus: Evidence for the trans interaction of reciprocally imprinted genes. Trends Genet 19: 248–252. Georges M, Charlier C, Smit M, Davis E, Shay T, Tordoir X, Takeda H, Caiment F, Cockett N. 2004. Toward molecular understanding of polar
1662
Genome Research www.genome.org
overdominance at the ovine callipyge locus. Cold Spring Harb Symp Quant Biol 69: 477–483. Glazov EA, Kongsuwan K, Assavalapsakul W, Horwood PF, Mitter N, Mahony TJ. 2009. Repertoire of bovine miRNA and miRNA-like small regulatory RNAs expressed upon viral infection. PLoS ONE 4: e6349. doi: 10.1371/journal.pone.0006349. Griffiths-Jones S. 2006. miRBase: The microRNA sequence database. Methods Mol Biol 342: 129–138. Grimson A, Farh KK, Johnston WK, Garrett-Engele P, Lim LP, Bartel DP. 2007. MicroRNA targeting specificity in mammals: Determinants beyond seed pairing. Mol Cell 27: 91–105. John B, Enright AJ, Aravin A, Tuschl T, Sander C, Marks DS. 2004. Human microRNA targets. PLoS Biol 2: e363. doi: 10.1371/ journal.pbio.0020363. Kagami M, Yamazawa K, Matsubara K, Matsuo N, Ogata T. 2008. Placentomegaly in paternal uniparental disomy for human chromosome 14. Placenta 29: 760–761. Kawahara Y, Zinshteyn B, Sethupathy P, Iizasa H, Hatzigeorgiou AG, Nishikura K. 2007. Redirection of silencing targets by adenosine-toinosine editing of miRNAs. Science 315: 1137–1140. Kennison JA, Southworth JW. 2002. Transvection in Drosophila. Adv Genet 46: 399–420. Landgraf P, Rusu M, Sheridan R, Sewer A, Iovino N, Aravin A, Pfeffer S, Rice A, Kamphorst AO, Landthaler M, et al. 2007. A mammalian microRNA expression atlas based on small RNA library sequencing. Cell 129: 1401– 1414. Li R, Li Y, Kristiansen K, Wang J. 2008. SOAP: Short oligonucleotide alignment program. Bioinformatics 24: 713–714. Lin SP, Youngson N, Takada S, Seitz H, Reik W, Paulsen M, Cavaille J, Ferguson-Smith AC. 2003. Asymmetric regulation of imprinting on the maternal and paternal chromosomes at the Dlk1-Gtl2 imprinted cluster on mouse chromosome 12. Nat Genet 35: 97–102. Linsen SE, de Wit E, Janssens G, Heater S, Chapman L, Parkin RK, Fritz B, Wyman SK, de Bruijn E, Voest EE, et al. 2009. Limitations and possibilities of small RNA digital gene expression profiling. Nat Methods 6: 474–476. Morin RD, O’Connor MD, Griffith M, Kuchenbauer F, Delaney A, Prabhu AL, Zhao Y, McDonald H, Zeng T, Hirst M, et al. 2008. Application of massively parallel sequencing to microRNA profiling and discovery in human embryonic stem cells. Genome Res 18: 610–621. Ozsolak F, Poling LL, Wang Z, Liu H, Liu XS, Roeder RG, Zhang X, Song JS, Fisher DE. 2008. Chromatin structure analyses identify miRNA promoters. Genes Dev 22: 3172–3183. Seitz H, Youngson N, Lin SP, Dalbert S, Paulsen M, Bachellerie JP, FergusonSmith AC, Cavaille J. 2003. Imprinted microRNA genes transcribed antisense to a reciprocally imprinted retrotransposon-like gene. Nat Genet 34: 261–262. Seitz H, Royo H, Bortolin ML, Lin SP, Ferguson-Smith AC, Cavaille J. 2004. A large imprinted microRNA gene cluster at the mouse Dlk1–Gtl2 domain. Genome Res 14: 1741–1748. Selbach M, Schwanhausser B, Thierfelder N, Fang Z, Khanin R, Rajewsky N. 2008. Widespread changes in protein synthesis induced by microRNAs. Nature 455: 58–63. Smit M, Segers K, Carrascosa LG, Shay T, Baraldi F, Gyapay G, Snowder G, Georges M, Cockett N, Charlier C. 2003. Mosaicism of Solid Gold supports the causality of a noncoding A-to-G transition in the determinism of the callipyge phenotype. Genetics 163: 453–456. Song G, Wang L. 2008. MiR-433 and miR-127 arise from independent overlapping primary transcripts encoded by the miR-433-127 locus. PLoS ONE 3: e3574. doi: 10.1371/journal.pone.0003574. Takeda H, Charlier C, Farnir F, Georges M. 2010. Demonstrating polymoprhic miRNA-mediated gene regulation in vivo: Application to the g+6223G!A mutation of Texel sheep. RNA 16: 1854–1863. Tay Y, Zhang J, Thomson AM, Lim B, Rigoutsos I. 2008. MicroRNAs to Nanog, Oct4 and Sox2 coding regions modulate embryonic stem cell differentiation. Nature 455: 1124–1128. Vandesompele J, De Preter K, Pattyn F, Poppe B, Van Roy N, De Paepe A, Speleman F. 2002. Accurate normalization of real-time quantitative RT-PCR data by geometric averaging of multiple internal control genes. Genome Biol 3: research0034–research0034.11. Vuocolo T, Byrne K, White J, McWilliam S, Reverter A, Cockett NE, Tellam RL. 2007. Identification of a gene network contributing to hypertrophy in callipyge skeletal muscle. Physiol Genomics 28: 253–272.
Received April 6, 2010; accepted in revised form October 8, 2010.
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Research
Selective sweeps and parallel mutation in the adaptive recovery from deleterious mutation in Caenorhabditis elegans Dee R. Denver,1,4 Dana K. Howe,1 Larry J. Wilhelm,1 Catherine A. Palmer,2 Jennifer L. Anderson,3 Kevin C. Stein,2 Patrick C. Phillips,3 and Suzanne Estes2 1
Department of Zoology and Center for Genome Research and Biocomputing, Oregon State University, Corvallis, Oregon 97331, USA; Department of Biology, Portland State University, Portland, Oregon 97207, USA; 3Center for Ecology and Evolution, University of Oregon, Eugene, Oregon 97403, USA 2
Deleterious mutation poses a serious threat to human health and the persistence of small populations. Although adaptive recovery from deleterious mutation has been well-characterized in prokaryotes, the evolutionary mechanisms by which multicellular eukaryotes recover from deleterious mutation remain unknown. We applied high-throughput DNA sequencing to characterize genomic divergence patterns associated with the adaptive recovery from deleterious mutation using a Caenorhabditis elegans recovery-line system. The C. elegans recovery lines were initiated from a low-fitness mutationaccumulation (MA) line progenitor and allowed to independently evolve in large populations (N ; 1000) for 60 generations. All lines rapidly regained levels of fitness similar to the wild-type (N2) MA line progenitor. Although there was a near-zero probability of a single mutation fixing due to genetic drift during the recovery experiment, we observed 28 fixed mutations. Cross-generational analysis showed that all mutations went from undetectable population-level frequencies to a fixed state in 10–20 generations. Many recovery-line mutations fixed at identical timepoints, suggesting that the mutations, if not beneficial, hitchhiked to fixation during selective sweep events observed in the recovery lines. No MA line mutation reversions were detected. Parallel mutation fixation was observed for two sites in two independent recovery lines. Analysis using a C. elegans interactome map revealed many predicted interactions between genes with recovery linespecific mutations and genes with previously accumulated MA line mutations. Our study suggests that recovery-line mutations identified in both coding and noncoding genomic regions might have beneficial effects associated with compensatory epistatic interactions. [Supplemental material is available online at http://www.genome.org. The sequence data from this study have been submitted to the NCBI Sequence Read Archive (http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi) under accession no. SRA023539.] The accumulation of deleterious mutations under conditions of relaxed selection threatens the persistence of organisms evolving in small populations (Lynch et al. 1993; Lande 1994) and is especially relevant to small captive populations of endangered species living in benign environments (Lande 1995). The recovery from deleterious mutations also serves as an analog to adaptation to a novel environment in which previously favored alleles are now detrimental. The evolutionary mechanisms by which organisms suffering from deleterious mutation are able to recover fitness have been well-studied in bacteriophage and bacterial laboratory evolution settings that showed rapid fitness recovery and a high incidence of parallel beneficial mutation fixation in independent experimental lineages (Reynolds 2000; Maisnier-Patin et al. 2002; Poon and Chao 2005; Poon et al. 2005). For example, DNA sequencing analysis of bacteriophage FX174 lines that had recovered from previously accumulated deleterious mutations revealed that ;30% of the beneficial mutations responsible for fitness recovery were back mutations (direct mutational reversals) and the remaining ;70% were compensatory mutations at other
4
Corresponding author. E-mail
[email protected]; fax (541)737-0501. Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.108191.110.
sites in the phage genome (Poon and Chao 2005). Similarly, ;95% of the beneficial mutations detected in Salmonella typhimurium lab populations recovering from the deleterious effects of antibiotic resistance were compensatory in nature rather than reversions (Maisnier-Patin et al. 2002). A previous fitness analysis of Caenorhabditis elegans recovery lines, initiated from mutationally degraded MA line progenitors and then allowed to evolve in large populations (N > 1000), found that many lines were able to rapidly recover fitness and suggested that compensatory mutation was most likely responsible (Estes and Lynch 2003). However, the much larger genome sizes (>100 Mb for C. elegans versus ;5.3 kb for FX174) of multicellular species have thus far precluded analyses of genomic divergence patterns associated with adaptive recovery from deleterious mutation similar to those carried out in prokaryotic systems. Here, we use replicate C. elegans lines that have suffered a nearly 50% loss in fitness due to the accumulation of deleterious mutations to examine the molecular basis of rapid fitness recovery under experimental evolution via whole genome resequencing. Using theoretical population genetic predictions we are able to rule out neutral explanations for the relatively small number of nucleotide changes that we observe within each line, and show very strong positive selection acting on a subset of these nucleotides. Although certain classes of mutation were missed by our
20:1663–1671 Ó 2010 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/10; www.genome.org
Genome Research www.genome.org
1663
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Denver et al. analysis, our results show the promise of next generation sequencing approaches for the comprehensive analysis of genomic change in evolutionary studies, as well as demonstrating that compensatory mutations can be a powerful driver of evolution of genetic systems.
Results Recovery line experimental evolution and mutational analysis We applied Illumina high-throughput DNA sequencing (also known as Solexa sequencing) to a C. elegans recovery-line system (Fig. 1) to characterize the spectrum of mutations associated with recovery from deleterious mutation accumulation in an animal system. MA12, a C. elegans MA line derived from the N2 strain and bottlenecked as single hermaphrodite nematodes for 323 generations (Vassilieva and Lynch 1999; Estes et al. 2005), was used as the progenitor of five independent sets of recovery lines (R12A–R12E) that were allowed to evolve in large populations (N > 1000) for 60 generations. Each of the five lines was initiated from a single MA12 progenitor and rapidly regained fitness levels similar to wild type (the N2 progenitor of the MA lines) as determined by life-history assays (Fig. 2). The genomes of the N2 (wild type) C. elegans progenitor of the MA lines, the MA12 (generation 323) progenitor of the recovery lines, and three independent recovery lines (R12A, R12B, and R12C; generation 60) were analyzed using Illumina DNA sequencing. Seven Illumina lanes were used to collect DNA sequence data for each of the five samples. Using the same general approach previously applied to base-substitution mutation identification in a set of seven C. elegans MA line genomes (Denver et al. 2009), we surveyed virtually all nonrepetitive genomic regions (>80% of the total genome) in all lines analyzed (see Methods and Supplemental Methods). We identified 68 base-substitution changes between N2 and MA12; in all 68 cases, the mutation was also detected in each of the three MA12-derived recovery lines (Supplemental Table S1). Thus, no reversions of MA12 mutations were detected in the recovery lines. These 68 mutations originated and
Figure 1. Schematic of the MA12 recovery-line system. The seven genotypes considered are shown in circles, and arrows represent generational time. MA12 was bottlenecked as single hermaphrodite nematodes across 323 generations and each of the five MA12-derived recovery lines (R12A–R12E) evolved in the lab for 60 generations at much larger (N > 1000 nematodes) population sizes. All genotypes were analyzed by Illumina sequencing, with the exception of the two recovery lines indicated by asterisks.
1664
Genome Research www.genome.org
Figure 2. Fitness trajectories of individual C. elegans recovery lines. Black diamonds at generation 0 show average fitness trait values for the MA12 ancestor. Open circles represent average trait values for each of the five lines initiated from the MA12 genotype and evolved independently for 60 generations. Lines connect the same evolved populations across generations. Intrinsic rate of population increase (top panel), r (Giannelli et al. 1999), and total fecundity (middle panel) are reported as the proportion of N2 control values. The bottom panel shows lifespan; dashed line represents average lifespan of the N2 control. Bars, 1 SEM.
fixed during nematode bottlenecking along the N2 to MA12 lineage. We determined the base-substitution mutation rate (mbs) for MA12 using the same method previously applied to 10 C. elegans MA lines genomes (Denver et al. 2009), and calculated a mbs value, 2.5 (60.3) 3 109 per site per generation, highly similar to the 10-MA line average, 2.7 (60.4) 3 109, from the previous study. We extended our Illumina approach to identify base substitutions that originated and fixed during the recovery phase, as well as carrying out PCR and capillary DNA sequencing confirmation of all Illumina-identified mutant sites to rule out potential heterozygosity-related confounders (e.g., heterozygous MA12 mutations differentially fixing in recovery lines, new mutations still segregating in recovery-line populations), and confirm that the detected mutations were in a fixed (or nearly fixed) state. We detected and confirmed seven fixed base substitutions for R12A, nine for R12B, and 12 for R12C (Table 1). Due to the relatively low and fluctuating site-specific coverage levels among different strains analyzed, we were unable to effectively extend our analysis to the identification of changes in heterozygosity between MA12 and its derivative recovery lines. None of the fixed recovery line
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Evolutionary recovery from mutation in C. elegans Table 1.
Mutations detected in the recovery lines
Chromosome R12A II II III IV V X X R12B II III IV IV V X X X X R12C I I I II II II II II IV V X X
Recovery Chromosome linec positiona MA12b Codingd
Gene
12,610,210 13,295,475 9,098,894 7,065,868 7,961,426 11,305,323 12,074,688
A T C G G T A
T C G A A C G
IG NA UTR Y48C3A.5 EX: A!G glp-1 IN Y43B11AL.1 EX: S!F inx-4 UTR F08G12.11 IG NA
2,976,439 9,582,034 14,682,043 16,950,062 6,220,938 3,156,705 3,606,921 16,001,666 16,535,555
G T T A A G T C A
A C A G T A C T T
IG NA IN C48B4.3 IG NA EX: Syn Y116A8C.13 IG NA IG NA EX: N!S C18B2.5 IG NA IG NA
3,119,936 6,066,034 14,119,073 8,531,956 9,617,561 11,107,137 12,610,210 12,838,521 7,065,868 20,830,449 7,422,575 14,278,186
C C C C A C A T G G C A
T T G T T T T A A A A C
IN lpd-6 EX: E!K C27A12.9 EX: R!G mys-2 IG NA UTR T15H9.1 EX: A!V nasp-1 IG NA IG NA IN Y43B11AL.1 UTR num-1 IG NA IG NA
in recovery-line populations, unless affected by linkage to beneficial mutations. Within the 60 generations of this experiment, the probability of a neutral allele arising via mutation and becoming fixed via genetic drift alone is on the order of 1038 (see Supplemental Methods). Even when multiplied by the ;8.8 3 107 nucleotide sites per genome examined here, the chance that any given observed change is not the result of natural selection, either directly or via hitchhiking, is vanishingly small.
Parallel mutation fixation Although the majority (24/28) of mutations fixed in the recovery lines were specific to a single lineage, we observed two cases in which the same base-substitution events were detected in two different recovery lines—the two substitutions were each observed in both R12A and R12C (Table 1). One of these substitutions was on C. elegans chromosome (chr) II in an intergenic region 15 bp downstream from the lips-16 annotated gene boundary; the other was on chr IV in the middle of the second intron of the Y43B11AL.1 gene (Supplemental Fig. S1). Extending PCR and capillary sequencing analysis of these two sites to the two recovery lines not analyzed by Illumina (R12D, R12E) showed that these two mutations were present and fixed in the R12A and R12C lineages alone—there was no evidence for the mutations in R12B, R12D, or R12E at any generational time interval.
a
Relative to WS170 build. The ancestral base present in MA12 (and N2). The base present in the recovery line. d Coding shows the coding sequence category of the mutation: EX, exon; IN, intron; UTR, untranslated region; IG, intergenic region. For exon mutations, Syn indicates synonymous mutations; for nonsynonymous mutations the specific amino acid changes are denoted. NA, Not available. b c
mutations detected in gene sequence occurred in genes that already harbored MA12 mutations; thus, no putative cases of intragenic compensatory mutation were identified. We next analyzed the sites of recovery-line mutation fixation at nearly complete 10-generation intervals (from generation 10 to 60) using capillary DNA sequencing to investigate crossgenerational patterns of mutational segregation and fixation. We visually scrutinized DNA sequence chromatogram data to search for evidence of heterozygosity; analysis of DNA sequence data from samples containing known relative molar abundances of wild-type and mutant bases suggest that we should be able to detect any segregating variants at levels >5% of the total (see Methods). Here, we assume recovery-line mutations to be in a fixed state if we were unable to detect any evidence for the wild-type base in the chromatogram data, although we cannot formally rule out the possibility that ancestral wild-type alleles are segregating at very low, undetectable frequencies. All but two mutations were observed to go from a nondetectable state in the recovery-line population to a fixed state in a single 10-generation interval (Fig. 3). The expected conditional time for a new neutral mutation, unaffected by linked non-neutral mutations, to reach fixation through drift is 4Ne generations—thus, neutral mutations would be expected to take an average of 4000 or more generations to fix
Figure 3. Cross-generational analysis of recovery-line mutation sites. The 28 mutations identified in R12A, R12B, and R12C were analyzed at nearly complete 10-generation intervals in the recovery lines. Chromosome positions (approximate) are shown on the left for each mutation. (Green circles) Instances where the ancestral (MA12) base was detected; (red circles) the detection of recovery-line mutations in fixed states; (yellow circle) the single observed incidence of heterozygosity (Fig. 4); (gray circles) the R12B timepoint not assayed.
Genome Research www.genome.org
1665
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Denver et al.
Figure 4. Detection of a segregating mutation in R12A. The three chromatograms show DNA sequence data for the single instance where we were able to detect evidence for both ancestral DNA and recovery-line mutations. The mutation at position 12,610,210 on chr II (shared with R12C), indicated by the asterisk, was first detected as a segregating variant at generation 40 in R12A then appeared in a fixed state at generation 50. There was no evidence for this mutation at generation 30. We were unable to detect a segregating variant for any of the other 27 recovery-line mutations.
Cross-generational analysis of the two sites in R12A and R12C showed distinctive fixation patterns and distinctive patterns of linkage to other recovery line-specific fixed mutations (Fig. 3). The intergenic chr II mutation was first detectable in R12A as an unfixed, segregating allele in the experimental population at generation 40 and was fixed by generation 50 (Fig. 4) along with the chr IV shared mutation, the latter being undetectable until generation 50. These two mutations were the only fixed mutation sites detected in the R12A genome at generation 50. In R12C, both of the fixed mutations shared with R12A were first detected in a fixed state at generation 60 along with 10 other R12C-specific fixed alleles. Although we cannot with 100% certainty rule out the possibility that R12A nematode(s) contaminated the R12C population between generation 50 and 60 in the lab, we believe this possibility to be highly unlikely for two reasons. First, extreme technical care was taken during the experiment to avoid the possibility of cross-contamination (see Methods). Second, the two shared mutations appeared in R12C along with 10 additional R12C-specific fixed base substitutions. Thus, the cross-contaminating nematode lineage would have had to accumulate and fix 10 additional base substitutions in 10 generations or less. Given the base-substitution mutation rate (mbs) and confidence interval for C. elegans genomic regions analyzed by Illumina (Denver et al. 2009), 2.7 (60.4) 3 109 per site per generation, and the numbers of sites surveyed (86.7 million, on average), a nematode is expected to acquire 0.23 (60.4) detectable base-substitution mutations per generation. Dividing the number of observed R12C-specific mutations (N = 10) by 0.23 mutations/generation leads to an estimate of 43.5 (611.2) expected generations for these 10 mutations to have accumulated. This suggests that 10 generations was an insufficient amount of time for these 10 mutations to have arisen in R12C. The expected number of generations for 12 mutations to accumulate, following the same logic presented above, is 52.2 (613.5) generations. We deduce that most or all of the 12 total fixed mutations detected in R12C at generation 60 most likely accumulated during the first 50–60 generations of the recovery experiment, persisting as very low-frequency (undetectable) variants until the acquisition of a beneficial mutation on that genetic background swept them all to fixation sometime between generations 50 and 60. The two fixed mutations shared by R12A and R12C most likely arose and fixed in these two recovery lineages in an independent, parallel fashion.
Interactome analysis We investigated the possibility of intergenic epistatic interactions between recovery-line mutations and MA12 mutations in
1666
Genome Research www.genome.org
protein-coding genes using the C. elegans interactome map (Zhong and Sternberg 2006). GeneOrienteer (http://www.geneorienteer. org) was used to calculate log-likelihood ratio scores for all possible pairwise combinations of C. elegans genes that were found to harbor a MA12 or recovery-line mutation based on numerous underlying feature data sources (yeast two-hybrid experiments, microarray data, etc.) from C. elegans, Drosophila melanogaster, and Saccharomyces cerevisiae. Consistent with the original analysis of global C. elegans genetic interactions (Zhong and Sternberg 2006), we applied a score threshold of 0.9, which exceeds the maximum contribution that any single contributing feature can achieve, to identify putative interactions between genes harboring recovery-line and MA12 mutations. Fourteen predicted interactions were identified that met or surpassed our score threshold, all involving combinations of genes mutated in MA12 and those bearing fixed recovery-line mutations (Fig. 5); no interactions were detected between mutated MA12 genes or between recovery-line genes bearing fixed mutations. One R12A fixed mutation caused a nonsynonymous change in the glp-1 gene that has
Figure 5. Predicted interactions of recovery line and MA line mutations. Each red square represents a MA line mutation; inside the square, the genotype (MA12) is on top, the mutated gene is in the middle, and the functional effect is indicated on the bottom. UTR, mutations in untranslated regions; Ex Syn, exon mutations that are synonymous. For nonsynonymous mutations in exons, the resultant amino acid change is indicated using single-letter codes. Circles represent recovery line mutations: (green) R12A; (blue) R12C. Genetic interactions predicted by GeneOrienteer are indicated by double-headed arrows, and the log-likelihood score associated with each predicted interaction is listed next to corresponding arrows.
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Evolutionary recovery from mutation in C. elegans predicted interactions with 8/31 genes that suffered a mutation in MA12. glp-1 encodes a transmembrane receptor protein that, along with LIN-12, comprises one of two C. elegans members of the LIN-12/Notch family of receptors and plays a key role in the control of germ cell proliferation during postembryonic development (Austin and Kimble 1987; Priess et al. 1987). One R12C fixed mutation caused a change in the 39 untranslated region of the num-1 gene that has predicted interactions with five genes mutated in MA12. NUM-1 affects the localization and recycling of cell membrane receptor proteins (Nilsson et al. 2008). Four of the MA12-mutated genes predicted to interact with NUM-1 (fixed mutation in R12C) also have predicted interactions with GLP-1 (fixed mutation in R12A) (see Fig. 5). Three of these four genes mutated in MA12 encode transmembrane proteins: CDH-4 is a cadeherin involved in cell–cell adhesion (Schmitz et al. 2008), ITR-1 is a putative inositol (1,4,5) trisphosphate receptor that affects the defecation cycle and pharyngeal pumping (Walker et al. 2009), and Y74C10AL.2 encodes a protein bearing a conserved integral membrane protein domain (Rogers et al. 2008). Thus, we speculate that the R12A mutation in glp-1 and the R12C mutation in num-1 might have beneficial epistatic effects mediated through alteration of membrane protein activities.
Selective sweeps Cross-generational DNA sequencing analysis of the recovery lines revealed that many recovery line-specific mutations fixed at common 10-generational intervals (Fig. 3). This pattern is consistent with the occurrence of series of selective sweeps over the course of the recovery experiments. In R12A, two sweep events were detected: two mutations fixed in unison between generations 40 and 50 (the two shared with R12C), then five R12A-specific mutations fixed between generations 50 and 60. In R12B, the data indicated three sweep events: one mutation fixation by generation 20, followed by seven additional mutations fixing at generation 50, followed by one mutation fixing at generation 60. In R12C, one sweep event was detected at generation 60 involving 12 mutations. We formally explored the expected dynamics of these sweeps using simulations and a diffusion approximation of the fixation of adaptive mutations under the influences of natural selection, genetic drift, and recurrent mutation (see Methods). The first thing to note is that although most of the sweeps that we observed are confined to a 10-generation window, even under a completely deterministic model a great deal of the allele frequency change for a new adaptive mutation initially occurs below our detection threshold (Fig. 6A), indicating that the actual time for the sweeps is probably more on the order of 20–30 generations. The diffusion approach allows the probability of fixation to be calculated for every possible combination of initial allele frequency and number of generations (Fig. 6B). Here, we are most interested in integrating the probability that a new mutation (initial frequency of 1/2000) will be fixed over the total length of the experiment and/or observation window. Because of the very small probabilities involved, the diffusion approximation substantially underestimates the probability of fixation during early generations, but performs increasingly better over time, especially for strong selection (Fig. 6C). Since we observed at least one selective sweep in each of the replicates, the expected number of fixed mutations for a given time interval under a given set of parameter values must be at least 1.0. Assuming no interference between fixation events, moving from the single locus results to whole genome expectations
Figure 6. Population genetic analysis of the fixation of new mutations under positive selection and complete self-fertilization. (A) Deterministic sweep of a new adaptive mutation in an infinitely large self-fertilizing population under different strengths of selection. There can be a significant lag before the mutation reaches a high enough frequency to be detectable at the sensitivity threshold present in this experiment (dashed line). (B) Solution to the diffusion equation for finding the probability of fixation of a segregating allele (s = 0.5, Ne = 1000). The probability of fixation for a new mutation over the course of the experiment is calculated by summing over the cumulative probability of fixation for an initial allele frequency of 1/2000. (C ) Cumulative probability of fixation over a given number of generations for varying levels of positive selection. Solid lines show simulation results that simultaneously include mutation, drift, and selection (Ne = 1000; m = 2.6 3 109). Points below each line show the results of the diffusion approximation in which mutation is treated separately from drift and selection. The diffusion approximation tends to underestimate the probability of fixation, especially during early generations and under weaker selection.
Genome Research www.genome.org
1667
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Denver et al. requires multiplying the probability of fixation by the number of possible sites under selection. There are two domains of parameters that are consistent with the timescale of the response displayed by our populations. First, if the majority of the genome has the potential to contribute to the recovery observed here, then the probability of fixation (Fig. 6C) can be multiplied by a very large number (;8.8 3 107), allowing mutations under moderately strong selection (s » 0.3) to contribute to the response even though it is very unlikely that any particular mutation would become fixed. However, the consistency of the sweep dynamics and our observations of repeated fixations in several of the lines and an apparently limited set of interacting components suggest that this is unlikely. Fixation of a smaller subset of the genome (say 1 in 10,000 nucleotides) is feasible in the timescale observed here, but only if selection is strong. Thus, in order to satisfy the conditions (1) that we always observe fitness recovery in our experiments and (2) that the sweeps that we observed occur within a most a few dozen generations, we find that selection on at least one of the mutations is likely to be very strong: on the order of a 70%–90% increase in fitness relative to the ancestral genotype (Fig. 6C).
Discussion Evolutionary implications Our study provides a broad-based view of genomic divergence patterns associated with the adaptive recovery from deleterious mutation in C. elegans. None of the recovery response is attributable to detectable reversion mutations, and no cases of putative intragenic compensatory mutations were identified. Thus, intergenic compensatory mutations likely drive all of the change that we observe, suggesting that this might be a common avenue for genetic change within complex multicellular organisms. Second, virtually all of the base-substitution polymorphisms detected in the recovery lines went from zero or near-zero (undetectable) population-level frequencies to a state of fixation (or near fixation) in a few dozen generations, suggesting the occurrence of selective sweeps in the adaptively evolving lab populations. The fact that the changes that we are able to observe all occur toward the end of our experiment suggests that these lab populations are likely subject to a constant genetic churn in which early sweeps are replaced by subsequent adaptive changes. If this is a common occurrence, then future studies will need to completely sequence individuals from multiple time points in order to fully characterize the underlying evolutionary dynamics. The timescale of simultaneous fixation requires very strong selection and suggests that the majority of the changes that we observed are generated via hitchhiking of very low frequency background mutations and/or that epistasis between multiple loci generates the fitness effects that we observe. The strength and genomic impacts of the sweeps observed here may help to explain the extreme haplotype structure observed within natural populations of C. elegans (Cutter 2006; Rockman and Kruglyak 2009). On the other hand, the fact that a number of mutations appear to readily hitchhike along with these sweeps indicates that local adaptation should drive strong genetic divergence among C. elegans isolates. Instead, we see very little evidence for genetic variation or population structure on a worldwide scale (Barriere and Felix 2005; Haber et al. 2005; Cutter 2006; Rockman and Kruglyak 2009). This observation provides support for the view that most extant C. elegans populations may have diverged from one another relatively recently (Phillips 2006; Cutter et al. 2008).
1668
Genome Research www.genome.org
Parallel mutation and compensatory epistatic interactions Two identical fixed mutations were detected in R12A and R12C that most likely arose and fixed in these two recovery lineages in an independent, parallel fashion. The likelihood of two identical mutations occurring in two different recovery lines is ;0.0008 (see Methods). Although this probability is very low, it is possible that these sites might experience higher mutation rates than genomewide averages (Denver et al. [2009] was unable to effectively account for potential hotspots). Further, parallel mutation has previously been observed in similar experimental evolution studies in prokaryotes (Maisnier-Patin et al. 2002; Poon and Chao 2005). The observation of two sites fixing the exact same mutation type in independent recovery lines suggests that these mutations might have beneficial effects that were directly acted upon by natural selection. The observation of parallel mutation in the recovery lines might reflect a limited number of beneficial mutations available as potential substrates for adaptive recovery from MA12 mutations (Orr 2005). This interpretation is consistent with our population-genetic analysis of selective sweep dynamics that suggested very strong selection on a very small fraction of the C. elegans recovery-line genomes. Both parallel mutations, however, occurred in genomic regions that are not predicted to encode functional protein products, suggesting that any positive effects would be mediated through regulatory or DNA structural effects. The chr IV shared mutation occurred in an intron of the Y43B11AL.1 gene—the only functional information available for this gene from WormBase (Rogers et al. 2008) is that its product encodes F-box domains (involved in protein–protein interactions). The chr II shared mutation is just downstream from the lips-16 gene whose product is predicted to encode a lipase function and affect fat content. Three of the detected mutations that accumulated in the MA12 genome prior to recovery (presumably deleterious) were in genes (nonsynonymous change in ZK682.2, intron change in H08M01.2, intron change in mgl-3) that play roles in maintaining fat content, as determined by RNAi experiments (Greer et al. 2008). Further, one R12A-specific fixed mutation resulted in a nonsynonymous change in the inx-4 gene whose product affects fat content in RNAi experiments (Greer et al. 2008). We speculate that the chr II mutation shared by R12A and R12C that is downstream from lips-16, as well as the inx-4 mutation specific to R12A, might have beneficial effects associated with lipid metabolism manifested through epistatic compensatory interactions with MA12 mutations. The presence of both the chr II and chr IV shared mutations in R12A and R12C indicates that any putative beneficial effects of these mutations might require epistatic interactions between these two loci, though there is no functional information in support of this possibility. More evidence for epistatic compensatory interactions underlying adaptive evolution in the recovery lines resulted from our interactome analysis (Fig. 5) that revealed strong evidence for interactions between genes that suffered (presumably deleterious) MA line mutations during the bottleneck phase and genes that acquired fixed recovery-line mutations. Although we were able to detect rare mutations that fixed in recovery-line lineages and characterize selective sweep events in our cross-generational analysis, we were unable to determine whether mutations identified were beneficial versus selectively neutral (or slightly deleterious) mutations that hitchhiked to fixation due to the complete linkage of all sites in the primarily selfreproducing nematodes. Our survey was also only able to identify base-substitution mutations—it is possible that other mutation
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Evolutionary recovery from mutation in C. elegans types (e.g., insertion-deletion mutations, large rearrangement events) left undetected were the primary drivers of fitness recovery. Backcrossing individual mutations identified in this study onto MA12 genetic backgrounds, followed by comparative fitness studies, would provide an avenue for understanding the effects of various mutations fixed in the recovery lines. We have repeatedly attempted such an analysis and, unfortunately, have been unable to successfully perform crosses with the low-fitness MA12 nematodes. Another limitation to this study is the fact that the mutations responsible for the very rapid recovery of fitness in the first 10–20 generations were most likely missed since all but one of the fixed mutations detected here occurred at later generations. Given that all five MA12-derived recovery lines independently regained wild-type fitness levels by generation 20 (Fig. 2), we speculate that the mutations responsible for most of the fitness recovery were of high-mutation rate types and/or involved highly plastic epistatic changes, rather than the base-substitution changes analyzed here. For example, gene duplication and deletion dynamics involving large gene families have the potential to have profound and rapid consequences on fitness via changes in dosage. Likewise, changes in ribosomal DNA cluster copy number are shown to occur at very high rates and have broad-based epigenetic effects on global chromatin and gene regulation in Drosophila (Paredes and Maggert 2009). Although unlikely, it is also possible that highly deleterious mutations present in a heterozygous state in the MA12 progenitor were largely responsible for the greatly reduced fitness of this MA line—individuals homozygous for the wild-type alleles at these sites would have an immediate large fitness advantage in the recovery lines. Expanding our mutation survey to encompass heterozygous MA line mutation sites and repetitive DNA units will be required to pinpoint the nature of the beneficial changes responsible for the rapid regain of fitness in the recovery lines. We also note that although we detected very fast selective sweeps with associated selection coefficients of $0.3, two of our fitness measures (intrinsic rate of population increase, total fecundity) did not reveal fitness increases expected at these generational intervals (Fig. 2). Our third fitness measure, lifespan, did show increases across analyzed generational intervals. It is possible that our fitness measures lacked the sensitivity required to detect these fitness increases; competition assays involving recovery lines from different generations might provide a more powerful option for future analyses.
Adaptive mutation rate We can reasonably assume that each of the six selective sweeps detected in the cross-generational analysis was caused by positive selection acting on at least one beneficial mutation. Thus, a minimum of six beneficial mutations arose in the three recovery-line populations analyzed (two in R12A, three in R12B, and one in R12C), each underlying one of the six sweeps detected. This leads to a lower-bound adaptive genomic mutation rate estimate (Ua) of 3.8 3 105 per nematode per generation (see Supplemental Methods). Given the current total genomic mutation rate (Ut) estimate for C. elegans, 2.1 per genome per generation (Denver et al. 2004), this suggests that as few as one mutation in 55,263 (Ua/Ut) is adaptive. Our C. elegans Ua estimate is remarkably similar to a recent Ua estimate for Escherichia coli, 2.0 3 105, based on laboratory evolution studies (Perfeito et al. 2007). It is also consistent with our theoretical estimate of how much of the genome must be under strong positive selection in order for fixation to occur within the
timeframe observed here. Our Ua estimate for C. elegans, however, is likely to be an underestimate for two reasons. First, as discussed above, some selective sweep events likely went undetected, especially in earlier recovery generations where there might have been insufficient time for detectable base substitutions to accumulate. Second, the effects of clonal interference, the loss of competing beneficial mutations in the population as a consequence of selective sweeps at other loci, likely resulted in underestimation of the numbers of beneficial mutations arising in each recovery line. Thus, although whole-genome resequencing provides an unprecedented opportunity to identify the specific genetic changes responsible for fitness recovery, understanding the role of beneficial mutations in shaping natural patterns of genomic variation remains a formidable problem in evolutionary analysis.
Methods Experimental evolution and life-history analyses We selected five C. elegans MA lines for the current study that were shown to completely recover ancestral levels of fitness in a previous experiment (Estes and Lynch 2003). These lines were thawed, expanded for a single generation, and subdivided into five replicate populations. Each replicate was initiated from a single MA12 nematode and then independently maintained in large population sizes under standard laboratory conditions for 60 overlapping generations following Estes and Lynch (2003). The ancestral N2 control (progenitor of the MA lines) underwent the same treatment concurrently. Approximately 1000 individuals were transferred each generation, with populations expanding to roughly 10,000 individuals in between transfers (see Supplemental Methods for a discussion of effective population size). Extreme care was taken to avoid cross-contamination among experimental lines by keeping plates well separated on trays and through ethanol/flame sterilization of the metal core boring tool used for transfers. Finally, samples from each replicate were frozen at 10-generation intervals during the experimental evolution phase. The evolutionary trajectories of the other four recovery lines will be reported elsewhere. Because they exhibited the greatest total fitness increase during experimental evolution, we chose the MA12 recovery lines for Illumina analysis. Life-history assays were conducted as described in Estes and Lynch (2003). Briefly, for each line and each generational time point, total progeny production, population growth rate (r), and longevity were measured for 10–15 single individuals obtained from frozen stock. Single worms were transferred to fresh plates daily and progeny production measured by directly counting the progeny produced over the entire reproductive period. Intrinsic population growth rate, r, was calculated for each line by solving +e-rx l(x) m(x) = 1 for r, where l(x) is the proportion of worms surviving to day x and m(x) is the fecundity at day x. Longevity was taken to be the total number of days lived from the L1 stage. Assays were carried out on standard OP50 E. coli-seeded NGM agar plates at 20°C. We tested for recovery of ancestral levels of fitness and for evolution of the ancestral control using analyses of variance for each fitness trait with population treatment (MA, recovery, ancestral control, evolved control) as a fixed effect. To test for differences between pairs of treatment group means, least-squares contrasts (Tukey’s HSD for all pairwise comparisons; Zar 1999) were performed on the data for each life-history trait.
Male frequency analysis To approximate the amount of sexual recombination that may have been occurring during experimental evolution of the C.
Genome Research www.genome.org
1669
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Denver et al. elegans recovery lineages (A–E), we estimated the proportion of males produced by the MA12 ancestor and by each recovery line following 60 generations of laboratory evolution. Male frequency was scored by counting 200 individuals from five replicate plates per line. Male frequency in the MA12 line was estimated to be 0.7%; however, these males were apparently incapable of mating given our lack of success in backcrossing experiments. Male frequencies were even lower in the generation-60 recovery lines (Supplemental Table S2). We thus conclude that outcrossing-based sexual recombination most likely played a very minimal (if any) role in shaping the evolutionary trajectories of the recovery lines.
DNA sequence analysis We screened for base-substitution mutations in MA12, R12A, R12B, and R12C by applying the same Illumina high-throughput sequencing method previously used to accurately identify base substitutions in seven C. elegans MA lines genomes (Denver et al. 2009). The details of our analytical approach are provided in the Supplemental Methods. For the current study it was especially important to account for the potential effects of heterozygous MA12 mutant sites that might be differentially fixed/segregated in different recovery lines. For Illumina sample preparations, nematode lab populations were initiated from frozen stocks (;50–100 nematodes) and allowed to expand for two generations in order to amass a sufficient number of animals for Illumina DNA extraction protocols. Our level of Illumina coverage (;73 for unique regions, on average) was insufficient to distinguish mutations fixed in the recovery lines from those still segregating in the population and coexisting with ancestral alleles. To address these concerns, we carried out PCR and conventional direct capillary DNA sequencing analysis for all 28 detected MA12 recovery-line changes (Supplemental Fig. S2) and in all cases there was no evidence of heterozygosity in MA12 or the recovery line. Many thousands of nematodes were used for DNA extractions used in PCR/capillary sequencing assays. The primers used to PCR-amplify and capillary-sequence these sites are provided in Supplemental Table S3. We performed a controlled capillary DNA sequencing experiment to evaluate allele frequencies required for detection in our cross-generational analyses. Using initial ‘‘wild-type’’ and ‘‘mutant’’ (containing base substitution) PCR product samples of known concentrations estimated through standard spectroscopy methods (NanoDrop), we made a series of samples where the molar ratios of input wild-type and mutant PCR products varied from 15% to 0.5%. We evaluated chromatogram data from samples containing 15%, 10%, 5%, 1%, and 0.5% mutant PCR products and found that the mutant peak was discernible in the 15%, 10%, and 5% samples but not readily distinguishable from the baseline (‘‘noise’’) in the 1% and 0.5% samples. This result is generally consistent with similar studies aimed at identifying and characterizing low-frequency heteroplasmic mitochondrial DNA mutations (Theves et al. 2006). We conclude that any chromatograms that did not reveal any evidence for minority peaks (wild type or mutant, depending on the situation) correspond to DNA samples where the site is in a fixed state, or the frequency of the minority allele is <5%.
Mutation rate calculations Individual MA line-specific mutation rates were calculated with the equation mbs = m/(nT), where mbs is the base-substitution mutation rate (per nucleotide site per generation), m is the number of observed mutations, n is the number of nucleotide sites, and T is the time in generations, as previously described (Denver et al.
1670
Genome Research www.genome.org
2004). The standard errors for individual mutation rates were calculated as [mbs/(nT)]1/2, as described (Denver et al. 2004). Values for n were defined as the total number of sites surveyed that met our criteria for consideration of a possible mutation site (Supplemental Methods). To address the matter of parallel mutation, we first estimated the probability of a particular mutation not occurring during the recovery experiment (Pnm) as (1 mbs)TNe where Ne is the estimated effective population size (1000); Pnm = 0.9998. Given a fixed mutation in one recovery-line population, the probability of the same exact mutation occurring in any of the other four recovery lines is (1 – Pnm) 3 4, resulting in a value of 0.0008.
Population genetic analysis We used a combination of exact calculations, simulations, and a diffusion model to assess the probability of fixation of recurrent mutations in both the presence and absence of natural selection. In keeping with the mating system of C. elegans (particularly the N2 laboratory strain), we assume complete selfing throughout. Because of the very small probabilities involved, it is actually quite difficult to solve the case for complete neutrality for the timescales and population sizes used in this study. We outline upper bound calculations in the Supplemental Methods and Supplemental Fig. S3. For the case of positive selection operating on one or more loci, we use the standard Kolmogorov backward equation for genetic drift: @uð p; tÞ V dp @ 2 uðp; tÞ @uðp; tÞ = + M dp @t @p2 @p 2 (Crow and Kimura 1970, equation 8.8.3.1). We follow Caballero et al. (1991) and Caballero and Hill (1992) in defining the instantaneous mean change in allele frequency under partial selfing as Mdx = xð1 xÞs½h + xð1 2hÞ + Fð1 x h + 2xhÞ; where s is the strength of selection for the new mutation with homozygous fitness 1 + s versus the initial genotype with fitness 1.0, h is the dominance coefficient of the new mutation, and F is the inbreeding coefficient (set to 1.0 for the case of complete selfing). The variance of this process generated by genetic drift is given by Vdx = xð1 xÞ=2Ne ; where Ne is the effective size of the population (here taken to be 1000). The diffusion equation was solved numerically using Mathematica (Wolfram Research) under the boundary conditions u(0, t) = 0, u(1, t) = 1, and u(x, 0) = [x2 + x(1 – x)F]Ne (code provided in Supplemental Methods). The probability of fixation at any time point was calculated by obtaining the value of u(x0, t), with x0 set to the initial frequency of a new mutation (1/2000). Treating each mutation as independent of one another, the number of mutations expected to reach fixation over a period of T generations was calculated as T +t=1 uðx0 ; tÞ. For the case of complete selfing, the dominance parameter h has no influence on the probability of fixation. These theoretical approximations were tested using a simulation of the entire mutation and fixation process. Mutation, genetic drift, and selection were sequentially imposed on a population initially fixed for the less fit allele. The number of generations needed for the alternative high fitness allele to become fixed within the population was recorded, with the probability of fixation at a given generation calculated as the fraction of 108 replicate
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Evolutionary recovery from mutation in C. elegans populations in which the high fitness allele became fixed. The results presented here assume a population size of 1000 individuals. Results for other population sizes are presented in the Supplemental Figure S4.
Acknowledgments We thank L. Albergotti, H. Bui, A. Coleman-Hulbert, K. Hicks, B. Leon, S. Martha, S. Smith, T. Tague, and A. Woodbury for laboratory assistance. We thank M. Dasenko at the OSU Center for Genome Research and Biocomputing for DNA sequencing support. This work was supported by NSF grants DEB-062521 to S.E. and DEB-0743871 to S.E. and D.R.D., PSU Faculty Enhancement Grant to S.E., OSU Computational and Genome Biology Initiative support to D.R.D., and NSF grants DEB-0236180 and DEB-0641066 to P.C.P.
References Austin J, Kimble J. 1987. glp-1 is required in the germ line for regulation of the decision between mitosis and meiosis in C. elegans. Cell 51: 589–599. Barriere A, Felix M-A. 2005. Natural variation and population genetics of Caenorhabditis elegans. WormBook 26: 1–19. Caballero A, Hill WG. 1992. Effects of partial inbreeding on fixation rates and variation of mutant genes. Genetics 131: 493–507. Caballero A, Keightley PD, Hill WG. 1991. Strategies for increasing fixation probabilities of recessive mutations. Genet Res 58: 129–138. Crow JF, Kimura M. 1970. An introduction to population genetics theory. Burgess, Minneapolis, MN. Cutter AD. 2006. Nucleotide polymorphism and linkage disequilibrium in wild populations of the partial selfer Caenorhabditis elegans. Genetics 172: 171–184. Cutter AD, Wasmuth JD, Washington NL. 2008. Patterns of molecular evolution in Caenorhabditis preclude ancient origins of selfing. Genetics 178: 2093–2104. Denver DR, Morris K, Lynch M, Thomas WK. 2004. High mutation rate and predominance of insertions in the Caenorhabditis elegans nuclear genome. Nature 430: 679–682. Denver DR, Dolan PC, Wilhelm LJ, Sung W, Lucas-Lledo JI, Howe DK, Lewis SC, Okamoto K, Thomas WK, Lynch M, et al. 2009. A genome-wide view of Caenorhabditis elegans base-substitution mutation processes. Proc Natl Acad Sci 106: 16310–16314. Estes S, Lynch M. 2003. Rapid fitness recovery in mutationally degraded lines of Caenorhabditis elegans. Evolution 57: 1022–1030. Estes S, Ajie BC, Lynch M, Phillips PC. 2005. Spontaneous mutational correlations for life-history, morphological and behavioral characters in Caenorhabditis elegans. Genetics 170: 645–653. Giannelli F, Anagnostopoulos T, Green PM. 1999. Mutation rates in humans. II. Sporadic mutation-specific rates and rate of detrimental human mutations inferred from hemophilia B. Am J Hum Genet 65: 1580–1587. Greer ER, Perez CL, Van Gilst MR, Lee BH, Ashrafi K. 2008. Neural and molecular dissection of a C. elegans sensory circuit that regulates fat and feeding. Cell Metab 8: 118–131.
Haber M, Schungel M, Putz A, Muller S, Hasert B, Schulenburg H. 2005. Evolutionary history of Caenorhabditis elegans inferred from microsatellites: Evidence for spatial and temporal genetic differentiation and the occurrence of outbreeding. Mol Biol Evol 22: 160–173. Lande R. 1994. Risk of population extinction from fixation of new deleterious mutations. Evolution 48: 1460–1469. Lande R. 1995. Mutation and conservation. Conserv Biol 9: 782–791. Lynch M, Burger R, Butcher D, Gabriel W. 1993. The mutational meltdown in asexual populations. J Hered 84: 339–344. Maisnier-Patin S, Berg OG, Liljas L, Andersson DI. 2002. Compensatory adaptation to the deleterious effect of antibiotic resistance in Salmonella typhimurium. Mol Microbiol 46: 355–366. Nilsson L, Conradt B, Ruaud AF, Chen CC, Hatzold J, Bessereau JL, Grant BD, Tuck S. 2008. Caenorhabditis elegans num-1 negatively regulates endocytic recycling. Genetics 179: 375–387. Orr HA. 2005. The probability of parallel evolution. Evolution 59: 216– 220. Paredes S, Maggert KA. 2009. Ribosomal DNA contributes to global chromatin regulation. Proc Natl Acad Sci 106: 17829–17834. Perfeito L, Fernandes L, Mota C, Gordo I. 2007. Adaptive mutations in bacteria: High rate and small effects. Science 317: 813–815. Phillips PC. 2006. One perfect worm. Trends Genet 22: 405–407. Poon A, Chao L. 2005. The rate of compensatory mutation in the DNA bacteriophage uX174. Genetics 170: 989–999. Poon A, Davis BH, Chao L. 2005. The coupon collector and the suppressor mutation: Estimating the number of compensatory mutations by maximum likelihood. Genetics 170: 1323–1332. Priess JR, Schnabel H, Schnabel R. 1987. The glp-1 locus and cellular interactions in early C. elegans embryos. Cell 51: 601–611. Reynolds MG. 2000. Compensatory evolution in rifampin-resistant Escherichia coli. Genetics 156: 1471–1481. Rockman MV, Kruglyak L. 2009. Recombinational landscape and population genomics of Caenorhabditis elegans. PLoS Genet 5: e1000419. doi: 10.1371/journal.pgen.1000419. Rogers A, Antoshechkin I, Bieri T, Blasiar D, Bastiani C, Canaran P, Chan J, Chen WJ, Davis P, Fernandes J, et al. 2008. WormBase 2007. Nucleic Acids Res 36: D612–D617. Schmitz C, Wacker I, Hutter H. 2008. The Fat-like cadherin CDH-4 controls axon fasciculation, cell migration and hypodermis and pharynx development in Caenorhabditis elegans. Dev Biol 316: 249–259. Theves C, Keyser-Tracqui C, Crubezy E, Salles JP, Ludes B, Telmon N. 2006. Detection and quantification of the age-related point mutation A189G in the human mitochondrial DNA. J Forensic Sci 51: 865–873. Vassilieva LL, Lynch M. 1999. The rate of spontaneous mutation for lifehistory traits in Caenorhabditis elegans. Genetics 151: 119–129. Walker DS, Vazquez-Manrique RP, Gower NJ, Gregory E, Schafer WR, Baylis HA. 2009. Inositol 1,4,5-trisphosphate signalling regulates the avoidance response to nose touch in Caenorhabditis elegans. PLoS Genet 5: e1000636. doi: 10.1371/journal.pgen.1000636. Zar J. 1999. Biostatistical analysis. Prentice Hall, Upper Saddle River, NJ. Zhong W, Sternberg PW. 2006. Genome-wide prediction of C. elegans genetic interactions. Science 311: 1481–1484.
Received March 22, 2010; accepted in revised form September 24, 2010.
Genome Research www.genome.org
1671
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Research
Coevolution within a transcriptional network by compensatory trans and cis mutations Dwight Kuo,1 Katherine Licon,1 Sourav Bandyopadhyay,1 Ryan Chuang,1 Colin Luo,1 Justin Catalana,1 Timothy Ravasi,1,2 Kai Tan,3,4 and Trey Ideker1,4 1
Departments of Bioengineering and Medicine, University of California, San Diego, La Jolla, California 92093, USA; 2Red Sea Laboratory of Integrative Systems Biology, Division of Chemical and Life Sciences and Engineering, Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal 23955-6900, Kingdom of Saudi Arabia; 3Departments of Internal Medicine and Biomedical Engineering, University of Iowa, Iowa City, Iowa 52242, USA Transcriptional networks have been shown to evolve very rapidly, prompting questions as to how such changes arise and are tolerated. Recent comparisons of transcriptional networks across species have implicated variations in the cis-acting DNA sequences near genes as the main cause of divergence. What is less clear is how these changes interact with trans-acting changes occurring elsewhere in the genetic circuit. Here, we report the discovery of a system of compensatory trans and cis mutations in the yeast AP-1 transcriptional network that allows for conserved transcriptional regulation despite continued genetic change. We pinpoint a single species, the fungal pathogen Candida glabrata, in which a trans mutation has occurred very recently in a single AP-1 family member, distinguishing it from its Saccharomyces ortholog. Comparison of chromatin immunoprecipitation profiles between Candida and Saccharomyces shows that, despite their different DNA-binding domains, the AP-1 orthologs regulate a conserved block of genes. This conservation is enabled by concomitant changes in the cisregulatory motifs upstream of each gene. Thus, both trans and cis mutations have perturbed the yeast AP-1 regulatory system in such a way as to compensate for one another. This demonstrates an example of ‘‘coevolution’’ between a DNAbinding transcription factor and its cis-regulatory site, reminiscent of the coevolution of protein binding partners. [Supplemental material is available online at http://www.genome.org. The sequence data from this study have been submitted to the NCBI Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/) under accession no. GSE15818.] Transcriptional networks are central to understanding both evolution and phenotypic diversity among organisms. Of the many ways in which transcriptional networks can evolve, much attention has been given to changes in the so-called cis-regulatory regions of gene promoters (Wray 2007; Wagner and Lynch 2008). Such changes include gain, loss, or modification of DNA sequence motifs (Cliften et al. 2003; Kellis et al. 2003; Gasch et al. 2004; Stark et al. 2007) as well as alterations in motif spacing relative to the start of transcription, or to other motifs (Ihmels et al. 2005; Tanay et al. 2005). In addition to changes in cis, transcriptional networks can also evolve through alterations to transcription factor (TF) proteins and other trans-acting factors (Wagner and Lynch 2008). Although there have been fewer reports of evolutionary changes in trans, potential mechanisms include mutations to protein structure impacting transcriptional activation or DNA-binding domains (Wagner and Lynch 2008), modulation of TF expression (Sankaran et al. 2009) or post-translational modifications (Holt et al. 2009), or gain and loss of protein–protein interactions among TFs (Tuch et al. 2008; Lavoie et al. 2010). Recently, a number of genome-scale studies have performed systematic comparisons of TF-binding patterns (Borneman et al. 2007; Tuch et al. 2008; Bradley et al. 2010; Lavoie et al. 2010; Schmidt et al. 2010) or mRNA expression profiles across species (Ihmels et al. 2005; Tanay et al. 2005; Hogues et al. 2008; Field et al. 2009; Wapinski et al. 2010). Almost universally, these studies have 4
Corresponding authors. E-mail
[email protected]. E-mail
[email protected]. Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.111765.110. Freely available online through the Genome Research Open Access option.
1672
Genome Research www.genome.org
identified transcriptional programs that are dramatically rewired over short evolutionary time scales. As with earlier work, many of the observed differences in binding and expression have been linked to changes in cis-regulatory regions. For example, Borneman et al. (2007) found that the TF Tec1 binds only 20% of the same target genes in comparisons between Saccharomyces cerevisiae and the closely related Saccharomyces bayanus and Saccharomyces mikatae, and that this difference is due to gain and loss of canonical Tec1 cis-regulatory motifs. While some recent studies have associated genetic variants in TFs with gene expression changes observed in interspecies hybrids (Wilson et al. 2008; Wittkopp et al. 2008; Tirosh et al. 2009; Bullard et al. 2010; Emerson et al. 2010), in outbred crosses (Brem and Kruglyak 2005; Landry et al. 2005; Gerke et al. 2009; Sung et al. 2009; Zheng et al. 2010), or in human populations (Kasowski et al. 2010), the picture that emerges is that cis-regulatory regions are incredibly plastic over evolutionary time, while TFs (trans) evolve at a comparatively slower rate (Wray 2007). Given the dramatic changes that appear to be occurring in transcriptional networks, a key question is how such systems retain essential functions over evolutionary time (Wray 2007). One solution is that changes in cis can occur by replacement of one TF cofactor with another, thereby maintaining regulatory control (Tsong et al. 2006). Alternatively, rather than replacing specific cofactors, it is conceivable that the DNA-binding domains of the TFs that bind these cis-regulatory sequences might be altered in lock-step with changes in cis, similarly to the evolution of proteinbinding partners (Pazos and Valencia 2008). However, such a mechanism of evolution has yet to be observed. Here, we present a direct example of such ‘‘coevolution,’’ where a specific change to a DNA-binding transcription factor and its cis-regulatory site have occurred in compensatory fashion.
20:1672–1678 Ó 2010 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/10; www.genome.org
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Coevolution within a transcriptional network As a model of transcriptional network evolution, we examined the yeast AP-1 (yAP-1) family, which, with a total of eight members, is one of the largest paralogous TF families in S. cerevisiae (Fernandes et al. 1997; Rodrigues-Pousada et al. 2010). Like other paralogous families, AP-1 factors have been born through the process of gene duplication, which gives rise to multiple copies that are free from selective pressure and may functionally diverge from their duplicates by sub- or neofunctionalization (Hittinger and Carroll 2007). AP-1 also provides a classic example of the basic leucine zipper (bZIP) motif, which is widely conserved across eukaryotes (Tan et al. 2008; Rodrigues-Pousada et al. 2010). In humans, AP-1 TFs have been heavily studied due to their crucial role in cell proliferation, death, and differentiation (Shaulian and Karin 2002). In yeast, yAP-1-mediated transcriptional networks carry out overlapping, but distinct biological responses to stress (Tan et al. 2008; Rodrigues-Pousada et al. 2010). In contrast to the widespread divergence in TF binding that has been demonstrated previously (Borneman et al. 2007; Tuch et al. 2008; Lavoie et al. 2010), we show that coupled trans and cis mutations enable conservation of a subset of genes targeted by yAP-1. These results provide an example of compensatory coevolution of a trans and cis regulatory system.
Results
tin immunoprecipitation with microarray hybridization (ChIPchip) data in Sc (Harbison et al. 2004; Tan et al. 2008) have determined that five AP-1 family members (Yap1, Yap2, Yap5, Yap7, Yap8) recognize YRE-O, whereas two family members (Yap4 and Yap6) recognize YRE-A. We examined the binding of the remaining Sc AP-1 member Yap3 by ChIP-chip and determined it preferred YRE-A sites in both complete media and stress conditions (Supplemental Fig. 1). This preference for YRE-O or YRE-A-binding sites in Sc AP-1s correlates precisely with the presence of arginine or lysine at residue 12 (Fig. 1A). Interestingly, residue 12 is part of an alpha-helical surface that forms multiple contacts to DNA (basic region residues 7–15) (Fig. 1B; Fujii et al. 2000). Previously, this residue was predicted as a likely determinant of DNA half-site spacing preference in Gcn4, another bZIP family TF (Kim and Struhl 1995). Although in vitro testing of Gcn4 mutants was not able to confirm this prediction (Kim and Struhl 1995), it has become apparent that such variations in half-site recognition are best distinguished in vivo (Suckow and Hollenberg 1998; Berger et al. 2008; Maerkl and Quake 2009).
Residue 12 point mutations cause rewiring of AP-1 transcriptional interactions To further examine the regulatory role of residue 12, we mutated this residue in Yap1, a representative YRE-O-binding factor, and
A trans mutation is associated with AP-1 DNA-binding motif specificity To identify trans mutations that could be associated with AP-1-binding preference, we performed an amino acid sequence alignment of the DNA-binding domains of all eight AP-1-like TFs in S. cerevisiae (Sc). This alignment and its associated phylogenetic tree (Fig. 1A) were searched to identify the key polymorphic amino acids whose patterns of conservation and divergence best explain the phylogeny (Methods, Evolutionary Trace Analysis). Such residues have been shown to frequently play important evolutionary roles (Innis et al. 2000). Using this approach, we identified residue 12 of the DNAbinding domain basic region as the most important evolutionarily divergent position across the yAP-1 family (i.e., the one that was most highly correlated with the phylogeny; Fig. 1A). Residue 12 was also predictive of AP-1 family DNA-binding motif preference (Fig. 1A) (MacIsaac et al. 2006; Tan et al. 2008). AP-1 family members bind DNA as homo- or heterodimers, where each constituent monomer recognizes the consensus sequence TTAC (Suckow et al. 1999; Fujii et al. 2000). These ‘‘halfsites’’ are positioned in either adjacent or overlapping fashion (Fig. 1A), which we refer to as yAP-1 response element adjacent (YRE-A) or yAP-1 response element overlapping (YRE-O), respectively. Previous analyses of genome-wide chroma-
Figure 1. A single residue determines yAP-1 DNA-binding motif specificity. (A) Alignment and phylogeny of AP-1 DNA-binding domain basic regions (residues 6 to 20 are shown). Residue 12 (red star) is predictive of preference for overlapping (YRE-O) or adjacent (YRE-A) DNA-binding motifs (left). Note that Yap8 possesses an Asp at residue 12 and binds a 2-bp overlapping YRE-O (Harbison et al. 2004). Positions affecting Gcn4 half-site spacing preference (Kim and Struhl 1995) are shown (gray stars). (B) Recognition of the yAP-1 half-site (Fujii et al. 2000). Residue 12 (red star) is in close proximity to residues conferring AP-1 sequence specificity. (C,D) ScYap1.R79K and ScYap4.K252R mutants have altered half-site spacing preference as evidenced by ChIP-chip (Methods). P-values refer to differences in binding to genes with either YRE-O or YRE-A sites as assessed by Fisher’s exact test. (E,F ) ScYap1.R79K and ScYap4.K252R mutations cause mRNA expression changes among genes with YRE-O and YRE-A sites among the top 50 most differentially expressed genes. P-values denote the significance of YRE-A and YRE-O motifs among gene promoters compared with the genomic background.
Genome Research www.genome.org
1673
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Kuo et al. Yap4 (also known as Cin5), a representative YRE-A-binding factor. This process involved generating mutants Yap1.R79K and Yap4.K252R, changing arginine to lysine in Yap1 and lysine to arginine in Yap4 (Methods). Next, Yap1.R79K binding and Yap4.K252R binding were assayed in vivo using ChIP-chip (Methods). Comparison of the top 50 promoters bound by Yap1.R79K with the top 50 promoters bound by wild-type Yap1 (as determined in Tan et al. 2008) showed that mutation of Yap1 significantly altered its preference for YRE-O and YRE-A sites (Fig. 1C; Fisher’s exact test P = 0.0002). Comparison of promoters bound by mutant and wild-type Yap4 also showed the predicted shift in binding preference (Fig. 1D; Fisher’s exact test P = 0.037). These results were not dependent on the number of promoters examined (Supplemental Fig. 2). Next, to assess the functional implications of changes in yAP-1 binding, we generated genome-wide mRNA expression profiles for each mutant in comparison to the unmutated parental strain (Methods). Both mutations, Yap1.R79K and Yap4.K252R, altered the expression of genes whose promoters were highly enriched for AP-1-binding sites (YRE-O and YRE-A, Fig. 1E,F; Supplemental Fig. 3). These genes were also enriched for Yap1.R79K and Yap4.K252R binding (P < 10 5), respectively.
An apparent paradox: Candida AP-1 diverges at residue 12, but its targets are conserved Based on our observation that residue 12 affects binding of AP-1 paralogs in S. cerevisiae, we next asked whether changes in this residue could lead to divergent binding of AP-1 orthologs across species. We searched the yeast phylogeny (Wapinski et al. 2007) for AP-1 orthologs that were anomalous in their use of Arg 12 or Lys 12, suggesting lineage-specific mutation (Supplemental Fig. 4). Among TFs orthologous to Sc YAP1, we found that the Candida glabrata (Cg) ortholog CgAP1 diverges from other yeasts (Fig. 2A–C) due to the presence of lysine at residue 12, in contrast to other yeasts in its clade that possess an arginine. This CgAp1 amino acid substitution was confirmed by sequencing of genomic DNA from two independent Cg isolates, 2001HTU and NCCLS84 (Fig. 2B).
We used ChIP-chip to determine whether this Lys 12 substitution had a functional effect on CgAp1 binding (Methods). To facilitate this assay, we tagged CgAp1 with the TAP epitope and designed a custom microarray tiling the Cg genome (Methods). As a control on both the TAP construct and the array design, we used ChIP-qPCR to successfully validate a panel of five randomly chosen Cg gene promoters that were determined to be bound by CgAp1 in the ChIP-chip experiment (Supplemental Fig. 5). We found that CgAp1 bound the promoters of a total of 114 genes, 90 of which had known orthologs in Sc (Methods). Comparison of these data with ChIP profiles for each of the AP-1 factors in Sc grown under the same treatment conditions (as determined in Tan et al. 2008) showed significant overlap between the targets of CgAp1 and ScYap1 (17 genes, P < 10 17). Overlap with other Sc AP-1 factors was less substantial (Fig. 3A). This pattern of overlap was reinforced by sequence analysis, in which phylogenetic clustering of AP-1 DNA-binding domains places CgAp1 definitively with ScYap1 and not with other Sc AP-1 sequences (Fig. 2C; Methods). We were therefore faced with the following conundrum: On the one hand, the CgAp1 sequence diverges from Yap1 orthologs at residue 12, suggesting a shift in DNA binding. On the other hand, the CgAp1-binding profile is quite specifically conserved with that of Yap1, calling into question the importance of residue 12 for sequence recognition.
CgAp1 prefers YRE-A rather than YRE-O sites
To investigate this apparent contradiction, we next turned to the gene promoters targeted by CgAp1 in the ChIP assay. Promoters targeted by CgAp1 showed a clear preference for YRE-A sites over YRE-O sites (49 vs. four promoters, respectively). This preference significantly differs from ScYap1, which prefers YRE-O over YRE-A (Fisher’s exact test P = 3.5 3 10 8; Fig. 3B, 21 vs. 12 promoters, respectively). This preference could not be attributed to threshold effects on binding-site calls, as direct comparison of motif scores confirmed a preference for YRE-A over YRE-O sites (Mann-Whitney U test, P = 0.0072). This preference was also observed via de novo motif search in these promoters (Fig. 3C) and even among the Cg orthologs of all ScYap1 targets (Q = 0.05). We further analyzed this cis-regulatory preference by examining the orthologs of genes targeted by both CgAp1 and ScYap1 across 20 sequenced yeast genomes (Wapinski et al. 2007). C. glabrata stood out clearly as the only species with enrichment for YRE-A sites (Fig. 3D). In contrast, the YRE-O site was enriched in all neighboring species in the yeast phylogeny, including S. cerevisiae and other sensu stricto species (S. paradoxus, S. mikatae, and S. bayanus) as well as the more diverged Saccharomyces castellii, Kluyveromyces waltii, Kluyveromyces lactis, Ashbya gosspyii, and Candida tropicalis. These results indicate that upstream DNA-binding motifs of CgAp1 targets Figure 2. Evolution of the yAP-1 TFs. (A) CgAp1 possesses a lysine at residue 12 (CgAp1.46), while have evolved from YRE-O to YRE-A (Fig. most other species possess an arginine. (B) Sequencing of CgAP1 in two unrelated isolates shows 3E). Such a switch may have also been complete identity to the Cg reference genome. (C ) Phylogenetic clustering of all Sc and Cg AP-1 DNAaccompanied by concordant changes in binding domains reveals that CgAp1 and ScYap1 co-cluster. Internal branch point numbers refer to the secondary cis-regulatory DNA motifs Bayesian posterior probability, a measure of confidence (Drummond and Rambaut 2007).
1674
Genome Research www.genome.org
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Coevolution within a transcriptional network
Figure 3. The CgAp1 transcriptional network has been rewired. (A) For the promoters targeted by each yAP-1 transcription factor in Sc, the overlap with CgAp1 targets is shown (of 90 CgAp1 targets total). (B) CgAp1 prefers YRE-A-binding sites compared with ScYap1 (Fisher’s exact test). (C ) The CgAp1 DNA-binding motif (green) clusters with YRE-A rather than YRE-O motifs. (D) The YRE-O site is enriched (star) among common ScYap1 and CgAp1 targets in other yeasts (hypergeometric test, Q < 0.05), but not Cg (Q ; 1.0). The YRE-A site is enriched among these targets in Cg (star) but not other yeasts. (E ) Compensatory mutations in both trans and cis maintain AP-1 binding.
(Supplemental material, Yap Cis-Regulatory Motifs Are Coincident with Those of Rtg3 and Aft1; Supplemental Fig. 6) and possible functional divergence (Supplemental material, Divergence and Conservation of Yap1 Function). The most plausible explanation is that these motifs have coevolved with a Lys 12 mutation in CgAp1, with the result that this transcriptional system has retained regulatory control of the same set of target genes over evolutionary time.
Discussion Which mutation came first: the cis or trans? It is possible to envision two equally plausible scenarios (Fig. 3E): (1) An initial mutation in the Yap1 TF provided selective pressure for subsequent cis-regulatory changes in Yap1 targeted genes; (2) a change from YRE-O to YRE-A-binding site in key Yap1 target(s) provided selective pressure for a mutation in the Yap1 TF. In either scenario, mutations in trans and cis may have been facilitated by other AP-1 family members. The large size and interconnectivity of the AP-1 family may serve as a buffer for accumulation of cis and trans mutations, allowing for highly plastic evolution of the AP-1 regulatory network. In support of this hypothesis, several yAP-1s have been shown to bind each other along with common target genes, which might compensate for some loss in regulation by paralogs (Tan et al. 2008).
Examination of the protein sequences of all AP-1 family members across 20 available yeast genomes (Wapinski et al. 2007) suggests that mutations in residue 12 have occurred frequently during AP-1 family evolution (Supplemental Fig. 4). Interestingly, we found that all yeasts possess at least one AP-1 TF with Arg 12 (Fig. 4). In contrast, several yeasts lack AP-1 TFs with Lys 12, and these species are the most evolutionarily diverged from Sc. These results suggest that the common yeast AP-1 ancestor encoded arginine and that the emergence of TFs using lysine is a more recent evolutionary innovation (Supplemental material, AP-1 Family Ancestry). Within the Candida clade, several species (C. tropicalis, C. albicans, C. parapsilosis, and Lodderomyces elongosporus) have AP-1 families based exclusively on Arg 12, while others (C. lusitaniae, Debaryomyces hansenii, and C. guilliermondii) (Fig. 4) represent both Arg 12 and Lys 12 across the AP-1 family. This suggests two equally plausible scenarios for the emergence of Lys 12 in yAP-1 TFs: (1) Lysine emerged following the divergence of Yarrowia lipolytica from other hemi-ascomycetes, followed by a lineage-specific loss within the Candida clade. (2) Lysine emerged following the split of the Candida clade from the rest of the hemi-ascomycetes and emerged again within the Candida clade. In either scenario, a switch from arginine (coded by AGA or AGG) to lysine (coded by AAA or AAG) could be accomplished by a simple single base-pair mutation.
Genome Research www.genome.org
1675
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Kuo et al. protein interaction, which have demonstrated cases in which compensatory mutations are required to maintain protein interaction over evolutionary time (Pazos and Valencia 2008). In the context of transcriptional networks, coevolution gives rise to ‘‘regulatory homeostasis,’’ in which both mutations in a TF and its DNAbinding motif occur in compensatory fashion to maintain transcriptional regulation. This series of compensatory mutations, which maintains both the transcriptional circuit and regulatory logic, parallels that of previous work demonstrating evolution of alternative transcriptional circuits producing identical logic (Tsong et al. 2006). Such systems of tightly coupled compensatory mutations might serve to counter the widespread divergence observed in transcriptional networks, and may constitute a general evolutionary mechanism maintaining the regulation of transcriptional networks.
Methods Yeast strains
Figure 4. All yeasts in the species phylogeny (Wapinski et al. 2007) possess an AP-1 with an arginine. The Candida clade (shaded green) and the whole-genome duplication event (orange star) are noted. Note that the Candida clade does not include Candida glabrata.
Open questions still remain regarding how arginine and lysine substitutions alter AP-1 DNA-binding motif preference. One hypothesis is that differences in their electrostatic charges alter the space required to accommodate other positively charged residues of the bZIP DNA-binding domain without electrostatic repulsion (Kim and Struhl 1995). This hypothesis suggests that the most positively charged residues such as arginine should be associated with YRE-A rather than YRE-O sites (Kim and Struhl 1995). However, in our findings lysine ( pI = 9.59) rather than arginine ( pI = 11.15) was associated with YRE-A sites. An alternative explanation involves a role for AP-1-induced DNA flexibility. Complexes of yAP-1 protein with the YRE-O DNA sequence have been associated with an increase in incorporated water molecules (Dragan et al. 2004a) leading to a decrease in DNA flexibility (Kim and Struhl 1995) compared with yAP-1/YRE-A complexes. A previous report has suggested that changes in DNA flexibility play a key role in determining half-site spacing preference and are responsible for differences between in vivo and in vitro measurements (Suckow and Hollenberg 1998). Since residue 12 is in close proximity to DNA (Fig. 1B) within the protein–DNA complex, residue changes may affect the ability of DNA to incorporate water during binding, thus affecting both yAP-1 DNA motif flexibility and binding (Dragan et al. 2004b). Interestingly, the higher positive charge of arginine induces a stronger dipole than that of lysine, providing a possible mechanism for the increase in the number of incorporated water molecules present at YRE-O sites and associated changes in DNA flexibility and binding preference. In summary, we have shown that conservation of the AP-1 regulatory program in yeast occurs through coordinated evolution of both the sequence of the TF (trans) and in its DNA-binding motifs (cis). This finding echoes that of previous studies of protein–
1676
Genome Research www.genome.org
All immunoprecipitations were performed on strains where the appropriate gene has been endogenously fused to the TAP epitope (Rigaut et al. 1999). Sc TAP-tagged strains were obtained from Open Biosystems. In Cg, the 2001HTU strain was used for TAP-tagging (see below) and deletions. Cg 2001HTU and NCCLS84 were obtained from ATCC.
ScYap1.R79K and ScYap4.K252R mutants Endogenously epitope-tagged ScYap1TTAP and ScYap4TTAP strains (Open Biosystems) were used to introduce the appropriate mutation (ScYap1.R79K, ScYap4.K252R) via the delitto perfetto method (Storici et al. 2001). In brief, ScYAP1 and ScYAP4 were disrupted with the URA3 selectable marker from pRS306 (Brachmann et al. 1998) using ;100-bp homology. Complementary 200-bp oligos with a mutation (ScYap1.R79K or ScYap4.K252R) were then transformed by electroporation (Thompson et al. 1998) to remove URA3 by 5-FOA (US Biological) selection and verified by sequencing. This process creates strains possessing endogenous yAP-1 proteins having both the desired mutation and epitope.
CgAp1 TAP-tagged strain The TAP tag was amplified with ;100-bp homology (CgAP1) from pFA6a-TAP-HIS3MX6 (Longtine et al. 1998), transformed by electroporation (Thompson et al. 1998), and selected on complete –his media (Amberg et al. 2005). C-terminal integration was verified by PCR and DNA sequencing with protein expression verified by immunoblot (Amberg et al. 2005) with the peroxidase antiperoxidase antibody (Sigma P1291).
Growth conditions, mRNA expression, and ChIP Three (CgAp1) or two (ScYap1.R79K, ScYap4.K252R, ScYap3) biological replicates were grown from OD600 0.2 to 0.8 at 30°C in complete media (Amberg et al. 2005) and treated with 0.03% methyl methanesulfonate (Sigma) for 1 h as performed previously (Tan et al. 2008). For mRNA expression analysis, total RNA was isolated by hot phenol/chloroform extraction and labeled with Cy3 or Cy5 dyes (Invitrogen) (Kuo et al. 2010). Samples were hybridized to Agilent expression arrays and washed as recommended by Agilent (Agilent Technologies). For ChIP, all TAP-tagged strains were treated as previously described (Tan et al. 2008). In brief, cells were fixed with 1% formaldehyde for 20 min, inactivated with glycine and washed
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Coevolution within a transcriptional network with TBS. Cells were lysed for 2 h (Vibrax-VXR 2000) with glass beads and sonicated for four cycles of 20 sec (+100-sec rest) at power setting 2 (Misonex Sonicator 3000) on ice. Lysate was incubated with Dynabeads M-280 conjugated with anti-TAP antibody (Open Biosystems CAB1001) overnight. Cross-link reversal was performed overnight at 65°C with antibody-enriched and unenriched DNA, and amplified (Sigma-Aldrich) and labeled (Invitrogen) with Cy5/Cy3 dyes. Sc and Cg samples were hybridized to commercial or custom (see below) Agilent tiling arrays and washed as recommended by Agilent (Agilent Technologies).
Cg tiling microarray design and validation We designed a custom microarray tiling the Cg genome at ;250-bp resolution with ;44,000 60-mer probes designed to avoid selfdimerization and variability in melting temperatures (Mfold; Zuker 2003), low-complexity and repetitive sequences (RepeatMasker; http://repeatmasker.org), and cross-hybridization (WUBLAST2; Altschul and Gish 1996). Default settings were used for each program. Microarrays were manufactured using Agilent technology (Agilent Technologies). ChIP results were validated by qPCR of five targets compared against CgACT1 (Supplemental Fig. 5).
Microarray data processing Intensities were background subtracted and normalized by LOESS (Smyth 2005). Expression microarrays were analyzed using the limma package (Smyth 2005) with default parameters. ChIP tiling array errors were estimated by the Rosetta error model (Weng et al. 2006) with resulting P-values of binding for each promoter calculated by combining P-values of adjacent probes as previously described (Tan et al. 2008).
Phylogenetic trees, orthologs, and evolutionary trace analysis The yAP-1 DNA-binding domain phylogenetic tree was created with BEAST (Drummond and Rambaut 2007) using default settings. Sequences, species trees, and orthologs were obtained from the Fungal Orthogroups Repository (Wapinski et al. 2007). Multiplesequence alignment was preformed using MUSCLE (Edgar 2004) with default parameters. Evolutionary trace analysis was performed using TraceSuite II (Innis et al. 2000) with default settings.
Motif finding De novo motifs were identified by SOMBRERO (Mahony et al. 2005) using default parameters and compared with literature (MacIsaac et al. 2006; Tan et al. 2008) using the default settings for STAMP (Mahony et al. 2005). Promoters were scanned with the default settings for Patser (Hertz and Stormo 1999) for motif enrichment by the hypergeometric test with multiple test correction (Storey and Tibshirani 2003). A motif was considered ‘‘present’’ in a promoter for (fraction of maximal information content) $0.7.
Acknowledgments We thank Jonathan Weissman for providing pFA6a-TAP-HIS3MX6, Paul Russell for pFA6a-kanMX6, Nevan Krogan for pFA6a-natMX6, and Richard Kolodner for pRS303 and pRS306. We also thank Ilan Wapinski, Lorraine Pillus, and members of the T.I. laboratory for helpful discussions. D.K. was supported by the National Science and Engineering Research Council of Canada. K.T. and T.I. were supported by the David and Lucille Packard Foundation and NIH grant no. R01 ES014811 to T.I.
Author contributions: D.K., K.T., T.R., and T.I. designed the study. D.K., K.L., S.B., R.C., J.C., and C.L. performed the experimental work. D.K. and K.T. analyzed the data. D.K. and T.I. wrote the manuscript. K.T. and T.I. supervised the work.
References Altschul SF, Gish W. 1996. Local alignment statistics. Methods Enzymol 266: 460–480. Amberg DC, Burke DJ, Strathern JN. 2005. Methods in yeast genetics: A Cold Spring Harbor Laboratory course manual. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY. Berger MF, Badis G, Gehrke AR, Talukder S, Philippakis AA, Pena-Castillo L, Alleyne TM, Mnaimneh S, Botvinnik OB, Chan ET, et al. 2008. Variation in homeodomain DNA binding revealed by high-resolution analysis of sequence preferences. Cell 133: 1266–1276. Borneman AR, Gianoulis TA, Zhang ZD, Yu H, Rozowsky J, Seringhaus MR, Wang LY, Gerstein M, Snyder M. 2007. Divergence of transcription factor binding sites across related yeast species. Science 317: 815–819. Brachmann CB, Davies A, Cost GJ, Caputo E, Li J, Hieter P, Boeke JD. 1998. Designer deletion strains derived from Saccharomyces cerevisiae S288C: A useful set of strains and plasmids for PCR-mediated gene disruption and other applications. Yeast 14: 115–132. Bradley RK, Li X-Y, Trapnell C, Davidson S, Pachter L, Chu HC, Tonkin LA, Biggin MD, Eisen MB. 2010. Binding site turnover produces pervasive quantitative changes in transcription factor binding between closely related Drosophila species. PLoS Biol 8: e1000343. doi: 10.1371/ journal.pbio.1000343. Brem RB, Kruglyak L. 2005. The landscape of genetic complexity across 5,700 gene expression traits in yeast. Proc Natl Acad Sci 102: 1572– 1577. Bullard JH, Mostovoy Y, Dudoit S, Brem RB. 2010. Polygenic and directional regulatory evolution across pathways in Saccharomyces. Proc Natl Acad Sci 107: 5058–5063. Cliften P, Sudarsanam P, Desikan A, Fulton L, Fulton B, Majors J, Waterston R, Cohen BA, Johnston M. 2003. Finding functional features in Saccharomyces genomes by phylogenetic footprinting. Science 301: 71– 76. Dragan AI, Frank L, Liu Y, Makeyeva EN, Crane-Robinson C, Privalov PL. 2004a. Thermodynamic signature of GCN4-bZIP binding to DNA indicates the role of water in discriminating between the AP-1 and ATF/ CREB sites. J Mol Biol 343: 865–878. Dragan AI, Liu Y, Makeyeva EN, Privalov PL. 2004b. DNA-binding domain of GCN4 induces bending of both the ATF/CREB and AP-1 binding sites of DNA. Nucleic Acids Res 32: 5192–5197. Drummond AJ, Rambaut A. 2007. BEAST: Bayesian evolutionary analysis by sampling trees. BMC Evol Biol 7: 214. doi: 10.1186/1471-2148-7-214. Edgar RC. 2004. MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32: 1792–1797. Emerson JJ, Hsieh LC, Sung HM, Wang TY, Huang CJ, Lu HH, Lu MY, Wu SH, Li WH. 2010. Natural selection on cis and trans regulation in yeasts. Genome Res 20: 826–836. Fernandes L, Rodrigues-Pousada C, Struhl K. 1997. Yap, a novel family of eight bZIP proteins in Saccharomyces cerevisiae with distinct biological functions. Mol Cell Biol 17: 6982–6993. Field Y, Fondufe-Mittendorf Y, Moore IK, Mieczkowski P, Kaplan N, Lubling Y, Lieb JD, Widom J, Segal E. 2009. Gene expression divergence in yeast is coupled to evolution of DNA-encoded nucleosome organization. Nat Genet 41: 438–445. Fujii Y, Shimizu T, Toda T, Yanagida M, Hakoshima T. 2000. Structural basis for the diversity of DNA recognition by bZIP transcription factors. Nat Struct Biol 7: 889–893. Gasch AP, Moses AM, Chiang DY, Fraser HB, Berardini M, Eisen MB. 2004. Conservation and evolution of cis-regulatory systems in ascomycete fungi. PLoS Biol 2: e398. doi: 10.1371/journal.pbio.0020398. Gerke J, Lorenz K, Cohen B. 2009. Genetic interactions between transcription factors cause natural variation in yeast. Science 323: 498– 501. Harbison CT, Gordon DB, Lee TI, Rinaldi NJ, MacIsaac KD, Danford TW, Hannett NM, Tagne JB, Reynolds DB, Yoo J, et al. 2004. Transcriptional regulatory code of a eukaryotic genome. Nature 431: 99–104. Hertz GZ, Stormo GD. 1999. Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15: 563–577. Hittinger CT, Carroll SB. 2007. Gene duplication and the adaptive evolution of a classic genetic switch. Nature 449: 677–681. Hogues H, Lavoie H, Sellam A, Mangos M, Roemer T, Purisima E, Nantel A, Whiteway M. 2008. Transcription factor substitution during the evolution of fungal ribosome regulation. Mol Cell 29: 552–562.
Genome Research www.genome.org
1677
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Kuo et al. Holt LJ, Tuch BB, Villen J, Johnson AD, Gygi SP, Morgan DO. 2009. Global analysis of Cdk1 substrate phosphorylation sites provides insights into evolution. Science 325: 1682–1686. Ihmels J, Bergmann S, Gerami-Nejad M, Yanai I, McClellan M, Berman J, Barkai N. 2005. Rewiring of the yeast transcriptional network through the evolution of motif usage. Science 309: 938–940. Innis CA, Shi J, Blundell TL. 2000. Evolutionary trace analysis of TGF-beta and related growth factors: Implications for site-directed mutagenesis. Protein Eng 13: 839–847. Kasowski M, Grubert F, Heffelfinger C, Hariharan M, Asabere A, Waszak SM, Habegger L, Rozowsky J, Shi M, Urban AE, et al. 2010. Variation in transcription factor binding among humans. Science 328: 232–235. Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES. 2003. Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature 423: 241–254. Kim J, Struhl K. 1995. Determinants of half-site spacing preferences that distinguish AP-1 and ATF/CREB bZIP domains. Nucleic Acids Res 23: 2531–2537. Kuo D, Tan K, Zinman G, Ravasi T, Bar-Joseph Z, Ideker T. 2010. Evolutionary divergence in the fungal response to fluconazole revealed by soft clustering. Genome Biol 11: R77. doi: 10.1186/gb-2010-22-7-r77. Landry CR, Wittkopp PJ, Taubes CH, Ranz JM, Clark AG, Hartl DL. 2005. Compensatory cis-trans evolution and the dysregulation of gene expression in interspecific hybrids of Drosophila. Genetics 171: 1813–1822. Lavoie H, Hogues H, Mallick J, Sellam A, Nantel A, Whiteway M. 2010. Evolutionary tinkering with conserved components of a transcriptional regulatory network. PLoS Biol 8: e1000329. doi: 10.1371/ journal.pbio.1000329. Longtine MS, McKenzie A III, Demarini DJ, Shah NG, Wach A, Brachat A, Philippsen P, Pringle JR. 1998. Additional modules for versatile and economical PCR-based gene deletion and modification in Saccharomyces cerevisiae. Yeast 14: 953–961. MacIsaac KD, Wang T, Gordon DB, Gifford DK, Stormo GD, Fraenkel E. 2006. An improved map of conserved regulatory sites for Saccharomyces cerevisiae. BMC Bioinformatics 7: 113. doi: 10.1186/1471-2105-7-113. Maerkl SJ, Quake SR. 2009. Experimental determination of the evolvability of a transcription factor. Proc Natl Acad Sci 106: 18650–18655. Mahony S, Golden A, Smith TJ, Benos PV. 2005. Improved detection of DNA motifs using a self-organized clustering of familial binding profiles. Bioinformatics 21: i283–i291. Pazos F, Valencia A. 2008. Protein co-evolution, co-adaptation and interactions. EMBO J 27: 2648–2655. Rigaut G, Shevchenko A, Rutz B, Wilm M, Mann M, Seraphin B. 1999. A generic protein purification method for protein complex characterization and proteome exploration. Nat Biotechnol 17: 1030–1032. Rodrigues-Pousada C, Menezes RA, Pimentel C. 2010. The Yap family and its role in stress response. Yeast 27: 245–258. Sankaran VG, Xu J, Ragoczy T, Ippolito GC, Walkley CR, Maika SD, Fujiwara Y, Ito M, Groudine M, Bender MA, et al. 2009. Developmental and species-divergent globin switching are driven by BCL11A. Nature 460: 1093–1097. Schmidt D, Wilson MD, Ballester B, Schwalie PC, Brown GD, Marshall A, Kutter C, Watt S, Martinez-Jimenez CP, Mackay S, et al. 2010. Fivevertebrate ChIP-seq reveals the evolutionary dynamics of transcription factor binding. Science 328: 1036–1040. Shaulian E, Karin M. 2002. AP-1 as a regulator of cell life and death. Nat Cell Biol 4: E131–E136. Smyth GK. 2005. Limma: Linear models for microarray data. In Bioinformatics and computational biology solutions using R and Bioconductor (ed. VCR Gentleman et al.), pp. 397–420. Springer, New York. Stark A, Lin MF, Kheradpour P, Pedersen JS, Parts L, Carlson JW, Crosby MA, Rasmussen MD, Roy S, Deoras AN, et al. 2007. Discovery of functional
1678
Genome Research www.genome.org
elements in 12 Drosophila genomes using evolutionary signatures. Nature 450: 219–232. Storey JD, Tibshirani R. 2003. Statistical significance for genomewide studies. Proc Natl Acad Sci 100: 9440–9445. Storici F, Lewis LK, Resnick MA. 2001. In vivo site-directed mutagenesis using oligonucleotides. Nat Biotechnol 19: 773–776. Suckow M, Hollenberg CP. 1998. The activation specificities of wild-type and mutant Gcn4p in vivo can be different from the DNA binding specificities of the corresponding bZip peptides in vitro. J Mol Biol 276: 887–902. Suckow M, Kisters-Woike B, Hollenberg CP. 1999. A novel feature of DNA recognition: A mutant Gcn4p bZip peptide with dual DNA binding specificities dependent of half-site spacing. J Mol Biol 286: 983– 987. Sung HM, Wang TY, Wang D, Huang YS, Wu JP, Tsai HK, Tzeng J, Huang CJ, Lee YC, Yang P, et al. 2009. Roles of trans and cis variation in yeast intraspecies evolution of gene expression. Mol Biol Evol 26: 2533–2538. Tan K, Feizi H, Luo C, Fan SH, Ravasi T, Ideker TG. 2008. A systems approach to delineate functions of paralogous transcription factors: Role of the Yap family in the DNA damage response. Proc Natl Acad Sci 105: 2934– 2939. Tanay A, Regev A, Shamir R. 2005. Conservation and evolvability in regulatory networks: The evolution of ribosomal regulation in yeast. Proc Natl Acad Sci 102: 7203–7208. Thompson JR, Register E, Curotto J, Kurtz M, Kelly R. 1998. An improved protocol for the preparation of yeast cells for transformation by electroporation. Yeast 14: 565–571. Tirosh I, Reikhav S, Levy AA, Barkai N. 2009. A yeast hybrid provides insight into the evolution of gene expression regulation. Science 324: 659– 662. Tsong AE, Tuch BB, Li H, Johnson AD. 2006. Evolution of alternative transcriptional circuits with identical logic. Nature 443: 415–420. Tuch BB, Galgoczy DJ, Hernday AD, Li H, Johnson AD. 2008. The evolution of combinatorial gene regulation in fungi. PLoS Biol 6: e38. doi: 10.1371/ journal.pbio.0060038. Wagner GP, Lynch VJ. 2008. The gene regulatory logic of transcription factor evolution. Trends Ecol Evol 23: 377–385. Wapinski I, Pfeffer A, Friedman N, Regev A. 2007. Natural history and evolutionary principles of gene duplication in fungi. Nature 449: 54–61. Wapinski I, Pfiffner J, French C, Socha A, Thompson DA, Regev A. 2010. Gene duplication and the evolution of ribosomal protein gene regulation in yeast. Proc Natl Acad Sci 107: 5505–5510. Weng L, Dai H, Zhan Y, He Y, Stepaniants SB, Bassett DE. 2006. Rosetta error model for gene expression analysis. Bioinformatics 22: 1111–1121. Wilson MD, Barbosa-Morais NL, Schmidt D, Conboy CM, Vanes L, Tybulewicz VL, Fisher EM, Tavare S, Odom DT. 2008. Species-specific transcription in mice carrying human chromosome 21. Science 322: 434–438. Wittkopp PJ, Haerum BK, Clark AG. 2008. Regulatory changes underlying expression differences within and between Drosophila species. Nat Genet 40: 346–350. Wray GA. 2007. The evolutionary significance of cis-regulatory mutations. Nat Rev Genet 8: 206–216. Zheng W, Zhao H, Mancera E, Steinmetz LM, Snyder M. 2010. Genetic analysis of variation in transcription factor binding in yeast. Nature 464: 1187–1191. Zuker M. 2003. Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res 31: 3406–3415.
Received June 15, 2010; accepted in revised form September 20, 2010.
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Research
RNA synthesis precision is regulated by preinitiation complex turnover Kunal Poorey, Rebekka O. Sprouse,1 Melissa N. Wells,1 Ramya Viswanathan, Stefan Bekiranov,2 and David T. Auble2 Department of Biochemistry and Molecular Genetics, University of Virginia Health System, Charlottesville, Virginia 22908, USA TATA-binding protein (TBP) nucleates the assembly of the transcription preinitiation complex (PIC), and although TBP can bind promoters with high stability in vitro, recent results establish that virtually the entire TBP population is highly dynamic in yeast nuclei in vivo. This dynamic behavior is surprising in light of models that posit that a stable TBPcontaining scaffold facilitates transcription reinitiation at active promoters. The dynamic behavior of TBP is a consequence of the enzymatic activity of the essential Snf2/Swi2 ATPase Mot1, suggesting that ensuring a highly mobile TBP population is critical for transcriptional regulation on a global scale. Here high-resolution tiling arrays were used to define how perturbed TBP dynamics impact the precision of RNA synthesis in Saccharomyces cerevisiae. We find that Mot1 plays a broad role in establishing the precision and efficiency of RNA synthesis: In mot1-42 cells, RNA length changes were observed for 713 genes, about twice the number observed in set2D cells, which display a previously reported propensity for spurious initiation within open reading frames. Loss of Mot1 led to both aberrant transcription initiation and termination, with prematurely terminated transcripts representing the largest class of events. Genetic and genomic analyses support the conclusion that these effects on RNA length are mechanistically tied to dynamic TBP occupancies at certain types of promoters. These results suggest a new model whereby dynamic disassembly of the PIC can influence productive RNA synthesis. [Supplemental material is available online at http://www.genome.org. The microarray data from this study have been submitted to the NCBI Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo) under accession no. GSE18283.] The RNA polymerase II (Pol II) transcription machinery consists of a collection of general transcription factors (GTFs) and the multisubunit Pol II enzyme itself (Reese 2003; Hahn 2004). Assembly of the Pol II preinitiation complex (PIC) on promoters is highly orchestrated by transcriptional regulators and coregulators that influence GTF recruitment by direct interaction with the transcription machinery and by modulating the promoter chromatin template (Hahn 2004). PIC assembly is nucleated by the TATAbinding protein (TBP), which physically interacts with multiple GTFs and DNA. TBP recruitment to promoters is often rate limiting for transcription in vivo (Pugh 2000). Interaction of the TBP saddle with the TATA box results in severe bending and unwinding of the DNA (Burley and Roeder 1996). In vitro, the resultant complex forms a specialized, longlived substrate for accrual of the other GTFs. Biochemical evidence indicates that a TBP-containing subcomplex remains on promoter DNA following the departure of Pol II (Hahn 2004). This complex, termed the scaffold, can facilitate transcription reinitiation in vitro (Hahn 2004). Although the in vitro evidence in support of a stable reinitiation intermediate is strong, PIC dynamics may be influenced by other factors in vivo. For example, stable TBP–DNA binding is antagonized by Mot1, a Snf2/Swi2-related ATPase that dissociates the TBP–DNA complex (Auble 2009). As another example, the NC2 heterodimer interacts with TBP to form an encircling clamp that allows TBP to diffuse along the DNA contour (Kamada et al. 2001; Schluesche et al. 2007). In fact, recent measurements of TBP mobility in living yeast cells demonstrate that all 1
These authors contributed equally to this work. Corresponding authors. E-mail
[email protected]; fax (434) 924-5069. E-mail
[email protected]. Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.109504.110.
2
detectable TBP is highly mobile, displaying Mot1-dependent FRAP recovery times of <15 sec (Sprouse et al. 2008). Importantly, while the recovery times are rapid, they are markedly slower than can be explained by diffusion and are instead consistent with transient interaction with chromatin. This suggests that the entire (or nearly entire) TBP pool is rapidly recycled, leading to rapid redistribution of TBP among chromatin binding sites. Several fundamental questions are raised by the observed high mobility of TBP in vivo. If TBP is rapidly recycled from sites on chromatin, what is the nature of these sites? Given the pervasive RNA synthesis in yeast cells under these conditions, it would appear that there may be active promoters for which PICs are rapidly recycled. If this is true, how and why are such dynamics important for promoter function? When TBP dynamics are compromised, are new or different types of RNA made, or is simply the quantity at the annotated genes changed? To begin to address these questions, we developed a general genomic strategy to identify aberrant RNA species in mutant strains of interest. Surprisingly, we find that compromising TBP dynamics via a conditional mutation of Mot1 gave rise to many hundreds of changes in RNA length, the largest category of which includes transcripts that were apparently initiated properly but failed to reach the end of the gene. In parallel, we determined how Mot1 affects TBP occupancy genome-wide for comparison with the RNA effects. The results support a model in which Mot1-mediated TBP dynamics at the promoter influence transcription elongation efficiency. These results argue that in contrast to prevailing views, at many promoters, PIC dynamics can play an important role in conferring efficiency and accuracy of transcription elongation.
Results We first compared RNA from wild-type (WT) and mot1-42 yeast cells using Affymetrix genomic tiling arrays that interrogate the
20:1679–1688 Ó 2010 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/10; www.genome.org
Genome Research www.genome.org
1679
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Poorey et al. yeast genome at 5-bp resolution. mot1-42 is a temperature-sensitive allele that encodes a protein that is biochemically inactive in vitro (Darst et al. 2003), and prior work established that this allele induces changes in gene expression in vivo that closely parallel other severe, conditional mot1 alleles (Dasgupta et al. 2002). WT and mutant strains were grown at permissive temperature (30°C) and then shifted for 45 min to 35°C prior to harvesting RNA. This temperature shift did not impair growth of WT cells but did dramatically inhibit growth of the mot1-42 cells (Darst et al. 2003). We calculated the gene expression changes from the tiling array data using methods similar to those applied to conventional gene expression arrays (see Methods). These expression changes were well
correlated with expression changes defined previously (Supplemental Fig. 1; Sprouse et al. 2009), thus validating the use of tiling arrays for quantitative RNA analysis. Although Mot1 exerts a global effect on transcription, most of these transcriptional effects are modest in magnitude. To better understand why Mot1-catalyzed TBP recycling is essential, we took advantage of the tiling arrays to determine whether Mot1-mediated TBP dynamics affect RNA precision as well as quantity. To test this idea, we developed a method to capture significant RNA hybridization signals that deviate from gene annotations (Fig. 1A,B). This approach was possible because in the overwhelming majority of cases, the WT RNA signals were closely aligned with annotated
Figure 1. Global changes in RNA length in mot1-42 and set2D cells. (A) Integrated Genome Browser (Nicol et al. 2009) screen shot of log2 RNA profiles in WT (black) and mot1-42 (red) cells. The differential RNA profile (log2 mot1-42/WT) is shown in the blue track. Annotated genes are in green, with positions above or below the chromosomal coordinate indicating whether the genes are transcribed on the top or bottom strand. The short black and red horizontal bars denote RNA length changes captured by the method, as illustrated more clearly in panel B. (B) Overview of the method, illustrated with an enlargement of the chromosomal region in the center of the screen shot in A. In the case of the gene YBR284W, the significant differential RNA segment found using signal log2(mot1-42/WT) > 0.3 (blue segment) was compared with the annotation, and aberrant transcript length changes (denoted by the red and the black segments) were calculated. Thresholds requiring a $100 bp overlapping gene segment and length changes $150 bp long were applied. Soverlap and Sext refer to the differential RNA signal in the region where the differential segment overlaps (orange segment) and is external (black segment) to the annotation, respectively. Sint refers to the differential signal within the portion of the annotation that does not overlap the differential segment (red segment). By provisionally defining the differential RNA with respect to the overlapping gene, the aberrant RNAs were sorted among four groups (see text). In this example, a 59 length change is classified as an upstream initiation event, whereas the partial overlapping gene segment and consequent change in 39 length define a premature termination event. Although this example shows, for illustrative purposes, a gene with two types of RNA length changes, most of the genes with length changes in mot1-42 cells (92.4%) displayed only a single type as indicated in the table in panel C. (C ) The table summarizes the number of aberrant RNAs identified from the differential RNA signal as in B in mot1-42 and set2D cells compared with WT cells. Of particular note is the large number of premature termination events in mot1-42 cells and the enrichment in downstream (cryptic) initiation events in set2D cells.
1680
Genome Research www.genome.org
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
TBP and transcription elongation genes (an example is shown in Fig. 1A, and others are discussed below). By comparing the coordinates of gene annotations with differential RNA segments, a methodology was developed that can detect RNA segments that extend outside a gene annotation, or differential signals that overlap with only a portion of an open reading frame (ORF). The gene-dense nature of the budding yeast genome presented challenges for the analysis because differential RNA signals could potentially overlap with more than one gene annotation, giving rise to different types of apparent RNA length changes based on the relative orientation of the two genes (Supplemental Fig. S2). Nonetheless, it was possible to segregate the differential RNA signals into four categories in which the RNA segment overlap was defined provisionally with respect to the gene annotation (Fig. 1B). Thus, ‘‘upstream initiation’’ segments are RNAs that extended upstream of the normal transcription start site (TSS); ‘‘downstream initiation’’ events are RNAs apparently initiated within the ORF (also known as cryptic initiation events); ‘‘premature termination’’ segments correspond to RNAs within an ORF that do not include the normal 39 end; and ‘‘downstream termination’’ events are those in which RNA extended beyond the termination site in WT cells. Although the arrays do not provide information about the strand specificity of the hybridized RNA, the strand-specific analyses described below confirm that most, if not all, of the detected events correspond to changes in RNA sense strand abundance. To validate the array-detected RNA length changes, RNA from set2D cells was compared with the WT using the same methodology. Loss of the H3K36 methyltransferase activity in set2D cells is well known to result in cryptic initiation within ORFs (Workman 2006). Consistent with the published data (Li et al. 2007), 206 instances of cryptic initiation were detected in set2D cells, and these ‘‘downstream initiation’’ events comprised the largest class of RNA length changes by far (Fig. 1C). Among the set2D-induced array-detected RNA length changes, we found and validated
cryptic initiation in STE11 and PCA1, two genes previously shown to be susceptible to cryptic initiation in set2D cells (Fig. 2; Li et al. 2007). Using the same approach, we characterized RNA length changes from differential mot1-42 versus WT RNA. Strikingly, twice as many aberrant RNA species were detected in mot1-42 cells compared with set2D cells (Fig. 1C), and the mot1-42-induced length changes had a significantly different distribution among the four length change classes. Notably, a Mot1 defect led to 338 genes showing ‘‘premature termination,’’ the most prevalent class of events. The ‘‘premature termination’’ events fell into two categories: (1) ‘‘differential down’’ instances in which the RNA level was similar in the 59 end of the ORF but diminished in the 39 end of the ORF in mot1-42 cells compared with WT (77%) and (2) ‘‘differential up’’ instances in which a defect in Mot1 led to increased RNA in the 59 portion of the ORF but the differential RNA failed to extend to the 39 end of the ORF (23%). Although the mot1-42 and set2D data sets have different numbers of genes and different distributions among the length change categories, there are 12 genes that displayed premature termination in both mot1-42 and set2D cells. While few in number, the overlap is statistically significant (P = 0.02), suggesting that elongation efficiency in some genes may be under both Mot1 and Set2 control. Selected RNA length changes were validated by real-time PCR, including two genes with premature termination defects (Fig. 3A–C). Validation data for four other array-detected RNA length changes are shown in Figure 4, A, B, E, and F. Note that strand-specific real-time PCR showed that the differential RNA effects are attributable largely, if not entirely, to changes in sense strand abundance. The results thus far support a role for Mot1 in maintaining RNA profiles that match annotated genes but do not address whether this effect is mediated through transcription or some other effect on RNA processing. To address this question, chromatin immunoprecipitation (ChIP) was performed to assess
Figure 2. Confirmation of RNA length changes in set2D cells. (A–C ) Integrated Genome Browser screenshots of log2 WT, set2D, and differential RNA levels (set2D/WT) for STE11, PCA1, and ACT1 genes. Bar graphs, relative RNA levels (arbitrary units) quantified by real-time PCR using sense- and antisensespecific 59 and 39 primer sets for each gene (shown as small black boxes above each gene). Average values are shown 6SE obtained by analysis of two independent RNA samples for each strain. Cryptic initiation was detected in STE11 and PCA1, whereas there was no significant change in ACT1 expression.
Genome Research www.genome.org
1681
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Poorey et al.
Figure 3. Validation of premature termination RNA length changes in mot1-42 cells and correlation with Pol II density. (A–C, top) Screen shots showing log2 WT, mot1-42, and differential RNA signals across PDR11, EMP47, and ACT1 genes. (Bottom) Relative RNA levels were quantified by real-time PCR for both the sense and antisense strands. Primers were specific for either the 59 or 39 end of each gene, and the amplified product is represented by the small black square above each gene. Results shown are from the means of two independent RNA samples with associated SE. PDR11 is an example of a gene with increased 59 transcription in mot1-42 cells (classified as differentially up), whereas EMP47 had similar levels of 59 RNA but less 39 RNA in mot1-42 versus WT cells (classified as differentially down). No RNA length change was detected for ACT1, and the change in RNA level across the open reading frame was small (#5%) in comparison to the total level of total ACT1 RNA in both strains. Antisense ACT1 RNA was not detectable. (D–F ) Relative Pol II ChIP signals in the promoter, 59 end, and 39 end of each ORF. The results were obtained using the 8WG16 antibody and are shown as the mean of three independent biological replicates 6 SD. The indicated P-values were determined using a one-tailed paired t-test of the log-transformed ChIP values. Note the correspondence between the changes in Pol II ChIP and RNA length changes: The 59 ORF of PDR11 had an increased level of Pol II in mot1-42 cells that corresponded with increased 59 ORF RNA level. Similarly, Pol II ChIP signal was decreased in the 39 ORF of EMP47, consistent with the decrease in 39 RNA in mot1-42 cells. As expected, there were no significant changes in Pol II ChIP for ACT1.
the Pol II density on genes for which the RNA length changes occurred. As shown in Figures 3, D through F, and 4, C, D, G, and H, differential changes in the RNA level correlated with changes in Pol II occupancy as expected if the differential RNA signals arose through changes in transcription. To further confirm the interpretations of the differential RNA tiling array data, we analyzed RNAs by Northern blotting using strand-specific probes. We chose genes associated with premature termination for which the full-length and predicted short RNAs were appropriately sized for detection on the blot, as well as being reasonably well resolved from each other. THI2 displayed pre-
mature termination and was in the differentially up class. As shown in Figure 5A, primarily two species of THI2 RNA were detected in WT cells, a full-length species and a discrete shorter RNA that has not been characterized. In mot1-42 cells, most of the full-length THI2 RNA was slightly shifted to a position of smaller size (marked by asterisk), and a heterogeneous smear of shorter RNAs was detected (marked by bracket). This is consistent with the tiling array and RT-PCR data showing an increase in prematurely terminated RNA in mot1-42 cells. As expected, all of the detectable RNA was sense RNA, confirming the premature termination designation. PAN1 was also detected as a gene associated with premature
Figure 4. Confirmation of cryptic initiation and 39 transcript length changes in mot1-42 cells. (A,B,E,F) Screenshots of log2 WT, mot1-42, and differential RNA levels for four selected genes are shown in the upper part of each panel. Relative RNA levels quantified by real-time PCR using sense- and antisensespecific primers are shown in the bar graphs. The graphs show the average 6 SE obtained by analysis of two independent RNA samples for each strain. The small black boxes above each gene indicate the locations of each primer set. (C,D,G,H) Relative RNA Pol II ChIP signals were obtained at the 59 end and the 39 end of the ORF (THI2 and PDC5) or the 39 end of the ORF and the 39 intergenic region (TAT1 and ARN1). The results were obtained using the 8WG16 antibody and are shown as the mean of three biological replicates 6 SD. The indicated P-values were determined using a one-tailed paired t-test of the logtransformed ChIP values. The same primer sets were used as the RNA analysis. THI2 is an example of prematurely terminated RNA, and PDC5 is an example of downstream initiation. In contrast, TAT1 and ARN1 are examples of downstream termination RNAs. For all of the genes, there is an increase in Pol II ChIP levels that correspond with the RNA length change. Note that the Pol II ChIP data for PDC5 are shown on a log scale. The strong signal at the 59 end of the PDC5 ORF is consistent with previously published data (Steinmetz et al. 2006).
1682
Genome Research www.genome.org
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Figure 4. (Legend on previous page)
Genome Research www.genome.org
1683
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Poorey et al.
Figure 5. Confirmation of RNA length changes by Northern blotting. Total RNA from WT or mot1-42 cells was resolved by gel electrophoresis, transferred to nylon membrane, and probed with radiolabeled single DNA strands to detect gene-specific sense or antisense RNAs, as indicated. (A) THI2. Full-length RNA is indicated by the arrow. Note the slight shift in mobility of full-length RNA (indicated by the asterisk) and the smear of smaller RNA species (bracket) in mot1-42 versus WT samples. Only sense RNA was detected. The probes spanned +240 to +930 with respect to the start of the open reading frame. (For map of probe location, see Supplemental Fig. S8.) (B) PAN1 was identified as a Mot1-activated gene displaying premature termination in the differentially down class. Consistent with this, note the decrease in full-length RNA (arrow) in mot1-42 cells, which was in contrast to the shorter RNAs (large bracket) whose abundance was comparable in mot1-42 versus WT cells. However, the small RNA species were distributed such that the shortest RNAs were more prominent in mot1-42 cells compared with WT (denoted by small brace). Only sense RNA was detected. The probes spanned +420 to +864 with respect to the start of the ORF. (For screen shot of PAN1 tiling RNA signals and map of probe location, see Supplemental Fig. S8.) (C ) ACT1 control RNA was detected with a sense-strand specific probe. As expected, discrete full-length bands were detected in both WT and mot1-42 cells, with no quantitative difference between them. The ACT1 probe spanned +14 to +184 with respect to the start of the ORF.
termination, but in the differentially down class. PAN1 is a Mot1activated gene, and as expected, there was less full-length PAN1 RNA in mot1-42 cells compared with WT. A smear of shorter PAN1 RNA was readily detected, and as expected for a gene in the differentially down class, PAN1 short RNA was at least as abundant in mot1-42 cells as it was in WT cells. Moreover, we observed that the shortest RNA species detected were more abundant in mot1-42 cells compared with WT cells (small bracket). As expected, all of the detectable PAN1 RNA was sense RNA. In contrast, ACT1 RNA was a discrete species unaffected by mot1-42 (Fig. 5C). Interestingly, the presence of shorter THI2 and PAN1 RNA in WT cells suggests that these genes are prone to premature termination even in WT cells but that these effects are exaggerated by mutation of MOT1. In a previous study (Sprouse et al. 2009), we identified two TBP alleles that bypass the requirement for Mot1 in vivo. These TBPs restored appropriate expression levels to many Mot1-regulated genes in vivo and allowed cells to grow without Mot1, which is otherwise essential (Sprouse et al. 2009). Biochemically, the bypass TBPs were defective for interaction with other GTFs or DNA, consistent with the critical activity Mot1 provides in destabilizing TBP-containing complexes in vivo (Sprouse et al. 2009). RNA tiling array analysis demonstrated that the two TBP bypass alleles suppressed the premature termination RNA length changes observed in mot1-42 cells (Fig. 6A,B). Suppression of the premature termination RNA synthesis from the differential up class was essentially complete, whereas the bypass TBPs partially restored efficient RNA synthesis to the differentially down gene class. Interestingly, the bypass TBPs were able to suppress each of the other RNA length change classes as well (Supplemental Fig. S7). We conclude that the mot1-42-mediated effects on RNA length can be explained by a direct effect of Mot1 on TBP dynamics.
1684
Genome Research www.genome.org
Computational approaches were employed to determine if there are promoter features or aspects of local genomic organization that correlate with the RNA length changes observed in mot1 cells. First, we found that upstream initiation and downstream termination classes of RNA length changes are enriched in genes whose promoters possess TATA boxes (Table 1; Basehoar et al. 2004). The TATA box association with premature termination genes was only detected for the differentially up class; the differentially down premature termination genes are significantly underrepresented in TATA-containing genes. In contrast, no significant TATA box enrichment or depletion was seen for the downstream initiation genes. These results argue that the nature of the core promoter is related to the propensity to generate a certain type of aberrant RNA. This observation provides further support for the notion that these RNA length changes result from direct, promotermediated effects on transcriptional elongation. Recent work has revealed two classes of noncoding RNA in yeast, termed SUTs (stable unannotated transcripts) and CUTs (cryptic unstable transcripts) (Wyers et al. 2005; Davis and Ares 2006; David et al. 2006; Neil et al. 2009; Xu et al. 2009). An extensive and detailed investigation uncovered no statistically significant relationship between the occurrence of a particular type of aberrant RNA in mot1-42 cells and the occurrence of a SUT or CUT flanking or within the annotated gene in which the RNA length change was detected (Supplemental Fig. S6; data not shown). As a second approach, we classified annotated genes as Mot1-activated, Mot1-repressed, or Mot1-unaffected, based on the overall change in RNA level quantified in the gene-centric analysis of the tiling data described above (Supplemental Fig. S1). Interestingly, in this case, significant relationships were discovered between Mot1-regulated genes and the presence of a SUT or CUT proximal or overlapping the annotation. As shown in Figure 7, Mot1-repressed genes (scored as differentially up) are enriched in genes with a SUT that overlaps their transcribed regions. These SUTs are transcribed in the opposite sense as the affected gene. On the other hand, Mot1-activated genes (scored as differentially down) tend to have an overlapping CUT near the 39 end of the gene. Again, these CUTs are transcribed in the opposite sense as the affected gene. Although the underlying mechanisms are unknown, collectively, these results suggest that Mot1’s effects on transcription are influenced by antisense transcription of SUTs or CUTs with a particular proximal relationship to the affected gene. Next, we sought to correlate TBP chromatin localization with the global changes in the transcription observed in mot1 cells. Published work has documented TBP distribution genome-wide (Kim and Iyer 2004; Zanton and Pugh 2004; van Werven et al. 2008; Venters and Pugh 2009), as well as the Mot1 distribution and its correlation with TBP genome-wide (Geisberg and Struhl 2004; Zanton and Pugh 2004; van Werven et al. 2008). To determine the genome-wide dependence of TBP localization on Mot1 and to compare Mot1-mediated localization to changes in RNA length, ChIP with microarray hybridization (ChIP-chip) was performed using these same tiling arrays. These results allowed us to define the locations of TBP binding as well as the relative TBP occupancies genome-wide in both WT and mot1-42 cells. To distinguish productive TBP binding from nonproductive binding to Pol II promoters, ChIP-chip was performed in parallel for TFIIB. The results obtained with WT cells are in good agreement with published data (van Werven et al. 2009; Venters and Pugh 2009). For example, TBP and TFIIB were localized near each other and primarily in promoters, 126 and 123 bp, respectively, upstream from the TSS
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
TBP and transcription elongation cupancy (lower y-axis value). In contrast, there was no obvious correlation between the change in gene expression (color) and the change in TBP promoter occupancy (x-axis value). Figure 8, E and F, supports these general conclusions. These histograms show how the global distributions of TBP and TFIIB changed in mot1-42 versus WT cells. Most promoters (60%) had increased TBP occupancies in mot1-42 versus WT cells, as illustrated by the large peak of positive differential TBP in Figure 8E. The small peak centered over zero indicates that there were some genes (20%) whose TBP occupancies did not change, and the left-hand shoulder reflects promoters whose TBP occupancies decreased Figure 6. Suppression of the premature termination defects in mot1 cells by mutations in TBP. (A,B) in mot1-42 cells. In contrast, as shown in Profiles of average differential RNA—defined as the average of the differential RNA signal as determined Figure 8F, the two large peaks indicate that in Supplemental Methods (RNA Tiling Array Data)—for genes showing premature termination length changes in mot1-42 versus WT cells (black lines). The x-axis indicates the position along the chromoTFIIB occupancies were increased or desome in base pairs relative to the start of the gene annotation (zero). For comparison, the plots show the creased at roughly equal numbers of proaverage differential RNA profiles for these same genes in mot1-42 cells harboring TBP-F207L versus WT moters (39% increased and 42% decreased). (red lines) and mot1-42 TBP-Y185C versus WT (green lines). Premature termination length effects were The correlation of the differential TFIIB subclassified depending on whether the differential signal was positive (A, 77 genes) or negative (B, 264 genes). Average signals were obtained by smoothing over a 30-bp window. signal with differential RNA (Fig. 8D) indicates that the bimodal distribution of differential TFIIB in Figure 8F was a conse(Fig. 8A,B; Nagalakshmi et al. 2008). As expected, these locations quence of the nature of the Mot1-mediated transcriptional effect. To further explore the nature of Mot1-mediated effects on TBP, a peak correlate well with nucleosome-free regions (Supplemental Fig. S5; data not shown; Whitehouse et al. 2007). In addition, although finding algorithm was employed to map more precisely the loci of there was a weak positive correlation between TBP promoter ocprotein binding (Supplemental Fig. S3). Most promoters possessed cupancy and gene expression, TFIIB promoter occupancy in both single TBP peaks (67.4%); about a third (32.7%) possessed TFIIB WT and mot1-42 cells tracked much more closely with RNA level peaks; and close to half of the detected TBP peaks (47.3%) were as(Fig. 8C). sociated with a TATA motif (Supplemental Fig. S4). Thus, in many Interestingly, in mot1-42 cells, TBP promoter occupancies instances the effects of Mot1 on TBP and TFIIB promoter occupancy were increased genome-wide compared with WT cells (Fig. 8A,B). appear to reflect changes in the occupancies of single, discrete More detailed analyses supported the conclusion that overall, TBP complexes formed on promoters. occupancies increased in promoters across the board regardless of Finally, using several computational approaches, we inveswhether Mot1 exerted a detectable effect on expression of the gene tigated the relationship between TBP occupancy and the RNA (Fig. 8; data not shown). The implications of these observations length changes that occurred in mot1-42 cells (see Supplemental are discussed below. Figure 8C shows the global relationship bematerials and Methods). For the altered initiation events in partictween TBP promoter occupancy (x-axis), TFIIB promoter occuular, the relatively small number of affected genes made statistical pancy (y-axis), and RNA level (color). The diagonal arrow shows analysis difficult. Nonetheless, we observed that the ‘‘downstream that for a significant number of genes, increasing TBP and TFIIB initiation’’ events appear to have a relatively straightforward oripromoter occupancy was correlated with increasing RNA level, as gin: In mot1-42 cells, the shift in TBP localization parallels the indicated by a transition in the color along the diagonal from green apparent site of initiation of the new RNA (Pearson correlation = (lower RNA level) to brown/red (higher RNA level). However, the 0.51; data not shown). This suggests that changes in initiation plot also shows that TFIIB promoter occupancy was much better occurred because Mot1 failed to clear TBP from cryptic sites that correlated with RNA level than TBP, as indicated by the substantial can nucleate the assembly of functional transcription complexes. number of genes with low associated RNA levels (colored green) It is unclear how a defect in Mot1 could give rise to RNAs with but with high TBP occupancy (demarcated by the rectangle). The global effects of Mot1 on gene expression, as well as TBP and TFIIB Table 1. Relationship between RNA length change class and occurrence of a TATA box occupancy, are shown in Figure 8D. Differential TFIIB promoter occupancy was reasonably well correlated with differential expression mediated by loss of Mot1. Note that an increase in gene expression in mot1-42 cells (brown/red) was generally associated with an increase in TFIIB promoter occupancy (higher y-axis value), whereas decreases in gene expression (green) generally displayed a decrease in TFIIB promoter oc-
Genome Research www.genome.org
Table 1 live 4/C
1685
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Poorey et al. was induced and the rate of its association with promoters was tracked. As the production of the new TBP pool required on the order of 30 min, this approach would not capture the very rapid Mot1-catalyzed dynamics (occurring on the order of seconds) that Figure 7. Relationship between Mot1-regulated genes and CUT and SUT RNAs. The diagrams illustrate the two significant relationships detected between Mot1-mediated gene expression and noncoding transcription. Mot1-activated genes tend to have an overlapping CUT transcribed in the antisense direction. The CUT transcripts initiate within the transcribed region of the Mot1-activated gene, ;40 bp on average upstream of the Mot1-activated gene termination site. In contrast, Mot1repressed genes tend to have an overlapping antisense SUT. These SUTs initiate ;40 bp on average downstream from the termination site for the Mot1-repressed gene.
extended 39 ends, but the correlation of these different aberrant RNA species with promoter type (Table 1) suggests that termination or 39 processing events can somehow be influenced by TBP dynamics, depending on the type of promoter.
Discussion The results presented here reveal several new and unanticipated relationships between TBP dynamics and transcription. Notably, we find that impaired turnover of TBP at promoters correlates with the production of a predominant class of ‘‘premature termination’’ RNAs. The Pol II ChIP and TBP allele suppression results argue that Mot1-mediated effects on RNA length distribution are attributable to changes in transcription of the affected genes, and that these occur as a consequence of altered TBP dynamic behavior at promoters. Moreover, the genes that display RNA length changes tend to have certain promoter attributes (TATA versus TATA-less). The stochastic failure of Pol II elongation that ensues when TBP dynamics are perturbed suggests that Mot1-catalyzed clearance of TBP may be important for promoter recruitment or activity of the accessory factors that subsequently promote Pol II elongation. Another attractive model is that there may be communication between factors associated with the promoter and the elongating Pol II. A possible explanation is that a physical connection persists between the promoter and elongating Pol II, which can stall elongation some distance from the promoter. The role of Mot1 then would be to facilitate the dissolution of the TBP complex at the promoter to release a restrained Pol II. Alternatively, there may be communication between the promoter and an elongating Pol II to facilitate transit through chromatin or to ensure that downstream RNA processing events are coupled to transcription. Regardless of the mechanism, it is striking that a large number of yeast genes rely to some extent on TBP dynamics to ensure accurate and efficient RNA elongation. The genome-wide increase in TBP occupancy observed in mot1-42 cells compared with WT reported here fits well with the rapid mobility of essentially the whole TBP pool assessed by live cell imaging (Sprouse et al. 2008) and shows that TBP occupancy is generally limiting at promoters in vivo. On the other hand, although Mot1 regulates a substantial proportion of the transcriptome, not all yeast genes are affected in mot1-42 cells. Presumably, the Mot1-affected genes are especially sensitive to TBP occupancy, whereas other genes are rate-limited in some other critical step of the transcription cycle. A recent study concluded that TBP turnover rates are different at different classes of promoters (van Werven et al. 2009). This study relied on a replacement strategy in which expression of a differentially tagged form of TBP
1686
Genome Research www.genome.org
Figure 8. Global effects of Mot1 on TBP and TFIIB genomic distribution and correlations with transcription. Plots of average TBP (A) and average TFIIB chromatin (B) occupancies obtained by aligning all annotated Pol II genes with respect to the transcription start site (TSS). The signals were smoothed over a 50-bp sliding window (WT, solid curves; mot142, dashed curves). In both cases, the differences in the distributions are highly significant as determined by the Kolmogorov-Smirnov test (P-values as indicated). (C ) The heat map displays the transcriptome-wide relationship between TBP occupancy (x-axis), TFIIB occupancy (y-axis), and RNA level (color) in WT cells. Each box represents one or more genes whose TBP and TFIIB occupancies fall within the box’s x-axis/y-axis values. The box color is the relative median expression value on a scale defined by the range of all medians in the data set. Note the general trend that increasing TBP and TFIIB promoter occupancy is correlated with increasing RNA level (arrow). However, the correlation between TFIIB and RNA levels is better because there are genes with high TBP occupancy but low expression (black rectangle). (D) The heat map is similar to that in panel C but shows the transcriptome-wide relationship between the differential TBP signal (x-axis), the differential TFIIB signal (y-axis), and the change in RNA level (color) in mot1-42 versus WT cells. As in C, changes in TFIIB occupancy are reasonably well correlated with changes in transcription, whereas changes in TBP promoter occupancy do not correlate as well with changes in RNA level. (E,F ) Density distributions of the differential TBP and differential TFIIB signals (as indicated), over the promoters for all the genes. The plots show that in mot1-42 cells compared with WT, TBP occupancies increased at the majority of promoters. In contrast, TFIIB occupancies increased or decreased at roughly equal numbers of promoters, consistent with changes in gene expression in both positive and negative directions.
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
TBP and transcription elongation our previous work has shown accounts for the behavior of the majority of the TBP pool (Sprouse et al. 2008). Although no genomic scale method exists yet for detection of such rapid, locusspecific dynamic behavior, the analysis of the TBP occupancy profile in mot1-42 cells reported here significantly extends prior work by showing that Mot1 does target virtually the entire pool of promoter bound TBP in vivo. This observation provides support for a recent study that modeled the dynamic behavior of GTFs based on previously published ChIP-chip data, which found that interactions between TBP and chromatin are best described by models in which the interactions are rather transient (Samorodnitsky and Pugh 2010). Our suspicion is that GTF–promoter interactions span a wide range of lifetimes. Because the long-lived interactions detected by kinetic ChIP experiments appear to represent a relatively small (albeit very important) proportion of the total number of chromatin interactions, they may well be below the limit of detection in the live cell imaging experiments that capture the behavior of the overall GTF pool. Recent results show that constitutive yeast genes are expressed by infrequent initiation events clearly separated in time (Zenklusen et al. 2008). These observations also fit well with our measurements of highly dynamic TBP in vivo (Sprouse et al. 2008) and suggest that, in contrast to establishing a stable scaffold, many active promoters are subject to occupancy by transcription complexes that undergo rapid cycles of assembly and disassembly. Such dynamic behavior has been speculated to be important for ensuring appropriate start site selection and timely transcriptional regulatory responses (Auble 2009). However, the results presented here support the notion of a fundamental role for PIC dynamics in the process of RNA synthesis itself. More generally, the ability to rapidly characterize the spectrum of aberrant RNAs present in a particular mutant background using the approach outlined here will likely be of use in unraveling the molecular mechanisms responsible for these effects.
Methods Yeast strains and growth conditions Saccharomyces cerevisiae WT and mot1-42 strains used for RNA and ChIP analyses are derivatives of YPH499 (Sikorski and Hieter 1989) and were previously described (Darst et al. 2003; Dasgupta et al. 2005). SET2 was replaced with the kanMX cassette in YPH499 cells, and proper integration was confirmed by PCR. For TBP ChIP-chip, the C-terminally myc-tagged TBP strains were derived by transformation of PCR-generated DNA fragments into the same background containing the mutant alleles (YAD155, WT; YAD156, mot1-42). For the RNA analysis of mutant TBPs, previously described pRS314-borne alleles of TBP (WT, Y185C, or F207L) (Sprouse et al. 2009) were transformed into AY51 (WT) or AY87 (Y185C, F207L). Yeast strains were grown in YPD at 30°C to an OD600 ; 1.0. Then cells were shifted immediately for 45 min to 35°C and were harvested for isolation of total RNA or treated for ChIP as previously described (Dasgupta et al. 2005).
labeling were performed using the GeneChip WT Double-Stranded DNA Terminal Labeling Kit. Fragmented DNA was confirmed to be between 25 and 100 bp using the Agilent RNA 6000 Nano Kit and an Agilent 2100 Bioanalyzer. Efficient labeling was confirmed by gel shift assay using NeutrAvidin. Samples were hybridized to S. cerevisiae Tiling 1.0R Arrays (Affymetrix), and raw data were generated by the Microarray Core Facility at the University of Virginia (Charlottesville). WT, mot1-42, and set2D RNA analyses were performed using two independent biological replicates for each. For the experiment in Figure 5, A and B, WT (TBP), mot1-42 (TBPY185C), and mot1-42 (TBP-F207L) RNA was analyzed as a single replicate. For quantitative real-time PCR validation, RT-PCR was first performed using the iScript Select cDNA Synthesis Kit (BioRad) according to manufacturer’s instructions. cDNA was then quantitated by real-time PCR using iQ SYBR Green Supermix (BioRad) and the Bio-Rad MyiQ Single Color Real-Time PCR Detection System. Each experiment includes two independent biological replicates. Northern blotting was performed as previously described (Dasgupta et al. 2005) using strand-specific probes obtained by single-primer PCR in reactions that included [a-32P]dATP. In each case, the template for probe synthesis was a gel-purified PCR fragment spanning the differentially affected transcribed region. Primer sequences are shown in Supplemental Table 1.
Chromatin immunoprecipitation ChIP assays were performed exactly as previously described (Dasgupta et al. 2005) with the following antibodies: for TBP ChIP, anti-myc (9E10) (Dasgupta et al. 2005); for TFIIB ChIP, a TFIIB rabbit polyclonal antibody (Dasgupta et al. 2005); and for RNA Pol II ChIP, the Pol II monoclonal antibody 8WG16 (Thompson et al. 1989; Bhaumik and Green 2001). ChIP material was then used for hybridization to tiling arrays or for quantitation by real-time qPCR. For the tiling arrays, library preparation and amplification of DNA for both ChIP and mock IP samples were performed using the GenomePlex Complete Whole Genome Amplification Kit (Sigma) as described (O’Green et al. 2006) with several modifications: dUTP was added in equimolar concentration to the dNTP mix (0.4 mM), and a second amplification was performed for both the ChIP and mock samples using 10 ng of the previously amplified material. Samples were purified with the QIAquick PCR purification kit (Qiagen) prior to re-amplification and fragmentation. Duplicate samples were combined to obtain 7.5 mg of material for fragmentation and labeling. Samples were hybridized to S. cerevisiae Tiling 1.0R arrays (Affymetrix), and raw data were generated by the Microarray Core Facility at the University of Virginia (Charlottesville). Three independent biological replicates were analyzed for TBP, and two independent biological replicates were analyzed for TFIIB. Quantitative real-time PCR was performed on ChIP, mock IP, and total samples as described for the RNA analysis. ChIP signals were obtained by subtracting mock IP signal from ChIP signal and normalizing against the input. ORF primers were identical to those used in the expression analysis. Three independent biological replicates were performed for each ChIP analysis. P-values were obtained by log transforming the calculated ChIP signals and a one-tailed, paired, t-test was conducted.
Expression analysis Total RNA was isolated using hot-acid phenol extraction (Schmitt et al. 1990). For the tiling arrays, cDNA was synthesized from 7 mg of total RNA using the Affymetrix GeneChip WT Double-Stranded cDNA Synthesis Kit as recommended by the manufacturer, but with the addition 0.4 mM dUTP for subsequent fragmentation and biotin end-labeling. dsDNA was purified using the GeneChip Sample Cleanup Module (Affymetrix, Inc.). Fragmentation and
Tiling array data analysis We performed both gene-biased and -unbiased analyses of the tiling array data. For unbiased analysis, we used the Affymetrix Tiling Array Software (TAS), which quantile normalizes replicate arrays, scales their median intensity to a user-defined value, and calculates the 10log10(P-value) and log2(pseudo-median) (or signal strength) associated with a one- or two-sample Wilcoxon signed
Genome Research www.genome.org
1687
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Poorey et al. rank test over a sliding window (Cawley et al. 2004). For the ChIPchip analysis of TBP and TFIIB for mot1-42 and WT strains, we applied the two-sample test of the ChIP sample compared with mock IP using a window size of 500 bp. For the total RNA analysis, we applied the one-sample test using a window size of 50 bp. For the biased analysis, a gene-centric library (CDF) file was generated from yeast gene annotations (Fisk et al. 2006) and the Affymetrix tiling array library (BPMAP) file, which contained a probe set for every annotated yeast gene composed of all probes whose central position fell within the annotated start and stop of the gene. Normalized gene expression estimates were obtained by quantile normalizing the arrays and applying GCRMA. Lists of differentially expressed genes were obtained using the limma package in Bioconductor and applying a 5% false discovery rate cutoff. Additional methods are described in the Supplement. Tiling array data have been deposited in the NCBI Gene Expression Omnibus under accession number GSE18283.
Acknowledgments We thank Mitch Smith and Jeff Smith for comments on the manuscript. This work was supported by the NIH grant GM55763 to D.T.A.
References Auble DT. 2009. The dynamic personality of TATA-binding protein. Trends Biochem Sci 34: 49–52. Basehoar AD, Zanton SJ, Pugh BF. 2004. Identification and distinct regulation of yeast TATA box-containing genes. Cell 116: 699–709. Bhaumik SR, Green MR. 2001. SAGA is an essential in vivo target of the yeast acidic activator Gal4p. Genes Dev 15: 1935–1945. Burley SK, Roeder RG. 1996. Biochemistry and structural biology of transcription factor IID (TFIID). Annu Rev Biochem 65: 769–799. Cawley S, Bekiranov S, Ng HH, Kapranov P, Sekinger EA, Kampa D, Piccolboni A, Sementchenko V, Cheng J, Williams AJ, et al. 2004. Unbiased mapping of transcription factor binding sites along human chromosomes 21 and 22 points to widespread regulation of noncoding RNAs. Cell 116: 499–509. Darst RP, Dasgupta A, Zhu C, Hsu J-Y, Vroom A, Muldrow TA, Auble DT. 2003. Mot1 regulates the DNA binding activity of free TATA-binding protein in an ATP-dependent manner. J Biol Chem 278: 13216–13226. Dasgupta A, Darst RP, Martin KJ, Afshari CA, Auble DT. 2002. Mot1 activates and represses transcription by direct, ATPase-dependent mechanisms. Proc Natl Acad Sci 99: 2666–2671. Dasgupta A, Juedes SA, Sprouse RO, Auble DT. 2005. Mot1-mediated control of transcription complex assembly and activity. EMBO J 24: 1717–1729. David L, Huber W, Granovskaia M, Toedling J, Palm CJ, Bofkin L, Jones T, Davis RW, Steinmetz LM. 2006. A high-resolution map of transcription in the yeast genome. Proc Natl Acad Sci 103: 5320–5325. Davis CA, Ares M Jr. 2006. Accumulation of unstable promoter-associated transcripts upon loss of the nuclear exosome subunit Rrp6p in Saccharomyces cerevisiae. Proc Natl Acad Sci 103: 3262–3267. Fisk DG, Ball CA, Dolinski K, Engel SR, Hong EL, Issel-Tarver L, Schwartz K, Sethuraman A, Botstein D, Cherry JM, et al. 2006. Saccharomyces cerevisiae S288C genome annotation: A working hypothesis. Yeast 23: 857–865. Geisberg JV, Struhl K. 2004. Cellular stress alters the transcriptional properties of promoter-bound Mot1–TBP complexes. Mol Cell 14: 479–489. Hahn S. 2004. Structure and mechanism of the RNA polymerase II transcription machinery. Nat Struct Mol Biol 11: 394–403. Kamada K, Shu F, Chen H, Malik S, Stelzer G, Roeder RG, Meisterernst M, Burley SK. 2001. Crystal structure of negative cofactor 2 recognizing the TBP-DNA transcription complex. Cell 106: 71–81. Kim J, Iyer VR. 2004. Global role of TATA box-binding protein recruitment to promoters in mediating gene expression profiles. Mol Cell Biol 24: 8104–8112.
1688
Genome Research www.genome.org
Li B, Gogol M, Carey M, Pattenden SG, Seidel C, Workman JL. 2007. Infrequently transcribed long genes depend on the Set2/Rpd3S pathway for accurate transcription. Genes Dev 21: 1422–1430. Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M. 2008. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 320: 1344–1349. Neil H, Malabat C, d’Aubenton-Carafa Y, Xu Z, Steinmetz LM, Jacquier A. 2009. Widespread bidirectional promoters are the major source of cryptic transcripts in yeast. Nature 457: 1038–1042. Nicol JW, Helt GA, Blanchard SG Jr, Raja A, Loraine AA. 2009. The Integrated Genome Browser: Free software for distribution and exploration of genome-scale datasets. Bioinformatics 25: 2730–2731. O’Green H, Nicolet CM, Blahnik K, Green RD, Farnham PJ. 2006. Comparison of sample preparation methods for ChIP-chip assays. Biotechniques 41: 577–580. Pugh BF. 2000. Control of gene expression through regulation of the TATAbinding protein. Gene 255: 1–14. Reese JC. 2003. Basal transcription factors. Curr Opin Genet Dev 13: 114–118. Samorodnitsky E, Pugh BF. 2010. Genome-wide modeling of transcription preinitiation complex disassembly mechanisms using ChIP-chip data. PLoS Comput Biol 6: e1000733. doi: 10.1371/journal.pcbi.1000733. Schluesche P, Stelzer G, Piaia E, Lamb DC, Meisterernst M. 2007. NC2 mobilizes TBP on core promoter TATA boxes. Nat Struct Mol Biol 14: 1196–1201. Schmitt ME, Brown TA, Trumpower BL. 1990. A rapid and simple method for preparation of RNA from Saccharomyces cerevisiae. Nucleic Acids Res 18: 3091–3092. Sikorski RS, Hieter P. 1989. A system of shuttle vectors and yeast host strains designed for efficient manipulation of DNA in Saccharomyces cerevisiae. Genetics 122: 19–27. Sprouse RO, Karpova TS, Mueller F, Dasgupta A, McNally JG, Auble DT. 2008. Regulation of TATA binding protein dynamics in living yeast cells. Proc Natl Acad Sci 105: 13304–13308. Sprouse RO, Wells MN, Auble DT. 2009. TATA-binding protein variants that bypass the requirement for Mot1 in vivo. J Biol Chem 284: 4525–4535. Steinmetz EJ, Warren CL, Kuehner JN, Panbehi B, Ansari AZ, Brow DA. 2006. Genome-wide distribution of yeast RNA polymerase II and its control by Sen1 helicase. Mol Cell 24: 735–746. Thompson NE, Steinberg TH, Aronson DB, Burgess RR. 1989. Inhibition of in vivo and in vitro transcription by monoclonal antibodies prepared against wheat germ RNA polymerase II that react with the heptapeptide repeat of eukaryotic RNA polymerase II. J Biol Chem 264: 11511–11520. van Werven FJ, van Bakel H, van Teeffelen HAAM, Altelaar AFM, Koerkamp MG, Heck AJR, Holstege FCP, Timmers HTM. 2008. Cooperative action of NC2 and Mot1p to regulate TATA-binding protein function across the genome. Genes Dev 22: 2359–2369. van Werven FJ, van Teeffelen HAAM, Holstege FCP, Timmers HTM. 2009. Distinct promoter dynamics of the basal transcription factor TBP across the yeast genome. Nat Struct Mol Biol 16: 1043–1048. Venters BJ, Pugh BF. 2009. A canonical promoter organization of the transcription machinery and its regulators in the Saccharomyces genome. Genome Res 19: 360–371. Whitehouse I, Rando OJ, Delrow J, Tsukiyama T. 2007. Chromatin remodelling at promoters suppresses antisense transcription. Nature 450: 1031–1035. Workman JL. 2006. Nucleosome displacement in transcription. Genes Dev 20: 2009–2017. Wyers F, Rougemaille M, Badis G, Rousselle J-C, Dufour M-E, Boulay J, Regnault B, Devaux F, Namane A, Seraphin B, et al. 2005. Cryptic Pol II transcripts are degraded by a nuclear quality control pathway involving a new poly(A) polymerase. Cell 121: 725–737. ¨ nster S, Camblong J, Guffanti Xu Z, Wei W, Gagneur J, Perocchi F, Clauder-Mu E, Stutz F, Huber W, Steinmetz LM. 2009. Bidirectional promoters generate pervasive transcription in yeast. Nature 457: 1033–1037. Zanton SJ, Pugh BF. 2004. Changes in genomewide occupancy of core transcriptional regulators during heat stress. Proc Natl Acad Sci 101: 16843–16848. Zenklusen D, Larson DR, Singer RH. 2008. Single-RNA counting reveals alternative modes of gene expression in yeast. Nat Struct Mol Biol 15: 1263–1271.
Received April 23, 2010; accepted in revised form September 7, 2010.
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Research
Pervasive gene content variation and copy number variation in maize and its undomesticated progenitor Ruth A. Swanson-Wagner,1,5 Steven R. Eichten,1,5 Sunita Kumari,2 Peter Tiffin,1 Joshua C. Stein,2 Doreen Ware,2,3 and Nathan M. Springer1,4,6 1
Department of Plant Biology, University of Minnesota, Saint Paul, Minnesota 55108, USA; 2Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA; 3United States Department of Agriculture, Agricultural Research Service, Cold Spring Harbor, New York 11724, USA; 4Microbial and Plant Genomics Institute, University of Minnesota, Saint Paul, Minnesota 55108, USA Individuals of the same species are generally thought to have very similar genomes. However, there is growing evidence that structural variation in the form of copy number variation (CNV) and presence–absence variation (PAV) can lead to variation in the genome content of individuals within a species. Array comparative genomic hybridization (CGH) was used to compare gene content and copy number variation among 19 diverse maize inbreds and 14 genotypes of the wild ancestor of maize, teosinte. We identified 479 genes exhibiting higher copy number in some genotypes (UpCNV) and 3410 genes that have either fewer copies or are missing in the genome of at least one genotype relative to B73 (DownCNV/PAV). Many of these DownCNV/PAV are examples of genes present in B73, but missing from other genotypes. Over 70% of the CNV/PAV examples are identified in multiple genotypes, and the majority of events are observed in both maize and teosinte, suggesting that these variants predate domestication and that there is not strong selection acting against them. Many of the genes affected by CNV/PAV are either maize specific (thus possible annotation artifacts) or members of large gene families, suggesting that the gene loss can be tolerated through buffering by redundant functions encoded elsewhere in the genome. While this structural variation may not result in major qualitative variation due to genetic buffering, it may significantly contribute to quantitative variation. [Supplemental material is available online at http://www.genome.org. The sequence data from this study have been submitted to the NCBI Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo) under accession no. GSE23756.] It is generally assumed that the genomes of different individuals of the same species are similar in content. However, there is growing evidence for structural variation among the genomes of different individuals. Structural variation includes rearrangements (inversions and translocations) and copy number variation (CNV). The most extreme form of CNV is presence–absence variation (PAV), in which a particular sequence is present in some individuals and missing in others. While single nucleotide polymorphisms (SNPs) are the most common and most frequently assayed type of intraspecific genetic variation, there is evidence that more nucleotide bases are affected by CNV than by SNPs between any two individuals (Zhang et al. 2009). This structural variation challenges the notion of understanding the genome of a species through the analysis of a single reference sequence from one individual or genotype. CNV and PAV are likely to have functional significance and may explain some variation not captured by SNP-based genome-wide association studies (Manolio et al. 2009). For example, both CNV and PAV can contribute to phenotypic variation for some human diseases (Feuk et al. 2006; Sharp et al. 2006; Beckmann et al. 2007; Cooper et al. 2007; Sebat 2007; Hurles et al. 2008; Bucan et al. 2009; Merikangas et al. 2009; Zhang et al. 2009; Beroukhim et al. 2010; Conrad et al. 2010; Wellcome Trust Case Control Consortium 2010). Specifically, phenotypic variation results from CNV in dosage effect-sensitive genes (Charcot-Marie-Tooth disease), genes influenced by position effect (spastic paraplegia), and genes with a
5
These authors contributed equally to this work. Corresponding author. E-mail
[email protected]; fax (612) 625-1738. Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.109165.110.
6
mutant allele unmasked when the functional copy is deleted (for review, see Stankiewicz and Lupski 2010). CNVs underlying complex traits such as Alzheimer disease and Autism spectrum disorders have been detected in human patients (Stankiewicz and Lupski 2010). Copy number variation has been documented in several species, including the human genome (Sebat et al. 2004; Sharp et al. 2005; Tuzun et al. 2005; Conrad et al. 2006, 2010; Redon et al. 2006; McCarroll and Altshuler 2007; Wong et al. 2007; Kidd et al. 2008; Wellcome Trust Case Control Consortium 2010) and several other mammalian species, including mice (Graubert et al. 2007), rats (Guryev et al. 2008), chimpanzees (Perry et al. 2008), rhesus monkeys (Lee et al. 2008), and canines (Chen et al. 2009). It is difficult to compare the number of CNV in different studies, as the number of observed CNV is heavily influenced by the diversity of individuals that are examined and the technology used for detection. The general consensus is that there are several hundred to over a thousand CNVs between individuals within a species. It should be noted that in most cases these studies include segregating individuals, and many of the CNVs are observed as heterozygotes. Studies of several highly inbred model organisms including C. elegans (Maydan et al. 2010) and Arabidopsis thaliana (Santuari et al. 2010) have also identified numerous CNVs. Zea mays (maize) is a highly polymorphic species (for review, see Buckler et al. 2006; Messing and Dooner 2006; Springer and Stupar 2007). The recent completion of a reference genome from one genotype, B73, affords the opportunity to assess structural variation and complexity within this species (Schnable et al. 2009). Detailed analyses of specific loci as well as genomic approaches have identified numerous duplications within the maize genome, many of which are located in colinear regions (Schnable et al.
20:1689–1699 Ó 2010 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/10; www.genome.org
Genome Research www.genome.org
1689
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Swanson-Wagner et al. 2009) derived from an ancient allopolyploidization event (Gaut and Doebley 1997; Swigonova et al. 2005; Wei et al. 2007). There is also evidence for a high frequency of tandem duplicates within maize (Messing et al. 2004; Emrich et al. 2007; Schnable et al. 2009), including several well-characterized genes affecting pigmentation such as R-r (Robbins et al. 1991), P1 (Zhang and Peterson 2005), and A1-b (Yandeau-Nelson et al. 2006). In addition, there is evidence for many dispersed duplications that are not located within colinear regions, but are instead likely derived from transposition events (Bennetzen 2005; Lai et al. 2005; Morgante et al. 2005; Yang and Bennetzen 2009). There are many examples of structural variation among different maize genotypes. Cytogenetic studies have provided evidence for structural variation in maize chromosomes (McClintock et al. 1981; Kato et al. 2004). More recent studies have sequenced multiple haplotypes for specific loci and have identified structural variation affecting both repetitive and low-copy sequences (Fu and Dooner 2002; Yao et al. 2002; Brunner et al. 2005; Wang and Dooner 2006). For example, Wang and Dooner (2006) documented that only 25%–84% of bases within a ;100-kb region were shared among eight haplotypes. The frequency of CNV and PAV between the reference genome (B73) and a second genotype (Mo17) has been assayed using BAC libraries (Morgante et al. 2005) and comparative genomic hybridization (CGH) (Springer et al. 2009; Belo et al. 2010). These scans have identified hundreds of copy number variants as well as several thousand sequences present in the reference genome but absent in Mo17 (PAVs). A proportion of these CNV and PAV identified in Mo17 relative to B73 affect the copy number or content of genes present in these two lines. In this study, we used a gene-focused microarray to assess the frequency and identity of genes affected by CNV or PAV within a diverse panel of maize and teosinte (Zea mays ssp. parviglumis) genotypes. We included the teosinte lines to evaluate whether extensive structural variation in maize predates or is related to domestication. Over 10% of the ;32,500 genes surveyed exhibit CNV/ PAV relative to the B73 reference genome. The majority of the CNV/ PAV events are observed in both maize and teosinte, suggesting that these have not entered the genome during maize domestication or improvement. This study provides evidence for prevalent CNV/PAV within maize and provides an opportunity to characterize the types of genes affected by structural variation.
of log2 signal intensity relative to B73 reveals that many genes have variable signal over scales of the entire genome (Supplemental Fig. 2), single chromosomes (Fig. 1A), or small regions of a chromosome (Fig. 1B). The array-based CGH analysis detected genes with consistently higher (UpCNV) or consistently lower (DownCNV/ PAV) signal than in the reference B73 genome (Table 1). Because the array was designed using B73 genomic sequence, the primary biological cause for increased CGH signal for a genotype would be an increase in the copy number of the probe sequence in that genotype relative to B73. In contrast, there are multiple potential causes for significantly negative log2 ratios, including polymorphisms within a probed sequence relative to B73, fewer copies of the gene in the other genotype (DownCNV) or absence of the sequence in the other genotype (PAV). It should be noted that numerous polymorphisms would be required for all probes from a gene to exhibit low signal. Our previous data (Springer et al. 2009) suggest that hybridization intensity is not strongly affected until there are at least four to five SNPs within the probe sequence. This level of polymorphism spread across multiple portions of the coding region would represent a highly divergent allele. Analysis of the array CGH data identified 479 UpCNV genes and 3410 DownCNV/PAV (Supplemental Table 2). The array CGH analysis cannot distinguish between DownCNV and PAV, as these both exhibit lower hybridization intensities than in the reference samples. However, it is possible to use the B73 reference genome to classify these events as either DownCNV or PAV. In order to be classified as a DownCNV, a gene would need to have multiple copies in the B73 genome. Of the 3410 DownCNV genes, 586 have probes with multiple close matches in the B73 reference genome and were classified as DownCNV candidates, while the remaining 2824 genes are single copy in the B73 reference genome sequence and were classified as likely examples of PAV (Table 1). This is a useful classification scheme to estimate the relative frequency of DownCNV and PAV, but it may result in some false assignments as PAV if additional copies of the sequence reside in the ;5% of the B73 genome that was not sequenced or in regions of the B73 sequence that were collapsed during assembly. Due to the potential misclassification of DownCNV and PAV, these two classes were grouped together for many subsequent analyses. Interestingly, there are a number of genes that were classified as UpCNV in some genotypes and DownCNV in other genotypes. This suggests that
Results Identification of genes affected by structural variation Structural variation can include rearrangements (inversions and translocations), CNV, and PAV. Comparative genomic hybridization (CGH) of DNA samples to microarrays can be used to detect both CNV and PAV. A custom long oligonucleotide microarray was designed using the reference sequence of the B73 maize genotype (Schnable et al. 2009) and was used to perform CGH analyses of 32,487 maize genes (see Methods). High-quality hybridization data was obtained for 33 genotypes, including 19 diverse maize genotypes and 14 teosinte genotypes (listed in Supplemental Table 1). The visualization
1690
Genome Research www.genome.org
Figure 1. Structural variation affects many genes. The average log2(other/B73) is plotted for all 2767 genes on chromosome 6 (A) or for 293 genes within a 20-Mb region of chromosome 1 (B) for eight genotypes. (Blue data points) UpCNV with more copies in the other inbred line relative to B73; (red data points) genes with significantly lower signal in the other line relative to B73 and are examples of DownCNV or PAV; (red arrows) several multigene structural variants that are observed in multiple genotypes; (black arrows) the position of several single gene structural variants that are observed in multiple genotypes.
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Copy number variation in diverse maize genotypes Table 1.
Discovery of structural variation affecting maize genes
many examples of larger structural variants that affected multiple nearby genes Chromatin Cell wall Transcription ‘‘Classic’’ maize (Fig. 1A,B; Table 4). These larger structural genesb genesc factorsd All genes genesa variants were often observed in numerous maize genotypes. The largest PAV UpCNV 402 (1.2%) 2 3 4 0 event includes 25 adjacent genes on DownCNV 554 (1.7%) 3 1 4 5 PAV 2779 (8.6%) 19 2 68 91 chromosome 6 (Fig. 1A), which is present e Up & Down 77 (0.2%) 0 0 0 2 in 11 of 25 domesticated maize lines and Not changed 28,675 (88.3%) 396 263 1122 1625 3 of 14 teosinte lines, and absent in other Total 32,487 420 269 1198 1723 genotypes. This region was previously a identified as present in B73 and absent in Includes a set of 420 genes identified by classical genetic studies and curated at CoGe (website). b Include all non-histone maize chromatin genes curated by http://www.ChromDB.org. Mo17 (Springer et al. 2009) as well as sevc Includes all genes with putative cell wall function identified by Penning et al. (2009). eral other genotypes (Belo et al. 2010). d Includes the set of maize genes curated by GRASSIUS. e This insertion/deletion variant also segreIncludes genes that show increased signal in some genotypes and decreased signal in others. gates among teosinte individuals, suggesting that this large insertion/deletion genes that are present in multiple copies in B73 can frequently is not a result of selection or inbreeding that has occurred during exhibit either increases or decreases in copy number in other maize domestication or improvement. The largest UpCNV event, genotypes. which includes nine genes located on chromosome 7, is observed The structural variants were observed throughout the maize in 6/25 domesticated maize lines and 6 of 14 teosinte lines. genome (Fig. 2). There are more structural variants near the end The observation of genes affected by structural variation in of the chromosomes than within the central centromeric regions a diverse set of maize and teosinte lines provided the opportunity of the maize chromosomes, but this generally mirrors the genic to address several questions about the distribution of these events density. Chromosomal regions were classified as high, moderate, within maize. Individual genotypes differed from B73 at between or low recombination rates based upon a comparison of the genetic 21 and 217 (mean = 114) UpCNVs and between 405 and 1375 and physical map (Liu et al. 2009). The proportion of genes within DownCNV/PAV (mean = 917). As expected, the teosinte lines each of these regions that exhibit PAV or CNV were determined showed slightly greater divergence, differing by an average of 999 (Table 2). The CNV exhibit a significantly (x2, P < 0.0005) different DownCNV/PAV compared with an average of 852 in maize. The distribution than expected with higher levels of CNV in the low majority of structural variants were observed in more than five recombination regions. In contrast, the PAV do not show altered of the genotypes tested (Fig. 3A). The finding that many of the rates in high and low recombination regions. structural variants are present at common frequencies suggests
Validation of structural variants Several approaches validated the detected structural variants. Primer pairs for 12 genes located within putative PAV were used to perform PCR on the same genotypes used for microarray analysis (Table 2; Supplemental Fig. 3). The presence–absence patterns were largely supported by the PCR analysis with 92% of ‘‘absent’’ calls and 81% of present calls confirmed (Table 3). In some cases, the PCR failed to amplify a band in genotypes that were not predicted to be missing the sequence (Supplemental Fig. 3). These additional failed reactions could be due to polymorphisms within the primer sites or large insertions between the primers. Further PCR-based validation was conducted by using previous data on insertion/ deletion polymorphism (IDP) markers (Fu et al. 2006) at 75 CGH predicted PAV in Mo17 or Oh43. The data from Fu et al. (2006) supported the existence of structural variation at ;85% of the tested loci. Finally, 657 genes identified in this study with structural variation between B73 and Mo17 were also represented (with a minimum of three probes) in a previous high-density CGH analysis of these two lines (Springer et al. 2009), and 96% of these genes exhibit consistent signal changes in the two studies. Many of the same genes (108/180) that were identified in a previous study of B73 and Mo17 (Springer et al. 2009) were identified in the current study.
Distribution of structural variation within maize diversity The physical positions of genes with structural variants were visualized across the maize chromosomes (Fig. 2). While the majority of structural variants were limited to single genes, there are
Figure 2. Distribution and frequency of structural variation throughout the maize genome. The physical locations of the 32,487 genes are plotted along the 10 maize chromosomes. The color of each gene indicates whether structural variation was observed and the type of variation and the y-axis indicates the number of genotypes that contain the structural variant.
Genome Research www.genome.org
1691
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Swanson-Wagner et al. Table 2.
Recombination rate affects frequency of CNV
Recombination frequency
Total genes
High Moderate Low
19,234 4699 8493
Mb/cM
Proportion of genes with CNVa
Proportion of genes with PAVb
0.45 1.61 7.42
0.030 0.027 0.042
0.087 0.077 0.088
a The proportion of genes within each class of recombination frequency that are affected by CNV is shown. The observed distribution is significantly (P < 0.0001) different from expected (x2 test). Both the UpCNV and Down CNV show similar distributions. b The observed proportion of genes affected by PAV is not significantly different from the expected proportions.
by structural variation are enriched for sequences with low levels of conservation among species. First, genes showing structural variation are significantly enriched (;1.5-fold) for genes that do not have any Gene Ontology (GO) annotation. Second, all classes of CNV genes are significantly enriched (2.8-fold overall) for maizespecific genes based on homolog clustering with annotated genes of rice, sorghum, and Arabidopsis (Table 5). The 1488 maize-specific genes affected by CNV/PAV include 1097 for which no additional homologs were found within maize and 391 that are in multigene families. The lack of clear homologs in other species is consistent with the prediction that many PAV genes may have nonessential functions, and may indicate that some of these sequences are previously unclassified transposable elements. Indeed, some examples of ‘‘gene’’ content variation among rice subspecies were later identified as transposons, and it can be difficult to identify and eliminate all transposons in genome-wide analyses (Bennetzen et al. 2004). The remaining 2317 maize genes affected by CNV/PAV are conserved in other plant species, and among these, 2231 have orthologs identified in rice and/or sorghum. The relative genomic positions of orthologous genes were compared with rice and sorghum to determine how often the structural variant genes are located at syntenic positions. Among all orthologous maize genes (n = 27,550), 85.5% are syntenic. This compares with only 64.9% of orthologous CNV/PAV genes, a significant reduction (x2, P < 0.0001). Lack of synteny could have resulted from gene movement from its ancestral position or from gene duplication concomitant with movement, thereby leaving an intact ancestral copy. Such cases would be manifested by the existence of syntenic co-orthologs, i.e., genes that are paralogous to CNV genes and having a common rice or sorghum ortholog to which synteny has been maintained. Overall, we detected 1964 nonsyntenic maize genes that have syntenic co-orthologs. Over 21% of these (424) correspond to CNV/PAV genes identified in this study, almost twice that expected by chance (x2, P < 0.0001). Thus, many of the structural variant genes with orthologs in other grasses are withinspecies duplicates that have moved from their ancestral positions. No evidence was found that these genes belong to the PACK-MULE or helitron classes of transposons (Schnable et al. 2009), which are known to mediate gene capture and movement in maize (Bennetzen 2005; Lai et al. 2005; Morgante et al. 2005; Schnable et al. 2009; Yang and Bennetzen 2009). Thus, other mechanisms appear to be at play.
that these structural variants are tolerated in the homozygous state and, at least for the domesticated lines, are not associated with major fitness costs. We proceeded to assess the frequency of rare events separately in maize and teosinte (Fig. 3B,C). Teosinte has a higher frequency of unique structural variants than maize, possibly reflecting higher levels of diversity in teosinte or structural variants that are tolerated in heterozygotes, but would be deleterious in inbred genotypes. It should be noted that our power to detect structural variants is much lower when they are present as heterozygotes than as homozygotes based on a comparison of the CNV detected in B73xMo17 F1 plants relative to those detected in Mo17. Given this limitation and the fact that the majority (10/14) of teosinte genotypes tested are segregating individuals from wild populations, it is likely that the bias toward rare events in teosinte is even higher than actually observed within our data. A small proportion of structural variants (3%) are observed only in teosinte, while ;11% of the structural variants are only observed in domesticated maize lines. The remaining 86% of the variants are observed in both maize and teosinte. It is likely that the identification of fewer teosinte-specific events is due in part to the inclusion of fewer teosinte genotypes. We proceeded to further assess the frequency of structural variants in subpopulations of maize. The reference B73 genome represents the stiff stalk subpopulation of maize. Each of the other genotypes was assigned to one of five other subpopulations based on pedigree or SNP data (Hansey et al. 2010). The subpopulations include nonstiff stalk (n = 4), tropical (n = 5), ex-plant varietal protection (n = 6), inbred teosinte (n = 4), or wild teosinte (n = 10). To visualize the distribution of both UpCNV and DownCNV/PAV within Table 3. Validation of multigene PAV events these subpopulations, event frequencies Validation of aCGH absent calls within subpopulations were used for hierarchical clustering (Fig. 4; Supplemental No. of genotypes No. of PCR Gene ID absent (aCGH) consistent Fig. 4). The clustering identified variants that are restricted to certain subpopulaGRMZM2G143324 16 15 tions of maize or those that are present in GRMZM2G016150 15 13 multiple populations. GRMZM2G117319 11 10
Characterization of genes affected by structural variation The observation that many maize genes can vary in copy number or presence among genotypes leads to queries about the potential functional impacts. Two observations suggest that genes affected
1692
Genome Research www.genome.org
GRMZM2G098697 GRMZM2G109830 GRMZM2G072567 GRMZM2G300077 GRMZM2G095634 GRMZM2G704345 GRMZM2G703559 GRMZM2G093712 AC194853.1_FG002 Total
14 10 9 15 8 20 20 10 12 160
14 10 8 13 8 18 19 8 10 146 (91%)
Validation of aCGH present calls No. of genotypes present (aCGH)
No. of PCR consistent
22 23 27 24 28 29 23 30 18 18 28 26 296
15 11 26 24 11 28 23 26 16 15 26 20 241 (81%)
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Copy number variation in diverse maize genotypes Table 4.
Structural variants affecting multiple genes
Genes per event 1 2 3 4 5 6 7 8 9 25 Total events
UpCNV
DownCNV/PAV
353 32 10 2 0 0 1 1 1 0 400
2979 134 31 3 5 4 3 1 0 1 3161
The genes affected by structural variation are often part of large gene families in the reference genome and are significantly less likely to be single-copy genes (Table 4). In particular, the genes affected by structural variation are often found within paralogous clusters. Only 18.6% (4263/22,948) of all maize genes within multigene families reside in paralogous clusters, compared with 30.4% of (730/2399) of CNV genes (x2, P < 0.0001). This observation is consistent with the expectation that paralogous clusters are rapidly evolving and unstable with respect to copy number. The functional annotations of the CNV/PAV genes were assessed using the biological network gene ontology tool (BiNGO) (Maere et al. 2005) to identify overrepresented genes. There are relatively few functional categories that exhibit over-representation (Supplemental Fig. 5; Supplemental Table 4). The UpCNV genes exhibit enrichment for thylakoid-related genes, which may reflect intraspecific variation for specific chloroplast/mitochondrial DNA insertions as previously noted by Lough et al. (2008). The enrichment for membrane proteins (UpCNV) and genes involved in stress response (DownCNV/PAV) may be a consequence of the enrichment for these types of genes in tandem arrays (Rizzon et al. 2006). The list of genes affected by structural variation was compared with several manually curated gene lists (Table 1), including 420 genes defined by classical genetics, 269 non-histone chromatin genes, 1198 cell wall genes, and 1723 transcription factors. For each of these lists, the number of genes affected by structural variation was less than expected based on the frequency of all genes affected by structural variation (x2, P < 0.005). However, a number of genes within these lists do exhibit structural variation. For example, several instances of copy number variation were supported by prior analyses of variation in maize. The pericarp color1 (p1) gene was identified as a putative DownCNV, and previous studies have documented tandem repeats for this gene (Chopra et al. 1998), including 11 tandem repeats in B73 (Goettel and Messing 2009). A qPCR analysis of the copy number for the p1coding sequence (data not shown) indicates that many of the genotypes with relatively low signal, such as M37W, P39, TIL1, TIL9, TIL17, and TIL15, are likely to have only a single copy of the p1coding sequence and confirms the relative copy number changes observed by aCGH. The bZIP factor opaque-2 heterodimerizing protein1 (OHP1) was also identified as a DownCNV (Fig. 5). At least two closely related OHP1 sequences are present as tandem duplicates in B73. Copy number variation for OHP1 was previously documented (Pysh and Schmidt 1996) in some maize genotypes, including a single copy of OHP1 in Oh43 and Tx303 and multiple copies in W22. Our data are in agreement for these varieties and additionally provide evidence that OHP1 is present as a single copy in approximately 17 of the genotypes tested, and that there are at least two
copies in the other genotypes (Fig. 5). There is also evidence for potential copy number variation from the CGH data (Fig. 5), as well as previous SSR studies (http://maizegdb.org/) for the globulin1 gene. Interestingly, in each of these three examples, the majority of teosinte lines have low copy number for these genes, while many of the domesticated maize genotypes have complex, multicopy alleles.
Discussion The CGH analysis of diverse domesticated maize genotypes as well as teosinte lines revealed pervasive structural variation affecting over 10% of the genes annotated in the B73 reference genome (61% of which have homologs in other grasses). If we restrict our analysis to genes associated with GO annotation terms, we find that 8% of these genes are affected by CNV/PAV. The identification of genes affected by CNV or PAV in a diverse panel of maize genotypes allowed us to characterize the portion of this complex plant genome for which loss is tolerated. In addition, it provided an opportunity to investigate the distribution of structural variant events in domesticated and undomesticated maize and to speculate on potential phenotypic contributions of structural variation in maize. The presence of substantial structural variation affecting gene content has implications for the application of the reference genome concept and how a reference genome is used to ‘‘anchor’’
Figure 3. Enrichment for rare CNV/PAV in teosinte genotypes. (A) The number of genotypes containing each was determined and the percent of events was plotted. Only 10% of structural variants are detected in one or two genotypes, while over 60% of structural variant events are detected in at least six genotypes. (B) The plot shows the allele frequency distribution for structural variant events in teosinte (black) and maize (gray). The proportion of DownCNV/PAV that are observed in one to 16 genotypes is shown. Teosinte has an excess of DownCNV/PAV observed in a single genotype relative to maize genotypes. (C ) A similar plot is shown illustrating the distribution of allele frequency for UpCNV in maize (gray) and teosinte (black) genotypes.
Genome Research www.genome.org
1693
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Swanson-Wagner et al. Mechanisms of structural variation A current understanding of evolutionary mechanisms for producing and maintaining copy number variants (specifically gene duplications) is limited. Recombination- and replication-based mechanisms of CNV emergence have been proposed (Innan and Kondrashov 2010). The variation in the frequency of CNV in regions of high and low recombination suggests that recombination-based mechanisms play a role in either creating or maintaining CNV within the maize genome. Interestingly, the low-recombination regions had elevated frequencies of CNV. Both UpCNV and DownCNV show elevated frequencies within the low-recombination central portion of maize chromosomes. It is possible that this reflects a requirement for recombination in order to remove local duplications and eliminate CNV. Alternatively, it is possible that intrachromosomal recombination is elevated in these regions with lower interchromosomal recombination. In contrast, PAV rates were not influenced by recombination rate and are likely produced by mechanisms different from CNV. Woodhouse et al. (2010) studied the fractionation of genome regions that result from whole-genome duplication events. They found evidence for a short deletion mechanism that utilizes short direct repeats to explain differences in gene content within the duplicated regions of the reference maize genome. This mechanism is likely to also contribute to the high rate of PAV that we observe among maize genotypes.
Toleration of gene loss in maize inbreds
Figure 4. Structural variation haplotype frequencies in subpopulations of maize. Each of the genotypes was assigned to a subpopulation based on pedigree information or structure analysis. The subpopulations are nonstiff stalk (NSS, n = 4), ex-plant varietal protection varieties (exPVP, n = 6), inbred teosinte (TeoI, n = 4), wild teosinte (TeoW, n = 10), or tropical (Trop, n = 5). The frequency of the structural variant within this subpopulation was used to perform hierarchical clustering of both the structural variants and the subpopulations. The color indicates the type and frequency of each structural variant, with blue indicating DownCNV/PAV and red indicating UpCNV. The brighter colors represent higher allele frequencies.
next-generation sequence data from DNA or RNA of other individuals. It is worth noting that the number of CNV and PAV identified in this study is likely an underestimate of the actual number of CNV and PAV, since the current analysis could only detect structural variation within genes, and previous studies have found that only about one-third of the variants in low-copy maize DNA included genes (Springer et al. 2009). The actual number of genic CNV/PAV may also be underestimated since relatively strict criteria were used to call variants, and we may not have had sufficient power to detect rare CNV/PAV, particularly in the segregating teosinte individuals for which many variants may be present as heterozygotes, and thus not detected.
1694
Genome Research www.genome.org
It is surprising that individuals of the same species can have such variable gene copy number and content. A recent study (Conrad et al. 2010) found that two human individuals have ;1000 CNV/ PAV that affect approximately 600 genes, and that roughly 35% these could be identified as homozygotes. However, there are relatively few examples of CNV/PAV sequences being linked to disease (The Wellcome Trust Case Control Consortium 2010), suggesting that relatively few of these CNV/PAV have major phenotypic consequences. Similarly, in the current study we have identified numerous CNV/PAV within both maize and teosinte genotypes. Given that the maize genotypes we assayed are highly inbred (and therefore homozygous for the CNV/PAV) and have been selected for agricultural productivity, the majority of these CNV/PAV are not likely associated with lethality or major loss of fitness in an agricultural environment. Moreover, the presence of most of these variants in teosinte means that these variants are segregating in natural populations and are therefore unlikely to have strongly negative effects on fitness, at least not as heterozygotes. This leads to a question of how substantial levels of gene loss can be tolerated with relatively low perturbation of phenotype. The types of genes that are affected and the complex structure of the maize genome may provide clues as to how gene loss is tolerated in maize inbred lines. A subset (;40%) of the genes subject to structural variation are not found in the genomes of other model plants (Arabidopsis, rice, sorghum). These sequences may be novel transposon sequences or novel transcribed sequences that do not encode functional genes. Many of the remaining CNV/PAV genes that did have annotations and/or orthologs are present in gene families that include members at syntenic positions or within paralogous clusters. The maize genome, which arose from an ancient allopolyploidization event (Gaut and Doebley 1997; Swigonova et al. 2005; Wei et al. 2007), has many gene families with a very high level of redundancy. Gene losses within these gene families may be tolerated if they result in only minor differences in phenotype or
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Copy number variation in diverse maize genotypes Table 5.
Gene family sizes and species specificity of CNV genes
CNV/PAV that affect different inbred varieties, may have implications for heterPercent of nonspecific maize genes by family size osis. Generally, heterosis is considered at Maize the level of two alleles that may provide Gene class Count specific (%)a 1 2–5 6–10 >10 complementation when present in a heterozygote. However, it may be useful to All maize 32,540 4538 (13.9) 21.7 44.4 13.3 20.7 CNV (All)b 3804 1487 (39.1) 13.3 40.8 16.4 29.4 consider each member of a gene family as CNV (Down)b 3325 1307 (39.3) 14.1 41.9 15.5 28.6 an ‘‘allele’’ that provides partial to comb CNV (Up) 402 133 (33.1) 8.6 32.7 21.9 36.8 plete functionality for the gene family. b CNV (Both) 77 47 (61.0) 6.7 43.3 30 20 Inbred lines show relatively high rates of a CNV/PAV that affect the copy number, or Includes genes not assigned to families due to failure to cluster (3521 of total and 1096 of CNV genes) and genes assigned to families that lack membership of rice, sorghum, or Arabidopsis (1017 of total and presence, of individual members of gene 392 of CNV genes). Deviation from expected values were highly significant (P < 0.0001) for all CNV classes families. The loss of a single member of 2 based on x tests. a gene family may result in a relatively b Deviation from expected family size distributions were highly significant (P < 0.0001) for all CNV classes minor loss of the total functionality of except for CNV (Both), which yielded a significant P-value of 0.0241. the gene family as other family members provide compensatory function. The cufitness. For example, the Gnarley1 (Gn1) locus, a member of the mulative effect of many gene families lacking partially redundant knox gene family, was identified as ‘‘absent’’ in five genotypes. members would result in decreased vigor in the inbreds. However, Ectopic expression of Gn1 can result in morphological phenothe loss-of-function would be complemented (at a genomic, not types, but loss-of-function alleles of Gn1 do not result in major allelic) level in the hybrid, resulting in substantial hybrid vigor. phenotypic consequences (Foster et al. 1999). Analysis of 16 of the The hypothesis that heterosis is the result of restoring full funcgenes affected by PAV that are included on the list of classically tionally of gene families would suggest that heterosis would be defined maize genes (http://synteny.cnr.berkeley.edu/wiki/index. more prevalent in organisms with high levels of gene duplication php/Classical_Maize_Genes) reveals that the majority of these (14/ and variation affecting individual family members. 16) have duplicates located within the collinear portion of the It has been suggested that variation in gene content among maize genome. maize inbred lines could contribute to heterosis or hybrid vigor (Fu The observation that many of the genes affected by CNV or and Dooner 2002; Springer and Stupar 2007; Springer et al. 2009). PAV are members of gene families has some important implications High levels of variability in gene content among inbred lines will for the phenotypic consequences of PAV in plant genomes. Many result in hybrids containing more genes than either inbred parent plant genomes have substantial levels of gene duplication that and, indeed, expression studies have found that hybrids express have arisen from whole-genome duplications as well as other more genes than either parent (Stupar and Springer 2006; Stupar mechanisms (Freeling 2009). Even the relatively small genomes of et al. 2008). Historically, the complementation model of heterosis Arabidopsis and rice contain evidence for ancient whole-genome has been supported by the fact that an inbred line has not been duplications (Blanc et al. 2003; Yu et al. 2005; Paterson et al. 2006). created with all superior alleles (Birchler et al. 2003). Due to the Comparisons of plant genomes have revealed relatively high levels of instability and frequent gene loss that often affects members of gene families (Bennetzen 2007; Freeling 2009; Woodhouse et al. 2010). If we assume that there is redundancy or partial redundancy for function within the gene family, then the effect of losing a single member of a gene family can be genetically buffered by the family members. In effect, this means that within complex, highly duplicated genomes, a PAV is likely to contribute quantitative variation rather than major, qualitative defects. This may result in high levels of structural variation in crop plant genomes that contributes to important quantitative variation. Indeed, there are recent examples of rice quantitative trait loci (QTL) that are caused by deletion of genes (Shomura et al. 2008; Zhou et al. 2009).
Implications of structural variation for heterosis The concept of partial redundancy within gene families, coupled with high rates of
Figure 5. Examples of CNV for previously characterized maize genes. The CGH data are summarized for three maize genes. For each genotype the average log2 ratio for all probes from the gene is summarized as the height of the bar, and the standard deviation for the multiple probes is represented by the error bars.
Genome Research www.genome.org
1695
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Swanson-Wagner et al. high number of PAVs, it would be very difficult to create an inbred line containing all genes. Many of the maize inbreds were missing 500–1000 genes relative to B73. If we assume that each of these lines contains a similar number of genes that are not in B73, it becomes quite difficult to identify a series of recombination events that would create a chromosome containing all genes. Furthermore, the current complex arrangement of different complements of genes in the two haplotypes of a heterozygote can lead to apparent pseudo-overdominance. This would be a particular problem in the low-recombination centromeric regions of each chromosome. In total, these low-recombination regions include ;750 PAV genes, and the low rate of recombination events would make it quite difficult to generate ideal haplotypes. Recent analyses of residual heterozygosity suggest that these low-recombination regions may be particularly important for heterosis (McMullen et al. 2009). The allele frequencies that we observed for structural variants suggest that some variants have been entirely removed from certain populations. Maize breeding efforts are often focused on breeding within a heterotic group, or subpopulation, to create inbreds that are crossed to an inbred from another heterotic group. We found evidence for a number of structural variants that are entirely missing from one subpopulation, limiting the potential for improvement on inbred lines through selection within that subpopulation only.
Distribution of structural variation within maize and teosinte The identification of relatively few rare variants suggests that many of the structural variants represent haplotypes that have been segregating for some time in maize and teosinte populations. While technical aspects (such as the genome used as reference) and statistical power issues (the numbers of lines representing each subpopulation) may influence the ability to discover rare structural events, these are unlikely to completely account for the paucity of rare events observed in this study. The majority (;86%) of structural variants in this study were observed in both maize and teosinte, suggesting that they are relatively old events in terms of domestication. In addition, the presence of these events in teosinte would indicate that they are tolerated within natural populations and are not an artifact of many generations of artificial selection. A small proportion (;10%) of the variants were observed only in domesticated maize lines. Interestingly, many of these maize-specific events (252/347) are observed in three or fewer genotypes. Therefore, the maize-specific variants are enriched for rare alleles, and these may represent relatively new events that have arisen within breeding populations. The wild teosinte individuals used for this study were collected from populations located near the probable location of domestication (Piperno et al. 2009; Ranere et al. 2009). We searched for structural variants potentially associated with domestication by using the relative frequencies within maize and teosinte. We did not find any structural variants that were present in the majority of maize genotypes but not detected in any teosinte genotypes. However, it should be remembered that structural variants were documented based on comparisons to a reference domesticated maize line, and that genes present in teosinte, but not maize, cannot be detected. Therefore, domestication-associated copy number variants would be expected to be present in most teosintes, but in few or no maize lines. There were only four variants that were observed in most (>85%) teosinte lines but in very few (less than three) maize genotypes, and thus there was no evidence for strong effects of domestication on structural variation.
1696
Genome Research www.genome.org
The analysis of structural variation in maize and teosinte provides evidence for widespread genome content variation. This high level of variation could reflect the ancestoral polyploid nature of the maize genome by the fact that maize has high rates of outcrossing, or active transposition and genome contraction processes to create a dynamic genome. In addition, studies on genome content variation within a species can be used to develop an understanding of the core genome (shared by all members of a species) and the non-core genome (‘‘dispensable genome’’, as suggested by Morgante et al. 2007). It is likely that these structural variants will be associated with phenotypic diversity within maize, and further research is important to document how these variants affect phenotype. An understanding of which genes are affected by structural variation may provide a valuable resource to probe the function of many maize genes.
Methods Array design A custom long oligonucleotide microarray was designed by NimbleGen (Roche NimbleGen) using the 32,540 filtered maize genes predicted from the B73 reference genome (Schnable et al. 2009). Partial-length gene fragments and transposable elements are not included in this filtered gene set. The custom array included three to four probes (45–60 mers) each for 32,487 genes for which probes could be designed, as well as 17,995 control probes that are not present in the maize genome, but exhibit nucleotide frequencies similar to maize. Of the 119,609 genic probes, 114,854 (96%) were unique in the genome and 118,730 (99%) were present no more than two times in the B73 genome. Detailed information about the array format is available at GEO accession no. GPL10846 and this array can be ordered from Roche NimbleGen (product OID24389).
Plant materials Maize inbred lines were obtained from the USDA North Central Regional Plant Introduction Station. Teosinte inbred lines were provided by John Doebley (University of Wisconsin, Madison). Teosinte accessions (Ames 21809, Ames 21810, and Ames 21814) originally collected from the Guerrero state of Mexico were obtained from the USDA North Central Regional Plant Introduction Station. All genotypes are listed in Supplemental Table 1 along with germplasm accession numbers. These include diverse maize inbred lines (n = 19), inbred teosinte lines (n = 4), and wild teosinte individuals (n = 10). Additional replications of maize inbred lines B73 and Mo17 were repeated multiple times to assess consistency within array measurements.
DNA labeling and microarray hybridization DNAs were isolated (Saghai-Maroof et al. 1984) from above-ground seedling tissue. DNA (0.5–1 mg) samples were labeled, amplified, and hybridized for 72–96 h at 42°C according to the array manufacturer’s protocol (NimbleGen Arrays User’s Guide: CGH Analysis v5.1). Post washing, slides were immediately scanned using the GenePix 4000B Scanner (Molecular Devices) according to the array manufacturer’s protocol. Array images and data were processed using NimbleScan software. Experimental integrity was verified by evaluation of the signal uniformity across each array and the signal-to-noise ratio of experimental probes. A total of 71 samples (genotypes listed in Supplemental Table 1) provided high-quality data and were used for subsequent analyses; the raw data is available at GEO accession no. GSE23756.
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Copy number variation in diverse maize genotypes Data normalization The different genotypes examined are not equally diverged from the B73 reference genome used to develop the probe sequences. For this reason we normalized the data using an approach that does not assume similar distributions of data from each genotype. The implemented normalization approach assumes that, for any genotype, the majority of probes will not exhibit any significant variation relative to B73 and, therefore, the peak of the log2(signal/ B73) histogram should be centered at a value of zero (Supplemental Fig. 1). Briefly, the DNAcopy algorithm was used to produce spatially normalized hybridization values for all probes for the 71 samples using NimbleScan (Roche NimbleGen). A robust B73 average (henceforth termed B73avg) was generated from nine replicate samples of B73 hybridization. Subsequently, the log2(signal/ B73avg) was calculated for each probe for all 71 samples. The distributions of these ratios were normalized so that the mode of the distribution of log2(signal/B73avg) for each genotype equaled.
statistical testing method with false discovery rate (FDR) correction (Benjamini and Hochberg 1995). The maize-specific genes and gene families were identified based on homolog clustering with annotated genes of rice, sorghum, and Arabidopsis using the method of Vilella et al. (2009) as previously described (Schnable et al. 2009). Paralogous clusters were defined as two or more genes belonging to the same gene family that were separated on a chromosome by no more than two nonparalogous intervening genes. Syntenic mapping of maize genes to rice and sorghum was previously described (Schnable et al. 2009). In addition, we examined the frequency of CNV/PAV using several manually curated gene lists, including classically defined maize genes (http://synteny.cnr. berkeley.edu/wiki/index.php/Classical_Maize_Genes), nonhistone chromatin genes (http://www.chromdb.org), transcription factors (http://www.grassius.org), and maize cell wall genes (http:// cellwall.genomics.purdue.edu/).
Distribution of genes affected by structural variation Identification of CNV and PAV For each probe, the log2 ratio (relative to B73) is expected to be near zero if the same sequence is present in both genotypes. Following normalization, the histogram of all log2 ratios (relative to B73) revealed varying distributions of the data (Supplemental Fig. 1). The distribution of the log2 ratios is affected by both measurement error and biological variation. Because the amount of technical variation can vary between hybridizations, we calculated 99th percentile cut-off values for each genotype separately. The cut-off values were determined from the distribution of all data with values over 0, and subsequently used to identify genes with structural variation for each genotype (see Supplemental Fig. 1 for a full description of this process). UpCNV (more copies of a gene in some genotypes relative to B73) were identified as genes for which all probes (three or four) per gene had values above the 99% cut-off value. DownCNV/PAV (fewer copies or no copies of a gene in some genotypes relative to B73) were identified as genes for which all probes exhibited a log2 ratio below the negative value of the 99% cut-off value. It should be noted that the cut-off values for both UpCNV and DownCNV/PAV were determined based on the confidence interval for the subset of data with positive log2 values, since this subset of data will only reflect noise and structural variation, while negative log2 ratios would additionally reflect SNP polymorphism rates (Supplemental Fig. 1). This approach was quite stringent in that it required significant variation to be observed at all probes for a gene. We observed a very low false-positive rate (none to eight genes detected) when this approach was applied to any single B73 replicate. Following the stringent discovery process, a relaxed set of criteria was implemented (>95% cut-offs) to characterize the structural variant across all genotypes.
The distribution of CNVs and PAVs was compared within each of the 10 maize chromosomes (Table 2). Regions of high, moderate, and low recombination were determined based on the integrated physical-genetic map generated by Liu et al. (2009). The highrecombination regions are toward the ends of the chromosomes, while the low-recombination regions surround the centromeres. The distribution of CNVs and PAVs within the high- and low-recombination regions of all chromosomes were tested and P-values were produced from the x2 analysis (Table 2).
Validation of CNV/PAV PCR primers were designed to amplify genomic sequence for 12 genes located within putative PAVs (Table 3). PCR and gel electrophoresis were conducted on the same samples and genotypes from the microarray experiment as per previously published methods (Haun and Springer 2008) using 60°C as the annealing temperature.
Acknowledgments We thank John Doebley who very graciously provided stocks of the inbred teosinte lines, and several anonymous reviewers who provided very helpful suggestions for analyses. Peter Hermanson helped with DNA isolation and microarray hybridization. The Minnesota Supercomputing Institute provided access to software and user support for data analyses. This project was supported by USDA Hatch funds, the Microbial and Plant Genomics Institute, and a grant from the National Science Foundation to N.M.S. (IOS0922095), and by grants from the NSF (DBI-0527192) and USDA (1907-21000-030-00) to D.W.
References Functional characterization of genes affected by structural variation The genomic distribution of genes was assessed using the genomic coordinates from the B73 reference genome for each of the genes. We identified multigene structural variants in cases where two or more adjacent genes exhibit the same type of structural variation (UpCNV or DownCNV/PAV) in a highly similar set of genotypes. The GOslim annotation (http://www.maizesequence.org) of genes that were affected by structural variation were assessed using BiNGO (Maere et al. 2005); a Cytoscape (Shannon et al. 2003) plugin that maps over-represented functional themes present in a given gene-set onto the GO hierarchy. P-values for enrichment or GOslim terms were calculated using a hypergeometric distribution
Beckmann JS, Estivill X, Antonarakis SE. 2007. Copy number variants and genetic traits: Closer to the resolution of phenotypic to genotypic variability. Nat Rev Genet 8: 639–646. Belo A, Beatty MK, Hondred D, Fengler KA, Li B, Rafalski A. 2010. Allelic genome structural variations in maize detected by array comparative genome hybridization. Theor Appl Genet 120: 355–367. Benjamini Y, Hochberg Y. 1995. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J R Stat Soc Ser B Methodol 57: 289–300. Bennetzen J. 2005. Transposable elements, gene creation and genome rearrangement in flowering plants. Curr Opin Genet Dev 15: 621–627. Bennetzen JL. 2007. Patterns in grass genome evolution. Curr Opin Plant Biol 10: 176–181. Bennetzen JL, Coleman C, Liu R, Ma J, Ramakrishna W. 2004. Consistent over-estimation of gene number in complex plant genomes. Curr Opin Plant Biol 7: 732–736.
Genome Research www.genome.org
1697
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Swanson-Wagner et al. Beroukhim R, Mermel CH, Porter D, Wei G, Raychaudhuri S, Donovan J, Barretina J, Boehm JS, Dobson J, Urashima M, et al. 2010. The landscape of somatic copy-number alteration across human cancers. Nature 463: 899–905. Birchler J, Auger D, Riddle N. 2003. In search of the molecular basis of heterosis. Plant Cell 15: 2236–2239. Blanc G, Hokamp K, Wolfe KH. 2003. A recent polyploidy superimposed on older large-scale duplications in the Arabidopsis genome. Genome Res 13: 137–144. Brunner S, Fengler K, Morgante M, Tingey S, Rafalski A. 2005. Evolution of DNA sequence nonhomologies among maize inbreds. Plant Cell 17: 343–360. Bucan M, Abrahams BS, Wang K, Glessner JT, Herman EI, Sonnenblick LI, Alvarez Retuerto AI, Imielinski M, Hadley D, Bradfield JP, et al. 2009. Genome-wide analyses of exonic copy number variants in a familybased study point to novel autism susceptibility genes. PLoS Genet 5: e1000536. doi: 10.1371/journal.pgen.1000536. Buckler E, Gaut B, McMullen M. 2006. Molecular and functional diversity of maize. Curr Opin Plant Biol 9: 172–176. Chen WK, Swartz JD, Rush LJ, Alvarez CE. 2009. Mapping DNA structural variation in dogs. Genome Res 19: 500–509. Chopra S, Athma P, Li XG, Peterson T. 1998. A maize Myb homolog is encoded by a multicopy gene complex. Mol Gen Genet 260: 372–380. Conrad DF, Andrews TD, Carter NP, Hurles ME, Pritchard JK. 2006. A highresolution survey of deletion polymorphism in the human genome. Nat Genet 38: 75–81. Conrad DF, Pinto D, Redon R, Feuk L, Gokcumen O, Zhang Y, Aerts J, Andrews TD, Barnes C, Campbell P, et al. 2010. Origins and functional impact of copy number variation in the human genome. Nature 464: 704–712. Cooper GM, Nickerson DA, Eichler EE. 2007. Mutational and selective effects on copy-number variants in the human genome. Nat Genet 39: S22–S29. Emrich SJ, Barbazuk WB, Li L, Schnable PS. 2007. Gene discovery and annotation using LCM-454 transcriptome sequencing. Genome Res 17: 69–73. Feuk L, Carson AR, Scherer SW. 2006. Structural variation in the human genome. Nat Rev Genet 7: 85–97. Foster T, Yamaguchi J, Wong BC, Veit B, Hake S. 1999. Gnarley1 is a dominant mutation in the knox4 homeobox gene affecting cell shape and identity. Plant Cell 11: 1239–1252. Freeling M. 2009. Bias in plant gene content following different sorts of duplication: Tandem, whole-genome, segmental, or by transposition. Annu Rev Plant Biol 60: 433–453. Fu H, Dooner HK. 2002. Intraspecific violation of genetic colinearity and its implications in maize. Proc Natl Acad Sci 99: 9573–9578. Fu Y, Wen TJ, Ronin YI, Chen HD, Guo L, Mester DI, Yang Y, Lee M, Korol AB, Ashlock DA, et al. 2006. Genetic dissection of intermated recombinant inbred lines using a new genetic map of maize. Genetics 174: 1671– 1683. Gaut BS, Doebley JF. 1997. DNA sequence evidence for the segmental allotetraploid origin of maize. Proc Natl Acad Sci 94: 6809–6814. Goettel W, Messing J. 2009. Change of gene structure and function by non-homologous end-joining, homologous recombination, and transposition of DNA. PLoS Genet 5: e1000516. doi: 10.1371/ journal.pgen.1000516. Graubert TA, Cahan P, Edwin D, Selzer RR, Richmond TA, Eis PS, Shannon WD, Li X, McLeod HL, Cheverud JM, et al. 2007. A high-resolution map of segmental DNA copy number variation in the mouse genome. PLoS Genet 3: e3. doi: 10.1371/journal.pgen.0030003. Guryev V, Saar K, Adamovic T, Verheul M, van Heesch SA, Cook S, Pravenec M, Aitman T, Jacob H, Shull JD, et al. 2008. Distribution and functional impact of DNA copy number variation in the rat. Nat Genet 40: 538–545. Hansey CN, Johnson JM, Sekhon RS, Kaeppler SM, de Leon N. 2010. Genetic diversity of a maize association population with restricted phenology. Crop Sci (in press). doi: 10.2135/cropsci2010.03.0178. Haun WJ, Springer NM. 2008. Maternal and paternal alleles exhibit differential histone methylation and acetylation at maize imprinted genes. Plant J 56: 903–912. Hurles ME, Dermitzakis ET, Tyler-Smith C. 2008. The functional impact of structural variation in humans. Trends Genet 24: 238–245. Innan H, Kondrashov F. 2010. The evolution of gene duplications: Classifying and distinguishing between models. Nat Rev Genet 11: 97–108. Kato A, Lamb JC, Birchler JA. 2004. Chromosome painting using repetitive DNA sequences as probes for somatic chromosome identification in maize. Proc Natl Acad Sci 101: 13554–13559. Kidd JM, Cooper GM, Donahue WF, Hayden HS, Sampas N, Graves T, Hansen N, Teague B, Alkan C, Antonacci F, et al. 2008. Mapping and sequencing of structural variation from eight human genomes. Nature 453: 56–64.
1698
Genome Research www.genome.org
Lai J, Li Y, Messing J, Dooner HK. 2005. Gene movement by Helitron transposons contributes to the haplotype variability of maize. Proc Natl Acad Sci 102: 9068–9073. Lee AS, Gutierrez-Arcelus M, Perry GH, Vallender EJ, Johnson WE, Miller GM, Korbel JO, Lee C. 2008. Analysis of copy number variation in the rhesus macaque genome identifies candidate loci for evolutionary and human disease studies. Hum Mol Genet 17: 1127–1136. Liu S, Yeh CT, Ji T, Ying K, Wu H, Tang HM, Fu Y, Nettleton D, Schnable PS. 2009. Mu transposon insertion sites and meiotic recombination events co-localize with epigenetic marks for open chromatin across the maize genome. PLoS Genet 5: e1000733. doi: 10.137l/journal.pgen.1000733. Lough AN, Roark LM, Kato A, Ream TS, Lamb JC, Birchler JA, Newton KJ. 2008. Mitochondrial DNA transfer to the nucleus generates extensive insertion site variation in maize. Genetics 178: 47–55. Maere S, Heymans K, Kuiper M. 2005. BiNGO: A Cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks. Bioinformatics 21: 3448–3449. Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, et al. 2009. Finding the missing heritability of complex diseases. Nature 461: 747–753. Maydan JS, Lorch A, Edgley ML, Flibotte S, Moerman DG. 2010. Copy number variation in the genomes of twelve natural isolates of Caenorhabditis elegans. BMC Genomics 11: 62. doi: 10.1186/1471-216411-62. McCarroll SA, Altshuler DM. 2007. Copy-number variation and association studies of human disease. Nat Genet 39: S37–S42. McClintock B, Yamakake TAK, Blumenschein A. 1981. Chromosome constitution of races of maize. Its significance in the interpretation of relationships between races and varieties in the Americas. Colegio de Postgraduados, Chapingo, Mexico. McMullen MD, Kresovich S, Villeda HS, Bradbury P, Li H, Sun Q, Flint-Garcia S, Thornsberry J, Acharya C, Bottoms C, et al. 2009. Genetic properties of the maize nested association mapping population. Science 325: 737– 740. Merikangas AK, Corvin AP, Gallagher L. 2009. Copy-number variants in neurodevelopmental disorders: Promises and challenges. Trends Genet 25: 536–544. Messing J, Dooner H. 2006. Organization and variability of the maize genome. Curr Opin Plant Biol 9: 157–163. Messing J, Bharti AK, Karlowski WM, Gundlach H, Kim HR, Yu Y, Wei F, Fuks G, Soderlund CA, Mayer KF, et al. 2004. Sequence composition and genome organization of maize. Proc Natl Acad Sci 101: 14349–14354. Morgante M, Brunner S, Pea G, Fengler K, Zuccolo A, Rafalski A. 2005. Gene duplication and exon shuffling by helitron-like transposons generate intraspecies diversity in maize. Nat Genet 37: 997–1002. Morgante M, De Paoli E, Radovic S. 2007. Transposable elements and the plant pan-genomes. Curr Opin Plant Biol 10: 149–155. Paterson AH, Chapman BA, Kissinger JC, Bowers JE, Feltus FA, Estill JC. 2006. Many gene and domain families have convergent fates following independent whole-genome duplication events in Arabidopsis, Oryza, Saccharomyces and Tetraodon. Trends Genet 22: 597–602. Penning BW, Hunter CT III, Tayengwa R, Eveland AL, Dugard CK, Olek AT, Vermerris W, Koch KE, McCarty DR, Davis MF, et al. 2009. Genetic resources for maize cell wall biology. Plant Physiol 151: 1703–1728. Perry GH, Yang F, Marques-Bonet T, Murphy C, Fitzgerald T, Lee AS, Hyland C, Stone AC, Hurles ME, Tyler-Smith C, et al. 2008. Copy number variation and evolution in humans and chimpanzees. Genome Res 18: 1698–1710. Piperno DR, Ranere AJ, Holst I, Iriarte J, Dickau R. 2009. Starch grain and phytolith evidence for early ninth millennium B.P. maize from the Central Balsas River Valley, Mexico. Proc Natl Acad Sci 106: 5019–5024. Pysh LD, Schmidt RJ. 1996. Characterization of the maize OHP1 gene: Evidence of gene copy variability among inbreds. Gene 177: 203–208. Ranere AJ, Piperno DR, Holst I, Dickau R, Iriarte J. 2009. The cultural and chronological context of early Holocene maize and squash domestication in the Central Balsas River Valley, Mexico. Proc Natl Acad Sci 106: 5014–5018. Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H, Shapero MH, Carson AR, Chen W, et al. 2006. Global variation in copy number in the human genome. Nature 444: 444–454. Rizzon C, Ponger L, Gaut BS. 2006. Striking similarities in the genomic distribution of tandemly arrayed genes in Arabidopsis and rice. PLoS Comput Biol 2: e115. doi: 10.1371/journal.pcbi.0020115. Robbins TP, Walker EL, Kermicle JL, Alleman M, Dellaporta SL. 1991. Meiotic instability of the R-r complex arising from displaced intragenic exchange and intrachromosomal rearrangement. Genetics 129: 271– 283. Saghai-Maroof MA, Soliman KM, Jorgensen RA, Allard RW. 1984. Ribosomal DNA spacer-length polymorphisms in barley: Mendelian inheritance, chromosomal location, and population dynamics. Proc Natl Acad Sci 81: 8014–8018.
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Copy number variation in diverse maize genotypes Santuari L, Pradervand S, Amiguet-Vercher AM, Thomas J, Dorcey E, Harshman K, Xenarios I, Juenger TE, Hardtke CS. 2010. Substantial deletion overlap among divergent Arabidopsis genomes revealed by intersection of short reads and tiling arrays. Genome Biol 11: R4. doi: 10.1186/gb-2010-11-1-r4. Schnable PS, Ware D, Fulton RS, Stein JC, Wei F, Pasternak S, Liang C, Zhang J, Fulton L, Graves TA, et al. 2009. The B73 maize genome: Complexity, diversity, and dynamics. Science 326: 1112–1115. Sebat J. 2007. Major changes in our DNA lead to major changes in our thinking. Nat Genet 39: S3–S5. Sebat J, Lakshmi B, Troge J, Alexander J, Young J, Lundin P, Maner S, Massa H, Walker M, Chi M, et al. 2004. Large-scale copy number polymorphism in the human genome. Science 305: 525–528. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T. 2003. Cytoscape: A software environment for integrated models of biomolecular interaction networks. Genome Res 13: 2498–2504. Sharp AJ, Locke DP, McGrath SD, Cheng Z, Bailey JA, Vallente RU, Pertz LM, Clark RA, Schwartz S, Segraves R, et al. 2005. Segmental duplications and copy-number variation in the human genome. Am J Hum Genet 77: 78–88. Sharp AJ, Hansen S, Selzer RR, Cheng Z, Regan R, Hurst JA, Stewart H, Price SM, Blair E, Hennekam RC, et al. 2006. Discovery of previously unidentified genomic disorders from the duplication architecture of the human genome. Nat Genet 38: 1038–1042. Shomura A, Izawa T, Ebana K, Ebitani T, Kanegae H, Konishi S, Yano M. 2008. Deletion in a gene associated with grain size increased yields during rice domestication. Nat Genet 40: 1023–1028. Springer NM, Stupar RM. 2007. Allelic variation and heterosis in maize: How do two halves make more than a whole? Genome Res 17: 264–275. Springer NM, Ying K, Fu Y, Ji T, Yeh CT, Jia Y, Wu W, Richmond T, Kitzman J, Rosenbaum H, et al. 2009. Maize inbreds exhibit high levels of copy number variation (CNV) and presence/absence variation (PAV) in genome content. PLoS Genet 5: e1000734. doi: 10.1371/ journal.pgen.1000734. Stankiewicz P, Lupski JR. 2010. Structural variation in the human genome and its role in disease. Annu Rev Med 61: 437–455. Stupar RM, Springer NM. 2006. Cis-transcriptional variation in maize inbred lines B73 and Mo17 leads to additive expression patterns in the F1 hybrid. Genetics 173: 2199–2210. Stupar RM, Gardiner JM, Oldre AG, Haun WJ, Chandler VL, Springer NM. 2008. Gene expression analyses in maize inbreds and hybrids with varying levels of heterosis. BMC Plant Biol 8: 33. doi: 10.1186/1471-2229-8-33. Swigonova Z, Bennetzen JL, Messing J. 2005. Structure and evolution of the r/b chromosomal regions in rice, maize and sorghum. Genetics 169: 891–906. Tuzun E, Sharp AJ, Bailey JA, Kaul R, Morrison VA, Pertz LM, Haugen E, Hayden H, Albertson D, Pinkel D, et al. 2005. Fine-scale structural variation of the human genome. Nat Genet 37: 727–732.
Vilella AJ, Severin J, Ureta-Vidal A, Heng L, Durbin R, Birney E. 2009. EnsemblCompara GeneTrees: Complete, duplication-aware phylogenetic trees in vertebrates. Genome Res 19: 327–335. Wang Q, Dooner HK. 2006. Remarkable variation in maize genome structure inferred from haplotype diversity at the bz locus. Proc Natl Acad Sci 103: 17644–17649. Wei F, Coe E, Nelson W, Bharti AK, Engler F, Butler E, Kim H, Goicoechea JL, Chen M, Lee S, et al. 2007. Physical and genetic structure of the maize genome reflects its complex evolutionary history. PLoS Genet 3: e123. doi: 10.1371/journal.pgen.0030123. Wellcome Trust Case Control Consortium. 2010. Genome-wide association study of CNVs in 16,000 cases of eight common diseases and 3,000 shared controls. Nature 464: 713–720. Wong KK, deLeeuw RJ, Dosanjh NS, Kimm LR, Cheng Z, Horsman DE, MacAulay C, Ng RT, Brown CJ, Eichler EE, et al. 2007. A comprehensive analysis of common copy-number variations in the human genome. Am J Hum Genet 80: 91–104. Woodhouse MR, Schnable JC, Pedersen BS, Lyons E, Lisch D, Subramaniam S, Freeling M. 2010. Following tetraploidy in maize, a short deletion mechanism removed genes preferentially from one of the two homologs. PLoS Biol 8: e1000409. doi: 10.1371/journal.pbio.1000409. Yandeau-Nelson MD, Xia Y, Li J, Neuffer MG, Schnable PS. 2006. Unequal sister chromatid and homolog recombination at a tandem duplication of the a1 locus in maize. Genetics 173: 2211–2226. Yang L, Bennetzen JL. 2009. Distribution, diversity, evolution and survival of Helitrons in the maize genome. Proc Natl Acad Sci 106: 19922– 19927. Yao H, Zhou Q, Li J, Smith H, Yandeau M, Nikolau BJ, Schnable PS. 2002. Molecular characterization of meiotic recombination across the 140kb multigenic a1-sh2 interval of maize. Proc Natl Acad Sci 99: 6157– 6162. Yu J, Wang J, Lin W, Li S, Li H, Zhou J, Ni P, Dong W, Hu S, Zeng C, et al. 2005. The Genomes of Oryza sativa: A history of duplications. PLoS Biol 3: e38. doi: 10.1371/journal.pbio.0030038. Zhang F, Peterson T. 2005. Comparisons of maize pericarp color1 alleles reveal paralogous gene recombination and an organ-specific enhancer region. Plant Cell 17: 903–914. Zhang F, Gu W, Hurles ME, Lupski JR. 2009. Copy number variation in human health, disease, and evolution. Annu Rev Genomics Hum Genet 10: 451–481. Zhou Y, Zhu J, Li Z, Yi C, Liu J, Zhang H, Tang S, Gu M, Liang G. 2009. Deletion in a quantitative trait gene qPE9-1 associated with panicle erectness improves plant architecture during rice domestication. Genetics 183: 315–324.
Received May 1, 2010; accepted in revised form September 30, 2010.
Genome Research www.genome.org
1699
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Research
Localized hypermutation and associated gene losses in legume chloroplast genomes Alan M. Magee,1 Sue Aspinall,2 Danny W. Rice,3 Brian P. Cusack,1 Marie Se´mon,4 Antoinette S. Perry,1 Sasˇa Stefanovic´,5 Dan Milbourne,6 Susanne Barth,6 Jeffrey D. Palmer,3 John C. Gray,2 Tony A. Kavanagh,1 and Kenneth H. Wolfe1,7 1
Smurfit Institute of Genetics, Trinity College, Dublin 2, Ireland; 2Department of Plant Sciences, University of Cambridge, Cambridge CB2 3EA, United Kingdom; 3Department of Biology, Indiana University, Bloomington, Indiana 47405, USA; 4Institut de Ge´nomique Fonctionnelle de Lyon, Universite´ de Lyon, CNRS, INRA, UCB Lyon 1, Ecole Normale Supe´rieure de Lyon, 69364 Lyon Cedex 07, France; 5 Department of Biology, University of Toronto, Mississauga, Ontario L5L 1C6, Canada; 6Teagasc Crops Research Centre, Oak Park, Carlow, Ireland Point mutations result from errors made during DNA replication or repair, so they are usually expected to be homogeneous across all regions of a genome. However, we have found a region of chloroplast DNA in plants related to sweetpea (Lathyrus) whose local point mutation rate is at least 20 times higher than elsewhere in the same molecule. There are very few precedents for such heterogeneity in any genome, and we suspect that the hypermutable region may be subject to an unusual process such as repeated DNA breakage and repair. The region is 1.5 kb long and coincides with a gene, ycf4, whose rate of evolution has increased dramatically. The product of ycf4, a photosystem I assembly protein, is more divergent within the single genus Lathyrus than between cyanobacteria and other angiosperms. Moreover, ycf4 has been lost from the chloroplast genome in Lathyrus odoratus and separately in three other groups of legumes. Each of the four consecutive genes ycf4-psaI-accD-rps16 has been lost in at least one member of the legume ‘‘inverted repeat loss’’ clade, despite the rarity of chloroplast gene losses in angiosperms. We established that accD has relocated to the nucleus in Trifolium species, but were unable to find nuclear copies of ycf4 or psaI in Lathyrus. Our results suggest that, as well as accelerating sequence evolution, localized hypermutation has contributed to the phenomenon of gene loss or relocation to the nucleus. [Supplemental material is available online at http://www.genome.org. The sequence data from this study have been submitted to GenBank (http://www.ncbi.nlm.nih.gov/Genbank/) under accession nos. HM029359–HM029371, HM048906– HM048910, and GO313838–GO322539.] The genome organization and gene content of chloroplast DNA (cpDNA) are highly conserved among most flowering plant species (Palmer 1985; Sugiura 1992; Jansen et al. 2007). The chloroplast genome of the most recent common ancestor of all angiosperms contained 113 different genes (four rRNA genes, 30 tRNA genes, and 79 protein genes), and this content has been retained in many angiosperms (Kim and Lee 2004). Rates of synonymous nucleotide substitution in chloroplast genes are generally low (a few fold lower than plant nuclear genes) and relatively homogeneous within a genome except for a threefold difference in rate between the large inverted repeat (IR) and single-copy regions (Wolfe et al. 1987; Drouin et al. 2008). Lineage-specific variation in chloroplast synonymous rates has been documented (Gaut et al. 1993; Guo et al. 2007) but is relatively modest compared to the vast differences seen among some plant mitochondrial lineages (Palmer et al. 2000; Mower et al. 2007; Sloan et al. 2009). Some angiosperm cpDNAs have fewer than the 79 canonical protein genes due to gene losses. Most notable here are parasitic plants such as Cuscuta and Epifagus that have lost some or all photosynthetic ability (Wolfe et al. 1992; Funk et al. 2007; McNeal et al. 2007). Chloroplast gene losses are rarer in photosynthetic species, because in many cases the gene cannot simply be discarded and must instead be either functionally transferred to the 7
Corresponding author. E-mail
[email protected]. Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.111955.110.
1700
Genome Research www.genome.org
nuclear genome or functionally replaced by a nuclear gene (‘‘gene substitution’’). Successful gene transfers from the chloroplast to the nuclear genome during angiosperm evolution have been reported for rpl22 in legumes (Gantt et al. 1991); for infA in several lineages, including almost all rosids (Millen et al. 2001); and for rpl32 in two families of Malpighiales (Cusack and Wolfe 2007; Ueda et al. 2007). In addition, Ueda et al. (2008) identified gene substitution as the mechanism of loss of the rps16 gene from cpDNA in Medicago and Populus. The loss of rps16 from cpDNA is compensated by dual targeting (to chloroplasts as well as mitochondria) of mitochondrial ribosomal protein S16, which is encoded by a nuclear gene. Several other examples of losses of genes from cpDNA in photosynthetic angiosperms have been reported, and it is striking that the few species in which gene losses have occurred tend also to be those whose chloroplast genomes are highly rearranged relative to the ancestral angiosperm organization ( Jansen et al. 2007). As with angiosperm mitochondrial genomes (Adams and Palmer 2003), most of the genes that have been lost from chloroplast genomes during recent evolution have coded for ribosomal proteins ( Jansen et al. 2007). There have been no published reports of the loss of genes coding for components of photosystems I or II ( psa and psb genes), the electron transfer chain ( pet genes), or the chloroplast ATP synthase (atp genes) from cpDNA in any angiosperms except parasitic species (Wolfe et al. 1992; Funk et al. 2007; McNeal et al. 2007). One group of angiosperms that is known to be relatively prone to cpDNA rearrangement and gene loss is the legume family
20:1700–1710 Ó 2010 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/10; www.genome.org
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Chloroplast hypermutation (Fabaceae) (Palmer et al. 1988). The large IR that is otherwise almost universally present in chloroplast genomes is absent from one large clade of legumes (the IR loss clade, or IRLC) (Wojciechowski et al. 2004), some of which also show other rearrangements of gene order. Chloroplast genomes in the IRLC species are also notable for having significant amounts of repetitive DNA, something not usually seen in angiosperm cpDNA (Milligan et al. 1989; Saski et al. 2005; Cai et al. 2008). Five instances of gene loss from the IRLC chloroplast genomes have been discovered. As well as the aforementioned gene transfers of rpl22 and infA and substitution of rps16 (Gantt et al. 1991; Millen et al. 2001; Ueda et al. 2008), it has been reported that that accD is completely absent from Trifolium subterraneum (subclover) cpDNA, and that ycf4 in Pisum sativum (pea) is either absent or a pseudogene (Nagano et al. 1991a; Smith et al. 1991; Cai et al. 2008). Slot-blot hybridization experiments suggested that ycf4 and rps16 may have been lost independently multiple times in different lineages of legumes (Doyle et al. 1995). In the course of this study, we reviewed all reported instances (in published papers or in GenBank annotations) of gene loss among the 103 complete angiosperm chloroplast genome sequences that are publicly available, and found that 27 different protein-coding genes have been lost in at least one lineage (Supplemental Table S1). We found that some reported gene losses are simply due to annotation errors; because of this, the numbers of losses we describe here are slightly different from those in Jansen et al. (2007). In particular, we noticed that the gene ycf4, which was originally not identified in the genome sequences of the legumes Glycine max (soybean; Saski et al. 2005), T. subterraneum (subclover; Cai et al. 2008), Cicer arietinum (chickpea), and Medicago truncatula ( Jansen et al. 2008), is in fact present in the cpDNAs of all these species but is so divergent that it was not recognized by the DOGMA software (Wyman et al. 2004) used to annotate them. This discovery prompted us to investigate the rapid evolution of ycf4 and its surrounding region in legumes. Ycf4 is a thylakoid protein that has been shown to play a role in regulating photosystem I assembly in cyanobacteria (Wilde et al. 1995) and to be essential for photosystem I assembly in Chlamydomonas (Boudreau et al. 1997; Onishi and Takahashi 2009). Experiments in Chlamydomonas indicate that Ycf4 is the second of three scaffold proteins that act sequentially during the assembly process, with Ycf4’s roles being to stabilize an intermediate subcomplex consisting of the PsaAB heterodimer and the three stromal subunits PsaCDE, and to add the PsaF subunit to this subcomplex (Ozawa et al. 2009). As well as the loss of ycf4 in P. sativum, several other previous studies have indicated that the evolution of ycf4 in legumes may be unusual. In soybean and Lotus japonicus, the Ycf4 protein, which is almost universally 184 or 185 amino acids long, has expanded to about 200 residues (Reverdatto et al. 1995; Kato et al. 2000). The gene also has a high rate of synonymous nucleotide substitution between the latter two species (Perry and Wolfe 2002). Phylogenetic trees for phaseoloid legumes constructed using ycf4 were incongruent with trees constructed using seven other genes, due to accelerated evolution of codon positions 1 and 2 in ycf4 (Stefanovic et al. 2009). In blot hybridizations to DNAs from 280 diverse angiosperms (as in Millen et al. 2001) using a ycf4 probe from tobacco, we observed (SS and JDP, unpubl.) strong hybridization to all DNAs except those from the only Papillionoid legumes surveyed: Medicago (no signal from five species) and Vigna (considerably diminished signal). We show here that ycf4 is situated in a local mutation hotspot, in Lathyrus, and possibly in other legume species, resulting in dramatic acceleration of
sequence evolution in some species and evolutionary gene losses in others.
Results Rapid evolution of ycf4 in legumes To investigate acceleration of the evolutionary rate of ycf4 in legumes, we compared its nonsynonymous and synonymous nucleotide substitution rates in different angiosperm lineages to the rates observed in two other, widely sequenced chloroplast genes, rbcL and matK. This analysis included new ycf4 sequence data from Lathyrus and other legumes, together with sequences from a previous phylogenetic study (GenBank [http://www.ncbi.nlm.nih. gov/Genbank/] accession nos. EU717431–EU717464; Stefanovic et al. 2009) and other database sequences. For each gene, we used a likelihood model to estimate the numbers of nonsynonymous (dN) and synonymous (dS) nucleotide substitutions that occurred on each branch of an angiosperm phylogenetic tree (see Methods). In the dN trees, ycf4 is seen to evolve much faster in most legumes than in other angiosperms (Fig. 1) but no similar acceleration is seen in legume rbcL or matK, which suggests that the acceleration is locus-specific, as well as lineage-specific. Within legumes, the first accelerated branch is the one leading to a large clade (Millettioids, Robinioids, and the IRLC; asterisk in Fig. 1), and the legumes that are outgroups to this branch do not show acceleration. This branch is also the first one on which the Ycf4 protein size expands above 200 amino acids (Fig. 1). Even faster periods of dN evolution are seen in the genera Desmodium and Lathyrus relative to other legumes. Ycf4 is a pseudogene in three of six Desmodium species we sequenced and in Clitoria ternatea (Supplemental Fig. S1C, left panel). In the dS trees, some acceleration is seen in ycf4 of legumes relative to other angiosperms, particularly in Lathyrus, but again no similar acceleration is seen in legume rbcL or matK (Fig. 1). The genus Lathyrus also shows by far the greatest increases in Ycf4 size, reaching 340 residues in Lathyrus latifolius and Lathyrus cirrhosus. Remarkably, there is less amino acid sequence conservation between the Ycf4 proteins of two species within the genus Lathyrus (31% identity between Lathyrus palustris and L. cirrhosus), than between tobacco and the cyanobacterium Synechocystis (45% identity). Nevertheless, ycf4 can be inferred to be functional in the four Lathyrus species in which it is intact (Fig. 2), for two reasons. First, even though the level of amino acid sequence conservation among Lathyrus species is very low, many of the sites in the C-terminal part of the protein (beginning at position 248 in Fig. 2) that are conserved among other land plants and cyanobacteria are also conserved in Lathyrus. Second, comparing ycf4 sequences among Lathyrus species shows that they have lower levels of nonsynonymous than synonymous nucleotide substitutions (dN/ dS < 1) (Table 1), which is a hallmark of sequences that are being constrained to code for proteins (Kimura 1977; Graur and Li 1999). We therefore infer that these long ycf4 genes in Lathyrus species are biologically functional. However, the level of constraint on Lathyrus ycf4 is lower than on other angiosperm ycf4s (e.g., dN/dS = 0.15 between tobacco and spinach ycf4, compared with dN/dS = 0.36–0.81 within the genus Lathyrus). Tests for positive (Darwinian) selection suggested that some Desmodium branches within the ycf4 tree have undergone adaptive evolution, and in separate analyses, site-specific tests for positive selection were significant for some codons in ycf4 when the whole legume tree was considered (data not shown). However, in view of the evidence that the whole region around ycf4 has a high mutation rate (see below), and
Genome Research www.genome.org
1701
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Magee et al. editing in other species (Tsudzuki et al. 2001; Chateigner-Boutin and Small 2007; and our analyses of EST data from M. truncatula, Lotus japonicus, and G. max).
Gene losses and repetitive DNA in the region around ycf4 in legumes We sequenced the region flanking the ycf4 locus in five Lathyrus species, P. sativum (pea) and Vicia faba (broad bean) and compared it to the available data for other legumes (Fig. 3). This comparison reveals a history of multiple gene losses and gene length changes within a small region of cpDNA. We identified ycf4 pseudogenes in both P. sativum and Lathyrus odoratus (sweetpea), which must be the result of two separate losses of the gene (Fig. 3). The small photosystem I gene psaI, normally found immediately upstream of ycf4, is missing from a clade of four Lathyrus species but is present in L. palustris. Also in this region of the genome, the ribosomal protein gene rps16 was lost from cpDNA in the common ancestor of the IRLC clade (Doyle et al. 1995), and accD, coding for a subunit of acetyl-CoA carboxylase, is missing from T. subterraneum cpDNA, which has become rearranged in this region (Cai et al. 2008). Both ycf4 and accD show extensive length variation among the legume species that retain them (Fig. 3). The expansion of the accD open Figure 1. Synonymous and nonsynonymous divergence in angiosperm chloroplast ycf4 sequences. Shown are dN (upper) and dS (lower) trees resulting from a codon-based likelihood analysis and a conreading frame is partly explained by strained topology, rooted using gymnosperm sequences (which are not included in the trees). All trees the presence of numerous tandemly reare drawn to the same scale. The species are in the same order from top to bottom in all trees, to the peated sequences in this region of legume greatest extent possible, and are named in full in Supplemental Figure S1. Magenta branches in the dN cpDNA. As reported previously (Nagano tree for ycf4 indicate those on which the Ycf4 protein length is (or is inferred to have been) $200 amino acid residues; green branches indicate lengths $300 residues. The asterisk marks the branch (leading to et al. 1991a; Smith et al. 1991), and shown Millettioids, Robinioids, and IRLC) in which rate acceleration is first seen. Trees for chloroplast rbcL and by a dot-matrix plot in Supplemental matK genes do not show comparable rate heterogeneity at either synonymous or nonsynonymous sites. Figure S2A, P. sativum accD contains several in-frame internal repeats of up to 37 codons long. L. sativus accD has a similarly repetitive structure, but the sections of the gene that are rebecause we also found some dN/dS values greater than 1 within peated are different in the two species (Supplemental Fig. S2B,C). the genus Lathyrus for two genes flanking ycf4 (cemA and accD) There are tandem repeats in the intergenic DNA between accD (Table 1), we suspect that the high dN/dS values are artifacts stemand ycf4 in L. latifolius (Supplemental Fig. S2D), and a tandem reming from a combination of an increased mutation rate and lesspeat of 15 codons is located within the 59 end of L. sativus ycf4 ened constraints on protein sequences, rather than being indic(Supplemental Fig. S2B). All the repeats are species-specific, which ative of positive selection on multiple adjacent genes. suggests that these minisatellite-like sequences have a high turnWe may have slightly overestimated the divergence of over rate. However, some other species, such as L. odoratus, do not Lathyrus Ycf4 proteins because we inferred protein sequences from contain tandem repeats in this region, and the expanded size of chloroplast DNA sequences, whereas some chloroplast transcripts ycf4 in most Lathyrus species is not primarily due to the accumuare known to undergo mRNA editing (Stern et al. 2010). Editing lation of repeats. in angiosperms involves C ! U changes and typically occurs at 30–40 sites per genome (Tsudzuki et al. 2001; Inada et al. 2004). However, even extensive C ! U editing could only marginally Sequences of the P. sativum and L. sativus chloroplast genomes reduce the divergence in Lathyrus Ycf4. For example, if we assume that every possible C ! U editing event that could increase the To establish whether the patterns of evolution seen around the ycf4 similarity between L. palustris and L. cirrhosus Ycf4 proteins actulocus are atypical of the rest of the genome, we sequenced the ally occurs, their sequence identity only increases from 31% to chloroplast genome of L. sativus (grasspea; 121,020 bp) and com32%. Furthermore, no sites in ycf4 are known to undergo mRNA pleted the genome sequence of P. sativum cpDNA (pea; 122,169 bp)
1702
Genome Research www.genome.org
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Chloroplast hypermutation
Figure 2. Alignments of Ycf4 protein sequences from four Lathyrus species, four diverse land plants, and the cyanobacterium Synechocystis. Cons #1 shows residues that are absolutely conserved among the four Lathyrus species. Cons #2 shows residues that are absolutely conserved among Nicotiana tabacum, Oryza sativa, Pinus thunbergii, Marchantia polymorpha, and Synechocystis species PCC 6803. The alignment was made using MUSCLE as implemented in SeaView (Gouy et al. 2010) with default coloring of conservative amino acid substitution groups.
(Supplemental Fig. S3). Both of these genomes lack the IR. They have rearrangements of gene order relative to the ancestral angiosperm order, as represented by tobacco, and also relative to each other. The gene order in P. sativum can be obtained from the tobacco order by eight inversion steps (Palmer et al. 1988), beginning with a 50-kb inversion that is shared by most legumes and that placed rps16 beside accD. The first three inversions occurred before the separation of the lineages giving rise to P. sativum and L. sativus, after which there were five more inversions specific to P. sativum, Table 1.
Sequence divergence in cpDNA regions compared among Lathyrus species L. palustris vs. L. latifolius
Sequence ycf4 accD cemA rbcL matK
and three more inversions specific to L. sativus (Supplemental Fig. S3). None of the inversions in P. sativum or L. sativus is shared with the highly rearranged cpDNA of T. subterraneum (Cai et al. 2008), other than the initial 50-kb inversion (Supplemental Fig. S3E). The L. sativus genome sequence shows that it shares four gene losses that have already been reported in P. sativum: infA, rps16, rpl22, and rpl23 (Gantt et al. 1991; Nagano et al. 1991a,b; Millen et al. 2001); whereas L. sativus ycf4 is intact. The status of rpl23 in P. sativum has been unclear because it contains a 190-bp
Sites
a
109 112 150 200 176
L. palustris vs. L. sativus
L. odoratus vs. L. latifolius
L. cirrhosus vs. L. latifolius
dN/dS
dS 6 SE (%)
dN/dS
dS 6 SE (%)
dN/dS
dS 6 SE (%)
dN/dS
dS 6 SE (%)
0.362 0.677 1.723 0.149 0.384
152.2 7.1 1.4 3.6 6.5
0.520 0.255 2.971 0.130 0.550
108.4 6 13.3 6 1.4 6 2.0 6 5.4 6
NA 1.399 0.951 ND 0.260
NA 2.3 6 1.5 0.7 6 0.7 ND 2.9 6 1.3
0.805 ‘ ‘ 0.000 ND
4.8 6 2.2 0.0 6 0.0 0.0 6 0.0 0.5 6 0.5 ND
6 52.6 6 2.7 6 1.0 6 1.4 6 2.0
23.7 3.7 1.0 1.0 1.8
Sequence
Sites
Kimura’s K 6 SE (%)
Kimura’s K 6 SE (%)
Kimura’s K 6 SE (%)
Kimura’s K 6 SE (%)
rbcL-atpB spacer trnF-trnL spacerb trnS-trnG spacerb
747
3.9 6 0.7
4.1 6 0.8
ND
0.4 6 0.2
475
3.0 6 0.8
3.7 6 0.9
2.3 6 0.7
ND
616
2.7 6 0.7
4.3 6 0.8
2.5 6 0.6
ND
For protein-coding genes the synonymous divergence (dS), its standard error (SE), and the nonsynonymous-to-synonymous ratio (dN/dS, also called v) are shown. For intergenic regions, divergence (K) was calculated using Kimura’s two-parameter method. NA, Not applicable (gene not present); ND, not determined. a Average number of sites compared across the reported species pairs. The ycf4, accD, cemA, and rbcL comparisons are all not full-length. For ycf4, only the relatively conserved section between position 164 in Figure 2 and the C terminus was compared. b Sequence data from Kenicer et al. (2005).
Genome Research www.genome.org
1703
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Magee et al.
Figure 3. Gene organization around the ycf4 locus in some legumes. Triangles indicate evolutionary losses of the indicated genes. Numbers indicate the numbers of codons in accD and ycf4 genes. Psi symbols denote pseudogenes. All genes are transcribed from left to right. Fading colors denote genes that were not completely sequenced. The half-height region in rps16 represents an intron. The slash marks indicate a genomic rearrangement in T. subterraneum (Cai et al. 2008). The topology of the phylogenetic tree (not drawn to scale) is from Asmussen and Liston (1998), Wojciechowski et al. (2004), and Kenicer et al. (2005).
frameshifting insert close to the normal 39 end (Nagano et al. 1991b), but in L. sativus the same insert is present and more than half of rpl23 is missing, so we infer that rpl23 is a pseudogene in both species. In addition, these two species differ by the absence of psaI in L. sativus, and of ycf4 in P. sativum. As well as the gene losses, P. sativum and L. sativus both lack two of the 21 introns normally found in angiosperm cpDNA—the first intron of clpP, and the cisintron of rps12. These intron losses were reported previously as part of a survey that showed their occurrence at about the time of origin of the IRLC clade ( Jansen et al. 2008). Legume chloroplast genomes have long been thought to contain more repetitive sequences than other cpDNAs (e.g., Saski et al. 2005), and this is confirmed by dot-matrix analysis. Using a cutoff of 28 matching bases per 30-bp window, there are very few repeated sequences of this size in the tobacco and spinach chloroplast genomes other than the IR and some similarities among group II introns and among iso-accepting tRNA genes (Supplemental Fig. S4). However in P. sativum, L. sativus, and Lotus japonicus (as a representative IR-containing legume), it is striking that there are many tandem or near-tandem repeats (i.e., dots near the main diagonal in Supplemental Fig. S4), and the region around ycf4 stands out as particularly repetitive.
Sites of gene loss coincide with a mutation hotspot We measured synonymous divergence in each protein-coding gene between the P. sativum and L. sativus chloroplast genomes using dS (black circle symbols in Fig. 4), calculated by the yn00
1704
Genome Research www.genome.org
program (Yang 2007). For most loci, the divergence between these species is less than 0.1 substitutions per site (median dS = 0.055 synonymous substitutions per site). Ycf4 cannot be included directly in this comparison because it is absent from P. sativum cpDNA, so instead, for ycf4 in Figure 4 we have plotted the dS value between L. palustris and L. sativus, which is 20-fold higher (1.084) than the median even though the comparison is over a shorter divergence time. We observed even higher dS values in comparing ycf4 between L. palustris and L. cirrhosus (1.481) or L. latifolius (1.522). Similarly, psaI is missing from L. sativus cpDNA so instead we compared P. sativum psaI to L. palustris psaI and found dS = 0.580, which again is much higher than the genome average for P. sativum versus L. sativus (0.055). For accD we compared only the regions of the gene that could be reliably aligned between P. sativum and L. sativus, and obtained a dS value (0.212) that is 3.8 times the genome average. This spike in local dS values is matched by a local increase in divergence in the intergenic regions near the ycf4 and accD loci (all compared between P. sativum and L. sativus using Kimura’s K; open symbols in Fig. 4). The very high level of synonymous substitution in ycf4 made us question whether the mutational process at this locus might somehow be different than elsewhere in the genome. We investigated this possibility by sequencing regions of cpDNA from L. latifolius and L. cirrhosus (Fig. 3), two species that are evidently very closely related because there is only 1 nucleotide (nt) substitution between their rbcL genes, and only three substitutions in the atpBrbcL intergenic spacer (Fig. 5). There are only two differences out of 1256 bp in the combined partial accD and cemA sequences obtained from these species, compared with 56 differences in the 1023-bp-long ycf4 (dS = 0.048, dN = 0.039). Most strikingly, there are 19 differences (10% divergence) in the spacer between accD and ycf4. This spacer (from which psaI has been lost) is 238 bp in L. cirrhosus, most of which can be aligned to L. latifolius, but in L. latifolius the spacer has expanded to 648 bp due to the presence of multiple tandem repeat sequences comprising a 57-bp repeat unit (six complete and three partial copies) and a 67-bp repeat unit (two complete and one partial copy) (Supplemental Fig. S5). These results show that a region of ;1500 bp in these Lathyrus genomes, extending through the accD-ycf4 spacer and most if not all of ycf4 itself, is a hotspot with a mutation rate that is dramatically higher than in the rest of the genome. Despite this high mutation rate, the types of nucleotide substitution occurring in the ycf4 region do not seem particularly biased, with an overall transition/transversion ratio of 0.9 (Fig. 5). There also appears to be a smaller second peak of divergence values in the region around the genes rpl14 and rps8 (Fig. 4), which is the site from which infA was lost in an early ancestor of Fabales and Cucurbitales (Millen et al. 2001). We found that sites of gene loss coincide with fast-evolving intergenic regions more often than expected by chance: Five of the six most divergent intergenic regions between P. sativum and L. sativus are also the sites of five of their six gene losses ( ycf4, psaI, rps16, infA, rpl22; P = 2 3 106 by the hypergeometric test).
Mutation rate in ycf4 exceeds the rate in the nuclear genome Early work on rates of nucleotide substitution in plant genomes concluded that the synonymous substitution rate (assumed to be equal to the mutation rate) is about four times higher in plant nuclear genomes than in the single-copy regions of chloroplast genomes (Wolfe et al. 1987, 1989; Gaut 1998; Muse 2000). Given the heterogeneity of synonymous rates seen within the Lathyrus
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Chloroplast hypermutation
Figure 4. Sequence divergence between the P. sativum and L. sativus chloroplast genomes. The x-axis lists genes or exons in the order in which they occur in the L. sativus genome. Black filled circles show dS (number of synonymous substitutions per synonymous site) for each orthologous protein gene pair, calculated using yn00 (Yang 2007). White and gray filled circles show divergence (K) for each intergenic region or intron, respectively, calculated by Kimura’s two-parameter method (Kimura 1983). Vertical bars, dS or K 6 1 SE. Because ycf4 is a pseudogene in P. sativum and psaI is not present in L. sativus, the dS value plotted for ycf4 is for a comparison between L. sativus and L. palustris, and the dS value plotted for psaI is for a comparison between P. sativum and L. palustris (see text). No divergence values are plotted for intergenic regions that are not flanked by the same genes in the two species or that are shorter than 100 bp.
chloroplast genomes, we wondered how these rates compared with the rate in the nuclear genome. Relatively few nuclear genes have been sequenced from Lathyrus species, so we generated new expressed sequence tag (EST) data from Lathyrus odoratus (sweetpea; see below) and identified putatively orthologous nuclear genes between these and database sequences from P. sativum. Among 56 putative orthologs, the median dS is 0.131 (Supplemental Table S2), which is 2.4 times higher than the median dS (0.055) for chloroplast genes compared between P. sativum and L. sativus. Thus in comparisons between Lathyrus and P. sativum, as in other flowering plant comparisons, the synonymous divergence in most parts of the chloroplast genome is lower than in the nuclear genome. The synonymous divergence in ycf4, however, is at least 10 times greater than in the nuclear genome (the ratios of the dS values given above, 1.084/0.131 = 8.3 and 1.522/0.131 = 11.6, are underestimates of the actual ratio because the numerators involve a shorter divergence time).
to the older nuclear-transferred genes infA and rpl22. PsaI is a very small protein (34–40 amino acids) that is conserved between cyanobacteria and land plants and is physically located toward the exterior of photosystem I in P. sativum, where it interacts strongly with PsaH (Jolley et al. 2005; Amunts et al. 2007). It seems unlikely that photosystem I in Lathyrus could function efficiently without PsaI, although tobacco plants with a psaI knockout do not show a mutant phenotype under standard growth conditions (MA ¨ ttler and R Bock, pers. comm.). Most of the small membraneScho spanning subunits of photosystem I appear to be nonessential, and knockout lines do not display visible mutant phenotypes (Varotto ¨ ttler et al. 2007). However, et al. 2002; Jensen et al. 2007; Scho the loss of individual small membrane-spanning subunits usually affects the assembly of other subunits and results in lower
Transfer of Trifolium accD to the nucleus We suspect that ycf4 and psaI have been transferred to the nuclear genome in the Lathyrus species that lack them in cpDNA, because these species are fully photosynthetic and must have a functional photosystem I. However, we were unable to find nuclear copies of these genes. We made numerous unsuccessful attempts (see Methods) to amplify ycf4 and psaI by PCR from genomic DNA of L. odoratus (which lacks both of them in its cpDNA and has a smaller ycf4 pseudogene than P. sativum). We then made cDNA from young green leaves of L. odoratus and sequenced 8702 ESTs. None of the ESTs were derived from a nuclear ycf4 or psaI, even though we did find ESTs corresponding to seven of the nine other nuclear-encoded subunits of photosystem I ( Jolley et al. 2005), and
Figure 5. Sequence divergence between L. latifolius and L. cirrhosus in the accD-ycf4-cemA region (left) and the atpB-rbcL region (right). Vertical tickmarks indicate the locations of each nucleotide substitution, categorized according to whether it occurs at codon position 1, 2, or 3; or in intergenic DNA; and as a transversion (Tv; tickmarks above the horizontal lines) or a transition (Ti; tickmarks below the horizontal line). The total numbers of each type of substitution are shown on the right. Supplemental Figure S5 shows the nucleotide sequence alignment summarized in the left panel.
Genome Research www.genome.org
1705
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Magee et al. efficiencies of excitation transfer and electron transfer (Varotto ¨ ttler et al. 2007), which would et al. 2002; Jensen et al. 2007; Scho be evolutionarily deleterious. The only other known cases of loss of psaI from plastid DNA are in the parasitic species Cuscuta gronovii and Cuscuta obtusiflora which have reduced levels of photosynthesis but retain all other photosynthesis genes (Funk et al. 2007; McNeal et al. 2007), and in the nonphotosynthetic parasite Epifagus (Wolfe et al. 1992). Although we have direct evidence for association between gene losses and a mutation hotspot only in the genus Lathryus, it is intriguing that other species in the IRLC legume clade show evolutionary losses of other genes that neighbor ycf4 and psaI (Fig. 3). The loss of rps16 in the common ancestor of the IRLC clade can be explained in terms of gene substitution by the nuclear gene for mitochondrial RPS16, as already demonstrated for Medicago (Ueda et al. 2008), and so does not necessitate a gene transfer to the nucleus. Rps16 has been lost on multiple independent occasions during land plant evolution (Supplemental Table S1; Ohyama et al. 1986; Tsudzuki et al. 1992; Ueda et al. 2008), so it is possible that its multiple losses in legumes are simply the result of relatively easy and/or early substitution by the mitochondrial gene. The other IRLC legume gene loss in the neighborhood of ycf4 and psaI is the loss of accD in Trifolium (Fig. 3). AccD codes for a subunit of acetyl-CoA carboxylase, which functions in lipid synthesis and is an essential chloroplast gene in tobacco (Kode et al. 2005). The loss in Trifolium is one of five separate known instances of loss of accD in angiosperm cpDNAs (Supplemental Table S1). In grass species—the only case that has been studied in detail—the prokaryotic multisubunit carboxylase in the plastid has been completely replaced by a nuclear-encoded single-chain carboxylase of eukaryotic ancestry (Konishi et al. 1996; Gornicki et al. 1997). We identified an evolutionary transfer of accD to the nucleus in Trifolium. Using high-throughput EST sequence data from Trifolium repens (white clover), we found a cDNA structure consisting of a fusion between a gene for plastid lipoamide dehydrogenase (LPD2) and accD (Supplemental Fig. S6A–D). We confirmed the presence of a fused mRNA by reverse transcriptase PCR and Sanger sequencing (Supplemental Fig. S6E). In plastids, lipoamide dehydrogenase is a component of pyruvate dehydrogenase, a complex that makes acetyl-CoA (Lutziger and Oliver 2000; Drea et al. 2001). The T. repens nuclear transcript codes for a predicted protein of 805 amino acids, with residues 1–512 (including a transit peptide) derived from LPD2 and residues 513–805 derived from accD. By comparison to the known genomic structures of LPD genes in M. truncatula, we infer that in T. repens the accD sequence has replaced the final two exons (exons 14 and 15) of its LPD2 gene, with the point of fusion occurring at the third codon of exon 14. We did not find any evidence for alternative splicing of the LPD2–accD fusion to form two products, as occurs with the SOD–rpl32 fusion in mangrove trees (Cusack and Wolfe 2007). The fusion to accD probably rendered LPD2 unable to code for functional lipoamide dehydrogenase, because the fusion protein lacks some conserved residues normally provided by exons 14 and 15, but T. repens retains and expresses a paralogous gene LPD1 that also codes for plastid lipoamide dehydrogenase (Supplemental Fig. S6A). We found the transferred gene in T. repens, but we presume that the transfer is shared by other Trifolium species, including the two that have been demonstrated to have no accD in their cpDNAs (T. subterraneum and Trifolium pratense) (Doyle et al. 1995; Cai et al. 2008). We also found database ESTs for a nuclear accD in T. pratense (red clover), but they are too short to confirm that this species also has the LPD2–accD fusion. Phylogenetic
1706
Genome Research www.genome.org
analysis indicates that the T. repens and T. pratense nuclear sequences have a monophyletic origin and that the transfer of accD to the nucleus occurred within the IRLC clade (Supplemental Fig. S6F), consistent with the change in LPD2 gene structure that occurred after Trifolium diverged from Medicago (Supplemental Fig. S6A). The Trifolium nuclear accD gene is transcribed in both T. repens and T. pratense, is predicted to have a functional transit peptide in T. repens (TargetP cTP score 0.976) (Emanuelsson et al. 2000), and shows evidence of selection to maintain its AccDcoding function (dN/dS = 0.26 between T. repens and T. pratense in the accD region of the transcript). Moreover, the Trifolium nuclear mRNAs code for a leucine residue at a site that undergoes an essential Ser ! Leu mRNA edit in P. sativum plastids (Supplemental Fig. S6D; Sasaki et al. 2001; Inada et al. 2004).
Discussion The genomic region around ycf4 in Lathyrus is a dramatic hotspot for point mutations. It is difficult to quantify the factor by which its mutation rate is increased relative to the rest of the genome, but comparisons of synonymous site divergence indicate an increase of at least 20-fold, both in comparisons between P. sativum and L. sativus (Fig. 4) and among Lathyrus species (Table 1). Between L. latifolius and L. cirrhosus, the increase may be even greater (Fig. 5; Table 1). Even a 20-fold mutation rate increase only goes partway toward explaining how the protein sequence divergence between L. palustris and L. cirrhosus (with a divergence time of <10 Myr) (Kenicer et al. 2005) exceeds that between other angiosperms and cyanobacteria (separated by >1000 Myr); a relaxation of selective constraints on the Ycf4 protein in legumes must be involved too. Although there have been previous reports that the variance of synonymous substitution rates among genes in many eukaryotic genomes is greater than expected by chance (e.g., Baer et al. 2007; Fox et al. 2008), there are few if any precedents for the phenomenon that we describe here—a sharply localized mutation rate acceleration of great magnitude in one specific region of a genome. The existence of the hotspot violates the common assumption that the point mutation rate is approximately constant in all regions of the same genome (Kimura 1983), which underpins the silent molecular clock hypothesis (Ochman and Wilson 1987). Our results bear some similarities to the ‘‘mutation showers’’ (transient localized hypermutation events) that have been found in some studies on the genomic distribution of spontaneous mutations (Drake 2007; Wang et al. 2007; Nishant et al. 2009). As well as being a mutation hotspot, ycf4 and its neighbors also appear to be a hotspot for the formation and turnover of minisatellite sequences in Lathyrus. The previous study most relevant to our findings is that of Erixon and Oxelman (2008), who reported somewhat similar results for the chloroplast clpP gene in Silene and Oenothera species. For some interspecies comparisons in their study, both dN and dS were elevated in clpP compared with other chloroplast genes, although the dS elevations were at most fivefold for clpP, compared with at least 20-fold for ycf4 in Lathyrus. Also, insertions of repetitive amino acid sequence regions occurred in some of the fastevolving taxa. Locus-specific rate accelerations affecting both dN and dS were reported in cpDNA of Geraniaceae, but in this case, the accelerations occurred in numerous genes (Guisinger et al. 2008). In all IR-containing cpDNAs, the synonymous rate is higher in single-copy genes than in IR-located genes, probably due to a copynumber effect during DNA repair (Wolfe et al. 1987; Birky and Walsh 1992; Perry and Wolfe 2002). Dramatic accelerations of
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Chloroplast hypermutation synonymous rates have been found in the mitochondrial genomes of some plants, such as Plantago, Pelargonium, and certain Silene species (Cho et al. 2004; Parkinson et al. 2005; Mower et al. 2007; Sloan et al. 2009). Most of these mitochondrial accelerations appear to affect all genes in the genome similarly, but among-gene rate heterogeneity was found within the mtDNAs of a few species (Mower et al. 2007), including a 40-fold difference in synonymous rates between atp9 and three other mitochondrial genes in Silene (Sloan et al. 2009). Because plant mitochondrial genomes are relatively large and do not show much gene order conservation, most studies have only examined individual genes so the sizes of the genomic regions affected by rate acceleration are not known. Apart from these organellar examples, there are very few precedents for a mutation rate change that is so pronounced over such a short physical distance. One early study (Martin and Meyerowitz 1986) reported a 2-kb region of noncoding DNA near the glue gene cluster of three Drosophila species, which contained an abrupt boundary between a conserved region and a nonconserved region with a 10-fold elevated substitution rate, but this report has not been followed up with more extensive analyses based on complete genome sequence data. An abrupt boundary of evolutionary rates also occurs on the mammalian X chromosome at the junction between the pseudoautosomal region and the X-specific region. The pseudoautosomal part of the gene Fxy, which spans this junction in laboratory mice, has a synonymous rate about 60 times faster than the X-specific part of the gene, probably because the high recombination rate in the pseudoautosomal part leads to high levels of biased gene conversion (Perry and Ashworth 1999; Duret and Galtier 2009). Is the chloroplast hypermutation phenomenon unique to Lathyrus? At present, Lathyrus is the only legume genus for which we have extensive sequence data from more than one species, so we are unable to say whether the same hotspot is present in legumes outside this genus. Therefore the only gene losses we can potentially attribute directly to hypermutation are those of ycf4 in L. odoratus and of psaI in the ancestor of four Lathyrus species. Ycf4 is also evolving fast in Desmodium and has been lost in three species of that genus. The losses of ycf4 in P. sativum, of accD in Trifolium, and the older loss of rps16 in the ancestor of the IRLC clade are suggestive, but we have no direct evidence that these loci were fastevolving prior to the gene losses. It is possible that a hotspot has existed throughout legume evolution and was the cause of the ycf4 acceleration seen in the common ancestor of Millettioids, Robinioids, and the IRLC (Fig. 1) but that the exact location of the hotspot (and its associated tandem repeat sequences) has varied somewhat among lineages, affecting ycf4 in some taxa, but accD or psaI in others. We do not know the molecular basis for the increases in either the point mutation rate or the length mutation rate, but we speculate that they might be connected. We suggest that a correlation between the two rates could develop if, for some reason, the genomic region around ycf4 was subject to repeated DNA breakage and repair (cf. Guisinger et al. 2008; Yang et al. 2008). In this regard, it is interesting to note that only a few angiosperm species have cpDNAs that are highly rearranged relative to the canonical gene order, but among these, there are several independent lineages that are both highly rearranged and contain rapidly-evolving protein genes ( Jansen et al. 2007). These lineages include Jasminum (acceleration of accD; Lee et al. 2007), Silene (acceleration of clpP; Erixon and Oxelman 2008), and now Lathyrus (acceleration of ycf4). The phylogenetic diversity of these lineages suggests that hypermutable regions may exist in other angiosperm cpDNAs, and our findings may go some way toward explaining the
apparent bursts of organelle-to-nucleus gene transfer seen in some angiosperms. It is likely that many factors dictate whether a gene can be lost from an organelle genome. One property that is common to the gene transfer and gene substitution processes is that they both involve a phase during which the organelle gene and the nuclear gene coexist in the same species (Timmis et al. 2004). Analogous to a gene duplication, this two-gene phase can be resolved either by losing the organelle copy (resulting in a successful transfer of function) or by losing the nuclear copy (restoring the status quo). Intermediates in this process, and sister lineages where the twogene phase was resolved in opposite ways, have been identified (Adams et al. 1999). Brandvain and Wade (2009) have shown theoretically that the ratio between the point mutation rates in the organelle and nuclear copies has a profound influence on the direction in which the two-gene phase is resolved. If the organelle mutation rate is lower than the nuclear mutation rate, as is true for most plant mitochondrial and chloroplast genes, then gene transfer will not occur unless there is a benefit to relocating the gene. By contrast, if the organelle rate exceeds the nuclear rate, then gene transfer is predicted to occur even in the absence of any benefit (Brandvain and Wade 2009). Therefore, in a genome such as Lathyrus cpDNA, in which the mutation rate exceeds the nuclear rate only in one hypermutable region, we should expect to see more transfers, substitutions, or losses of genes from the hypermutable region than from the rest of the genome. This argument provides a plausible explanation for the losses of ycf4 and psaI seen in some Lathyrus species and, perhaps more generally, for the cluster of losses from the rps16-accD-psaI-ycf4 region seen in other legume cpDNAs.
Methods Plant material Seeds of Lathyrus sativus (cv. Cicerchia Marchigiana) were purchased from B&T World Seeds. Seeds of L. cirrhosus (accession no. LAT17) were obtained from the Leibniz Institute of Plant Genetics and Crop Plant Research. Other Lathyrus species were purchased from Thompson & Morgan. Additional sequencing of P. sativum cpDNA was done using cv. Feltham First.
Nucleotide sequencing The P. sativum (pea) chloroplast genome sequence was completed by S.A. and J.C.G. using the chain termination method (Sanger et al. 1977) with fluorescent dideoxynucleotides on PCR products amplified from cloned PstI fragments (Palmer and Thompson 1981), from cpDNA extracted from isolated chloroplasts, or from total DNA extracted from shoots of 8-d-old seedlings. Chloroplasts were isolated by the high-salt method (Bookjans et al. 1984), and DNA was extracted by the CTAB (hexadecyltrimethylammonium bromide) method (Milligan 1989). Previously published regions were not resequenced except at the borders or where there were discrepancies between publications. Newly sequenced regions were completed on both strands, and all the PstI sites used for cloning were confirmed by sequencing spanning PCR fragments. At the ycf4 locus, there is a 2-bp deletion in the sequence reported by Nagano et al. (1991a), relative to the sequence reported by Smith et al. (1991), both of which were obtained from the same cloned 17.3-kb PstI fragment from P. sativum cv. Alaska. We confirmed that this 2-bp deletion exists, in both cv. Alaska and cv. Feltham First. This correction means that the reported ORF157 (Smith et al. 1991) does not exist.
Genome Research www.genome.org
1707
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Magee et al. The L. sativus (grasspea) chloroplast genome sequence was determined by A.M.M., T.A.K., and K.H.W. Approximately 150 seeds were grown on soil in the greenhouse. Seedling shoots were harvested at 7 d post-germination, and cpDNA was prepared according to the method described by Milligan (1989) except that chloroplast lysis and cpDNA recovery procedures were modified. Chloroplasts were lysed by adding a 1/5 volume of 10% CTAB (Sigma-Aldrich) and heating for 20 min at 70°C. This was followed by a chloroform extraction, treatment with RNaseA (10 mg/mL), and isopropanol precipitation of cpDNA. A plasmid library of nebulized fragments was constructed from 5 mg of cpDNA by GATC-Biotech. The genome sequence was assembled from 1536 Sanger shotgun sequence reads with primer-walking to close gaps. We tried unsuccessfully to amplify ycf4 and psaI by PCR from L. odoratus genomic DNA using 16 and 9 primer combinations, respectively, and a range of amplification conditions. These primers were designed based on amino acid residues conserved among known Fabaceae Ycf4 and PsaI proteins, but primer design for these genes is difficult due to the fast rate of ycf4 evolution and the short length of psaI, as well as the high A+T content of the region. To obtain EST data from L. odoratus, we isolated poly(A) mRNA from leaves of 3-d-old seedlings. A normalized cDNA library was constructed by GATC-Biotech, and the 39 ends of 8702 cDNAs were sequenced by Agencourt Biosciences. ESTs were assembled into contigs, and putative orthologs between these contigs and P. sativum sequence data from GenBank were identified according the method of Se´mon and Wolfe (2008). The other new sequence data indicated in Figures 3 and 5 were generated by PCR amplification and sequencing (by primer walking) of at least three independent cloned products for each region. The cpDNA region that normally contains ycf4 was PCR amplified from L. latifolius, L. cirrhosus, L. odoratus, and L. palustris using primers designed from the P. sativum accD (59-AAACAGGCACAGG TCAASTAAATGG-39) and cemA (59-GACGGAGATACACGATTTA AATAACG-39) genes. The atpB–rbcL region from L. latifolius, L. cirrhosus, and L. palustris was amplified with primers 59-TGRAAAA RCTACATCGAGTACCGGAGG-39 and 59-TATGATCTCCACCAGA CATACG-39. T. repens mRNA sequences coding for LPD1 and the LPD2–accD fusion gene were identified among 700,000 ESTs obtained by high-throughput pyrosequencing of flower, leaf, and stolon mRNA from the inbred line S (7S.4.6.3.3.4.4.10) (DM and SB, unpubl.) and assembled manually. The structure of the LPD2-accD junction was confirmed by reverse transcriptase-PCR from T. repens leaf mRNA (commercial variety Nusiral) and Sanger sequencing.
Computational methods Sequence divergence for most analyses was calculated using yn00 from the PAML package (Yang 2007) for coding regions and Kimura’s two-parameter method (Kimura 1983) for noncoding regions. Gene sequences were aligned by reverse-translation of ClustalW alignments of the corresponding protein sequences. Noncoding sequences were aligned using ClustalW with manual adjustment for regions around ycf4. For the analysis in Figure 1 and Supplemental Figure S1, we first constructed a maximum likelihood phylogeny (in PAUP) from matK sequences using the HKY substitution model with a four category gamma rate distribution. The transition/transversion ratio and shape parameter were estimated iteratively until the topology converged. This analysis included legume ycf4 sequences from Stefanovic et al. (2009; GenBank [http://www.ncbi.nlm.nih.gov/Genbank/] accession nos. EU717431–EU717464). The dN and dS branch lengths for the matK, ycf4, and rbcL trees were estimated based on the PAML/codeML free-ratio model, using the fixed topology obtained from the above
1708
Genome Research www.genome.org
matK ML analysis. Dot-matrix plots were made using DNAMAN (http://www.lynnon.com) (Huang and Zhang 2004).
Acknowledgments We thank Gavin Conant for help with Figure S3, Shusei Sato for Trifolium cDNA clones, Greg Kenicer for prepublication access to sequence data, and Ralph Bock for discussion. S.A. and J.C.G. thank Chris Maddren (Department of Genetics, University of Cambridge) for DNA sequencing. This study was supported by Science Foundation Ireland (K.H.W., T.A.K.), European Commission FP5 Plastid Factory ( J.C.G., T.A.K.), US National Institutes of Health ( J.D.P.), and the METACyt Initiative of Indiana University, funded in part through a major grant from the Lilly Endowment, Inc. ( J.D.P., D.W.R.).
References Adams KL, Palmer JD. 2003. Evolution of mitochondrial gene content: Gene loss and transfer to the nucleus. Mol Phylogenet Evol 29: 380–395. Adams KL, Song K, Roessler PG, Nugent JM, Doyle JL, Doyle JJ, Palmer JD. 1999. Intracellular gene transfer in action: Dual transcription and multiple silencings of nuclear and mitochondrial cox2 genes in legumes. Proc Natl Acad Sci 96: 13863–13868. Amunts A, Drory O, Nelson N. 2007. The structure of a plant photosystem I supercomplex at 3.4 A˚ resolution. Nature 447: 58–63. Asmussen CB, Liston A. 1998. Chloroplast DNA characters, phylogeny, and classification of Lathyrus (Fabaceae). Am J Bot 85: 387–401. Baer CF, Miyamoto MM, Denver DR. 2007. Mutation rate variation in multicellular eukaryotes: Causes and consequences. Nat Rev Genet 8: 619–631. Birky CW Jr, Walsh JB. 1992. Biased gene conversion, copy number, and apparent mutation rate differences within chloroplast and bacterial genomes. Genetics 130: 677–683. Bookjans G, Stummann BM, Henningsen KW. 1984. Preparation of chloroplast DNA from pea plastids isolated in a medium of high ionic strength. Anal Biochem 141: 244–247. Boudreau E, Takahashi Y, Lemieux C, Turmel M, Rochaix JD. 1997. The chloroplast ycf3 and ycf4 open reading frames of Chlamydomonas reinhardtii are required for the accumulation of the photosystem I complex. EMBO J 16: 6095–6104. Brandvain Y, Wade MJ. 2009. The functional transfer of genes from the mitochondria to the nucleus: The effects of selection, mutation, population size and rate of self-fertilization. Genetics 182: 1129– 1139. Cai Z, Guisinger M, Kim HG, Ruck E, Blazier JC, McMurtry V, Kuehl JV, Boore J, Jansen RK. 2008. Extensive reorganization of the plastid genome of Trifolium subterraneum (Fabaceae) is associated with numerous repeated sequences and novel DNA insertions. J Mol Evol 67: 696–704. Chateigner-Boutin AL, Small I. 2007. A rapid high-throughput method for the detection and quantification of RNA editing based on highresolution melting of amplicons. Nucleic Acids Res 35: e114. doi: 10.1093/nar/gkm640. Cho Y, Mower JP, Qiu YL, Palmer JD. 2004. Mitochondrial substitution rates are extraordinarily elevated and variable in a genus of flowering plants. Proc Natl Acad Sci 101: 17741–17746. Cusack BP, Wolfe KH. 2007. When gene marriages don’t work out: Divorce by subfunctionalization. Trends Genet 23: 270–272. Doyle JJ, Doyle JL, Palmer JD. 1995. Multiple independent losses of two genes and one intron from legume chloroplast genomes. Syst Bot 20: 272–294. Drake JW. 2007. Too many mutants with multiple mutations. Crit Rev Biochem Mol Biol 42: 247–258. Drea SC, Mould RM, Hibberd JM, Gray JC, Kavanagh TA. 2001. Tissuespecific and developmental-specific expression of an Arabidopsis thaliana gene encoding the lipoamide dehydrogenase component of the plastid pyruvate dehydrogenase complex. Plant Mol Biol 46: 705–715. Drouin G, Daoud H, Xia J. 2008. Relative rates of synonymous substitutions in the mitochondrial, chloroplast and nuclear genomes of seed plants. Mol Phylogenet Evol 49: 827–831. Duret L, Galtier N. 2009. Biased gene conversion and the evolution of mammalian genomic landscapes. Annu Rev Genomics Hum Genet 10: 285–311. Emanuelsson O, Nielsen H, Brunak S, von Heijne G. 2000. Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J Mol Biol 300: 1005–1016.
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Chloroplast hypermutation Erixon P, Oxelman B. 2008. Whole-gene positive selection, elevated synonymous substitution rates, duplication, and indel evolution of the chloroplast clpP1 gene. PLoS ONE 3: e1386. doi: 10.1371/ journal.pone.0001386. Fox AK, Tuch BB, Chuang JH. 2008. Measuring the prevalence of regional mutation rates: An analysis of silent substitutions in mammals, fungi, and insects. BMC Evol Biol 8: 186. doi: 10.1186/1471-2148-8-186. Funk HT, Berg S, Krupinska K, Maier UG, Krause K. 2007. Complete DNA sequences of the plastid genomes of two parasitic flowering plant species, Cuscuta reflexa and Cuscuta gronovii. BMC Plant Biol 7: 45. doi: 10.1186/1471-2229-7-45. Gantt JS, Baldauf SL, Calie PJ, Weeden NF, Palmer JD. 1991. Transfer of rpl22 to the nucleus greatly preceded its loss from the chloroplast and involved the gain of an intron. EMBO J 10: 3073–3078. Gaut BS. 1998. Molecular clocks and nucleotide substitution rates in higher plants. Evol Biol 30: 93–120. Gaut BS, Muse SV, Clegg MT. 1993. Relative rates of nucleotide substitution in the chloroplast genome. Mol Phylogenet Evol 2: 89–96. Gornicki P, Faris J, King I, Podkowinski J, Gill B, Haselkorn R. 1997. Plastidlocalized acetyl-CoA carboxylase of bread wheat is encoded by a single gene on each of the three ancestral chromosome sets. Proc Natl Acad Sci 94: 14179–14184. Gouy M, Guindon S, Gascuel O. 2010. SeaView version 4: A multiplatform graphical user interface for sequence alignment and phylogenetic tree building. Mol Biol Evol 27: 221–224. Graur D, Li W-H. 1999. Fundamentals of molecular evolution. Sinauer, Sunderland, MA. Guisinger MM, Kuehl JV, Boore JL, Jansen RK. 2008. Genome-wide analyses of Geraniaceae plastid DNA reveal unprecedented patterns of increased nucleotide substitutions. Proc Natl Acad Sci 105: 18424–18429. Guo X, Castillo-Ramirez S, Gonzales V, Bustos P, Fernandez-Vazquez JL, Santamaria RI, Arellano J, Cevallos MA, Davila G. 2007. Rapid evolutionary change of common bean (Phaseolus vulgaris L.) plastome and genomic diversification of legume chloroplasts. BMC Genomics 8: 228. doi: 10.1186/1471-2164-8-228. Huang Y, Zhang L. 2004. Rapid and sensitive dot-matrix methods for genome analysis. Bioinformatics 20: 460–466. Inada M, Sasaki T, Yukawa M, Tsudzuki T, Sugiura M. 2004. A systematic search for RNA editing sites in pea chloroplasts: An editing event causes diversification from the evolutionarily conserved amino acid sequence. Plant Cell Physiol 45: 1615–1622. Jansen RK, Cai Z, Raubeson LA, Daniell H, Depamphilis CW, Leebens-Mack J, Muller KF, Guisinger-Bellian M, Haberle RC, Hansen AK, et al. 2007. Analysis of 81 genes from 64 plastid genomes resolves relationships in angiosperms and identifies genome-scale evolutionary patterns. Proc Natl Acad Sci 104: 19369–19374. Jansen RK, Wojciechowski MF, Sanniyasi E, Lee SB, Daniell H. 2008. Complete plastid genome sequence of the chickpea (Cicer arietinum) and the phylogenetic distribution of rps12 and clpP intron losses among legumes (Leguminosae). Mol Phylogenet Evol 48: 1204–1217. Jensen PE, Bassi R, Boekema EJ, Dekker JP, Jansson S, Leister D, Robinson C, Scheller HV. 2007. Structure, function and regulation of plant photosystem I. Biochim Biophys Acta 1767: 335–352. Jolley C, Ben-Shem A, Nelson N, Fromme P. 2005. Structure of plant photosystem I revealed by theoretical modeling. J Biol Chem 280: 33627–33636. Kato T, Kaneko T, Sato S, Nakamura Y, Tabata S. 2000. Complete structure of the chloroplast genome of a legume, Lotus japonicus. DNA Res 7: 323– 330. Kenicer GJ, Kajita T, Pennington RT, Murata J. 2005. Systematics and biogeography of Lathyrus (Leguminosae) based on internal transcribed spacer and cpDNA sequence data. Am J Bot 92: 1199–1209. Kim K-J, Lee H-L. 2004. Complete chloroplast genome sequences from Korean ginseng (Panax schinseng Nees). Comparative analysis of sequence evolution among 17 vascular plants. DNA Res 11: 247–261. Kimura M. 1977. Preponderance of synonymous changes as evidence for the neutral theory of molecular evolution. Nature 267: 275–276. Kimura M. 1983. The neutral theory of molecular evolution. Cambridge University Press, Cambridge, UK. Kode V, Mudd EA, Iamtham S, Day A. 2005. The tobacco plastid accD gene is essential and is required for leaf development. Plant J 44: 237–244. Konishi T, Shinohara K, Yamada K, Sasaki Y. 1996. Acetyl-CoA carboxylase in higher plants: Most plants other than gramineae have both the prokaryotic and the eukaryotic forms of this enzyme. Plant Cell Physiol 37: 117–122. Lee HL, Jansen RK, Chumley TW, Kim KJ. 2007. Gene relocations within chloroplast genomes of Jasminum and Menodora (Oleaceae) are due to multiple, overlapping inversions. Mol Biol Evol 24: 1161–1180. Lutziger I, Oliver DJ. 2000. Molecular evidence of a unique lipoamide dehydrogenase in plastids: Analysis of plastidic lipoamide dehydrogenase from Arabidopsis thaliana. FEBS Lett 484: 12–16.
Martin CH, Meyerowitz EM. 1986. Characterization of the boundaries between adjacent rapidly and slowly evolving genomic regions in Drosophila. Proc Natl Acad Sci 83: 8654–8658. McNeal JR, Kuehl JV, Boore JL, Depamphilis CW. 2007. Complete plastid genome sequences suggest strong selection for retention of photosynthetic genes in the parasitic plant genus Cuscuta. BMC Plant Biol 7: 57. doi: 10.1186/1471-2229-7-57. Millen RS, Olmstead RG, Adams KL, Palmer JD, Lao NT, Heggie L, Kavanagh TA, Hibberd JM, Gray JC, Morden CW, et al. 2001. Many parallel losses of infA from chloroplast DNA during angiosperm evolution with multiple independent transfers to the nucleus. Plant Cell 13: 645–658. Milligan BG. 1989. Purification of chloroplast DNA using hexadecyltrimethylammonium bromide. Plant Mol Biol Rep 7: 144–149. Milligan BG, Hampton JN, Palmer JD. 1989. Dispersed repeats and structural reorganization in subclover chloroplast DNA. Mol Biol Evol 6: 355– 368. Mower JP, Touzet P, Gummow JS, Delph LF, Palmer JD. 2007. Extensive variation in synonymous substitution rates in mitochondrial genes of seed plants. BMC Evol Biol 7: 135. doi: 10.1186/1471-2148-7-135. Muse SV. 2000. Examining rates and patterns of nucleotide substitution in plants. Plant Mol Biol 42: 25–43. Nagano Y, Matsuno R, Sasaki Y. 1991a. Sequence and transcriptional analysis of the gene cluster trnQ-zfpA-psaI-ORF231-petA in pea chloroplasts. Curr Genet 20: 431–436. Nagano Y, Ishikawa H, Matsuno R, Sasaki Y. 1991b. Nucleotide sequence and expression of the ribosomal protein L2 gene in pea chloroplasts. Plant Mol Biol 17: 541–545. Nishant KT, Singh ND, Alani E. 2009. Genomic mutation rates: What highthroughput methods can tell us. BioEssays 31: 912–920. Ochman H, Wilson AC. 1987. Evolution in bacteria: Evidence for a universal substitution rate in cellular genomes. J Mol Evol 26: 74–86. Ohyama K, Fukuzawa H, Kohchi T, Shirai H, Sano T, Sano S, Umesono K, Shiki Y, Takeuchi M, Chang Z, et al. 1986. Chloroplast gene organization deduced from complete sequence of liverwort Marchantia polymorpha chloroplast DNA. Nature 322: 572–574. Onishi T, Takahashi Y. 2009. Effects of site-directed mutations in the chloroplast-encoded ycf4 gene on photosystem I complex assembly in the green alga Chlamydomonas reinhardtii. Plant Cell Physiol 50: 1750– 1760. Ozawa SI, Nield J, Terao A, Stauber EJ, Hippler M, Koike H, Rochaix JD, Takahashi Y. 2009. Biochemical and structural studies of the large Ycf4photosystem I assembly complex of the green alga Chlamydomonas reinhardtii. Plant Cell 21: 2424–2442. Palmer JD. 1985. Comparative organization of chloroplast genomes. Annu Rev Genet 19: 325–354. Palmer JD, Thompson WF. 1981. Clone banks of the mung bean, pea and spinach chloroplast genomes. Gene 15: 21–26. Palmer JD, Osorio B, Thompson WF. 1988. Evolutionary significance of inversions in legume chloroplast DNAs. Curr Genet 14: 65–74. Palmer JD, Adams KL, Cho Y, Parkinson CL, Qiu YL, Song K. 2000. Dynamic evolution of plant mitochondrial genomes: Mobile genes and introns and highly variable mutation rates. Proc Natl Acad Sci 97: 6960– 6966. Parkinson CL, Mower JP, Qiu YL, Shirk AJ, Song K, Young ND, DePamphilis CW, Palmer JD. 2005. Multiple major increases and decreases in mitochondrial substitution rates in the plant family Geraniaceae. BMC Evol Biol 5: 73. doi: 10.1186/1471-2148-5-73. Perry J, Ashworth A. 1999. Evolutionary rate of a gene affected by chromosomal position. Curr Biol 9: 987–989. Perry AS, Wolfe KH. 2002. Nucleotide substitution rates in legume chloroplast DNA depend on the presence of the inverted repeat. J Mol Evol 55: 501–508. Reverdatto SV, Beilinson V, Nielsen NC. 1995. The rps16, accD, psaI, ORF 203, ORF 151, ORF 103, ORF 229 and petA gene cluster in the chloroplast genome of soybean (PGR95-051). Plant Physiol 109: 338. Sanger F, Nicklen S, Coulson AR. 1977. DNA sequencing with chainterminating inhibitors. Proc Natl Acad Sci 74: 5463–5467. Sasaki Y, Kozaki A, Ohmori A, Iguchi H, Nagano Y. 2001. Chloroplast RNA editing required for functional acetyl-CoA carboxylase in plants. J Biol Chem 276: 3937–3940. Saski C, Lee SB, Daniell H, Wood TC, Tomkins J, Kim HG, Jansen RK. 2005. Complete chloroplast genome sequence of Glycine max and comparative analyses with other legume genomes. Plant Mol Biol 59: 309–322. Scho¨ttler MA, Flugel C, Thiele W, Stegemann S, Bock R. 2007. The plastomeencoded PsaJ subunit is required for efficient photosystem I excitation, but not for plastocyanin oxidation in tobacco. Biochem J 403: 251–260. Se´mon M, Wolfe KH. 2008. Preferential subfunctionalization of slowevolving genes in Xenopus laevis. Proc Natl Acad Sci 105: 8333–8338. Sloan DB, Oxelman B, Rautenberg A, Taylor DR. 2009. Phylogenetic analysis of mitochondrial substitution rate variation in the angiosperm tribe Sileneae. BMC Evol Biol 9: 260. doi: 10.1186/1471-2148-9-260.
Genome Research www.genome.org
1709
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Magee et al. Smith AG, Wilson RM, Kaethner TM, Willey DL, Gray JC. 1991. Pea chloroplast genes encoding a 4 kDa polypeptide of photosystem I and a putative enzyme of C1 metabolism. Curr Genet 19: 403–410. Stefanovic S, Pfeil BE, Palmer JD, Doyle JJ. 2009. Relationships among phaseolid legumes based on sequences from eight chloroplast regions. Syst Bot 34: 115–128. Stern DB, Goldschmidt-Clermont M, Hanson MR. 2010. Chloroplast RNA metabolism. Annu Rev Plant Biol 61: 125–155. Sugiura M. 1992. The chloroplast genome. Plant Mol Biol 19: 149–168. Timmis JN, Ayliffe MA, Huang CY, Martin W. 2004. Endosymbiotic gene transfer: Organelle genomes forge eukaryotic chromosomes. Nat Rev Genet 5: 123–135. Tsudzuki J, Nakashima K, Tsudzuki T, Hiratsuka J, Shibata M, Wakasugi T, Sugiura M. 1992. Chloroplast DNA of black pine retains a residual inverted repeat lacking rRNA genes: nucleotide sequences of trnQ , trnK, psbA, trnI and trnH and the absence of rps16. Mol Gen Genet 232: 206–214. Tsudzuki T, Wakasugi T, Sugiura M. 2001. Comparative analysis of RNA editing sites in higher plant chloroplasts. J Mol Evol 53: 327–332. Ueda M, Fujimoto M, Arimura SI, Murata J, Tsutsumi N, Kadowaki KI. 2007. Loss of the rpl32 gene from the chloroplast genome and subsequent acquisition of a preexisting transit peptide within the nuclear gene in Populus. Gene 402: 51–56. Ueda M, Nishikawa T, Fujimoto M, Takanashi H, Arimura SI, Tsutsumi N, Kadowaki KI. 2008. Substitution of the gene for chloroplast RPS16 was assisted by generation of a dual targeting signal. Mol Biol Evol 25: 1566– 1575. Varotto C, Pesaresi P, Jahns P, Lessnick A, Tizzano M, Schiavon F, Salamini F, Leister D. 2002. Single and double knockouts of the genes for photosystem I subunits G, K, and H of Arabidopsis. Effects on photosystem I composition, photosynthetic electron flow, and state transitions. Plant Physiol 129: 616–624.
1710
Genome Research www.genome.org
Wang J, Gonzalez KD, Scaringe WA, Tsai K, Liu N, Gu D, Li W, Hill KA, Sommer SS. 2007. Evidence for mutation showers. Proc Natl Acad Sci 104: 8403–8408. Wilde A, Hartel H, Hubschmann T, Hoffmann P, Shestakov SV, Borner T. 1995. Inactivation of a Synechocystis sp strain PCC 6803 gene with homology to conserved chloroplast open reading frame 184 increases the photosystem II-to-photosystem I ratio. Plant Cell 7: 649–658. Wojciechowski MF, Lavin M, Sanderson MJ. 2004. A phylogeny of legumes (Leguminosae) based on analysis of the plastid matK gene resolves many well-supported subclades within the family. Am J Bot 91: 1846– 1862. Wolfe KH, Li WH, Sharp PM. 1987. Rates of nucleotide substitution vary greatly among plant mitochondrial, chloroplast, and nuclear DNAs. Proc Natl Acad Sci 84: 9054–9058. Wolfe KH, Sharp PM, Li W-H. 1989. Rates of synonymous substitution in plant nuclear genes. J Mol Evol 29: 208–211. Wolfe KH, Morden CW, Palmer JD. 1992. Function and evolution of a minimal plastid genome from a nonphotosynthetic parasitic plant. Proc Natl Acad Sci 89: 10648–10652. Wyman SK, Jansen RK, Boore JL. 2004. Automatic annotation of organellar genomes with DOGMA. Bioinformatics 20: 3252–3255. Yang Z. 2007. PAML 4: Phylogenetic analysis by maximum likelihood. Mol Biol Evol 24: 1586–1591. Yang Y, Sterling J, Storici F, Resnick MA, Gordenin DA. 2008. Hypermutability of damaged single-strand DNA formed at doublestrand breaks and uncapped telomeres in yeast Saccharomyces cerevisiae. PLoS Genet 4: e1000264. doi: 10.1371/journal.pgen.1000264.
Received June 20, 2010; accepted in revised form September 20, 2010.
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Method
High-throughput discovery of rare insertions and deletions in large cohorts Francesco L.M. Vallania, Todd E. Druley, Enrique Ramos, Jue Wang, Ingrid Borecki, Michael Province, and Robi D. Mitra1 Center for Genome Sciences and Systems Biology Department of Genetics Washington University in St. Louis School of Medicine, St. Louis, Missouri 63108, USA Pooled-DNA sequencing strategies enable fast, accurate, and cost-effect detection of rare variants, but current approaches are not able to accurately identify short insertions and deletions (indels), despite their pivotal role in genetic disease. Furthermore, the sensitivity and specificity of these methods depend on arbitrary, user-selected significance thresholds, whose optimal values change from experiment to experiment. Here, we present a combined experimental and computational strategy that combines a synthetically engineered DNA library inserted in each run and a new computational approach named SPLINTER that detects and quantifies short indels and substitutions in large pools. SPLINTER integrates information from the synthetic library to select the optimal significance thresholds for every experiment. We show that SPLINTER detects indels (up to 4 bp) and substitutions in large pools with high sensitivity and specificity, accurately quantifies variant frequency (r = 0.999), and compares favorably with existing algorithms for the analysis of pooled sequencing data. We applied our approach to analyze a cohort of 1152 individuals, identifying 48 variants and validating 14 of 14 (100%) predictions by individual genotyping. Thus, our strategy provides a novel and sensitive method that will speed the discovery of novel disease-causing rare variants. [Supplemental material is available online at www.genome.org. Sequencing data is available at http://cgs.wustl.edu/ ~fvallania/4_splinter_2010/5_splinter_webpage/SPLINTER_supporting_material.html. Novel SNP data have been submitted to the NCBI dbSNP (http://www.ncbi.nlm.nih.gov/snp) under accession nos. rs113740468, rs78985299, and rs113225202. SPLINTER is available at http://www.ibridgenetwork.org/wustl/splinter.] Understanding the genetic basis of common diseases is an important step toward the goal of personalized medicine (Ng et al. 2008). At present, two distinct hypotheses are under debate (Goldstein 2009; Manolio et al. 2009). The common variant, common disease (CVCD) hypothesis states that disease-causing alleles are common in the human population (frequency > 5%) (Reich and Lander 2001). In contrast, the rare variant, common disease (RVCD) hypothesis posits that multiple disease-causing alleles, which individually occur at low frequencies (<<1%), cumulatively explain a large portion of disease susceptibility (Cohen et al. 2004; Ji et al. 2008). Recent evidence favors the RVCD hypothesis, as common variants have failed to explain many complex traits (Manolio et al. 2009), while rare genetic variants have been successfully associated with HDL levels (Cohen et al. 2004), blood pressure ( Ji et al. 2008), obesity (Ahituv et al. 2007), and colorectal cancer (Fearnhead et al. 2004, 2005). Due to their low frequencies, identifying rare, disease-associated variants requires genotyping large cohorts in order to reach the appropriate statistical power (e.g., 5000 individuals are required to detect mutations present at 0.1% in the population with a probability of 96%). ‘‘Collapsing’’ methods in which rare variants are grouped together before association with disease have been shown to improve statistical power (Li and Leal 2008), but analysis of large cohorts is still required. One recent strategy for genotyping large cohorts consists of pooled-sample sequencing, where individual samples are pooled prior to analysis on a next-generation
1
Corresponding author. E-mail
[email protected]; fax (314) 362-2157. Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.109157.110.
sequencing platform (Van Tassel et al. 2008; Druley et al. 2009; Erlich et al. 2009; Koboldt et al. 2009; Prabhu and Pe’er 2009). By leveraging the massively parallel output of second-generation DNA sequencing, pooled-sample sequencing allows fast and accurate detection of rare variants in thousands of samples at a fraction of time and cost of traditional methods. Individual sample identities can be recovered using a combinatorial pooling strategy (such as DNA Sudoku) (Erlich et al. 2009). Despite the promise of this method for studying rare genetic variants, current computational approaches pose a bottleneck because they are focused either on single individual genotyping (Li et al. 2008) or on the detection of common variants in small-sized pools (Koboldt et al. 2009). Our previously developed SNPseeker algorithm allows the detection of single nucleotide substitutions in large pooled samples (Druley et al. 2009), but still fails to address two important key challenges in rare variant detection. First, presently no algorithm has been able to detect indels in pools larger than 42 individuals without the presence of many false-positives (;40%) (Koboldt et al. 2009), despite the fact that they account for one-quarter of the known mutations implicated in Mendelian diseases (Ng et al. 2008; Stenson et al. 2009). In particular, short indels represent the most common type of this class of variation (Ng et al. 2008) and have been reported to occur as rare germline variants associated with genetic diseases such as breast and ovarian cancer (King et al. 2003). Efforts to detect disease-associated genetic variants will therefore greatly benefit from the ability to accurately detect rare short indels. Second, in order to accurately detect rare variants in a large pooled sample, an optimal significance cutoff for the accurate discrimination of true variants from false-positives must be chosen. This parameter is, in practice, affected by sequencing error
20:1711–1718 Ó 2010 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/10; www.genome.org
Genome Research www.genome.org
1711
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Vallania et al. rates and average coverage, which have been shown to change for every run (Druley et al. 2009). Failure to define an optimal cutoff results in lower sensitivity and increased false-positive rates. Since the rare variant hypothesis posits that individual disease-associated mutations will be extremely rare (but cumulatively common), it is absolutely critical to be able to specifically discriminate, in every experiment, a single heterozygous individual in a large cohort from the background noise. Until now this has not been reliably demonstrated. To address these important challenges, we have developed a novel experimental and computational strategy that combines a synthetically engineered DNA library inserted in each run and a new computational approach named SPLINTER (short indel prediction by large deviation inference and nonlinear true frequency estimation by recursion). This approach allows accurate detection and quantification of short insertions, deletions, and substitutions by integrating information from the synthetic DNA library to tune SPLINTER and quantify specificity and sensitivity for every experiment in order to accurately detect and quantify indels and substitutions (Fig. 1; Supplemental Fig. 1). SPLINTER requires the presence of two components: a negative control (1–2 kb of cloned plasmid DNA) used to generate a run-specific error model, and a positive control consisting of a synthetic DNA library simulating an artificial pool with mutations engineered at a known position and frequency. We tested
SPLINTER on synthetically engineered pooled samples containing different mutations at different frequencies in a variety of sequencecontext backgrounds, obtaining 100% sensitivity with no falsepositives in pools up to 500 individuals. SPLINTER was also able to accurately quantify allele frequencies—predicted and observed allele frequencies were correlated with a correlation of 0.999. We find that SPLINTER significantly outperforms all of the other algorithms for the analysis of pooled sequencing data by being the most sensitive approach, while also returning almost no false-positives. We then applied our strategy to multiple pooled samples, identifying novel and already described sequence variants, all of which were independently validated.
Results Detection of rare insertions and deletions in synthetic libraries For each experiment, we first pooled equimolar amounts of sample DNA together with the controls and generated a DNA library to be sequenced on the Illumina Genome Analyzer IIx sequencing platform. We then mapped back the sequencing reads to their reference and built a run-specific error model from the negative control reads. Next, we optimized our cutoff parameters on the positive control and then called SNPs and indels on our sample (see Supplemental material). We first sought to determine the upper
Figure 1. Experimental and computational pipeline for detection of indels and substitutions in large pooled DNA samples: DNA samples from a selected group of patients are individually pooled in a complex mixture to be used as a template for PCR amplification of selected genomic loci. The pool PCR products are then combined in an equimolar mix containing a DNA fragment without variants (negative control) and a synthetic pool with engineered mutations present at the lowest expected variant frequency present in the sample (positive control). The mix is then sequenced on Illumina Genome Analyzer LIX, and sequencing reads are mapped back to the sample and the controls reference sequence by gapped alignment. The negative control reads are used to generate a second-order error model to be used in the variant calling phase. The positive control allows determination of the optimal cutoff for maximizing specificity and sensitivity of the analysis. SPLINTER will then be used to analyze the pooled sample, resulting in detection and quantification of indels and substitutions present in the pool. The SPLINTER algorithm detects true segregated variants by comparing the frequency vector of observed read bases to an expected frequency vector defined by the error model. If the observed vector is significantly different from the expected vector, then SPLINTER will call that position a sequence variant. For each identified variant, SPLINTER will then perform maximum likelihood fit in order to estimate its frequency in the pooled sample.
1712
Genome Research www.genome.org
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
High-throughput discovery of rare indels limit of the number of samples that SPLINTER can analyze in a pool. To do so, we generated three synthetic DNA libraries, each containing 15 different indels and substitutions (Supplemental Tables 1, 2; Supplemental material) introduced at frequencies of 0.005, 0.002, and 0.001, respectively (corresponding to cohorts of 100, 250, and 500 diploid individuals). We sequenced these libraries using the workflow shown in Figure 1. In each instance, SPLINTER was able to correctly identify every variant (15/15 variants) without making false-positive calls (2254/2254 true-negatives) (Fig. 3A, below; Supplemental Table 4). We concluded that SPLINTER can accurately and reliably detect single heterozygous mutations in pools of up to 500 individuals.
Estimation of required sequencing coverage for optimal indel and substitution detection We next investigated how SPLINTER’s accuracy changed as a function of average sequencing coverage. To do so, we sampled the sequencing data obtained for each of the three previous libraries at different fractions (Supplemental material) and then computed the accuracy of our predictions in the form of an area under a receiver-operator curve (AUC), a commonly used metric of
accuracy ranging from 0.5 (random guessing) to 1 (100% sensitivity and specificity). By plotting AUC as a function of average sequencing coverage we found that accuracy increased with coverage, with high-frequency variants requiring less coverage than lower-frequency variants (Fig. 2A). By analyzing AUC as a function of coverage per allele, we observed a clear overlap of the curves for each pool, reaching AUC equal to 1 at ;30-fold average coverage per haploid genome (Fig. 2B), indicating that accurate detection can be achieved given enough coverage independently of pool size. Recent resequencing efforts show that indel detection remains challenging, as their false-positive rate is 15-fold higher than substitutions (Pleasance et al. 2010). Our initial data suggested that indels can be detected as sensitively and accurately as substitutions. To test this hypothesis, we generated five additional DNA libraries with synthetic insertions, deletions, and substitutions included at a wide range of frequencies (from one to 50 variants in 1000 total alleles) (Supplemental Tables 2, 4). We achieved 100% sensitivity for all of the pools (9/9 indel variants and 10/10 substitution variants) with specificities between 99.91% and 100% (between 2263/2265 and 2259/2259 true-negatives). We then plotted the relationship between AUC and coverage for
Figure 2. Relationship between variant detection accuracy and average sequencing coverage per base. (A) Accuracy expressed as AUC (area under the curve) (y-axis) plotted as a function of average sequencing coverage per base (x-axis) for synthetic pools with variants present at frequencies 1/200, 1/500, and 1/1000. (B) Same as in A, with average sequencing coverage per base per allele on the x-axis. (C–E ) AUC (y-axis) as a function of average sequencing coverage per base (x-axis) for insertions (C ), deletions (D), and substitutions (E ). Variants are present at frequencies 1/1000, 5/1000, 10/1000, 15/1000, and 50/1000.
Genome Research www.genome.org
1713
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Vallania et al. each set. Indels converged to AUC equal to 1 at a rate comparable to substitutions independently of the frequency of the mutation (Fig. 2A–C). Thus, we conclude that SPLINTER detects indels as accurately and as sensitively as it does substitutions. Since many deleterious indels are 4 bp or shorter (King et al. 2003; Ng et al. 2008), we wanted to determine whether SPLINTER could accurately detect indels as large as 4 bp. We generated and sequenced two synthetic pools containing eight and 10 4-bp indels with frequencies ranging from 0.001 to 0.020 and from 0.025 to 0.045, respectively. SPLINTER achieved 100% sensitivity 10/10 variants and 100% specificity (2253/2253 true-negatives) for allele frequencies between 0.025 and 0.045 and 100% sensitivity (8/8 variants) and 99.5% specificity (2243/2253 true-negatives) between 0.001 and 0.020 (Supplemental Tables 3, 4). These results suggest that SPLINTER is sensitive and specific in detecting 4-bp indels.
Comparison of SPLINTER with other variant discovery approaches We next compared SPLINTER with existing tools for variant calling. We used the synthetic DNA libraries previously described to benchmark the sensitivity and positive predictive value of each method. We compared SPLINTER with SNPseeker (Druley et al. 2009), MAQ (Li et al. 2008), SAMtools (Li et al. 2009), and VarScan (Koboldt et al. 2009) for the detection of substitutions (Fig. 3A,B)
and with SAMtools and VarScan for the detection of indels (Fig. 3C,D). For each data set analyzed, SPLINTER significantly outperformed every other approach. In all of the synthetic libraries containing substitutions, SPLINTER detected all of the synthetic variants with no false-positives, thus achieving a 100% sensitivity and specificity. SNPseeker also achieved perfect accuracy in the pool simulating 100 individuals, but had a 20% positive predictive value in the libraries simulating 250 and 500 individuals, and had only an 80% sensitivity in the 500 individual library. The other approaches detected variants with substantially lower sensitivity and positive predictive values in all libraries. For each indel set, SPLINTER returned all of the true variants with no false-positives, except for the indel 1 set and the 4-bp 1 set (;30% and ;50% positive predictive values, respectively). In comparison, every other approach resulted in false-positive rates greater than 80%, while achieving low sensitivity, with the exception of the second 4-bp set. We also compared SPLINTER with a recently published new algorithm for pooled DNA variant detection called CRISP (Vikas 2010) for both substitution and indel detection (Supplemental Fig. 2). SPLINTER outperformed CRISP in both sensitivity (at most 40% increment) and positive predictive value (at most 80% increment). In order to distinguish whether the improved accuracy in variant finding originated from improved alignments or improved variant calling, we also compared the performance of SPLINTER using our alignment algorithm versus using reads aligned with
Figure 3. Comparison between SPLINTER and other variant calling algorithms: Substitutions (A,B) and indels (C,D) were analyzed independently. For each approach, performance was evaluated by assessing sensitivity (fraction of true-positive hits divided by total positives in the set) and positive predictive value (fraction of true-positive hits divided by total hits).
1714
Genome Research www.genome.org
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
High-throughput discovery of rare indels Novoalign (http://www.novocraft.com). Both aligners resulted in a comparable performance in finding true variants (Supplemental Fig. 3), although our aligner showed small increases in sensitivity and positive predictive value in several of the analyzed pools. This result suggests that improved variant calling accuracy mostly depended on the variant calling algorithm and not the underlying aligner. Taken together, there results demonstrate that SPLINTER outperforms other approaches at detecting single nucleotide substitutions and indels in large pools.
Estimation of the frequency of rare insertions and deletions in synthetic libraries Having established that SPLINTER could detect rare variants in pooled samples, we next examined whether SPLINTER could also accurately determine the frequencies of the identified variants. We compared estimated and expected indel frequencies from all of our libraries (frequency range 0.001 ; 0.050) and found a very high correlation (r = 0.969, P < 2.2 3 10 16; Fig. 4A), indicating that SPLINTER was able to accurately estimate allele frequencies. We next sought to better understand the causes of the observed errors in our allele frequency estimates. Allele quantification can be affected by pipetting errors during DNA pooling and by preferential
amplification of specific alleles in the pooled PCR. To distinguish between these two sources of error, we constructed all of our plasmids so that each contained two mutations spaced far enough apart to be analyzed independently (i.e., with no overlapping reads). If pipetting error and amplification bias are the major sources of error in allele quantification, then the estimated allele frequencies of mutations on the same plasmid will be highly correlated. This was indeed the case. Frequency estimates for mutations within the same molecule were very highly correlated (r = 0.995, P < 2.2 3 10 16; Fig. 4C), indicating that most of the noise in variant quantification was due to experimental error. We similarly observed very high correlations with substitutions (frequency correlation r = 0.956, P < 2.2 3 10 16; pair correlation r = 0.993, P < 2.2 3 10 16; Fig. 4D) and 4-bp indels (frequency correlation r = 0.962, P = 1.501 3 10 11; pair correlation r = 0.939, P = 5.599 3 10 5) (Supplemental Fig. 4). Based on these results, we reasoned that robotic pooling of samples might improve allelic quantification. Therefore, we robotically pooled and sequenced a large cohort of 974 people previously analyzed in a GWA study (see Methods). As expected, we observed an almost perfect correlation (r = 0.999, P < 2.2 3 10 16; Fig. 4E) between the GWA frequencies and the frequencies estimated by SPLINTER, indicating that inaccurate pipetting was indeed a primary source of error.
Figure 4. Precise quantification of rare genetic variants in synthetic and real samples. (A,B) Correlation between variant frequency measured by SPLINTER (y-axis) and expected variant frequency (x-axis) from eight synthetic pools for indels (A) and substitutions (B). (C,D) Pair correlation between mutation pairs present in the same DNA molecule for indels (C ) and substitutions (D). (E) Correlation between variant frequency measured from GWA study (x-axis) and SPLINTER estimated frequency (y-axis).
Genome Research www.genome.org
1715
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Vallania et al. High-throughput discovery of rare indels in large patient cohorts Finally, we applied SPLINTER to a large human cohort as a ‘‘realworld’’ test of the algorithm. We sequenced 14 loci (2596 bp total) in 1152 individuals, which were divided into nine pools (94–178 individuals per pool) (see Methods). For every sequenced pool, we included a negative and positive control to tune SPLINTER. We identified, on average, 19 variants per pool (for a total of 151 variants, see Supplemental Table 6). To confirm SPLINTER’s accuracy, we examined the overlap of our hits with variants listed in dbSNP. We observed large overlapping fractions—between 68.5% and 100% of the identified variants in each pool could be found in dbSNP (Supplemental Tables 5, 6). In all cases, statistical significance was reached (Fisher’s exact test; Supplemental Table 5). We selected 14 variants (three novel variants and 11 from dbSNP) from the largest analyzed pool for independent validation by individual genotyping using the Sequenom iPLEX platform. All 14 variants were confirmed, resulting in 100% positive predictive value. Furthermore, allele frequencies were highly correlated with those estimated by SPLINTER (r = 0.985, P = 5.958 3 10 9; Supplemental Table 8; Supplemental Fig. 5). Together, these results demonstrate the utility of the SPLINTER methodology for the rapid analysis of large populations of individuals. All of the computational tools, source codes, and the experimental datasets presented in this study can be accessed at http://cgs.wustl.edu/;fvallania/4_splinter_ 2010/5_splinter_webpage/SPLINTER_supporting_material.html.
Discussion Rare genetic variation is likely to describe a substantial portion of heterogeneity in common and complex diseases. Identifying disease-associated rare variants requires the analysis of multiple loci in large cohorts. We have shown that a novel experimental design combined with SPLINTER can accurately identify genetic variants in large pools, leading to several advantages over other computational strategies. First, we found that SPLINTER identified genetic variants with high sensitivity and precision, whereas the other methods were unable to detect a large fraction of the variation present in the samples. We found that a sequencing coverage of ;303 per haploid genome was required to detect mutations with high sensitivity and specificity. In earlier work, we successfully analyzed pooled samples using SNPseeker at lower sequencing coverage (;13.8-fold per haploid genome) (Druley et al. 2009). However, in that study most of the variants were present in many individuals in the pool, suggesting that in order to detect singleton alleles with ;100% confidence in a variety of different sequence contexts a higher sequence coverage is required. This finding is confirmed by recent resequencing studies of single cancer genomes, where nearoptimal accuracy of somatic SNP detection (3% false discovery rate) was achieved at ;40-fold average haploid genome coverage (Pleasance et al. 2010), and by the lower performance of SNPseeker when compared with SPLINTER in detecting substitutions present at one in a 1000 in both sensitivity and precision. Second, our strategy incorporates a synthetic positive control and a negative control, which allow estimation of sensitivity and specificity for each experiment. This is important because run-torun variations in sequencing error rates can influence accuracy and perturb the optimal P-value cutoffs. The inclusion of the control DNA has a negligible impact on experiment cost. One single-end sequencing lane (;30 million 36-bp-long reads per lane) can
1716
Genome Research www.genome.org
provide enough coverage to analyze ;25 kb of genomic DNA in 500 patients, with the control sequences accounting for ;4% of the total sequencing data. Third, SPLINTER can accurately and sensitively detect indels with a high sensitivity and accuracy. Detection of indels, even in single genome resequencing studies, is indeed a challenging problem due to the difficulties in reducing the false-positive rate while retaining good sensitivity (Pleasance et al. 2010). In addition, previously published approaches cannot detect indels (Li et al. 2008; Druley et al. 2009), or can only be applied to small-sized cohorts (42 people) (Koboldt et al. 2009). Together, these issues have limited the application of pooled DNA sequencing. We have shown here that SPLINTER can accurately discriminate single indels in pools as large as 500 individuals with high sensitivity and specificity. By comparison, the best performing algorithm achieved at best an 80% false-positive rate. Fourth, SPLINTER can accurately quantify the frequency of the alleles present in the pool. Although high correlations between real and estimated frequencies were observed, small discrepancies may result in errors in variant association to a phenotype if the variant is rare and the effect of the variant is high. Our pair correlation analysis shows that the major source of errors in quantification does not come from SPLINTER, but rather from pipetting errors in pool construction as indicated by the improved correlations after robotic pipetting of the pools. This issue can, in fact, be resolved by performing orthogonal validation of the samples, which will be highly facilitated by the overall performance of SPLINTER in detecting rare variants as opposed to other methods. In contrast, the major source of error in array-based pooled DNA analysis is array variation, being seven times higher than pool construction variation (Macgregor 2007). This observation argues that our approach shows even higher accuracy compared with other experimental platforms. Finally, our approach can be applied to any pooled cohort or any heterogeneous sample of any size and can be easily scaled up to whole-exome and whole-genome analysis. Given the presence of a positive control to infer the optimal parameters, pooled samples can accurately be analyzed without limitations on experimental design or achieved coverage. In this study, we used PCR to amplify the various genomic regions, but our strategy is also compatible with solid and liquid-phase genomic capture approaches (Mamanova et al. 2010). We found that alignment errors decreased our ability to detect large indels. This explains why SPLINTER performed slightly worse in the analysis of the 4-bp indel libraries relative to the 1–2-bp indel libraries. To detect the longer indels, it was necessary to allow larger gaps in our read alignments, which increased the overall alignment noise. We believe this was due to potential sequencing artifacts or sample contaminants aligning back to the reference sequence, thereby reducing the signal coming from true variants. This limitation can be overcome with longer sequencing read lengths, which should reduce the ambiguity in aligning reads while allowing larger gaps (in this work, all sequencing reads were 36 bp in length). Similarly, while whole-genome analysis may present additional challenges due to increased sequence complexity, compared with the analyzed synthetic controls we expect it to mostly impact the read alignment step in the analysis pipeline, which can be overcome by generating paired-end and/or longer sequencing reads. In addition, with reduced error rate, fewer observations at a given variant position will be needed to provide confidence in the variant call. Nevertheless, our approach is the first one to accurately call short indels in large pooled samples.
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
High-throughput discovery of rare indels One departure of our algorithm from other variant calling programs is that SPLINTER does not incorporate quality scores in any step of the analysis. We have found that our error model captures essentially the same information that is contained in quality scores (see Supplemental material; Druley et al. 2009 and so including quality score information does not improve SPLINTER’s performance. The high performance of our method compared with others that use quality scores (Li et al. 2008; Koboldt et al. 2009) suggests that this viewpoint is likely correct. Additionally, analyzing reads aligned with quality scores resulted in equal or lower performance when compared with reads aligned using our aligner (see Supplemental Fig. 5). To obtain a complete understanding of the molecular causes of common diseases, it is critical to be able to detect and analyze rare variants (Van Tassel et al. 2008; Druley et al. 2009; Erlich et al. 2009; Koboldt et al. 2009; Prabhu and Pe’er 2009). Pooled DNA sequencing is an important method for rare variant analysis, since it enables the rapid and cost-effective analysis of thousand or tens of thousands of individuals. SPLINTER will also be useful for analyzing samples that are naturally heterogeneous—e.g., for the detection and quantification of rare somatic mutations in tumor samples (Stingl and Caldas 2007). A second promising application is detection of induced mutations in in vitro evolution experiments (Barrick at al. 2009; Beaumont et al. 2009). Thus, we expect SPLINTER will become a useful tool for the analysis of data generated by next-generation sequencing methods.
Methods Preparation of the synthetic pools Every synthetic pool library consists of a mixture of different oligonucleotides, where one is referred to as the wild-type allele and the others are mutants with respect to the wild type. We used the consensus sequence of the 72-bp exon 9 from TP53 (RefSeq accession no. NM_000546) as the ‘‘wild-type’’ insert into a pGEM-T Easy vector (Promega). We then designed a panel of different variations of this consensus sequence (see Supplemental Tables 1–3) containing single, double, and 4-bp indels, as well as single nucleotide substitutions. These vectors could then be pooled such that each mutation was present at different frequencies. Once pooled, a single PCR reaction was performed using primers that flanked the insertion site and generated a 335-bp amplicon. To facilitate ligation into the vector, each oligonucleotide was ordered with 59 phosphorylation and an overhanging 39 A from Integrated DNA Technologies. Complimentary oligonucleotide pairs were annealed as follows: 1 mL of sense and antisense oligonucleotide at 100 mM were mixed with 5 mL of 103 PCR buffer (Sigma-Aldrich) and brought to a final volume of 50 mL. The annealing mix was then warmed up to 95°C for 5 min, followed by 20 min at 25°C. Each annealed sequence was then ligated into the pGEM-T Easy Vector (Promega) according to the manufacturer’s protocol and reagents. The final ligation product was then transformed into GC10 competent cells (GeneChoice) using standard cloning protocol. Colonies were screened using ‘‘Blue/White’’ selection induced by Xgal and IPTG, White colonies were picked and grown on Luria broth agar with ampicillin for 12–16 h. Plasmid was then recovered from the transformed bacteria suspension using Qiaprep Spin Miniprep kit according to the manufacturer’s protocol (Qiagen). Following insert validation by Sanger sequencing, plasmid pools were prepared by pooling each plasmid at the appropriate number of molecules in order to introduce the desired mutations at the desired frequency with respect to the wild-type background. Each pool was generated with a total number of 1011 plasmid molecules.
This was chosen in order to mimic the best conditions described in the original pooled-DNA sequencing protocol11 to maximize the number of molecules available for analysis, while keeping fluid volumes tractable. Each pool was then PCR amplified using primer sequences flanking the plasmid insertion site (see Supplemental Table 4). Each PCR reaction was performed as follows: (1) 93°C for 2 min; (2) 93°C for 30 sec; (3) 56°C for 30 sec; (4) 65°C for 2 min; (5) repeat steps 2–4 for 18 cycles; (6) 65°C for 10 min. Each PCR mix contained 2.5 uL of 103 PfuUltra buffer, 10 mM forward and reverse primers, 1 M betaine (Sigma-Aldrich/Fluka), 1.25 U PfuUltra DNA polymerase, and between 30 and 50 ng of template DNA in a final volume of 25 uL. Each pool was then sequenced using a single lane of the Illumina Genome Analyzer II platform.
DNA library preparation and sequencing for pooled samples After PCR amplification of target loci, a second pool was created by adding PCR products to the positive and negative controls for the analyzed pooled sample. In order to generate uniform sequencing coverage, every PCR product and control was pooled at the same number of molecules (chosen to be at least 1011 molecules [;1 mg] in order to have enough material for the sequencing library preparation). Random ligation, sonication, and sequencing library preparation were performed as previously described with a few changes. DNA ligation was performed in a final volume of 50 mL. Prior to sonication, ligation products were diluted 1:10 using Qiagen PBI buffer from the QIAquick PCR Purification Kit (Qiagen). Fragmentation was then performed using the Bioruptor XL sonicator (Diagenode). Samples were sonicated in parallel with the following settings: 25 min of total sonication time, 40 sec of pulse followed by 20 sec without pulse, high power pulse setting. This resulted in each pool of large concatemers being randomly fragmented between 500 and 4000 bp (data not shown). Following sonication, DNA samples were then purified via the QIAquick PCR purification kit (Qiagen) and sequencing libraries were prepared according to the standard protocol for genomic sample preparation by Illumina (Illumina). Each library was then sequenced on a single lane of an Illumina Genome Analyzer II, generating 36-bp read lengths.
Variant calling in pooled samples For each pooled sample, reads were compressed in order to reduce computational run-time and then aligned to their reference using a dynamic programming algorithm (see Supplemental material). Aligned reads were used to generate a run-specific error model from the incorporated negative control. The aligned file and the error model are then used by SPLINTER in input to detect the presence of a sequence variant in the pool at any analyzed position. Optimal detection of true variants was achieved by calibrating the P-value cutoff used by SPLINTER with information generated from the included positive control (see Supplemental material).
Acknowledgments This work was partly supported by the Children’s Discovery Institute grant MC-II-2006-1 (R.D.M., T.E.D.), the NIH Epigenetics Roadmap grant (1R01DA025744-01 and 3R01DA025744-02S1; R.D.M., F.L.M.V.), the Saigh Foundation (F.L.M.V., T.E.D.), and Hope Street Kids and Alex’s Lemonade Stand ‘‘A’’ Award support (T.E.D.). We thank Michael DeBaun at Washington University in Saint Louis, School of Medicine, for providing the samples from the SIT cohort; Lee Tessler, David Mayhew, and the other members of the Mitra lab for helpful discussion on the SPLINTER algorithm;
Genome Research www.genome.org
1717
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Vallania et al. the Sequenom Core facility at the Human Genetics Division at Washington University in Saint Louis, School of Medicine for the Sequenom validation; and Jessica Hoisington-Lopez at the Center for Genome Science, Washington University in Saint Louis, School of Medicine for running the samples on the Illumina GAIIx platform. This work is dedicated to Natalina Vallania. Author contributions: F.L.M.V. designed and implemented the SPLINTER algorithm. F.L.M.V., T.E.D., and R.D.M. designed the experiments. F.L.M.V. performed the sequencing experiments on the synthetic and real samples. FL.M.V. performed the data analysis and variant validation. J.W. and E.R. performed sequencing on the GWA sample and analyzed its frequency correlation. I.B. and M.P. provided reagents. F.L.M.V., R.D.M., and T.E.D. wrote the manuscript.
References Ahituv N, Kavaslar N, Schackwitz W, Ustaszewska A, Martin J, Hebert S, Doelle H, Ersoy B, Kryukov G, Schmidt S, et al. 2007. Medical sequencing at the extremes of human body mass. Am J Hum Genet 80: 779–791. Barrick JE, Yu DS, Yoon SH, Jeong H, Oh TK, Schneider D, Lenski RE, Kim JF. 2009. Genome evolution and adaptation in a long-term experiment with Escherichia coli. Nature 461: 1243–1247. Beaumont HJ, Gallie J, Kost C, Ferguson GC, Rainey PB. 2009. Experimental evolution of bet hedging. Nature 462: 90–93. Cohen JC, Kiss RS, Pertsemlidis A, Marcel YL, McPherson R, Hobbs HH. 2004. Multiple rare alleles contribute to low plasma levels of HDL cholesterol. Science 305: 869–872. Druley TE, Vallania FL, Wegner DJ, Varley KE, Knowles OL, Bonds JA, Robison SW, Doniger SW, Hamvas A, Cole FS, et al. 2009. Quantification of rare allelic variants from pooled genomic DNA. Nat Methods 6: 263– 265. Erlich Y, Chang K, Gordon A, Ronen R, Navon O, Rooks M, Hannon GJ. 2009. DNA Sudoku–harnessing high-throughput sequencing for multiplexed specimen analysis. Genome Res 19: 1243–1253. Fearnhead NS, Wilding JL, Winney B, Tonks S, Bartlett S, Bicknell DC, Tomlinson IP, Mortensen NJ, Bodmer WF. 2004. Multiple rare variants in different genes account for multifactorial inherited susceptibility to colorectal adenomas. Proc Natl Acad Sci 101: 15992–15997. Fearnhead NS, Winney B, Bodmer WF. 2005. Rare variant hypothesis for multifactorial inheritance: Susceptibility to colorectal adenomas as a model. Cell Cycle 4: 521–525. Goldstein DB. 2009. Common genetic variation and human traits. N Engl J Med 360: 1696–1698. Ji W, Foo JN, O’Roak BJ, Zhao H, Larson MG, Simon DB, Newton-Cheh C, State MW, Levy D, Lifton RP. 2008. Rare independent mutations in renal salt handling genes contribute to blood pressure variation. Nat Genet 40: 592–599.
1718
Genome Research www.genome.org
King MC, Marks JH, Mandell JB, New York Breast Cancer Study Group. 2003. Breast and ovarian cancer risks due to inherited mutations in BRCA1 and BRCA2. Science 302: 643–646. Koboldt DC, Chen K, Wylie T, Larson DE, McLellan MD, Mardis ER, Weinstock GM, Wilson RK, Ding L. 2009. VarScan: Variant detection in massively parallel sequencing of individual and pooled samples. Bioinformatics 25: 2283–2285. Li B, Leal SM. 2008. Methods for detecting associations with rare variants for common diseases: Application to analysis of sequence data. Am J Hum Genet 83: 311–321. Li H, Ruan J, Durbin R. 2008. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 18: 1851– 1858. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup. 2009. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25: 2078–2079. Macgregor S. 2007. Pooling sources of error. Eur J Hum Genet 15: 501–504. Mamanova L, Coffey AJ, Scott CE, Kozarewa I, Turner EH, Kumar A, Howard E, Shendure J, Turner DJ. 2010. Target-enrichment strategies for nextgeneration sequencing. Nat Methods 7: 111–118. Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, et al. 2009. Finding the missing heritability of complex diseases. Nature 461: 747– 753. Ng PC, Levy S, Huang J, Stockwell TB, Walenz BP, Li K, Axelrod N, Busam DA, Strausberg RL, Venter JC. 2008. Genetic variation in an individual human exome. PLoS Genet 4: e1000160. doi: 10.1371/ journal.pgen.1000160. Pleasance ED, Stephens PJ, O’Meara S, McBride DJ, Meynert A, Jones D, Lin ML, Beare D, Lau KW, Greenman C, et al. 2010. A small-cell lung cancer genome with complex signature of tobacco exposure. Nature 463: 184– 190. Prabhu S, Pe’er I. 2009. Overlapping pools for high-throughput targeted resequencing. Genome Res 19: 1254–1261. Reich DE, Lander ES. 2001. On the allelic spectrum of human disease. Trends Genet 17: 502–510. Stenson PD, Mort M, Ball EV, Howells K, Phillips AD, Thomas NS, Cooper DN. 2009. The Human Gene Mutation Database: 2008 Update. Genome Med 1: 13. doi: 10.1186/gm13. Stingl J, Caldas C. 2007. Molecular heterogeneity of breast carcinomas and the cancer stem cell hypothesis. Nat Rev Cancer 7: 791–799. Van Tassell CP, Smith TP, Matukumalli LK, Taylor JF, Schnabel RD, Lawley CT, Haudenschild CD, Moore SS, Warren WC, Sonstegard TS. 2008. SNP discovery and allele frequency estimation by deep sequencing of reduced representation libraries. Nat Methods 5: 247–252. Vikas B. 2010. A statistical method for the detection of variants from next-generation resequencing of DNA pools. Bioinformatics 26: 318– 324.
Received April 23, 2010; accepted in revised form September 27, 2010.
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Method
Evaluation of affinity-based genome-wide DNA methylation data: Effects of CpG density, amplification bias, and copy number variation Mark D. Robinson,1,2 Clare Stirzaker,1 Aaron L. Statham,1 Marcel W. Coolen,1,3 Jenny Z. Song,1 Shalima S. Nair,1 Dario Strbenac,1 Terence P. Speed,2 and Susan J. Clark1,4,5 1
Epigenetics Laboratory, Cancer Research Program, Garvan Institute of Medical Research, Sydney 2010, New South Wales, Australia; 2Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, Parkville, Melbourne 3052, Victoria, Australia; 3Department of Molecular Biology, Faculty of Science, Nijmegen Centre for Molecular Life Sciences, Radboud University Nijmegen, 6500 HB Nijmegen, The Netherlands; 4St. Vincent’s Clinical School, University of New South Wales, Sydney 2010, New South Wales, Australia DNA methylation is an essential epigenetic modification that plays a key role associated with the regulation of gene expression during differentiation, but in disease states such as cancer, the DNA methylation landscape is often deregulated. There are now numerous technologies available to interrogate the DNA methylation status of CpG sites in a targeted or genome-wide fashion, but each method, due to intrinsic biases, potentially interrogates different fractions of the genome. In this study, we compare the affinity-purification of methylated DNA between two popular genome-wide techniques, methylated DNA immunoprecipitation (MeDIP) and methyl-CpG binding domain-based capture (MBDCap), and show that each technique operates in a different domain of the CpG density landscape. We explored the effect of whole-genome amplification and illustrate that it can reduce sensitivity for detecting DNA methylation in GC-rich regions of the genome. By using MBDCap, we compare and contrast microarray- and sequencing-based readouts and highlight the impact that copy number variation (CNV) can make in differential comparisons of methylomes. These studies reveal that the analysis of DNA methylation data and genome coverage is highly dependent on the method employed, and consideration must be made in light of the GC content, the extent of DNA amplification, and the copy number. [Supplemental material is available online at http://www.genome.org. The data from this study have been submitted to Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo) under SuperSeries accession no. GSE24546.] DNA methylation, which is one of the most studied epigenetic marks, involves the addition of a methyl group to the 5 position of the cytosine pyrimidine ring and occurs primarily at CpG dinucleotides in mammals ( Jones 1999). DNA methylation patterns are established early in development and are associated with the regulation and maintenance of gene expression during differentiation (Sorensen et al. 2010). Methylation patterns can also be disrupted in many disease states, and in particular, changes in DNA methylation at CpG island-associated promoters can play a role in the development of cancer ( Jones and Baylin 2002, 2007; Jaenisch and Bird 2003). There are now numerous methods available for determining CpG methylation status (for review, see Laird 2010), including methods focused at the level of CpG islands (Ponzielli et al. 2008; Kaminsky et al. 2009), individual promoters (Weber et al. 2005, 2007; Novak et al. 2008), and, increasingly, genome-‘‘scale’’ (Meissner et al. 2008; Gu et al. 2010) and genome-wide methods, either at high (Lister et al. 2009) or low resolution (Serre et al. 2009; Ruike et al. 2010). These later methods can be broadly classified into the following designations: reduced representation approaches that are based on methylation-sensitive (e.g., HELP) (Oda et al. 5
Corresponding author. E-mail
[email protected]; fax 61-2-92958316. Article published online before print. Article and publication date are at http:// www.genome.org/cgi/doi/10.1101/gr.110601.110.
2009) or specific (e.g., CHARM) (Irizarry et al. 2008) restriction digestion (for review, see Jeddeloh et al. 2008), affinity-based methods such as methyl-DNA immunoprecipitation (MeDIP) (Weber et al. 2005, 2007; Novak et al. 2008) and methyl-CpG binding domain-based capture (MBDCap) (Rauch et al. 2006, 2008; Serre et al. 2009), and the more direct bisulphite treatment-based methods (Lister et al. 2009); coupling of reduced representation and bisulphite treatment has now been demonstrated (Meissner et al. 2008; Gu et al. 2010), and other combinations are also possible. However, there are still many challenges involved in interpreting data from DNA methylation-based assays, due to complex effects, both technical and biological that are introduced at various steps in the procedure, in addition to implicit biases from the methods employed. These include cellular purity and DNA quality, DNA amplification bias in GC-rich regions, and the effects of copy number aberrations. The focus of this study is on DNA methylation analyses using affinity-based approaches, in combination with promoter DNA microarrays or high-throughput DNA sequencing readouts. Chromatin immunoprecipitation (ChIP) has been used extensively to study protein–DNA interactions (Ren et al. 2000), and recently, an extensive benchmarking study has been conducted comparing microarray platforms and analysis methods ( Johnson et al. 2008). Comparison studies for DNA methylation platforms are now starting to emerge (Li et al. 2010). MeDIP uses an antibody to 5-methyl-cytosine, targeting single-stranded DNA (Weng et al.
20:1719–1729 Ó 2010 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/10; www.genome.org
Genome Research www.genome.org
1719
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Robinson et al. 2009; Ruike et al. 2010), while the MBDCap approach uses the methyl-CpG binding domain of the MBD2 protein to capture double-stranded DNA (Serre et al. 2009). Furthermore, the MBDCap approach can use a series of salt fractionation steps that allows specific methylation density to be assessed (Fig. 1A). Methylation
status of the immunoprecipitated DNA isolated from MeDIP or MBDCap is analyzed using tiling microarrays or high-throughput sequencing, and multiple platforms of each are available. Because DNA is often limiting and derived from mixed cell types, especially in studies with clinical samples, it can be difficult to isolate sufficient amounts of pure DNA to hybridize to microarrays directly or to sequence. DNA amplification techniques have been developed to address this problem (Paris 2009). Here, we show that DNA amplification can result in depletion of GC-rich regions and therefore may particularly impact on the interpretation of DNA methylation in CpG islands. Most importantly, copy number variation (CNV) can impact on the interpretation of DNA methylation levels, and in cancer, this can be critical within the regions harboring gene amplification and/or deletions. Furthermore, promoter tiling array data can be used to adjust for copy number changes, and we show that copy number aberrations can have a significant impact on genomewide DNA methylation analysis.
Results MeDIP and MBDCap enrich different fractions of the genome based on CpG density MeDIP and MBDCap are two capture methods commonly used to interrogate genome-wide DNA methylation patterns. These two techniques have inherent differences, namely, antibody versus MBD capture. We asked if each method was comparable in the detection of the same methylated genomic DNA sequences. Fully methylated human genomic DNA treated with SssI methyltransferase was used to benchmark the two affinity-based DNA methylation mapping platforms. For the MBDCap comparison, the MethylMiner protocol was used, where DNA can be eluted in a high-salt buffer (2 M NaCl) as a single fraction, here referred to as MBD-SF, or eluted as distinct subpopulations based on the degree of methylation, using an increasing concentration of NaCl from 200 mM to 2000 mM. MBDElu5 denotes the 1 M fraction of the elution series (Fig. 1A). After SssI treatment, essentially every CpG site in the genome is methylated, allowing a direct comparison of enrichment between MeDIP and MBDCap, which was interrogated using Affymetrix Human Promoter 1.0R arrays containing more than 4.5 million 25mer probes spanning 23,155 promoters. Figure 1B shows summarized inputsubtracted promoter tiling array signals from MeDIP, MBD-Elu5, and MBD-SF of fully methylated DNA after stratifying probes according to the local genomic CpG density (for a formal definition, see Methods). Several key observations can be made: First, the overall degree of enrichment is higher for MBDCap-based procedures, especially for CpG-dense methylated DNA (probes with high local
Figure 1. (A) Schematic showing the capture of methylated DNA into populations of single-stranded (MeDIP) or double-stranded (MBDCap) fragments. (B) Summarized probe intensities for enrichment of fully methylated DNA with MeDIP and two variations of MethylMiner-based enrichment. X-axis shows the local CpG density group (1–50). Y-axis shows the log2-scale input-normalized intensity. Each line shows the median intensity for the input-normalized intensities for the probes in the bin (here, for probes with GC content of 11 only). The intensities are further normalized such that the median in the lowest bin is 0. The location of probes within CpG islands is shown by the gray-shaded region, corresponding to a local CpG density score between 12 and 40. (C ) Summarized read counts in bins of 1000 bases over the same genomic regions interrogated by the Affymetrix Promoter 1.0R array. Each line represents the median log2 read count (RCpM indicates read counts per million mapped); the summaries are normalized such that the median with the lowest bin is 0.
1720
Genome Research www.genome.org
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Evaluation of genome-wide methylation data CpG density); second, due to the nature of the salt elution steps, MBD-Elu5 enriches primarily for CpG-dense material only, whereas MBD-SF enriches for a broader range of CpG densities, albeit at a lower average level. An attenuation of promoter microarray signal was observed at the highest CpG density regions, and notably, the attenuation seems to occur at a higher local CpG density for the MethylMiner-based procedure than MeDIP. As MeDIP recovers single-stranded DNA and MBDCap recovers double-stranded DNA, it is possible that topoisomerase activity from M.SssI may compromise amplification of DNA recovered from MeDIP relative to MBDCap (Matsuo et al. 1994). However, we found that enrichment levels of methylated LNCaP DNA were also favored by MBDCap in comparison to MeDIP for CpG dense regions, albeit at reduced levels (Supplemental Fig. 1), supporting the use of MBDCap-based procedures for favoring interrogation of CpG-rich regions. Next we asked if the apparent CpG bias was also observed with sequencing data. Similar to the previous experiment, fully methylated DNA was used to perform the MBDCap enrichment, and we analyzed the eluted DNA using a high-throughput sequencing readout. Using one lane of Illumina Genome Analyzer sequencing with 36-base single end reads for each of MBD-Elu5 and MBD-SF, 8,616,022 and 11,557,035 uniquely mapped reads were obtained, respectively, to the reference human genome (see Methods). In order to compare the methylation readout against promoter arrays, the number of reads (in nonoverlapping 1000-bp bins) mapping to the subset of the genome interrogated by the Affymetrix tiling array was calculated. Figure 1C shows the read counts normalized to depth, similarly grouped by local CpG density. Supplemental Figure 2 shows enrichment profiles across the whole genome, highlighting the different profile of enrichment (and CpG density) of promoters compared to the entire genome. The CpG density bins in Figure 1C represent approximately the local CpG density bins in Figure 1B, but the enrichment levels are not directly comparable. Overall, a similar enrichment profile was observed for the sequencing data, where MBD-Elu5 enriches for densely methylated regions and MBD-SF enriches for a slightly broader range of CpG density. As before, a slight drop in the number of reads in very high CpG density regions was observed, but not to the same extent as observed in the tiling array data. It is not clear whether the decrease can be attributed to PCR-based amplification in the library preparation or cluster generation step, or whether there are other biases introduced in mapping reads to the genome, or whether this is an inherent property of the affinity-based techniques.
Whole-genome amplification can bias DNA methylation calls in CpG dense regions Given the signal attenuation observed in the fully methylated enrichment experiments, we next examined if this is due to wholegenome amplification (WGA), a required step in the protocol to generate enough material for hybridization to the promoter arrays. It is well established that the GC content of a DNA template can affect the efficiency of amplification, often resulting in a bias against the GC-rich regions of the genome (Bredel et al. 2005; Pugh et al. 2008; Teo et al. 2008). Johnson et al. (2008) also report a significant drop in sensitivity, most notably for Affymetrix tiling arrays, when amplified DNA is hybridized. Amplification bias is perhaps even more of a potential concern in DNA methylation mapping, since CpG islands are GC-rich by their very nature and therefore more prone to any extant bias. We were initially alerted to a potential problem involving GC bias when GSTP1, which is highly methylated at its CpG island-
associated promoter in prostate cancer and is unmethylated in normal cells (Song et al. 2002; Nakayama et al. 2004), showed little differential enrichment at the probe-level on the Affymetrix promoter tiling arrays after MeDIP enrichment and WGA of prostate cancer (LNCaP) and normal prostate epithelial cells (PrECs) (see Supplemental Fig. 3A; Coolen et al. 2010). MeDIP-qPCR experiments, used as a control before hybridization, confirmed a strong affinity of methylated DNA at the GSTP1 locus, both before (77fold) and after (76-fold) amplification (Supplemental Fig. 3B). Consequently, even though the degree of enrichment was maintained between LNCaP and PrECs, the absolute copy number of GSTP1 molecules in the population decreased proportionally after WGA, from 676 and 8.7 copies/ng beforehand to 32 and 0.43 copies/ng afterward, respectively. These data suggest that DNA amplification can result in an apparent loss of methylation detection for regions of the genome that are amplified less efficiently, such as GC-rich regions. To illustrate this further, Supplemental Figure 4 shows probe-level data for the CpG-rich promoters of WNT2 and CAV2 and the CpG-poor promoters of AGR2, PTN, and SOSTDC1, all of which are validated to be hypermethylated in prostate cancer cells. For the WNT2 and CAV2 promoters, the microarray signal representing differential methylation is observed only in regions flanking the CpG island. We hypothesize that WGA has ablated the absolute levels of these DNA molecules, similar to GSTP1. Notably, the AGR2, PTN, and SOSTDC1 promoters, which are of lower CpG content, exhibit a differential signal throughout the region validated to be differentially methylated. It is realistic to expect that many GC-rich regions, such as CpG islands, while differentially enriched between populations before amplification, become diluted to be below the lower detection limit on the promoter microarray after WGA. By using Affymetrix promoter tiling array data, the effect of WGA was studied directly by comparing the probe intensities from unamplified and amplified DNA from the same origin. For this experiment, genomic input DNA was used, without a methylated DNA affinity step. Figure 2 shows the distribution of the raw promoter array signal intensity for both an unamplified and WGA sample of the same input genomic DNA, across equally sized bins of local CpG density. Here, only probes with a GC content of 8, 11, and 14 (of the 25-mer probe) (Supplemental Fig. 5 displays the full range of probe GC contents) are displayed. Notably, a substantial drop in the promoter tiling microarray signal is observed for the probes in CpG-rich regions (local CpG density greater than 12 defines a CpG island). In addition, CpG-rich regions of the genome also show some attenuation in the hybridization signal from unamplified DNA when the probe GC content was greater than 11, suggesting that other effects, such as cross-hybridization and probe-specific temperature effects (Wei et al. 2008), may influence the signal observed on microarrays. It is also noted that the observed CpG density bias is not unique to the Affymetrix platform, since unamplified MeDIP-enriched fully methylated DNA samples analyzed on a NimbleGen platform (Gal-Yam et al. 2008) also exhibit attenuation over probes with a broad range of GC contents (Supplemental Fig. 6). Since some form of genome-wide amplification is required to obtain sufficient DNA for array experiments, we next asked if recent variations to enhance the amplification of GC-rich regions (Zhang et al. 2009) could reduce the attenuation observed on the tiling arrays. Samples of unamplified genomic DNA with standard WGA, additives with WGA (Betaine, DMSO, ethylene glycol, 1,2propanediol), and the different amplification conditions suggested by Affymetrix for ChIP-chip experiments (Fig. 3) are compared. To
Genome Research www.genome.org
1721
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Robinson et al.
Figure 3. Observed cumulative bias of various amplification methods. X-axis denotes the probe GC content. Y-axis denotes the cumulative bias score, which captures the cumulative signal attenuation over the 50 bins of local CpG density (for definition, see Methods). Each line represents a different amplification strategy.
summarize the results, a cumulative bias score was calculated to quantify the signal attenuation in tiling array data across local CpG density bins for each group of probes at a GC content (for details of the bias score, see Methods; for explanation, see Supplemental Fig. 7). As shown previously in Figure 2, more CpG-rich attenuation (cumulative bias) at the higher probe GC content (greater than 11) occurs for unamplified DNA (Fig. 3). However, all the different amplification conditions tested show a much greater bias score than does unamplified DNA and did not reduce the signal attenuation (Fig. 3).
Detection of differentially methylated regions
Figure 2. Box-and-whisker plots of unnormalized log2-scale microarray intensities for unamplified and WGA-amplified genomic DNA. To control for the association between probe GC content and intensity, probes with GC content of 8, 11, and 14 (out of 25) are shown in A–C, respectively. Plots for the remaining probe GC contents (and further experimental samples) are shown in Supplemental Figure 5. Probes are grouped into 50 equally sized bins genome-wide-based on their local CpG density, as shown in Figure 1, B and C. Box-and-whisker plots show the 25th and 75th percentile as the bottom and top of the box, and the band represents the median; the whiskers show the lowest data point within 1.5 interquartile range (IQR) of the 25th percentile and the highest data point within 1.5 IQR of the 75th percentile.
1722
Genome Research www.genome.org
To explore the impact of these biases, we looked for differentially methylated regions (DMRs) across the promoter regions represented on the Affymetrix promoter tiling array, comparing the prostate cancer (LNCaP) and normal epithelial (PrEC) cell lines. Using a statistical procedure similar to model-based analysis of tiling arrays (MAT) ( Johnson et al. 2006) at an estimated 5% false discovery rate (see Methods), 7384, 4398, and 3815 DMRs were detected between the two cell lines for MeDIP, MBD-Elu5, and MBD-SF, respectively. Given the enrichment profiles from Figure 1, it is not surprising that the detected DMRs showed CpG density distributions that reflect the enrichment profile, as shown in Figure 4. The differentially methylated promoters were split into hypermethylated (Fig. 4A) and hypomethylated in cancer (Fig. 4B) to illustrate the asymmetry in differential methylation between the cell lines. As expected, MBD-Elu5 discovers substantially more DMRs in CpG-rich regions and identifies the greatest proportion of CpG islands, and MBD-SF finds DMRs in a broader range of CpG density, while MeDIP identifies the lowest percentage of CpG-rich regions. Next, MBD-SF enriched DNA was analyzed from the two cell lines using both promoter tiling array (MBDCap-chip) and highthroughput sequencing (MBDCap-seq), allowing us to compare directly the concordance of the two readouts. Since the tiling arrays only measure promoters, the sequencing data were summarized into bins of read counts at promoters so they could be directly compared. The statistical procedures used to detect differentially methylated promoters from the two platforms are fundamentally different, due to the nature of the data (probe intensities vs. read counts, see Methods). However, P-values should be on scales that
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Evaluation of genome-wide methylation data
Figure 4. Box-and-whisker plots of CpG density for putative DMRs (at estimated false discovery rate of 5%) between LNCaP and PrEC cells. Shown are hypermethylated (A) and hypomethylated (B) regions.
are directly comparable. The P-values from each platform are backtransformed into Z-scores (i.e., quantiles of a standard normal distribution) to summarize the evidence for differential methylation, signed according to the direction of the change. Differential methylation Z-scores for both platforms are shown in Figure 5A. As expected, there is a general concordance (r = 0.46) since the two platforms are comparing enriched DNA from the same origin. One noticeable difference between the platforms is the range of Z-scores, with sequencing data giving a much larger range of evidence for differential methylation. Although it is not certain that every promoter-level difference above a threshold is indeed differentially methylated, the much wider range of Z-scores suggests the sequencing-based data have a higher sensitivity. Furthermore, comparison of MBDCap-seq data and quantitative Sequenom bisulphite-based DNA methylation data highlights a strong concordance (r = 0.81) (Supplemental Fig. 8). Figure 5A highlights that all six hypermethylated genes discussed previously are called differentially methylated by one of the readouts. Notably, the GSTP1 CpG island promoter shows a significantly higher number of reads in the region around the gene’s transcription start site (TSS) for the
cancer cells, but similar to the MeDIP-chip data (Fig. 5A), there is little evidence of differential methylation from the MBDCapchip data (Supplemental Fig. 8). Similarly, the promoter of CAV2, a CpG-rich region, shows strong differential methylation for the sequencing but not for the microarray. The promoters of SOSTDC1 and AGR2, from low CpG density regions, are found by the microarray and only show moderate differential methylation in the sequencing data (Supplemental Fig. 9). Differential methylation calls for PTN and WNT2 are reasonably concordant. Furthermore, if a Z-score cutoff of three is set, the promoter array only detects 256 of the 2854 promoters detected by the sequencing data as differentially methylated, suggesting that its sensitivity is much lower. However, the tiling array also finds 246 regions differentially methylated that the sequencing data does not, suggesting there may also be inherent biases in the genomic regions that are suitable for sequencing. We next explored the CpG density of the concordant and discordant promoters (as defined by Z-score cutoffs; see Methods) that were detected using the two platforms (Fig. 5B), split into groups of hyper- and hypomethylated regions. The array-based
Figure 5. Comparison of MBD-SF tiling array and sequencing data. (A) Differential methylation Z-scores between LNCaP and PrEC cells using MBDSF-seq (y-axis) and MBD-SF-chip (x-axis). The six validated genes that are shown in Supplemental Figures 3A and 4 are indicated with black dots. The remaining dot colors are chosen according to the differential methylation concordance between MBD-SF-seq and MBD-SF-chip Z-score as depicted in B. Note that some truly differentially methylated promoters, such as WNT2, are deemed ‘‘Indeterminate’’ by this concordance classification. (B) Box-andwhisker plots of CpG density for concordant and discordant differentially methylated promoters, with colors corresponding to the cutoffs shown in A. (C ) Box-and-whisker plots of sequencing mapability of the concordant and discordant differentially methylated promoters, using the colors from A.
Genome Research www.genome.org
1723
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Robinson et al. readout detects a smaller percentage of CpG-rich regions in comparison with the sequencing-based data, supporting the observation that DNA amplification and other biases have a direct impact on the regions detected. The majority of regions detected by microarray and not by sequencing are in low CpG density regions. Conversely, sequencing-only detections are largely from CpG-rich regions. Furthermore, promoters deemed differentially methylated by microarray but not by sequencing, on average, have lower mapability (Fig. 5C). However, it should be noted that the probes in these same microarray-only detected regions exhibit a slightly higher probe copy number (see Supplemental Fig. 10). The number of differentially methylated promoters identified by sequencing data is ultimately dependent on read depth. To estimate whether the saturation of promoter differential methylation has been achieved with the current sequencing depth, the MBDSF-seq data set for LNCaP cells and PrECs was down-sampled at various fractions, and a curve was fitted to the number of differentially methylated promoters. The presented experiments capture an estimated 60%–68% of the differentially methylated promoters, while doubling the number of mapable reads will result in an estimated 78%–89% sensitivity (see Methods; Supplemental Fig. 11).
Copy number, input DNA, and DNA methylome data Another important yet subtle aspect of identifying epigenetic changes, especially in a cancer setting, is the impact of copy number aberrations on the DNA methylation signal. Copy number aberrations can have a direct effect on transcript levels (Stranger et al. 2007). The effect is less clear in DNA methylome data, but the expectation is that genetically amplified (deleted) regions of the genome should be captured at a higher (lower) rate if the DNA is methylated. In the analysis of the ChIP-chip experiments, it is common practice to subtract the genomic DNA input signals from the immunoprecipi-
tation signals in order to account for copy number, while indirectly this adjustment can also account for hybridization, sonication, or probe-specific effects. In two-color microarray experiments, this adjustment is done explicitly (Gal-Yam et al. 2008). Early sequencingbased DNA methylation mapping exercises have not included input DNA controls (Serre et al. 2009), and if included, biases are known to exist (Teytelman et al. 2009; Vega et al. 2009). It is highlighted here that signals from the input DNA can adequately identify copy number aberrations, and this knowledge will be critical to disentangling differential methylation from changes in copy number. To validate the use of genomic input DNA tiling array data to account for copy number changes, Affymetrix SNP 6.0 array data were collected on the same two cell lines that have genome-wide DNA methylation data. To define copy number changes between the two cell lines, the input DNA tiling arrays were processed similarly to the gene expression data, resulting in a promoter-level summary of the change in copy number after accounting for probe-specific effects (see Methods) (Irizarry et al. 2003). Figure 6 shows the change in copy number between the two cell lines for chromosome 5 (for all other chromosomes, see Supplemental Fig. 12), emphasizing the strong association between promoter-level summaries of genomic input DNA on the promoter tiling array (Fig. 6A), and summarized SNP and copy number probes from the genotyping arrays (Fig. 6B). It also demonstrates the potential of using promoter-level summaries of input DNA signals for discovering copy number aberrations in the absence of directly collecting SNP array data or similar. Figure 6C shows the relationship between smoothed estimates of copy number (see Methods) for the two platforms, suggesting the genome-wide correspondence of copy number changes is quite high (r = 0.86). Next, we highlight that the effects of copy number aberrations are prominent in affinity-based epigenome data, affecting both DNA methylation and ChIP assays. The differential methylation
Figure 6. Using promoter tiling arrays to estimate changes in copy number. (A) Y-axis is the difference in copy number between the prostate cancer and normal epithelial cell line using the Affymetrix Promoter 1.0R array along human chromosome 5. The gray line represents kernel-smoothed differences over 200 kb. (B) Y-axis shows the difference in copy number using the Affymetrix SNP 6.0 array along the same region of chromosome 5. The gray line represents kernel-smoothed differences over 50 kb. (C ) X-axis and y-axis represent the smoothed copy number changes between the prostate cancer and epithelial cell lines for the Promoter 1.0R and SNP 6.0 arrays, respectively, genome-wide over a common set of loci.
1724
Genome Research www.genome.org
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Evaluation of genome-wide methylation data detection exercise (using MBD-SF on prostate cancer LNCaP and PrEC lines) was extended to the entire genome, using read counts over 1500 base pair bins, and Z-scores were computed for each region. Figure 7A illustrates changes in MBDCap-seq read counts along chromosome 13. As expected, they are correlated with changes in copy number (Fig. 7B) for the same regions. Figure 7C shows the distribution of DMR Z-scores genome-wide, stratified by their corresponding change-in-copy-number status. Taken together, these observations underscore the need to integrate CNV explicitly with epigenome analyses. Interestingly, there is a large (;10 M base pairs) region near 70 MB on chromosome 13 that shows deletion (SNP array data), but no corresponding change in methylation (MBDCapseq). The MBDCap-seq data from the rest of this chromosome suggest that it is a region of hypermethylation, implying a potential link between regional hypermethylation and genome stability.
Discussion With the rapid growth of the different genome-wide technologies available for DNA methylation analysis (Lister et al. 2009; Oda et al. 2009; Serre et al. 2009; Ruike et al. 2010), it is timely to stop and reevaluate the limitations and benefits of the different techniques. We have evaluated the technical and data analysis aspects of promoter-level tiling microarray and genome-wide sequencingbased DNA methylation data and find that there are several hurdles that need to be overcome before a high sensitivity platform with genome-wide methylation coverage will emerge. Perhaps not unexpectedly, different enrichment techniques and readouts give different snapshots of the DNA methylome. Knowledge of inherent biases and limitations of each method should encourage protocol improvements and facilitate data integration from multiple platforms, as well as the development of improved bioinformatics tools to extract meaningful biological interpretation. Comparisons of affinity-based methylome mapping techniques are now beginning to appear (Li et al. 2010). Here, a comparison of MBDCap- and MeDIP-based affinity capture strategies was performed, as well as a comparison of promoter tiling arrays and sequencing-based readouts. In addition, we highlight the technical limitations due to WGA, describe a novel method to assess amplification genome-wide with tiling arrays, and illustrate biological biases attributable to copy number that are relevant to the effective analysis of the DNA methylome. By using fully methyl-
ated DNA and LNCaP DNA, enrichment profiles were compared across the CpG density spectrum of MeDIP and two versions of MBDCap: MBD-Elu5, the 1000-mM fraction that elutes densely methylated DNA, and MBD-SF, a single elution encompassing all methylated fractions. Our data reveal higher overall enrichment using the methyl DNA binding domain protocol from MethylMiner, compared with immunoprecipitation with the 5-methylcytosine monoclonal antibody, especially in CpG-rich regions. MBD-Elu5 preferentially elutes CpG-rich DNA, while MBD-SF contains DNA molecules spanning a broader range of CpG densities. For all methods analyzed by promoter microarrays, a marked drop in signal intensity was observed at the CpG-dense regions, including at many of the CpG islands. This attenuation was found to be largely due to WGA but is also confounded by other effects, such as GC content of the microarray probes and cross-hybridization, which affects the signal and therefore sensitivity and dynamic range. Interestingly, CHARM has compared favorably in performance to MeDIP (Irizarry et al. 2008) and notably without an amplification step. However, CHARM does require large amounts of starting DNA and is typically used with a custom microarray. Unfortunately for the MeDIP and MBDCap approaches, it is rarely feasible to get sufficient affinity-purified DNA for genomewide analyses, thereby necessitating amplification before hybridization to microarrays. In some cases, pooling multiple samples may be a reasonable alternative, but this results in a loss of valuable replicate information. Furthermore, pooling affinity-purified DNA is often not practical when analyzing DNA from low cell numbers, such as formalin-fixed paraffin-embedded clinical samples. Our results suggest that amplification reduces sensitivity and permits only a subfraction of the genome to be interrogated. Specifically, tiling array probes representing CpG-rich regions, which are arguably of greatest interest for methylation mapping, appear to show a lower intensity and a compressed dynamic range. As a result, the ability to detect differential methylation using amplified DNA is compromised at many CpG islands. Sequencing-based assays require less starting material. However, amplification during the sequencing protocol also has the potential to introduce some sequence bias in CpG-rich regions, albeit to a lesser extent. Promoter tiling array and high-throughput sequencing readouts of the same populations of MBDCap-enriched methylated DNA were compared. Even though the concordance is strong, it is the discordance that highlights the differences in the
Figure 7. Effects of copy number changes on differential methylation detection. (A) Differential methylation Z-score for between LNCaP and PrEC cells, using MBD-SF-seq, for human chromosome 13. (B) Smoothed Affymetrix SNP 6.0 array data showing corresponding changes in copy number. (C ) Genome-wide distributions of Z-scores, stratified by the change-in-copy-number status of the corresponding regions.
Genome Research www.genome.org
1725
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Robinson et al. platform-specific snapshots of the methylome. DMRs that are detected by sequencing and not by microarray are commonly located in the CpG-rich regions of the genome, validating the loss in sensitivity on microarrays that is partly attributable to WGA. Furthermore, sequencing-based assays are strongly affected by enrichment levels, such that highly enriched regions are sequenced to a greater depth, resulting in higher power to detect changes. Therefore, microarrays may be better suited to interrogating regions of lower enrichment, such as those in lower CpG density areas, where the cost of sequencing to obtain sufficient coverage may become limiting. To a smaller extent, some of the DMRs detected by microarrays and not by sequencing are in regions of lower ‘‘mapability,’’ suggesting microarrays may have improved sensitivity in these regions, if unique probes can be designed. Furthermore, the extent to which the genomic repeat elements are present in affinity-captured methylated DNA is largely unknown. Longer or paired-end reads will result in higher mapability, while paired-end reads will be essential to studying the effects of repeat elements. Overall, sequencing data appear to be more sensitive for the discovery of DMRs, in terms of total detections, and they carry the obvious advantage that the entire genome can be interrogated. However, the complexities introduced by the enrichment levels, the amount of sequencing used, amplification, methylated repeat elements, CpG density, and mapability are cumulatively significant, suggesting that array and sequencing platforms may be complementary for cost-effective and comprehensive analysis of differential methylation. Last, we studied the subtle effects that CNV may introduce into DNA methylation data sets. The use of input DNA hybridized to tiling arrays for economical copy number aberration detection was validated, and we highlighted that genetic changes can significantly confound the identification of epigenetic differences, if not explicitly integrated into the analysis. This will be of particular importance when analyzing cancer methylomes, for example, where copy number aberrations are widespread. Extrapolation of the information within existing copy number databases or from existing genomic DNA microarray or arrayCGH data should be straightforward. However, the implication of our results is that, in some cases, additional resources will need to be dedicated to collect CNV information (e.g., CNV-seq) (Xie and Tammi 2009). Copy number biases will not only confound genome-wide methylation analyses but will also be present in other affinity-based epigenome mapping exercises, such as chromatin immunopreciptation experiments studying histone modifications. Many exciting new approaches have recently emerged to study genome-wide DNA methylation (Clarke et al. 2009; Lister et al. 2009; Flusberg et al. 2010), and along with these novel approaches have come an abundance of challenges, mainly associated with the interpretation of the growing masses of data. A better understanding of these technologies and the impact of current laboratory protocols, such as MeDIP and MBDCap, will go a long way toward the development of suitable and sensitive protocols for the genome-wide analysis of the methylome.
Methylation profiling by MeDIP DNA was extracted from the cell lines using the Puragene extraction kit (Gentra Systems). For fully methylated positive control DNA, CpG genome universal methylated DNA was obtained from Millipore (catalog no. 57821). The MeDIP assay was performed on 4 mg of sonicated genomic DNA (300–500 bp) in 13 IP buffer (10 mM sodium phosphate at pH 7.0, 140 mM NaCl and 0.05% Triton X-100). Ten micrograms anti-5-methylcytosine mouse monoclonal antibody (Calbiochem clone 162 33 D3 catalog no. NA81) was incubated overnight in 500 mL 13 IP buffer, and the DNA/antibody complexes were collected with 80 mL Protein A/G PLUS agarose beads (Santa Cruz sc-2003). The beads were washed three times with 13 IP buffer at 4°C and twice with 1 mL TE buffer at room temperature. The immune complexes were eluted with freshly prepared 1% SDS and 0.1 M NaHCO3, and the DNA was purified by phenol/chloroform extraction and ethanol precipitation and resuspended in 30 mL H2O. Input samples were processed in parallel.
Isolation of methylated DNA by MBDCap The MethylMiner Methylated DNA Enrichment Kit (Invitrogen) was used to isolate the methylated DNA. One microgram of genomic DNA was sonicated to 100–500 bp. Then 3.5 mg (7 mL) of MBD-Biotin Protein was coupled to 10 mL of Dynabeads M-280 Streptavidin according to the manufacturer’s instructions. The MBD-magnetic beads conjugates were washed three times and resuspended in 1 volume of 13 bind/wash buffer. The capture reaction was performed by adding 1 mg sonicated DNA to the MBDmagnetic beads on a rotating mixer for 1 h at room temperature. All capture reactions were done in duplicate. The beads were washed three times with 13 bind/wash buffer. The methylated DNA was eluted in one of two ways: (1) as a single fraction with a high-salt elution buffer (2000 mM NaCl), denoted MBD-SF; or (2) as distinct subpopulations based on the degree of methylation using an increasing NaCl concentration of the elution buffer, from 200 mM to 2000 mM in a stepwise gradient (elution 1, 200 mM; elution 2, 350 mM; elution 3, 450 mM; elution 4, 600 mM; elution 5, 1000 mM; and elution 6, 2000 mM). Each fraction was concentrated by ethanol precipitation using 1 mL glycogen (20 mg/mL), 1/10th volume of 3 M sodium acetate (pH 5.2), and two sample volumes of 100% ethanol, and was resuspended in 60 mL H20.
WGA and promoter array analyses Immunoprecipitated DNA and input DNA from MeDIP immunoprecipitations and MBD-Capture reactions were amplified with GenomePlex Complete WGA Kit (Sigma catalog no. WGA2), according to the manufacturer’s instructions. Fifty nanograms of DNA was used in each amplification reaction. The reactions were cleaned up using cDNA cleanup columns (Affymetrix no. 900371), and 7.5 mg of amplified DNA was fragmented and labeled according to Affymetrix Chromatin Immunoprecipitation Assay Protocol P/N 702238 Rev. 3. Affymetrix GeneChip Human Promoter 1.0R arrays (P/N. 900777) were hybridized using the GeneChip Hybridization wash and stain kit (P/N 900720).
Methods Amplification bias experiments Cell lines and culture conditions LNCaP prostrate cancer cells were cultured as described previously (Song et al. 2002). Normal PrECs (Cambrex Bio Science catalog no. CC-2555) were cultured according to the manufacturer’s instructions in Prostate Epithelial Growth Media (PrEGM; Cambrex Bio Science catalog no. CC-3166).
1726
Genome Research www.genome.org
WGA reactions were performed in the presence of reagents known to enhance the amplification of GC-rich DNA (Zhang et al. 2009). Betaine (Sigma B0300-IVL; final 2.2 M), ethylene glycol (Sigma E-9129; lot 23H00252; final 1.075 M), 1,2 propanediol (Sigma 398039; final 0.816M), and DMSO (Stratagene catalog no. 60026053; final 4%) were used in separate 100 mL WGA reactions. Following
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Evaluation of genome-wide methylation data the amplification, the DNA was purified, labeled, fragmented, and hybridized to the Affymetrix GeneChip Human Promoter 1.0R arrays as described above.
Local CpG density We use the definition of local CpG density given by Pelizzola et al. (2008), with a window of 600 bp (Pelizzola et al. 2008) since we hybridize genomic DNA fragments with an average length of 600 bases, and individual probes are measuring signal from adjacent genomic regions and are thus affected by the number of CpG sites in this region. Briefly, the local CpG density is a weighted count of CpG sites in the genome upstream and downstream 600 bases from a given point of interest (e.g., microarray probe location). Weight decreases linearly from 1 at the center of the point of interest to 0 at 600 bases up- or downstream. The score is a reflection of the number of CpG sites in close proximity to the point of interest.
read counts per million uniquely mapped, as shown in Figure 1C (promoters) and Supplemental Figure 2 (genome); and (2) the differential analysis of read counts at promoters, for comparing LNCaP-MBDCap versus PrEC-MBDCap, explicitly compensates for read depth (i.e., library size) in the edgeR software (Robinson et al. 2010).
Back-transformed Z-scores To put observed differences between LNCaP and PrEC cells on a common scale, for both the tiling array and sequencing platforms, P-values were back-transformed into signed Z-scores. For each P-value, the Z-score is the value, z, of the standard normal distribution such that Pr(Z > z) = p/2, where p is the P-value. Regions with higher signal (or higher relative count) in LNCaP cells will have positive Z-scores, otherwise they will be negative.
Mapping Genome Analyzer sequencing reads Cumulative signal attenuation The cumulative bias score captures the degree of attenuation at high local CpG density for a set of probes with given probe GC content. Using the statistics (median, 25th percentile, and 75th percentile) from bins 3–17 for each combination of probe GC content and sample, a median and variance for the combination were calculated. The cumulative bias is the sum of absolute deviations from the calculated median and variance among the 50 bins for the combination. A pictorial description is given in Supplemental Figure 7.
Untargeted promoter-level analysis of promoter array data A probe-level score for the difference of interest was calculated (LNCaP signal PrEC signal) and smoothed using a trimmed mean (600-bp window) and searched for a significant and persistent difference. To calculate a false discovery rate, the order of the probes is randomized and the same procedure is followed. The method is implemented in the regionStats function of the Repitools package (Statham et al. 2010).
Targeted promoter-level analysis of promoter array and sequencing data For tiling array data, a probe-level score for the difference of interest was calculated (LNCaP signal PrEC signal) using all the probes within 750 bases of every TSS. A one-sample t-statistic was then calculated to determine whether the average probe-level score for each TSS is significantly different from zero, as implemented in the blockStats function of the Repitools package (Statham et al. 2010). P-values are calculated from t-statistics. For sequencing data, the number of reads that mapped to within 750 bases of every TSS was counted. Then, an exact test for the difference in counts between LNCaP and PrEC was calculated using the Bioconductor edgeR package (Robinson et al. 2010).
Data normalization The normalization for Affymetrix Human Promoter 1.0R arrays follows the adjustment proposed from the model-based analysis of tiling arrays (MAT) ( Johnson et al. 2006), which compensates for the global effects of base composition and probe copy number. By the nature of high-throughput sequencing experiments, each sample is sequenced to a different depth. The compensation for total read depth occurs at the following stages: (1) in the analysis of MBDCap-enriched SssI-treated DNA, signal levels are presented as
We mapped 36 base pair reads to the hg18 reference genome using Bowtie (Langmead et al. 2009), with up to three mismatches. Reads that mapped more than once (i.e., identical start sites) to a single genomic location were excluded.
Concordance of DMRs between promoter arrays and sequencing Promoters were deemed to be concordant and hypermethylated if both platforms give a Z-score greater than 3 and to be concordant and hypomethylated if both Z-scores are less than 3. Hypermethylated discordant promoters were defined as one platform having a Z-score greater than 3 and the other platform with a Z-score less than 1.5. Similarly, cutoffs of 3 and 1.5 were used to define discordant hypomethylated promoters. Note that promoters deemed as differentially methylated by one platform and not by the other (i.e., between 1 and 3 or between 3 and 1) are considered indeterminate.
Down-sampling analysis The counts were down-sampled for each gene promoter to accumulate total read counts between 20% and 100% of the original data set (10 data sets are sampled for each level of down-sampling). For each down-sampled data set, the number of DMRs (using an absolute Z-score cutoff of 3) was calculated. Using the median number of DMRs for each level of subsampling, a nonlinear curve of the form axc/(b + xc) was fitted (using the R nls function) in order to estimate the total number of DMRs (i.e., parameter a). The 95% confidence interval for the total number of DMRs is (5555, 6323). The data sets sampled at 100% reveal 3777 DMRs.
Smoothed copy number estimates The change in copy number has been smoothed using a truncated Gaussian kernel smoother using a bandwidth of 50/200 kb (promoter/SNP array), using the implementation in the aroma.core R package (Bengtsson et al. 2008).
Mapability Using Bowtie, all possible 36-bp reads from the entire human genome were mapped back to the genome. At every base, a read can either be unambiguously mapped starting at a given position or not. Mapability is the proportion of such reads that can be mapped for a given genomic region.
Genome Research www.genome.org
1727
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Robinson et al.
Acknowledgments We thank Kate Patterson for help with preparation of the figures and critical reading of the manuscript and Oleg Mayba for mappability calculation code. This work is supported by National Health and Medical Research Council (NH&MRC) project (427614, 481347) (M.D.R., C.S., D.S.) and Fellowship (S.J.C.), Cancer Institute NSW grants (CINSW: S.J.C., A.L.S.), and NBCF Program Grant (S.J.C.) and ACRF. We also thank the Ramaciotti Centre, University of New South Wales (Sydney, Australia) for array hybridizations and Illumina GAII sequencing of MBDCap DNA.
References Bengtsson H, Simpson K, Bullard J, Hansen K. 2008. aroma.affymetrix: A generic framework in R for analyzing small to very large Affymetrix data sets in bounded memory, tech report no. 745. Department of Statistics, University of California, Berkeley. Bredel M, Bredel C, Juric D, Kim Y, Vogel H, Harsh GR, Recht LD, Pollack JR, Sikic BI. 2005. Amplification of whole tumor genomes and gene-by-gene mapping of genomic aberrations from limited sources of fresh-frozen and paraffin-embedded DNA. J Mol Diagn 7: 171–182. Clarke J, Wu HC, Jayasinghe L, Patel A, Reid S, Bayley H. 2009. Continuous base identification for single-molecule nanopore DNA sequencing. Nat Nanotechnol 4: 265–270. Coolen MW, Stirzaker C, Song JZ, Statham AL, Kassir Z, Moreno CS, Young AN, Varma V, Speed TP, Cowley M, et al. 2010. Consolidation of the cancer genome into domains of repressive chromatin by long-range epigenetic silencing (LRES) reduces transcriptional plasticity. Nat Cell Biol 12: 235–246. Flusberg BA, Webster DR, Lee JH, Travers KJ, Olivares EC, Clark TA, Korlach J, Turner SW. 2010. Direct detection of DNA methylation during singlemolecule, real-time sequencing. Nat Methods 7: 461–465. Gal-Yam EN, Egger G, Iniguez L, Holster H, Einarsson S, Zhang X, Lin JC, Liang G, Jones PA, Tanay A. 2008. Frequent switching of Polycomb repressive marks and DNA hypermethylation in the PC3 prostate cancer cell line. Proc Natl Acad Sci 105: 12979–12984. Gu H, Bock C, Mikkelsen TS, Jager N, Smith ZD, Tomazou E, Gnirke A, Lander ES, Meissner A. 2010. Genome-scale DNA methylation mapping of clinical samples at single-nucleotide resolution. Nat Methods 7: 133–136. Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP. 2003. Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res 31: e15. doi: 10.1093/nar/gng015. Irizarry RA, Ladd-Acosta C, Carvalho B, Wu H, Brandenburg SA, Jeddeloh JA, Wen B, Feinberg AP. 2008. Comprehensive high-throughput arrays for relative methylation (CHARM). Genome Res 18: 780–790. Jaenisch R, Bird A. 2003. Epigenetic regulation of gene expression: How the genome integrates intrinsic and environmental signals. Nat Genet 33: 245–254. Jeddeloh JA, Greally JM, Rando OJ. 2008. Reduced-representation methylation mapping. Genome Biol 9: 231. doi: 10.1186/gb-2008-9-8-231. Johnson WE, Li W, Meyer CA, Gottardo R, Carroll JS, Brown M, Liu XS. 2006. Model-based analysis of tiling-arrays for ChIP-chip. Proc Natl Acad Sci 103: 12457–12462. Johnson DS, Li W, Gordon DB, Bhattacharjee A, Curry B, Ghosh J, Brizuela L, Carroll JS, Brown M, Flicek P, et al. 2008. Systematic evaluation of variability in ChIP-chip experiments using predefined DNA targets. Genome Res 18: 393–403. Jones PA. 1999. The DNA methylation paradox. Trends Genet 15: 34–37. Jones PA, Baylin SB. 2002. The fundamental role of epigenetic events in cancer. Natl Rev 3: 415–428. Jones PA, Baylin SB. 2007. The epigenomics of cancer. Cell 128: 683–692. Kaminsky ZA, Tang T, Wang SC, Ptak C, Oh GH, Wong AH, Feldcamp LA, Virtanen C, Halfvarson J, Tysk C, et al. 2009. DNA methylation profiles in monozygotic and dizygotic twins. Nat Genet 41: 240–245. Laird PW. 2010. Principles and challenges of genome-wide DNA methylation analysis. Natl Rev 11: 191–203. Langmead B, Trapnell C, Pop M, Salzberg SL. 2009. Ultrafast and memoryefficient alignment of short DNA sequences to the human genome. Genome Biol 10: R25. doi: 10.1186/gb-2009-10-3-r25. Li N, Ye M, Li Y, Yan Z, Butcher LM, Sun J, Han X, Chen Q, Zhang X, Wang J. 2010. Whole genome DNA methylation analysis based on high throughput sequencing technology. Methods (in press). doi: 10.1016/ j.ymeth.2010.04.009. Lister R, Pelizzola M, Dowen RH, Hawkins RD, Hon G, Tonti-Filippini J, Nery JR, Lee L, Ye Z, Ngo QM, et al. 2009. Human DNA methylomes at base
1728
Genome Research www.genome.org
resolution show widespread epigenomic differences. Nature 462: 315– 322. Matsuo K, Silke J, Gramatikoff K, Schaffner W. 1994. The CpG-specific methylase SssI has topoisomerase activity in the presence of Mg2+. Nucleic Acids Res 22: 5354–5359. Meissner A, Mikkelsen TS, Gu H, Wernig M, Hanna J, Sivachenko A, Zhang X, Bernstein BE, Nusbaum C, Jaffe DB, et al. 2008. Genome-scale DNA methylation maps of pluripotent and differentiated cells. Nature 454: 766–770. Nakayama M, Gonzalgo ML, Yegnasubramanian S, Lin X, De Marzo AM, Nelson WG. 2004. GSTP1 CpG island hypermethylation as a molecular biomarker for prostate cancer. J Cell Biochem 91: 540–552. Novak P, Jensen T, Oshiro MM, Watts GS, Kim CJ, Futscher BW. 2008. Agglomerative epigenetic aberrations are a common event in human breast cancer. Cancer Res 68: 8616–8625. Oda M, Glass JL, Thompson RF, Mo Y, Olivier EN, Figueroa ME, Selzer RR, Richmond TA, Zhang X, Dannenberg L, et al. 2009. High-resolution genome-wide cytosine methylation profiling with simultaneous copy number analysis and optimization for limited cell numbers. Nucleic Acids Res 37: 3829–3839. Paris PL. 2009. A whole-genome amplification protocol for a wide variety of DNAs, including those from formalin-fixed and paraffin-embedded tissue. Methods Mol Biol 556: 89–98. Pelizzola M, Koga Y, Urban AE, Krauthammer M, Weissman S, Halaban R, Molinaro AM. 2008. MEDME: An experimental and analytical methodology for the estimation of DNA methylation levels based on microarray derived MeDIP-enrichment. Genome Res 18: 1652–1659. Ponzielli R, Boutros PC, Katz S, Stojanova A, Hanley AP, Khosravi F, Bros C, Jurisica I, Penn LZ. 2008. Optimization of experimental design parameters for high-throughput chromatin immunoprecipitation studies. Nucleic Acids Res 36: e144. doi: 10.1093/nar/gkn735. Pugh TJ, Delaney AD, Farnoud N, Flibotte S, Griffith M, Li HI, Qian H, Farinha P, Gascoyne RD, Marra MA. 2008. Impact of whole genome amplification on analysis of copy number variants. Nucleic Acids Res 36: e80. doi: 10.1093/nar/gkn378. Rauch T, Li H, Wu X, Pfeifer GP. 2006. MIRA-assisted microarray analysis, a new technology for the determination of DNA methylation patterns, identifies frequent methylation of homeodomain-containing genes in lung cancer cells. Cancer Res 66: 7939–7947. Rauch TA, Zhong X, Wu X, Wang M, Kernstine KH, Wang Z, Riggs AD, Pfeifer GP. 2008. High-resolution mapping of DNA hypermethylation and hypomethylation in lung cancer. Proc Natl Acad Sci 105: 252–257. Ren B, Robert F, Wyrick JJ, Aparicio O, Jennings EG, Simon I, Zeitlinger J, Schreiber J, Hannett N, Kanin E, et al. 2000. Genome-wide location and function of DNA binding proteins. Science 290: 2306–2309. Robinson MD, McCarthy DJ, Smyth GK. 2010. edgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26: 139–140. Ruike Y, Imanaka Y, Sato F, Shimizu K, Tsujimoto G. 2010. Genome-wide analysis of aberrant methylation in human breast cancer cells using methyl-DNA immunoprecipitation combined with high-throughput sequencing. BMC Genomics 11: 137. doi: 10.1186/1471-2164-11-137. Serre D, Lee BH, Ting AH. 2009. MBD-isolated genome sequencing provides a high-throughput and comprehensive survey of DNA methylation in the human genome. Nucleic Acids Res 38: 391–399. Song JZ, Stirzaker C, Harrison J, Melki JR, Clark SJ. 2002. Hypermethylation trigger of the glutathione-S-transferase gene (GSTP1) in prostate cancer cells. Oncogene 21: 1048–1061. Sorensen AL, Jacobsen BM, Reiner AH, Andersen IS, Collas P. 2010. Promoter DNA methylation patterns of differentiated cells are largely programmed at the progenitor stage. Mol Biol Cell 21: 2066–2077. Statham AL, Strbenac D, Coolen MW, Stirzaker C, Clark SJ, Robinson MD. 2010. Repitools: An R package for the analysis of enrichment-based epigenomic data. Bioinformatics 26: 1662–1663. Stranger BE, Forrest MS, Dunning M, Ingle CE, Beazley C, Thorne N, Redon R, Bird CP, de Grassi A, Lee C, et al. 2007. Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science 315: 848–853. Teo YY, Inouye M, Small KS, Fry AE, Potter SC, Dunstan SJ, Seielstad M, Barroso I, Wareham NJ, Rockett KA, et al. 2008. Whole genomeamplified DNA: Insights and imputation. Nat Methods 5: 279–280. Teytelman L, Ozaydin B, Zill O, Lefrancois P, Snyder M, Rine J, Eisen MB. 2009. Impact of chromatin structures on DNA processing for genomic analyses. PLoS ONE 4: e6700. doi: 10.1371/journal.pone.0006700. Vega VB, Cheung E, Palanisamy N, Sung WK. 2009. Inherent signals in sequencing-based chromatin-immunoprecipitation control libraries. PLoS ONE 4: e5241. doi: 10.1371/journal.pone.0005241. Weber M, Davies JJ, Wittig D, Oakeley EJ, Haase M, Lam WL, Schubeler D. 2005. Chromosome-wide and promoter-specific analyses identify sites of differential DNA methylation in normal and transformed human cells. Nat Genet 37: 853–862.
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Evaluation of genome-wide methylation data Weber M, Hellmann I, Stadler MB, Ramos L, Paabo S, Rebhan M, Schubeler D. 2007. Distribution, silencing potential and evolutionary impact of promoter DNA methylation in the human genome. Nat Genet 39: 457–466. Wei H, Kuan PF, Tian S, Yang C, Nie J, Sengupta S, Ruotti V, Jonsdottir GA, Keles S, Thomson JA, et al. 2008. A study of the relationships between oligonucleotide properties and hybridization signal intensities from NimbleGen microarray datasets. Nucleic Acids Res 36: 2926–2938. Weng, Y.I., Huang, T.H., and Yan P.S. 2009. Methylated DNA immunoprecipitation and microarray-based analysis: Detection of DNA methylation in breast cancer cell lines. Methods Mol Biol 590: 165–176.
Xie C, Tammi MT. 2009. CNV-seq, a new method to detect copy number variation using high-throughput sequencing. BMC Bioinformatics 10: 80. Zhang Z, Yang X, Meng L, Liu F, Shen C, Yang W. 2009. Enhanced amplification of GC-rich DNA with two organic reagents. Biotechniques 47: 775–779.
Received May 18, 2010; accepted in revised form September 23, 2010.
Genome Research www.genome.org
1729
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Method
Gene expression profiling of human breast tissue samples using SAGE-Seq Zhenhua Jeremy Wu,1,2 Clifford A. Meyer,1,2,9 Sibgat Choudhury,3,4,5,9 Michail Shipitsin,3,4,5 Reo Maruyama,3,4,5 Marina Bessarabova,6 Tatiana Nikolskaya,6 Saraswati Sukumar,7 Armin Schwartzman,1,2 Jun S. Liu,8,10 Kornelia Polyak,3,4,5,10 and X. Shirley Liu1,2 1
Department of Biostatistics and Computational Biology Dana-Farber Cancer Institute, Boston, Massachusetts 02115, USA; 2Harvard School of Public Health, Boston, Massachusetts 02115, USA; 3Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, Massachusetts 02115, USA; 4Department of Medicine, Brigham and Women’s Hospital, Boston, Massachusetts 02115, USA; 5 Department of Medicine, Harvard Medical School, Boston, Massachusetts 02115, USA; 6Vavilov Institute for General Genetics, Russian Academy of Sciences, Moscow 119331, Russia; 7Johns Hopkins Oncology Center, Baltimore, Maryland 21231, USA; 8 Department of Statistics, Harvard University, Science Center 715, Cambridge, Massachusetts 02138, USA We present a powerful application of ultra high-throughput sequencing, SAGE-Seq, for the accurate quantification of normal and neoplastic mammary epithelial cell transcriptomes. We develop data analysis pipelines that allow the mapping of sense and antisense strands of mitochondrial and RefSeq genes, the normalization between libraries, and the identification of differentially expressed genes. We find that the diversity of cancer transcriptomes is significantly higher than that of normal cells. Our analysis indicates that transcript discovery plateaus at 10 million reads/sample, and suggests a minimum desired sequencing depth around five million reads. Comparison of SAGE-Seq and traditional SAGE on normal and cancerous breast tissues reveals higher sensitivity of SAGE-Seq to detect less-abundant genes, including those encoding for known breast cancer-related transcription factors and G protein–coupled receptors (GPCRs). SAGE-Seq is able to identify genes and pathways abnormally activated in breast cancer that traditional SAGE failed to call. SAGE-Seq is a powerful method for the identification of biomarkers and therapeutic targets in human disease. [Supplemental material is available online at http://www.genome.org. The data from this study have been submitted to the NCBI Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo) under accession no. GSE24491. Software for SAGE-Seq data analysis is available at http://www.liulab.dfci.harvard.edu/sageExpress/.] Microarrays and sequencing-based technologies have been widely used for gene expression profiling to create global pictures of cellular function (Adams et al. 1991; Schena et al. 1995; Velculescu et al. 1995). Early gene expression data analysis algorithms focused on biases and limitations introduced by each technology. For arraybased technologies such as Affymetrix and NimbleGen microarrays, methods have been developed to overcome probe-specific behavior, GC content bias, dye bias, and cross-hybridization (Yang and Speed 2002; Johnson et al. 2006; Song et al. 2007). While traditional sequencing-based gene expression methods such as serial analysis of gene expression (SAGE) (Velculescu et al. 2000; Polyak and Riggins 2001) and expressed sequence tag (EST) (Adams et al. 1991) sequencing allow the identification and quantification of both known and novel genes, they were severely limited by sequencing throughput and cost (Adams et al. 1991; Velculescu et al. 1995). As next-generation sequencing platforms provide increased throughput at reduced cost (Johnson et al. 2007), their applications to SAGE become a natural choice for comprehensive analysis of gene expression (SAGE-Seq) or other applications (Bloushtain-Qimron et al.
9
These authors contributed equally to this work. Corresponding authors. E-mail
[email protected]. E-mail
[email protected]. Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.108217.110.
10
1730
Genome Research www.genome.org
2008) and promise greater sensitivity and specificity (Morrissy et al. 2009). However, SAGE-Seq poses its unique challenges with regard to data normalization, read alignment, identification of differentially expressed genes, and comparison to traditional SAGE. To address the above questions, we describe data analysis pipelines to process SAGE-Seq data on mammary epithelial cells isolated from normal and cancerous human breast tissue samples deep sequenced on the Illumina platform (formerly known as Solexa). In order to normalize the SAGE-Seq raw data across different libraries, we utilize a nonparametric empirical Bayes method to reduce the sequence sampling bias (Robbins 1956; Gale and Sampson 1995). Appropriate global diversity measurements within and across data sets are evaluated and used to cluster the libraries. We propose a mapping strategy to align SAGE-Seq tags to the genome. We utilize the mapping information to minimize sequencing errors and obtain accurate quantification of sense and antisense transcripts corresponding to RefSeq and mitochondrial genes. We develop a method to identify differentially expressed genes with statistical significance and show its utility on differential gene detection between normal and neoplastic mammary epithelial cells. We also compare traditional SAGE and SAGE-Seq data sets and demonstrate the overwhelming power of SAGE-Seq to detect 20 times more differentially expressed genes with higher statistical confidence. Pathway analysis shows that the greater sequencing depth obtained by SAGE-Seq allows the identification of more than three times as many statistically significant Gene Ontology (GO) terms than by traditional
20:1730–1739 Ó 2010 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/10; www.genome.org
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
SAGE-Seq profiling of human breast tissue samples SAGE and improves their statistical significance score. Many of these pathways are newly identified by SAGE-Seq and are completely missed by traditional SAGE.
Results SAGE-Seq library generation SAGE-Seq libraries in this study were generated from 50,000 to 100,000 uncultured mammary epithelial cells isolated from breast tissue of normal healthy women and from primary invasive ductal breast carcinomas (Table 1). Immunomagnetic bead purification of the cells and SAGE library generation was performed essentially as previously described (Shipitsin et al. 2007), except when modifications were necessary for sequencing on the Illumina platform (see Methods). The raw Illumina data consists of millions of sequence tags, but only the first 21 bp of each read is useful here. The first 4 bp are all ‘‘CATG,’’ which is the recognition site of the NlaIIImapping restriction enzyme used during the construction of the SAGE libraries. MmeI is used as a tagging enzyme to cut 21 bp 39 of its recognition site present in the linker immediately 59 to the NlaIII site. Thus, a SAGE-Seq tag is composed of a 59 ‘‘CATG’’ followed by a 17-bp unique transcript-specific sequence. The crosslane correlation shows high reproducibility of the abundance measurement in SAGE-Seq libraries (Supplemental Fig. S1).
Pipelines for tag mapping and sequencing error minimization To analyze the expression of individual genes, we used SeqMap ( Jiang and Wong 2008) and propose a mapping pipeline (Supplemental Fig. S3) to align tags to RefSeq genes. This mapping pipeline allows us to map tags to mitochondrial, sense, and antisense transcripts of RefSeq genes. If a transcript has multiple CATGs, then the one closest to the poly(A) tail (39 end) is called the best tag (Fig. 1A). If a tag is mapped to multiple RefSeq locations with only one tag being a best tag, the best tag is considered as the uniquely mapped location. Otherwise, the tag is a nonunique tag, and its count is evenly divided among mapped locations. Sense tags are defined as tags that are mapped to the sense strand of exons of known transcripts. Antisense tags are defined as tags that cannot be mapped as sense tags, but are able to map against the antisense strand of known transcribed genes (He et al. 2008). Mapping results can also be used to identify sequencing errors as shown in Figure 1C. We combined the counts of the tags that are uniquely mapped to the same genes at the same locations to reduce noise and sampling bias due to sequencing error (sequencing error
Table 1.
minimization), which reduces the number of false positives in subsequent differentially expressed gene analysis. The tag in the reference genome is used as the consensus tag for sequencing error minimization. For example, suppose there are 190,793 occurrences of the tag ‘‘GCCGTGTCCGCCTGCTA,’’ which maps exactly to the reference genome. If there are 3198 tags that differ from this tag by a single base pair, combined together after sequencing error minimization there is a total of 193,961; therefore, the fraction of single base pair mismatches is 1.6% (3198/193,961). This is equivalent to a 0.1% sequencing error rate per base (17 3 0.001 3 [1 – 0.001]16 = 1.7%). This is consistent with the estimation for high-quality reads of Illumina (Shendure and Ji 2008). Using library N1 as an example, we demonstrate that about 76% of the tags can be uniquely mapped using our pipeline; 6% of these tags are mitochondrial tags, 46% are unique RefSeq sense best tags, 14% are unique sense non-best tags, and 10% are unique antisense tags (Fig. 1B; Supplemental spreadsheet 1 for the mapping results of other libraries). All subsequent analyses are conducted on the 46% of unique sense best tags.
Overview of normal and cancer transcriptomes Gene expression patterns of cell populations in many ways resemble species populations of different species in an ecosystem, where an individual of a species is like a transcript in our study. In typical ecosystems some species are abundant, whereas the majority of species are rare (Magurran 2003). Similarly, SAGE-Seq profiling data shows that most of the genes are expressed at low levels (rare transcripts) and a few genes are expressed in large amounts (abundant transcripts) (Fig. 2A,B). Interestingly, although rarely expressed tags are the majority, highly expressed unique tags are still dominant when considering their population (expression level). By plotting the accumulative fraction of tag count out of total tag count as a function of unique tag count, we found that although the unique tags with one count are 63% of S (overall number of unique tags), they only account for 3% of N (the total tag count) (Fig. 2B). The question arises as to whether these low-count tags are spurious tags dominated by sequencing errors or true tags expressed at very low levels. As described above, after sequencing-error minimization, tags with one mismatch due to sequencing errors can be identified and corrected based on mapping information. Tags with more than two mismatches represent only 0.01% of all the reads (see Methods). Thus, these low-count tags cannot be explained by sequencing errors, as they are much more abundant than what could be explained by such errors. They are possibly a mixture of low-abundant transcripts and nucleic acid contamination (possibly genomic DNA), as the Sage-Seq preparation protocol does not include a step for the elimination of genomic DNA from the RNA samples.
SAGE-Seq libraries of normal and cancer groups Normal
N (Total tags) S (No. of unique tags)
N1 9,618,916 475,975
N2 13,522,703 533,972
N3 2,983,207 342,066
N4 1,800,069 222,314
N5 1,824,933 173,973
C4 3,550,342 374,584
C5 3,848,898 434,886
N6 1,045,874 129,323
N7 11,007,864 333,462
Cancer
N (Total tags) S (No. of unique tags)
C1 4,334,958 477,158
C2 4,710,675 657,887
C3 4,116,502 383,450
C6 4,263,862 466,774
C7 3,373,871 372,801
N1–N7 denotes mammary epithelial cells isolated from reduction mammoplasty specimens of normal healthy women, whereas C1–C7 indicates primary invasive breast carcinomas. Total and unique tag counts are listed for each library.
Genome Research www.genome.org
1731
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Wu et al.
Figure 1. SAGE-Seq tag alignment and sequencing error minimization. (A) Best tag is defined as the tag next to the 39-most NlaIII site (CATG) to the poly(A) tail. (B) Tag alignment statistics of sample N1 according to the tag alignment pipeline in Supplemental Figure S3. Detailed mapping of other data sets is shown in Supplemental spreadsheet 1. (C ) Presumed sequencing errors revealed during tag mapping. All of the listed tags are best tags uniquely mapped to the same RefSeq gene ‘‘NM_001010.’’ (Not every tag uniquely mapped to this gene at the same location is listed.) X-axis is the tag sequences listed in descending order of tag count. The one-base difference in sequence most likely due to sequencing error is marked in red. Sequencing error minimization step for this particular example is done in the following way: sum up the count of all these tags and assign it to tag ‘‘CATGGCCGTGTCCGCCTGCTA’’ and remove all the other tags.
both the observed count and the nature of the frequency distribution of unique tag counts (Fig. 2A), which is applied as the empirical prior to reduce the sequence sampling bias (Gale and Sampson 1995). It also renormalizes the adjusted proportion by the estimated total proportion of detected tags to 1 – P0 (See Supplemental material section ‘‘Algorithms comparison for differentially expressed genes’’ for comparison between NEB and ML normalization). To show the effect of sampling bias, we randomly sampled 10% (pseudo library 1) and 1% (pseudo library 2) tags from library N1 to generate two pseudo libraries with a 10-fold difference in sequencing depth. When comparing the proportion of tags in the two pseudo libraries, we found that the proportions are much more comparable after NEB normalization, whereas ML overestimates p for low-count tags and underestimates p for high-count tags in pseudo library 2 with lower sequencing depth (Fig. 2C).
Nonparametric empirical Bayes normalization
Diversity of normal and cancer transcriptomes
If each SAGE library were sequenced to the same depth (i.e., the same N), tag counts in different libraries would be directly comparable. However, although most of the samples were subjected to one lane of Illumina sequencing, N varies from 1 million to 13 million in different libraries (Table 1). Thus, in order to accurately compare gene expression patterns of different libraries, normalization of tag counts is needed. One intuitive way for normalization is to use proportion p, defined as n/N, where n is the count of a unique tag. Known as the maximum likelihood (ML) estimator for population frequency (Fisher 1922). This approach has the drawback that the p of any tag that is missing from the sequencing data (undetected tags) is assigned to be zero, while it overestimates the p of low and intermediate abundance tags and underestimates highly abundant tags (Fig. 2C, black symbols). In addition, high-throughput quality reads from the Illumina Genome Analyzer provide a good estimate for the population frequency of tags at different enrichment levels (Fig. 2A). This heterogeneous population of tag frequency applied as a prior indicates that the best estimation of tag enrichment should be smaller than the observed count, because the population of less-abundant transcripts is larger than that of abundant ones (Fig. 2A). Thus, a more sophisticated approach for SAGE-Seq data normalization is needed. We applied the nonparametric empirical Bayes (NEB) method (Good 1953; Robbins 1956; Orlitsky et al. 2003) to normalize libraries with different sequencing depths (see Methods). There are two advantages of NEB over ML. First, whereas ML simply considers the undetected tags as zero, NEB estimates the proportion of undetected tags as P0 = n1/N, where n1 is the frequency of unique tags with count one. To validate the accuracy of the NEB estimator of P0, we randomly sampled library N1 from 1% to 10% at a step of 1% to generate 10 pseudo libraries with different sequencing depths. We used NEB to estimate the P0 of the undetected tags in each pseudo library, compared them with their respective proportion in the original library, and found that they were in good agreement (Supplemental Fig. S2). Second, NEB adjusts tag counts based on
One advantage of sequence-based gene expression profiling is that it measures the absolute expression levels of many genes simultaneously. Thus, we can obtain a global view of transcript diversity within the cells and also among libraries. We used two different measures to compare transcript diversity in the libraries we analyzed. First, we used Simpson index of diversity (SID) (Simpson 1949) to characterize transcriptome diversity within each library. SID captures the variance of the tag count distribution and is independent of sequencing depth (see Methods). Higher values indicate higher diversity, which means the tag counts are more widely distributed among different genes. We found in our data sets that libraries from cancer samples, in general, have higher diversity than that from normal (Fig. 3A,B; Wilcoxon rank-sum test, P = 0.07284; the P-value is in the borderline of significance due to limited number of samples). This trend could be due to the fact that tumors express many more genes, either because they are composed of more diverse populations of cells or because they lost normal epigenetic controls that maintain tissue and cell type-specific gene expression patterns (Fig. 4D). The second type of diversity is measured across libraries to study the gene expression diversity among libraries derived from different individuals. To ask how similar two libraries, A and B, are we used the Morisita-Horn (MH) similarity index CMH(A,B) (Wolda 1983) (see Methods), and calculated their distance as D [ 1 CMH (A, B). The Morisita-Horn index has several advantages over other distance metric measurements. First, MH index is not strongly influenced by N and S, which is essential to ascertain that the difference of the measurement is not due to differences in sequencing depths. Second, compared with distance based on Pearson cross correlation, MH index has no singularity for data with standard deviation approaching zero. We found that cancer samples are not only more diverse within each individual (Simpson index), but are also more diverse (MH index) across different individuals (Fig. 3C). This is not
1732
Genome Research www.genome.org
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
SAGE-Seq profiling of human breast tissue samples deeper sequencing coverage, SAGE-Seq gave much higher data correlation between different libraries within the same group (Supplemental Fig. S4). In addition, traditional SAGE can only detect genes with proportions from 105 to 103, whereas SAGE-Seq shows a much larger dynamic range (defined as the detected range of gene enrichment), covering about five orders of magnitude from 107 to 102. For example, genes encoding for transcription factors are often expressed at intermediate or low levels, and SAGE-Seq detected the expression of around 1300 transcription factors (out of 1658 total in the human genome) in our samples. Most of the transcription factors detected by traditional SAGE are also detected by SAGE-Seq, whereas 384 transcription factors are only detected by SAGE-Seq (Fig. 4A). We observed similar phenomena for genes encoding for GPCRs and ABCtransporters (Fig. 4B,C), which are known to be differentially expressed between normal and cancer cells and are expressed at relatively low levels (Li et al. 2005; Dean 2009). To determine how far the sequenced SAGE-Seq libraries are from saturation, we calculated the number of unique best-tag genes (uniquely mapped best tags) detected in relation to the sequencing depth of each library. The number of uniquely Figure 2. Frequency plot of unique tag count and nonparametric empirical Bayes method. (A) Fremapped best tags is a good indicator of quency of unique tag counts in libraries N1 (black) and N5 (red). X-axis is the observed tag count and y-axis is the frequency that shows the number of unique tags with a specific count. (B) Pie chart the number of genes detected. Deeper sedepicting the distribution of unique tags in library N1: 62.6% of unique tags has tag count 1, 12.1%, quencing is expected to detect more genes count 2, 5.5%, count 3, and 19.8% counts larger than 3. The outer plot shows the accumulative fraction until a plateau is reached when all of the of unique tag counts. Although 62.6% unique tags have count 1, they only account for 3% of total tag genes are detected. To overcome the lack of counts. (C ) Scatter plot of tag proportion. X-axis is the proportion of tags in pseudo library 1 obtained by data covering a broad range of sequencing randomly sampling 10% of library N1. Y-axis is the average proportion of pseudo library 2 obtained by randomly sampling 1% of library N1. The data points are obtained in the following way. For example, depth, we combined all cancer (or normal) find all of the tags in pseudo library 1 with proportion 1 3 106, then calculate the mean proportion of libraries and analytically calculated the 5 6 these tags in pseudo library 2, which gives for example 1 3 10 . This gives a data point at (1 3 10 , number of detected unique tag genes at 1 3 105). The dashed line is y = x. Black symbols indicate the proportion using the maximum likelihood different sequencing depths to obtain the estimator, where overestimation in the low and intermediately expressed tags (<100/million) and underestimation in the highly expressed tags (>100/million) are observed. Red symbols mark the prosaturation curve (solid lines in Fig. 4D; see portion calculated using nonparametric empirical Bayes method with improved, more comparable Methods for analytic calculation). For secorrected proportions between two libraries with different sequencing depth in both low and highly quencing depths below 3 million reads, abundant tags. the number of genes detected increases dramatically with sequencing depth (fastentirely surprising, since normal cells have a physiologic role that growth region). The rate continues to grow at a slower rate (slowis essentially the same in different individuals, whereas tumors are growth region) above 5 million reads, until it plateaus around 10 genetically diverse and have no functional role in the body; thus, million reads for both normal and cancer samples (Fig. 4D). This there is no selection pressure to keep their phenotype within cersuggests that the ideal sequencing depth for SAGE-Seq should be tain limits. The hierarchical clustering of the libraries based on the above 10 million reads, with a minimum desired sequence depth of 5 MH index showed that cancer libraries are more different from million per library. Sage-Seq data points (triangles) are all close to or in each other (larger distance across libraries), and they are also very the slow-growth region, where most of the transcriptome is seclearly separated from the normal libraries (Fig. 3D). quenced. Traditional SAGE data points (circles) are still in the fastgrowth region, where more than half of the transcriptome is not detected due to low-sequencing depth. Figure 4D also indicates Data quality of SAGE-Seq compared with traditional SAGE that more genes are expressed in cancer (red triangle) than in norFollowing read alignment and sequencing error minimization, we mal (black triangle) samples, which is consistent with our previous further evaluated the ability of SAGE-Seq to profile genome-wide finding that cancer samples have higher transcript diversity within gene expression and compared it with traditional SAGE. With each library and across libraries.
Genome Research www.genome.org
1733
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Wu et al. our data sets overdispersion could be the result of variation from biological individuals in addition to the sampling variation (see Methods). We also observed overdispersion among a subset of housekeeping genes and subsets of uniquely mapped best tags both in normal and cancer groups (data not shown).
Analysis of differentially expressed genes
Figure 3. Diversity of normal and cancer transcriptomes. (A) Simpson index of diversity to measure within-library gene expression diversity. Libraries in the cancer group show higher within-library diversity compared with the normal group. (B) Box plot depicting Simpson index of diversity of normal and cancer samples. P = 0.07284 (Wilcoxon rank-sum test). (C ) Distance defined as ‘‘1 Morisita-Horn similarity index’’ is used to measure gene expression diversity across libraries. Libraries in the normal group are more similar to one another, whereas cancer libraries are more diverse. (D) Hierarchical clustering using ‘‘distance’’ defined in C separates normal and cancer libraries.
Sampling noise and biological variability To detect differentially expressed genes between two conditions, it is important to know the sources of gene expression variability. A major source of variability in SAGE-Seq or any techniques using sequencing techniques is sampling variation; and most algorithms analyzing traditional SAGE data tackled this using various approaches (Velculescu et al. 1995; Cai et al. 2004). Sequence-based transcriptome profiling can be modeled as a binomial sampling process with replacement that approximates a Poisson distribution, because using current technologies (Kharchenko et al. 2008), sequenced transcripts are a tiny minority of the total amount of cDNA loaded on the sequencers. If the same library is sequenced multiple times, Poisson model dictates that the variance in tag counts of a particular gene is equal to its abundance. When examining the empirical variance of genes in the normal libraries versus their respective normalized count (normalized by NEB and scaled to the same sequencing depth of N = 1 million), the observed variance indeed depends on its gene expression level (count) (Fig. 5A; red dashed line with slope aobv » 2.0 in log-log plot). However, if we denoted variance-to-mean slope for random binomial sampling as arbs, which is expected to be 1.0 according to Poisson distribution (rbs stands for random binomial sampling, blue dashed line in Fig. 5A), we observed overdispersion (aobv » 2arbs > arbs = 1), which means that the excess variability of the observed data is significantly larger than the variability expected in the random reference model (Poisson model in this case). This suggests a nonlinear dependency of gene expression variability on the mean expression level, which indicates that in
1734
Genome Research www.genome.org
One of the major applications of transcriptome profiling is the identification of genes differentially expressed between different samples. After tag alignment and sequencing error minimization, our analysis pipeline for the identification of differentially expressed genes (Fig. 5B) first applies the nonparametric empirical Bayes method as a normalization step to reduce sampling bias and to bring different libraries to the same sequencing depth (N = 1 million; Normalized sequencing depth has no influences on differential genes, which is different from sequencing depth of the library.). After normalization, tags with counts $3 per million in #2 out of all the libraries were discarded. This effectively removes a significant portion of noninformative tags, which either contain outliers or have too low counts to detect differential expression with statistical significance, and saves computational time and storage space during subsequent analysis. The logarithmic transformation is then applied to obtain the expression index and decouple the correlation between the observed variance and the mean expression level of genes (Fig. 5A). Quantitatively, the observed variance in our libraries is proportional to the square of the expression level. According to the delta method in statistics, the logarithmic transformation is the right transformation to stabilize the variance (see Methods). An alternative transformation is arcsinh, which is also a logarithm-like transformation, but with the advantage of no singularity at zero (Huber et al. 2002). Supplemental Figure S5 shows that after applying a logarithmic transformation of base 2 on the normalized count, for intermediate and high abundance tags the variance of the expression index is almost independent of its mean. Finally, the SAM (significance analysis of microarray) algorithm is applied to the expression indices in the two groups of samples to identify differentially expressed genes (Fig. 5B; Tusher et al. 2001). We also tried the standard t-test and found many false positives resulting from the underestimated empirical standard deviation that gives rise to extreme t values. SAM algorithm stabilizes variance to reduce false positives. Other statistical tests could be used in this step instead of using SAM, such as Robinson and Smyth’s moderated t-test or Baggerly’s tw test (Baggerly et al. 2003, 2004; Lu et al. 2005; Robinson and Smyth 2007). Another alternative for the analysis of differentially expressed genes is to use overdispersed models such as overdispersed logistic regression or overdispersed log-linear model (Baggerly et al. 2004; Lu et al. 2005). However, whether these modelbased methods can be scaled up to the deeper sequencing depths of SAGE-Seq data needs to be verified through systematic analysis with more data. We compared the lists of differentially expressed genes between normal and cancer for both SAGE-Seq and traditional SAGE. The expression (i.e., presence) of 10,052 and 4953 best-tag genes is detected by SAGE-Seq and traditional SAGE, respectively (Supplemental spreadsheet 2), with 99% (4904) overlap. We calculated the false discovery rate (FDR) using the Q-value package of Storey and Tibshirani (2003). Traditional SAGE does not sequence deep enough to allow similar P-value or FDR cutoffs as SAGE-Seq. SAGE-Seq identifies about 4000 differentially expressed best tag genes at 1% FDR, whereas traditional SAGE detects less than 200 at 10% FDR (Fig. 5C). Deeper sequencing gives SAGE-Seq
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
SAGE-Seq profiling of human breast tissue samples that the top differentially expressed genes detected by SAGE-Seq are often expressed at moderate or low levels (;100/million; see Supplemental Fig. S7), which traditional SAGE either completely fails to detect or has too low (two or three) a tag count to show differential expression with statistical power. These differentially expressed tags in SAGE-Seq are unlikely to be from sequencing errors based on the tag counts observed. These data imply that the increased sequencing depth of SAGE-Seq results in the detection of a different set of differentially expressed genes. To demonstrate this we resort to simulations, as the use of defined cell populations with limited numbers of cells isolated from primary breast tissues did not allow the generation of both SAGE-seq and traditional-SAGE libraries from the same sample. We took the 14 SAGE-Seq libraries and sampled them down (binomial sampling) to the sequencing-depth level of traditional SAGE (;50,000). The top differentially expressed genes of these simulated libraries also show little overlap with the original SAGESeq libraries (Fig. 5D, red symbols).
Pathways and networks differentially activated between normal and cancer samples Figure 4. SAGE-Seq tag mapping and sequencing depths saturation curve. (A–C ) Differential coverage of expression profiles in three selected gene families: transcription factors (A), GPCRs (B), and ABC transporters (C ). Y-axis lists the genes and x-axis is the mean gene expression index (logarithm of the normalized tag count). Red and blue colors mark traditional SAGE and SAGE-Seq, respectively. SAGESeq detects many more genes in these gene families than traditional SAGE does. (D) Number of unique best-tag genes (y-axis) in relation to sequencing depth (x-axis). The number of best-tag genes is the number of unique genes mapped by best tags, counted as one if multiple tags are mapped to the best tag of the same gene. Black and red colors indicate normal and cancer groups, respectively. Symbols ‘‘s’’ and ‘‘m’’ mark traditional SAGE and SAGE-Seq, respectively. Solid curves (saturation curves) are from simulation by sampling the combination of all libraries in the normal (or cancer) group, which depict the trend with increasing sequencing depth. Traditional SAGE identifies much fewer best-tag genes than the SAGE-Seq. SAGE-Seq shows that cancer samples (red triangles) have a larger number of unique best-tag genes than normal samples (black triangles). This difference is not detected by traditional SAGE (red circles vs. black circles).
increased statistical power to detect more differentially expressed genes. To compare the two lists of differentially expressed genes, we examined the rank order of genes based on their t-scores. The top 10% of genes with the highest t-scores (495 genes for traditional SAGE and 1005 for SAGE-Seq) are used as differentially expressed gene lists for comparison between these two methods. SAGE-Seq detected all 26 genes known to be differentially expressed between normal and breast cancer samples based on prior studies, whereas traditional SAGE only identified four (Supplemental Table S1). Surprisingly, we only identified 54 genes when comparing the overlap between the top 10% of genes identified as differentially expressed by the two methods. Further analysis confirmed that the top differentially expressed genes detected by traditional SAGE and SAGE-Seq is quite different (Fig. 5D; black symbols). Many factors could contribute to this discrepancy, such as differences in library preparation protocols and samples. Beside these factors, we observed
To determine what signaling pathways are identified as differentially activated by SAGE-Seq and traditional SAGE, we applied a combination of gene ontology and pathway analyses for the differentially expressed gene sets using MetaCore (Nikolsky et al. 2009). However, SAGE-Seq identifies 3587 differentially expressed genes at 1% FDR cut off, whereas the most significantly differentially expressed gene identified by traditional SAGE has an FDR >9%. Thus, we decided to take the top 10% of differentially expressed genes identified by traditional SAGE genes (493) and SAGE-Seq genes at 1% FDR (3587), since an FDR cutoff gives too few differentially expressed genes in traditional SAGE (Supplemental spreadsheet 3). MetaCore provides a P-value for each tested GO term or pathway name. Using a P-value of 103 as the cutoff for significance, SAGESeq identifies 99 pathways to be significant, whereas with traditional SAGE only 32 have an overlap of 19 (Fig. 5C; Supplemental spreadsheet 4). The following pathways and GO processes are commonly enriched between SAGE-Seq and traditional SAGE: apoptosis, cell adhesion, cytoskeleton remodeling, development, immune response, G-protein signaling, signal transduction, and transcription. These are all pathways known to be relevant to breast cancer, and in each category SAGE-Seq identified the term with higher statistical significance. The 80 additional significant GO categories identified by SAGE-seq but not by traditional SAGE are all related to cancer, generally or specifically to breast cancer, based on published literature, especially categories such as Apoptosis and survival, Cell cycle,
Genome Research www.genome.org
1735
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Wu et al. et al. 2007), are found to be significant in SAGE-Seq, but insignificant in traditional SAGE.
Discussion In this study, we systematically evaluated SAGE-Seq for transcriptome profiling and its ability to identify differentially expressed genes between normal and neoplastic mammary epithelial cells. We are the first to apply the NEB method to normalize different high-throughput SAGE-Seq libraries in order to correct the sampling bias due to incomplete sampling. NEB normalization can be applied to other types of techniques based on random sampling such as RNA-seq. We designed a pipeline to align SAGE tags to sense and antisense transcripts and minimize sequencing error through tag alignment and proposed an approach to detect differentially expressed genes by considering both sampling and biological variability. We compared SAGE-Seq and traditional SAGE to examine the effect of sequencing depth on gene coverage and differentially expressed gene detection. Comparison of SAGE-Seq data between normal and neoplastic mammary epiFigure 5. Differentially expressed genes and their variance. (A) Mean-to-variance plot for the seven thelial cells revealed that breast cancers normal libraries after removing the noise and normalization. Red dashed line is the best linear fit in loglog plot. The slope gives the exponent aobv » 1.9 Blue dashed line is the mean-to-variance line inhave higher within- and across-library ditroduced by sampling. (B) Pipeline for the identification of differentially expressed genes: (1) Seversity than normal breast cells. SAGEquencing error minimization: After tag alignment, tags that are mapped to the same genes at the same Seq identifies 20 times more differentially locations are combined together; (2) NEB is used to normalize different libraries with different seexpressed genes at 10-fold more stringent quencing depth; (3) filtering to remove tags with counts $3 per million in less than two libraries followed by log2 transformation; (4) SAM is used for the detection of differentially expressed genes. (C ) cutoff (1% FDR) than traditional SAGE Detected differentially expressed genes (top) and activated pathways (bottom) in SAGE-Seq and tradi(10% FDR), and three times more pathtional SAGE. SAGE-Seq identifies approximately 4000 differential genes at 1% FDR, while traditional ways specifically activated in breast canSAGE identifies <200 at a much looser cut off (10% FDR). At P = 0.001, SAGE-Seq identifies 99 pathways cer, indicating its higher sensitivity and significantly activated in breast cancer, while traditional SAGE only shows 32. The 80 pathways only identified by SAGE-Seq and missed by traditional SAGE are all breast cancer-related pathways. (D) The specificity. overlap ratio (defined as the number of overlapping genes divided by the gene number in traditional Identifying changes in gene exSAGE in the top x percent differentially expressed genes, where x changes between 0 and 1. The black pression associated with physiologic symbols depict actual data (SAGE-Seq vs. traditional SAGE). It indicates that there is little overlap in the processes is a central issue in biology, estop differentially expressed genes list between SAGE-Seq and traditional SAGE. The red symbols indicate simulation (SAGE-Seq vs. sampled down SAGE-Seq). Sampled down SAGE-Seq means to bipecially in the study of human diseases nomially sample 50 k tags from each SAGE-Seq library; 50,000 is a typical sequencing depth for (Zhu et al. 2008). Commonly used methtraditional SAGE. Simulation confirms the same conclusion as that drawn from the actual data: SAGEods include EST sequencing, cDNA microSeq gives a different top differentially expressed gene list compared with traditional SAGE. Deeper array hybridization, subtractive cloning, sequencing reveals that traditional SAGE identifies different sets of top differentially expressed genes differential display, and serial analysis than that of SAGE-Seq, confirming our conclusion that traditional SAGE lacks sufficient sequencing depth. of gene expression (traditional SAGE) (Adams et al. 1991; Schena et al. 1995; Velculescu et al. 1995). Compared with Androgen receptor signaling, TGFB signaling, NFKB signaling, array-based hybridization methods, SAGE-Seq has many advanBRCA1-mediated DNA damage, p53 signaling pathway, Detages. First, SAGE-Seq has higher sensitivity, which allows the development and cell cycle regulation by ESR1 and ESR2 (estrogen retection of less-abundant genes with high-confidence levels. Secceptor), G-protein signaling, and translation and transport pathways. ond, SAGE-Seq is less subject to technical artifacts such as probe The genes in these pathways are typically expressed at low levels, effects and hybridization bias (Yang and Speed 2002). Third, SAGEand this is consistent with Figure 4, B and C, showing many genes Seq does not require the a priori knowledge of transcripts to be in the GPCR- and ABC-transporter families as detected by SAGEanalyzed; thus, it allows a global analysis of transcriptome present Seq, but missed by traditional SAGE (Li et al. 2005; Dean 2009). It in the cells. is especially worth noting that the NFKB and TGFB pathways, Overdispersion of expression levels of highly expressed genes which appeared in multiple GO and pathway branches and are was observed in microarray data, and as a result, analyses were often known to be differentially regulated in breast cancer (Shipitsin conducted at the log intensity level (Irizarry et al. 2003). However,
1736
Genome Research www.genome.org
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
SAGE-Seq profiling of human breast tissue samples most people attribute the overdispersion to probe hybridization and scanning biases inherent to the microarray platform. We quantitatively identified the relationship between biological variability and mean expression level. The SAGE-Seq data presented here not only show that the Poisson distribution used in many SAGE analysis algorithms is insufficient to capture the biological variance, but also indicate that abundant genes have higher variability among biological samples (Fig. 5A). This suggests that cells tolerate variations in the levels of highly expressed genes much better. These findings also imply that efforts on disease marker and drug target discovery might be more fruitful if focused on intermediate or low-abundance transcripts, as these show less variation among samples within the same tissue type, and differences in their expression might play a more important role in the disease process. SAGE-Seq with deeper sequencing depth is able to detect many more significant differentially expressed transcripts than traditional SAGE with higher significance. The top differentially expressed genes identified by SAGE-Seq are not the most abundant genes, but rather expressed at intermediate or low levels (;100/million). For traditional SAGE, which is sequenced at 20 times less depth, these tags will be at the borderline of being detected. Thus, traditional SAGE has no power to differentiate these genes between different conditions. At the same time, these less-abundant genes often are transcription factors and receptors that play important roles of cell regulations, and in tumorigenesis (Fig. 4A–C). Thus, even small changes in the expression of these genes might have pronounced effects on the whole cellular environment. It seems that less-abundant genes also have less variability (Fig. 5A), which enables them to be detected as top differentially expressed genes despite the fact that the absolute change in their expression levels is not the largest. Thus, high-throughput sequencing technologies provide the opportunity to unveil subtle changes in gene expression in more detail and with improved statistical power. In summary, we show here that SAGE-Seq is a powerful and cost-effective method for the gene expression profiling of small numbers of cells isolated from primary human tissue samples, and we present data analysis tools that enable researchers to decipher the physiological meaning of the immense SAGE-Seq data sets.
Methods SAGE-Seq library construction We posted our detailed protocol for SAGE-Seq library generation at http://research4.dfci.harvard.edu/polyaklab/protocols_linkpage. php. All of the SAGE and SAGE-Seq libraries in this study were generated from immunomagnetic bead-purified cells freshly isolated from human breast tissue samples; thus, the cell numbers are estimates based on the microscopic examination of the number of captured cells in 10 uL of volume, and they are in the 50,000– 100,000 cell range. However, based on FACS analyses and sorting of the same cell type we know their approximate abundance in the tissue sample. All of the cells are directly lysed and processed for poly(A) RNA selection, followed by library preparation. The amount of poly(A) RNA is either measured by Nanodrop or by SYBR green II, but if the number of cells is very limited, we just go straight to library preparation. Based on the estimate that one cell contains 10 pg of total RNA, 100,000 cells have ;100 ng of total RNA and ;1–10 ng of poly(A) RNA (depending on cell type, tumor cells in general have higher RNA content/cell). In addition, 10% of the poly(A) RNA is saved for semiquantitative RT-PCR testing of cell purity prior to proceeding with SAGE-Seq sample
preparation, and this also gives an estimate of the transcribable mRNA present.
Sequencing error minimization At 0.1% error rate per base, the population of tags with no error is (1 – 0.1%)17 = 0.9831353, and the population of tags with one error is 17 3 0.001 3 (1 0.001)16 = 0.01673003. Thus, the population of tags with error in at least two bases is: 1 0.9831353 0.01673003 = 0.0001346471.
Simpson index of diversity Simpson’s measure of diversity (Simpson 1949) (SID) is defined as: SIS = ln D = ln +
ni ðni 1Þ ; NðN 1Þ
where ni is the count of the ith tag and N is the total number of tag count. SID = 1 indicates that one tag dominates all of the tag counts of the system, which means that there is no diversity (highest dominance). The larger the value of SID is, the higher diversity is (less dominance). SID is not strongly influenced by the sequencing depth, which is confirmed by simulation.
Morisita-Horn similarity index Morisita-Horn similarity index, CMH(A,B), between two libraries, A and B, is defined as: CMH ðA; BÞ [
2+ pi ðAÞpi ðBÞ ; +½ p2i ðAÞ + p2i ðBÞ
where pi(A) and pi(B) are the proportion of gene (or tag) i for library A and B, respectively. MH similarity index is independent of the sequencing depth N.
Sampling variance and biological variance The high-throughput sequencing technology is modeled as a binomial sampling procedure with replacement. Mixing with the biological variability from different samples, another layer of variability is introduced due to sampling. Define pn, the proportion of tag with count n as, p = n/N, where N is the total number of tag count. From our hierarchical model, the mean proportion p of a gene is an unbiased estimator of the true abundance of genes. However, s2(xi), the observed variance of scaled count xi of gene i is an addition of the true biological variance from individuals and the sampling variance (Supplemental material for analytical proof), s2 ðxÞ = Npt ð1 pt Þ + NðN 1Þs2t :
Normalization using nonparametric empirical Bayes correction We implement the empirical Bayes using simple Good-Turing estimator (SGT) (Gale and Sampson 1995). Assume the total number of all unique tags in the mRNA is s, and pi is the true proportion of tag i, which is what we want to estimate from data. Empirical Bayes estimation of an observed tag count r is: r = ðr + 1Þnr + 1 =nr ; where nr is the number of tags with count r. Thus, the expected total chance of all tags that are each represented r times (r $ 1) is: (r + 1) nr + 1/N, where N is the sequencing depth (N = n1 + 2n2 + 3n3 +. . .). Therefore, the expected total chance of all tags represented in the
Genome Research www.genome.org
1737
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Wu et al. sample is: (2n2 + 3n3 +. . .)/N = 1 n1/N. In SGT, the proportion of undetected tags, P0 is estimated as P0 = n1 =N; where n1 represents the number (frequency) of unique tags with count one. The corrected total tag count after SGT, N*, is N* = Snr r*. The empirical Bayes estimator for proportion of a gene with count r, pr*, is renormalized by N* as pr = ð1 P0 Þn =N :
formation based on the delta method. We provide the proof as follows. Suppose a random variable x follows a distribution with mean m and variance s2. Consider a transformation g (x). The Taylor expansion of g (x) around m up to the first order is g(x) » g (m) + (x m) g9(m). Thus, the transformed variable g (x) has approximate mean g (m) and approximate variance Var[g(x)] » s2[ g9(m)]2. In our data, m and s2 satisfies the observed dependency s2 ; m2, yielding Var[ g(x)] » m2[ g9(m)]2. Assuming the transformation g stabilizes the variance, Var[ g(x)] is a constant independent of m, and thus g9(m) = c/m, where c is a constant. Integrating with respect to m gives that the form of the stabilization transformation g should be:g (x) = log x.
Variance of Good-Turing estimator for unseen tags The variance of P0 can be calculated in the following: Var(P0) = Var(n1)/N2. n1 = Si Npi (1- pi)N 1 under the assumption of binomial sampling approximation. Introducing a new random number xi : xi = 1 if the ith tag is sequenced with only one tag at sequencing depth N and xi = 0 otherwise. Then: 2 !2 3 ! 2 4 E n1 = E + xi 5 = E + xi xj = + E x2i + + E xi xj i
i;j
= + Npi 1 pi
N1
m6¼n
"
#2 "
Eðn1 Þ = + Eðxi Þ
= + Npi 1 pi
i
i
= + Npi 1 pi
N1
N1
#2
Acknowledgments
N1 Npj 1 pj
i;j
N1 2ðN1Þ = + N 2 p2i 1 pi + + Npm 1 pm Npn ð1 pn ÞN1: m6¼n
i
Thus, VarðP0 Þ = Varðn1 Þ=N 2 " # N1 2ðN1Þ 2 2 = + Npi 1 pi + N pi 1 pi =N 2 i
i
2ðN1Þ P0 + p2i 1 pi = ; N i which gives: VarðP 0 Þ <
P0 : N
Saturation curve calculation We define the total number of all unique tags in the mRNA as s and pi as the true proportion of tag i. pi is estimated from the data. Thus, the mean of unseen tags according to binomial sampling is: N N n0 ðN Þ = +CN0 p0i 1 pi = + 1 pi ; i
where the summation is overall the possible unique best tags and N is the sequencing depth. Thus, the number of detected unique best tag genes is the total number of unique best tag genes minus n0 (N) as shown in Figure 4D.
Variance stabilization For the variance to mean relationship observed in our data, the correct transformation to stabilize variance is logarithm trans-
1738
Genome Research www.genome.org
We thank Andrea Richardson (Brigham and Women’s Hospital) for her help with the acquisition of breast tumor samples; Haiyan Huang, Li Cai, Molin Wang, and David Harrington for valuable discussions; Love Nickerson for English proofreading. This work was supported by the Friends of Dana-Farber Women’s Cancer Program (X.S.L.), NIH R01 1HG004069 (X.S.L.), NCI P50 CA89393 CA and R01 CA116235-04S1 (K.P.), the AVON Foundation (K.P.), Terri Brodeur Breast Cancer Research Foundation (S.C.), and the Susan G. Komen Foundation PDF0707996 (M.S., R.M.). Author contributions: Z.W. did the computational analysis. Z.W., X.S.L., and K.P. designed the study and wrote the manuscript. Z.W., X.S.L., C.M., A.S., and J.L. developed the analytical methodology. S.C., M.S., R.M., M.B., and T.N. carried out experiments and analyzed data. S.S. provided normal tissue samples for the study.
References
This indicates that the Good-Turing estimator for unseen tags is a stable estimator, which is shown in Supplemental Figure S2.
i
Network and pathway analysis using METACORE Network and pathway analysis using METACORE was performed essentially as previously described (Nikolsky et al. 2008). Specific details are in the Supplemental Methods.
and 2
The transcription-factor gene list for humans is obtained from NCBI (http://www.ncbi.nlm.nih.gov). Go to Entrez Gene and search for human transcription factor. After filtering out nonhuman genes, 1658 human genes encoding for transcription factors are in this list. The seven normal and seven cancer raw SAGESeq data have been deposited in GEO (accession no. GSE24491).
i6¼j
i
N1 + + Npm 1 pm Npn ð1 pn ÞN1 ;
i
Databases used
Adams MD, Kelley JM, Gocayne JD, Dubnick M, Polymeropoulos MH, Xiao H, Merril CR, Wu A, Olde B, Moreno RF, et al. 1991. Complementary DNA sequencing: Expressed sequence tags and human genome project. Science 252: 1651–1656. Baggerly KA, Deng L, Morris JS, Aldaz CM. 2003. Differential expression in SAGE: Accounting for normal between-library variation. Bioinformatics 19: 1477–1483 Baggerly KA, Deng L, Morris JS, Aldaz CM. 2004. Overdispersed logistic regression for SAGE: Modelling multiple groups and covariates. BMC Bioinformatics 5: 144. doi: 10.1186/1471-2105-5-144. Bloushtain-Qimron N, Yao J, Snyder EL, Shipitsin M, Campbell LL, Mani SA, Hu M, Chen H, Ustyansky V, Antosiewicz JE, et al. 2008. Cell typespecific DNA methylation patterns in the human breast. Proc Natl Acad Sci 105: 14076–14081. Cai L, Huang H, Blackshaw S, Liu JS, Cepko C, Wong WH. 2004. Clustering analysis of SAGE data using a Poisson approach. Genome Biol 5: R51. doi: 10.1186/gb-2004-5-7-51. Dean M. 2009. ABC transporters, drug resistance, and cancer stem cells. J Mammary Gland Biol Neoplasia 14: 3–9. Fisher RA. 1922. On the mathematical foundations of theoretical statistics. Philos Trans R Soc Lond 222: 309–368. Gale WA, Sampson G. 1995. Good-Turing frequency estimation without tears. J Quant Ling 2: 217–237.
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
SAGE-Seq profiling of human breast tissue samples Good IJ. 1953. The population frequencies of species and the estimation of population parameters. Biometrika 40: 237–264. He Y, Vogelstein B, Velculescu VE, Papadopoulos N, Kinzler KW. 2008. The antisense transcriptomes of human cells. Science 322: 1855–1857. ¨ ltmann H, Poustka A, Vingron M. 2002. Huber W, von Heydebreck A, Su Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 18: S96–S104. Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP. 2003. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4: 249–264. Jiang H, Wong WH. 2008. SeqMap: Mapping massive amount of oligonucleotides to the genome. Bioinformatics 24: 2395–2396. Johnson WE, Li W, Meyer CA, Gottardo R, Carroll JS, Brown M, Liu XS. 2006. Model-based analysis of tiling-arrays for ChIP-chip. Proc Natl Acad Sci 103: 12457–12462. Johnson DS, Mortazavi A, Myers RM, Wold B. 2007. Genome-wide mapping of in vivo protein-DNA interactions. Science 316: 1497–1502. Kharchenko PV, Tolstorukov MY, Park PJ. 2008. Design and analysis of ChIPseq experiments for DNA-binding proteins. Nat Biotechnol 26: 1351–1359. Li S, Huang S, Peng SB. 2005. Overexpression of G protein-coupled receptors in cancer cells: Involvement in tumor progression. Int J Oncol 27: 1329–1339. Lu J, Tomfohr JK, Kepler TB. 2005. Identifying differential expression in multiple SAGE libraries: An overdispersed log-linear model approach. BMC Bioinformatics 6: 165. doi: 10.1186/1471-2105-6-165. Magurran AE. 2003. The commonness, and rarity, of species. In Measuring biological diveristy, p. 18, Wiley-Blackwell, Hoboken, NJ. Morrissy AS, Morin RD, Delaney A, Zeng T, McDonald H, Jones S, Zhao Y, Hirst M, Marra MA. 2009. Next-generation tag sequencing for cancer gene expression profiling. Genome Res 19: 1825–1835. Nikolsky Y, Sviridov E, Yao J, Dosymbekov D, Ustyansky V, Kaznacheev V, Dezso Z, Mulvey L, Macconaill LE, Winckler W, et al. 2008. Genomewide functional synergy between amplified and mutated genes in human breast cancer. Cancer Res 68: 9532–9540. Nikolsky Y, Kirillov E, Zuev R, Rakhmatulin E, Nikolskaya T. 2009. Functional analysis of OMICs data and small molecule compounds in an integrated ‘‘knowledge-based’’ platform. Methods Mol Biol 563: 177–196. Orlitsky A, Santhanam NP, Zhang J. 2003. Always Good Turing: Asymptotically optimal probability estimation. Science 302: 427–431.
Polyak K, Riggins GJ. 2001. Gene discovery using the serial analysis of gene expression technique: Implications for cancer research. J Clin Oncol 19: 2948–2958. Robbins H. 1956. An empirical Bayes approach to statistics. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1, pp. 157–163. University of California Press, Berkeley, CA. Robinson MD, Smyth GK. 2007. Moderated statistical tests for assessing differences in tag abundance. Bioinformatics 23: 2881–2887. Schena M, Shalon D, Davis RW, Brown PO. 1995. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270: 467–470. Shendure J, Ji H. 2008. Next-generation DNA sequencing. Nat Biotechnol 26: 1135–1145. Shipitsin M, Campbell LL, Argani P, Weremowicz S, Bloushtain-Qimron N, Yao J, Nikolskaya T, Serebryiskaya T, Beroukhim R, Hu M, et al. 2007. Molecular definition of breast tumor heterogeneity. Cancer Cell 11: 259–273. Simpson EH. 1949. Measurement of diversity. Nature 163: 688. Song JS, Johnson WE, Zhu X, Zhang X, Li W, Manrai AK, Liu JS, Chen R, Liu XS. 2007. Model-based analysis of two-color arrays (MA2C). Genome Biol 8: R178. doi: 10.1186/gb-2007-8-8-r178. Storey JD, Tibshirani R. 2003. Statistical significance for genome-wide studies. Proc Natl Acad Sci 100: 9440–9445. Tusher VG, Tibshirani R, Chu G. 2001. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci 98: 5116–5121. Velculescu VE, Zhang L, Vogelstein B, Kinzler KW. 1995. Serial analysis of gene expression. Science 270: 484–487. Velculescu VE, Vogelstein B, Kinzler KW. 2000. Analysing uncharted transcriptomes with SAGE. Trends Genet 16: 423–425. Wolda H. 1983. Diversity, diversity indices and tropical cockroaches. Oecologia 58: 290–298. Yang YH, Speed T. 2002. Design issues for cDNA microarray experiments. Nat Rev Genet 3: 579–588. Zhu J, Zhang B, Schadt EE. 2008. A systems biology approach to drug discovery. Adv Genet 60: 603–635.
Received April 1, 2010; accepted in revised form September 24, 2010.
Genome Research www.genome.org
1739
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Method
Scaffolding a Caenorhabditis nematode genome with RNA-seq Ali Mortazavi,1,2,3 Erich M. Schwarz,1,2,3 Brian Williams,1 Lorian Schaeffer,1 Igor Antoshechkin,1 Barbara J. Wold,1 and Paul W. Sternberg1,2,4 1
Division of Biology, California Institute of Technology, Pasadena, California 91125, USA; 2Howard Hughes Medical Institute, Pasadena, California 91125, USA Efficient sequencing of animal and plant genomes by next-generation technology should allow many neglected organisms of biological and medical importance to be better understood. As a test case, we have assembled a draft genome of Caenorhabditis sp. 3 PS1010 through a combination of direct sequencing and scaffolding with RNA-seq. We first sequenced genomic DNA and mixed-stage cDNA using paired 75-nt reads from an Illumina GAII. A set of 230 million genomic reads yielded an 80-Mb assembly, with a supercontig N50 of 5.0 kb, covering 90% of 429 kb from previously published genomic contigs. Mixed-stage poly(A)+ cDNA gave 47.3 million mappable 75-mers (including 5.1 million spliced reads), which separately assembled into 17.8 Mb of cDNA, with an N50 of 1.06 kb. By further scaffolding our genomic supercontigs with cDNA, we increased their N50 to 9.4 kb, nearly double the average gene size in C. elegans. We predicted 22,851 protein-coding genes, and detected expression in 78% of them. Multigenome alignment and data filtering identified 2672 DNA elements conserved between PS1010 and C. elegans that are likely to encode regulatory sequences or previously unknown ncRNAs. Genomic and cDNA sequencing followed by joint assembly is a rapid and useful strategy for biological analysis. [Supplemental material is available online at http://www.genome.org. The sequence data from this study have been submitted to GenBank (http://www.ncbi.nlm.nih.gov/Genbank/) under accession no. AEHI01000000 and to the NCBI Sequence Read Archive (http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi) under accession no. SRA023844.] Sanger sequencing of eukaryotic genomes and transcriptomes has enabled large-scale gene discovery and evolutionary comparisons, but has also been a laborious process requiring multiple centers, millions of dollars, and years per genome. The human genome draft sequence was at first so fragmented that mRNAs and expressed sequence tags (ESTs) were used to scaffold it (Kent and Haussler 2001); three more years were needed to drive the human genome sequence to its near-finished state (International Human Genome Sequencing Consortium 2004). The search for variation in the human genome led to new DNA sequencing methods, producing short reads at much lower cost that can be aligned to a reference genome (Bentley et al. 2008). To allow de novo genome assembly from these short reads, programs that use de Bruijn graphs rather than classic overlapping have been developed and applied to microbial genomes (Zerbino and Birney 2008; Chaisson et al. 2009). Similarly, a transcriptome can be assembled into expressed sequence tags either by mapping reads onto a genome sequence or by de novo assembly (Haas and Zody 2010). There are many organisms that have been studied by, at most, a small cohort of researchers, which are unlikely ever to be sequenced by a genome center, but which, nevertheless, could be useful to biology and medicine if their genome could be characterized. For instance, there are between 40,000 and 10 million nematode species (Blaxter 1998), many of which are important parasites and pests. As an instance of their possible analysis, we have generated a draft genome and transcriptome of the nematode
3
These authors contributed equally to this work. Corresponding author. E-mail
[email protected]; fax (626) 568-8012. Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.111021.110.
4
1740
Genome Research www.genome.org
Caenorhabditis sp. 3 PS1010 (NCBI taxonomy ID 96668; henceforth, ‘‘PS1010’’). PS1010 is a nematode in the same genus as C. elegans, but is more distantly related to C. elegans and C. briggsae than they are to each other (Fig. 1; Kiontke and Fitch 2005; K Kiontke, pers. comm.). DNA sequence divergence between PS1010 and C. elegans is comparable to that between mammals and birds (Kiontke and Fitch 2005). Nevertheless, PS1010 still has identifiable, highly conserved noncoding DNA elements in common with C. elegans (Kuntz et al. 2008). We have sequenced the genome and transcriptome of PS1010 using Illumina paired and unpaired 75-nt reads, used Velvet (Zerbino and Birney 2008) to assemble supercontigs from both cDNA and genomic DNA, and then assembled both sequence sets into a gene-centric draft genome assembly of 79.8 Mb (Fig. 2). This approach both improved the assembly and produced better gene models over the entire expression range of the transcriptome. Our assembly has a supercontig N50 of 9.4 kb, nearly twice the average gene lengths of C. elegans and C. briggsae (Stein et al. 2003); it encodes a full Caenorhabditis proteome, and 2672 highly conserved DNA elements that may be regulatory.
Results Sequencing of the genome and RNA-seq-mediated scaffolding We sequenced and assembled 200-bp fragment libraries at 100-fold nominal combined coverage (assuming a 100-Mb genome like C. elegans) to get an initial assembly of 79.8 Mb. This assembly’s supercontig N50 improved from 1.5 kb to 5.0 kb with the addition of 375- and 450-bp fragment libraries, each having 35-fold coverage, for a final nominal coverage of 170-fold; however, coverage dropped significantly in regions of very low and very high GC content. These numbers do not include 497 supercontigs, totaling 4.6 Mb with an N50 of 64.7 kb and with $90% identity to genome
20:1740–1747 Ó 2010 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/10; www.genome.org
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
RNA-seq-scaffolded nematode genome
Figure 1. Phylogenetic relationship of Caenorhabditis sp. 3 PS1010 to representative Caenorhabditis species and other nematode species discussed in this work, as determined by Kiontke et al. (2007), Meldal et al. (2007), and K Kiontke (pers. comm.). PS1010 is an outgroup to previously sequenced Caenorhabditis species, all of which closely resemble C. elegans, and most of which are formally considered part of an Elegans species group (Sudhaus and Kiontke 1996; Kiontke et al. 2007). PS1010 itself falls into a newly characterized Drosophilae species group, whose members substantially differ from Elegans species in morphology and mating behavior (K Kiontke, pers. comm.). However, PS1010 is still more closely related to C. elegans than non-Caenorhabditis nematodes such as Pristionchus, Meloidogyne, and Brugia, and hence, is more likely to show noncoding sequence conservation with C. elegans and C. briggsae. Phylogram not drawn to scale.
sequences of Escherichia coli (PS1010’s laboratory food source); 10% of the raw reads mapped to E. coli sequences. We also sequenced the mixed-stage larval transcriptome of PS1010 to a depth of 53.2 million 2 3 75 nt reads. Using pair-mates that mapped to different Velvet genomic supercontigs, we performed RNA-seq-mediated DNA scaffolding with the RNAPATH module in ERANGE 3.2 (Mortazavi et al. 2008) and generated a 79.8-Mb draft genome of PS1010 (Table 1); 15,450 Velvet supercontigs were placed into 4072 RNAPATH supercontigs. To test the completeness of our assembly, we mapped its supercontigs onto 429 kb of PS1010 sequences already in GenBank, including 417 kb of pilot genome sequence (Kuntz et al. 2008).
A total of 576 supercontigs generated by Velvet covered 90.7% of the previously known 429 kb. In contrast, the RNAPATH-based assembly covered the same sequence with only 402 supercontigs (Fig. 3). RNAPATH excluded 138 supercontigs that had more than 50% overlap with 81 PS1010 genomic repeats that amounted to an additional 1.1% of pilot sequence (or equivalently 10% of the gap sequence) along with 1% of standalone intronic supercontigs from the cDNA-mediated scaffolds (Fig. 3). The rest of the missing pilot sequence are in regions of low coverage with very low or very high GC content. Supercontigs composed of intergenic DNA or genes lacking RNA-seq data were left untouched. While still fragmented, this assembly is sufficient to analyze genes and should also contain a substantial fraction of noncoding elements conserved between PS1010 and C. elegans.
Assembly of the transcriptome and gene annotation To optimize our parameters for assembling PS1010 RNA-seq data into cDNA supercontigs, we first tested our parameters on a staged C. elegans L3 2 3 75 RNA-seq data set for which the correct outputs of assembly would be largely known. We found that Velvet typically made better assemblies of cDNA from moderately expressed genes than from strongly expressed ones. We thus assembled cDNA from such high-expression genes from a small subset of RNA-seq reads (one million), while doing a separate assembly with all of the reads for low-abundance cDNA and merging the resulting supercontigs by concatenating the FASTA sequences. Using this two-tiered strategy on our PS1010 RNA-seq data, we assembled 17.8 Mb of cDNA into 27,923 supercontigs with an N50 of 1.06 kb, 99% of which mapped back onto the genome. In contrast to genomic DNA, less than 0.02% of the RNA-seq reads mapped to E. coli. Velvet cDNA supercontigs were used as EST hints for the AUGUSTUS genefinder run with Caenorhabditis settings (Stanke et al. 2008) to predict 22,851 genes encoding 28,978 proteins. These gene models were used to evaluate expression levels with ERANGE (Supplemental Fig. S1); 63.2% of 161,032 predicted exons in 78.1% of the genes showed expression over 1 RPKM (read per kilobase per million reads) in our PS1010 RNA-seq data, corresponding to $6 reads for the median PS1010 exon length of 143.
Figure 2. Sequencing strategy. (A) Genomic reads are assembled using the Velvet short read assembler and then filtered for high similarity to E. coli. RNA-seq paired reads are mapped using a combination of Bowtie and BLAT onto this preliminary genomic assembly. The RNA-seq reads are then imported into ERANGE, where those reads with ends on separate supercontigs serve as input to the RNAPATH module. This process can be repeated with trimmed reads to increase mappable reads, if necessary. (B) Paired RNA reads (each pair represented by a blue and an orange triangle connected by a green line), where each read end maps to a separate genomic supercontig, can be used to scaffold (i.e., to join, order, and orient) those genomic supercontigs based on the two read ends pointing toward each other.
Genome Research www.genome.org
1741
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Mortazavi et al. Table 1.
Assembly statistics
found 2672 filtered elements in all, comprising 0.08% of the C. elegans genome, rangLargest No. of ing from 7 to 160 bp in size, with an average supercontig N50 No. of genes supercontigs Total of 29 bp; 9.5% of these elements are $50 bp (kb) (kb) predicted ($100 bp) Assembly (Mb) long (Supplemental Fig. S2). Two elements fall into a lin-39 enhancer conserved beGenomic Velvet 79.8a 44,965 45.7 5.1 27,741 Genomic Velvet+RNAPATH 79.8a 33,587 96.3 9.4 22,851 tween C. elegans and PS1010 (Kuntz et al. RNA-seq Velvet 17.8 27,923 14.5 1.1 — 2008). In C. elegans, these elements disproportionately reside near genes annotated a An additional 4.6 Mb of Velvet supercontigs were filtered out because of their high similarity to with 28 Gene Ontology (GO) terms (SupE. coli. plemental Table S5). The most significantly enriched terms relate to reproduction, growth, and embryonic development. To characterize the completeness and content of PS1010’s One possible explanation of our persistent residue of longer predicted genes, we used OrthoMCL (Li et al. 2003) and HMMER/ elements might be that they overlap highly conserved noncoding PFAM-A (Finn et al. 2008; http://hmmer.janelia.org) to identify RNA genes whose expression is rare enough to have eluded anorthology groups and protein domains from PS1010, C. elegans, notation. To test this idea, we checked our elements for overlaps C. briggsae, Pristionchus pacificus (Dieterich et al. 2008), Meloidogyne with 3672 putative ncRNAs predicted in C. elegans by Missal et al. hapla (Opperman et al. 2008), M. incognita (Abad et al. 2008), and (2006) with RNAz, of which 1290 remained completely novel by Brugia malayi (Ghedin et al. 2007). All three Caenorhabditis proJanuary 2010 (i.e., they still did not overlap annotated protein- or teomes had comparable similarity to those of the other nematodes nonprotein-coding exons in WormBase WS210). A total of 128 of (Fig. 4). A total of 5623 PS1010 genes (25% of 22,851) showed strict our elements (4.8%) indeed overlapped RNAz predictions, and 72 orthology (1:1 gene ratios in an orthology group) with C. elegans elements (2.7%) had $80% overlap with the novel RNAz subset; genes; this is ;70% as many gene pairs as had strict orthology the latter set of elements ranged in size from 10 to 120 nt, with between C. briggsae and C. elegans. 11,633 (51%) displayed hoa mean of 35 nt (Supplemental Table S4; Supplemental Fig. S2). mology with at least one gene in another nematode genome; However, 95% of our elements had no overlap with RNAz pre11,630 (51%) encoded at least one of 3466 PFAM-A domains; 45 dictions, and this nonoverlapping majority ranged from 7 to 160 PFAM-A domains, encoded by 69 PS1010 genes, were found both nt in size with a mean of 29 nt. This observation suggests that the in PS1010 and at least one non-Caenorhabditis nematode, yet were filtered elements may identify novel, highly conserved ncRNAs, missing from both C. elegans and C. briggsae, suggesting that they but that such cryptic ncRNAs do not currently account for either have been specifically lost from the Elegans group (Supplemental the bulk of elements or even most of the larger ones. Table S1). Two of these domains, encoded by one PS1010 gene If these elements are genuinely regulatory, they should share apiece, are found in all four non-Caenorhabditis nematodes that we recurrent motifs that at least partially match known regulatory searched. Conversely, 85 PFAM-A domains were found in all sequences. We detected 22 motifs in 1193 elements with MEME nematode genomes except for PS1010 (Supplemental Table S2); at (Bailey and Elkan 1994) and FIMO (Bailey et al. 2009). We then least some of these domains might exist within genes present in compared them with published motifs with TOMTOM (Gupta PS1010, but are missing from our assembly. (By comparison, 19 et al. 2007), finding significant similarities to motifs from C. elegans and 30 PFAM-A domains were found in all nematode genomes and the general literature (Table 2; Supplemental Table S6). Our except for C. elegans and C. briggsae, respectively.) Some PFAM-A domains are represented in PS1010 by up to 370 genes, indicating that our genome assembly successfully captured extensively paralogous gene sets: for instance, 602 genes in PS1010 encode possible serpentine receptors, and 54 genes encode major sperm proteins (Supplemental Table S3). The number of predicted receptors is close to that for P. pacificus (613), though half that for C. elegans (1411) and C. briggsae (1120).
Identifying conserved noncoding elements To identify noncoding sequences in the C. elegans genome that are highly conserved between C. elegans and PS1010, including ones likely to be regulatory, we used TBA/MULTIZ (Blanchette et al. 2004) to align PS1010 to the C. elegans and C. briggsae genomes and scanned the alignments with phastCons (Siepel et al. 2005) for three-species conservation. A total of 6.21% of the C. elegans genome showed conservation in 95,712 elements with an average size of 65 bp; 97.2% of these elements overlapped with repetitive DNA, known protein-coding or ncRNA exons, or alternative exon predictions (in some cases generated by us from ESTs and RNA-seq data; Supplemental Table S4). For example, the unc-2 gene had 53 unfiltered phastCons elements, but only three passed all of our filters, one of which marked a potential new promoter (Fig. 5A). Overlaps were conservatively defined as any match of $1 nt. We
1742
Genome Research www.genome.org
Figure 3. An example of Velvet supercontigs (blue) and RNAPATH supercontigs (red) on 20 kb of Sanger-sequenced PS1010 fosmids (Kuntz et al. 2008) along with the ERANGE mapped coverage of the transcriptome reads (red), Velvet assembly of transcriptome (green), AUGUSTUS gene predictions on the original Velvet assembly and AUGUSTUS with Velvet-computed cDNA sequences assisted predictions on the final cDNA-scaffolded assembly. cDNA-mediated scaffolding combined with a genefinder improves the accuracy of gene models by allowing genes fragmented between genomic supercontigs to be on the same scaffold (broken line box). Summary statistics on the right are for the entire Sanger-sequenced 429 kb of sequence (corresponding to ;0.5% of the genome).
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
RNA-seq-scaffolded nematode genome
Figure 4. The pattern of pairwise strict orthologies between the three Caenorhabditis species and the non-Caenorhabditis nematodes P. pacificus and M. hapla matches their known phylogeny. In particular, the numbers of strict orthologs between PS1010 and each of the outgroup species are comparable to those seen between C. elegans and C. briggsae and the same outgroups, suggesting that our protein-coding gene set is largely complete.
most statistically significant predicted motif was equivalent to a slr-2/jmjc-1-dependent stress-response motif conserved between ecdysozoa and deuterostomes (Fig. 5B; Kirienko and Fay 2010). Other predicted motifs matched a miRNA promoter-associated motif (Ohler et al. 2004), the core promoter SP1 site (Li et al. 2004), muscle-specific motifs 1–4 of Zhao et al. (2007), early and late PHA-4 binding sites (Gaudet et al. 2004), the M-2 pharyngeal muscle motif (Ao et al. 2004), and the E2F binding site (van den Heuvel and Dyson, 2008). Subsets of those elements with matches to muscle-specific motifs were disproportionately found near genes annotated with GO terms for locomotion, body morphogenesis, and nematode larval development. The unc-2 element in Figure 5A contains a match to the E2B-like motif 2-30, which is also associated with locomotion; unc-2 itself encodes a voltage-gated calcium channel a1 subunit required for normal movement (Mathews et al. 2003). Other predicted motifs were of comparable statistical significance, but did not match sites with known functions. Four of them matched previous predictions by Beer and Tavazoie (2004), indicating that at least some of these novel motifs are likely to be real, uncharacterized regulatory sites conserved between PS1010 and C. elegans (Table 2; Fig. 5C).
Discussion We have carried out next-generation sequencing and analysis of the nematode Caenorhabditis sp. 3 PS1010, identifying approximately 18,000 expressed protein-coding genes in ;90% of its ge-
nome, along with approximately 2700 noncoding DNA elements highly conserved between PS1010 and C. elegans. PS1010 is a member of the Caenorhabditis genus, but is not part of the Elegans group (Kiontke and Fitch 2005), which includes C. briggsae, three other recently sequenced nematodes (C. remanei, C. brenneri, and C. japonica), and an increasing number of unnamed elegans-like species. Instead, PS1010 belongs to a newly defined Drosophilae group within Caenorhabditis (Fig. 1; K Kiontke, pers. comm.). PS1010s conservation of genes and noncoding DNA therefore defines traits likely to be strongly required throughout the Caenorhabditis genus, despite overt differences in morphology and behavior between Elegans and Drosophilae group species (Sudhaus and Kiontke 1996; K Kiontke, pers. comm.) and despite sequence divergence comparable to that between humans and birds (Kiontke and Fitch 2005). In particular, the 2672 candidate DNA elements that passed our extensive filters probably encode either highly conserved regulatory elements or cryptic exons missed in the extensive annotation of C. elegans. While elements have an average size of 66 bp before being filtered with known exons, filtered elements average 29 bp in size (Supplemental Table S4). In addition, recurrent motifs found within the filtered elements include matches to several published regulatory motifs, and two elements mapped into a lin-39 enhancer previously shown to be conserved between C. elegans and PS1010 (Kuntz et al. 2008). These results are consistent with the hypothesis that many filtered elements are regulatory. Kuntz et al. (2008) found three other lin-39 enhancers conserved in PS1010 that our elements did not detect; we suspect that this arises from a limited ability of our gene–centric assembly to be aligned by TBA/MULTIZ in regions far from exons. This, in turn, suggests that a better PS1010 genome assembly might reveal significantly more than 2700 noncoding DNA elements to be conserved between C. elegans and PS1010. One goal of our work was to devise an analytical tool kit for animal or plant genomes 70–300 Mb in size, usable by a small research group, with (in our case) a particular focus on nematode species. PS1010 was a good test case for this, because we had previously Sanger-sequenced fosmids representing roughly 0.5% of its genome (Kuntz et al. 2008) and, therefore, we could assess the quality of our genomic data through several rounds of sequencing and assembly. We were able to produce a genome assembly, which, though unsuitable for analyses of long-range regulation or multigene synteny, does support analyses of gene function, orthology, and short-range gene regulation. We also found that the PS1010
Figure 5. Conserved noncoding elements. (A) A highly conserved noncoding element (red star) in an intron of unc-2, as identified by phastCons and passing all of our filters, is a possible promoter element. Conserved PS1010 elements are typically subsets of the conserved elements shared between Elegans group genomes. (B) The most statistically significant predicted motif from the highly conserved noncoding elements (motif 1-1) (Table 2) is equivalent to the slr-2/jmjc-1-responsive motif of Kirienko and Fay (2010). (C ) A functionally uncharacterized motif (1-8) closely resembles motif 4 of Beer and Tavazoie (2004).
Genome Research www.genome.org
1743
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Mortazavi et al. Table 2.
Motifs predicted in highly conserved noncoding DNA elements
Motif
Description
Size (nt)
Consensus sequence
E-value 140
1-1 1-2 1-3 1-4 1-5
slr-2/jmjc-1 Novel Novel Novel Muscle 3
11 15 11 14 15
STCTGCGTCTC MGTGGSSRGASCCWA GTGGCCTAGAA GCAARYGCGCTCYA SMGMSMCSMSMCMSC
3.4 6.0 3.1 8.7 1.5
1-6
Muscle 1
15
GASRRAGASASRSAG
4.6 3 10
59
1-8
11
ACGACACTCCG
6.6 3 10
48
8 10 15 15 11
CYCCGCCC CTACAGTAAY RYGTSWBKGTGTKTG AGRAGAWGAARAMGA TGCGCCTTTAA
7.2 1.6 2.4 4.4 1.9
3 10 3 10 3 10 3 10 3 10
42
15 11 13
GTCCKAGAGGASTAC GGTTCGAHYCC TCGYKKCRAGACC
1.8 3 10 7.4 3 10 4.9 3 10
17
15 15
WTTACWGTTTCAAAA BCYCGTAAATCSACA
4.9 3 10 3.5 3 10
11
2-22 2-24
Uncharacterized: previously found by Beer and Tavazoie (2004) as highly significant motif (4th out of 375) miRNA 59 flank/Sp1 Novel Resembles early pha-4/Muscle 4 Muscle 2 Uncharacterized: previously found by Beer and Tavazoie (2004); also has possible mammalian homolog (PF0082.1; Xie et al. 2005) Novel Novel Uncharacterized: previously found in most significant 10% of Beer and Tavazoie (2004) motifs Novel Uncharacterized: motif 140 of Beer and Tavazoie (2004) Novel Novel
15 18
GACMCCCAWMWYGMC CRKTKRATRCTCASSSAM
9.4 3 10 3.8 3 10
11
2-26 2-27 2-30
Novel Novel E2F
18 11 8
ATYWKAWTTGACGMGCAA RRCTSAAAATB SGCGCSRA
8.2 3 10 3.5 3 10 1.9 3 10
6
1-9 1-10 1-11 1-14 1-15
1-17 1-18 1-19 1-20 2-18
genome and transcriptome could be determined effectively with a single round of sequencing (e.g., one run of an Illumina flow cell). This finding opens the prospect of a wider survey of the nematode phylum at a reasonable cost. Can this approach be extended to the vast number of uncharacterized nematodes, most of which probably cannot be cultured in the laboratory? Setting aside the daunting issue of chromosome diminution in some nematode clades such as Ascaris ¨ ller and Tobler 2000), this will depend on whether PS1010 is (Mu representative of other nematode genomes in its polymorphism and repeat structure. We did not inbreed PS1010 before sequencing, as was necessary for species such as C. brenneri (Barrie`re et al. 2009). However, PS1010 was isolated from a small sample of Caenorhabditis sp. 3 worms and underwent years of continuous culture before being frozen (K Kiontke, pers. comm.). Thus, PS1010 had probably already undergone a bottleneck that lowered its polymorphism and facilitated assembly. For new species that are difficult or impossible to inbreed, DNA from a single worm would allow reconstructing (at worst) two haplotypes at the cost of deeper sequencing. Moreover, current short-read assemblers are designed to deal with error-prone reads and so should tolerate higher levels of polymorphism than their predecessors. When sequencing DNA from one or a few worms of an unculturable species, constructing jumping libraries from paired ends of larger genomic fragments with specified lengths will probably not be an option for scaffolding genome assemblies. In this case, transcriptomes could be a useful alternative for local scaffolding of genomes: RNA, after being reverse-transcribed, can be amplified
1744
Genome Research www.genome.org
3 10 3 10 3 10 3 10 3 10
GO term (P-value <1 3 10 6)
131 129 135 54
Nematode larval development (GO:0002119; 1.31 3 10 7) Locomotion (GO:0040011; 2.23 3 10 7); body morphogenesis (GO:0010171; 4.93 3 10 7)
34 31 30 26
7
Locomotion (GO:0040011; 5.41 3 10
)
14 10
16
11
Nematode larval development (GO:0002119; 4.53 3 10 7)
25 2
Locomotion (GO:0040011; 4.2 3 10
7
)
from small samples in the same way as genomic DNA. Moreover, such RNA-seq samples would be from whole organisms, and so would be likely to express large fractions of all genes at some level. While cDNA-scaffolded assemblies will not match the quality of assemblies based on jumping libraries and run the risk of excluding intronic fragments, their scaffolding will improve as the depth and variety of RNA-seq samples from different developmental stages are added. Such assemblies will be best for genes most strongly expressed in biologically important life stages (e.g., infectious larvae). This is similar to the analysis of a cancer transcriptome in the context of its matching cancer genome. Both features could help decipher genomes replete with intronic and intergenic repeats, such as those of the nematode Panagrellus redivivus (de Chastonay et al. 1990) and of some plants. Velvet-assembled RNAseq data makes gene predictions more reliable, and our overall strategy makes it feasible for individual laboratories to sequence the genomes of multicellular eukaryotes. cDNA-scaffolded assembly should thus enable draft genomes of many neglected organisms.
Methods Worm culture The strain PS1010 was obtained from the Caenorhabditis Genetics Center. These worms did not thrive on normal C. elegans growth media. We thus grew PS1010 on nutrient agar (Difco) supplemented with 0.1% v/v cholesterol and incubated with E. coli HB101 as food. Worms were grown to high density on five to ten
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
RNA-seq-scaffolded nematode genome 10-cm plates, collected with M9 buffer, cleaned of bacteria by sucrose centrifugation (Lewis and Fleming 1995), and bleached before growing cultures for DNA or RNA harvests.
Isolation of DNA and RNA Recently bleached cultures of PS1010 were expanded on five to ten 10-cm nutrient agar/HB101 plates to starvation. After starving worms for 1–2 d to rid them of E. coli, they were collected with M9, sucrose-centrifuged, and snap-frozen with liquid nitrogen in ;100-mL aliquots before storing at 80°C. Worms were thawed and refrozen three times to promote cuticle breakage before extracting either genomic DNA or bulk RNA. Genomic DNA was extracted by two rounds of proteinase K digestion and phenolchloroform extraction, with an intermediate step of RNase A digestion in TE; bulk RNA was extracted with the Qiagen RNeasy mini kit.
Transcriptome scaffolding of the genome using RNAPATH The mapped RNA-seq reads were imported into ERANGE sqlite datasets, and paired-mates with both uniquely mappable (ungapped or spliced) ends on different supercontigs were exported out for the genomic scaffolding by the new RNAPATH module within ERANGE 3.2; while we could have exported these reads instead to a general scaffolding program such as Bambus (Pop et al. 2004), we opted to have the code more tightly integrated within ERANGE. The read-mates were used to build an edge-weighted adjacency matrix of the supercontigs; only the top two edges per supercontigs with weights greater than two were kept. Scaffolding proceeded by starting at leaves and following the highest-weighted edges, reverse-complementing supercontigs as necessary to keep read-mates oriented toward each other; supercontigs and edges that were included in a scaffold could not be reused in any subsequent scaffold. We repeated the scaffolding a second time to obtain the final assembly, which is available in GenBank (accession no. AEHI01000000) and WormBase.
Genome and transcriptome sequencing Genomic DNA libraries were built using Illumina’s standard paired-end protocol (Bentley et al. 2008). Four libraries were built using different size cuts ranging from 200- to 450-bp fragments and were sequenced as 75-mers (Supplemental Table S7). The 200nt fragment RNA library was built largely as described (Mortazavi et al. 2008) with an added 12 rounds of column filtration following the last PCR steps and was sequenced as paired 75-mers. All libraries were sequenced on the Illumina Genome Analyzer II following the manufacturer’s recommendations. Genome and RNA-seq reads were submitted to the Sequence Read Archive under accession number SRA023844.
Genome and cDNA assembly using Velvet Raw reads were first mapped using Bowtie 0.12.1 (Langmead et al. 2009) onto the existing 439 kb of PS1010 sequence in GenBank to determine optimal insert sizes for paired mates using ERANGE 3.2 (Mortazavi et al. 2008). Raw reads were assembled with Velvet v.0.7.56 (Zerbino and Birney 2008) using k = 47 nt, expected coverage of 200, minimum coverage of four, minimum pair count of two, supercontigs $100 nt, and specified insert sizes for the two longest fragment libraries, where the settings were optimized for the highest N50. Supercontigs showing $90% matches to any E. coli assembly in GenBank with BLAT (Kent 2002) were filtered out. The remaining supercontigs were used for the transcriptomemediated scaffolding of the genome (Fig. 2). Ungapped transcriptome mate-ends were mapped with Bowtie 0.12.1 using the settings ‘‘-v 2 -e 240 -k 11 -m 10 --strata --best’’, allowing matches of $70/75. Reads that did not map with Bowtie were then mapped at 70/75 using BLAT, filtered with pslReps, and imported as 3.6 million splice reads (with at least 6 nt on the short end of the splice) with ERANGE. We mapped RNA-seq reads first with Bowtie onto the cDNA-scaffolded genomic assembly; ERANGE extracted 43.7 million uniquely mappable reads and 0.6 million multireads from the Bowtie mappings. The remaining reads did not map primarily because of poor sequence quality. RNA-seq reads were assembled into cDNA with Velvet using a two-tiered strategy. One million paired reads were used to assemble cDNA from highly expressed genes with the settings ‘‘-exp_cov 100 -ins_length 200 -cov_cutoff 4. -min_contig_lgth 100’’. In parallel, all of the reads were used to assemble cDNA from moderately expressed genes with the settings ‘‘-exp_cov 1000 -ins_length 200 -cov_cutoff 4. -min_contig_lgth 100’’. The resulting cDNA supercontigs were collectively mapped to the genome with BLAT and used as hints to the AUGUSTUS 2.3 genefinder (Stanke et al. 2008).
Annotating genomic DNA and protein-coding genes AUGUSTUS was run on the PS1010 Velvet+RNAPATH assembly with C. elegans parameters. We also ran AUGUSTUS on the C. elegans genome using either de novo parameters, or the following data sources for hints: ;355,000 C. elegans ESTs from GenBank; public RNA-seq data from GenBank; or our own Velvet-assembled cDNA supercontigs, from our own C. elegans RNA-seq data. To calculate RPKM expression levels on a per-exon and pergene basis with ERANGE using the AUGUSTUS gene models, RNA reads were mapped onto the Velvet+RNAPATH assembly. Bowtie and BLAT were run at the same settings used for genomic assembly, but using the first 50 bp of each read and allowing up to five mismatches. To find repetitive elements in our genome assembly, we ran RepeatModeler, which itself runs both RECON (Bao and Eddy 2002) and RepeatScout (Price et al. 2005) before merging their predictions. We identified 422 repetitive elements in the PS1010 genome, which we mapped to 429 kb of Sanger-sequenced PS1010 genomic DNA with BLAT. Protein sequence analyses were done on the PS1010 predicted proteome itself, the WormBase WS210 predicted proteomes of C. elegans, C. briggsae, and P. pacificus, the WormBase WS207 predicted proteome of M. hapla, the predicted proteome of M. incognita from http://www.inra.fr/meloidogyne_incognita/genomic_ resources/downloads, and a hybrid predicted proteome for B. malayi from both WormBase WS209, and our own de novo AUGUSTUS predictions using B. malayi parameters. We determined orthologies with OrthoMCL 1.3 (Li et al. 2003), run with standard settings. Since OrthoMCL outputs are protein based (overcounting genes with multiple products), groups were mapped from proteins to genes with Perl, and ‘‘orthology groups’’ with one gene were discarded. Genes were considered to belong to strict orthology groups if there was no more than one gene from each species in that group (i.e., 1:1, 1:1:1, etc.). Less stringently, a PS1010 gene was considered to have homology of some sort if it fell into any orthology group that had non-PS1010 genes as members. Protein domains with an E-value of #10 6 were found in all seven proteomes with hmmscan from HMMER 3.0b3 (http:// hmmer.janelia.org) and release 24.0 of PFAM-A (Finn et al. 2008).
Conserved genomic DNA elements and motifs The PS1010 draft assembly was aligned to the genomes of C. elegans and C. briggsae with TBA/MULTIZ (Blanchette et al. 2004) at BLASTZ settings T = 0, W = 8, K = 2200 and then analyzed for
Genome Research www.genome.org
1745
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
Mortazavi et al. conservation using phastCons (Siepel et al. 2005) with standard settings. We trained phastCons on alignments of C. elegans chromosome IV, and then used it with otherwise standard settings to generate a list of elements in the C. elegans genome conserved in both C. briggsae and PS1010. We required that they be at least 10 nt long, with 80% overlap in all three genomes, and that they not overlap any of the following datasets in C. elegans: complex or simple repetitive elements; known exons from protein-coding or noncoding RNA (ncRNA) genes annotated in WormBase WS210; BLASTN hits (E # 10 3) against C. elegans ncRNA sequences from WS210; alternative exon predictions by genefinders, such as mGene; which are provided in WormBase as supplementary data rather than official gene models (Schweikert et al. 2009); and our own exon predictions with AUGUSTUS. Statistically overrepresented GO terms of neighboring C. elegans genes were found with the Cistematic module of ERANGE (Mortazavi et al. 2006). We extracted motifs of 6–18 nt in size from the filtered elements with MEME (-minw 6 -maxw 18) (Bailey and Elkan 1994), allowing any number of motifs per sequence (-mod anr), setting a minimum significance of P # 0.05, allowing up to 50 instances of a motif (-nmotifs 50), and using a Markov-1 background dinucleotide frequency model that we generated from 47.7 MB of filtered C. elegans genome sequence. This consisted of sequence from which we had first removed repeats and exons (known or predicted), and then had removed any sequence fragments under 30 nt. We compared motifs with TOMTOM (Gupta et al. 2007) using Euclidean distances to measure their similarity, Q-values (Storey and Tibshirani 2003) to define highly significant matches, and P-values to qualify weak ones. Previously published motifs were extracted from JASPAR (Portales-Casamar et al. 2010), Drosophila Flyreg v2 (http://www.danielpollard.com/bergman2004_ matrices.html; Bergman et al. 2005), TRANSFAC (Matys et al. 2006) via the TOMTOM Web portal (Bailey et al. 2009), and WormBase (Harris et al. 2010). To find subsets of conserved DNA elements containing instances of particular motifs, we ran FIMO with default parameters (Bailey et al. 2009); these subsets were, in turn, scanned for overrepresented GO terms in neighboring genes with ERANGE/Cistematic. More details of some procedures above are given in the Supplemental Methods.
Acknowledgments We thank Robin Giblin-Davis for providing PS1010 in 1991, Oren Schaedel for use of his C. elegans L3 RNA-seq data, Todd Ciche and Karin Kiontke for advice on worm culture and RNA extractions, Henry Amrhein and Diane Trout for computational support, and Adler Dillman, Karin Kiontke, Adrienne Roeder, Hillel Schwartz, and Allyson Whittaker for comments on the manuscript. Sequencing was performed in the Millard and Muriel Jacobs Genetics and Genomics Laboratory at Caltech (I.A., L.S.). This work was supported by the Howard Hughes Medical Institute, with which P.W.S. is an Investigator, the Beckman Institute Functional Genomics Center, the Caltech Moore Cell Center, grants HG02223 and HG003162 from the National Human Genome Research Institute, and grant GM084389 from the National Institute of General Medical Sciences.
References Abad P, Gouzy J, Aury JM, Castagnone-Sereno P, Danchin EG, Deleury E, Perfus-Barbeoch L, Anthouard V, Artiguenave F, Blok VC, et al. 2008. Genome sequence of the metazoan plant-parasitic nematode Meloidogyne incognita. Nat Biotechnol 26: 909–915.
1746
Genome Research www.genome.org
Ao W, Gaudet J, Kent WJ, Muttumu S, Mango SE. 2004. Environmentally induced foregut remodeling by PHA-4/FoxA and DAF-12/NHR. Science 305: 1743–1746. Bailey TL, Elkan C. 1994. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol 2: 28–36. Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L, Ren J, Li WW, Noble WS. 2009. MEME SUITE: Tools for motif discovery and searching. Nucleic Acids Res 37: W202–W208. Bao Z, Eddy SR. 2002. Automated de novo identification of repeat sequence families in sequenced genomes. Genome Res 12: 1269–1276. Barrie`re A, Yang SY, Pekarek E, Thomas CG, Haag ES, Ruvinsky I. 2009. Detecting heterozygosity in shotgun genome assemblies: Lessons from obligately outcrossing nematodes. Genome Res 19: 470–480. Beer MA, Tavazoie S. 2004. Predicting gene expression from sequence. Cell 117: 185–198. Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, et al. 2008. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456: 53–59. Bergman CM, Carlson JW, Celniker SE. 2005. Drosophila DNase I footprint database: A systematic genome annotation of transcription factor binding sites in the fruitfly, Drosophila melanogaster. Bioinformatics 21: 1747–1749. Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AF, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED, et al. 2004. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res 14: 708–715. Blaxter M. 1998. Caenorhabditis elegans is a nematode. Science 282: 2041–2046. Chaisson MJ, Brinza D, Pevzner PA. 2009. De novo fragment assembly with short mate-paired reads: Does the read length matter? Genome Res 19: 336–346. de Chastonay Y, Muller F, Tobler H. 1990. Two highly reiterated nucleotide sequences in the low C-value genome of Panagrellus redivivus. Gene 93: 199–204. Dieterich C, Clifton SW, Schuster LN, Chinwalla A, Delehaunty K, Dinkelacker I, Fulton L, Fulton R, Godfrey J, Minx P, et al. 2008. The Pristionchus pacificus genome provides a unique perspective on nematode lifestyle and parasitism. Nat Genet 40: 1193–1198. Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz HR, Ceric G, Forslund K, Eddy SR, Sonnhammer EL, et al. 2008. The Pfam protein families database. Nucleic Acids Res 36: D281–D288. Gaudet J, Muttumu S, Horner M, Mango SE. 2004. Whole-genome analysis of temporal gene expression during foregut development. PLoS Biol 2: e352. doi: 10.1371/journal.pbio.0020352. Ghedin E, Wang S, Spiro D, Caler E, Zhao Q, Crabtree J, Allen JE, Delcher AL, Guiliano DB, Miranda-Saavedra D, et al. 2007. Draft genome of the filarial nematode parasite Brugia malayi. Science 317: 1756–1760. Gupta S, Stamatoyannopoulos JA, Bailey TL, Noble WS. 2007. Quantifying similarity between motifs. Genome Biol 8: R24. doi: 10.1186/gb-20078-2-r24. Haas BJ, Zody MC. 2010. Advancing RNA-Seq analysis. Nat Biotechnol 28: 421–423. Harris TW, Antoshechkin I, Bieri T, Blasiar D, Chan J, Chen WJ, De La Cruz N, Davis P, Duesbury M, Fang R, et al. 2010. WormBase: A comprehensive resource for nematode research. Nucleic Acids Res 38: D463–D467. International Human Genome Sequencing Consortium. 2004. Finishing the euchromatic sequence of the human genome. Nature 431: 931–945. Kent WJ. 2002. BLAT—the BLAST-like aligment tool. Genome Res 12: 656– 664. Kent WJ, Haussler D. 2001. Assembly of the working draft of the human genome with GigAssembler. Genome Res 11: 1541–1548. Kiontke K, Fitch DHA. 2005. The phylogenetic relationships of Caenorhabditis and other rhabditids. In WormBook (ed. The C. elegans Research Community), pp. 1–11. doi: 10.1895/wormbook.1.11.1, http://www.wormbook.org. Kiontke K, Barrie`re A, Kolotuev I, Podbilewicz B, Sommer R, Fitch DH, Fe´lix MA. 2007. Trends, stasis, and drift in the evolution of nematode vulva development. Curr Biol 17: 1925–1937. Kirienko NV, Fay DS. 2010. SLR-2 and JMJC-1 regulate an evolutionarily conserved stress-response network. EMBO J 29: 727–739. Kuntz SG, Schwarz EM, DeModena JA, De Buysscher T, Trout D, Shizuya H, Sternberg PW, Wold BJ. 2008. Multigenome DNA sequence conservation identifies Hox cis-regulatory elements. Genome Res 18: 1955–1968. Langmead B, Trapnell C, Pop M, Salzberg SL. 2009. Ultrafast and memoryefficient alignment of short DNA sequences to the human genome. Genome Biol 10: R25. doi: 10.1186/gb-2009-10-3-r25. Lewis JA, Fleming JT. 1995. Basic culture methods. Methods Cell Biol 48: 3–29.
Downloaded from genome.cshlp.org on December 25, 2010 - Published by Cold Spring Harbor Laboratory Press
RNA-seq-scaffolded nematode genome Li L, Stoeckert CJ Jr, Roos DS. 2003. OrthoMCL: Identification of ortholog groups for eukaryotic genomes. Genome Res 13: 2178–2189. Li L, He S, Sun JM, Davie JR. 2004. Gene regulation by Sp1 and Sp3. Biochem Cell Biol 82: 460–471. Mathews EA, Garcı´a E, Santi CM, Mullen GP, Thacker C, Moerman DG, Snutch TP. 2003. Critical residues of the Caenorhabditis elegans unc-2 voltage-gated calcium channel that affect behavioral and physiological properties. J Neurosci 23: 6537–6545. Matys V, Kel-Margoulis OV, Fricke E, Liebich I, Land S, Barre-Dirrie A, Reuter I, Chekmenev D, Krull M, Hornischer K, et al. 2006. TRANSFAC and its module TRANSCompel: Transcriptional gene regulation in eukaryotes. Nucleic Acids Res 34: D108–D110. Meldal BH, Debenham NJ, De Ley P, De Ley IT, Vanfleteren JR, Vierstraete AR, Bert W, Borgonie G, Moens T, Tyler PA, et al. 2007. An improved molecular phylogeny of the Nematoda with special emphasis on marine taxa. Mol Phylogenet Evol 42: 622–636. Missal K, Zhu X, Rose D, Deng W, Skogerbø G, Chen R, Stadler PF. 2006. Prediction of structured non-coding RNAs in the genomes of the nematodes Caenorhabditis elegans and Caenorhabditis briggsae. J Exp Zool B Mol Dev Evol 306: 379–392. Mortazavi A, Leeper Thompson EC, Garcia ST, Myers RM, Wold B. 2006. Comparative genomics modeling of the NRSF/REST repressor network: From single conserved sites to genome-wide repertoire. Genome Res 16: 1208–1221. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. 2008. Mapping and quantifying mammalian transcriptomes by RNA-seq. Nat Methods 5: 621–628. ¨ ller F, Tobler H. 2000. Chromatin diminution in the parasitic nematodes Mu Ascaris suum and Parascaris univalens. Int J Parasitol 30: 391–399. Ohler U, Yekta S, Lim LP, Bartel DP, Burge CB. 2004. Patterns of flanking sequence conservation and a characteristic upstream motif for microRNA gene identification. RNA 10: 1309–1322. Opperman CH, Bird DM, Williamson VM, Rokhsar DS, Burke M, Cohn J, Cromer J, Diener S, Gajan J, Graham S, et al. 2008. Sequence and genetic map of Meloidogyne hapla: A compact nematode genome for plant parasitism. Proc Natl Acad Sci 105: 14802–14807. Pop M, Kosak DS, Salzberg SL. 2004. Hierarchical scaffolding with Bambus. Genome Res 14: 149–159.
Portales-Casamar E, Thongjuea S, Kwon AT, Arenillas D, Zhao X, Valen E, Yusuf D, Lenhard B, Wasserman WW, Sandelin A. 2010. JASPAR 2010: The greatly expanded open-access database of transcription factor binding profiles. Nucleic Acids Res 38: D105–D110. Price AL, Jones NC, Pevzner PA. 2005. De novo identification of repeat families in large genomes. Bioinformatics 21: i351–i358. Schweikert G, Zien A, Zeller G, Behr J, Dieterich C, Ong CS, Philips P, De Bona F, Hartmann L, Bohlen A, et al. 2009. mGene: Accurate SVM-based gene finding with an application to nematode genomes. Genome Res 19: 2133–2143. Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, et al. 2005. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res 15: 1034–1050. Stanke, M., Diekhans, M., Baertsch, R., Haussler, D. 2008. Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics 24, 637–644. Stein LD, Bao Z, Blasiar D, Blumenthal T, Brent MR, Chen N, Chinwalla A, Clarke L, Clee C, Coghlan A, et al. 2003. The genome sequence of Caenorhabditis briggsae: A platform for comparative genomics. PLoS Biol 1: E45. doi: 10.1371/journal.pbio.0000045. Storey JD, Tibshirani R. 2003. Statistical significance for genomewide studies. Proc Natl Acad Sci 100: 9440–9445. Sudhaus W, Kiontke K. 1996. Phylogeny of Rhabditis subgenus Caenorhabditis (Rhabditidae, Nematoda). J Zoo Syst Evol Res 34: 217–233. van den Heuvel S, Dyson NJ. 2008. Conserved functions of the pRB and E2F families. Nat Rev Mol Cell Biol 9: 713–724. Xie X, Lu J, Kulbokas EJ, Golub TR, Mootha V, Lindblad-Toh K, Lander ES, Kellis M. 2005. Systematic discovery of regulatory motifs in human promoters and 3’ UTRs by comparison of several mammals. Nature 434: 338–345. Zerbino DR, Birney E. 2008. Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18: 821–829. Zhao G, Schriefer LA, Stormo GD. 2007. Identification of muscle-specific regulatory modules in Caenorhabditis elegans. Genome Res 17: 348–357.
Received May 26, 2010; accepted in revised form August 24, 2010.
Genome Research www.genome.org
1747