Lecture Notes in Bioinformatics
5542
Edited by S. Istrail, P. Pevzner, and M. Waterman Editorial Board: A. Apostolico S. Brunak M. Gelfand T. Lengauer S. Miyano G. Myers M.-F. Sagot D. Sankoff R. Shamir T. Speed M. Vingron W. Wong
Subseries of Lecture Notes in Computer Science
Ion M˘andoiu Giri Narasimhan Yanqing Zhang (Eds.)
Bioinformatics Research and Applications 5th International Symposium, ISBRA 2009 Fort Lauderdale, FL, USA, May 13-16, 2009 Proceedings
13
Series Editors Sorin Istrail, Brown University, Providence, RI, USA Pavel Pevzner, University of California, San Diego, CA, USA Michael Waterman, University of Southern California, Los Angeles, CA, USA Volume Editors Ion M˘andoiu University of Connecticut Computer Science & Engineering Department 371 Fairfield Way, Unit 2155, Storrs, CT 06269, USA E-mail:
[email protected] Giri Narasimhan Florida International University School of Computing and Information Sciences Bioinformatics Research Group (BioRG) 11200 SW 8th Street, Room ECS254, University Park, Miami, FL 33199, USA E-mail: giri@cs.fiu.edu Yanqing Zhang Georgia State University Department of Computer Science Atlanta, GA 30302-3994, USA E-mail:
[email protected]
Library of Congress Control Number: Applied for CR Subject Classification (1998): J.3, H.2.8, F.1, F.2.2, G.3 LNCS Sublibrary: SL 8 – Bioinformatics ISSN ISBN-10 ISBN-13
0302-9743 3-642-01550-6 Springer Berlin Heidelberg New York 978-3-642-01550-2 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2009 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12672264 06/3180 543210
Preface
The 5th edition of the International Symposium on Bioinformatics Research and Applications (ISBRA 2009) was held during May 13–16, 2009 at Nova Southeastern University in Ft. Lauderdale, Florida. The symposium provides a forum for the exchange of ideas and results among researchers, developers, and practitioners working on all aspects of bioinformatics and computational biology and their applications. The technical program of the symposium included 26 contributed papers, selected by the Program Committee from a number of 55 full submissions received in response to the call for papers. The technical program also included contributed papers and abstracts submitted to the Second Workshop on Computational Issues in Genetic Epidemiology (CIGE 2009), which was held in conjunction with ISBRA 2009. Additionally, the symposium included poster sessions and featured invited keynote talks by four distinguished speakers: Mikhail Gelfand from the Russian Academy of Sciences and Moscow State University spoke on evolution of regulatory systems in bacteria, Nicholas Tsinoremas from the Miller School of Medicine and the College of Arts and Sciences at the University of Miami spoke on bioinformatics challenges in translational research, Esko Ukkonen from the University of Helsinki spoke on motif construction from high-throughput SELEX data, and Shamil Sunyaev from Brigham and Women’s Hospital and Harvard Medical School spoke on interpreting population sequencing data. We would like to thank the Program Committee members and external reviewers for volunteering their time to review and discuss symposium papers. We would also like to thank the Chairs and the Program Committee of CIGE 2009 for enriching the technical program of the symposium with a workshop on an important and active area of bioinformatics research. We would like to extend special thanks to the Steering and General Chairs of the symposium for their leadership, and to the Finance, Publicity, Local Organization, Posters Chairs, and Web Master for their hard work in making ISBRA 2009 a successful event. Last but not least, we would like to thank all authors for presenting their work at the symposium. May 2009
Ion M˘ andoiu Giri Narasimhan Yanqing Zhang
Organization
5th International Symposium on Bioinformatics Research and Applications (ISBRA 2009) Steering Chairs Dan Gusfield Yi Pan Marie-France Sagot
University of California, Davis, USA Georgia State University, USA INRIA, France
General Chairs Matthew He Alexander Zelikovsky
Nova Southeastern University, USA Georgia State University, USA
Program Chairs Ion M˘andoiu Giri Narasimhan Yanqing Zhang
University of Connecticut, USA Florida International University, USA Georgia State University, USA
Publicity Chair Raj Sunderraman
Georgia State University, USA
Finance Chair Anu Bourgeois
Georgia State University, USA
Poster Chairs Yufeng Wu Craig E. Nelson
University of Connecticut, USA University of Connecticut, USA
Local Organization Chairs Edward Keith Miguel A. Jimenez-Montano
Nova Southeastern University, USA Universidad Veracruzana, Mexico
VIII
Organization
Local Organization Committee Ahmed Albatineh Ricardo Carrera Josh Loomis Evan Haskell Saeed Rajput Reza Razeghifard Raisa Szabo
Nova Nova Nova Nova Nova Nova Nova
Southeastern Southeastern Southeastern Southeastern Southeastern Southeastern Southeastern
University, University, University, University, University, University, University,
USA USA USA USA USA USA USA
Web Master Zejin Jason Ding
Georgia State University, USA
Program Committee Srinivas Aluru Iowa State University, USA
Bhaskar Dasgupta University of Illinois at Chicago, USA
Danny Barash Ben-Gurion University, Israel
Colin Dewey University of Wisconsin-Madison, USA
Anne Bergeron Universit´e du Qu´ebec `a Montr´eal, Canada
Werner Dubitzky University of Ulster, UK
Tanya Berger-Wolf University of Illinois at Chicago, USA Daniel Berrar University of Ulster, UK Olivier Bodenreider National Library of Medicine, NIH, USA
Guillaume Fertin Universit´e de Nantes, France Liliana Florea George Washington University, USA Jean Gao University of Texas at Arlington, USA
Mikhail Gelfand Paola Bonizzoni IITP, Russia Univ. de Studi di Milano-Bicocca, Italy Michael Gribskov Daniel Brown Purdue University, USA University of Waterloo, Canada Katia Guimar˜ aes Liming Cai Universidade Federal de University of Georgia, USA Pernambuco, Brazil Luonan Chen Osaka Sangyo University, Japan
Robert Harrison Georgia State University, USA
Organization
Jieyue He Southeast University, China Vasant Honavar Iowa State University, USA Lars Kaderali University of Heidelberg, Germany Ming-Yang Kao Northwestern University, USA George Karypis University of Minnesota, USA Yury Khudyakov Centers for Disease Control and Prevention, USA Jing Li Case Western Reserve University, USA Yiming Li National Chiao Tung University, Taiwan Guohui Lin University of Alberta, Canada Stefano Lonardi University of California at Riverside, USA Jingchu Luo Peking University, China Osamu Maruyama Kyushu University, Japan Satoru Miyano University of Tokyo, Japan
Itsik Pe’er Columbia University, USA Mihai Pop University of Maryland, USA Teresa Przytycka NCBI, USA Sven Rahmann Technical University of Dortmund, Germany Sanguthevar Rajasekaran University of Connecticut, USA Shoba Ranganathan Macquarie University, Australia Isidore Rigoutsos IBM Research, USA Cenk Sahinalp Simon Fraser, Canada David Sankoff University of Ottawa, Canada Russell Schwartz Carnegie Mellon University, USA Jo˜ ao Carlos Setubal Virginia Polytechnic Institute and State University, USA Mona Singh Princeton University, USA
Bernard Moret Ecole Poly. Fed. de Lausanne, Switzerland
Steve Skiena State University of New York at Stony Brook, USA
Craig Nelson University of Connecticut, USA
Donna Slonim Tufts University, USA
Laxmi Parida IBM T.J. Watson Research Center, USA
Ramanathan Sowdhamini NCBS, India
IX
X
Organization
Jens Stoye Universit¨ at Bielefeld, Germany
Li-San Wang University of Pennsylvania, USA
Wing-Kin Sung National University of Singapore, Singapore
Lusheng Wang City University of Hong Kong, China
Sing-Hoi Sze Texas A&M University, USA
Carsten Wiuf University of Aarhus, Denmark
Haixu Tang Indiana University, USA
Hongwei Wu University of Georgia, USA
Gabriel Valiente Technical University of Catalonia, Spain
Yufeng Wu University of Connecticut, USA
Jean-Philippe Vert Ecole des Mines de Paris, France
Dong Xu University of Missouri-Columbia, USA
St´ephane Vialette Universit´e Paris-Est Marne-la-Vall´ee, France
Kaizhong Zhang University of West Ontario, Canada
Gwenn Volkert Kent State University, USA
Leming Zhou University of Pittsburgh, USA
External Reviewers Angibaud, S´ebastien Araujo, Flavia Assareh, Amin Astrovakaya, Irina Bernauer, Julie Blin, Guillaume Chen, Shihyen Comin, Matteo DeRonne, Kevin Della Vedova, Gianluca Dewal, Ninad Dondi, Riccardo Ghodsi, MohammadReza Guillemot, Sylvain Harris, Elena Husemann, Peter Jahn, Katharina Jin, Guangxu
Kauffman, Chris Kim, Dongchul Kim, Yoo-Ah Knapp, Bettina Krishnan, Yamuna Lara, James Li, Weiming Liu, Bo Liu, Zhiping Mangul, Serghei Marschall, Tobias Martin, Marcel Mazur, Johanna Monteiro, Carla Offmann, Bernard Palamara, Pierre Podolyan, Yevgeniy Pugalenthi, Ganesan
Organization
Radde, Nicole Rizzi, Raffaella Rosa, Rogerio Rusu, Irena Salari, Rahele Schoenhuth, Alex Sheikh, Saad Stoffer, Deborah
Tripathi, Lokesh Wittler, Roland Wojtowicz, Damian Wu, Lingyun Zhao, Xingming Zheng, Jie Zola, Jaroslaw
XI
Second Workshop on Computational Issues in Genetic Epidemiology (CIGE 2009)
Steering Committee Andrew Allen Ion M˘andoiu Dan Nicolae Yi Pan Alex Zelikovsky
Duke University, USA University of Connecticut, USA University of Chicago, USA Georgia State University, USA Georgia State University, USA
Program Chairs Andrew Allen Itsik Pe’er
Duke University, USA Columbia University, USA
Program Committee Dave Cutler Frank Dudbridge Eleazar Eskin Eran Halperin David Heckerman Chun Li Eden Martin Shaun Purcell Hongyu Zhao
Emory University, USA Cambridge University, UK UCLA, USA UC Berkeley/Tel Aviv University, USA/Israel Microsoft Research, USA Vanderbilt University, USA Miami University, USA Harvard University, USA Yale University, USA
Table of Contents
Evolution of Regulatory Systems in Bacteria (Invited Keynote Talk) . . . Mikhail S. Gelfand, Alexei E. Kazakov, Yuri D. Korostelev, Olga N. Laikova, Andrei A. Mironov, Alexandra B. Rakhmaninova, Dmitry A. Ravcheev, Dmitry A. Rodionov, and Alexei G. Vitreschak
1
Integrating Multiple-Platform Expression Data through Gene Set Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ˇ Matˇej Holec, Filip Zelezn´ y, Jiˇr´ı Kl´ema, and Jakub Tolar
5
Practical Quality Assessment of Microarray Data by Simulation of Differential Gene Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Brian E. Howard, Beate Sick, and Steffen Heber
18
Mean Square Residue Biclustering with Missing Data and Row Inversions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stefan Gremalschi, Gulsah Altun, Irina Astrovskaya, and Alexander Zelikovsky Using Gene Expression Modeling to Determine Biological Relevance of Putative Regulatory Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Peter Larsen and Yang Dai Querying Protein-Protein Interaction Networks . . . . . . . . . . . . . . . . . . . . . . Guillaume Blin, Florian Sikora, and St´ephane Vialette Integrative Approach for Combining TNFα-NFκB Mathematical Model to a Protein Interaction Connectivity Map . . . . . . . . . . . . . . . . . . . . . . . . . . Mahesh Visvanathan, Bernhard Pfeifer, Christian Baumgartner, Bernhard Tilg, and Gerald Henry Lushington
28
40 52
63
Hierarchical Organization of Functional Modules in Weighted Protein Interaction Networks Using Clustering Coefficient . . . . . . . . . . . . . . . . . . . Min Li, Jianxin Wang, Jianer Chen, and Yi Pan
75
Bioinformatics Challenges in Translational Research (Invited Keynote Talk) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nicholas F. Tsinoremas
87
Untangling Tanglegrams: Comparing Trees by Their Drawings . . . . . . . . Balaji Venkatachalam, Jim Apple, Katherine St. John, and Dan Gusfield An Experimental Analysis of Consensus Tree Algorithms for Large-Scale Tree Collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Seung-Jin Sul and Tiffani L. Williams
88
100
XVI
Table of Contents
Counting Faces in Split Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lichen Bao and Sergey Bereg
112
Relationship between Amino Acids Sequences and Protein Structures: Folding Patterns and Sequence Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexander Kister
124
Improved Algorithms for Parsing ESLTAGs: A Grammatical Model Suitable for RNA Pseudoknots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sanguthevar Rajasekaran, Sahar Al Seesi, and Reda Ammar
135
Efficient Algorithms for Self Assembling Triangular and Other Nano Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vamsi Kundeti and Sanguthevar Rajasekaran
148
Motif Construction from High–Throughput SELEX Data (Invited Keynote Talk) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Esko Ukkonen
159
Rearrangement Phylogeny of Genomes in Contig Form . . . . . . . . . . . . . . . . Adriana Mu˜ noz and David Sankoff
160
Prediction of Contiguous Regions in the Amniote Ancestral Genome . . . A¨ıda Ouangraoua, Fr´ed´eric Boyer, Andrew McPherson, ´ Eric Tannier, and Cedric Chauve
173
Pure Parsimony Xor Haplotyping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Paola Bonizzoni, Gianluca Della Vedova, Riccardo Dondi, Yuri Pirola, and Romeo Rizzi
186
A Decomposition of the Pure Parsimony Haplotyping Problem . . . . . . . . . Allen Holder and Thomas Langley
198
Exact Computation of Coalescent Likelihood under the Infinite Sites Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yufeng Wu
209
Imputation-Based Local Ancestry Inference in Admixed Populations . . . Bogdan Pa¸saniuc, Justin Kennedy, and Ion M˘ andoiu
221
Interpreting Population Sequencing Data (Invited Keynote Talk) . . . . . . . Shamil R. Sunyaev
234
Modeling and Visualizing Heterogeneity of Spatial Patterns of Protein-DNA Interaction from High-Density Chromatin Precipitation Mapping Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Juntao Li, Fajrian Yunus, Zhu Lei, Majid Eshaghi, Jianhua Liu, and R. Krishna Murthy Karuturi
236
Table of Contents
XVII
A Linear-Time Algorithm for Analyzing Array CGH Data Using Log Ratio Triangulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Matthew Hayes and Jing Li
248
Mining of cis-Regulatory Motifs Associated with Tissue-Specific Alternative Splicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jihye Kim, Sihui Zhao, Brian E. Howard, and Steffen Heber
260
Analysis of Cis-Regulatory Motifs in Cassette Exons by Incorporating Exon Skipping Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sihui Zhao, Jihye Kim, and Steffen Heber
272
A Class of Evolution-Based Kernels for Protein Homology Analysis: A Generalization of the PAM Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Valentina Sulimova, Vadim Mottl, Boris Mirkin, Ilya Muchnik, and Casimir Kulikowski
284
Irreplaceable Amino Acids and Reduced Alphabets in Short-Term and Directed Protein Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Miguel A. Jim´enez-Monta˜ no and Matthew He
297
A One-Class Classification Approach for Protein Sequences and Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andr´ as B´ anhalmi, R´ obert Busa-Fekete, and Bal´ azs K´egl
310
Prediction and Classification of Real and Pseudo MicroRNA Precursors via Data Fuzzification and Fuzzy Decision Trees . . . . . . . . . . . . . . . . . . . . . Na’el Abu-halaweh and Robert Harrison
323
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
335
Evolution of Regulatory Systems in Bacteria (Invited Keynote Talk) Mikhail S. Gelfand1,2, Alexei E. Kazakov1, Yuri D. Korostelev2, Olga N. Laikova3, Andrei A. Mironov1,2, Alexandra B. Rakhmaninova1,2, Dmitry A. Ravcheev1, Dmitry A. Rodionov1,4, and Alexei G. Vitreschak1 1
A.A.Kharkevich Institute for Information Transmission Problems, RAS, Bolshoi Karetny pereulok 19, Moscow, 127994, Russia {gelfand,kazakov,ravcheyev,rodionov,vitreschak}@iitp.ru 2 Faculty of Bioengineering and Bioinformatics, M.V.Lomonosov Moscow State University, Vorobievy Gory 1-73, Moscow, 119992, Russia {
[email protected],abr@belozersky}.msu.ru 3 Research Institute for Genetics and Selection of Industrial Microorganisms, Pervy Dorozhny proezd 1, Moscow, 127994, Russia 4 A.A. Burnham Institute for Medical Research, 10901 North Torrey Pines Road, La Jolla, CA 92037, USA
Abstract. Recent comparative studies indicate surprising flexibility of regulatory systems in bacteria. These systems can be analyzed on several levels, and I plan to consider two of them. At the level of regulon evolution, one can attempt to characterize the evolution of regulon content formed by loss, gain and duplications of regulators and regulated genes, as well as gain and loss of individual regulatory sites and horizontal gene transfer. At the level of transcription factor families, one can study co-evolution of DNA-binding proteins and the motifs they recognize. While this area is not yet ripe for fully automated analysis, the results of systematic comparative studies gradually start to coalesce into an understanding of how bacteria regulatory systems evolve. Keywords: Comparative genomics, bacteria, regulation of transcription, regulation of translation, transcription factor, binding site, T-box.
1 Introduction Sequencing of hundreds of bacterial genomes has created a situation when in many taxa we have rather dense and relatively uniform sampling of genomes at varying evolutionary distance from each other. This paves way for careful comparative genomic analysis of regulatory systems and their evolution. Identification of candidate transcription factor binding sites and regulatory RNA structures and analysis of their distribution in related genomes allows one to reconstruct the evolutionary history of regulons, whereas analysis of candidate binding sites for transcription factors forming a structural family creates an opportunity for studying co-evolution of transcription factors and their binding motifs, and hence, elucidation of family-specific proteinDNA interaction code. I. Măndoiu, G. Narasimhan, and Y. Zhang (Eds.): ISBRA 2009, LNBI 5542, pp. 1–4, 2009. © Springer-Verlag Berlin Heidelberg 2009
2
M.S. Gelfand et al.
2 Evolution of Regulons The list of basic events shaping the regulons includes gain (via duplication and horizontal gene transfer) and loss of regulators, changes of specificity, gain, loss and duplication of regulated genes, shuffling of genes in operons, and gain and loss of individual regulatory sites. While it is not currently possible to estimate the rate of these events, it is clear that these rates are not uniform for different regulons and their life stages. Further, it is clear that in most cases there are significant overlaps between individual regulons, and it makes more sense to speak of interacting regulatory systems. Life without FUR: Evolution of Iron Homeostasis in the Alpha-Proteobacteria [1]. One of examples where the evolutionary history could be reconstructed in sufficient detail is the regulation of iron homeostasis in the alpha-proteobacteria. In this case the starting even was change of ligand specificity that transformed the usual iron repressor FUR into manganese-responsive MUR in the common ancestor of the Rhizobiales and Rhodobacteriales. The role of iron regulator was assumed by a distant member of the FUR family, Irr, and this state is conserved in the Bradirhizobiceae. In the Rhizobiaceae, IscR, a regulator of genes involved in the synthesis of the iron-sulfur clusters also was lost. Further on, in the Rhizobiaceae the RirA regulator appeared, and the job of iron-dependent regulation is shared by Irr (mainly responsible for iron storage, Fe-S clusters, heme and iron-dependent enzymes) and RirA (main regulator of iron acquisition and also some Fe-S and iron storage genes). In the Rhodobacteriales, iron acquisition is regulated by an unknown transcription factor binding to the motif CTGActrawtyagTCAG, that is somewhat similar to the binding moitif of IscR; iron storage genes are co-regulated by this factor and Irr, Fe-S synthesis, by IscR and Irr, and iron-dependent enzymes, solely by Irr. Fatty Acid and Branched-Chain Amino Acid Utilization in the Gamma- and Beta-proteobacteria [2]. A similar reconstruction could be performed for a large system regulating the catabolism of fatty acids (FA) and branched-chain amino acids (ILV) in the gamma- and beta-proteobacteria. This system involves six transcriptional factors from the MerR, TetR, and GntR families binding to eleven distinct DNA motifs. The ILV degradation genes in the gamma- and beta-proteobacteria are regulated mainly by a newly identified regulator from the MerR family (e.g., LiuR in Pseudomonas aeruginosa) and, in some beta-proteobacteria, by a TetR-family regulator LiuQ. In addition to the core set of ILV utilization genes, the LiuR regulon in some lineages is expanded to include genes from other metabolic pathways, such as the glyoxylate shunt and glutamate synthase, as well as salt- and alkaline stress response in the Shewanella species. The FA degradation genes are controlled by four regulators including FadR in the gamma-proteobacteria, PsrA in the gamma- and betaproteobacteria, FadP in the beta-proteobacteria, whereas in the alpha-proteobacteria it is regulated by LiuR orthologs. The most parsimonious evolutionary scenario for the ILV and FA regulons seems to be that LiuR and PsrA were likely present in the common ancestor of the gamma- and beta-proteobacteria, and they have been partially or fully substituted by LiuQ and FadP in the Burkholderiales and by FadR in some groups of the gamma-proteobacteria.
Evolution of Regulatory Systems in Bacteria
3
T-boxes and Regulation of Amino Acid Metabolism in the Firmicutes [3]. T-boxes are regulatory RNA structures that bind to uncharged tRNAs and regulate aminoacyltRNA synthetase genes as well as genes encoding amino acid transporters and metabolic enzymes. T-boxes are sufficiently large to retain the phylogenetic signal, at least at short evolutionary distances, and hence it is possible to follow the history of Tbox duplications. Further, since the specificity of T-boxes is dictated by the interaction between a well-defined structural element (so-called specifier codon) and the tRNA anticodon, they are an ideal material for studying changes in specificity. One of the most interesting observations is rapid, duplication-driven, lineage-specific expansion of some specific T-box regulon following the loss of previously existing transcription factors. Regulon Expansion, or how FruR Has Become CRA and Duplicated RbsR Has Become PurR. The fructose repressor FurR, a member of the LacI family, is a standard sugar regulator in most lineages of the gamma-proteobacteria, whereas in E.coli it is a well-studied global regulator named CRA (catabolism repressor and activator). Following the fate of known binding sites in the genomes ordered by increasing phylogenetic distance from E. coli, one can see that the regulon expansion started with the glycolysis pathway and then extended to some genes of the Krebs cycle and sugar catabolic pathways. Similarly, the ribose operon regulator RbsR duplicated in the common ancestor of the Enterobacteriales and Vibrionales. The RbsR copy retained the ligand (ribose) specificity and the regulon, but its DNA motif changed somewhat (to AGCGAAACGTTTCGCT), whereas the other copy retained the DNA motif (ACGCAAACGTTTGCGT), but has become the purine repressor PurR regulating, in E. coli, more than twenty genes from the purine biosynthesis pathway and some adjacent pathways.
3 Co-evolution of Transcription Factors and DNA Motifs They Recognize As mentioned in the previous section, evolution of regulons is often accompanied by changes in the DNA motifs. To study co-evolution of transcription factors (TFs) and their binding sites systematically, we are doing large-scale comparative genomics analysis of several families of TFs. An outcome of such studies is lists of TFs, each with a set of candidate binding sites. Several recently developed programs are used to identify correlated positions in proteins and DNA. Indeed, it turns out that when this analysis was applied to the LacI family of TFs, the identified set of correlated positions was consistent with several known X-ray structures of TF-DNA complexes. Notably, however, the set of protein positions correlated with specific nucleotides was not limited to residues in immediate contact with the DNA: in several families this set also included positions situated on the other side of the DNA-binding alpha-helix and forming hydrophobic interactions with the rest of the protein. Further, these studies revealed that the familyspecific protein-DNA recognition code is not limited to known universal correlations
4
M.S. Gelfand et al.
(like “arginine binds to guanine”), nor to pairwise correlations. Some of predictions coming from these analyses were recently confirmed in experiment [4]. Acknowledgments. The reported studies were supported by grants from the Howard Hughes Medical Institute (55005610 to M.S.G.), the Russian Fund of Basic Research (08-04-01000 to A.E.K.), and the Russian Academy of Sciences (program «Molecular and Cellular Biology»).
References 1. Rodionov, D.A., Gelfand, M.S., Todd, J.D., Curson, A.R.J., Johnston, A.W.B.: Comparative Reconstruction of Transcriptional Network Controlling Iron and Manganese Homeostasis in Alpha-Proteobacteria. PLoS Comp. Biol. 2, e163 (2006) 2. Kazakov, A.E., Rodionov, D.A., Alm, E., Arkin, A., Dubchak, I., Gelfand, M.S.: Comparative Genomics of Regulation of Fatty Acid and Branched-Chain Amino Acid Utilization in Proteobacteria. J. Bacteriol. 191, 52–64 (2009) 3. Vitreschak, A.G., Mironov, A.A., Lyubetsky, V.A., Gelfand, M.S.: Functional and Evolutionary Analysis of the T-box Regulon in Bacteria. RNA 14, 717–735 (2008) 4. Desai, T., Rodionov, D., Gelfand, M., Alm, E., Rao, C.: Engineering Transcription Factors with Novel DNA-binding Specificity Using Comparative Genomics. Nucleic Acids Res. (in press)
Integrating Multiple-Platform Expression Data through Gene Set Features ˇ Matˇej Holec1 , Filip Zelezn´ y1 , Jiˇr´ı Kl´ema1, and Jakub Tolar2 1
Czech Technical University, Prague University of Minnesota, Minneapolis {holecm1,zelezny,klema}@fel.cvut.cz,
[email protected] 2
Abstract. We demonstrate a set-level approach to the integration of multiple platform gene expression data for predictive classification and show its utility for boosting classification performance when singleplatform samples are rare. We explore three ways of defining gene sets, including a novel way based on the notion of a fully coupled flux related to metabolic pathways. In two tissue classification tasks, we empirically show that the gene set based approach is useful for combining heterogeneous expression data, while surprisingly, in experiments constrained to a single platform, biologically meaningful gene sets acting as sample features are often outperformed by random gene sets with no biological relevance.
1
Introduction
The problem addressed in this paper is set-level analysis of gene expression data, as opposed to the more traditional gene-level analysis approaches. In the latter, one typically seeks single statistically significant genes or constructs classification models with gene expressions acting as sample features. In set-level analysis, genes are first grouped into sets apriori determined by a chosen relevant kind of background knowledge. For example, a gene set may correspond to a group of proteins acting as enzymes in a biochemical pathway or be a set of genes sharing a gene-ontology [3] term. Naturally, gene sets considered for an analysis may on one hand overlap while on the other hand their union may not exhaust the entire gene set screened in the expression data. Any gene set may then be assigned descriptive values (such as expression, fold change, significance) by statistical aggregation of the analogical values pertaining to its members. Gene sets thus may act as derived sample features replacing the original gene expressions. The potential for set-level analysis of genomic data has been advocated recently [12,1] on the grounds of improved interpretation power and statistical significance of analysis results. The basic idea of set-level analysis is not new. Indeed, state-of-the-art tools such as DAVID [9] have supported the established protocol of enrichment analysis detecting ontology terms or pathways related to a large subset of a user-supplied gene list, thus obviously following a simple form of set-level analysis. The biological utility of set-level analysis was demonstrated by the study [11] where a significantly downregulated pathway-based I. M˘ andoiu, G. Narasimhan, and Y. Zhang (Eds.): ISBRA 2009, LNBI 5542, pp. 5–17, 2009. c Springer-Verlag Berlin Heidelberg 2009
6
M. Holec et al.
gene set in a class of type 2 diabetes was discovered despite no significant expression change being detected for an individual gene. In another study [18], a method based on singular value decomposition was proposed to determine the ‘level of activity’ of a pathway based on the sampled expression values of its gene-members. The paper [5] reviews some common statistical pitfalls in the calculation of such statistics ascribed to gene sets. The recent work [15] suggests a more sophisticated method to estimate the activity level of a pathway, considering the pathway structure in addition to the expressions of the genes involved therein. Another innovative aspect of [15] is that the authors employ such pathway activities as derived features of samples and use these for sample classification by a machine learning algorithm. The main contribution of the present work is showing that the gene set based approach naturally enables to analyze in an integrated manner gene expression data collected from heterogeneous platforms, which may even encompass different organism species. The significance of this contribution is at least twofold. First, microarray experiments are costly, often resulting in numbers of samples insufficient for reliable modeling. The possibility of systematically integrating the experimenter’s data with numerous public expression samples coming from heterogeneous platforms, would obviously help the experimenter. Second, such integrated analysis provides the principal means to discover biological markers shared by different-genome species. We consider three types of gene sets. The first type groups genes that share a common gene ontology [3] term. The second type groups genes acting in biological pathways formalized by the KEGG [10] database. The third gene set type represents a further novel contribution of our work and is based on the notion of a fully coupled flux, which is a pattern prescribing pathway partitions hypothesized by [13] to involve strongly co-expressed genes. These synergize in single gradually amplified biological functions such as enzymatic catalysis or translocation among different cellular compartments. Research papers concerned with gene set based analysis, including the aforementioned studies, usually point out the statistical advantages of results based on gene sets in comparison with those based on single genes. We conjecture, however, that to assess the utility of the gene set approach, the relevant question that must be asked is how data models based on biologically meaningful gene sets compare to those based on gene sets constructed randomly, with no biological relevance. This question is important as we indeed show that even random grouping of genes into sets may lead to improved predictive accuracies. By addressing this question way we can determine whether the inclusion of background knowledge through gene sets has a positive effect on the analysis results. We are not aware of previous work considering this question1 and it is our third contribution to address it experimentally.
1
The suggested gene set randomization should not be confused with the standard class-permutation technique used for validation, also in the set-level analysis context [1].
Integrating Multiple-Platform Expression Data through Gene Set Features
7
The paper is organized as follows. In Section 2 we describe the methodological ingredients of our approach, consisting of normalization, gene set extraction, data integration and predictive classification. Section 3 describes the expression analysis case studies and the collected relevant data used for experimental validation. In Section 4 we show and discuss the experimental results. Section 5 lays out prospects for future work and concludes the paper.
2
Methods
The input of our workflow is a set of gene expression samples (real vectors) possibly measured by different microarray platforms. Each sample is assigned two labels. The first identifies the microarray platform from which the sample originates, the second identifies a sample class (e.g. tissue type). The output is a classification model, that is, a model that estimates the sample class given an expression sample and its platform label. The model is obviously applicable to any sample not present in the input (‘training’) data, as long as its platform label is also present in the input data. The remarkable property of the output model is that it is not a combination of separate models each pertaining to a single platform. Rather, it is a single classifier trained from the entire heterogeneous sample set and represented in terms of ‘activity levels’ of units that apply to all platforms, albeit the computation of these activity levels may be different across platforms. More specifically, the activity of a unit (such as a pathway) is calculated using a different gene set in each platform. We now describe the individual steps of the method in more detail. Normalization. The first normalization step is conducted separately for each platform to consolidate same-platform samples. Quantile normalization [2] ensures that the distribution of expression values across such samples is identical. As a second step, scaling provides means to consolidate the measurements across multi-platform samples. We subtract the sample mean from all sample components, and divide them by the standard deviation within the sample. As a result, all samples independently of the platform exhibit zero mean and unit variance. We conduct these steps using the Bioconductor [4] software. Set Construction. Here we consider three types of background knowledge in order to define apriori gene sets. Each such set will be extracted from the initial pool of all genes measured by at least one of the involved platforms. The first type groups genes that share a common gene ontology [3] term. The second type groups genes acting in biological pathways formalized by the KEGG [10] database. A gene falls in a set corresponding to a pathway if it is mapped to a KEGG node of some organism ortholog of that pathway. The third gene set type is based on the notion of a fully coupled flux (FCF), motivated as follows. Many notable biological conditions are characterized by the activation of only certain parts of pathways; for example, see references [16,19,21]. The notion of ‘pathway activation’ implied by the previous gene set may thus often violate intuition and hinder interpretation. Therefore we extracted all pathway partitions which
8
M. Holec et al.
Fig. 1. Fully coupled fluxes in a simplified network with nodes representing chemical compounds and arrows as symbols for chemical reactions among them. Each arrow can be labeled by a protein. R3, R4 and R5 are fully coupled as a flux in any of these reactions implies a flux in the rest of them. Note that R1 and R3 do not constitute a FCF as a flux in R3 does not imply a flux in R1.
comply with the graph-theoretic notion of FCF [13]. It is known that the genes coupled by their enzymatic fluxes not only show similar expression patterns, but also share transcriptional regulators and frequently reside in the same operon in prokaryotes or similar eukaryotic multi-gene units such as the hematopoietic globin gene cluster. FCF is a special kind of network flux that corresponds to a pathway partition in which non-zero flux for one reaction implies a non-zero flux for the other reactions and vice versa. It is the strongest qualitative connectivity that can be identified in a network. The notion of an FCF is explained through an example in Fig. 1; for a detailed definition, see reference [13]. Pathway partitions forming FCF’s constitute the third gene set type. Again, a gene falls in a set corresponding to a FCF if it is mapped to a KEGG node in some organismortholog of that FCF. The extraction of fully coupled fluxes from KEGG pathways graphs was conducted in Prolog. The source code as well as the Prolog representation [8] of the pathways are available on request to the first author. The bold numbers in Table 2 display the total numbers of gene sets extracted for the respective types. In what follows, gene sets act as features acquiring a real value for each sample. Formally, let π be the set of genes interrogated by a given platform, and Σ a set of gene sets of a particular type. We define a mapping Aπ : R|π| × Σ → R For an expression sample s = [e1 , . . . , e|π| ] ∈ R|π| , Aπ (s, σ) should collectively quantify the ‘activity level’ of genes in set σ ∈ Σ, in the biological situation (e.g. a tissue type) sampled by s. Typically, not all members of σ will be measured by platform π, and the computation of Aπ (s, σ) will be based on the expressions ei of genes in σ ∩ π. For transparency, in this study we define Aπ (s, σ) as the average of expressions measured in s for all genes in σ ∩ π. We only note here that more sophisticated methods have been proposed to instantiate Aπ (s, σ), either linear, based e.g. on a weighted sum of expression values of the involved genes as in [18], or non-linear, based on additional structure information as in [15] but then constrained to pathway-type gene sets.
Integrating Multiple-Platform Expression Data through Gene Set Features
9
Fig. 2. Integrating expression data collected from heterogeneous platforms into a unified tabular representation of pathway activations. If these platforms pertain to different organisms, we assume that (an ortholog of) each pathway pi exists in each of the organisms.
Our reasoning above assumes the aggregation of gene expression measurements. Precisely speaking, genes themselves aggregate one or more measurements since multiple probesets can represent the same gene. Here, the expression of a gene is simply defined as the average of the corresponding normalized probeset measurements, despite certain caveats of this approach.2 Data Integration. The goal of this methodological step is to integrate heterogeneous expression samples into a single-tabular representation (that is, into a set of samples sharing a common feature set) that predictive classification algorithms can process. Formally, we have a set of expression samples S = {s1 , s2 , . . .} in which for all i si ∈ ∪j R|πj | , πj ∈ Π where Π is the set of the considered platforms. We wish to obtain a new repre¯ i ∈ Rn , n ∈ N . sentation S¯ = {¯ s1 , s¯2 , . . .} where each s This aim is achieved using the above introduced ‘gene set activation’ concepts. Formally, using gene set type Σ = {σ1 , σ2 , . . . , σm }, for each sample si labeled with platform π we stipulate s¯i = [Aπ (si , σ1 ), . . . , Aπ (si , σm )] Naturally, sample s¯i then inherits the class label from si . The integration principle is exemplified in Fig. 2 with pathways pi playing the role of gene sets σi . The described representation conversion is part of the functionality of the aforementioned Prolog code. Classification and Validation. The final step of the workflow is to employ machine learning algorithms to induce predictive classification models of the integrated samples. As the achieved unified representation S¯ can be processed by virtually any machine learning algorithm, the choice appears rather arbitrary. Since one of the usual arguments in favor of gene set based analysis is the ease 2
For example, Affymetrix chips contain probesets representing the same gene that cannot be consolidated into unique measures of transcription due to alternative splicing, use of alternative poly(A) signals, or incorrect annotations [17].
10
M. Holec et al.
of interpretation, we decided to test decision-tree classifiers enabling direct human inspection. Specifically, we experimented with the J48 decision tree learner included the machine learning environment Weka [20]. The design of the experiments and the validation protocol is dictated by the following questions we wish to address empirically. – (Q1) How do classifiers based on original single gene expressions compare in terms of predictive accuracy to those based on activations of biologically meaningful gene sets? – (Q2) How do classifiers based on biologically meaningful gene sets compare in terms of predictive accuracy to those based on gene sets constructed randomly, with no biological relevance? – (Q3) How do classifiers learned from single-platform data compare in terms of predictive accuracy to those learned from data integrated from heterogeneous platforms? In the case of (Q2), we constructed three families of random gene sets corresponding to the three respective kinds of genuine gene sets, for each of the involved platforms. The correspondence is in that a particular type of random gene sets contains exactly the same number of set-elements and exactly the same set-cardinality distribution as its genuine counterpart. For each platform, the members of each random gene set were drawn randomly without replacement from a uniform probability distribution cast on the genes measured by the platform. We are interested in the insights Q1-Q3 for both the ‘data-rich’ and ‘datapoor’ situation, i.e. for both small and large sets of expression samples. Therefore the preferred means of assessment is through learning curves which are diagrams plotting an unbiased estimate of the classifier’s predictive accuracy against the proportion p of the available data set used for its training. The accuracy estimate for each measured p was obtained by inducing a classifier 20 times with a randomly chosen subset (of proportional size p) of the entire data set and testing its accuracy on the remaining data not used for training. In each such step, the 20 empirical accuracy results were averaged into the reported value. We let p range from 0.2 to 0.8 to prevent statistical artifacts arising from overly small sets used for training or testing, respectively.
3
Classification Tasks and Data
Here we validate our methodology in biological classification tasks. In order to avoid domain bias, we chose not to tackle overly special classification cases such as those addressing particular diseases. We therefore address two general tasks of tissue type classification. The first experiment focuses on distinct features of blood-forming (hematopoietic; ‘heme’ in figure legends) and supportive (stromal; ‘stroma’) cellular compartments in the bone marrow. The second assesses differences in brain, liver and muscle tissues. Both experiments are of biological significance as they tackle novel challenges in understanding of cellular behavior: the former in the complex functional unit termed hematopoietic stem cell
Integrating Multiple-Platform Expression Data through Gene Set Features
11
Table 1. Sample size statistics. Platforms are identified by NCBI’s GPL keys. Organism keys stand for mus musculus (mmu), homo sapiens (hsa) and rattus norvegicus (rno). Platform 1261 339 341 570 81 91 Organism mmu mmu rno hsa mmu hsa Heme 46 7 4 19 6 Stroma 19 8 47
96 hsa 18 26
97 hsa 18 33
Platform 1261 91 96 Organism mmu hsa hsa Brain 6 15 20 Liver 11 2 6 Muscle 11 22 41
Table 2. Gene sets statistics. Numbers in bold are independent of the specific platforms measuring the expression data, being only determined by the respective types of background knowledge. The ‘Probesets contained’ columns capture statistics over all involved platforms. The first three rows correspond to the apriori defined sets. For accuracy, we list their sizes in terms of probesets, rather than genes. The statistical relation between genes and probes are in turn shown in the last row. Set type Total Probesets contained Min Max Avg Median FCF 901 0 83 5.47 2 Pathway 251 0 457 52.09 33 GO term 5164 1 7605 25.75 3 Gene 12808∗ 1 49 1.58 1 ∗
average across platforms
niche, where inter-dependent hematopoietic and stromal cell functions synergize in the blood-forming function of the bone marrow; the latter in comparison of cell fate determined by the tissue origin from the separate layers of the embryo: ectoderm (brain), endoderm (liver) and mesoderm (muscle). While of general character, the chosen classification tasks are not just random biological exercises as these studies may illuminate cellular functions determined by gene expression signatures in complex cell system seeded by cell-type-heterogeneous undifferentiated populations (hematopoietic and stromal stem cells in the cell niche), and in the cell-type-homogeneous differentiated tissues (brain, liver and muscle), respectively. For both the first (2-class) and the second (3-class) classification problems, samples were downloaded from the Gene Expression Omnibus database [14]. We only downloaded control (non-treated, non-pathological) samples of each tissue in question. For ease of gene functional annotation, we only downloaded samples measured with platforms provided by Affymetrix. Table 1 provides the statistics on sample distribution among classes and platforms. Table 2 then shows statistics derived from the application of apriori constructed gene sets onto the collected expression samples.
12
M. Holec et al.
4
Results
Here we show the empirical results obtained by processing the data described in Section 3 by the method explained in Section 2 and comment on their relevance to questions Q1-Q3 formulated in the latter section. Results are of two types: single-platform (experiments conducted on a single type of microarray) and cross-platform (experiments on the integrated heterogeneous expression data). Single-platform experiments are shown in both classification tasks for the sample-richest platform pertaining to the homo sapiens organism (GPL97 and GPL96 respectively). The principal trends observed are as follows. Q1 is addressed by the top two panels of Fig. 3. While they do not provide a conclusive performance ranking of the four types of sample representation, they clearly demonstrate that predictive
90 80 60 40
50
60
70
80
20
30
40
50
60
70
% of samples used for training
Heme−Stroma / Genuine sets Cross platform
Brain−Liver−Muscle / Genuine sets Cross platform
80
90 80 70 60
50
Pathway GO term Fully coupled fluxes 20
30
40
50
60
% of samples used for training
70
80
Pathway GO term Fully coupled fluxes
50
60
70
80
% correctly classified samples
90
100
% of samples used for training
100
30
Gene Pathway GO term Fully coupled fluxes
50
50
Gene Pathway GO term Fully coupled fluxes 20
% correctly classified samples
70
% correctly classified samples
80 70 60
% correctly classified samples
90
100
Brain−Liver−Muscle / Genuine sets GPL96
100
Heme−Stroma / Genuine sets GPL97
20
30
40
50
60
70
80
% of samples used for training
Fig. 3. Overall comparison of predictive classification performance using genes (only single-platform) and genuine gene sets. Top: single-platform, Bottom: cross-platform
Integrating Multiple-Platform Expression Data through Gene Set Features
90 50
60
70
80
20
30
40
50
60
Heme−Stroma / Pathway GPL97
Brain−Liver−Muscle / Pathway GPL96
70
80
90 80 70 60
70
80
% correctly classified samples
90
100
% of samples used for training
60
40
50
60
70
80
20
30
40
50
60
70
% of samples used for training
% of samples used for training
Heme−Stroma / Fully coupled fluxes GPL97
Brain−Liver−Muscle / Fully coupled fluxes GPL96
80
90 80 70 60
60
70
80
% correctly classified samples
90
100
30
100
20
random gene sets genuine gene sets
50
50
random gene sets genuine gene sets
50
random gene sets genuine gene sets 20
30
40
50
60
% of samples used for training
70
80
random gene sets genuine gene sets
50
% correctly classified samples
80 60
40
% of samples used for training
100
30
random gene sets genuine gene sets
50
50
random gene sets genuine gene sets 20
% correctly classified samples
70
% correctly classified samples
80 70 60
% correctly classified samples
90
100
Brain−Liver−Muscle / GO term GPL96
100
Heme−Stroma / GO term GPL97
13
20
30
40
50
60
70
80
% of samples used for training
Fig. 4. Single-platform experiments comparing performance of predictive classification using genuine gene sets with that using random gene sets as sample features. Rows correspond to different gene set types, columns to different classification tasks.
14
M. Holec et al.
90 50
60
70
80
20
30
40
50
60
Heme−Stroma / Pathway Cross platform
Brain−Liver−Muscle / Pathway Cross platform
70
80
90 80 70 60
70
80
% correctly classified samples
90
100
% of samples used for training
60
40
50
60
70
80
20
30
40
50
60
70
% of samples used for training
% of samples used for training
Heme−Stroma / Fully coupled fluxes Cross platform
Brain−Liver−Muscle / Fully coupled fluxes Cross platform
80
90 80 70 60
60
70
80
% correctly classified samples
90
100
30
100
20
random gene sets genuine gene sets
50
50
random gene sets genuine gene sets
50
random gene sets genuine gene sets 20
30
40
50
60
% of samples used for training
70
80
random gene sets genuine gene sets
50
% correctly classified samples
80 60
40
% of samples used for training
100
30
random gene sets genuine gene sets
50
50
random gene sets genuine gene sets 20
% correctly classified samples
70
% correctly classified samples
80 70 60
% correctly classified samples
90
100
Brain−Liver−Muscle / GO term Cross platform
100
Heme−Stroma / GO term Cross platform
20
30
40
50
60
70
80
% of samples used for training
Fig. 5. Cross-platform experiments comparing performance of predictive classification using genuine gene sets with that using random gene sets as sample features. Rows correspond to different gene set types, columns to different classification tasks.
Integrating Multiple-Platform Expression Data through Gene Set Features
15
accuracy is not sacrificed by converting the representation from genes to gene sets. On the contrary, the gene set representation based on GO terms quite systematically outperforms the original gene based representation. The lower two panels of Fig. 3 compare the three gene set based approaches in the crossplatform experiments where the gene based representation is not applicable. In the Heme-Stroma task, a clear ranking is observable with fully coupled fluxes performing best, followed by GO terms and lastly pathways. Ranking induced by the Brain-Liver-Muscle task is much less crisp. Figures 4 and 5 relate to Q2. Fig. 4 provides the surprising finding that none of the three genuine gene set representations strictly outperforms its randomized counterpart in both tasks performed in the single-platform setting; with the pathway based gene set representation being strikingly outperformed in the Brain-Liver-Muscle task. To make sure that these results were not a statistical artifact we regenerated all the randomized gene sets and arrived at principally same results. Combining these results with the top row of Fig 3, we deduce another observation that the random gene set approach often improves classification accuracy upon the basic classification based on gene expressions. This latter observation can however be explained rather naturally by viewing the random gene set approach as a form of stochastic feature extraction [7] reducing the dimensionality of the data and thus suppressing the variance component [6] of the classification error. The trends are significantly different in the crossplatform setting (Fig. 5) where all genuine gene set types strictly outperform their random counterparts in both tasks. Here the value of biologically meaningful gene sets manifests itself clearly in that the sets act as links connecting diverse genes distributed across platforms. Such a link is obviously broken when the gene sets are randomized. Finally, to answer Q3 we compare the upper panels of Fig. 3 against its lower panels. With large training data sizes, accuracy differences between singleplatform (upper panels) and cross-platform (lower panels) learning are insignificant, letting us conclude that the assembling of multiple-platform data did not have a detrimental effect on classification performance. More importantly still, in the cross-platform setting, high accuracies are achieved much earlier along the x axis than in the single-platform setting. While the reason is obvious (the same sample set proportion corresponds to a higher absolute number of samples in the cross-platform case), this observation is reassuring. An experimenter possessing a sample set too small for reliable model induction may benefit from employing the gene set based approach to include further relevant public expression samples, however coming from diverse microarray platforms.
5
Conclusions and Future Work
We have demonstrated a set-level approach to the integration of multipleplatform gene expression data for predictive classification and argued its utility for boosting classification performance when single-platform samples are rare. We explored three ways of defining gene sets, including a novel way based on
16
M. Holec et al.
the notion of a fully coupled flux related to metabolic pathways. In two tissue classification tasks, we showed that the gene set based representation is unquestionably useful for combining heterogeneous expression data. This may be for sakes of assembling a larger sample set or to obtain general biological insights not limited to a particular organism. On the other hand, in experiments constrained to a single platform, biologically meaningful gene sets were often outperformed by random gene sets with no biological relevance. Further studies are obviously needed to conclusively compare the performance of biologically relevant gene sets with their randomized counterparts; such studies would especially be interesting in problems where the genuine gene set approach was shown successful, such as in [11,18]. Another natural extension of this work would be in the adoption of a less elementary approach to determine the pathway activation levels, e.g. along the lines of the study [15]. Acknowledgements. The authors are supported by the Czech Grant Agency through project 201/09/1665 (MH), the Czech Ministry of Education through projects ME910 (FZ) and MSM6840770012 (JK), and by the Children’s Cancer Research Fund of the University of Minnesota (JT).
References 1. Bild, A., Febbo, P.G.: Application of a priori established gene sets to discover biologically important differential expression in microarray data. PNAS 102(43), 15278–15279 (2005) 2. Bolstad, B.M., Irizarry, R.A., Astrand, M., Speed, T.P.: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19(2), 185–193 (2003) 3. The Gene Ontology Consortium. Gene ontology: tool for the unification of biology. Nature Genetics 25 (2000) 4. Gentleman, R.C., Carey, V.J., Bates, D.M., et al.: Bioconductor: Open software development for computational biology and bioinformatics. Genome Biology 5, R80 (2004) 5. Goeman, J., B¨ uhlmann, P.: Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics 23(8), 980–987 (2007) 6. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer, Heidelberg (2001) 7. Ho, T.K.: The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(8), 832–844 (1998) 8. Holec, M., Zelezny, F., Klema, J., et al.: Using bio-pathways in relational learning. In: Late Breaking Papers, 18th International Conference on Inductive Logic Programming (ILP 2008) (2008) 9. Huang, D.W., Sherman, B.T., Lempick, R.A.: Systematic and integrative analysis of large gene lists using david bioinformatics resources. Nature Protocols 4, 44–57 (2009) 10. Kanehisa, M., Goto, S., Kawashima, S., Okuno, Y., Hattori, M.: The KEGG resource for deciphering the genome. Nucleic Acids Res. 32, 277–280 (2004) 11. Mootha, V.K., Lindgren, C., Laureta, S., et al.: Pgc-1-alpha-responsive genes involved in oxidative phosphorylation are coorinately down regulated in human diabetes. Nature Genetics 34, 267–273 (2003)
Integrating Multiple-Platform Expression Data through Gene Set Features
17
12. Nicolae, D.L., De la Cruz, O., Wen, W., Ke, B., Song, M.: Invited keynote talk: Set-level analyses for genome-wide association data. In: M˘ andoiu, I., Sunderraman, R., Zelikovsky, A. (eds.) ISBRA 2008. LNCS (LNBI), vol. 4983, p. 1. Springer, Heidelberg (2008) 13. Notebaart, R.A., Teusink, B., Siezen, R.J., Papp, B.: Co-regulation of metabolic genes is better explained by flux coupling than by network distance. PLOS Computational Biology 4(1) (2008) 14. Edgar, R., Domrachev, M., Lash, A.E.: Gene expression omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 30(1), 207–210 (2002) 15. Rapaport, F., Zinovyev, A., Dutreix, M., Barillot, E., Vert, J.-P.: Classification of microarray data using gene networks. BMC Bioinformatics 8, 35 (2007) 16. Shaw, A.S., Filbert, E.L.: Scaffold proteins and immune-cell signalling. Nat. Rev. Immunol. 9(1), 47–56 (2009) 17. Stalteri, M.A., Harrison, A.P.: Interpretation of multiple probe sets mapping to the same gene in affymetrix genechips. BMC Bioinformatics 8, 13 (2007) 18. Tomfohr, J., Lu, J., Kepler, T.B.: Pathway level analysis of gene expression using singular value decomposition. BMC Bioinformatics 6 (2005) 19. Weichhart, T., Semann, M.D.: The PI3K/Akt/mTOR pathway in innate immune cells: emerging therapeutic applications. Ann Rheum Dis. suppl. 3, iii:70–74 (2008) 20. Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005) 21. Sun, Y., Chen, J.: mTOR signaling: PLD takes center stage. Cell Cycle 7(20), 3118–3123 (2008)
Practical Quality Assessment of Microarray Data by Simulation of Differential Gene Expression Brian E. Howard1, Beate Sick2, and Steffen Heber1,3 1
Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina, United States 2 Institute of Data Analysis and Process Design, Zurich University of Applied Science, Winterthur, Switzerland 3 Department of Computer Science, North Carolina State University, Raleigh, North Carolina, United States
[email protected],
[email protected],
[email protected]
Abstract. There are many methods for assessing the quality of microarray data, but little guidance regarding what to do when defective data is identified. Depending on the scientific question asked, discarding flawed data from a small experiment may be detrimental. Here we describe a novel quality assessment method that is designed to identify chips that should be discarded from an experiment. This technique simulates a set of differentially expressed genes and then assesses whether discarding each chip enhances or obscures the recovery of this known set. We compare our method to expert annotations derived using popular quality diagnostics and show, with examples, that the decision to discard a chip depends on the details of the particular experiment. Keywords: Microarray, quality assessment, simulation.
1 Introduction Considerable attention has been paid to methods and metrics that can be used to measure the quality of microarray data (for recent reviews, see [1, 2]). For example, a common approach employs a routine set of diagnostic plots and statistics to identify arrays having low quality relative to the other chips in an experiment [3-8]. In the majority of cases, these methods are used as a filtering step, with the assumption that discarding low quality arrays should increase both the sensitivity and specificity of tests for differentially expressed genes [2]; however, in reality, many of these chips still contain valuable signal, even if that signal is obscured by extensive statistical noise. For a given FDR level, increasing sample size can increase the power to identify differentially expressed genes with decreased probability of declaring false positives [9]. Hence, as demonstrated in [10], discarding moderately noisy chips can actually be detrimental in many cases. Unfortunately, no clear guidelines currently exist for differentiating scenarios in which it is advantageous to discard low quality data from situations where that data should be retained. Here we present a simple procedure that can be used to assess the quality of microarray data. In contrast to other methods, however, this procedure also provides I. Măndoiu, G. Narasimhan, and Y. Zhang (Eds.): ISBRA 2009, LNBI 5542, pp. 18–27, 2009. © Springer-Verlag Berlin Heidelberg 2009
Practical Quality Assessment of Microarray Data
19
practical advice about what to do when low quality chips are identified. The method works by first simulating a set of differentially expressed genes, using gene expression distributions estimated from the dataset. Then, the procedure identifies arrays whose inclusion impairs the recovery of this known set of genes. This method is intended not to merely categorize arrays into binary “high quality” and “low quality” categories, but to identify arrays that should actually be excluded from a particular analysis. Because this approach to quality assessment depends on the details of the particular microarray experiment considered, the assessment framework we describe is easily adaptable to a variety of analysis protocols and experimental frameworks. In the first section, we will describe the dataset used in this paper, and explain the simulation algorithm. Then, we will compare the results obtained from this approach with previous expert annotations created with the aid of a set of popular quality diagnostics. We will illustrate the observation that any decision about whether to include a given array should be dependent not only on the noise profile of the array itself, but also on the details of the specific experiment being performed, including the number of replicates in each sample and the analysis method used to interpret the results.
2 Methods 2.1 Datasets The dataset for this research consists of a set of 531 Affymetrix raw intensity (.CEL) files obtained from the NCBI GEO database [11]. These data are a subset of the dataset described in [8] and consist of all the experiments having at least three samples per treatment. Several of the most commonly used Affymetrix GeneChip 3’ expression array types are represented and were chosen to include a variety of frequently investigated tissue types, experimental treatments and species, including Arabidopsis (ath1121501 array), mouse (mgu74a, mgu74av2, mo3430a, and mouse4302 arrays), rat (rae230a and rgu34a arrays), and human (hgu133a, hgu95av2, hgu95d, and hgu95e arrays). 2.2 Expert Annotations Quality scores were assigned to each chip by a domain expert, according to a procedure previously established and applied in the Lausanne DNA Array Facility (DAFL) [7]. Briefly, this procedure involves the systematic analysis of a variety of common predictive quality assessment metrics including: chip scan images, distributions of the log scale raw and normalized PM probe intensities, plots of the 5’ to 3’ probe intensity gradients, pseudo-images of the PLM weights and residuals, and boxplots of the Normalized Unscaled Standard Error (NUSE) and Relative Log Expression (RLE) scores for each chip. After consideration of each of these quality features, the expert identified arrays that appeared to be outliers with respect to other chips in the same experiment, and each array was assigned a quality score of 0, 1, or 2, with 0 being “acceptable quality” (462 chips), 1 being “suspicious quality” (45 chips) and 2 being “unacceptable quality” (24 chips). These scores were then used as a basis of comparison to quality assessments made using the empirical quality approach described in this paper.
20
B.E. Howard, B. Sick, and S. Heber
2.3 Quality Assessment Algorithm Our approach takes a very practical definition of microarray data “quality”: a low quality microarray is an array that diminishes the chances of accurately detecting differentially expressed genes, given a particular experimental design, dataset, and analysis methodology. To make this determination, our algorithm uses simulated data to find out if excluding a particular chip is likely to improve the ability to detect differentially expressed genes in an experiment similar to the one intended by the investigator. The simulated dataset is constructed using the observed gene expression distributions from the original experiment. Within this simulated dataset, which includes both a “treatment” group and a “control” group, some of the genes are differentially expressed. The quality assessment procedure operates by performing a statistical test for differential expression under two different scenarios: (1) using only the simulated data, and (2) using the simulated data plus the actual expression measurements for one of the chips. If excluding the actual expression measurements for this chip enhances the recovery of the known set of simulated differentially expressed genes, then that chip is flagged as “low quality”. Note that this definition of quality depends on the details of the experiment examined, and, accordingly, our quality assessment framework is adaptable to a variety of microarray platforms and statistical procedures. For concreteness, we will describe the algorithm as it might be applied to a set of one-color microarray data of the sort that comprises our previously described test dataset. However, the details of this approach, including the normalization procedure, gene expression parameters, and choice of statistical test, are flexible. These can, and should, be adapted to match the analysis approach used for the actual experimental data. Goal • To determine whether or not a particular microarray chip should be excluded from an experiment designed to test for differential expression between two treatment groups. Input • A set of microarray expression values from treatment Group 1, which contains N1 (≥2) replicate chips. • A set of microarray expression values from treatment Group 2, which contains N2 replicate chips. • A suspected low quality chip, c, from Group 1. Output • A decision whether or not to exclude chip c from the test for differential expression between Group 1 and Group 2. Procedure 1. Normalize the complete dataset using whatever procedure would normally be used in the final analysis (e.g. quantile/RMA [12], etc.).
Practical Quality Assessment of Microarray Data
21
2. Exclude the suspect chip, c, and use the N1-1 remaining chips from Group 1 to estimate the mean, μˆ g , and sample variance, sg2 , for every probeset, g, on the chip. Repeat 30 times:
3. Simulate a set of G1 consistently expressed genes (CEGs) as follows: • Randomly select G1 probesets from the set of all probesets on the chip. • For each selected probeset, sample N1+N2-1 values from a Normal( μˆ g , sg2)
distribution. • Append the actual expression values from chip c to the simulated data for Group 1. The result is a G1 × (N1+N2) expression matrix, where the first N1 columns correspond to “treatment 1” and the second N2 columns are “treatment 2”. 4. Use the same procedure to simulate a set of G2 differentially expressed genes (DEGs), with the following additional step: • Add a small multiple of the probeset-specific standard deviations, sg, to the N2 “treatment 2” expression values, shifting the mean of the second treatment group relative to the first.
5. Perform a test for differential expression between the two treatments (e.g. using LIMMA [13]) in each of the G1+G2 rows. 6. Evaluate the performance of this test by computing an ROC curve, which can be constructed from the sorted p-values from the tests in step 5. Using this ROC curve, compute the corresponding area under the curve (AUC). (A detailed guide to ROC curves can be found in [14]). 7. Discard the expression values from the suspect chip, and re-compute the ROC curve and AUC (i.e. repeat steps 5 and 6). 8. Record the difference between the AUC scores computed in steps 6 and 7. 9. Discard chip c if the AUC without chip c is significantly higher than with chip c.
3 Results 3.1 Comparison with Expert Annotations
After normalizing all arrays from each experiment using RMA [12], we applied the previously described simulation-based quality assessment procedure to each of the chips in our dataset. For each chip, we simulated 30 N × N experiments, where N is the number of replicates for that chip’s treatment in the original dataset. Each experiment contained 500 consistently expressed genes (CEGs) and 500 differentially expressed genes (DEGs). Differential expression was simulated by adding ± 1 standard deviation to the second treatment group (odd probesets were given positive deltas, and even probesets were given negative deltas). The R LIMMA [13] package was then
22
B.E. Howard, B. Sick, and S. Heber
Fig. 1. Low-quality calls by expert quality group. Expert quality score is shown on the x-axis. Light blue indicates frequency of this category among expert annotations. Dark blue shows proportion of this category flagged for exclusion using the simulation approach.
Fig. 2. Comparison of expert and simulation determined quality scores. Chips with expert quality scores of 1 or 2 are included in the “Flagged by Expert” set. Chips with simulation pvalues < .001 are included in the “Flagged by Algorithm” set.
used to identify differentially expressed genes, both with and without the suspect chip, and the resulting ROC curves were computed in each case. Chips whose inclusion significantly lowered the AUC according to a paired t-test (p-value < .001) were
Practical Quality Assessment of Microarray Data
23
identified as having low quality. The entire analysis was performed using the R statistical programming language (code available from the author by request.) We then compared the chips identified using this procedure with those identified previously by the domain expert. Figure 1 shows that, for the 24 chips identified by the expert as having the lowest quality (i.e. scored as 2’s), the simulation identified 8 chips as being candidates for exclusion (33.3%). Among the 45 chips flagged by the expert as suspicious (1’s), 11 were identified by the simulation procedure as candidates for exclusion (24.4%). For the 462 chips regarded by the expert as having acceptable quality, only 2 were identified by the algorithm as candidates for exclusion (0.43%). Figure 2 summarizes the chips flagged as low quality by the two methods. 3.2 Practical Quality Judgment Depends on the Details of the Experiment
Quality assessment procedures based on predictive quality metrics sometimes have difficulty determining the utility of excluding suspicious chips because this decision is inextricably tied to the details of the particular experiment and the analysis method used. Unfortunately, the values for most quality metrics do not explicitly incorporate the sample size, target effect magnitude or analysis method employed. However, these experimental details are critical for making a rational decision regarding the inclusion or exclusion of low quality data. This scenario is illustrated in the following examples. Example 1. GEO dataset GSE1873 [15] contains gene expression measurements taken from liver tissue of obese mice. The experiment used 5 Affymetrix microarrays to measure gene expression of obese mice exposed to intermittent hypoxic conditions and 5 microarrays to measure gene expression of obese mice used as controls. Using the protocol described in section 2.2, our domain expert examined this dataset and identified 3 chips as having low or suspicious quality (GSM32860, GSM32861 and GSM32866). However, in a simulation using 5 chips in each treatment group, only GSM32860 and GSM32866 were found to be worthy of exclusion (when considered individually). On the other hand, in simulated 3x3 and 4x4 experiments, exclusion of chip GSM32866 is no longer recommended by our procedure. Conversely, as simulated experiment size increases, the p-value for chip GSM32861 approaches the threshold for exclusion, with a p-value of less than 0.01 for experiments of size 9x9 or greater. Example 2. Recent research has demonstrated that many of the common quality problems observed in a typical microarray experiment can be mitigated with the use of robust analysis methods. For example, many typical quality problems can be captured with a heteroscedastic variance model which allows each chip to have different levels of random noise [10]. Smyth showed in simulation that, in many cases, procedures that simply down-weight noisier chips perform better than methods that attempt to identify and exclude these low quality chips. Again, consider experiment GSE1873. Figure 3 shows the expression values for a few representative probesets (expert-identified low quality probesets are shown in colored dots). The diagram illustrates the fact that there is greater variance between probesets than among chips within each probeset. On the other hand, the expression values for the low quality chips appear to more often have extreme values than the
24
B.E. Howard, B. Sick, and S. Heber
Fig. 3. Normalized expression levels for 4 probesets from experiment GSE1873. Green circles correspond to ‘treatment 1’ and blue circles to ‘treatment 2.’ The colored circles represent chips flagged by the expert as low quality. The dashed lines indicate the median expression level for each treatment, while the dotted lines correspond to treatment median ± 1 MAD (median absolute deviation). X-axis is chip name; y-axis is normalized expression.
Fig. 4. Log Expression for experiment GSE1873. Height of box corresponds to interquartile range of the RLE, and midline indicates RLE median.
other chips, although not consistently in one direction. The RLE boxplot also reflects this observation (figure 4); the interquartile ranges for the low quality chips are larger than for the high quality chips. These observations suggest that the heteroscedastic
Practical Quality Assessment of Microarray Data
25
variance model may indeed be useful in the analysis of this data set. To test this hypothesis, we repeated the quality simulation with one modification: we used the “arrayWeights” functionality of the LIMMA package to identify and downweight noisy chips. Under this analysis framework, the quality simulation showed that excluding these chips is no longer recommended. On the other hand, even robust methods can not be expected to correct the most extreme types of errors. For example, we simulated mislabeled samples by interchanging data from different GEO datasets and observed that in these cases it was often still better to remove the foreign arrays than to apply the downweighting procedure (data not shown).
4 Discussion The quality assessment method described here addresses an important question not often considered by other procedures: what to do with the low-quality chips that are identified. In many real-world scenarios, better results can be obtained by retaining slightly flawed data, instead of discarding it completely. Unfortunately, there is currently little guidance available with regard to this decision. Our method takes an empirical approach to this problem by simulating a set of differentially expressed genes and then evaluating the contribution of each suspected chip with regard to identifying these genes. For the datasets examined in our research, the chips identified by the simulation algorithm as “excludable” were roughly a subset of the chips identified by the domain expert as having low quality (figure 2). This may imply that although the expert is correctly identifying the chips with higher noise levels, many of those chips still retain useful signal, especially within the context of the small experiments considered. This approach is easily adapted to other analysis settings, and, in general, it is recommended that the analysis method and parameter settings chosen for the simulation should match the protocol intended for the real data set. For example, here we have used the LIMMA library for statistical analysis, but other methods, such as SAM [16] or Cyber-T [17] could just as easily be applied instead. Alternatively, if the researcher is interested in controlling false discoveries at a specific rate, then one could apply an FDR control procedure and compare the number of true discoveries made instead of the area under ROC curves. In future work we intend to explore more thoroughly the influence of these parameters on the resulting quality decisions. It would also be interesting to enhance our simulation approach to emulate more complex gene expression models, possibly allowing for correlated genes, non-normal distributions and variable effect sizes. It should be noted that when applying the procedure as described here, it is important to look not only at the resulting p-value, but also the magnitude of the observed difference in AUC obtained with and without each chip. Very small differences can sometimes accompany significant p-values, especially if enough replications are performed; in these cases it is probably prudent to retain the chip anyway. Like other quality assessment procedures that attempt to identify outliers among a particular set of microarrays, our method is susceptible to scenarios where the dataset is corrupted by a majority of chips with systematic error. For example, in a dataset where one of the arrays is mislabeled with regard to the experimental treatment
26
B.E. Howard, B. Sick, and S. Heber
applied, our method would likely identify the mislabeled array as an outlier; however, if all of the arrays except one particular array were mislabeled, our algorithm may erroneously identify the correctly labeled array as the outlier. Robust analysis methods such as the approach described in [10] can potentially mitigate many of the common problems observed in microarray datasets. On the other hand, there are still scenarios where even the most robust methods cannot recover useful signal from a particular low quality array. Arrays showing evidence of large spatial artifacts, contamination or other gross errors such as mislabeled samples can rarely be salvaged. Our method can be used to identify these scenarios. In addition, while the method we have described can be used on its own for quality assessment, this technique can also be used in conjunction with other traditional quality diagnostics, which may provide additional clues as to what sorts of errors are present in a batch of arrays and thereby assist in avoiding these problems in the future.
References 1. Larsson, O., Wennmalm, K., Sandberg, R.: Comparative microarray analysis. OMICS: A Journal of Integrative Biology 10(3), 381–397 (2006) 2. Wilkes, T., Laux, H., Foy, C.A.: Microarray data quality – review of current developments. OMICS: A Journal of Integrative Biology 11(1), 1–13 (2007) 3. Archer, K.J., Dumur, C.I., Joel, S.E., Ramakrishnan, V.: Assessing quality of hybridized RNA in Affymetrix GeneChip experiments using mixed-effects models. Biostatistics 7(2), 198–212 (2006) 4. Reimer, M., Weinstein, J.N.: Quality assessment of microarrays: visualization of spatial artifacts and quantitation of regional biases. BMC Bioinformatics 6, 166 (2005) 5. Stokes, T.H., Moffitt, R.A., Phan, J.H., Wang, M.D.: chip artifact CORRECTion (caCORRECT): a bioinformatics system for quality assurance of genomics and proteomics array data. Annals of Biomedical Engineering 35(6), 1068–1080 (2007) 6. Gentleman, R., Carey, V., Huber, W., Irizarry, R., Dudoit, S.: Bioinformatics and computational biology solutions using R and Bioconductor. Springer, New York (2005) 7. Heber, S., Sick, B.: Quality assessment of Affymetrix GeneChip data. OMICS: A Journal of Integrative Biology 10(3), 358–368 (2006) 8. Howard, B.E., Sick, B., Heber, S.: Unsupervised assessment of microarray data qQuality using a Gaussian mixture model (2009) (manuscript) (submitted) 9. Pawitan, Y., et al.: False discovery rate, sensitivity and sample size for microarray studies. Bioinformatics 21, 3017–3024 (2005) 10. Ritchie, M.E., Diyagama, D., Neilson, J., van Laar, R., Dobrovic, A., Holloway, A., Smyth, G.: Empirical array quality weights in the analysis of microarray data. BMC Bioinformatics 7, 261 (2006) 11. Edgar, R., Domrachev, M., Lash, A.E.: Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Research 30(1), 207–210 (2002) 12. Irizarry, R.A., Bolstad, B.M., Collin, F., Cope, L.M., Hobbs, B., Speed, T.P.: Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Research 31(4), e15 (2003) 13. Smyth, G.K.: Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology 3(1) (2004)
Practical Quality Assessment of Microarray Data
27
14. Fawcett, T.: An introduction to ROC analysis. Pattern Recognition Letters 27, 861–874 (2006) 15. Li, J., Grigoryev, D.N., Ye, S.Q., Thorne, L., et al.: Chronic intermittent hypoxia upregulates genes of lipid biosynthesis in obese mice. Journal of Applied Physiology 99(5), 1643–1648 (2005) 16. Tusher, V.G., Tibshirani, R., Chu, G.: Significance analysis of microarrays applied to the ionizing radiation response. PNAS 98(9), 5116–5121 (2001) 17. Baldi, P., Long, A.D.: A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics 17, 509–519 (2001)
Mean Square Residue Biclustering with Missing Data and Row Inversions Stefan Gremalschi1, , Gulsah Altun2 , Irina Astrovskaya1,, and Alexander Zelikovsky1 1
Department of Computer Science, Georgia State University, Atlanta, GA 30303 {stefan,iraa,alexz}@cs.gsu.edu 2 Department of Reproductive Medicine, University of California, San Diego, CA 92093
[email protected]
Abstract. Cheng and Church proposed a greedy deletion-addition algorithm to find a given number of k biclusters, whose mean squared residues (MSRs) are below certain thresholds and the missing values in the matrix are replaced with random numbers. In our previous paper we introduced the dual biclustering method with quadratic optimization to missing data and row inversions. In this paper, we modified the dual biclustering method with quadratic optimization and added three new features. First, we introduce ”row status” for each row in a bicluster where we add and also delete rows from biclusters based on their status in order to find min MSR. We compare our results with Cheng and Church’s approach where they inverse rows while adding them to the biclusters. We select the row or the negated row not only at addition, but also at deletion and show improvement. Second, we give a prove for the theorem introduced by Cheng and Church in [4]. Since, missing data often occur in the given data matrices for biclustering, usually, missing data are filled by random numbers. However, we show that ignoring the missing data is a better approach and avoids additional noise caused by randomness. Since, an ideal bicluster is a bicluster with an H value of zero, our results show a significant decrease of H value of the biclusters with lesser noise compared to original dual biclustering and Cheng and Church method. Keywords: Biclustering, Mean Square Residue.
1 Introduction The gene expression data are given in matrices. In these matrices rows represent genes and columns represent experimental conditions. Each cell in the matrix represents the expression level of a gene under a specific experimental condition. It is well known that, genes can be relevant for a subset of conditions. On the other hand, groups of conditions can be clustered by using different groups of genes. In this case, it is important to do clustering in these two dimensions simultaneously. This led to the discovery of
Partially supported by GSU Molecular Basis of Disease Fellowship.
I. M˘andoiu, G. Narasimhan, and Y. Zhang (Eds.): ISBRA 2009, LNBI 5542, pp. 28–39, 2009. c Springer-Verlag Berlin Heidelberg 2009
MSR Biclustering with Missing Data and Row Inversions
29
biclusters corresponding to a subset of genes and a subset of conditions with a high similarity score by Cheng and Church [4]. Biclustering algorithms perform simultaneous row-column clustering. The goal in these algorithms is to find homogeneous submatrices. Biclustering has been widely used to find appropriate subsets of experimental conditions in microarray data [1, 5, 7, 9, 11–13, 15, 18, 19]. Cheng and Church’s algorithm is based on a natural uniformity model which is the mean squared residue. They proposed a greedy deletion-addition algorithm to find a given number of k biclusters, whose mean squared residues (MSRs) are below certain thresholds. However, in their method, missing values in the matrix is replaced with random numbers. It is possible that these random numbers can interfere the discovery of future biclusters, especially those ones that have overlap with the discovered ones. Yang et al. [15, 16] referred to this as random interference. They generalize the model of bicluster to incorporate missing values and propose a probabilistic algorithm. They defined a probabilistic move-based algorithm FLOC (FLexible Overlapped biclustering) that generalizes the concept of mean squared residue and based on the concept of action and gain. However, FLOC model is still not suitable for non-disjoint clusters and there are more user parameters, including the number of biclusters. These additional features can have negative impacts to the clustering process. In this paper, we propose a similar method to handle the missing data. We have first mathematically characterized general “ideal” biclusters, i.e., biclusters with zero mean square residue. We have shown that new way of handling missing data is significantly more tolerant to noise. We have also introduced status for each row – status -1 means that the corresponding row is inverted (negated), status +1 means that the original row is not inverted. We consider the problem of finding min MSR overall possible row inversions. A limited use of row inversion (without introducing row status) has been applied in [4] when rows are added to biclusters. Based on our findings in [14], we developed a new dual biclustering algorithm and quadratic program that treats missing data accordingly and use the best status assignment. The matrix entries with missing data are not taken in account when computing averages. When comparing our method with Cheng and Church [4], we show that it is better to ignore missing data when adjusting the mean squared residue (MSR) value for finding optimal biclusters. We use a set of methods which includes a dual biclustering algorithm, quadratic program (QP) and combination of dual biclustering with QP which finds (k × l)-bicluster with MSR using a greedy approach proposed in paper [14]. We use a set of methods which includes a dual biclustering algorithm, quadratic program and combination of dual biclustering with QP which finds (k × l)-bicluster with MSR using a greedy approach proposed in paper [14]. Finally, we apply the best row status assignments and get even better average and median MSR overall set of all biclusters. The reminder of this paper is organized as follows. Section 2 gives the formal definition of mean squared residue. In section 3, we give a new definition for adjusting MSR and prove a necessary and sufficient criteria for a matrix to have a perfect correlation. Section 4 defines the inversion based MSR and shows how to compute it. In section 5, we introduce the dual problem formulation described in [14] and we illustrate the comparison between the new adjusted MSR with Cheng and Church’s method. The search
30
S. Gremalschi et al.
of biclusters using the new MSR is given in section 6. The analysis and validation of experimental study is given in Section 7. Finally, we draw conclusions in Section 8.
2 Mean Squared Residue Mean squared residue problem has been defined before by Cheng and Church [4] and Zhou and Khokhar [13]. In this paper, we use the same terminology as in [13]. In this section, we give a brief introduction to the terminology as given in [14]. Our input is an (N × M )-data matrix A, with R rows and C columns, where a cell aij is a real value that represents the expression level of gene i(row i), under condition j(column j). Matrix A is defined by its set of rows, R = {r1 , r2 , ..., rN } and its set of columns C = {c1 , c2 , ..., cM }. Given a matrix, biclustering finds sub-matrices, that are subgroups of rows (genes) and subgroups of columns, where the genes exhibit highly correlated behavior for every condition. Given a data matrix A, the goal is to find a set of biclusters such that each bicluster exhibits some similar characteristic. Let AIJ = (I, J) represent a submatrix of A (I ⊆ R and J ⊆ C). AIJ contains only the elements aij belonging to the submatrix with set of rows I and set of columns J. A bicluster AIJ = (I, J) can be defined as a k by l sub-matrix of the data matrix where k and l are the number of rows and the number of columns in the submatrix AIJ . The concept of bicluster was introduced by [4] to find correlated subsets of genes and a subset of conditions. Let aiJ denote the mean of the i-th row of the bicluster (I, J), aIj the mean of the j-th column of (I, J), and aIJ the mean of all the elements in the bicluster. As given in [4], more formally, aiJ =
1 aij , i ∈ I, |J|
(1)
1 aij , j ∈ J, |I|
(2)
j∈J
aIj =
i∈I
aIJ =
1 |I||J|
aij .
(3)
i∈I,j∈J
According to [4], the residue of an element aij in a submatrix AI J equals rij = aij − aiJ − aI j + aI J
(4)
The difference between the actual value of aij and its expected value predicted from its row, column, and bicluster mean is given by the residue of an element. It also reveals its degree of coherence with the other entries of the bicluster it belongs to. The quality of a bicluster can be evaluated by computing the mean squared residue H, i.e. the sum of all the squared residues of its elements[4]: H(I, J) =
1 |I||J|
i∈I,j∈J
(aij − aiJ − aI j + aI J )2
(5)
MSR Biclustering with Missing Data and Row Inversions
31
A submatrix AI J is called a δ − bicluster if H(I, J) ≤ δ for some given threshold δ ≥ 0. In general, we can formulate biclustering problem bilaterally – maximize the size (area) of the biclusters and minimize MSR. But, these two objectives above contradict each other because smaller biclusters have smaller MSR and vice versa. Therefore, there are two optimization problem formulations. Cheng and church considered the following formulation: Maximize the bicluster size (area) subject to an upper bound on MSR.
3 Adjusting MSR for Missing Data Missing data often occur in biological data. Common practice to deal with them is to fill gaps by random numbers. However, it adds noise and may result in biclusters of lower quality. Alternative approach is to ignore missing data, keeping only originally available information. Let A be a bicluster (I, J). We denote via Ji ∈ J bicluster’s columns without missing data in i-th row and via Ij ∈ I rows without missing data in j-th column. Then the mean of the i-th row of the bicluster, the mean of the j-th column, and the mean of all the elements in the bicluster are reformulated as follows in equations 6, 7 and 8. 1 aiJ = aij , i ∈ I, (6) |Ji | j∈Ji
aIj =
1 aij , j ∈ J, |Ij |
(7)
i∈Ij
aIJ =
1
j∈J
Ij
(8)
aij .
i∈Ij ,j∈J
In order to compare the approach with the Cheng-Church’s approach for handling missing data, a bicluster with zero H-value were used. A bicluster with H=0 is called ideal bicluster. Theorem. Let n × m matrix A be a bicluster (I, J). Then, A has a zero H-value if and only if A can be represented as a sum of n-vector X and m-vector Y in the following way aij = xi + yj , i ∈ I, j ∈ J. Proof. First, we assume that A is a n × m bicluster (I, J) with zero H value and try to prove that A can be represented as above-mentioned sum. Zero H value means zero residues rij , i ∈ I, j ∈ J. Then each element of A can be calculated as follows aij = aiJ + aI j − aI J . Denoting X = {xi = aiJ − aI2J }i∈I and Y = {yj = aI j − aI2J }j∈J results in A = X + Y where vector addition is defined as aij = xi + yj . Q.E.D. In the other direction, we assume that bicluster A can be represented as a sum of n-vector X and m-vector Y and try to show that A has zero H-value. Since aij = xi + yj , i ∈ I, j ∈ J, the mean of the i-th row is aiJ = the j-th column is aIj = is aIJ =
m
i∈I
xi +n nm
i∈I j∈J
yj
xi +nyj , n
mxi +
m
j∈J
yj
, the mean of
and the mean of all the elements in the bicluster
. Obviously, the residues are equalled to zero. Indeed,
32
S. Gremalschi et al.
yj
rij = xi + yj − xi − j∈J m bicluster A has zero H-value.
−
i∈I
n
xi
− yj +
i∈I
n
xi
+
j∈J
m
yj
= 0. Thus, the
Note. Theorem also covers biclusters that are product of two vectors. Indeed, applying logarithm to them produces biclusters that are represented as a sum.
4 MSR with Row Inversions In the original definition of biclusters, it is possible to invert (negate) certain rows. The row inversion corresponds to negative correlation rather than usual positive correlation of the inverted rows with other rows in the bicluster. The row inversion may result in the significant reduction of the bicluster MSR. In contrast to algorithmically handling inversions when adding rows (see [4]), we suggest to embed row inversion in the MSR definition as follows. We associate with each row its status which is equal -1 if the row is inverted and +1, otherwise. Definition. The Mean Square Residue with row inversions is minimum MSR over all possible row statuses. Finding the optimal row status assignment is not a trivial problem. Since MSR of a matrix does not change when positive linear transformations is applied, we can show that there is a single global minimum of MSR among all possible status assignments. A greedy iterative method changing status of row if the resulted MSR of the entire matrix decreases will find such minimum. Unfortunately, this greedy method is too slow to apply even once while it is better to apply it after each node deletion. Therefore, we suggest the following simple heuristic – iteratively over each row find which total row square residue is lower: the original or the one with all values inverted (negated). The better choice is used as the row status. In our experiments, this heuristic always finds the optimal inversion status assignment.
5 Dual Biclustering In this section, we give a brief overview of the dual biclustering problem and our algorithm that we described in [14]. We formulate the dual biclustering problem as follows: given expression matrix A, find k × l bicluster with the smallest mean squared residue H. For a set of biclusters, we have: Given: matrix An×m , set of bicluster sizes S, total overlapping V . Find:|S| biclusters with total overlapping at most V and total minimum sum of scores H. This algorithm implements the new computation of MSR which ignores missing data. The algorithm uses only the present data that is available. The greedy algorithm for finding a bicluster may start with the entire matrix and at each step try all single rows (columns) addition (deletion), applying the best operation if it improves the score and terminating when it reaches the bicluster size k × l. The output bicluster will have the smaller MSR for the given size. Like in [4], the algorithm uses the structure of the mean
MSR Biclustering with Missing Data and Row Inversions
33
residue score to enable faster greedy steps: for a given threshold α, at each deletion iteration all rows (columns) for which d(i) > αH(I, J) are removed. Also, the algorithm implements the addition of inverse rows to the matrix, allowing the identification of the biclusters which contains co-regulation and inverse co-regulation. Single node deletion and addition algorithms are shown in Figure 1 and Figure 2, respectively.
Input: Expression matrix A on genes n, conditions m and bicluster size (k, l). Output: Bicluster AI,J with the smallest adjusted MSR. Initialize: I = n, J = m, ∀w( i, j) = 0, i ∈ n, j ∈ m. Iteration: 1. Calculate aiJ , aIj and H(I, J) based on adjusted MSR. If |I| = k, |J| = l output I, J. 2. For each row calculate d(i) = |J1i | j∈Ji RSIJ (i, j) 1 3. For each column calculate e(j) = |Ij | i∈Ij RSIJ (i, j) 4. Take the best row or column and remove it from I or J. Fig. 1. Single node deletion algorithm
Input: Expression matrix A and bicluster size (k, l). Output: Bicluster AI ,J with I ⊆ I and J ⊆ J. Iteration: 1. Calculate aiJ , aIj and H(I, J) based on the adjusted MSR. 2. Add the columns with |I1j | i∈Ij RSIJ (i, j) ≤ H(I, J) 3. Calculate aiJ , aIj and H(I, J) based on the adjusted MSR. 4. Add the rows with |J1i | j∈Ji RSIJ (i, j) ≤ H(I, J) 5. If nothing was added or |I | = k, |J | = l, halt. Fig. 2. Single node addition algorithm
This algorithm is used as a subroutine and repeatedly applied to the matrix. We are using bicluster overlapping control (BOC) to avoid finding the same bicluster over and over again. The penalty is applied for using the cells present in biclusters found before. By using BOC, we can preserve the original data from losing information it carries because we do not mask biclusters with random numbers. The general biclustering scheme is outlined in Figure 3, where wij is an element of weights matrix W , A is the resulting data matrix after node deletion on original matrix A; and A” is the resulting matrix after node addition on A . We used the measure of bicluster overlapping, V , introduced in [14], which is the complement to ratio of number of distinct cells used in all found biclusters and the area of all biclusters.
34
S. Gremalschi et al.
Input: Expression matrix A, parameter α and a set S of bicluster sizes. Output: |S| biclusters in matrix A. Iteration: 1. ∀w( i, j) = 0, i ∈ n, j ∈ m. 2. while S not empty do 3. (k, l) = get f irst element f rom S 4. S = S − {(k, l)} 5. Apply multiple node deletion on A giving (k, l). 6. Apply node addition on A giving (k, l). 7. Store A” and update W . 8. end. Fig. 3. Dual biclustering algorithm
6 MSR Minimization via Quadratic Program We have defined the Dual Biclustering as an optimization problem [6], [3] in [14]. We have also defined a quadratic program for biclustering in [14]. In this paper, we have modified our QP in [14] where we reformulated the objective and constraints in order to handle missing data. We define the dual biclustering formulation as an optimization problem [14]: for a given matrix An×m , find the bicluster with bounded size (area) k×l with minimal mean squared residue. It can be easily seen that if MSR has to be defined as QP objective, it will be of a cubic form. Since QP’s objective can be contain only squared variables, the following constraint needs to be satisfied: define QP objective in such a way that only quadratic variables are present. To meet this requirement, we simulated variable multiplication by addition as described in [14]. 6.1 Integer Quadratic Program For a given normalized matrix An×m and bicluster size k × l, the Integer Quadratic Program is defined as follows: Objective M inimize :
1 |I||J|
2 i∈n,j∈m (residueij )
Subject to I=k J =l residueij = aij xij − aiJ xij − aI j xij + aI J xij 1 1 aiJ = |J| j∈m aij , aI j = |I| i∈n aij and aI J = xij ≥ rowi + columnj − 1 xij ≤ rowi
1 |I||J|
i∈n, j∈m
aij
MSR Biclustering with Missing Data and Row Inversions
35
x ij ≤ columnj i∈n rowi = k j∈m columnj = l xij , rowi , columnj ∈ {0, 1} End The QP is used as a subroutine and repeatedly applied to the matrix. For each bicluster size, we generate a separate QP. In order to avoid finding the same bicluster over and over again, the discovered bicluster is masked by replacing the values of its submatrix with random values. Row inversion is simulated by adding to the input matrix A its inversed rows. The resulting matrix will have twice more rows. Missing data is handled in the following way: if an element of the matrix contains a missing value, then it does not participate in computation of mean squared residue H. In this case, the row mean AiJ will be equal to the sum of all cells in row i that are not marked as missing values and divided by their number. Similar for column mean AI j and bicluster average AI J . Since the integer QP is too slow and its not scalable enough, we used the greedy rounding and random interval rounding methods proposed in [14]. 6.2 Combining Dual Biclustering with Rounded QP In this section, we combined the adjusted dual biclustering with modified rounded QP algorithm. Here, our goal is to reduce the instance size to speed up the QP. First, we apply adjusted dual biclustering algorithm to input matrix A to reduce the instance size where the new size is specified by two parameters: ratiok and ratiol . Then, we run rounded QP on the output obtained from Dual Biclustering algorithm. This combination improves the running time of the QP and increases the quality of the final bicluster since an optimization method is applied. The general algorithm scheme is outlined in Figure 4, where W is the weights matrix, A is the resulting matrix after node deletion and A” is the resulting matrix after node addition.
Input: Expression matrix A, parameters α, ratiok , ratiol and a set of bicluster sizes S. Output: |S| biclusters in matrix A. 1. while S not empty do 2. (k, l) = get f irst element f rom S 3. S = S − {(k, l)} 4. k = k · ratiok 5. l = l · ratiol 6. Apply multiple node deletion on A giving (k , l ). 7. Apply node addition on A giving (k , l ). 8. Update W . 9. Run QP on A” giving (k , l ). 10. Round Fractional Relaxation and store A” . 11. end. Fig. 4. Combined Adjusted Dual Biclustering with Rounded QP algorithm
36
S. Gremalschi et al.
7 Experimental Results In this section, we analyze results obtained from Dual Biclustering with adjusted MSR for missing data. We describe comparison criteria, define the swap rule model and analyze the p value of the biclusters. We tested our biclustering algorithms on data from [10] and compared our results with Cheng and Church [4]. For a fair comparison, we used bicluster sizes published by [4]. A systematic comparison and evaluation of biclustering methods for gene expression data is given in [17]. However, their model uses biologically relevant information, whereas our model is more generic and based on statistical approach. Therefore, we haven’t used their comparison results in this paper. 7.1 Evaluation of the Adjusted MSR To measure robustness of the proposed MSR to noise and evaluate quality of the obtained biclusters, the experiments were run on the imputed data. Let A be a (I, J) bicluster with zero H-value and variation of real data σ 2 . Corresponding imputed bicluster Ap is defined as follows in the following equation. apij = aij + εij where p is a percentage of added noise, {εij }i∈I,j∈J ∼ N (0,
(9) p 2 100 σ ).
7.2 The Goal of Our Experiments The goal of our experiments is to find percentage of noise data such that algorithm is still able to distinguish bicluster of size k from non-biclusters in the imputed data. Although, one can determine such percentage in respect to submatrices of the bicluster, the probability of having distinguishable submatrix when bicluster can not be already distinguished from non-bicluster tends which becomes zero due to uniformly distributed imputation of error. 7.3 Experimental Results Figure 5 compares Cheng and Church, dual biclustering, dual biclustering coupled with QP, adjusted dual biclustering, adjusted dual biclustering coupled with QP and adjusted dual biclustering with row inversion. Average MSR for adjusted dual and QP represents 68 percent (average) and 48 percent (median) of the data published in [4]. These results show that ignoring missing data for the dual algorithm gives much smaller MSR. The effect of noise on the MSR computation using synthesized data can be seen in Figure 6. Figure 7 shows the noise effect on adjusted MSR computation vs. random filled missing data. It is easy to see that adjusted MSR is less affected by noise than random filled missing data. Figure 8 shows how noise affects adjusted MSR random filled missing data for different levels of noise.
MSR Biclustering with Missing Data and Row Inversions
37
Algorithms Cheng and Church*
Cheng and Church**
Dual
Dual and QP
Adjusted Dual
Adjusted Dual and QP
Adjusted Dual with inverted rows
OC parameter
n/a
n/a
1.8
1.7
1.8
1.7
1.6
Covering
39945
39945
40548
41037
40548
41087
43028
Average MSR
204.29
228.56
205.77
171.5
161.23
154.66
195.9
(%)
100
112
100.72
75.02
70.54
68
95
Median MSR
196.3095
204.96
123.27
104.47
104.66
95.46
77.96
(%)
100
105
62.79
47.91
51.1
47
39.71
Fig. 5. Comparison of biclustering methods
Noise vs. MSR 1000000 900000 800000 700000 0% Missing Data
MSR
600000
5% Missing Data
500000
10% Missing Data
400000
15% Missing Data
300000 200000 100000 0 0%
3%
5%
10%
20%
30%
Noise (%)
Fig. 6. MSR computation for synthesized data
Missing Data vs. Random Missing Data 10% Missing Data
1200000
10% Random Missing Data 1000000 800000 600000 400000 200000 0 0%
3%
5%
10% Noise (%)
20%
30%
50%
60%
70%
10% Random Missing Data 10% Missing Data MSR
Fig. 7. Adjusted MSR vs. random filled missing data
We measure the statistical significance of biclusters obtained by our algorithms using p value. P value is computed by running Dual Problem algorithm on 100 random generated input data sets. The random data is obtained from matrix A by randomly
38
S. Gremalschi et al.
Missing Data vs. Random Missing Data
0% Missing Data 10% Missing Data
1200000
10% Random Data
1000000 800000 600000 400000 200000 0 0%
3%
5%
10%
20%
Noise (%)
30%
50%
10% Random Data 10% Missing Data 0% Missing Data MSR 60%
70%
Fig. 8. MSR random filled missing data for different levels of noise
selecting two cells in the matrix (aij , dkl ) and taking their diagonal elements (bkj , cil ). If aij > bkj and cil < dkl , algorithm swaps aij with cil and bkj with dkl , it is called a hit. If not, two elements aij and dkl are randomly chosen again. The matrix is considered randomized if there are nm 2 hits. In our case, p value is smaller than 0.001, which indicates that the results are not random and are statistically significant.
8 Conclusions Random numbers can interfere with the discovery of future biclusters, especially those ones that have overlap with the discovered ones. In this paper, we introduce a new approach to handle the missing data which does not take in account entries with missing data. We have characterized ideal biclusters, i.e., biclusters with zero mean square residue and shown that this approach is significantly more stable with respect to increasing noise. Several biclustering methods have been modified accordingly. Our experimental results show a significant decrease of H value of the biclusters when comparing with counterparts with noise reduction (e.g., the original Cheng and Church [4] method). Average MSR for adjusted dual and QP represents 68 percent (average) and 48 percent (median) of the data published in [4]. These results showed that ignoring missing data for the dual algorithm gives much smaller MSR. We also define MSR based on the best row inversion status. We give an efficient heuristic for finding such assignment. This new definition allow to further reduced MSR for a found set of biclusters.
References 1. Angiulli, F., Pizzuti, C.: Gene Expression Biclustering using Random Walk Strategies. In: Tjoa, A.M., Trujillo, J. (eds.) DaWaK 2005. LNCS, vol. 3589, pp. 509–519. Springer, Heidelberg (2005) 2. Baldi, P., Hatfield, G.W.: DNA Microarrays and Gene Expression. In: From Experiments to Data Analysis and Modelling. Cambridge Univ. Press, Cambridge (2002)
MSR Biclustering with Missing Data and Row Inversions
39
3. Bertsimas, D., Tsitsiklis, J.: Introduction to Linear Optimization. Athena Scientific 4. Cheng, Y., Church, G.: Biclustering of Expression Data. In: Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology (ISMB), pp. 93–103. AAAI Press, Menlo Park (2000) 5. Madeira, S.C., Oliveira, A.L.: Biclustering Algorithms for Biological Data Analysis: A Survey. IEEE Transactions on Computational Biology and Bioinformatics 1(1), 24–45 (2004) 6. Papadimitriou, C.H., Steiglitz, K.: Combinatorial optimization: algorithms and complexity, p. 2982. Prentice-Hall, Inc., Upper Saddle River 7. Prelic, A., Bleuler, S., Zimmermann, P., Wille, A., Bhlmann, P., Gruissem, W., Hennig, L., Thiele, L., Zitzle, E.: A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics 22(9), 1122–1129 (2006) 8. Shamir, R., Lecture notes, http://www.cs.tau.ac.il/ rshamir/ge/05/scribes/ lec04.pdf 9. Tanay, A., Sharan, R., Shamir, R.: Discovering Statistically Significant Biclusters in Gene Expression Data. Bioinformatics 18, 136–144 (2002) 10. Tavazoie, S., Hughes, J.D., Campbell, M.J., Cho, R.J., Church, G.M.: Systematic determination of genetic network architecture. Nature Genetics 22, 281–285 (1999) 11. Yang, J., Wang, H., Wang, W., Yu, P.: Enhanced biclustering on gene expression data. In: Proceedings of the 3rd IEEE Conference on Bioinformatics and Bioengineering (BIBE), pp. 321–327 (2003) 12. Zhang, Y., Zha, H., Chu, C.H.: A time-series biclustering algorithm for revealing coregulated genes. In: Proc. Int. Symp. Information and Technology: Coding and Computing (ITCC 2005), Las Vegas, USA, pp. 32–37 (2005) 13. Zhou, J., Khokhar, A.A.: ParRescue: Scalable Parallel Algorithm and Implementation for Biclustering over Large Distributed Datasets. In: 26th IEEE International Conference on Distributed Computing Systems, ICDCS 2006 (2006) 14. Gremalschi, S., Altun, G.: Mean Squared Residue Based Biclustering Algorithms. In: M˘andoiu, I., Sunderraman, R., Zelikovsky, A. (eds.) ISBRA 2008. LNCS (LNBI), vol. 4983, pp. 232–243. Springer, Heidelberg (2008) 15. Divina, F., Aguilar, J.: Ruiz Biclustering of Expression Data with Evolutionary Computation. IEEE Transactions on Knowledge and Data Engineering 18(5), 590–602 (2006) 16. Yang, J., Wang, W., Wang, H., Yu, P.S.: Enhanced biclustering on expression data. In: Proceedings of the 3rd IEEE Conference on Bioinformatics and Bioengineering (BIBE 2003), pp. 321–327 (2003) 17. Prelic, A., Bleuler, S., Zimmermann, P., Wille, A., Bhlmann, P., Gruissem, W., Hennig, L., Thiele, L., Zitzler, E.: A Systematic Comparison and Evaluation of Biclustering Methods for Gene Expression Data. Bioinformatics 22(9), 1122–1129 (2006) 18. Xiao, J., Wang, L., Liu, X., Jiang, T.: An Efficient Voting Algorithm for Finding Additive Biclusters with Random Background. Journal of Computational Biology 15(10), 1275–1293 (2008) 19. Liu, X., Wang, L.: Computing the maximum similarity bi-clusters of gene expression data. Bioinformatics 23(1), 50–56 (2007)
Using Gene Expression Modeling to Determine Biological Relevance of Putative Regulatory Networks Peter Larsen1 and Yang Dai2 1
Core Genomics Laboratory (MC063), University of Illinois at Chicago, 835 South Wolcott Avenue, Chicago, IL 60612, USA
[email protected] 2 Department of Bioengineering (MC063), University of Illinois at Chicago, 851 South Morgan Street, Chicago, IL 60607, USA
[email protected]
Abstract. Identifying gene regulatory networks from high-throughput gene expression data is one of the most important goals of bioinformatics, but it remains difficult to define what makes a ‘good’ network. Here we introduce Expression Modeling Networks (EMN), in which we propose that a ‘good’ regulatory network must be a functioning tool that predicts biological behavior. Interaction strengths between a regulator and target gene are calculated by fitting observed expression data to the EMN. ‘Better’ EMNs should have superior ability to model previously observed expression data. In this study, we generate regulatory networks by three methods using Bayesian network approach from an oxidative stress gene expression time course experiments. We show that better networks, identified by percentage of interactions between genes sharing at least one GO-Slim Biological Process terms, do indeed generate more predictive EMN’s. Keywords: Gene expression, linear model, least-squares, expression modeling network, regulatory network.
1 Introduction Gene Regulatory networks represent genetic control mechanisms as directed graphs, in which genes are the nodes and the connecting edges signify regulatory interactions [1]. Determination of gene regulatory networks from high-throughput gene expression data is an important goal of bioinformatics analysis. There are many proposed computational methods for inferring potential gene regulatory networks from microarray data, such as relevance networks [2], clustering coefficient threshold method [3], nearest neighbor networks [4], and ARACNE [5, 6], Asymmetric-N [7], and Bayesian network (BN) [8]. Except the BN approach, most of the methods use correlation or information theoretical measurement (e.g. entropy) between gene expression profiles to determine whether two genes are related to each other. The predicted networks are essentially determined by binary interactions. There remains no universally accepted standard for identifying a ‘good’ network or to determine the ‘best’ network from a I. Măndoiu, G. Narasimhan, and Y. Zhang (Eds.): ISBRA 2009, LNBI 5542, pp. 40–51, 2009. © Springer-Verlag Berlin Heidelberg 2009
Using Gene Expression Modeling to Determine Biological Relevance
41
collection of potential networks. Some methods that have been used are to assign a predicted interaction confidence determined by (1) previously observed interactions, (2) functional association derived from other ‘omic’ data such as protein-protein interactions, or (3) evolution conservation observed in other organisms. Here we propose that the best predicted gene regulatory network is the one that can be used to most accurately predict experimentally observed expression data. By considering gene expression as a linear function of the expression of its regulators, a proposed gene regulatory network can be modeled as a set of algebraic equations with constant terms for the relative strength of a regulator’s expression effect on the expression of its target. Using a set of experimentally observed data, these equations can be solved using the least-squares method. With these results, the gene regulation network becomes a predictive tool, capable of estimating the gene expression for all genes in the network from a value for the top-most node. For the equations in the EMN to be solvable, it is required that the predicted gene regulatory networks are in the form of a Directed Acyclic Graph (DAG). How well the EMN-calculated gene expression levels coincide with the experimentally observed gene expression levels can be used as a measure of the biological relevance of the proposed gene regulatory network. For validation of the proposed framework of the EMN, we have selected a subset of time course microarray data investigating yeast’s gene expression response to oxidative stress. To generate the appropriate DAG for regulatory networks, we have selected the BN to infer putative interaction networks from the time course data. As it has been previously reported that addition of biological knowledge to BN methods can improve quality of networks [9-12], we generated three BNs: BN-Unrestricted imposes no restriction of possible edges, BN-Published restricts possible edges to those interactions previously reported in the published literature, and BN-LOI, which restricts edges to those interactions calculated to be biologically likely by their similarity to known interactions. Additionally, 10 random networks were generated to serve as a baseline control. To determine whether better networks generate more predictive EMNs, a metric to rank the inferred networks is required. For this study, we selected percent of interactions in the networks that share a common biological process annotation for this metric. To measure how well EMN-calculated expression values match experimentally observed values, we have selected two metrics. The first is a Pearson Correlation Coefficient (PCC) and the second is the percent of genes that are differentially expressed in the same direction, positive or negative, in the EMNcalculated values and experimental data.
2 Methods This section introduces the concept of Expression Modeling Network (EMN), a method for using a directed acyclic graph representing a gene regulatory interaction set and estimating values representing the strength of the effect of change in expression of a regulator gene on the expression of a target gene. A previously collected time course microarray experiment studying the gene expression response of yeast to oxidative stress was used to demonstrate EMN here. Putative gene regulatory networks were inferred from data using the BN approach under three conditions: no restrictions on possible interactions, restricting interactions to those previously
42
P. Larsen and Y. Dai
published and restricting interactions to those with high Likelihood of Interactions (LOI) scores. Additionally, 10 interaction networks were generated at random. ‘Best’ interaction network was identified as a function of percent of interactions in which regulator and target share a Gene Ontology Biological Process annotation. 2.1 Microarray Data The data used is a subset of microarray gene expression profiles that were from an experiment in yeast response to various environmental stresses by Gash et al. [15]. The subset is taken from those measured genes that indicated at least a two-fold change in expression in at least one time point when cultured under conditions of oxidative stress with 0.32mM hydrogen peroxide (H2O2) and missing expression values for no more than 25% of observations. The dataset consists of 189 genes at 11 time points. Gene expression patterns are characterized by a rapid change in expression levels of most genes, returning to initial expression levels over time. 2.2 Expression Modeling Networks EMN is an extension of Toxicological Prediction Networks (TPN) [13], which associates small, heterogeneous networks of toxic ligands and proteins with broad biological process descriptions. For EMN, a DAG representing an expression regulatory network for which nodes are genes with measured relative expressions or quantified environmental cues, and directed edges between nodes indicating that the expression of the parent node has a regulating effect on the expression of the child node. Also, a set of observed expression data and quantified environmental stimulus for multiple observations is needed. The regulatory network is represented as a set of equations in which the expression of every gene is a linear function of the expression of its regulators.
∑
Ei (t ) =
E j (t ) * v j ,i + c j ,i
(1)
j regulates i
where Ei(t) is the measure of expression of gene i at time point t, and vj,i and cj,i are constants describing the strength of the regulatory influence of the expression of gene j on gene i. The optimal EMN is determined by minimizing the sum of squared differences between the modeled and observed gene expression, i.e., minimizing T
n
∑∑ ( E (t ) − E t =1 i =1
where,
i
obs i
(t )) 2
(2)
Eiobs (t ) is the observed gene expression of gene i at time point t, and T is the
overall number of time points in the experiment. Using the set of experimental observations of expression of all genes at different time points, the optimization problem can be solved for all constants vj,i and cj,i by using the least-squares method. The ‘v’ terms model for the strength of the effect of a change in expression in a regulator on its target. The ‘c’ terms model the influence that a regulator has on the expression of its target to maintain equilibrium at a control state. This procedure for EMN is summarized in Figure 1.
Using Gene Expression Modeling to Determine Biological Relevance
43
For this study, expression data was considered as a log ratio relative to a control. ‘Expression’ values for H2O2 was set to an arbitrary 160, the total number of minutes in the time course experiment, at time 1, and then reduced by half for every subsequent time point as this was empirically observed to best model the similar drop in number of significantly expressed genes over time. The least squares fitting procedure as implemented by ‘lsfit’ function in statistical computing language R (ver.2.8.1) [14] was used to solve the optimal problem. (A)
Gene Regulatory Network as DAG Node 0
v, c 0,2
v, c 0,1
v, c 0,3
Node 1
v, c 1,4
Node 2
v, c 2,3
v, c 1,5 Node 3
Node 4
v, c 3,5
Node 5
(B)
Expression E for nodes at time 't' Node Node 0 Node 1 Node 2 Node 3 Node 4 Node 5
(C)
t1 -0.41 -0.61 0.25 -0.87 0.06 -0.01
t2 -0.67 0.31 0.72 0.39 0.70 0.13
…
tn -0.13 0.54 0.32 -0.63 0.10 0.95
Gene Regulation Network at set of Equations E1 t = v0,1 * E0 t + c 0 ,1 E2 t = v0,2 * E0 t + c 0 ,1 E3 t = v0,3 * E0 t + c 0,3 + v2 ,3 * E2 t + c 2 ,3 E4 t = v1,4 * E1 t + c 1 ,4 E5 t = v1,5 * E1 t + c 1,5 + v3,2 * E3 t + c 3,2
Fig. 1. Using a small example regulatory network, the process for EMN is summarized. (A) Given a putative interaction network in which every directed interaction between a regulator j and target i is characterized by two constants vj,i, and cj,i and (B) set of gene expression observations over multiple time points, the network can be described as a set of equations (C) in which the expression of a target gene is a linear function of the expression of the expression of its regulators, where En(t) is the expression of node n at time t, vj,i and cj,i are the constants describing the interaction strength between regulator i and target j. Using the values in (B), the equations in (C) can be solved for the all values vj,i and cj,i.
2.3 Previously Published Interactions In order to obtain interactions involving the genes in the subset mentioned above from literature, we used ‘PathwayStudio’ (Ariadne Genomics, Inc., Rockville, MD) to automatically extract from PubMed references. Given an input set of query genes or
44
P. Larsen and Y. Dai
gene products, PathwayStudio searches the database of published abstracts, seeking instances in which genes are identified as interacting according to the information found in available PubMed abstracts. The nature of interactions (‘expression’, ‘regulation’, ‘genetic interaction’, ‘binding’, ‘protein modification’, and ‘chemical modification’ as defined in that software package) can be used to screen for specific types of interactions. Interaction types ‘direct regulation’, ‘regulation’ and ‘expression’ were used for this study. 2.4 Likelihood of Interaction (LOI) This study utilizes the concept of Likelihood of Interaction (LOI) scores for gene interaction pairs developed in our previous study [16] for assigning confidence for interaction for a pair of genes. The LOI-score is a measure of the likelihood that a gene or a gene product with a particular molecular function annotation influences the expression of another gene or a gene product. More specifically, if two genes closely resemble by their specific molecular function annotation from previously observed interaction pairs, then they will be considered likely to interact. The specific details of deriving the LOIscore can be found in [16]. But in general, a negative LOI-score indicates that a particular GO Molecular Function (MF) annotation pair occurs less frequently than expected by random chance. A positive LOI-score indicates an interaction between GO MF annotations occurs more frequently than expected at random. A score near zero indicates that the frequency occurs at a level near that expected by random. For the derivation of LOI-scores in this study, a set of 6150 yeast genes was selected from the Saccharomyces cerevisiae database of PathwayStudio 3.0 and used to identify 576 directed gene interaction pairs. The 25 GO MF annotations specified by the Saccharomyces Genome Database SGD GO Slim Mapper [17] were considered for the annotation of the regulator and the target genes. 2.5 Bayesian Network (BN) BN is a probabilistic framework for robust inference of interactions in the presence of noise. For using BNs in gene regulatory networks, nodes are genes or gene products. Directed edge is a regulatory interaction between nodes in which a change in expression in the regulator leads to a change in expression of the target. This edge does not necessarily imply the physical nature of the regulatory interaction and it can be assumed that regulation may occur through the physical interaction of proteins not in the set of differentially expressed genes. The input data set are measures of gene expression under multiple observations and output is a DAG. For determining networks for analysis in this study, BANJO (Bayesian Network Inference with Java Objects) was employed [18] using the following conditions. Nodes in the network represent genes with measured expressions, plus one node representing the presence of H2O2. Data were discretized into five values for more than two standard deviations below, one standard deviation below, within one standard deviation, more than one standard deviation above, and more than two standard deviations above the average expression value of all genes in the microarray study. Simulated annealing, BDe Scoring Metric [19], and Random Local Moves were selected as parameters in BANJO. A maximum of 5 parents was considered as this was in the range of observed number of parents in this set of published interactions (average number of parents is 3, with standard deviation of 3). For this study, three separate networks were generated.
Using Gene Expression Modeling to Determine Biological Relevance
45
In ‘Unrestricted’, no restriction was placed on possible interactions, except that no gene was allowed to regulate node for H2O2. The H2O2 node indicates the presence of H2O2 in the media and no gene’s expression can regulate the state of environmental H2O2. In ‘Published’, possible interactions were restricted to those identified and previously observed using the tool ‘Pathway Studio’. ‘LOI’ restricted possible interaction to those that had an LOI-score in the top 25% off all LOI-scores calculated between all possible gene interaction pairs. Additionally, 10 random DAG networks were generated to serve as a baseline. All random networks have 729 interactions. 2.6 Ranking Putative Gene Regulatory Networks from Best to Worst To determine biological relevance of the identified gene networks, GO annotation descriptions [20] were used. There are three ontologies: molecular function, biological process, and cellular component. Molecular function (MF) annotation describes what gene product does at the molecular level, without specifying where or when the activity takes place in the broader context. Biological process refers to a biological objective to which a gene product contributes, though GO biological process (BP) annotations are not the equivalent of a biological pathway. Cellular component (CC) annotation refers to the place in the cell where a gene product is found. GO annotations, at their finest level do not describe specific gene products and a given gene product may have multiple GO annotations from each ontology. The specific GO ontologies considered in this study are GO-Slim BP annotations as provided by the Saccharomyces Genome Database (SGD) [17], a curator selected set of characteristic, most biologically relevant terms. For this study, the percent of proposed interactions in a regulatory network where the regulator and the target share at least on GO-Slim BP annotation is considered to be a measure of quality for the network. The higher the percentage in interaction that share an annotation, the better the proposed network is considered. 2.7 Evaluating the Fit of EMN-generated Expression Values to Experimental Observations Since the networks obtained from the BN method by using different levels of prior knowledge are of different sizes, the corresponding least squares are not comparable. Two metrics are considered to determine how well EMN-generated expression data fits with the experimentally observed data. The first is a Pearson Correlation Coefficient (PCC) between EMN and experimentally observed data. A PCC close to one is a good fit between calculated and observed data. A PCC near zero indicates no similarity between calculated and observed data. Values close to negative one indicate an inverse correlation. The second metric used here is the percent of genes in which the direction of expression change, positive or negative, is in agreement between the EMN-calculated and the observed data. A high percentage indicates that the EMN-calculated data frequently agrees with observed data as to the direction of gene expression change. 2.8 Assigning Significance to Results The significance of a calculated network was determined using the cumulative binomial distribution, where the probability that x successes out of n trials with a probability of success p.
46
P. Larsen and Y. Dai x ⎛n⎞ B( x; n, p) = ∑ ⎜ ⎟ p i (1 − p) n −i . i =0 ⎝ i ⎠
(3)
A p-value for each network is reported as 1-B(x,n,p). This was used to determine the significance of percent of interactions sharing GO-Slim BP terms used as a metric of network quality, and the percent of EMNcalculated gene expressions that are same direction of expression change as experimentally observed data. For the significance of a particular percent of shared GO-Slim BP annotations between calculated interactions in a given network, n is equal to the total number of interactions in the predicted network, x is the number of interactions that share a GO-Slim BP term, and p is equal to the frequency of interactions in the complete graph of 198 genes that share a GO-Slim BP annotation. For percent of calculated gene expression in same direction as observed, n is equal to the number of genes, x is equal to the number of genes in calculated expression that are changes in the same direction as observed, and p is equal to:
(
) (
)
Pos Pos Pos Pos f Obs ∗ f Calc + 1 − f Obs * 1 − f Calc ,
where
(4)
Pos f Obs is equal to the frequency of positive expression changes in the experi-
mentally observed data and
Pos f Calc is the frequency of positive expression changes in
the EMN-calculated expression data.
3 Results Here, the generated putative oxidative stress regulatory networks are compared to one another, insuring that networks are distinct from one another and that the networks can be ranked from ‘best’ to ‘worst’ with regard to percent of interactions that share a GO-Slim BP annotation. EMN was used as described in Methods section to estimate the interaction strengths between all regulator and target pairs in each network. Those interaction strengths were then used to attempt to model gene expression in the network in response to oxidative stress. 3.1 Evaluation of Generated Networks First, the generated putative regulatory networks: BN-Unrestricted, BN-Published, BN-LOI, and the 10 randomly generated networks, have to be compared to one another. As it is the goal of this study to determine that a regulatory network that can better model gene expression using EMN, it needs to be determined that the generated networks are sufficiently distinct from one another and that the networks can be ranked from ‘best’ to ‘worst’. To determine if the networks are distinct, the percent of interactions shared between the networks were considered (Table 1). In general, the overlap between networks is about 15% for the various BN-generated networks. BN-LOI and BN-Published have the most similarity with 16% overlap. BN-Unrestricted and BN-LOI have the least overlap with 13%. The 10 randomly generated networks have about a 3% overlap with each of the BN-generated networks and with one another. The number of interactions
Using Gene Expression Modeling to Determine Biological Relevance
47
Table 1. The percent of interactions in common between multiple potential gene regulation networks among 198 genes is summarized here. For the ’10 Random’ networks, results are presented as average (standard deviation) for all ten randomly generated networks.
BNUnrestricted BN-Published BN-LOI 10 Random
BNUnrestricted
BNPublished
BNLOI
10 Random
-14% 13% 3% (0.007)
--16% 3% (0.007)
---3% (0.003)
---3% (0.006)
for each BN-generated interaction network is also distinct. BN-Unrestricted has 729, BN-Published has 739, and BN-LOI has 731 interactions. From this it can be judged that the different restrictions of BN estimation did in fact yield distinctly different networks that share only a minority of proposed regulator-target interactions. In order to estimate the relative quality of interaction network, percent of interactions whose regulator and target share at least one GO-Slim BP annotation was chosen as a metric. A significance of this measure was assigned using a Binomial distribution derived p-value, calculating the probability that a network of a given size would have the same of greater number of interactions that share a GO-Slim BP annotation by chance (Table 2.) BN-Unrestricted have the greatest percent of shared terms at 32% with a highly significant p-value of 3.93E-09. BN-Published is next with 30% at a significance of 2.49E-06. BN-LOI is the worst, with a 27% and relatively poor pvalue of 0.317. The random networks averaged 22% shared GO-Slim BP annotations with a p-value indicating results no better than random chance, 0.671. From this we can determine that we can rank BN-derived networks with BN-Unrestricted as best and BN-LOI as worst. All BN-generated interaction networks perform substantially better than the 10 randomly generated networks. Table 2. The relative quality of several proposed gene interaction networks is defined here as the proportion of identified interactions in which the regulator and target gene share a GO-Slim BP annotation. For the ’10 Random’ networks, results are presented as average (standard deviation) for all ten randomly generated networks. ‘#Edges’ is the total number of interactions between the 189 genes identified as involved in response to H2O2. ‘%Shared GO-BP’ is the fraction of the interactions that share a GO-Slim BP term between regulator and target. ‘Binom.pVal’ is the binonimal distribution derived significance of the ‘%Shared GO-BP’ in the identified interaction network relative to the percent shared GO-BP annotations in the complete graph of all 189 genes.
BN-Unrestricted BN-Published BN-LOI 10 Random
# Edges 729
% Shared GO-BP 32%
Binom.pVal 3.93E-09
739 731 729
30% 27% 22% (0.017)
2.49E-06 3.17E-03 6.71E-01
48
P. Larsen and Y. Dai
3.2 Using EMN to Model Gene Expression in Response to Oxidative Stress Using the procedure described for EMN, the values for all vj,i and cj,i interaction strengths between all pairs of regulator j and target gene i were determined for the three BN-derived networks and 10 randomly generated networks. Using these values, gene expressions for all genes were estimated using EMN for the first three time points in the time course. The first three time points, at which most genes are differentially expressed, were selected to validate EMN. At later time points, where differential gene expression tapers off the correlation of EMN to data is inflated, unfairly suggesting a very good fit of the model to the data. Table 3. ‘A’ is the BN-Unrestricted network, ‘B’ is the BN-Published network, ‘C’ is the BNLOI network, and ‘D’ is the average (standard deviation) from 10 randomly generated networks. Data for the first three time points from the experimental data and the average of the results are presented here. ‘Correl with Obs.’ is the correlation of EMN-generated expression data with observed expression data. ‘%Same Dir. As Obs.’ is the percentage of genes whose direction of expression change, positive or negative, is the same as the direction of expression change in the experimental data. ‘Binom.pVal’ is the binomial distribution derived significance of the ‘%Same Dir as Obs.’ relative to the distribution of positive and negative fold changes in the EMN-derived and experimentally observed data.
Network
Time 1
Time 2
Time 3
Average
Correl with Obs. %Same Dir as Obs. Binom.pVal Correl with Obs. %Same Dir as Obs. Binom.pVal Correl with Obs. %Same Dir as Obs. Binom.pVal Correl with Obs. %Same Dir as Obs. Binom.pVal
A 0.290 70% 1.05E-08 0.381 68% 1.42E-07 0.414 68% 2.50E-07 0.362 69% 1.34E-07
B 0.429 71% 1.63E-09 0.360 65% 1.30E-05 0.249 65% 5.17E-05 0.346 67% 2.16E-05
C 0.023 54% 1.21E-01 -0.032 51% 3.30E-01 0.029 58% 1.29E-02 0.006 54% 1.55E-01
D -0.002 (0.083) 51% (0.036) 5.00E-01 -0.021 (0.090) 49% (0.029) 5.00E-01 -0.041 (0.093) 49% (0.035) 4.47E-01 -0.021 50% 4.82E-01
Results of fitting EMN-generated expression values and observed expression values are summarized in Table 3. Using EMN-calculated expression values, on average across the first three time points in the time course experiment, data for BNUnrestricted was the best PCC with experimental data at 0.362 and the highest percent agreement in direction of expression change of 69% and a significance of 1.34E-07. BN-Published performed next best on average with a PCC of 0.346 and a 67% agreement in direction of expression change at a significance level of 2.16E-05. BN-LOI preformed the worst on average of all the BN-generated interaction networks with a PCC of 0.006 and a weakly significant percent same direction of 54% and
Using Gene Expression Modeling to Determine Biological Relevance
49
p-value 0.155. The 10 randomly generated networks average a PCC of -0.021 indicating no correlation with observed data and a percent same direction of 50% and significance of 0.482 indicating a result attributable to random chance. The best result at an individual time point was for the BN-Published network at the first time point with a PCC of 0.429 and highly significant 71% agreement with expression change direction and a p-Value of 1.63E-09.
4 Conclusions Here we have proposed Expression Modeling Networks (EMN), a method that (1) uses an interaction network and observed microarray gene expression data to estimate the strength of an interaction between a regulator and a target and (2) uses those estimated interaction strength to model gene expression. Using a microarray dataset studying the gene expression of yeast in response to oxidative stress, we have demonstrated that EMNs can be used to calculate gene expression in response to environmental stimulus. Better, more biologically relevant networks, as judged using a metric of percent of interactions in the network in which the regulator and the target share at least one GO-Slim Biological Process annotation, generate EMNs that more accurately model gene expression data. This positive correlation between rank of the networks and the quality of the EMNs indicates that it is the biological relevance of the proposed networks that is responsible for the fit of EMN-calculated expression values with experimentally observed data. Given this result, EMNs could be used to evaluate among multiple proposed gene interaction networks to identify those proposed networks that best model experimentally observed data, and therefore have a greater likelihood of being biologically relevant. This ability to quantifiably measure how well a proposed gene regulatory interaction network fits experimentally observed, expression data represents a potentially significant advancement, taking regulatory networks derived from high throughput data from simple hypothesis to being predictive, analytical tools. In the complimentary approach, TPN [13] was able to model the effects of several toxic ligands on rats from gene expression in liver, suggesting that linear modeling of complex system in EMN can be extended to more complex systems than yeast. It should also be noted that the framework proposed here is generally applicable to more complex models, such as S-system, for nonlinear representation of expression. Additional work remains to be done with EMN. Only a single environmental condition, response to oxidative stress, was considered here. Ultimately, to determine whether EMN not only mimics the observed experimental data but also predicts relevant biology, one must perform a biological experiment to confirm the model. Fortunately, EMN is well suited to such hypothesis driven experimentation. The effects of the deletion or amplification of a gene node expression or modifying the strength of an interaction edge on the expression of other genes in the proposed network can be simulated in EMN. Then biological experimentation can confirm or reject the model based prediction. A more complex EMN might incorporate several possible conditions and model the yeast’s more complete environmental stress response regulatory network, able to model not only single stressed but specific combination of stresses. A well designed EMN might have the ability to predict gene expression, not just model
50
P. Larsen and Y. Dai
previously observed conditions. With a predictive EMN tool, certain gene expression studies could be performed in silico before advancing to actual biological experiments, refining biological hypotheses, allowing researchers to better design proposed experiments, and perhaps reduce the number of biological experiments that need to be performed. Acknowledgements. We thank Eyad Almasri for useful discussion.
References 1. Weaver, D., Workman, C., Stormo, G.: Modeling regulatory networks with weight matrices. In: Pacific Symp. Biocomp., vol. 99(4), pp. 112–123 (1999) 2. Butte, A.J., Tamayo, P., Slonim, D., Golub, T.R., Kohane, I.S.: Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks. Proceedings of the National Academy of Sciences of the United States of America 97(22), 12182–12186 (2000) 3. Elo, L.L., Jarvenpaa, H., Oresic, M., Lahesmaa, R., Aittokallio, T.: Systematic construction of gene coexpression networks with applications to human T helper cell differentiation process. Bioinformatics 23(16), 2096–2103 (2007) 4. Huttenhower, C., Flamholz, A., Landis, J., Sahi, S., Myers, C., Olszewski, K., Hibbs, M., Siemers, N., Troyanskaya, O., Coller, H.: Nearest Neighbor Networks: clustering expression data based on gene neighborhoods. BMC Bioinformatics 8(1), 250 (2007) 5. Margolin, A., Nemenman, I., Basso, K., Wiggins, C., Stolovitzky, G., Favera, R., Califano, A.: ARACNE: An Algorithm for the Reconstruction of Gene Regulatory Networks in a Mammalian Cellular Context. BMC Bioinformatics 7(suppl. 1), S7 (2006) 6. Basso, K., Margolin, A.A., Stolovitzky, G., Klein, U., Dalla-Favera, R., Califano, A.: Reverse engineering of regulatory networks in human B cells. Nat. Genet. 37(4), 382 (2005) 7. Chen, G., Larsen, P., Almasri, E., Dai, Y.: Rank-based edge reconstruction for scale-free genetic regulatory networks. BMC Bioinformatics 9(1), 75 (2008) 8. Friedman, N., Linial, M., Nachman, I., Pe’er, D.: Using Bayesian networks to analyze expression data. J. Comput. Biol. 7, 601 (2000) 9. Almasri, E., Larsen, P., Chen, G., Dai, Y.: Incorporating literature knowledge in Bayesian network for inferring gene networks with gene expression data. In: Măndoiu, I., Sunderraman, R., Zelikovsky, A. (eds.) ISBRA 2008. LNCS (LNBI), vol. 4983, pp. 184–195. Springer, Heidelberg (2008) 10. Hartemink, A.J., Gifford, D.K., Jaakkola, T.S., Young, R.A.: Combining location and expression data for principled discovery of genetic regulatory network models. In: Pac. Symp. Biocomput., pp. 437–449 (2002) 11. Imoto, S., Higuchi, T., Goto, T., Tashiro, K., Kuhara, S., Miyano, S.: Combining Microarrays and Biological Knowledge for Estimating Gene Networks via Bayesian Networks. In: Proceedings of the IEEE Computer Society Conference on Bioinformatics. IEEE Computer Society, Los Alamitos (2003) 12. Le Phillip, P., Bahl, A., Unga, L.H.: Using prior knowledge to improve genetic network reconstruction from microarray data. Silico Biology 4, 335–353 (2004) 13. Kulkarnia, K., Larsen, P., Linninger, A.A.: Assessing chronic liver toxicity based on relative gene expression data. Journal of Theoretical Biology 254(2), 308–318 (2008) 14. R Development Core Team: R: A Language and Environment for Statistical Computing, http://www.R-project.org
Using Gene Expression Modeling to Determine Biological Relevance
51
15. Gasch, A.P., Spellman, P.T., Kao, C.M., Carmel-Harel, O., Eisen, M.B., Storz, G., Botstein, D., Brown, P.O.: Genomic expression programs in the response of yeast cells to environmental changes. Mol. Biol. Cell. 11, 4241–4257 (2000) 16. Larsen, P., Almasri, E., Chen, G., Dai, Y.: A statistical method to incorporate biological knowledge for generating testable novel gene regulatory interactions from microarray experiments. BMC Bioinformatics 8, 317 (2007) 17. GO Slim Mapper, http://db.yeastgenome.org/cgi-in/GO/goTermMapper 18. BANJO, http://www.cs.duke.edu/~amink/software/banjo/ 19. Herskovits, E., Cooper, G.: Algorithms for Bayesian belief-network precomputation. Methods Inf. Med. 30(2), 81–89 (1991) 20. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., et al.: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25, 25–29 (2000)
Querying Protein-Protein Interaction Networks Guillaume Blin, Florian Sikora, and St´ephane Vialette Universit´e Paris-Est, LIGM - UMR CNRS 8049, France {gblin,sikora,vialette}@univ-mlv.fr
Abstract. Recent techniques increase the amount of our knowledge of interactions between proteins. To filter, interpret and organize this data, many authors have provided tools for querying patterns in the shape of paths or trees in Protein-Protein Interaction networks. In this paper, we propose an exact algorithm for querying graphs pattern based on dynamic programming and color-coding. We provide an implementation which has been validated on real data.
1
Introduction
Contrary to what was predicted years ago, the human genome project has highlighted that human complexity may not only rely on its genes (only 25 000 for human compared to the 30 000 and 45 000 for the mouse and the poplar respectively). This observation has yield to an increase in the interest of proteins (e.g. their numbers, functions, complexity and interactions). Among others protein properties, the set of all their interactions for an organism, called Protein-Protein Interactions (PPI) networks, have recently attracted lot of interest. Knowledge on them increases in an exponential manner due to the use of various genome-scale screening techniques [10,12,23]. Unfortunately, acquiring such valuable resources is prone to high noise rate [10,19]. Comparative analysis of PPI tries to determine the extent to which protein networks are conserved among species. Indeed, numerous evidences suggest that proteins functioning together in a pathway (i.e., a path in the interaction graph) or a structural complex (i.e., an assembling of strongly connected proteins) are likely to evolve in a correlated fashion, and during evolution, all such functionally linked proteins tend to be either preserved or eliminated in a new species [17]. In this article, we focus on the following related problem called Graph Query (formaly defined later). Given a PPI network and a pattern in the shape of graph, query the pattern in the network consist of find a subnetwork of the PPI network which is most similar as possible to the pattern. Similarity is measured both in terms of sequence similarity and graph topology conservation. Unfortunately, this problem is clearly equivalent to the NP-complete subgraph homeomorphism problem [9]. Recently, several techniques have been proposed to overcome the difficulty of this problem. By restricting the query to a path, Kelley et al. [16] were able to define a Fixed-Parameter Tractable (FPT) algorithm parameterized by the size of the query. Recall that a parameterized problem is FPT if it can be determined in f (k)nO(1) time where f is a function only I. M˘ andoiu, G. Narasimhan, and Y. Zhang (Eds.): ISBRA 2009, LNBI 5542, pp. 52–62, 2009. c Springer-Verlag Berlin Heidelberg 2009
Querying Protein-Protein Interaction Networks
53
depending on the parameter k, and n is the size of the input [8]. Pinter et al. [18] proposed an algorithm dealing with tree shape query that is restricted to forest PPI networks (i.e., collection of trees). Later on, Shlomi et al. [21] proposed an alternative, called QPath [16], for querying paths in a PPI network which is based on the color-coding technique introduced by Alon, Yuster and Zwick [1]. In addiction of being faster, QPath allows more flexibility by considering non-exact matches. Finally, Dost et al. [7] developed QNet, an algorithm to handle tree query in the general context of PPI networks. The authors also gave some theoretical approaches for querying graphs by using the tree decomposition of the query. Since QNet is the major reference in this field and is quite related to the work presented in this article, let us present it briefly. QNet is an exact FPT algorithm for querying trees in a PPI network. The complexity is 2O(k) m, where k is the number of proteins in the query and m the number of edges of the PPI network. As QPath, QNet uses dynamic programming and color-coding. For querying graphs in a network, QNet uses, as a subroutine, an exact algorithm to query trees. To do so, they perform a tree decomposition. A formal definition of a tree decomposition can be found in [4]. Roughly speaking, it is a transformation of a graph into a tree. A tree node (or a bag) can contain several graph nodes. There are several ways to perform such a transformation. The treewidth of a graph is the minimum (among all decompositions) of the cardinality of the largest bag minus one. Computing this treewidth is NP-Hard [3]. From this tree decomposition, the time complexity of QNet is O(2O(k) nt+1 ) time, where k is the size of the query, n is the size of the PPI network, and t is the treewidth of the query. QNet is an exact algorithm for querying trees in a PPI network. A logical extension would be to query graphs. The authors of [7] provides a theoretical solution, without implementation and which depends on the query treewidth. We propose in this article an exact alternative solution, using color-coding (Section 2). We provide in Section 3 some experimental results.
2
PADA1 as an Alternative to QNet
In this section, we propose an alternative to QNet called PADA1 (Protein Alignment Dealing with grAphs). At the broadest level, QNet and PADA1 use the very same approach: transform the query into a tree and find an occurrence of that tree in the PPI network by dynamic programming. However, whereas QNet uses tree decompositions, PADA1 combines feedback vertex sets together with nodes duplications (Algorithm Graph2Tree). It is worth mentioning that, following QPath and QNet, we will consider non-exact matches (i.e., allowing indels). Since we allow queries to be graphs, PADA1 is clearly an extension of QPath and an alternative to QNet. 2.1
Transforming the Query into a Tree
We begin by presenting Algorithm Graph2Tree to transform a graph G = (V, E) into a tree, without loss of information (i.e., one can reconstruct the graph
54
G. Blin, F. Sikora, and S. Vialette
starting from the tree). Informally, the main idea of Algorithm Graph2Tree is to transform the graph into a tree by iteratively finding a cycle C, duplicating a node of C, and finally breaking cycle C by one edge deletion. Central is our approach is thus the node duplication procedure (Algorithm Duplicate), see Figure 1 for an illustration to break a cycle at vertex v1 . For each u ∈ V , write d(u) for the set of all copies of vertex u including itself. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Function Graph2Tree(G) begin d(u) ← u for all u of V ; for (i = 0 ; i < |V | ; i + +) do foreach subgraph G = (V , E ) of G such that |V | = |V | − i do if G is acyclic then foreach node u of V \V do foreach (u, v) ∈ E do tmp ← G; Duplicate(v, u, d); if G is not connected anymore then G ← tmp; end end end return G; end end end end
Algorithm 1. “Brute-force” transformation algorithm
1 2 3 4 5 6 7 8
Function Duplicate(G = (V, E), va , vb , d) begin Let i ← |d(vb )|; V ← V ∪ {vb i }; d(vb ) ← d(vb ) ∪ {vb i } ; E ← E − {(va , vb )}; E ← E ∪ {(va , vb i )}; end
Algorithm 2. Algorithm to duplicate a node when a cycle is detected Let F denote the set of all nodes of G that have been duplicated at the end of Algorithm Graph2Tree, i.e., F = {v ∈ V : |d(v)| > 1}. The cardinality of F turns out to be an important parameter since, as we will prove soon, the overall time complexity of PADA1 mostly depends on |F | and not on the
Querying Protein-Protein Interaction Networks
55
Fig. 1. Steps when Duplicate(G, v3 , v1 , d) is called on graph a). b) A node v1 1 from v1 is created. c) The edge (v3 ,v1 ) is deleted: the cycle is then broken. d) The edge (v3 ,v1 1 ) is added. Finally, the resulting graph is acyclic, and d(v1 ) = {v1 , v1 1 }.
overall number of duplications. Minimizing the cardinality of F is the well-known NP-complete Feedback Vertex Set problem [15]: Given a graph G, find a minimum cardinality subset of vertices with the property that removal of these vertices from the graph eliminates all cycles. We only have implemented an algorithm using a “brute-force” solution for the Feedback Vertex Set problem. Since there are 2|V | potential subgraphs, its complexity is O(2|V | × |E|), but it is still running in seconds. Indeed, the overall complexity of PADA1 considerably limits the size of our graph query. However, one may also consider an efficient FPT algorithm such as the one of Guo et al. [11], using iterative compression, or a cubic [5] or quadratic kernalization [22]. 2.2
Tree Matching
We now assume that the query has been transformed into a tree (with duplicated nodes) by Algorithm Graph2Tree, and hence we only consider tree queries from this point. We show that an occurrence of such a tree can be found in a PPI network by dynamic programming. Let us fix notations. PPI networks are represented by undirected weighted graphs GN = (VN , EN , w) ; each node of VN represents a protein and each weighted edge (vi , vj ) ∈ EN represents an interaction between two proteins. A query is given by a tree TQ = (VQ , EQ ) (output of Algorithm Graph2Tree on the graph query). The set VQ represents proteins while EQ represents interactions between these proteins. There is no weight for these later. Let h(p1 , p2 ) be a function that returns a similarity score between two proteins p1 and p2 . The similarity considered here will be computed according to amino-acid sequences similarity (using BLAST [2]). In the following, given two nodes v1 and v2 of VQ (or VN ), we write h(v1 , v2 ) for the similarity between the two proteins corresponding to v1 and v2 . A node v1 is considered to be homologous to a node v2 if the corresponding similarity score h(v1 , v2 ) is above a given threshold. Biologically, one can assume that two homologous proteins have probably common functions. Clearly, for every node v of F , all nodes in d(v) are homologous with the same protein. An alignment of the query TQ and GN is defined as: (i) a subgraph GA = (VA , EA , w) ⊆ GN = (VN , EN , w), such that VA ⊆ VN and EA ⊆ EN , and (ii) a
56
G. Blin, F. Sikora, and S. Vialette
Fig. 2. a) The graph query with a cycle, before calling Graph2Tree algorithm. c) The query after calling Graph2Tree where q1 has been duplicated. Thus, q1 and q1 1 have to be aligned with the same node of the network. b) and d) denote the resulting graph alignment GA , subgraph of the network GN . The horizontal dashed lines denote a match between two proteins.
mapping σ : VQ → VA ∪{del}. More precisely, the function σ is defined such that for all q of VQ , σ(q) = v if and only if q and v are homologous, and σ(q) = del otherwise. For a given alignment of TQ and GN , a node q of VQ is said to be deleted if σ(q) = del and matched otherwise. Moreover, any node va of VA such that σ −1 (va ) is undefined is said to be inserted. Note that, similarly to QNet, only nodes of degree two can be deleted. For practical applications, the number of insertions (resp. deletions) is limited to be at most Nins (resp. Ndel ), each involving a penalty score δi (resp. δd ). The Graph Query problem can be thus defined as follow: Given a query TQ , a PPI network GN , a similarity function h, penalty scores δi and δd for each insertion and deletion, find an alignment (GA , σ) between TQ and GN of maximal score. The score of an alignment is defined as the sum of (i) similar ity scores of aligned nodes (i.e., v∈VA h(v, σ −1 (v))), (ii) the sum of all −1 (v) defined σ edges involved in GA (i.e., e∈EA w(e)), (iii) a penalty score δd for each node deletion (i.e., q∈VQ δd ), and (iv) a penalty score δi for each node insertion σ(q)=del (i.e., v∈VA δi ). σ(v)−1 undefined
The general problem is NP-complete. However, it is Fixed Parameter Tractable in case the query is a tree by a combination of the color-coding technique [1] and dynamic programming. This randomized technique allows to find simple paths of length k in a network in O(2k ) time (instead of the brute-force O(nk ) time algorithm), where n is the number of proteins in the network [20]. In [7], the authors
Querying Protein-Protein Interaction Networks
57
of QNet adapted this technique for their query algorithm. Since one is looking for an alignment, each node of the query has to be considered once (and only once) in an incremental build of the alignment by dynamic programming. Thus, one has to maintain a list of the nodes already considered in the query. Therefore, on the whole, one has to consider all O(nk ) potential alignments, with n = |VN | and k = |VQ |. Using color-coding, one may decrease this complexity to O(2k ). First, nodes of the network are colored randomly using k colors, where k = |VQ |. Then, looking for a colorful alignment (i.e., an alignment that contains each color once) leads to a potential solution (i.e., not necessarily optimal). Therefore, one only needs to maintain a list of the colors already used in the alignment, storable in a table of size in O(2k ). In order to get an optimal solution, this process is repeated. More precisely, according to QNet [7], since a colorful alignment happens with probability kk!k e−k , the coloration step has to be done log( 1 )ek times to obtain an optimal alignment with high probability (1 − , for any ). The QNet dynamic programming algorithm can be summarized as follows. By an incremental construction, for each (qi , qj ) of EQ when one considers qi of VQ aligned with a node vi of VN , check whether the score of the alignment is improved through: (i) a match of qj and any vj of VN such that qj and vj are homologous and (vi , vj ) ∈ EN , (ii) an insertion of a node vj of VN in the alignment graph GA , (iii) a deletion of qj . This is made for a given coloration of the network, and repeated for each coloration. Hereafter, we define an algorithm, inspired from QNet, which consider a query tree TQ , a PPI network GN and seeks for an alignment (GA , σ). To deal with duplicated nodes (cf. Graph2Tree algorithm), we pre-compute all possible assignment of the duplicated nodes VQ of TQ . More precisely, for each q of F and for all q of d(q) one assign σ(q ) with each v of VN . We then compute for each assignment A an alignment with respect to A. We denote BestConstraintAlignment this step (details omitted due to space constraints). The difficulty is to construct the best alignment by dynamic programming, with respect to A. As done in QNet, we use a set SC of k + Nins colors (as needed by the colorcoding) which will be used when a node is matched or inserted. Moreover, in order to deal with potential duplicated nodes in TQ , we have to use another multi-set of colors (i.e., the colors in this set can appear more than once), rather than a classical set as in QNet. Indeed, every node in d(q) such that q ∈ F , must use the same color. Algorithm 3 may be summarized as follow. Perform log( 1 )ek random colorations of the PPI network GN to ensure optimality with a probability of at least 1 − . A coloration consists in affecting a random color of SC to each node of VN . Then, for each coloration, we build all possible valid assignments A of the duplicated nodes. An assignment A is valid if no two non homologous nodes are matched in A. For each such assignment A, we compute the best alignment according to A with Algorithm BestConstraintAlignment. We keep the best score of these trials, and, get the corresponding alignment by classic backtracking technique.
58
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
G. Blin, F. Sikora, and S. Vialette
Function PADA1 (TQ ,GN , h, threshold) begin BestGA ← ∅; BestScore ← −∞; for (i = 0; i < log( 1 )ek ; i + +) do randomly colorize GN with k + Nins colors; foreach valid assignment A do GA ← BestConstraintAlignment(GN , TQ , A, h, threshold); if score(GA ) > BestScore then BestGA ← GA ; BestScore ← score(GA ); end end end return BestGA ; end
Algorithm 3. Sketch of the PADA1 algorithm to align a query graph to a network
Let us analyze the complexity of PADA1. The whole complexity depends essentially on lines 5 to 12. Let us consider the complexity of one iteration (we have log( 1 )ek iterations). The random coloration can be done in O(n), where n = |VN |. There are n|F | possible assignments. The complexity of BestConstraintAlignment is 2O(k+Nins ) m as in QNet, where k is the size of the graph query and m = |EN |, since our modifications are essentially additional tests which are done in constant time. Let us note that the complexity of Graph2Tree is negligible compared to the overall complexity of Algorithm PADA1. Indeed, the complexity of Algorithm Graph2Tree only depends on the query size k, with k n. Therefore, on the whole, the complexity of PADA1 is O(n|F | .2O(k+Nins ) m) time. Observe that the time complexity does not depends on the total number of duplicated nodes but on the size of F .
3
Experimental Results
According to the authors of QNet, one may query a PPI network by running a O(2O(k) nt+1 ) time algorithm log( 1 )ek times, where t is the treewidth of the query. Thus, the difference between the two algorithms is mainly related to t + 1 versus |F | question (i.e., the size of the set of families of duplicated nodes computed by Algorithm Graph2Tree). These two parameters are not easily comparable, except for trivial cases. However, we have computed some experimental tests to compare these two parameters on random graphs. Figure 3 suggests that parameter |F | is usually smaller for moderate size graphs (i.e., those graphs for which PADA1 is still practicable). Observe however that there are graphs with treewidth smaller than |F |, and hence no definitive conclusion can be drawn.
Querying Protein-Protein Interaction Networks
59
Algorithms Comparison QNet (Treewidth+1) PADA1 (|F|)
4.5
Value
4 3.5 3 2.5 2 1.5
4
6
8
10 Graph size
12
14
16
Fig. 3. Comparison between QNet (i.e., the treewidth+1 value) and PADA1 (i.e., the size of F computed after running the Graph2Tree algorithm). The method is as follows: for each different size of graph, we get the average treewidth and F values over 30 000 connected graphs, randomly constructed with the NetworkX library (http://networkx.lanl.gov/). Treewidth is computed with the exact algorithm provided by http://www.treewidth.com/, while the size of F is computed with our Graph2Tree algorithm.
In practice, our upper-bound is largely over estimated. Indeed, each element of F must be assigned to a different node of the network. So, there are not n possibilities for each element of F . The number of executions of BestConn! straintAlignment is (n−|F |)! , the number of combinations. Moreover, we only consider valid assignments and there are only few such assignments. Indeed, a protein is, on average, homologous to dozens of proteins, which is quite less than the number of proteins in a classical PPI network (e.g. n 5.000 for the yeast). For example, if |F | = 3 and if the protein represented by this unique element of F is homologous to ten proteins in the PPI network, then, the number of assignment will not be n3 but only 103 . Here, the running time is largely less than the worst case time complexity. Therefore, and not surprisingly, the BLAST threshold used to determine if a protein is homologous to another have a huge impact on the running time of the algorithm. Finally, observe that in QNet, for a given treewidth, the query graph can be very different. For example, in the resulting tree decomposition of the graph, there is no limit on the number of bags of size t. Furthermore, in a given bag, the topology is arbitrary (e.g., a clique), potentially requiring an exhaustive enumeration upper-bounded by nt+1 . Therefore, the treewidth value does not indicate how many times an exhaustive enumeration has to be done. We would have liked to compare in practice our algorithm to QNet, but, unfortunately, their version querying graphs is not yet implemented. Comparing
60
G. Blin, F. Sikora, and S. Vialette
Fig. 4. A result sample of our algorithm. a) A MAPK human query, get from [14], with three cycles. b) The alignment graph given by our algorithm in the fly PPI network. Dashed lines denotes the BLAST homology scores between the two proteins. Our algorithm retrieves a query graph in an other network. As in QNet [7], it seems to be that there is some conservation between these two species.
our algorithm for simple trees queries with QNet would not make sense since PADA1 is not optimized for this special cases. In order to validate our algorithm, we perform the experimental tests on real data, proposed by QNet [7]. In our experiments, the data for the PPI network of the fly and the yeast have been obtained from the DIP database1 [24]. The yeast network contains 4 738 proteins and 15 147 interactions, whereas the fly network contains 7 481 proteins and 26 201 interactions. The first experiment consists in retrieving trees. To do so, the authors of QNet extract randomly trees queries of size 5 to 9 from the yeast network and try to retrieve them in this network. Each query is modified with at most two insertions or deletions. We also have successfully retrieved these queries. The second experiment was performed across species. The Mitogen-Activated Protein Kinase (MAPK) are a collection of signal transduction queries. According to [6], they have a critical function in the cellular response to extracellular stimuli. They are known to be conserved through different species. We obtained the human MAPK from the KEGG database [14] and tried to retrieve them in the fly network as done in QNet. While QNet uses only trees, we were able to query graphs. The results were satisfying since we retrieved them, with few or without modifications. The Figure 4 shows a sample of our results on real data. This suggests a potential conservation of patterns across species. The BLAST threshold have deep impact on the running time. |F |. Moreover, we probably could certainly speed-up the running time by using the H¨ uffner et al. technique [13], which basically consists in increasing the number of colors used during the coloration step.
4
Conclusion
In this paper, we have tried to improve our understanding in PPI networks by developing a tool called PADA1 (available uppon request), to query graphs in 1
http://dip.doe-mbi.ucla.edu/
Querying Protein-Protein Interaction Networks
61
PPI networks. The time complexity of this algorithm is n|F | 2O(k) , where n is the size of the PPI network, k is the size of the query, and |F | is the minimum number of nodes which have to be duplicated to transform the query graph into a tree (solving the Feedback Vertex Set problem). This is the main difference with QNet of Dost et al. [7], which uses the treewidth of the query (unimplemented algorithm). We have performed some tests on real data and have retrieved known paths in the yeast PPI network. Moreover, we have retrieved known human paths in the fly PPI network. The time complexity of our algorithm depends on the number of nodes which have to be duplicated in the graph query, depends on the initial topology of the query graph. Obtaining more information about the topology of the queries is of particular interest in this context. Future works includes using this information to predict average time complexity.
References 1. Alon, N., Yuster, R., Zwick, U.: Color coding. Journal of the ACM 42(4), 844–856 (1995) 2. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. Journal of Molecular Biology 215(3), 403–410 (1990) 3. Arnborg, S., Corneil, D.G., Proskurowski, A.: Complexity of finding embeddings in a k-tree. Journal on Algebraic and Discrete Methods 8(2), 277–284 (1987) 4. Bodlaender, H.L.: A tourist guide through treewidth. Acta Cybernetica 11, 1–23 (1993) 5. Bodlaender, H.L.: A cubic kernel for feedback vertex set. In: Thomas, W., Weil, P. (eds.) STACS 2007. LNCS, vol. 4393, pp. 320–331. Springer, Heidelberg (2007) 6. Dent, P., Yacoub, A., Fisher, P.B., Hagan, M.P., Grant, S.: MAPK pathways in radiation responses. Oncogene 22, 5885–5896 (2003) 7. Dost, B., Shlomi, T., Gupta, N., Ruppin, E., Bafna, V., Sharan, R.: QNet: A Tool for Querying Protein Interaction Networks. In: Speed, T., Huang, H. (eds.) RECOMB 2007. LNCS (LNBI), vol. 4453, pp. 1–15. Springer, Heidelberg (2007) 8. Downey, R., Fellows, M.: Parameterized Complexity. Springer, Heidelberg (1999) 9. Garey, M.R., Johnson, D.S.: Computers and Intractability: a guide to the theory of NP-completeness. W.H. Freeman, San Franciso (1979) 10. Gavin, A.C., Boshe, M., et al.: Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 414(6868), 141–147 (2002) 11. Guo, J., Gramm, J., H¨ uffner, F., Niedermeier, R., Wernicke, S.: Compression-based fixed-parameter algorithms for feedback vertex set and edge bipartization. Journal of Computer and System Sciences 72(8), 1386–1396 (2006) 12. Ho, Y., Gruhler, A., et al.: Systematic identification of protein complexes in Saccharomyces cerevisae by mass spectrometry. Nature 415(6868), 180–183 (2002) 13. Huffner, F., Wernicke, S., Zichner, T.: Algorithm Engineering For Color-Coding To Facilitate Signaling Pathway Detection. In: Proceedings of the 5th Asia-Pacific Bioinformatics Conference. Imperial College Press (2007) 14. Kanehisa, M., Goto, S., Kawashima, S., Okuno, Y., Hattori, M.: The KEGG resource for deciphering the genome. Nucleic acids research 32, 277–280 (2004) 15. Karp, R.M.: Reducibility among combinatorial problems. In: Thatcher, J.W., Miller, R.E. (eds.) Complexity of computer computations, pp. 85–103. Plenum Press, New York (1972)
62
G. Blin, F. Sikora, and S. Vialette
16. Kelley, B.P., Sharan, R., Karp, R.M., Sittler, T., Root, D.E., Stockwell, B.R., Ideker, T.: Conserved pathways within bacteria and yeast as revealed by global protein network alignment. Proceedings of the National Academy of Sciences 100(20), 11394–11399 (2003) 17. Pellegrini, M., Marcotte, E.M., Thompson, M.J., Eisenberg, D., Yeates, T.O.: Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. PNAS 96(8), 4285–4288 (1999) 18. Pinter, R.Y., Rokhlenko, O., Yeger-Lotem, E., Ziv-Ukelson, M.: Alignment of metabolic pathways. Bioinformatics 21(16), 3401–3408 (2005) 19. Reguly, T., Breitkreutz, A., Boucher, L., Breitkreutz, B.J., Hon, G.C., Myers, C.L., Parsons, A., Friesen, H., Oughtred, R., Tong, A., et al.: Comprehensive curation and analysis of global interaction networks in saccharomyces cerevisiae. Journal of Biology (2006) 20. Scott, J., Ideker, T., Karp, R.M., Sharan, R.: Efficient algorithms for detecting signaling pathways in protein interaction networks. Journal of Computational Biology 13, 133–144 (2006) 21. Shlomi, T., Segal, D., Ruppin, E., Sharan, R.: QPath: a method for querying pathways in a protein-protein interaction network. BMC Bioinformatics 7, 199 (2006) 22. Thomasse, S.: A quadratic kernel for feedback vertex set. In: Proceedings SODA (2009) (to appear) (unpublished manuscript) 23. Uetz, P., Giot, L., et al.: A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisae. Nature 403(6770), 623–627 (2000) 24. Xenarios, I., Salwinski, L., Duan, X.J., Higney, P., Kim, S.M., Eisenberg, D.: DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Research 30(1), 303 (2002)
Integrative Approach for Combining TNFα-NFκB Mathematical Model to a Protein Interaction Connectivity Map Mahesh Visvanathan1, Bernhard Pfeifer2, Christian Baumgartner2, Bernhard Tilg2, and Gerald Henry Lushington1 1
Bioinformatics Core Facility, University of Kansas, Lawrence, KS 66047
[email protected] 2 University for Health Sciences, Medical Informatics and Technology (UMIT) Hall in Tyrol, Austria
Abstract. We have investigated different mathematical models for signaling pathways and built a new pathway model for TNFα-NFκB signaling using an integrative analytical approach. This integrative approach consists of a knowledgebase, model designing/visualization and simulation environments. In particular, our new TNFα-NFκB signaling pathway model was developed based on literature studies and the use of ordinary differential equations and a detailed protein-protein interaction connectivity map within this approach. Using the most detailed mathematical model as a base model, three new relevant proteins -TRAF1, FLIP, and MEKK3 -- were identified and included in our new model. Our results show that this integrative approach offers the most detailed and consistent mathematical description for TNFα-NFκB signaling and further increases the understanding of TNFα-NFκB signaling pathway. Keywords: TNFα mediated NF-kB signaling pathway, protein-protein interaction and mathematical model.
1 Introduction Interactions of molecules are essential for almost all cellular functions. Genes and proteins seldom carry out their functions in isolation. They operate through a number of interactions with other biomolecules. Molecular interactions in biological pathways and networks are highly dynamic and may be controlled by feedback loops and forward regulation mechanisms as well as dependence on other cellular hierarchies. They make experimental elucidation and computational analysis of pathways extremely challenging. Mathematical modeling is becoming increasingly important as a tool to capture molecular interactions and dynamics from high-throughput experiments. Biological pathways and networks are often represented graphically. Ordinary differential equations (ODEs) have been commonly used to help explain the kinetic process of association and disassociation among molecules in chemical or biochemical reactions. I. Măndoiu, G. Narasimhan, and Y. Zhang (Eds.): ISBRA 2009, LNBI 5542, pp. 63–74, 2009. © Springer-Verlag Berlin Heidelberg 2009
64
M. Visvanathan et al.
Tumor necrosis factor-alpha (TNFα) is a cytokine involved in systematic inflammation and is a member of a group of cytokines that all stimulate the acute phase reaction. Dysregulation and over-production of TNFα have been implicated in the pathogenesis of a wide spectrum of human diseases, e.g. sepsis, diabetes, cancer, osteoporosis, multiple sclerosis and Crohn's disease. Eliminating TNFα by a specific monoclonal antibody, e.g. infliximab, caused dramatic effects on the phenotype of the diseases, with severe side effects. To find specific ways with least side effects to block or partially block TNFα actions, we need to learn more about the signaling pathways of TNFα. Some research efforts have been shifted from intercellular to intracellular signaling in order to increase the knowledge about the involved cell proteins and to get a better understanding of the molecular dynamics of the TNFαrelated pathways, especially the TNFα-NFκB signaling pathway. Biological cartoons and analytical pathway models have been constructed for the TNFα-NFκB signaling pathway respectively. These models can be used for computer simulation studies in order to get insights on setting various experimental scenarios. The better understanding of diseases makes the drug development more appropriate with minimal side effects [1, 31]. Reliable information about proteins and molecular interactions of the signaling pathway is needed to improve the structure of a pathway model. It has known that several protein purification methods are not capable of detecting protein activities in natural concentration. Because of protein over-expression, non-natural protein complex building can occur [20, 25] which could cause faults in modeling. Although a significant amount of biological and biochemical literature is now available in this research field, some of which examine only small parts of the pathway and sometimes with methods relying on over-expression situations. In this paper, we introduce an integrative approach to analyze an experimental TNFα-NFκB signaling pathway model. We built an integrative database that includes mathematical modeling data, literature information as well as biological data. In particular, the modeling data consists of kinetic constants, rate equations and initial concentrations for building mathematical ODE models. The biological data includes descriptions of proteins, protein-protein interactions and other information about signaling pathways. All information can be retrieved and visualized within our integrative computational framework. The paper is organized as follows: Section 2 provides our methodology and delimitates the modeling of the TNFα-NFκB signaling pathway based on ODE models and the literature information. Section 3 describes the results obtained from our extended TNFα-NFκB signaling pathway model. Section 4 gives our discussion and conclusion of this work.
2 Methodology 2.1 Integrating Heterogeneous Information for Modeling The immunosuppression associated with neutralization of TNFα through infliximab results in serious adverse effects e.g. systemic tuberculosis, allergic granulomatosis of the lung, or mild leucopenia in patients with active ankylosing spondylitis [19].
Integrative Approach for Combining TNFα-NFκB Mathematical Model
65
Because of these clinical adverse effects, additional information on the mechanisms of the action of infliximab is needed. In order to increase understanding of the mechanism of TNFα, to find more specific drug targets, to minimize adverse effects and to maintain the duration of the treatment, the focus of research has been shifting from intercellular signals such as TNFα, to intracellular signals, such as the signal transduction pathway from the membrane receptors of TNFα to the transcription factors AP1 and NFκB and to apoptosis. The exact description of the TNFα pathway is no easy task. Various publications on this topic exist, but none of them provides a comprehensive assessment of the interactions of all the proteins believed to participate in the pathway. Evidence for the uncertainty about the pathway structure is provided by numerous different biological cartoons dealing with TNFα and NFκB [5 8, 10,11,, 21, 22, 25, 30]. Some of these cartoons use quite different proteins. A recently developed method, the TAP (tandem affinity purification) tag strategy [6, 18], seems to overcome some of the problems and opens a new door in large scale in vitro experiments. Further problems have arisen from the fact that a lot of experiments were based on protein over-expression situations, which can produce unphysiological complex building. This problem is also solved by the TAP tag strategy [4]. The development of a precise mathematical model is a very difficult task. As the biological knowledge and the experimental data used are insufficient, various assumptions have to be made concerning kinetic parameters and concentrations of proteins involved. The first step to produce a mathematical model is to develop a detailed qualitative model or a cartoon outlining participating proteins after collecting biological and experimental data. This qualitative model/cartoon has to be translated into a quantitative mathematical model [10]. There are two principle ways to construct mathematical models to model the kinetics of biochemical reactions: a deterministic formulation based on nonlinear ordinary differential equations (ODEs) for large numbers of molecules and a stochastic formulation rooted in exponential distribution law. Deterministic models have been commonly used as they can be easily applied with existing off-the-shelf computer software programs, while stochastic models are currently getting more attention and being further developed to capture certain randomness nature of molecular interactions [26, 27]. A pathway cartoon comprises only qualitative information of the biochemical pathway of interest (e.g. the TNFα-NFκB signaling pathway in our work). The proteins in the cartoon are placed in the region where they are situated in the cell, e.g. the receptor is at the cell membrane, the interacting proteins, which may be located in the cell nucleus, are far away. Information about protein interactions and complex building is hidden in the biological cartoons. Such cartoons do not explain molecular dynamics quantitatively, but they provide a schematic visual overview of the overall dynamics and information processing in cellular systems. Mathematical models of a biochemical pathway can show the chronology of protein complex association and dissociation. This is done with respect to the initial concentrations of the proteins and the kinetic parameters that controls the simulation and analysis of theses models [11, 12, 14, 13, 19]. Protein-protein interaction connectivity maps also provide important information on which proteins are co-purified with distinct target proteins.
66
M. Visvanathan et al.
To compare and integrate the pathway mathematical model with the protein-protein interaction connectivity maps is a challenge. Based on various cartoons of the TNFα-NFκB signaling pathway, we identified the graphical dynamics of protein complex building. A graphic presentation of a qualitative pathway model of the TNFα-NFκB signaling pathway is shown in Figure 1. 2.2 Architecture of the Integrative Framework Our integrative framework was designed as a 3-tier architecture shown in Figure 2. It consists of a Java-based pathway designing-visualization environment and a simulation environment in the upper application tier, a Java Database Connectivity-Open Database Connectivity (JDBC-ODBC) in the middle tier, and a relational data management system in the back-end database tier. Within this framework we first designed the TNFα-NFκB pathway model in the designing-visualization environment that allows inclusion of mathematical modeling data, simulation data and biological data from our integrated knowledgebase in the database tier. Then, the designed TNFα-NFκB pathway model was exported in an XML format from the designing environment to the simulation environment. The organizational structure of the database tier was designed to represent different levels of data. The entities of the knowledgebase contain information grouped into three categories: molecular components, reactions and pathways. These entities inherit both the biological and modeling information concerning specified pathways.
Fig. 1. The TNFαpathway model that was developed by our framework, it
Integrative Approach for Combining TNFα-NFκB Mathematical Model
67
Fig. 2. The framework was designed in the form of a 3-tier architecture system that includes a Java-based pathway designing, simulation and visualization environments as the upper application tier, JDBC-ODBC middle tier, and an integrative database as the backend tier.
Specifically, the knowledgebase incorporates biological knowledge about components, reactions and pathways from three different online external protein databases: Biomolecular Interaction Network Database (BIND) [16, 14] Database of Interacting Proteins (DIP) [17], and Munich Information Center for Protein Sequences (MIPS) protein-protein interaction database [19] as well as internal experimental verifications and literature studies. The mathematical modeling knowledge includes kinetic constants, rate equations and initial concentrations and is related to the components and reactions of the pathway under investigation (e.g. TNFα-NFκB signaling pathway in our work). In general deterministic formulation, mathematical models of pathways use differential equations relying on fundamental assumptions. For example, a signal transduction system behaves as a slowly varying non-linear system (as a function of time) during a reaction period based on biological observations. Further, it is assumed that a cell keeps the concentration of each signaling protein constant before and after each signaling event; that is the concentration of these proteins return to steady state after the reaction. With these molecular kinetics assumptions, the ordinary differential equations for the mathematical models are derived and stored for simulation of the signaling pathway of interest. We have derived a new ODE model by also incorporating protein-protein interaction connectivity map information for the TNFα-NFκB signaling pathway with our integrative analysis. Simulations and analyses were done using a graphical user interface that was designed and developed using MATLAB®. More details are provided in the results section.
68
M. Visvanathan et al.
3 Results For signaling pathway modeling, the following problems occur: • • •
It is not always clear which proteins should be used to set up a stable model basis which can easily be extended. What is the chronology of protein interactions and which proteins collectively constitute a complex? How fast and long do the proteins interact (i.e., what are the kinetic parameters)?
These problems are major considerations when modeling the TNFα-NFκB signaling pathway within our framework. The mathematical models described in [CH03, SC01, IH04] related to TNFα pathway demonstrate that these models contain similar proteins and the general dynamics of complex building was almost identical. Among these models we identified the model based on Cho (we referred it as Model A) as the best available model for possible further improvement in order to address the pathway modeling problems mentioned previously. Several other possible extensions/improvements for the pathway Model A were also identified depending on research foci and we concluded that three additional proteins – TRAF1, FLIP, and MEKK3 – should be introduced into Model A to improve correspondence with experimental observations. The main focus of our work was to model the TNFα- NFκB signaling pathway starting from TNFα receptor to the transcription factor AP1, rather than going into other levels. By including these three new proteins, we derived an initial extended pathway model using our integrated analytical framework. Table 1 shows a quantified summary to justify the inclusion of TRAF1, FLIP, MEKK3 which was idenfied in all the existing models but not in the interaction map. Hence we incorporated these components in to our new mathematical model. It is worth mentioning that TRAF1 and FLIP were also considered in the mathematical models by Schöberl, Ihekwaba and MEKK3 has been intensively discussed [SC01, IH05]. All proteins contained in the cell proliferation module of the extended pathway model were identified and compared among others via the TAP tagged strategy and the protein-protein interaction connectivity map [4]. We noted that some basic interactions, which were modeled, could not be found in the protein-protein interaction connectivity map (i.e. TNFR1-TRADD, TRADD-FADD, TRADD-RIP, RIP-Caspase8 and RIP-IKK). Since cIAP has not been further examined, the interaction with effector Caspases was not modeled. The analysis resulted in a final ODE pathway model that we refer to as the final ODE pathway model, Model B. In Model B, several single proteins like TNFR1, TRADD, RIP1, TRAF2, IKK, IκB and NFκB were identified in the protein-protein interaction connectivity map (Figure 4). The connectivity map comprises no complexes, only individual proteins. IKK is presented by its subunits IKKα, IKKβ and IKKγ, colored lilac, and Ι•Β is presented by its subunits IkBα, IkBβ and IkBε, colored orange. NFκB’s family members (monomers) are colored red and their precursors are mentioned in the connectivity map, i.e. NFκB1 (p50; precursor: p105), NFκB2 (p52; precursor: p100), p65 (RelA), c-Rel
Integrative Approach for Combining TNFα-NFκB Mathematical Model
69
Fig. 3. The new TNFα-NFκB pathway model (Model B) includes three new proteins MEKK3 (m32), FLIP (m33) and TRAF1 (m35). Circles (with m and a number) represent various states for protein concentration (i.e. kinetic constants), and directed arrows represent dynamic relations (i.e. rate equations). The protein complex TNF/TNFR1/ TRADD/RIP1/ TRAF2/ MEKK3/IKK connects to NFκB is colored red. The new protein complexes are colored magenta, e.g. Caspase8/FLIP (m34), RAF1/Caspase8* (m36) and TRAF2/TRAF1c (m38).
(Rel), and RelB. All these seven proteins are present in both the Model B and the connectivity map. With limited experimental data, further modeling improvements are very difficult. The incorporated biological and modeling knowledge can be retrieved from the knowledgebase and integrated into the extended TNFα-NFκB signaling pathway model within our framework. We performed a systematic examination of the TNFα-NFκB signaling pathway by simulating different scenarios to analyze the sensitivity of the Model B with respect to possible changes of the kinetic parameters. The extended TNFα-NFκB signaling pathway model B now including all three new components namely the proteins: MEKK3, FLIP and TRAF1 is indeed quite a stable model. TRAF1 and Caspase8* are
70
M. Visvanathan et al.
reversible from the complex proteins and TRAF1 is irreversibly cleaved and TRAF1c and Caspase8* were yielded. The comparison of the protein concentrations was also performed by plotting the concentrations of the components of other existing mathematical models including model by Cho [9], i.e. Model A, modles by Schöberl [23] and Ihekwaba [13] and the new Model B side-by-side. Components, which show extraordinary differences, were examined via further literature investigations. As a result, our extended TNFα-NFκB signaling pathway model B appears to be the most detailed and consistent mathematical model with protein-protein interaction connectivity information incorporated in it.
Fig. 4. The protein-protein interaction connectivity map (Bouwmeester et al. 2004). The proteins that are colored are modeled in our new TNFα-NFκB signaling pathway model B.
Integrative Approach for Combining TNFα-NFκB Mathematical Model
71
Table 1. Quantifiable summary of mathematical models validated related to TNF
Existing Mathematical Models Schöberl EGF Model Ihekwaba MAPK Model Cho TNF Model Bonizzi NFkB Model Chung TGF Model TNF New Model (Model B)
Total Components Mapping to TNF Interaction Map
Total Number of Components
Common Components between the Models
94
2 (TRAF1, FLIP) 3 (TRAF1, FLIP, MEKK3)
3
14
46
2 (TRAF1, FLIP) 4 (TRAF1, FLIP, MEKK3,NFkB)
6
39
2 (MEKK3, FLIP)
3
35
4 (TRAF1, FLIP, MEKK3,NFkB)
16
81 31
3
4 Discussion Complex time-dependent interactions of intracellular proteins as a function of protein concentrations as well as other kinetic factors cause a high degree of variability in biological functions. To bring light to complex intracellular interaction systems, we first have to identify the proteins involved and then model their interactions in an appropriate manner. TNFα is a cytokine, which is implicated in the pathogenesis of various human diseases and therefore it has been a subject of intense investigations aimed at better understanding of the exact signal flow. The objective of this work was to build a new TNFα- NFκB signaling pathway model that allows integration of a protein-protein interaction connectivity map. The mathematical models relating to the TNFα-NFκB signaling pathway contained similar proteins and an identical concept underlying the complex building. Bouwmeester et al. [4] published a connectivity map giving a detailed view of protein-protein interactions of the TNFα-NFκB signaling pathway on a physiological protein concentration level and this map was used as a preliminary guide for our extended mathematical model, and employed as a point of comparison. Mathematical models based on Cho [8], Schöberl [13] and Ihekwaba [IH04] are available in relation to TNFα- NFκB signaling pathway. Improvements upon the existing pathway models are very difficult and challenging. The decision to use the mathematical model by Cho (i.e. Model A) as our base model and extending it was based on our impression that it is the most detailed mathematical model available today for TNFα- NFκB signaling starting from TNFα receptor up to the transcription factor AP1. The Schöberl [23]. and Ihekwaba. [13] models describe signaling from IKK and NFκB level onwards. The Ihekwaba’s model concentrates more on details of modeling IKK interactions rather than the TNFα-NFκB signaling.
72
M. Visvanathan et al.
Having selected Model A for extension, we found out that this model contained several inconsistencies, regarding the “recycling” of particular proteins after protein complex dissociation for TNFα- NFκB signaling pathway as a whole. Cho et al. did not clearly justify all of the choices underlying their model construction. We identified some inconsistencies and correspondingly revised the model so that it corresponded more directly with information evident from the protein-protein interaction connectivity map. The resulting extended Model B can serve as a more consistent mathematical description to reflect TNFα- NFκB signaling pathway. Our overall integrative analytical approach should serve as a reasonable point of comparison with other future models as further modeling improvements arise courtesy of new or more accurate experimental data become available. In essence, the knowledgebase of our integrative framework will grow accordingly. Three new proteins – TRAF1, FLIP, and MEKK3 – were included into the extended model B. TRAF1 and FLIP were already considered in the Schöberl model and MEKK3 has also been intensively discussed in the TNFα- NFκB signaling pathway literature. The importance of this protein has also been pointed out by Bouwmeester etal. [5]. The exact localization of MEKK3 in the signaling pathway and the protein interaction network are not fully clear at the moment. That is why the actual position in the extended pathway model has to be regarded as a first approximation.
5 Conclusion One of the main goals of this work was to use the information of the protein-protein interaction connectivity map to build an improved pathway model. In practice, we found that several protein interactions mentioned in literature could not be found in the connectivity map provided by Bouwmeester et al. [5], especially those interactions situated in close proximity to the cell membrane. In this regard, the Bouwmeester connectivity map just outlines which proteins are parts of the TNFα- NFκB signaling pathway, but does not exactly specify all the occurring interactions between them. To construct a valid mathematical model, it is necessary to have a complete view on the interactions in the signaling pathway. If all of the important interactions are revealed, this might require changing the general model structure, as the connectivity map suggests several alternative inhibiting and activating signal flow paths, which have not been modeled yet. Such limitations of experimental data also make it more challenging to develop a pathway model focusing specifically on the TNFα-NFκB signaling pathway. Information from several sources has to be combined and adapted to fit into one pathway model. Hence, this task makes the possibility of combining models more challenging and interesting. Stochastic mathematical modeling approach is also one of our research interests, and we are currently working on extending a Bayesian probabilistic graphical model framework [29] for TNFα-NFκB signaling pathway and have ongoing efforts to validate the structure of TNFα-NFκB signaling pathway by incorporating information from new experimental analysis and simulation studies. Specifically, the iterative interplay between experimental analysis and modeling strategies in which new experimental data is incorporated data into the knowledgebase, yielding more informed pathway analysis models, will hopefully yield an increasingly physiologically realistic
Integrative Approach for Combining TNFα-NFκB Mathematical Model
73
model, and simulation studies performed on such improved signaling pathways should enable increasingly sophisticated and accurate interpretations of the biological system from a global systemic view.
References [1] Kitano, H.: Computational systems biology. Nature 420, 206–210 (2002) [2] Alfarano, C., Andrade, C.E., Anthony, K., et al.: The Biomolecular Interaction Network Database and related tools 2005 update. Nucleic Acids Res. 33(Database issue), D418–D424 (2005) [3] Barken, D., Wang, C.J., Kearns, J., et al.: Comment on Oscillations in NF-kappaB signaling control the dynamics of gene expression. Science 308(5718), 52 (2005) [4] Bonizzi, G., Karin, M.: The two NF-kappaB activation pathways and their role in innate and adaptive immunity. Trends Immunol. 25(6), 280–288 (2004) [5] Bouwmeester, T., Bauch, A., Ruffner, H., et al.: A physical and functional map of the human TNF-alpha/NF-kappa B signal transduction pathway. Nat. Cell. Biol. 6(2), 97–105 (2004) [6] Cho, H., Shin, Y., Kolch, W., et al.: Experimental Design in systems biology based on parameter sensitivity analysis with monte carlo simulation: A case study for the TNFα Mediated NF-κB- signal transduction pathway. Simulation 12, 726–739 (2003a) [7] Cho, K.H., Shin, S.Y., Lee, H.W., et al.: Investigations into the analysis and modeling of the TNF alpha-mediated NF-kappa B-signaling pathway. Genome Res. 13(11), 2413–2422 (2003b) [8] Chung, J.Y., Lu, M., Yin, Q., et al.: Structural revelations of TRAF2 function in TNF receptor signaling pathway. Adv. Exp. Med. Biol. 597, 93–113 (2007) [9] Cox, D.M., Du, M., Guo, X., et al.: Tandem affinity purification of protein complexes from mammalian cells. Biotechniques 33(2), 267–268 (2002) [10] Dempsey, P.W., Doyle, S.E., He, J.Q., et al.: The signaling adaptors and pathways activated by TNF superfamily. Cytokine Growth Factor Rev. 14(3-4), 193–209 (2003) [11] Dixit, V., Mak, T.W.: NF-kappaB signaling. Many roads lead to madrid. Cell 111(5), 615–619 (2002) [12] Gilbert, D.: Biomolecular interaction network database. Brief Bioinform 6(2), 194–198 (2005) [13] Gregan, J., Riedel, C.G., Petronczki, M., et al.: Tandem affinity purification of functional TAP-tagged proteins from human cells. Nat. Protoc. 2(5), 1145–1151 (2007) [14] Ihekwaba, A.E., Broomhead, D.S., Grimley, R.L., et al.: Sensitivity analysis of parameters controlling oscillatory signalling in the NF-kappaB pathway: the roles of IKK and IkappaBalpha. Syst. Biol. (Stevenage) 1(1), 93–103 (2004) [15] Kitano, H.: Computational systems biology. Nature 420(6912), 206–210 (2002); Micheau, O., Tschopp, J.: Induction of TNF receptor I-mediated apoptosis via two sequential signaling complexes. Cell 114(2), 181–190 (2003) [16] Min, J.K., Kim, Y.M., Kim, S.W., et al.: TNF-related activation-induced cytokine enhances leukocyte adhesiveness: induction of ICAM-1 and VCAM-1 via TNF receptorassociated factor and protein kinase C-dependent NF-kappaB activation in endothelial cells. J. Immunol. 175(1), 531–540 (2005) [17] Nelson, D.E., Ihekwaba, A.E., Elliott, M., et al.: Oscillations in NF-kappaB signaling control the dynamics of gene expression. Science 306(5696), 704–708 (2004)
74
M. Visvanathan et al.
[18] Pagel, P., Kovac, S., Oesterheld, M., et al.: The MIPS mammalian protein-protein interaction database. Bioinformatics 21(6), 832–834 (2005) [19] Phair, R.D.: Development of kinetic models in the nonlinear world of molecular cell biology. Metabolism 46(12), 1489–1495 (1997) [20] Pomerantz, J.L., Baltimore, D.: Two pathways to NF-kappaB. Mol. Cell. 10(4), 693–695 (2002) [21] Rivkin, E., Cullinan, E.B., Tres, L.L., et al.: A protein associated with the manchette during rat spermiogenesis is encoded by a gene of the TBP-1-like subfamily with highly conserved ATPase and protease domains. Mol. Reprod. Dev. 48(1), 77–89 (1997) [22] Salwinski, L., Miller, C.S., Smith, A.J., Pettit, F.K., Bowie, J.U., Eisenberg, D.: The Database of Interacting Proteins: 2004 update. NAR 32 Database issue, D449–D451 (2004) [23] Schlosser, P.M.: Experimental design for parameter estimation through sensitivity analysis. J. Toxicol. Environ. Health 43(4), 495–530 (1994) [24] Schoeberl, B., Gilles, E.D., Scheurich, P.: A Mathematical Vision of TNF Receptor Interaction. In: Proceedings of the International Congress of Systems Biology, pp. 158–167 (2001) [25] Su, C.G., Lichtenstein, G.R.: Influence of immunogenicity on the long-term efficacy of infliximab in Crohn’s disease. Gastroenterology 125(5), 1544–1546 (2003) [26] Swaffield, J.C., Melcher, K., Johnston, S.A.: A highly conserved ATPase protein as a mediator between acidic activation domains and the TATA-binding protein. Nature 374(6517), 88–91 (1995) [27] Vilimas, T., Mascarenhas, J., Palomero, T., et al.: Targeting the NF-kappaB signaling pathway in Notch1-induced T-cell leukemia. Nat. Med. 13(1), 70–77 (2007) [28] Wang, C.Y., Mayo, M.W., Korneluk, R.G., et al.: NF-kappaB antiapoptosis: induction of TRAF1 and TRAF2 and c-IAP1 and c-IAP2 to suppress caspase-8 activation. Science 281(5383), 1680–1683 (1998) [29] Wolkenhauer, O., Cho, K.H.: Analysis and Modeling of Signal Transduction Pathways in Systems Biology. Biochem. Soc. Trans., Pt6:1503-1509 (2003) [30] Wang, J., Cheung, L.W., Delabie, J.: New probabilistic graphical models for genetic regulatory networks studies. J. Biomed. Inform. 38(6), 443–455 (2005) [31] Yao, J., Duan, L., Fan, M., et al.: NF-kappaB signaling pathway is involved in growth inhibition, G2/M arrest and apoptosis induced by Trichostatin A in human tongue carcinoma cells. Pharmacol. Res. 54(6), 406–413 (2006) [32] You, L.: Toward computational systems biology. Cell Biochem. Biophys. 40(2), 167–184 (2004)
Hierarchical Organization of Functional Modules in Weighted Protein Interaction Networks Using Clustering Coefficient
Min Li1 , Jianxin Wang1, , Jianer Chen1,2 , and Yi Pan1,3 1
School of Information Science and Engineering, Central South University, Changsha 410083, P. R. China 2 Department of Computer Science, Texas A&M University, College Station, TX 77843, USA 3 Department of Computer Science, Georgia State University, Atlanta, GA 30302-4110, USA
Abstract. As advances in the technologies of predicting protein interactions, huge data sets portrayed as networks have been available. Several graph clustering approaches have been proposed to detect functional modules from such networks. However, all methods of predicting protein interactions are known to yield a nonnegligible amount of false positives. Most of the graph clustering algorithms are challenging to be used in the network with high false positives. We extend the protein interaction network from unweighted graph to weighted graph and propose an algorithm for hierarchically clustering in the weighted graph. The proposed algorithm HC-Wpin is applied to the protein interaction network of S.cerevisiae and the identified modules are validated by GO annotations. Many significant functional modules are detected, most of which are corresponding to the known complexes. Moreover, our algorithm HCWpin is faster and more accurate compared to other previous algorithms. The program is available at http://bioinfo.csu.edu.cn/limin/HC-Wpin. Keywords: Protein interaction network, clustering, functional module.
1
Introduction
High-throughput methods, such as yeast-two-hybrid and mass spectrometry, have led to the emergence of large protein-protein interaction datasets [1–6]. These protein-protein interactions can be naturally represented in the form of networks and provide useful insights into functional associations between proteins[7, 8]. A wide range of graph clustering algorithms have been developed for identifying functional modules from such protein interaction networks.
This research was supported in part by the National Basic Research 973 Program of China No. 2008CB317107, the National Natural Science Foundation of China under Grant No. 60773111, the Program for New Century Excellent Talents in University No. NCET-05-0683, the Program for Changjiang Scholars and Innovative Research Team in University No. IRT0661. Corresponding author.
I. M˘ andoiu, G. Narasimhan, and Y. Zhang (Eds.): ISBRA 2009, LNBI 5542, pp. 75–86, 2009. c Springer-Verlag Berlin Heidelberg 2009
76
M. Li et al.
Recently, hierarchical model of modular network has been introduced and several hierarchical clustering approaches have been applied to identify functional modules[9–12]. In general, the hierarchical clustering approaches can represent the protein interaction networks in hierarchy by tree. According to the differences of constructing the tree, hierarchical clustering approaches can be classed into two groups: the top-down approach and the bottom-up approach. The topdown approaches start from one cluster with all vertices and recursively dividing it into several dissimilar sub-clusters. A typical example is the betweenness centrality-based algorithm proposed by Girvan and Newman [9]. In contrast, the bottom-up approaches start at single vertex clusters and iteratively merge similar clusters. MoNet algorithm [12] is an example of the bottom-up approaches. However, the hierarchical clustering approaches are known to be sensitive to noisy data [8]. Up to now, all methods of predicting protein-protein interactions can not avoid yielding a non-negligible amount of noisy data (false positives)[13]. Thus, the conventional hierarchial clustering approaches are challenging to be used directly in the networks with false positives. A series of density-based clustering approaches have been proposed to identify densely connected regions from protein interaction networks which are somewhat robust to noisy data. An extreme example is the maximum clique algorithm [14] which detects fully connected subnetworks. However, only mining fully connected subnetworks is too strict to be used in real biological networks. A variety of alternative density functions have been proposed to detect dense subnetworks [15–19]. However, the density-based approaches neglect many peripheral proteins that connect to the core protein clusters with few links, even though these peripheral proteins may represent true interactions that have been experimentally verified [12]. As Barab´ asi and Oltvai have pointed out that the density-based clustering approaches are not able to partition the biological networks whose degree distributions are typically power-law [20]. In addition, biologically meaningful functional modules that do not have highly connected topologies are ignored by these approaches [12]. In this paper, we extend the protein interaction network to a weighted graph and develop an algorithm, named HC-Wpin, for hierarchical clustering in the weighted graph. Algorithm HC-Wpin is applied to the weighted protein interaction network of S.cerevisiae and the identified modules are validated by GO annotations (including Biological Process, Molecular Function, and Cellular Component ). The experimental results show that the identified modules are statistically significant in terms of three types of GO annotations. Compared to other previous competing algorithms, our algorithm HC-Wpin is faster and more accurate, which can be used in even larger protein interaction networks.
2
Methods
Edge Clustering Coefficient In Weighted Protein Interaction Network A weighted protein interaction network can be represented as a weighted undirected graph G = (V, E), where V is a set of vertices and E is a set of edges
Hierarchical Organization of Functional Modules
77
between the vertices. Each edge (u, v) is assigned with a weight w(u, v), which represents the probability of this interaction being a true positive. Clustering coefficient is first proposed to describe the property of a vertex in a network. Clustering coefficient of a vertex is the ratio of the number of connections in the neighborhood of the vertex and the number of connections if the neighborhood is fully connected [21]. Roughly speaking, clustering coefficient tells how well connected the neighborhood of the vertex is. Recently, Radicchi et al. [22] generalized the clustering coefficient of a vertex to an edge, and defined it as the number of triangles to which a given edge belonged, divided by the number of triangles that might potentially include it. The definition is not feasible when the network has few triangles. In our previous studies, we have redefined the clustering coefficient of an edge by calculating the common neighbors instead of the triangles [23]. However, all the definitions of clustering coefficient in the unweighted protein interaction networks do not consider the edge reliability. Here, we redefine the clustering coefficient of an edge in the weighted graph. Let Nu be the set of neighbors of vertex u and Nv be the set of neighbors of vertex v, respectively. Then the clustering coefficient of an edge (u, v) in a weighted graph G is defined as: k∈Iu,v
CCu,v =
w(u, k) ·
s∈Nu w(u, s) ·
k∈Iu,v
w(v, k)
t∈Nv w(v, t)
(1)
where Iu,v denotes the set of common vertices in Nu and Nv (i.e. Iu,v = Nu ∩Nv ). Two vertices of an edge with larger clustering coefficient are more likely to lie in the same module. Quantitative Definition of Modules In Weighted Protein Interaction Network Generally, functional modules in protein interaction networks are loosely referred to as highly connected subgraphs, which have more internal edges than external edges. Several module definitions have been proposed, such as strong module and weak module [22]. For an unweighted protein interaction network with high false positive, false prediction may be generated by using these definitions. To avoid the effect of false positive interactions in protein interaction networks, we define a functional module in the weighted graph. For a weighted graph G = (V, E), the degree of a vertex v is denoted as dw (v) which is the sum of weights of the edges connecting v: dw (v) =
w(u, v).
(2)
u∈Nv ;(u,v)∈E
For a vertex v in a subgraph H ⊆ G, its in-degree, denoted as din w (H, v), is the sum of weights of edges connecting vertex v to other vertices belonging to H, and its out-degree, denoted as dout w (H, v), is the sum of weights of edges connecting
78
M. Li et al.
out vertex v to other vertices in the rest of the graph G. din w (H, v) and dw (H, v) can be formed as formula (3) and formula (4), respectively.
din w (H, v) =
w(u, v).
(3)
u,v∈H;(u,v)∈E
dout w (H, v) =
w(u, v).
(4)
v∈H;u∈H;(u,v)∈E /
It is clearly that the degree dw (v) of a vertex v is equaled to the sum of din w (H, v) and dout (H, v). w Definition 1. Given a weighted undirected graph G = (V, E, W ) and a threshold λ, a subgraph H ⊆ G is a λ-module if v∈H
din w (H, v) > λ
dout w (H, v)
(5)
v∈H
where λ is a parameter determined by user. By changing the values of parameter λ , we can get different modules in the weighted protein interaction network. Hierarchical Clustering Based on the definitions of edge clustering coefficient and λ-module in weighted protein interaction networks, we propose a novel hierarchical clustering algorithm HC-Wpin. The whole description of algorithm HC-Wpin is shown in Figure 1. In Figure 1, Ci denotes a cluster, V (Ci ) and E(Ci ) denote the set of vertices and edges in the cluster Ci , respectively. Let Din (Ci ) be the sum of the in-degree of the vertices in Ci , and Dout (Ci ) be the sum of the out-degree of the vertices in Ci . For a vertex v, L(v) indicates which cluster it is in. The input to algorithm HC-Wpin is a weighted undirected graph G(V, E). Firstly, all vertices in the graph G are initialized as singleton clusters. Then, the clustering coefficient of each edge in the graph G is calculated. We enqueue all the edges into a queue Sq in non-increasing order in terms of their clustering coefficients. The higher clustering coefficient the edge has, the more likely its two vertices are inside a module. By gradually adding edges in the queue Sq to clusters, algorithm HC-Wpin finally assembles all the singleton clusters into λ-modules. In the end, the λ-modules which consist of s or more than s proteins are outputted. Parameter s is used to control the minimum size of the output functional modules. Let n and m denote the number of vertices and edges in a weighted protein interaction network respectively and k be the average number of neighbors of all the vertices, i.e. k = n1 v∈V |Nv |. Then, the complexity of calculating all the edge clustering coefficients is O(k 2 m), the complexity of the iterative merging is O(m). Thus, the total computational complexity of algorithm HC-Wpin is O(k 2 m). In general, k is very small and can be considered as a constant.
Hierarchical Organization of Functional Modules
79
Algorithm HC-Wpin input: a weighted graph G = (V, E), parameters λ and s; output: identified modules; 1. for each vertex vi ∈ V do V (Ci ) = {vi }; E(Ci ) = ∅ 2. for each edge (u, v) ∈ E do compute its clustering coefficient; 3. sort all edges to queue Sq in non-increasing order in terms of clustering coefficients; 4. while Sq = ∅ do { e(u, v) ← Sq ; if L(u) = L(v) then i = L(u); E(Ci ) = E(Ci ) ∪ {e(u, v)}; else i = L(u); j = L(v); Din (Cj ) in (Ci ) if DDout ≤ λ or Dout ≤ λ then (Ci ) (Cj ) V (Ci ) = V (Ci ) ∪ V (Cj ); E(Ci ) = E(Ci ) ∪ E(Cj ); Cj = {∅, ∅}; Sq = Sq − {e(u, v)};} 5. for i=1 to |V | do if |V (Ci )| ≥ s then output Ci ;
Fig. 1. The description of algorithm HC-Wpin
3
Experiments and Results
Identification of λ Modules In The Network of S.cerevisiae The original unweighted protein interaction network of S.cerevisiae, consisting of 4,726 proteins and 15,166 interactions, was downloaded from the DIP database[24]. To construct weighted protein interaction network, we assign confidence scores to these interactions using the logisticregression-based scheme employed in [25, 26]. Roughly speaking, the confidence score is computed based on the experimental evidences which include the type of experiments in which the interaction is observed, and the number of observations in each experimental type. We apply algorithm HC-Wpin to the weighted protein interaction network and achieve five output sets of modules by changing the values of parameter λ from 1.0 to 3.0 with 0.5 increment. Table 1 illustrates the effect of parameter λ on clustering. In Table 1, Max.Size represents the size of the largest module, and Avg.Size represents the average size of all the identified modules. As shown in Table 1, Table 1. The effect of λ on clustering Parameter Modules Max.size Avg.size
λ = 1.0 145 79 9.24
λ = 1.5 132 125 10.54
λ = 2.0 117 263 12.25
λ = 2.5 91 982 16.89
λ = 3.0 77 1192 20.82
80
M. Li et al.
#3 (982)
#11 (8)
#11 (5)
#14 (72)
#3 #16 (4) (125)
#3 (263)
#15 (7)
#19 (19)
#27 (3)
#18 (73)
#35 (9)
#4 #69 #51 (79) (33) (8)
Y L R 4 1 8 C
Y B R 2 7 9 W
Y O L 1 4 5 C
Y G L 2 4 4 W
#22 (4)
Y M L 0 1 0 W
Y G L 2 0 7 W
#30 #33 (22) (8)
#36 (3)
#37 #38 (18) (7)
#44 (73)
#0 (26)
Y M L 0 6 9 W
#39 (6)
#43 (4)
#51 (72)
#52(73)
#46 (15)
#20 #64 (45) (12)
Y O R 1 2 3 C
#26 (5)
Ȝ=2.5
Y M R 1 9 7 C
Y K L 1 9 6 C
#35 (4)
Y B L 0 5 0 W
#36 #38 (4) (50)
Y O R 1 0 6 W
Y G L 0 9 5 C
Y D R 4 6 8 C
#55 (3)
#19 (11)
#46 #61 #94 (9) (17) (9)
Y L R 0 9 3 C
#54 #55 #56 (8) (255) (3)
Y O L 0 1 8 C
Y G L 2 1 2 W
#64 (17)
#65 (123)
#63 (5)
#68 (8)
#71 (3)
#80 (4)
#69 (16)
#72 (17)
#87 (4)
#94 (5)
#41 #54 #76 (27) (29) (32)
Y O L 0 0 4 W
#89 (3)
#126 (3)
Ȝ=2.0
Ȝ=1.5
Ȝ=1.0 #58 (7)
Y N L 3 3 0 C
Y D L 0 7 6 C
Y P L 1 3 9 C
Y M R 2 6 3 W
Y I L 0 8 4 C
Y P L 1 8 1 W
The identified module consisting of 982 proteins which is generated by HC-Wpin with Ȝ=2.5 The identified modules generated by HC-Wpin with Ȝ=2.0 which are included in a larger module generated by HC-Wpin with Ȝ=2.5 The identified modules generated by HC-Wpin with Ȝ=1.5 which are included in larger modules generated by HC-Wpin with Ȝ=2.0 The identified modules generated by HC-Wpin with Ȝ=1.0 which are included in larger modules generated by HC-Wpin with Ȝ=1.5 Proteins
Fig. 2. An example of hierarchical modules generated by algorithm HC-Wpin with different values of parameter λ. All the identified modules listed in this figure are available from Additional file 1.
the number of the identified modules is decreasing with the increase of λ. The average size of all the identified modules and the size of the biggest module are increasing as λ increases. Bigger modules are generated by HC-Wpin when larger value of λ is used. This is because the larger value of λ may lead to the merging of clusters in the agglomerative process. Figure 2 illustrates an example that small modules are iteratively merged with the increase of λ. As shown in Figure 2, the modules #19, #41, #54, #58 and #76 generated by HC-Wpin with λ=1.0 are merged into a larger module #65 which is generated by HC-Wpin with λ=1.5. When λ=2.0, the module #65 and the other nine green modules generated with λ=1.5 are merged into a larger module #55 which consists of 255 proteins. When λ=2.5, all the 24 modules generated with λ=2.0 are merged into a more larger module which consists of 982 proteins. By changing the values of parameter λ, we can obtain the hierarchial organization of functional modules in the protein interaction network. In the following subsection we will evaluate the significance of the identified modules by using the SGD GO Term Finder (http://www.yeastgenome.org/). The evaluation results show that the hierarchical organization of modules are approximatively corresponding to the hierarchical structure of GO annotations. Statistical Assessment of the Identified Modules To test whether the identified modules are significant, we validate them using all the three types of GO terms: Biological Process, Molecular Function, and Cellular
Hierarchical Organization of Functional Modules
81
Component. For each identified module, the P-value from the hypergeometric distribution is calculated based on the three types of GO annotations. A cutoff parameter is used to differentiate significant groups from insignificant ones. If an identified module is associated with a P-value larger than cutoff, it is considered insignificant. We use the recommended cutoff of 0.05 for all our validations. Firstly, we evaluate the significance of the hierarchical modules shown in Figure 2. As pointed out by Gavin et al. and Krogan et al., the larger complexes are composed, sometimes transiently, from smaller subcomplexes [3, 6]. We obtain the similar results that the smaller functional modules are hierarchically organized into the larger functional modules which is approximatively corresponding to the hierarchical structure of GO annotations. For example, the five green modules (#38, #64, #65, #69 and #72) generated with λ=1.5 are merged into a larger module #55 generated with λ=2.0. Correspondingly, the Biological Process annotation for the module #55 is the common ancestor of that for the five smaller modules #38, #64, #65, #69 and #72 in the hierarchical structure of GO annotations, as shown in Additional file 2. Similar corresponding relation between the hierarchical modules and the hierarchical structure of GO annotations is obtained both for Molecular Function and for Cellular Component annotations. Next, we evaluate all the identified modules generated by HC-Wpin. Take the modules with λ=1.0 for example, 130 out of 145 identified modules are validated to be significant with Biological Process annotations. The lowest P-values of the 130 significant modules range from 4.93E-02 to 5.82E-73. For Molecular Function annotations, 115 identified modules are validated to be significant, whose lowest P-values range from 4.70E-02 to 2.50E-52. The module with the lowest P-value of 2.50E-52 is composed of 33 members. Of all the 33 proteins, more than 75% have the function of “RNA polymerase activity”. For Cellular Component annotations, the lowest P-value of all the identified modules is 1.51E-68. The module with the lowest P-value is composed of 45 proteins, in which 37 proteins belong to the known complex “small nuclear ribonucleoprotein complex”. There are a series of small identified modules matching the known complexes perfectly in the network of S.cerevisiae. For example, the module #10 consisting of 3 proteins (EFB1, TEF4, TEF1) is exactly the “eukaryotic translation elongation factor 1 complex”, the module #136 consisting of 3 proteins (POL3, POL31, POL32) is exactly the “delta DNA polymerase complex”, the module #33 consisting of 4 proteins (APL3, APL1, APS2, APM4) is exactly the “AP-2 adaptor complex”, the module #90 consisting of 4 proteins (SEC66, SEC72, SEC63, SEC62) is exactly the “endoplasmic reticulum Sec complex”, and the module #131 consisting of 5 proteins (VPS29, PEP8, VPS35, VPS5, VPS17) is exactly the “retromer complex”. There are some identified modules which have the similar P-values validated by different types of GO annotations. For the example of module #134, all the 4 proteins (SEN34, SEN2, SEN15, SEN54) in it are exactly the entire members in the network of S.cerevisiae which have the same molecular function of “endoribonuclease activity, producing 3’-phosphomonoesters”, participate in the same biological process of “RNA splicing, via endonucleolytic cleavage and ligation” and are in the same cellular component “tRNA-intron endonuclease complex”.
82
M. Li et al.
There are also a number of identified modules whose lowest P-values are very different with different validation of GO terms. For example, the module #60 is composed of 4 proteins (APM3, APL6, APS3, APL5). For Biological Process annotations, it has a lowest P-value of 4.55E-09. The 4 proteins combining with other 18 proteins participate in the process of “Golgi to vacuole transport”. For Molecular Function annotations, 3 proteins (APM3, APL6, APS3) out of the 4 proteins are directly annotated to the root term “molecular function unknown”. It has no P-value result. However, it is exactly the known complex “AP-3 adaptor complex” for Cellular Component annotation with the lowest P-value of 6.70E13. The detailed annotations of one type of GO terms may give clues to study another type of GO terms. The above analyses show that our algorithm HC-Wpin not only can identify significant functional modules but also can detect significant functional modules in hierarchy. Comparison with Other Methods To evaluate the effectiveness of our algorithm HC-Wpin, we compare it with several previous state-of-the-art algorithms: the MoNet algorithm [12] and FAG-EC algorithm [23] as hierarchial clustering approaches, the MCODE algorithm and DPClus algorithm as density-based methods, and the STM algorithm [27]. The values of the parameters in each algorithm are selected from those recommended by the author. The accuracy of each algorithm is calculated and shown in Table 2. The accuracy of an algorithm indicates the average f -measure of the significant modules generated by it. f -measure of an identified module is defined as a harmonic mean of its recall and precision. f -measure =
2 ∗ recall ∗ precision recall + precision
recall =
|M ∩ Fi | |Fi |
precision =
|M ∩ Fi | |M |
(6)
(7)
(8)
where Fi is a functional category mapped to module M . The proteins in functional category Fi are considered as true predictions, the proteins in module M are considered as positive predictions, and the common proteins of Fi and M are considered as true positive predictions. Recall is the fraction of the true-positive predictions out of all the true predictions, and precision is the fraction of the true-positive predictions out of all the positive predictions[8]. As shown in Table 2, the accuracy of our algorithm HC-Wpin is much higher than that of the other four algorithms: FAG-EC, MCODE, DPClus and STM with all the validations of Biological Process (abbreviated as B.P.), Molecular Function (abbreviated as M.F.), and Cellular Component (abbreviated as C.C.). MoNet produces a giant module consisting of 3336 proteins and three small
Hierarchical Organization of Functional Modules
83
Table 2. Comparison of the accuracy of algorithm HC-Wpin and other previous algorithms Algorithms Modules Average of Maximum Accuracy size ≥ 3 Size Size B.P. M.F. C.C. HC-Wpin 145 9.24 79 0.34 0.29 0.50 MoNet 4 837.50 3336 FAG-EC 326 8.52 237 0.27 0.20 0.40 MCODE 59 74.78 555 0.29 0.24 0.39 DPClus 236 4.02 13 0.26 0.19 0.35 STM 10 467.80 4647 0.21 0.10 0.01
(a) Biological Process
Parameters λ=1.0 S=1 λ=1.0 fluff =0.1;VWP =0.2 CPin =0.5;Din =0.9 Merge=1.0
(b) Molecular Function
0
(c) Cellular Component 0
0
-10
-10 -10
-20
-20
log(P-value)
log(P-value)
-40
HC-Wpin MoNet FAG-EC MCODE DPClus STM
-50 -60 -70
0
5
10
15
20
Significant modules
25
-30
HC-Wpin MoNet FAG-EC MCODE DPClus STM
-40
-50
-80 30
log(P-value)
-20
-30
-30 -40
HC-Wpin MoNet FAG-EC MCODE DPClus STM
-50 -60 -70
-60
-80 0
5
10
15
20
Significant modules
25
30
0
5
10
15
20
25
30
Significant modules
Fig. 3. Comparison of P-value distribution of significant modules generated by HC-Wpin and other algorithms. The x axis represents the number of significant modules and the y axis represents the log(P-value) for each corresponding module.
modules. Only one module is validated to be significant for all the three type of GO annotations. Thus, we does not list the accuracy of MoNet. The false clustering of MoNet is mostly caused by the miscalculation of betweenness. The false positive interactions can yield the incorrect shortest paths in a network and the incorrect shortest paths cause miscalculation of betweenness. Figure 3 (a), (b), and (c) illustrate the P-value distributions of the significant modules generated by all these algorithms. Since the network has a lower probability to produce the module by chance, the module is more significant with lower P-value. As observed from Figure 3, the 30 best significant modules identified by our algorithm HC-Wpin are consistently more significant than those generated by other algorithms. The comparison results in Table 2 and Figure 3 show that our algorithm HC-Wpin outperforms the other previous algorithms. Efficiency Analysis All experiments in this paper are implemented on a Linux server with 4 Intel Xeon 3.6GHz CPU and 4GByte RAM. Table 3 illustrates a comparison of the running time of our algorithm HC-Wpin and the other five algorithms for identifying functional modules. The network of S.cerevisiae (4726 proteins and 15166 interactions) and the other four networks with different confidence level obtained from [28] are used as the test data. We call the network of S.cerevisiae
84
M. Li et al.
Table 3. Comparison of the running time of algorithm HC-Wpin and other algorithms
Algorithms HC-Wpin FAG-EC MCODE DPClus MoNet STM
The running time (second) Y2k Y11k Y45k Y78k NSc (988, 2455) (2401, 11000) (4687, 45000) (5321, 78390) (4726, 15166) 0.1 0.6 5.6 9.8 0.7 0.1 0.6 5.6 9.8 0.7 0.2 7.2 1037.2 4129.8 480.2 0.3 32.6 1116.4 5638.6 1194.7 2.0 88.2 6593.8 9516.2 2852.4 1.0 7944.3 62360.2 140174.8 27073.2
NSc and name the other four networks Y2k, Y11k, Y45k and Y78k, respectively, according to the number of edges included in them. From Table 3, we can see the running time of MoNet, STM, MCODE and DPClus increase sharply with the size of network. It takes more than 4,000 seconds for MCODE and DPClus, about 10,000 seconds for MoNet, and more than 100,000 seconds for STM to identify modules from the network consisting of 5,321 proteins and 78,390 interactions. However, the running time of HC-Wpin and FAG-EC detecting functional modules from the same network is still small and less than 10 seconds. As one can see, algorithm HC-Wpin is extremely fast, which is hundreds of times faster than MCODE, DPClus, MoNet and STM. As the protein-protein interactions accumulating, algorithm HC-Wpin can be used in even larger protein interaction networks.
4
Conclusions
In previous studies, the protein interaction networks are generally represented as unweighted graphs. As is well known, the protein interaction networks can not avoid of false positives. Thus, directly clustering from the unweighted graphs with high false positives, most of the previous graph clustering approaches may generate a number of false predictions. In this paper, we extend the protein interaction network from unweighted graph to weighted graph and develop a fast hierarchical clustering algorithm HC-Wpin to identify functional modules from the weighted graph. The reliability of interactions is measured by the logisticregression-based scheme [25, 26]. By changing the values of parameter λ, we can identify the functional modules in a hierarchy. We use all the three types of GO Terms to validate the identified modules of S.cerevisiae. Many significant functional modules are detected, most of which are exactly corresponding to the known complexes. For most cases, the value of λ is recommended to be between 1.0 and 3.0. When you want to get small modules, you should select a small value for λ. On the contrary, you should select a relative large value for λ to obtain modules consisting of more proteins. We also compare the performances of our algorithm HC-Wpin and the other five algorithms: MoNet, FAG-EC, MCODE, DPClus, and STM. Unexpected
Hierarchical Organization of Functional Modules
85
giant modules are generated by MoNet and STM which are caused by their weakness to the false positives. Although MCODE and DPClus are somewhat robust to the false positives, they are not adept at identifying hierarchically distributed functional modules. The quantitative comparison of accuracy reveal that algorithm HC-Wpin outperforms the other five algorithms. Another strength of our algorithm HC-Wpin is efficiency. It is very fast and can be applied to even larger protein interaction networks of other higher-level organisms. Acknowledgments. The authors would like to thank F. Luo and his colleagues for sharing their program of MoNet, to W. Hwang and his colleagues for sharing the source code of STM. The authors are also thankful to M. Altaf-UI-Amin and his colleagues for sharing the tool of DPClus, to G.D. Bader and C.W. Hogue for their publicity of MCODE. The authors also thank T. Shlomi for providing and discussing about the data.
Additional Files Additional file 1 — Example for hierarchical organization of functional modules. This file contains the hierarchical modules shown in Figure 2 which is available at http://bioinfo.csu.edu.cn/limin/HC-Wpin/Additional file 1.txt. Additional file 2 —Supplemental Figure 1. This file contains a supplemental Figure 1 which shows the reduction of hierarchical structure of biological process annotations for the hierarchical modules shown in Figure 2. This file is available at http://bioinfo.csu.edu.cn/limin/HC-Wpin/Additional file 2.pdf.
References 1. Uetz, P., et al.: A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature 403, 623–627 (2000) 2. Gavin, A.C., et al.: Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 415(6868), 141–147 (2002) 3. Gavin, A.C., et al.: Proteome survey reveals modularity of the yeast cell machinery. Nature 440(7084), 631–636 (2006) 4. Ho, Y., et al.: Systematic identification of protein complexes in saccharomyces cerevisiae by mass spectrometry. Nature 415(6868), 180–183 (2002) 5. Krogan, N.J., et al.: High-definition macromolecular. composition of yeast RNAprocessing complexes. Molecular Cell 13, 225–239 (2004) 6. Krogan, N.J., et al.: Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 440(7084), 637–643 (2006) 7. Harwell, L.H., Hopfield, J.J., Leibler, S., Murray, A.W.: From molecular to modular cell biology. Nature 402, c47–c52 (1999) 8. Cho, Y.R., Hwang, W., Ramanmathan, M., Zhang, A.D.: Semantic integration to identify overlapping functional modules in protein interaction networks. BMC Bioinformatics 8, 265 (2007) 9. Girvan, M., Newman, M.E.: Community structure in social and biological networks. Proc. Natl. Acad. Sci. 99, 7821–7826 (2002)
86
M. Li et al.
10. Rives, A.W., Galitski, T.: Modular organization of cellular networks. Proc. Natl. Acad. Sci 100, 1128–1133 (2003) 11. Ravasz, E., et al.: Hierarchical organization of modularity in metaboli networks. Science 297, 1551–1555 (2002) 12. Luo, F., Yang, Y.F., Chen, C.F., Chang, R., Zhou, J.Z.: Modular organization of protein interaction networks. Bioinformatics 23(2), 207–214 (2007) 13. Brohee, S., van Helden, J.: Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinformatics, 7–488 (2006) 14. Spirin, V., Mirny, L.A.: Protein complexes and functional modules in molecular networks. Proc. Natl. Acad. Sci., 12123–12128 (2003) 15. Bu, D., et al.: Topological structure analysis of the protein-protein interaction networks in budding yeast. Nucleic Acid Research 31(9), 2443–2450 (2003) 16. Brun, C., et al.: Clustering proteins from interaction networks for the prediction of cellular functions. BMC Bioinformatics 7, 488 (2004) 17. Bader, G.D., Hogue, C.W.: An Automated Method for Finding Molecular Complexes in Large Protein Interaction Networks. BMC Bioinformatics 4, 2 (2003) 18. Altaf-Ul-Amin, M., Shinbo, Y., Mihara, K., Kurokawa, K., Kanaya, S.: Development and implementation of an algorithm for detection of protein complexes in large interaction networks. BMC Bioinformatics, 7–207 (2006) 19. Palla, G., Derenyi, I., Farkas, I., Vicsek, T.: Uncovering the overlapping community structure of complex networks in nature and society. Nature 435, 814–818 (2005) 20. Barab´ asi, A.L., Oltvai, Z.N.: Network biology: understanding the cell’s functional organization. Nature Reviews: Genetics 5, 101–114 (2004) 21. Friedel, C., Zimmer, R.: Inferring topology from clustering coefficients in proteinprotein interaction networks. BMC Bioinformatics, 7–519 (2006) 22. Radicchi, F., et al.: Defining and identifying communities in networks. Proc. Natl. Acad. Sci. 101, 2658–2663 (2004) 23. Li, M., Wang, J.X., Chen, J.E.: A fast agglomerate algorithm for mining functional modules in protein interaction networks. In: Peng, Y., Zhang, Y. (eds.) Proceedings of the First International Conference on BioMedical Engineering and Informatics: Hainan, China, May 27-30, pp. 3–7 (2008) 24. Xenarios, I., et al.: DIP: the Database of Interaction Proteins: a research tool for studying cellular networks of protien interactions. Nucleic Acids Res. 30, 303–305 (2002) 25. Sharan, R., et al.: Conserved patterns of protein interaction in multiple species. Proc. Natl. Acad. Sci. 102(6), 1974–1979 (2005) 26. Shlomi, T., Segal, D., Ruppin, E., Sharan, R.: Qpath: a method for querying pathways in a protein-protein interaction network. BMC Bioinformatics, 7–199 (2006) 27. Hwang, W., Cho, Y.R., Zhang, A., Ramanathan, M.: A novel functional module detection algorithm for protein-protein interaction networks. Algorithms for Molecular Biology 12, 1–24 (2006) 28. von Mering, C., et al.: Comparative assessment of large-scale data sets of proteinprotein interactions. Nature 417(6887), 399–403 (2002)
Bioinformatics Challenges in Translational Research (Invited Keynote Talk) Nicholas F. Tsinoremas University of Miami Center for Computational Science University of Miami, Miami, FL 33143
[email protected]
The basic goal of translational research, from a boinformatics perspective, is to relate data acquired from basic research to an outcome in a patient. The relation of data may be following the drug discovery process, where biochemical and protein structure data is linked all the way through to data collected during the clinical trial process and onward. The relation of data can also be associate with better patient care, through techniques like data mining, where data from the research environment is combined with data from the clinical environment and used for hypothesis generation, testing and potential outcomes are implemented. Translation research also requires collaboration between various clinical investigators, physicians, scientists and teams, creating a need for secure data sharing. Inherent to the nature of translational research is the integration of data from multiple systems. Data used for research resides in EMRs, LIMSs, CTMS, and other source systems. For example at the University of Miami, The Miller school and its affiliate Institutes (e.g., Jackson Hospital), have established a number of information systems to support various operational needs. These current systems include Velos (for clinical trials management) Cerner (the Jackson EMR), MetaDatach, (UMH) and IDX. The Miller School is also in the process of implementing a EPIC EMR system. We are developing a system to address the above outlined needs and challenges. It is an integration infrastructure to support translational research, but may also be applied to other data sharing and integration needs throughout UM. The system is currently referred to as UTRIX (UM Translational Research Information eXchange). More specifically, UTRIX features a utility data storage environment (FUSE), a service oriented architecture (SOA), an organization currently referred to as the ”honest broker” (HB) to control access to data, and standard tools and educational programs to support data analysis. FUSE (Flexible Utility Storage Environment) is intended to meet the data storage needs described above. The SOA and HB together address the challenges posed by data access and authorization. FUSE, the SOA and HB together provide a context in which to make available tools and educational programs to enable the data analysis and advance data mining of research data.
I. M˘ andoiu, G. Narasimhan, and Y. Zhang (Eds.): ISBRA 2009, LNBI 5542, p. 87, 2009. c Springer-Verlag Berlin Heidelberg 2009
Untangling Tanglegrams: Comparing Trees by Their Drawings Balaji Venkatachalam1, Jim Apple1 , Katherine St. John2 , and Dan Gusfield1 1
2
Department of Computer Science, UC Davis {balaji,apple,gusfield}@cs.ucdavis.edu Department of Mathematics and Computer Science, Lehman College, and the Graduate Center, City University of New York
[email protected]
Abstract. A tanglegram is a pair of trees on the same set of leaves with matching leaves in the two trees joined by an edge. Tanglegrams are widely used in biology – to compare evolutionary histories of host and parasite species and to analyze genes of species in the same geographical area. We consider optimizations problems in tanglegram drawings. We show a linear time algorithm to decide if a tanglegram admits a planar embedding by a reduction to the planar graph drawing problem. This problem was considered by Fernau, Kauffman and Poths. (FSTTCS 2005). Our reduction method provides a simpler proof and helps to solve a conjecture they posed, showing a fixed-parameter tractable algorithm for minimizing the number of crossings over all d-ary trees. For the case where one tree is fixed, we show an O(n log n) algorithm to determine the drawing of the second tree that minimizes the number of crossings. This improves the bound from earlier methods. We introduce a new optimization criterion using Spearman’s footrule optimization and give an O(n2 ) algorithm. We also show integer programming formulations to quickly obtain tanglegram drawings that minimize the two optimization measures discussed. We prove lower bounds on the maximum gap between the optimal solution and the heuristic of Dwyer and Schreiber (Austral. Symp. on Info. Vis. 2004) to minimize crossings.
1 Introduction Determining the evolutionary history, or the phylogeny, of a set of species is an important problem in biology. Often represented as trees, phylogenies are used for determining ancestral species, designing vaccines, and drug discovery. [27]. The popular criteria to reconstruct an optimal tree – maximum parsimony and maximum likelihood – are NP-hard [13, 24], so heuristic methods (i.e. [17, 26]) are used that can yield many possible trees. Comparing these trees, as well as those generated on multiple genes, or for co-evolving species, is a necessary task for data analysis [22]. A visual way to compare two trees is via a tanglegram which shows the spatial relationship among the leaves. Roughly, a tanglegram consists of two trees with additional edges linking pairs of corresponding leaves (see Fig. 1 and Sect. 2). Tanglegrams are
This research was partially supported by NSF grants SEI-BIO 0513910, SEI-SBE 0513660, CCF-0515378, and IIS-0803564.
I. M˘andoiu, G. Narasimhan, and Y. Zhang (Eds.): ISBRA 2009, LNBI 5542, pp. 88–99, 2009. c Springer-Verlag Berlin Heidelberg 2009
Untangling Tanglegrams: Comparing Trees by Their Drawings
89
Fig. 1. A tanglegram from Charleston and Perkins [6]: phylogenetic trees for lizards in the Caribbean tropics and strains of malaria found there ([6], p 86), joined by dashed lines that represent the parasite-host relationship. The crossing number is 7, and the footrule distance is 10. This is not optimal; an alternative layout which interchanges the children of nodes c and d improves these to 4 and 6, respectively. The optimal drawings have crossing number 1 and distance 2, respectively.
widely used in biology, including, to compare evolutionary histories of host and parasite species and to analyze genes of species in the same geographical area [10, 29]. The number of edge crossings in tanglegrams serves as a good measure to the extent of horizontal gene transfer, which has been inferred by viewing single layouts of tanglegrams [5, pg. 204-206]. Drawings with fewer crossings or with matching leaves close together are more useful in biological analysis. We focus on two natural measures of complexity that are used for comparing permutations: the crossing number (or Kendallτ ) and Spearman’s footrule distance [7]. The former measures the number of times edges between the leaves cross, and the latter, the sum of the distances between leaf pairs. These are widely used, including, in ranking search results on the web and in voting systems [11, 8]. We focus on the complexity of these ranking problems and give efficient algorithms for drawing tanglegrams. Crossing minimization in tanglegrams has parallels to crossing minimization in graphs [14, 19]. Computing the minimum number of crossings in a graph is NPcomplete [14]. However it can be verified in linear time that a graph has a planar drawing (with zero crossings) [16, 25]. Analogously, crossing minimization in tanglegrams is NP-complete, while the special case of planarity can be decided in linear time. Fernau et al. [12] showed this by a reduction to the upward flow problem [2]. Independently, Lozano et al. [21] showed a simple dynamic programming based solution that gives a planar drawing in O(n2 ) time. In recent work, Buchin et al. [4] showed approximation results and a fixed parameter tractable algorithm for complete tanglegrams (where every leaf has the same depth). We do not use this restriction and the results in this paper hold for arbitrary trees. N¨ollenburg et al. show some experimental results [4] and discuss an integer quadratic program for the crossing problem [23]. Bansal et al. [1] define a generalized tanglegram to allow multiple edges between leaves in the two trees.
90
B. Venkatachalam et al.
Our results: The case where only one tree is mutable is called the one-tree crossing minimization (OTCM) problem and has been studied for balanced trees in [9]. For arbitrary trees Fernau et al. [12] showed an O(n log2 n) solution, while Bansal et al. [1] show an O(n log2 n/ log log n) solution. We provide an algorithm that improves the time bound to O(n log n) (Sect. 3.1). Previous work on tanglegrams is limited to crossing minimization. We borrow Spearman’s footrule distance function to use as an optimization criterion here. We show an O(n2 ) solution for the one-tree fixed case (Sect. 3.2). We provide a method that has a simple intuition and allows us to use well studied solutions of graph drawing problems. Further, it leads to a simple fixed parameter tractable (FPT) algorithm. We show a linear time algorithm for planarity testing by a reduction to the planar graph drawing problem (Sect. 4.1). We can also use the fixed parameter algorithm for minimizing crossing numbers in graphs [19] to improve the running time of the FPT algorithm of Fernau et al. [12] for crossing minimization in binary trees and answer their conjecture for d-ary trees for d > 2 (Sect. 4.2). For the praxis of tanglegram drawing, we show integer programming formulations to obtain tanglegram drawings that minimize the two optimization measures discussed (Sect. 5). We also show a lower bound on the worst case behavior of the heuristic of [9].
2 Preliminaries We define tanglegrams and their drawings following [10, 12]: Let L(T ) denote the leaves of a tree T . A linear order < on L(T ) is called suitable if T can be embedded into the plane such that L(T ) is mapped onto a straight line in the order given by <. A tanglegram (T1 , T2 ; M ) is given by a pair of rooted binary trees (T1 , T2 ) with perfect matching M ⊆ L(T1 ) × L(T2 ). In this paper we consider trees with n leaves labeled [n] = {1, . . . , n}, with M matching leaves with identical labels. A drawing of (T1 , T2 ; M ) is given by two suitable linear orders <1 and <2 on L(T1 ) and L(T2 ), respectively. We call a drawing proper if it is realized by planar embeddings of T1 and T2 such that: 1. L(T1 ) and L(T2 ) lie on two parallel lines L1 and L2 2. All nodes of Ti lie within the half-plane bounded by L3−i not containing Li 3. Every node is farther from the line than its children. Let cr(T1 , T2 , M, <1 , <2 ) denote the number of crossings in the drawing of (T1 , T2 ; M ) given by linear orders <1 and <2 . Note that by the definition only matching edges may cross and that the number of crossings is independent of the chosen realization. It is easy to see that a pair of edges cross at most once. We consider two optimization criteria for drawing a tanglegram. The first is minimizing the number of crossings in the drawing, that is, for a given tanglegram (T1 , T2 ; M ), we want min cr(T1 , T2 , M, <1 , <2 ) . <1 ,<2
Since the crossings can be changed by flipping the children at an internal node, the problem is to determine the order of the children at each internal node that minimizes the number of crossings.
Untangling Tanglegrams: Comparing Trees by Their Drawings
91
The second criterion is based on the distance between the leaves in the orderings. Given a drawing (T1 , T2 , M, <1 , <2 ), let πi be the permutation on the leaves induced by
Again, the optimization problem is to obtain the drawing that minimizes the distance. Let d be a distance measure on tanglegram drawings. We define d(T1 , T2 , M, <1 , ·) to be the minimal value of d(T1 , T2 , M, <1 , <2 ) for all suitable linear orders <2 on L(T2 ). Similarly d(T1 , T2 , M, ·, ·) is defined to be the minimal value of d(T1 , T2 , M, <1 , <2 ) for all suitable linear orders, <1 and <2 on L(T1 ) and L(T2 ), respectively. We define the following two natural problems for crossings in tanglegrams: One-Tree Crossing Minimization (OTCM) I NSTANCE : A tanglegram (T1 , T2 ; M ) with suitable linear linear order, <1 on L(T1 ). R ESULT: A <2 with cr(T1 , T2 , M, <1 , <2 ) minimal. Two-Tree Crossing Minimization (TTCM) I NSTANCE : A tanglegram (T1 , T2 ; M ) and parameter k. Q UESTION : Is cr(T1 , T2 , M, ·, ·) ≤ k? One- and two-tree footrule distance minimization problems are defined analogously.
3 One-Tree Optimization Problems For one-tree minimization problems, we assume, w.l.o.g, that the all tree labels are in [n], that M is the identity matching, and that <1 is simply <[n] . 3.1 One-Tree Crossing Minimization We give an algorithm for the one-tree crossing minimization with running time O(n log n). As in [9, 12], we exploit the optimal substructure property of the problem and recursively work on the subtrees. Our results are due to novel use of efficient data structures to maintain lists of the subtrees’ leaves. To calculate the optimal layout at any internal node, v, we analyze the child subtrees to calculate which of the two available layouts is better. This is sufficient since: Lemma 1. Let <2 be an optimal suitable linear order on L(T2 ). Then for every subtree, S, of T2 , <2 is an optimal suitable linear order for L(S). Proof. Assume not. Then there is some
92
B. Venkatachalam et al.
Theorem 2. OTCM can be solved in O(n log n) time. Proof. Any suitable order on L(T2 ) can be constructed by choosing, for each non-leaf node in T2 , one of the two possible orders of its children. At each node, we chose an ordering recursively, starting from nodes closest to the line L2 . For each internal node, we not only decide the optimal order for its children, we also construct a 2-3 finger tree, an ordered search tree with fast split and append operations [18]. The finger trees at siblings will be used to decide the ordering for their shared parent. The base case is for our induction is simply the leaves. These require no layout decision, and can be made into a singleton finger tree of size 1 in constant time [18]. At every internal node v we construct a finger tree holding the leaf labels of its descendents, ordered by <1 . Since v is farther from L2 than either of its children, induction allows us to assume each child already has a finger tree associated with it. The method for constructing a finger tree and layout choice at v is shown in Algorithm 1. Algorithm 1. Container merging for the minimal-crossing single-tree problem. The inputs p, q, and the output result are finger trees sorted according to <1 . 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11:
count ← 0 result ← while |q| > 0 do (qh , q) ← head/tail(q) // pops the first element of q (r, p) ← split(p, qh ) // splits p, removes elements less than qh into r result ← result + + r // appends elements smaller than qh . result ← result + +qh count ← count + |p| // the number of crossings for qh end while result ← result + +p return (count, result)
The algorithm takes as input two finger trees p and q corresponding to the two child nodes (node(p) and node(q)). The trees are merged according to the usual merge procedure on finger trees, while maintaining a count of the number of crossings incurred. Switching the order of the arguments in the algorithm reveals the number of crossings if the layout of the two child nodes is switched, from which we can determine the optimal layout for these nodes. Complexity. Kaplan and Tarjan [18] describe split and append (+ +) operations on 2-3 finger trees. The operation (tL , tR ) ← split(t, v) takes O(log(min(|tL |, |tR |))) time, and t1 + + t2 takes O(log(min(|t1 |, |t2 |))) time. Therefore, the head/tail split on line 4 and append on line 7 take only O(1) time. The values |p| and |q| can be computed in O(1) time as shown by Hinze and Patterson [15], where they maintain the trees with size information. The call to split in line 5 takes time proportional to the logarithm of the smaller of {|r|, |p\r|}. Taking di as the size of r in the loop iteration when qh is the ith element in
Untangling Tanglegrams: Comparing Trees by Their Drawings
93
Fig. 2. One-tree footrule distance minimization with respect to the identity permutation 1, . . . , 7. Consider the configuration of the subtree rooted at x. In the left figure, the configuration of this subtree is optimal and contributes 4 to the overall distance. If the same layout were to be used at position 1, shown in the right figure, it would contribute 8 to the footrule distance value. However, the optimal configuration at that position has footrule distance 4 as shown in the right figure.
|q| q, the total time taken in line 5 is no more than |q| i=0 α log di where i=0 di ≤ q. This applies to lines 6 and 10 as well; these lines append to result all of p, in q + 1 pieces. The total complexity is bounded by the shared complexity of lines 5, 6 and 10. Since the sum of the logarithms is maximized when all the di s are equal [18], the complexity |p| is thus O . The total time to calculate the optimal i≤|q| log di = O |q| log |q| layout at anodewith n descendant leaves is given by the recurrence: T (n) = T (l) + T (r) + O l log rl , where l, r are the number of leaves in the left and right subtrees, and l + r = n. Using induction, assume that T (m) ∈ O(m log m) for all m < n. r T (n) = O(l log l) + O(r log r) + O l log = O((l + r) log r) = O(n log n) . l 3.2 One-Tree Distance Minimization The crossings minimization problem has the optimal substructure property, i.e., a configuration that minimizes the number of crossings of a subtree is also a configuration that minimizes the crossings in any optimal solution. Therefore, once the value of an optimal configuration of a subtree is computed, we can reuse the configuration irrespective of where the subtree appears in the final layout of the solution. However, in the footrule distance minimization problem, the optimal configuration of a subtree depends on the position of the subtree. An optimal configuration for one position need not be an optimal solution for all positions. See Fig. 2 for an example. Nonetheless, we can find an O(n2 ) algorithm using dynamic programming. For a leaf labeled i at position j, the footrule distance is |i − j|. Consider an internal node v with children u and w with c1 leaves and c2 leaves in the two subtrees, respectively. The optimal solution for the subtree rooted at v with the leaves starting at position i, D(v, i), is obtained by either drawing u on the left with leaves from i through i + c1 − 1 and w on the right with leaves from i + c1 through i + c1 + c2 − 1, or in the opposite order. We choose the ordering that minimizes the value. D(v, i) = min{D(u, i) + D(w, i + c1 ), D(w, i) + D(u, i + c2 )} .
94
B. Venkatachalam et al.
The optimal solution for the tree is D(root, 1). The correctness of the algorithm is straightforward. The algorithm can be run in O(n2 ) time, since there are n − 1 internal nodes and for each node we do a constant amount of computation in at most n positions.
4 A Reduction Method for Two-Tree Crossing Problems When both leaf orderings in a tanglegram are allowed to change, the complexity of crossing minimization increases greatly. While the case where one tree is fixed (OTCM) is solvable in polynomial time, the TTCM problem, where the ordering of both trees is mutable, is NP-hard [12]. However, the special case of checking if a tanglegram has a drawing with zero crossings (planarity testing) can be solved in linear time [12]. 4.1 Two-Tree Tanglegram Planarity We first define a natural extension of tanglegrams for planarity testing. Definition 3. An augmented tanglegram is a tanglegram with the roots of the two trees joined by an edge. This edge is called the augmented edge. Lemma 4. A tanglegram has a proper drawing with zero crossings iff the augmented tanglegram has a planar drawing. Proof. The “if” direction of the lemma is straightforward. For the other direction, consider a planar drawing of the augmented tanglegram. If the drawing of the augmented tanglegram is proper, removing the augmenting edge gives us a proper planar drawing. If the drawing of the augmented tanglegram is not proper, we need to show a way to rearrange the edges of this drawing to produce a proper drawing. To do so, first contract the internal edges of the two trees except for the two edges out of each root. During the contracting process, shown in Fig. 3, no new planar regions are produced. Regions that are bounded between the internal edges of one tree, the edges connecting the leaves, and the internal edges of the other tree vanish when the internal edges are contracted (see Fig. 3). We call the resulting graph the reduced graph and label the root and its two children r1 , u1 , v1 , respectively, in one tree and r2 , u2 , v2 in the other. There are four possible edges between {u1 , v1 } and {u2 , v2 }. We call these edges between the two trees super-edges. Each of these super-edges represents the union (merger) of some of the regions. We claim that at most three of these edges exist. If all four edges existed, then together with the augmented edge (r1 , r2 ) they would form K3,3 (see Fig. 3) contradicting the planarity of the original drawing. Without loss of generality, let the three super-edges be (u1 , u2 ), (v1 , v2 ) and (u2 , v1 ). Any drawing on the reduced graph with these edges can be redrawn to a proper drawing of the reduced graph (as the example in Fig. 3). Rearranging the super edges is equivalent to rearranging of the edges and the regions of the original graph. Now expanding the edges in the reverse order of contraction gives us a proper drawing for the tanglegram.
Untangling Tanglegrams: Comparing Trees by Their Drawings
95
Fig. 3. (a) & (b): Contraction process: After contracting the dashed internal edges, planar region 2 vanishes. The new edge can be thought of as containing the region 2 within it, and is called a super-edge. (c) & (d): Avoiding K3,3 minor: There are at most 3 edges between pairs (u1 , v1 ) and (u2 , v2 ). (d) is not proper. The edges can be rearranged to form (c).
The idea of the proof can be extended to get an algorithm that generates a proper drawing by setting a convention for left and right children and remembering the left and right order on the children during edge contraction. The order of some edges might be reversed in rearranging the super-edges. Finally, the order information is used recursively during the edge expansion to obtain a proper drawing. Theorem 5. Deciding if a tanglegram admits a planar drawing can be done in linear time. Proof. Apply the linear planar graph drawing algorithm [16, 25] to the augmented tanglegram. Follow the previous lemma to obtain a proper drawing. 4.2 Fixed Parameter Tractability of TTCM The two-tree problem (TTCM), when restricted to binary trees, is fixed parameter tractable with parameter k, the number of crossings, as shown by Fernau et al. [12]. Their proof relies on the trees being binary and achieves the result through a complicated analysis of quadruples of leaves. They conjecture difficulty for d-ary trees for d > 2. We use our reduction method to utilize the elegant work of Kawarabayashi and Reed [19]. We give a simple proof that resolves the conjecture of [12] that TTCM is fixed parameter tractable over the class of all finite trees: Theorem 6. TTCM is fixed parameter tractable over the class of all finite trees with parameter, k, the number of crossings. The algorithm takes time quadratic in n. As in the planar drawing problem, we create an augmented tanglegram and use the FPT algorithm of crossing minimization in graphs from [19]. Like in the planarity case we want to disallow crossings with internal edges. To achieve this, we add n duplicate edges around each internal edge and the augmented edge. If two internal edges cross, there will be n2 crossings, which is more than the number of crossings in the sought proper drawing. Similarly, anything but the proper drawing of the edges connecting the leaves will increase the number of crossings. This ensures proper drawing. The proof of Theorem 6 is in [28].
96
B. Venkatachalam et al.
5 Integer Programming Solutions Integer Linear Programming (ILP) is one of the standard approaches to obtain fast solutions for hard problems as they provide provably optimal solutions. Though the run time is not polynomially bounded, they are fast in many practical settings, and are often better than provably efficient methods. We describe ILP formulations for the two-tree optimization problems considered in this paper. 5.1 Crossing Minimization The formulation for crossing minimization is based on the following intuition: if the leaf i is to the left of leaf j in both of the trees, then the edges connecting the i’s and the j’s do not cross. The edges cross if there is an inversion in the order. To realize this, we introduce binary variables xi,j for all leaf pairs (i, j) such that i < j. xi,j is set to 1 iff i appears before j in the linear order. For every internal node k we introduce a variable yk . Let c1 and c2 be the two children of k. yk = 1 if c1 , c2 are to the left and right, respectively, and yk = 0 otherwise. For leaves i in the subtree below c1 and j in the subtree below c2 , if i < j then xi,j = 1 ⇐⇒ yk = 1, so xi,j = yk . If j < i then yk = 1 − xi,j . Analogously, for the second tree we define these constraints over variables xi,j and yk . If i is to the left (or right) of j in the drawing of both trees in the tanglegram, then there is no crossing. i and j cross only when the order is reversed. That is, i, j cross iff xi,j = xi,j . We let zi,j = xi,j ⊕xi,j . We can rewrite the XOR as linear inequalities. The objective function for minimizing the number of crossings is therefore min i<j zi,j . 5.2 Distance Minimization We describe two different formulations for the distance minimization problem. The first formulation is based on the dynamic programming idea used in the one-tree distance minimization problem. The second uses the simple fact that the the order of its children in an internal node determines the relation between the leaves in the two subtrees. Dynamic programming version. For a vertex k we set a binary variable yk,p = 1 when the subtree beneath it is placed starting at position p. For instance, yroot,1 = 1 always. If k is an internal node, let i and j be the its children with l and r leaves in the subtrees below them. yk,p = 1 implies that node i is placed at position p or p + r. This implication is written by the inequality yi,p + yi,p+r ≥ yk,p . Similarly yj,p + yj,p+l ≥ yk,p . Both i and j cannot be the left (or right) child of k simultaneously, so yj,p + yi,p ≤ 1. Every leaf must occur exactly once. For every leaf l, therefore, r∈[n] yl,r = 1. Every position must have exactly one leaf, so ∀r ∈ [n], l∈leaves yl,r = 1. We use variables y and similar inequalities for the second tree. Binary variables zl,r,r = 1 only when the leaf l is present at positions r, r in the two trees respectively. zl,r,r |r − r | to the distance value. Therefore, the contributes objective function is min leaf l r∈[n] r ∈[n] |r − r |zl,r,r .
Untangling Tanglegrams: Comparing Trees by Their Drawings
97
Table 1. Running time of ILP solutions: average time, in secs, is averaged over 30 runs Crossing Problems Input size Crossing Input Size Time variance 10 0.02 0.01 6 20 0.32 0.17 10 30 2.03 0.54 11 40 7.79 1.7 12 50 20.87 3.64 15
Distance problems Distance Dynamic Programming Time variance Time variance 0.12 0.04 0.41 0.25 16.87 19.21 36.34 18.69 75.93 110.80 99.04 56.06 182.10 245.75 324.36 211.48 781.88 1171.95 8663.02 6208.82
Distance version. Consider an internal node i with m leaves in its subtree and let its two children be c1 , c2 . Let j, k be leaves in subtrees c1 , c2 respectively. Let xj denote the position of leaf j in the linear order, [n]. Introduce a binary variable yi for each internal node i to model the choice of c1 or c2 being the left child. yi = 1 when c1 is the left child (and j is to the left of k). The opposite is implied by yi = 0. yi = 1 ⇐⇒ −(m − 1) ≤ xj − xk ≤ −1
(1)
yi = 0 ⇐⇒ 1 ≤ xj − xk ≤ m − 1
(2)
These implications are written as the following inequalities: xj − xk + 1 ≤ m(1 − y) and xj − xk + my ≥ 1. Next we need to ensure that all leaves 1 ≤ xj ≤ n and all xj ’s are unique. The uniqueness constraints can be written in a number of ways. We model them as a matching problem. It has been observed in the ILP literature that the vertices of the matching polytope are all lattice points and therefore the ILP software need not apply further reduction techniques [20]. As usual, we define similar inequalities on variables xi and yi for similar constraints on the second tree. Finally, the optimization criterion is min i |xi − xi |. 5.3 Timing To generate a random tree we take a random subset of [n]. This is the set of leaves on the left subtree of the root. The rest of the elements are leaves of the right subtree. We recurse on these subsets to generate the random tree. We take two such trees to form a random tanglegram. We executed the ILP formulations of the problem using CPLEX-10 on a Pentium IV 3 GHz dual-core desktop machine with 2GB of RAM. The data shown in Table 1 are obtained by averaging the running time over thirty runs each for problems of various data sizes. The crossing minimization problem is very fast. The distance version is slower in comparison. It is relatively fast for small datasets. The distance version is about three times faster than the dynamic programming version. We see in our examples that most of the executions run in about less than half of the reported mean time. There are about 10% of the cases that take much longer, leading to increased variance. In most of these cases CPLEX obtains the optimal solution quickly or finds a solution very close to optimal solution very soon, but takes much longer to make minor improvements or to ensure there is no better solution.
98
B. Venkatachalam et al.
6 Dwyer and Schreiber’s Seesaw Heuristic Though [4] shows that, assuming the Unique Games Conjecture, there is no constantfactor approximation algorithm for TTCM, Dwyer and Schreiber [9] present a heuristic for n tree crossing minimization that iteratively solves OTCM for each tree. The idea is to fix <2 , then solve OTCM on T1 , then fix <1 and solve OTCM on T2 . They found that this yielded a good solution after ten or fewer iterations. We call this “seesawing”. Theorem 7. For any N , there is an n > N , and a tanglegram drawing of size n for which the optimal drawing produced by seesawing has Ω(n2 ) more crossings than an optimal drawing. We call a drawing that can’t be improved by seesawing seesaw-optimal. We prove the theorem (in [28]) by finding one tanglegram that has a seesaw-optimal drawing that is inferior to its optimal drawing. By iteratively replacing the leaves with copies of the drawing, we create a chain of seesaw-optimal drawings with a quadratically increasing number of crossings, while the optimal crossing number stays small. From this we describe planar tanglegrams of arbitrarily large size and seesaw-optimal drawings with Ω(n2 ) crossings.
7 Conclusion and Open Problems We have shown several significantly faster algorithms for tanglegram drawing, including for planar, k-crossing, and one-tree optimization problems. We have also introduced the footrule distance measure for tanglegrams and shown an efficient one-tree drawing algorithm. We conjecture that the two-tree distance minimization problem is NPcomplete. Future work includes improving drawing heuristics for tanglegrams with the distance measure. Our ILP solution for the crossing measure is efficient, but the ILP solution for the distance problem is slower and may perhaps be improved. It also remains to explore the seesaw method for the distance heuristic, though we have shown it can be larger than the optimal solution by Ω(n2 ) in the crossing case. For the one-tree problem, though distance between permutations can be computed in linear time (while counting crossings takes Ω(n log n)), distance seems the harder measure to optimize.
References [1] Bansal, M.S., Chang, W.-C., Eulenstein, O., Fern´andez-Baca, D.: Generalized binary tanglegrams: Algorithms and applications. In: BiCoB (2009) [2] Bertolazzi, P., Battista, G.D., Mannino, C., Tamassia, R.: Optimal upward planarity testing of single-source digraphs. SIAM J. Comput. 27(1), 132–169 (1998) [3] Biedl, T.C., Brandenburg, F.-J., Deng, X.: Crossings and permutations. In: Healy, P., Nikolov, N.S. (eds.) GD 2005. LNCS, vol. 3843, pp. 1–12. Springer, Heidelberg (2006) [4] Buchin, K., Buchin, M., Byrka, J., N¨ollenburg, M., Okamoto, Y., Silveira, R.I., Wolff, A.: Drawing (complete) binary tanglegrams: Hardness, approximation, fixed-parameter tractability. In: Graph Drawing. Springer, Heidelberg (2008) [5] Burt, A., Trivers, R.: Genes in Conflict. Belknap Harvard Press (2006)
Untangling Tanglegrams: Comparing Trees by Their Drawings
99
[6] Charleston, M., Perkins, S.: Lizards, malaria, and jungles in the Caribbean. In: Page, R. (ed.) Tangled Trees: Phylogeny, Cospeciation, and Coevolution, pp. 65–92. University Of Chicago Press, Chicago (2003) [7] Diaconis, P., Graham, R.L.: Spearman’s footrule as a measure of disarray. Journal of the Royal Statistical Society. Series B (Methodological) 39(2), 262–268 (1977) [8] Dwork, C., Kumar, R., Naor, M., Sivakumar, D.: Rank aggregation methods for the web. In: WWW, pp. 613–622 (2001) [9] Dwyer, T., Schreiber, F.: Optimal leaf ordering for two and a half dimensional phylogenetic tree visualisation. In: Australasian Symp. on Info. Vis., pp. 109–115 (2004) [10] Page, R.D.M. (ed.): Tangled Trees: Phylogeny, Cospeciation, and Coevolution. University Of Chicago Press, Chicago (2002) [11] Fagin, R., Kumar, R., Sivakumar, D.: Comparing top k lists. In: SODA, pp. 28–36 (2003) [12] Fernau, H., Kaufmann, M., Poths, M.: Comparing trees via crossing minimization. In: Ramanujam, R., Sen, S. (eds.) FSTTCS 2005. LNCS, vol. 3821, pp. 457–469. Springer, Heidelberg (2005) [13] Foulds, L.R., Graham, R.L.: The Steiner problem in phylogeny is NP-complete. Adv. in Appl. Math. 3(1), 43–49 (1982) [14] Garey, M., Johnson, D.S.: Crossing number is np-complete. SIAM Journal on Algebraic and Discrete Methods 4, 312–316 (1983) [15] Hinze, R., Paterson, R.: Finger trees: A simple general-purpose data structure. Journal of Functional Programming 16(2), 197–217 (2006) [16] Hopcroft, J.E., Tarjan, R.E.: Efficient planarity testing. J. ACM 21(4), 549–568 (1974) [17] Huelsenbeck, J.P., Ronquist, F.: Mrbayes: Bayesian inference of phylogeny (2001) [18] Kaplan, H., Tarjan, R.E.: Purely functional representations of catenable sorted lists. In: STOC 1996, pp. 202–211. ACM, New York (1996) [19] Kawarabayashi, K., Reed, B.: Computing crossing number in linear time. In: STOC, pp. 382–390 (2007) [20] Lee, J.: All-different polytopes. Journal of Combin. Optim. 6(3), 335–352 (2002) [21] Lozano, A., Pinter, R.Y., Rokhlenko, O., Valiente, G., Ziv-Ukelson, M.: Seeded tree alignment and planar tanglegram layout. In: Giancarlo, R., Hannenhalli, S. (eds.) WABI 2007. LNCS (LNBI), vol. 4645, pp. 98–110. Springer, Heidelberg (2007) [22] Hillis, D.M., Heath, T., John, K.S.: Analysis and visualization of tree space. Systematic Biology 3, 471–482 (2005) [23] N¨ollenburg, M., Holten, D., V¨olker, M., Wolff, A.: Drawing binary tanglegrams: An experimental evaluation. In: ALENEX, pp. 106–119. SIAM, Philadelphia (2009) [24] Roch, S.: A short proof that phylogenetic tree reconstruction by maximum likelihood is hard. IEEE/ACM Trans. Comp. Biol. and Bioinf. 3(1), 92–94 (2006) [25] Shih, W.K., Hsu, W.-L.: A new planarity test. Theor. Comput. Sci. 223(1-2), 179–191 (1999) [26] Swofford, D.L.: PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods). Version 4. Sinauer Associates, Sunderland, Massachusetts (2002) [27] Swofford, D.L., Olsen, G.J., Waddell, P.J., Hillis, D.M.: Phylogenetic inference. In: Molecular Systematics, 2nd edn., pp. 407–514. Sinauer (1996) [28] Venkatachalam, B., Apple, J., John, K.S., Gusfield, D.: Untangling tanglegrams: Comparing trees by their drawings. Technical Report CSE-2009-1, UC Davis, Computer Science Department (2009) [29] Wan Zainon, W.N., Calder, P.: Visualising phylogenetic trees. In: Piekarski, W. (ed.) Seventh Australasian User Interface Conference (AUIC 2006), Hobart, Australia. CRPIT, vol. 50, pp. 145–152. ACS (2006)
An Experimental Analysis of Consensus Tree Algorithms for Large-Scale Tree Collections Seung-Jin Sul and Tiffani L. Williams Department of Computer Science and Engineering Texas A&M University {sulsj,tlw}@cs.tamu.edu
Abstract. Consensus trees are a popular approach for summarizing the shared evolutionary relationships in a collection of trees. Many popular techniques such as Bayesian analyses produce results that can contain tens of thousands of trees to summarize. We develop a fast consensus algorithm called HashCS to construct large-scale consensus trees. We perform an extensive empirical study for comparing the performance of several consensus tree algorithms implemented in widely-used, phylogenetic software such as PAUP* and MrBayes. Our collections of biological and artificial trees range from 128 to 16,384 trees on 128 to 1,024 taxa. Experimental results show that our HashCS approach is up to 100 times faster than MrBayes and up to 9 times faster than PAUP*. Fast consensus algorithms such as HashCS can be used in a variety of ways, such as in real-time to detect whether a phylogenetic search has converged.
1
Introduction
Given a collection of organisms (or taxa), the objective of a phylogenetic analysis is to produce an evolutionary tree describing the genealogical relationships between the taxa. Phylogenetic methods (such as Bayesian analyses) to reconstruct an evolutionary tree can easily produce tens of thousands of potential trees that must be summarized in order to understand the evolutionary relationships among the taxa. Moreover, large tree collections can also be produced by bootstrap tests on phylogenies to access the uncertainty of a phylogenetic estimate. Currently, biologists use popular phylogenetic software packages such as PAUP* [1] and MrBayes [2] to summarize their large tree collections into a single consensus tree (see Figure 1). In this paper, we study whether current consensus tree implementations can accommodate the growing requirements of larger phylogenetic analyses, such as those necessary for building the Tree of Life, the grand challenge problem in phylogenetics. The novelty of our work consists of the following: (1) developing a fast algorithm to construct large-scale consensus trees and (2) performing an extensive, empirical investigation to analyze consensus tree performance. As trees increase in size (number of taxa) and potentially number (size of collection), fast approaches that can handle such sets of trees will be needed. Our experiments I. M˘ andoiu, G. Narasimhan, and Y. Zhang (Eds.): ISBRA 2009, LNBI 5542, pp. 100–111, 2009. c Springer-Verlag Berlin Heidelberg 2009
An Experimental Analysis of Consensus Tree Algorithms
101
Input Trees
T1 C
A B1
B
T2 D B2
B3
E
T3
D
A
C B4
B5
B
A
E
C
T4
E
A
C B6
B7
B
D
A
E
Majority Consensus Tree
E
D B8
C
C
D
D
B
B
A
B
E
Strict Consensus Tree
Fig. 1. Overview of the consensus techniques on four trees of interest. Bipartitions (or internal edges) in a tree are labeled Bi , where i ranges from 1 to 8.
compare the performance of our HashCS algorithm to the consensus tree implementations in MrBayes [2], PAUP* [1], and Phylip [3]. We also compare our work to Day’s algorithm [4], a theoretically, optimal approach for building strict consensus trees. We obtained a diverse collection of large biological and artificial trees to access the performance of the consensus tree implementations. The biological trees were gathered from Bayesian analyses performed by life scientists on two molecular datasets consisting of 150 taxa (desert algae and green plants) [5] and 567 taxa (angiosperms) [6]. To complement our experiments using biological trees, we developed a generator to produce artificial trees to predict the performance of the consensus tree implementations on a more diverse collection of trees. Our results clearly show that HashCS is the best algorithm for computing large-scale consensus trees. HashCS is up to 100 and 1.8 times faster than the modules in MrBayes and PAUP* for computing consensus trees on our biological tree collections, respectively. Given that HashCS and PAUP* are substantially faster than the other approaches, we look at their performance on a diverse collection of artificial trees with varying degrees of similarity among them. On these collections, the speedup of HashCS over PAUP* ranges from 2 to 9. Although HashCS outperforms PAUP*, the gap between the running times widens (or closes) in relation to the similarity of the evolutionary relationships (or bipartitions) among the trees. Currently, consensus tree algorithms are exclusively used at the end of a phylogenetic analysis, which implies the search has converged in tree space resulting in the search returning highly similar trees. However, it is conceivable given the fast running times of HashCS (e.g., constructing the strict consensus of over 16,000 trees on 567 taxa requires around 40 seconds)
102
S.-J. Sul and T.L. Williams
that consensus trees could be constructed in real-time repeatedly during a phylogenetic search. Under this scenario, the additional information provided by a consensus tree could be used to detect whether a search has converged. Consensus algorithms whose running times are minimally impacted by the diversity of trees that they may encounter is quite desirable. Thus, fast algorithms such as HashCS allow scientists to process their data in new and exciting ways.
2 2.1
Background Evolutionary Trees and Their Bipartitions
In a phylogenetic tree, modern organisms (or taxa) are placed at the leaves and ancestral organisms occupy internal nodes, with the edges of the tree denoting evolutionary relationships. It is useful to represent phylogenies in terms of their bipartitions. Removing an edge e from a tree separates the leaves on one side from the leaves on the other. The division of the leaves into two subsets is the bipartition Be associated with edge e. In Figure 1, tree T1 has two bipartitions: AB|CDE and ABC|DE. An evolutionary tree is uniquely and completely defined by its set of O(n) bipartitions, where n is the number of taxa. A binary tree has exactly n − 3 non-trivial (or internal) bipartitions. For each tree in the collection of input trees, we find all of its bipartitions (internal edges) by performing a postorder traversal. In order to process the bipartitions, we need some way to store them in the computer’s internal memory. An intuitive bitstring representation requires n bits, one for each taxon. The first bit is labeled by the first taxon name, the second bit is represented by the second taxon, etc. We can represent all of the taxa on one side of the tree with the bit ‘0’ and the remaining taxa on the side of the tree with the bit ‘1’. Consider the bipartition AB|CDE from tree T1 in Figure 1. This bipartition is represented as 11000. Taxa on the same side of a bipartition as taxon A receive a ‘1’. 2.2
Consensus Trees
Consensus trees summarize the information of a collection of trees into a single output tree. We consider the most popular consensus approaches: strict and majority trees. The strict consensus tree contains bipartitions that appear in all of the input trees. To appear in the majority tree, a bipartition must appear in more than half of the input trees. Oftentimes, a consensus tree will not be binary since there will be bipartitions that are not shared across the tree collection. One way to measure the quality of a consensus tree is to compute its resolution rate, which represents the percentage of the tree that is binary. The resolution rate b of a tree T is n−3 , where b is the number of bipartitions in the tree T and n − 3 is the number of possible resolved bipartitions. The resolution rate of a tree T varies between 0% (a star) and 100% (completely resolved binary tree).
An Experimental Analysis of Consensus Tree Algorithms
3
103
Comparison with Previous Work
A number of different techniques have been developed to summarize tree collections by building consensus trees. Traditional approaches include Day’s algorithm [4], MrBayes, PAUP*, and Phylip, and our results demonstrate that HashCS outperforms them in practice. Our HashCS algorithm is motivated by the work of Amenta, Clark, and St. John [7]. They develop an O(nt) optimal algorithm for computing the majority consensus tree, where n is the number of taxa and t is the number of trees. Unfortunately, there appears to be no publicly available implementation that we could obtain of their algorithm. Moreover, in their paper, Amenta et al. do not provide any empirical evidence of their algorithm’s running time on biological or artificial tree collections. HashCS is a fast implementation of Amenta et al.’s algorithm. However, there are several key differences between the approaches. First, HashCS requires a single traversal of the collection of t trees to construct a consensus tree. Amenta et al.’s algorithm requires two traversals of the t trees, where the second traversal of the trees is an involved procedure to construct the majority tree from the nodes in the hash table. Secondly, unlike Amenta et al.’s approach, HashCS does not insert all bipartitions into the hash table. Finally, Amenta et al. explicitly detect double collisions (which can result in an incorrect consensus tree) in the hash table, but our experiments show that this step is unnecessary. Thus, HashCS does not explicitly check for double collisions. The Texas Analysis of Symbolic Phylogenetic Information (TASPI) system is a recently proposed technique to compute consensus trees [8], [9]. One of the novelties of TASPI is it incorporates a new format for compactly storing and retrieving phylogenetic trees. Experimental results on several collections of maximum parsimony trees show that the TASPI system outperforms PAUP* [1] and TNT [10] in constructing consensus trees. An implementation of the TASPI system does not appear to be available to use for experimental comparison in this paper. However, since it was demonstrated clearly that TASPI and PAUP* are much faster than TNT’s consensus algorithm, we do not explore the performance of TNT in this paper. A comparison of Phylip and MrBayes consensus methods was not included in the experiments with TASPI. Hence, we include those techniques in our study.
4
Our HashCS Algorithm
Building consensus trees using a hash table consists of two major steps. First, the hash table is populated with the bipartitions from the collection of trees (see Figure 2). Each tree Ti ’s bipartitions are fed through two hashing functions, h1 and h2 to determine where it should reside in the hash table. The bipartition is stored in the hash table and a frequency counter associated with that bipartition is updated. Once all bipartitions are inserted into the hash table, a consensus tree is constructed based on the values of the bipartition frequency counters.
104
S.-J. Sul and T.L. Williams Bipartitions
Hash Table
h1 T1 T2 T3 T4
B1
2
B3
3
B5
27 3
4
35 1
... TYPE 1 collsion
8
B6 B7
46 1
1
B2 B4
Hash Records
0
...
31 1
17
B8 m1 - 1
40 2
Fig. 2. Overview of the HashCS algorithm. Bipartitions are from Figure 1. That is, B1 and B2 define tree T1 , B3 and B4 are from tree T2 , etc. The implicit representation of each bipartition is fed to the hash functions h1 and h2 . The shaded value in each hash record contains the bipartition ID (or h2 value) whereas the unshaded value shows the frequency of that bipartition.
4.1
Step 1: Populating the Hash Table
Hashing functions h1 and h2 . Similarly to Amenta et al. [7], we define two universal hashing functions. Our first hash function is defined as h1 (B) = bi ri mod m1 and the second one is defined similarly, h2 (B) = bi si mod m2 . m1 represents the number of entries (or locations) in the hash table. m2 represent the largest bipartition ID (BID) that we can be given to a bipartition. That is, instead of storing the n-bitstring, a shortened version of it (represented by the BID) will be stored in the hash table instead. HashCS requires two sets of random integers ri and si in the intervals (0, ..., m1 − 1), and (0, ..., m2 − 1), respectively. bi represents the ith bit of the n-bitstring representation of the bipartition B. Implicit bipartitions. For faster performance, we avoid sending the n-bitstring representations of each bipartition B to our hashing functions. Instead, we use an implicit bipartition representation to compute the hash functions quickly. An implicit bipartition is simply an integer value (instead of a n-bitstring) that provides the representation of the bipartition. Consider an internal node B whose bipartition is represented by a n-bitstring. Let the two children of this node have their n-bitstring representations labeled Blef t and Bright , which represent disjoint sets of taxa. Then, the hash value for our h1 hash function is ⎛ ⎞ ⎛ ⎞ h1 (B) = ⎝ bi ri mod m1 ⎠ + ⎝ bi ri mod m1 ⎠ . (1) Blef t
Bright
We can use Equation 1 to compute implicit bipartitions, which replace the need for bitstrings in our hashing functions. Computing h2 works similarly.
An Experimental Analysis of Consensus Tree Algorithms
105
Inserting bipartitions into the hash table. HashCS uses two different policies for inserting bipartitions into the hash table. For the strict consensus tree, a bipartition must appear in all t trees in the tree collection. Since the first tree in the collection determines the possible set of strict consensus bipartitions, only the first tree’s bipartitions are inserted into the hash table. If these bipartitions are found in the remaining trees, their frequency counters in the hash table are incremented. For the last tree, the n-bitstring representation for each bipartition is computed along with the implicit representation. The n-bit representation of a bipartition is stored into an array A if its frequency count is t. Let x represent the number of n-bitstrings in array A, where x is the number of bipartitions that will compose the strict consensus tree. The array A of n-bitstring representations will be used to build the consensus tree in Step 2 of the algorithm. To construct the majority consensus tree, all unique bipartitions are inserted into the hash table until tree 2t + 1 is read. At this time, the n-bitstrings are computed along with the implicit bipartitions. Once a node’s bipartition frequency has reached 2t + 1, it is invalidated so that its resulting n-bit representation doesn’t appear multiple times in the array A. Similarly to the strict consensus case, during Step 2 of the algorithm the array A of majority nbitstrings will be used to build the majority tree. Collision types and their probability. A consequence of using hash functions is that bipartitions may end up residing in the same location in the hash table. Such an event is considered a collision, and there are two types to consider. Type 1 collisions result from two different bipartitions Bi and Bj (i.e., Bi = Bj ) residing in the same location in the hash table. That is, h1 (Bi ) = h1 (Bj ). Type 2 (or double) collisions are serious and require a restart of the algorithm. Otherwise, the resulting output will be incorrect. Suppose that Bi = Bj . A Type 2 collision occurs when two different bipartitions Bi and Bj hash to the same location in the hash table and the bipartition IDs (BIDs) associated with them are also the same. That is, h1 (Bi ) = h1 (Bj ) and h2 (Bi ) = h2 (Bj ). The probability of HashCS restarting because of a double collision among any pair of the bipartitions is O( 1c ). Given that we can make c arbitrarily large, we do not explicitly check for Type 2 collisions. As a result, HashCS has a theoretical error rate of O( 1c ). In practice, however, the error rate of HashCS is much better. Our experiments with varying c from 1 to 10,000, and running HashCS at least 100 times for each c value, resulted in our algorithm producing the correct consensus tree every time for an overall error rate of 0%. 4.2
Step 2: Constructing the Consensus Tree
Once all the bipartitions in the tree collection have been processed, we can build the consensus tree. In our approach, the consensus tree is initially a star tree of n taxa. Bipartitions from the array A, which contains x n-bitstrings created in Step 1, are added to refine the consensus tree based on the number of 1’s in their n-bitstring representation.1 The more 1’s in the n-bitstring representation, 1
The number of 0’s could have been used as well.
106
S.-J. Sul and T.L. Williams
the more taxa that are grouped together by this bipartition. A star tree is an nbitstring representation consisting of all 1’s. During the collection of n-bitstrings in Step 1, a count of the number of 1’s was stored for each bipartition. In Step 2, these counts are then sorted in increasing order, which means that the bipartitions that groups together the most taxa appears first in the sorted list. The bipartition that groups together the fewest taxa appears last in the sorted list of ’1’ bit counts. For each bipartition B in the sorted list, a new internal node is created to further refine the taxa in the consensus tree created so far. To do this, the bipartition B is scanned to put the taxa into two groups. Taxa in the n-bitstring representation of bipartition B with ’0’ bits compose one group and those with ’1’ bits compose the other group. The taxa in the current consensus tree indicated by the ‘1’ bits become children of the new internal node. The above process repeats until all bipartitions in the sorted list are added to the consensus tree. 4.3
Analysis
Our analysis assumes that the number of trees, t is much greater than the number of taxa, n. We believe this assumption is especially valid for trees obtained from a Bayesian analysis, which can sample trees from runs consisting of well over a million generations. Using our assumption that t >> n, then Step 1 requires O(nt) time. A similar analysis can be done for the HashCS majority algorithm resulting also in an O(nt) time for the first step. Step 2 requires O(nx) time to construct the consensus tree from the n-bitstring bipartitions, where x is the number of bipartitions in the consensus tree. In the worst case, x = n − 3, the maximum number of edges in a binary tree. Thus, the overall running time for HashCS is O(nt), which is optimal since there are O(nt) total bipartitions in the input trees that must be processed by any consensus algorithm.
5 5.1
Our Collection of Evolutionary Trees Biological Trees
The large biological tree collections used in this study were obtained from two recent Bayesian analyses. – 20,000 trees obtained from a Bayesian analysis of an alignment of 150 taxa (23 desert taxa and 127 others from freshwater, marine, and oil habitats) with 1,651 aligned sites [5]. The majority and strict consensus resolution rates for these 20,000 trees are 87% and 34%, respectively. – 33,306 trees obtained from an analysis of a three-gene, 567 taxa (560 angiosperms, seven outgroups) dataset with 4,621 aligned characters [6]. The majority and strict consensus resolution rates for these 33,306 trees are 93% and 51%, respectively. In our experiments, for each number of taxa, n, we created different tree set sizes, t, to test the scalability of the algorithms. Dividing the original tree collection into smaller tree sets simulates using higher burn-in rates (i.e., higher
An Experimental Analysis of Consensus Tree Algorithms
107
burn-in rates means the consensus tree is composed of fewer trees) as well as shorter Bayesian runs (fewer generations produce fewer trees). Hence, the entire collection of trees is divided into smaller sets, where t is 128, 256, 512, . . . , 16384 trees. Thus, for each (n, t) pair, t trees with n taxa were randomly sampled without replacement from the appropriate tree collection. For each (n, t) pair, we repeated the above sampling process five times. Our experimental results show the average running time performance for each (n, t) pair. 5.2
Artificial Trees
We use artificial tree collections to predict how the performance of the algorithms would scale across a wider range of trees than are available with our biological tree collections. To generate our collection of t trees, each consisting of n taxa, we first generate a random, Yule model trees on n taxa called T100% . Each of these trees are binary and have (n − 3) bipartitions. Next, we remove r% of the bipartitions from T100% to create a r% strict or majority consensus tree, Cr% . If Cr% represents a majority tree, then we assign a random number in the interval (50, 100] to each bipartition in Cr% . This value represents the percentage of the t trees that contain that bipartition. For example, consider the 50% resolution majority tree in Figure 1. The bipartition AB|CDE appears in 75% of the input trees. Hence, three (T1 , T2 , and T3 ) of the four trees are selected randomly to contain the bipartition AB|CDE. If Cr% represents a strict tree, then each consensus bipartition appears 100% of the time. Once all of the bipartitions from the consensus tree Cr% have been distributed, each of the t trees for the collection is constructed. For each tree Ti , where 1 ≤ i ≤ t, we construct tree Ti with the consensus bipartitions that have been distributed to it. If tree Ti is a multifurcating tree (i.e., its resolution rate is less than 100%), then it is randomly made into a binary tree. That is, the internal nodes in tree Ti that are not binary are randomly resolved so that they become binary. These randomly resolved bipartitions (non-consensus bipartitions) are then distributed to the remaining p trees, where 1 < p ≤ 0.50t and i < p ≤ t, if Cr% is a majority tree. If Cr% is a strict tree, then 1 < p < t and i < p ≤ t. We distribute the non-consensus bipartitions to the remaining trees in order to increase the amount of sharing among the non-consensus bipartitions in the tree collection. The above process is repeated until all t trees are constructed. 5.3
Experimental Validation, Implementations and Platform
All consensus implementations produce the same majority and strict consensus trees for the tree collections. Although PAUP*, Phylip, and MrBayes have methods for inferring evolutionary trees, all experiments were done using only the consensus tree modules in these programs. No phylogenetic searches were conducted in our experiments. All experiments were run on an Intel Pentium platform with a 3.0GHz processor and a total of 2GB of memory. We used the Linux operating system (Red Hat 2.5.22.14-17.fc6). HashCS, and Day were written in C++ and compiled with gcc
108
S.-J. Sul and T.L. Williams 10000
100
10
1 MrBayes Day Phylip PAUP* (strict) PAUP* (MJ=100) HashCS
0.1
128
256
512
1024
2048
4096
8192 16384
number of trees (log2)
(a) 150 taxa
CPU time in seconds (log10)
CPU time in seconds (log10)
1000
MrBayes PAUP* (strict) Day Phylip PAUP* (MJ=100) HashCS
1000
100
10
1
0.1
128
256
512
1024 2048 4096 8192 16384
number of trees (log2)
(b) 567 taxa
Fig. 3. Running time of the strict consensus tree algorithms on our collection of biological trees. The scale of the y-axis is different for all the plots.
4.1.2 with the -O3 compiler option. PAUP* is commercially-available software and we used version 4.0b10 in our experiments. We also compared our approach to the consensus tree approaches in Phylip (ver. 3.65) and MrBayes (ver. 3.1.2), which are freely available.
6 6.1
Results Biological Tree Collections
First, we consider the running time of the consensus tree implementations for computing the strict consensus tree. Figure 3 shows the results. Overall, HashCS is the fastest implementation for computing strict consensus trees. MrBayes is the slowest approach requiring 1.1 hours compared to 38.3 seconds by HashCS on the largest dataset (567 taxa, 16384 trees). HashCS is 1.8 times faster than PAUP* on the largest tree collection. In our plots, we show two performance results for PAUP*. PAUP*(strict) is the running time of the algorithm when using the strict command in the Nexus file. PAUP*(MJ=100) computes the strict consensus tree, but using the majority option with percent equal to 100%. Surprisingly, using the strict option takes considerably more time to compute the same tree than using the majority option. Other researchers have observed this behavior as well [8], [9]. In any case, PAUP*(MJ=100) is the second faster performer behind HashCS. The running times for constructing majority trees are similar and are not shown because of space limitations. 6.2
Artificial Tree Collections
The previous figures clearly demonstrate that MrBayes, Phylip, and Day’s algorithm are not competitive as HashCS and PAUP*. So, we don’t consider them any further in this paper. For our artificial tree collections, the number of taxa, n, varies from 128 to 1, 024. The number of trees, t is 16, 386. Our results with biological trees show that the speedup of HashCS is not impacted by t. Instead, it is impacted by the number of taxa, n, so we fix t to be 16,386 trees.
1.8
1.9
1.7
1.8
speedup of HashCS−strict
speedup of HashCS−majority
An Experimental Analysis of Consensus Tree Algorithms
1.6 1.5
567 taxa (biological) 567 taxa (artificial) 150 taxa (biological) 150 taxa (artificial)
1.4 1.3 1.2 1.1 1
128
256
512
1024 2048 4096 8192 16384
109
1.7 1.6
567 taxa (biological) 567 taxa (artificial) 150 taxa (biological) 150 taxa (artificial)
1.5 1.4 1.3 1.2 1.1
128
256
number of trees (log2)
512
1024 2048 4096 8192 16384
number of trees (log2)
(a) Majority consensus
(b) Strict consensus
Fig. 4. Speedup comparison of HashCS over PAUP* on artificial and biological trees consisting of 150 and 567 taxa. The scale of the y-axis is different for all the plots.
9
2.6 2.4
7 6 5 4 3 2 1
2.5 majority strict
strict majority
speedup of HashCS
majority strict
speedup of HashCS
speedup of HashCS
8
2.2 2 1.8 1.6 1.4
2
1.5
1.2
128
256
512
1024
number of taxa (log ) 2
(a) 50% resolution
1
128
256
512
1024
number of taxa (log ) 2
(b) 75% resolution
1
128
256
512
1024
number of taxa (log ) 2
(c) 100% resolution
Fig. 5. Speedup of the consensus tree algorithms on 16,384 artificial trees of varying taxa sizes and consensus tree resolution rates. The scale of the y-axis is different for all the plots.
To test the predictive capability of our model, Figure 4 shows the speedup of HashCS over PAUP* on biological and artificial trees of 150 and 567 taxa. The resolution rates of the consensus trees used to create the artificial trees are the same rates that were given in Section 5.1 for the biological trees. For example, in Figure 4(a), the majority resolution rate for both the biological and artificial trees for 567 taxa is 93%. The speedup obtained on our artificial trees is slightly lower than the speedup obtained with our biological trees. By using artificial trees for our experiments, we are not overestimating the performance of the algorithms. Thus, Figure 4 provides evidence that our artificial tree collections are valid for predicting the performance of HashCS and PAUP* on larger and more diverse collections of trees. Figure 5 shows the resulting speedup of HashCS over PAUP*. Since PAUP*’s running time performance varies as a function of the distribution of the input trees, the speedup varies as well. PAUP* performs the worst when constructing a 0% or 25% consensus tree resulting in HashCS being up to 300 and 60 times faster than PAUP* for n = 1, 024, respectively (not shown). At 50% consensus resolution, HashCS is up to 9 times faster than PAUP*. The speedup gap closes
110
S.-J. Sul and T.L. Williams
as the 16,384 input trees become more similar. Hence, when the consensus tree is 100% resolved, then HashCS is up to 2.3 times faster than PAUP*. Furthermore, the plot shows that as the resolution decreases and the number of taxa increases, PAUP*’s majority algorithm is much slower than it’s strict consensus algorithm. In HashCS, constructing strict and majority trees require about the same time.
7
Conclusions
In this paper, we performed an extensive empirical analysis of five different consensus tree implementations (HashCS, PAUP*, Day, Phylip, and MrBayes) on a diverse collection of phylogenetic trees. From life scientists, we obtained two collections consisting of 20,000 and 33,306 biological trees, which were the result of Bayesian analyses on 150 and 567 taxa molecular datasets, respectively. To predict how the consensus tree implementations would perform on even larger trees with greater numbers of taxa, we developed a model to generate collections of large-scale artificial trees. Our experimental results show that our HashCS implementation is the fastest approach for building large-scale consensus trees. The performance of HashCS in comparison to other consensus tree implementations is not impacted by the number of trees in a collection. Instead, the biggest speedup gains occur with increasing number of taxa. Hence, as trees increase in size, even more performance gains can be expected from HashCS. Furthermore, our artificial tree collections predict that HashCS will not be impacted by amount of bipartition sharing among the trees in the collection. Fast algorithms improve the speed of exploration to facilitate new discoveries. Hence, fast algorithms can be used in all aspects of reconstructing a phylogenetic tree, even if that level is currently not a bottleneck. For the tree collections studied here, constructing the resulting consensus trees requires very little time— especially when compared to the phylogenetic analyses that produced our biological trees. Depending on the needs and patience of a phylogenetic researcher, a typical phylogenetic search could take a few hours to several months. However, the time required to prepare the phylogenetic data (e.g., collecting organisms in the field, extracting molecular data in the lab, performing a multiple sequence alignment on the data) can take years, which dominates the time required for a typical phylogenetic search. Finally, our results indicate that HashCS could be used for more than postprocessing trees. For example, it is difficult to determine whether a phylogenetic heuristic such as MrBayes has converged. With fast consensus algorithms, one can take a collection of trees that have been sampled from tree space and construct their strict or majority tree. If the consensus tree has not changed significantly in some sufficient amount of time, then the search has potentially reached a local optima and it could be terminated. The resulting consensus resolution rates can vary significantly depending on the sampling of the trees in tree space. Hence, HashCS is a good consensus approach to use given that its fast performance is not impacted by the degree of bipartition sharing among
An Experimental Analysis of Consensus Tree Algorithms
111
the trees. Finally, faster algorithms make it feasible to study the increasing size of tree collections from large-scale phylogenetic analyses (such as those geared toward building the Tree of Life) in a reasonable amount of time.
Acknowledgments This work was funded by the National Science Foundation under grants DEB0629849 and IIS-0713618. The authors would like to thank Nick Pattengale, Eric Gottlieb, and Bernard Moret for providing the code for Day’s algorithm. Matthew Gitzendanner, Paul Lewis, and David Soltis for providing us with the Bayesian tree collections used in this paper.
References 1. Swofford, D.L.: PAUP*: Phylogenetic analysis using parsimony (and other methods), Sinauer Associates, Underland, Massachusetts, Version 4.0 (2002) 2. Ronquist, F., Huelsenbeck, J.P.: Mrbayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19(12), 1572–1574 (2003) 3. Felsenstein, J.: Phylogenetic inference package (PHYLIP), version 3.2. Cladistics 5, 164–166 (1989) 4. Day, W.H.E.: Optimal algorithms for comparing trees with labeled leaves. Journal Of Classification 2, 7–28 (1985) 5. Lewis, L.A., Lewis, P.O.: Unearthing the molecular phylodiversity of desert soil green algae (chlorophyta). Syst. Bio. 54(6), 936–947 (2005) 6. Soltis, D.E., Gitzendanner, M.A., Soltis, P.S.: A 567-taxon data set for angiosperms: The challenges posed by bayesian analyses of large data sets. Int. J. Plant Sci. 168(2), 137–157 (2007) 7. Amenta, N., Clarke, F., John, K.S.: A linear-time majority tree algorithm. In: Benson, G., Page, R.D.M. (eds.) WABI 2003. LNCS (LNBI), vol. 2812, pp. 216–227. Springer, Heidelberg (2003) 8. Boyer, R.S., Hunt Jr., W.A., Nelesen, S.: A compressed format for collections of phylogenetic trees and improved consensus performance. In: Casadio, R., Myers, G. (eds.) WABI 2005. LNCS (LNBI), vol. 3692, pp. 353–364. Springer, Heidelberg (2005) 9. Hunt Jr., W.A., Nelesen, S.M.: Phylogenetic trees in ACL2. In: Proc. 6th Int’l Conf. on ACL2 Theorem Prover and its Applications (ACL2 2006), pp. 99–102. ACM, New York (2006) 10. Goloboff, P.: Analyzing large data sets in reasonable times: solutions for composite optima. Cladistics 15, 415–428 (1999)
Counting Faces in Split Networks Lichen Bao and Sergey Bereg Department of Computer Science Erik Jonsson School of Engineering & Computer Science The University of Texas at Dallas Richardson, TX 75080, USA {lxb042000,besp}@utdallas.edu
Abstract. SplitsTree is a popular program for inferring and visualizing various phylogenetic networks including split networks. Split networks are useful for realizing metrics that are linear combinations of split metrics. We show that the realization is not unique in some cases and design an algorithm for computing split networks with minimum number of faces. We also prove that the minimum number of faces in a split network is equal to the number of pairs of incompatible splits.
1
Introduction
As the genomic data increases dramatically in recent years, evolutionary biologists need reliable data structures and fast computing tools to figure out the relationship or phylogeny for different species. Traditionally phylogenetic trees are used for their simplicity and structure. But trees cannot represent recombinations, recurrent and back mutations, horizontal gene-transfers and other biological phenomena. Phylogenetic networks generalize phylogenetic trees and can represent more general phylogenies [9,10,12,14]. Minimizing the number cycles is a very important issue in optimization phylogenetic networks. It has been addressed in a recent study of gall-trees and blobbed-trees [7,8]. Actually recombinations correspond to the cycles in phylogenetic networks such as the galled-trees or blobbed-trees. Finding an evolutionary history for a set of sampled sequences with minimum number of recombination events is a computationally very challenging problem [11]. The problem of minimizing the number of recombinations in a phylogenetic network, constructed using binary DNA sequences, is NP-hard [15]. Recently, we [2] studied the problem of minimizing the number of cycles and maximizing the fit value. There are several methods for phylogenetic network construction [3,6,8,12] and most of the methods have been incorporated in some phylogenetic construction programs. SplitsTree is one of the most popular programs of this kind, for inferring and visualizing various phylogenetic networks including split networks. Split networks are useful for realizing metrics that are linear combinations of split metrics. A split metric corresponds to a split of taxa X which is a partition of X into two sets. Two splits are called compatible if one of four split sets is a subset of another split set; otherwise they are called incompatible. As in [9], we I. M˘ andoiu, G. Narasimhan, and Y. Zhang (Eds.): ISBRA 2009, LNBI 5542, pp. 112–123, 2009. c Springer-Verlag Berlin Heidelberg 2009
Counting Faces in Split Networks
113
assume that taxa are in a circular order and given splits are circular. We will define a split network as any network realizing splits using ladders. There could be many split networks realizing the same set of splits. The main result in this paper is the following theorem. Theorem 1 (Minimum Number of Faces in Split Networks). Let X be a set of n taxa, let S be a set of m circular splits on X and let t be the number incompatible pairs in S. (i) The minimum number of faces in a split network realizing S is exactly t. (ii) A split network realizing S with the minimum number of faces can be computed in O(mn + mt) time.
2
Problem Description and Definitions
Circular order and circular split system. For a given set of taxa X = x1 , . . . , xn , a split is a pair (A, B) of subsets of X such that A = ∅, B = ∅ and A ∪ B = X, A ∩ B = ∅. A set (system) Σ of splits is called circular [1] if there is an order of taxa x1 , x2 , . . . , xn such that for any split s ∈ Σ, there exist p and q with 1 < p < q ≤ n such that s = (A, X − A) where A = {xp , xp+1 , . . . , xq }. The order of taxa is called circular. The algorithm for constructing split networks [9] computes a circular system of splits. It is used then to build a planar network. Two splits A = (A1 , A2 ) and B = (B1 , B2 ) are called compatible if one of the four intersections A1 ∩ B1 , A1 ∩ B2 , A2 ∩ B1 , or A2 ∩ B2 is empty [4]. Otherwise, A and B are called incompatible. A split system Σ is called compatible if any two splits from Σ are compatible; otherwise Σ is called incompatible. Compatible split systems correspond to phylogenetic trees and they can be computed, for example, by the Neighbor-Joining method [13] from a given set of taxa and the distance matrix. We define a minimal cut for a split in planar graphs. Let G = (V, E) be a connected plane graph and let X be the set of taxa corresponding to the vertices on the outer face of G. Let C ⊆ E be a cut for G such that G − C consists of two connected subgraphs G1 = (V1 , E1 ) and G2 = (V2 , E2 ). If s = (X1 , X2 ) is a split of X such that X1 ∈ V1 , X2 ∈ V2 , X1 ∩ X2 = X, then C is called a cut for split s. Furthermore, if G − C + e is connected for any edge e ∈ C then C is called a minimal cut for split s. In a graph G, the minimal cut for a split could be not unique. In an example shown in Figure 1, the minimal cut for split ({1, 2, 6, 7}, {3, 4, 5}) could be either {e1 , e2 } or {e3 , e4 , e5 }. Based on the definition of minimal cuts, we define split networks. Given a set of taxa X and a set of circular system of splits S, a plane graph G = (V, E) is called a split network for X if (i) X corresponds to vertices on the outer face of G, and (ii) E is the union of minimal cuts for splits in S. We call minimal cuts for splits in split networks ladders. The following theorem (proof is omitted) allows to construct different split networks using any order of splits and different ways to insert ladders.
114
L. Bao and S. Bereg 2 1 3 e1
e3 e4
e2
e5 4 5
7 6
Fig. 1. Both {e1 , e2 } and {e3 , e4 , e5 } are minimal cuts for split ({1, 2, 6, 7}, {3, 4, 5})
Theorem 2. Let N be a split network for split set Σ and let s = ({xi , . . . , xj }, {xj+1 , . . . , xi−1 }) be a new split. Let a be any vertex in the boundary path from xj to xj+1 and let b be any vertex in the boundary path from xi−1 to xi . A split network for Σ + s can be found as follows: take any shortest path P from a to b, double it, cut N along P , and connect the corresponding vertices. The number of faces in computed split network can vary, see Fig. 2 for an example. We address the problem of constructing a split network with minimum number of faces for a given set of circular splits S and taxa set X. 2
2
1
1 3
3
4
4
5
7
5
7 6
(a)
6
(b)
Fig. 2. Two split networks for the same set of splits. (a) Split network with 3 faces which is minimum by Theorem 1 since there are 3 incompatible pairs of splits. (b) Split network with 4 faces.
According to the construction or the way you choose and double the shortest path, the cycle number varies. However, the minimum number of the cycles needed for the final split network depends only on the set of circular splits. In other word, we can find a algorithm which could construct a split network for X based on S with minimum number of cycles.
Counting Faces in Split Networks
115
There can be many ways for adding a ladder according to Theorem 2. A random shortest path may result in different number of faces. Consider an example in Figure 3. Suppose that we want to add a new split ({1, 2, 9, 10, 11, 12}, {3, 4, 5, 6, 7, 8}). If we can take path a . . . cc d , we generate four new faces. If we take path b . . . d, we only add two new faces. 2 1 3
12 a
b
4
11 d
c
d
c 10
5 e
f
9 8
6
7
Fig. 3. Paths a . . . ccd and b . . . d are all boundary to boundary for split ({1, 2, 9, 10, 11, 12}, {3, 4, 5, 6, 7, 8})
How can we find the path which results in smallest total number of faces? We show that the algorithm from Theorem 2 can be used with arbitrary order of splits but the path should be chosen carefully.
3
Network Construction
Here we describe an algorithm of constructing the split network. Given a set of taxa X and a set of circular splits S for it, we do the following to construct the split network. 1. Start the network with a star, the edges of which are corresponding to all the trivial splits from S. 2. Take any order of the non-trivial splits of S, insert these splits one by one by using the iteration in step 3 to generate the final split network. 3. Let s = ({xi , . . . , xj }, {xj+1 , . . . , xi−1 }) be the split to be added to network N . Suppose xi−1 , xi , xj , xj+1 is the clockwise order of these four points along the circle. We take the path clockwise along the boundary from xj to xj+1 as P1 and the path clockwise along the boundary from xi−1 to xi as P2 . Make a virtual vertex r1 connecting to all the intermediate vertices in P1 with virtual edges and another virtual vertex r2 connecting to all the intermediate vertices in P2 with virtual edges. Then run breadth-first search (BFS) algorithm to find the shortest path from r1 to r2 . Delete the virtual edges (the first edge and the last one) in the shortest path to produce P . Remove all the virtual edges in the graph. Cut N along P into two sub-graphs and keep the path in
116
L. Bao and S. Bereg
each of them by doubling path P . Add edges to connect the corresponding vertices of the two paths. Figure 4 illustrates the algorithm. For example, split s5 = ({1, 2, 7}, {3, 4, 5, 6}) is inserted to the network in (e) producing the network in (f) as follows. In the third step of the algorithm, two paths P1 = 2 . . . a . . . 3 and P2 = 6b7 are found (xi = 7, xj = 2 here). Then path P = acb is found. P is the shortest path from a vertex in P1 to a vertex in P2 . Cut the network in (e) along P = acb by doubling P : make P = a c b a copy of P , reconnect edges on two sides of P using paths P and P , and add edges between corresponding vertices of P and P (edges (a, a ), (b, b ), (c, c )). 2 1
3
7
4 6
2
1
4
5
6
(b) 2
(c) 2 3
a
6
7
b 5
5 6 (d)
c
c
4 7
3
a
a
c 4
5
1
3
7
3
7
4 6
2
1
3
7
5 (a)
1
2
1
b
4
b
5
6 (e)
(f)
Fig. 4. (a) Starting from trivial splits.(b) Insert split s1 = ({1, 2, 3}, {4, 5, 6, 7}).(c) Insert split s2 = ({1, 2, 6, 7}, {3, 4, 5}).(d) Insert split s3 = ({1, 2}, {3, 4, 5, 6, 7}).(e) Insert split s4 = ({1, 2, 3, 4, 5}, {6, 7}).(f) Insert split s5 = ({1, 2, 7}, {3, 4, 5, 6}).
The ladder we insert in each iteration in step 3 of the algorithm is short in the following sense: Let L be the ladder corresponding to split s. Let π1 and π2 be the two parallel paths of L, |π1 | = |π2 |. Still P1 and P2 are the two boundaries in step 3 of the algorithm above. If π1 is a shortest path for any two vertices v1 ∈ P1 and v2 ∈ P2 , then L is short. Obviously, after iteration 3 the new ladder is short. Notice that the old ladders may be changed after this iteration. We prove that they are still short in the following theorem. Theorem 3. The shortest path found in each iteration in step 3 of the algorithm above is composed of all and only the edges corresponding to the splits which are incompatible to split s.
Counting Faces in Split Networks
117
Proof. We prove the theorem by induction together with the property that all ladders are short. The base case: In the first step we just have a star, it is easy to test this is true. For the inductive step, we prove the following statement. If, at the beginning of an iteration, all ladders are short, then (a) The shortest path found in this iteration is composed of all and only the edges corresponding to the splits which are incompatible to split s. (b) Ladders after this iteration are short. We prove (a) first. We use the notation N, s, P1 , P2 from the algorithm in the following proof. Suppose s = ({xi , . . . , xj }, {xj+1 , . . . , xi−1 }) is the split that we want to add to N . P is the shortest path from a vertex on the boundary path from xj to xj+1 and a vertex on the boundary path from a xi−1 to xi is P . We show that the edges in P are all and only corresponding to the splits in N that are incompatible with s. 1. For any split in N that is incompatible with s, the corresponding ladder has one end on the boundary from xi to xj and the other on the boundary from xj+1 to xi−1 . Since P connects P1 and P2 , it crosses the ladder. Therefore, every split in N that is incompatible with s has a corresponding edge in P . xj
a
g
h
P1
xj+1
P c
d
e f xi
P2
b
xi−1
Fig. 5. P has edge cd corresponding to a split which is incompatible with s
2. We prove that every edge in P corresponding to a split incompatible with s. Suppose, on the contrary, P contains an edge cd that is corresponding to split scd that is compatible with s, see Fig. 5. Suppose P = a . . . cd . . . b where a is on P1 and b is on P2 . Let ef and gh be the two edges of N that are two ends of the ladder for scd . Since the ladder for scd is short, both paths e . . . c . . . g and f . . . d . . . h are shortest paths. Then the pathsa . . . c, g . . . c, h . . . d have the same length. Then the path h . . . d . . . b is shorter by one than path P . This contradicts the fact that P is the shortest path from P1 to P2 . Thus, (a) holds. Now we prove (b). Let L for split s be any ladder in the end of the iteration. If s = s, then L = L is short. Suppose s = s . If P does not cross L ,then L does not change when we insert L. Ladder L is still short since the distances can not decrease with the insertion of L. Suppose P crosses L , so by property (a),
118
L. Bao and S. Bereg xj
xj+1 π
xi
xj+1
L
P
π
π1 s
xj
L
π1
π2
L
π2 xi−1 (a)
xi
(b)
xi−1
Fig. 6. (a) Network before the iteration (b) Network after the iteration
s and s are incompatible, see Fig. 6. Let N be the network before the iteration and N be the one after the iteration. Let π1 , π2 be the parallel paths of L . We prove that L is short in N . Consider any boundary to boundary path π corresponding to s , since s and s are incompatible, π crosses L in N . The lengths of π and π1 are shorter by one in network N . Since π1 is the shortest in N , then it is still the shortest in N . Thus, (b) holds too.
Proof of Theorem 1. We show (i) first. Let t be the number of incompatible pairs of splits. Every incompatible pair of splits corresponds to a face which is the intersection of two ladders. Clearly, different pairs correspond to different faces. The lower bound on the number of faces in a split network follows. On the other hand, our algorithm constructs a split network N with t faces. Indeed, let ai , i = 1, 2, . . . , m be the number of splits from {s1 , s2 , . . . , si−1 } incompatible with si by Theorem 3. The number of faces in N is m i=1 ai = t. To prove (ii), it remains to prove the running time bound. Consider the insertion of ith split i−1si . From the above argument the number of faces in the current network is k=1 ak . Since every face is a quadrilateral, the number of vertices i−1 and the number of edges in the network is bounded by n + O( k=1 ak ). Then i−1 BFS takes at most n + O( k=1 ak ) time. The total time is O
m i=1
(n +
i−1
ak )
= O(mn + mt).
k=1
Theorem 1 follows.
4 4.1
Experiments and Results Different Shortest Path and Face Movement
Here we show some examples with multiple shortest paths in some of the iteration in step 3 of the algorithm. This shows that our algorithm can give different networks with same number of faces, which equal to the number of incompatible
Counting Faces in Split Networks 2
3
4
1
(a)
5
12
2
(b)
6
11
3
4
1
5
12
6
7
10
10 3
9 2
4
1
5
12
6
11 10
7
11
8
9
2 (c)
119
7
3
4
1
(d)
5
12
6
11
7
8 9
8
10
8 9
Fig. 7. (a) Split network to add split ({1, 2, 3, 10, 11, 12}, {4, 5, 6, 7, 8, 9}). (b) Use the left-most path to insert the ladder. (c) Use the right-most path to insert the ladder. (d) Use some path in the middle to insert the ladder.
split pairs. In Figure 7 (a), using area with fat edges, any shortest path cab be chosen from the top vertex to the bottom vertex to insert the ladder for split ({1, 2, 3, 10, 11, 12}, {4, 5, 6, 7, 8, 9}). Figure 7 (b), (c) and (d) shows that different choices of the paths. The example in Figure 7 also shows that existing faces be placed differently in the resulting network by using different shortest paths. See Figure 8 for another example. It shows a ladder for one split in gray. There are 10 faces above the ladder in Figure 8 (a) and 5 faces below it. In Figure 8 (a), however, there are 5 and 10 faces above/below the ladder. Both networks in Figure 8 correspond to the same set of splits. Figure 8 gives a symmetric case (one network can be obtained from the other by rotating 180◦). Figure 9 shows a stronger contrast about the faces separated by the ladder. There are 10 faces above the ladder but none below in Fig. 9 (a); while in (b) there are 5 faces above the ladder and 5 below. Since the weights of splits can be relatively large, some faces in a split network can be large. In Figure 10, the gray face (it could be even larger) is located differently relative to the path connecting two clusters of taxa, a1 , a2 , a3 , a4 and b 1 , b2 , b 3 , b4 . All these face movements or shortest path choices shown here in the above examples shed the possibilities for different topologies for a set of taxa and given circular splits. These networks have the same number of faces, fit values but they represent different evolutionary pathways.
120
L. Bao and S. Bereg 2
1
2 1 3 4
11
5
10
11
4
10
5
6 9
6
9
7
8
3
12
12
7
8 (b)
(a)
Fig. 8. (a) 10 faces above the ladder and 5 faces below it. (b) 5 faces above the ladder and 10 faces below it.
2
1
2 1 3 4
11
7
10
5
6
4
11
10
8
9
3
12
12
6
9 8 (b)
(a)
5
7
Fig. 9. (a) 10 faces above the ladder and 0 faces below it. (b) 5 faces above the ladder and 5 faces below it.
1 a3
b1
2
b1 b2
a4 b4
a1
3 6
a3
b3
a2
4
5 (a)
2
1
b2
a4
b3
a2
b4 3
a1 6 5
4 (b)
Fig. 10. (a) The gray face shares is above the shortest path for two clusters. (b) The gray face is below the shortest path.
Counting Faces in Split Networks
4.2
121
Dusky Dolphins Example
This example is taken from an article by Cassens et al. [5] that studies the phylogeography of dusky dolphins (Lagenorhhynchus oscurus) and compares a number of different network methods. There are 33 taxa corresponding to different haplotypes seen in 124 individuals, sampled off Peru, Argentina, and Southwest Africa. These haplotypes 60 are obtained from variable positions in the DNA sequences of the full mitochondrial cytochrome b gene. A recent paper by [9] reanalyzed these data using methods from SplitsTree 4, including the neighbor-joining tree, the confidence network, the consensus network, the split decomposition network and the neighbor-net network. Figure 11 shows the weighted network generated by the neighbor-net method from SplitsTree 4. We apply our algorithm and find several split networks. Figure 12 shows parts of two different split networks. Figure 12 (a) corresponds to the split network produced by splitstree4. P4.1
P4.2
A7 A3
P3.2
A6 P1
A2.1
P3.1 P2.1 P5.4 P5.3
P2.2 P5.1
P5.2
A1.4 P6.1 A12
A17
P6.2
A10
P7
A1.1 A1.3
A2.2 A14
A9
A1.2
A1.5 A8.1 A11 A4 A8.2 A18 A5 A19
Fig. 11. Neighbor-net network for dusky dolphins constructed by SplitsTree 4
122
L. Bao and S. Bereg A2.1
A2.1
f1 f2 A18
A18 A8.2 (a)
A8.2 (b)
Fig. 12. (a) Part of the network for dusky dolphins by SplitsTree 4: gray faces are large in the corresponding weighted network. (b) Gray faces move away from the boundary by choosing different shortest path for the split.
All weights are equal in these figures. Two faces f1 and f2 correspond to two large faces in Fig. 11 along the path from A18 to A8.2 . Our network is shown in figure Fig. 12 (b) and faces f1 and f2 are in different locations.So the sizes of the two faces will be different and the new layout of the weighted network will be very different.
5
Conclusions
We studied the problem of finding different split networks minimizing the number of faces. We proved the minimum number of faces is the number of incompatible splits and designed an algorithm for constructing such networks. Future directions. Recently, we [2] proposed a new kind of phylogenetic network called CS-network, with one of the face being a k-gon, k ≥ 3. We will work on face counting in clustered split networks. We also will explore new applications in biology for our split networks. Because our algorithm can construct several split networks, they may be used for other problems in evolutionary biology, for example, identifying reassortment in influenza virus.
References 1. Bandelt, H., Dress, A.: A canonical decomposition theory for metrics on a finite set. Advances in Mathematics 92, 47–105 (1992) 2. Bao, L., Bereg, S.: Clustered splitsnetworks. In: Yang, B., Du, D.-Z., Wang, C.A. (eds.) COCOA 2008. LNCS, vol. 5165, pp. 469–478. Springer, Heidelberg (2008)
Counting Faces in Split Networks
123
3. Bryant, D., Moulton, V.: Neighbornet: An agglomerative method for the construction of planar phylogenetic networks. In: Guig´ o, R., Gusfield, D. (eds.) WABI 2002. LNCS, vol. 2452, pp. 375–391. Springer, Heidelberg (2002) 4. Buneman, P.: The recovery of trees from measures of dissimlarity. Math. and the Archeological and Historical Sciences, 387–395 (1971) 5. Cassens, I., Van Waerebeek, K., Best, P.B., Crespo, E.A., Reyes, J., Milinkovitch, M.C.: The phylogeography of dusky dolphins (lagenorhynchus obscurus): a critical examination of network methods and rooting procedures. Mol. Ecol. 12(7), 1781–1792 (2003) 6. Dress, A., Huson, D.: Constructing splits graphs. IEEE/ACM Transactions in Computational Biology and Bioinformatics 1, 109–115 (2004) 7. Gusfield, D., Bansal, V.: A fundamental decomposition theory for phylogenetic networks and incompatible characters. In: Miyano, S., Mesirov, J., Kasif, S., Istrail, S., Pevzner, P.A., Waterman, M. (eds.) RECOMB 2005. LNCS (LNBI), vol. 3500, pp. 217–232. Springer, Heidelberg (2005) 8. Gusfield, D., Eddhu, S., Langley, C.: The fine structure of galls in phylogenetic networks. INFORMS Journal on Computing 16, 459–469 (2004) 9. Huson, D.H., Bryant, D.: Application of phylogenetic networks in evolutionary studies. Molecular Biology and Evolution 23(2), 254–267 (2005) 10. Linder, C.R., Warnow, T.: Overview of phylogeny reconstruction. In: Aluru, S. (ed.) Handbook of Computational Biology. CRC Computer and Information Science Series. Chapman & Hall, Boca Raton (2005) 11. Lyngsø, R.B., Song, Y.S., Hein, J.: Minimum recombination histories by branch and bound. In: Casadio, R., Myers, G. (eds.) WABI 2005. LNCS (LNBI), vol. 3692, pp. 239–250. Springer, Heidelberg (2005) 12. Moret, B.M.E., Nakhleh, L., Warnow, T., Linder, C.R., Tholse, A., Padolina, A., Sun, J., Timme, R.E.: Phylogenetic networks: Modeling, reconstructibility, and accuracy. IEEE/ACM Trans. Comput. Biology Bioinformatics 155(1), 13–23 (2004) 13. Saitou, N., Nei, M.: The neighbor-joining method: A new method for reconstructing phylogenetic trees. Molecular Biology and Evolution 4, 406–425 (1987) 14. Song, Y.S., Wu, Y., Gusfield, D.: Efficient computation of close lower and upper bounds on the minimum number of recombinations in biological sequence evolution. Advances in Mathematics 21, 413–422 (2005) 15. Zahid, M.A.H., Mittal, A., Joshi, R.C.: A pattern recognition-based approach for phylogenetic network construction with constrained recombination. Pattern Recognition 39(12), 387–395 (2006)
Relationship between Amino Acids Sequences and Protein Structures: Folding Patterns and Sequence Patterns Alexander Kister Department of Health Informatics, SHRP, University of Medicine and Dentistry of New Jersey, Newark, NJ, 07107,USA
Abstract. Two crucial problems of protein folding are considered here. First, the hypothesis, that all proteins with an identical SSS, regardless of degree of sequence identity among sequences have common sequence pattern. To find conserved positions and create a sequences pattern a new algorithm of the structure-based multiple sequences alignment was developed. An essential feature of the algorithm is that the alignment is based on residues that form hydrogen bond contacts between strands in protein structures. It was shown that SSS-specific sequence patterns have very high sensitivity for identifying protein structure and can be used for SSS prediction without any prior structural information. Second, the rules by which secondary structure elements – beta strands come together into supersecondary structure (SSS) – folding patterns. Knowledge of these patterns that uncover the spatial arrangement of strands will likely prove useful in protein structure prediction. Keywords: Beta-sandwich proteins, supersecondary structure, protein folding, protein structure classification, protein structure prediction, sequence alignment.
1 Introduction The present research describes two crucial problems in Bioinformatics: 1) How to receive sequence pattern specific for a given 3D structures. A sequence pattern may be stated as a template of sequences, which describes certain positions in sequences, set of possible residues at these positions, and distances (numbers of not defined residues) between these positions. Two questions should be discussed in this context: how to receive the sequence pattern, and does it exhibit both high sensitivity and high specificity, or, in other words should the sequence pattern detect sequences in all or most of the proteins with a given 3D structure? A fundamental principle that governs sequence-structure relationship of proteins states that all information about the native structure of a protein is encoded in its amino acid sequence [1-2]. This implies that similar sequences encode similar structures. The idea that sequence similarity translates into structural similarity underlies most modern high-accuracy algorithms of structure prediction (see recent reviews [34]). It was shown that a protein with a protein sequence with over 30% identity to a known structure often be predicted with an accuracy equivalent to a low-resolution I. Măndoiu, G. Narasimhan, and Y. Zhang (Eds.): ISBRA 2009, LNBI 5542, pp. 124–134, 2009. © Springer-Verlag Berlin Heidelberg 2009
Relationship between Amino Acids Sequences and Protein Structures
125
X-ray structure [5]. This very important observation provides the threshold for structure prediction and states that a relatively small number of residues in a sequence are critical to structure formation [6]. In this research a sequence – supersecondary structure relationship is examined. The reason for focusing on the relation between primary sequence and SSS, rather than the usually considered relationship between sequence and tertiary structure, is that definition of SSS identity is much more rigorous than the semi-quantitative notion of 3D structure similarity. For example, beta sandwich proteins are said to have identical SSS if they have the same number of strands and the same order (arrangement) of strands in each of their two beta sheets. It is important to note that proteins with identical SSS may differ markedly in the number and composition of residues within strands and loops, and so similarities of sequences of these proteins may be below the ‘twilight zone’. This research evaluates the hypothesis that proteins with an identical supersecondary structure (SSS) share a unique set of SSS-determining residues. The residues at conserved position will be referred as ‘SSS-determining residues’, because they are presumed to be critical to SSS formation. To prove the hypothesis of uniqueness of SSS-determining residue sets, it is necessary to demonstrate that even markedly dissimilar sequences with the same SSS, share the same SSS-determining residues, and that these set of residues is not present in sequences with different SSS. For comparison of dissimilar sequences that share the same SSS, a novel supersecondary structure-based multi-sequence alignment algorithm was developed [7-8]. There are two main features that distinguish the proposed structure-based method from the traditional alignment techniques: 1. Units of alignment are strands and loops, rather than whole sequences; 2. The basis of alignment of strands is the residues that form inter-strand hydrogen bonds. The H-bonded residues play the most important role in supersecondary structure formation, because they determine the arrangement of strands in beta sheets. This “hydrogen bond residues” based alignment allows one to compare sequences with very low similarity (below the ‘twilight zone’), which would not have been possible using the extant alignment techniques, and to identify the positions that are conserved across all sequences with same SSS. For analysis of correlation between amino acid sequences and SSS the group of sandwich-like proteins (SP) were selected. Proteins in this group are very diverse and belong to different protein families with very low sequence similarities but have the same SSS. The results of the sequence alignment of the proteins using the novel algorithm revealed that up to 25% of positions in sequences are conserved positions occupied either by exclusively hydrophobic or hydrophilic residues. It was shown that residues at the conserved positions form an SSS-specific sequence pattern. The pattern has very high sensitivity and specificity for identifying protein structure and could be used for protein classification as an amino acid ‘tag’ that brands a sequence as having a particular SSS. As such, SSS sequence pattern can be used for SSS prediction without any prior structural information. The important conclusion follows from the results of this work: the protein sequence/structure relationship cuts both ways. That is both the sequence of amino acids determines the folding pattern and the structure defines key positions in sequences and residues at these positions. The second main problem in this paper is the uncovering of the main regulation of arrangement of secondary structural units in SSS. For over 30 years, researchers have
126
A. Kister
looked at how strands and helices assemble into SSS [9-12]. In this research we describe a representation of SPs in terms of novel supersecondary structural unit, termed strandon. A strandon is defined as a set of sequentially consecutive strands that are connected by hydrogen bonds in beta-sheets, and thus close in space as well. The advantage of looking at protein structure as an arrangement of strandons is that it enables us to find specific folding patterns of the strand packing in the sandwich-like protein structures. These patterns sharply delimit the number of possible structural permutations of strandons and strands in space. These restrictive patterns help us to understand why only a limited number of different protein folds are observed.
2 Sequence Patterns. Methods and Results 2.1 Study Material Beta-sandwich proteins are a large group of beta proteins. In the 3D classifications of protein structures, the SCOP and CATH databases, structures of SP are defined as two beta sheets packed face-to-face [13-14]. Supersecondary structures of these proteins can be rigorously defined by specifying the number and order (arrangement) of strands in each of their two beta sheets (Fig.1). SPs with same number and order of sheets in each beta sheet have identical SSS. Proteins with the SSS shown in Fig.1 belong to 3 superfamilies and 3 families (see the legend of the table 1). According to supersecondary structure classification of proteins in SSS database, there are 418 proteins with the given SSS. Sequences from different families are strongly dissimilar. For example, alignment of two sequences - 1ncw I:1–2–5–4 II: 7 – 6 – 3 Fig. 1. Schematic representation of the arrangement of strands and strandons in the SSS The numbers designate the strands that make up sheets I and II
and 1f42 by the standard global alignment algorithm EMBOSS/Needle program, shows 13.5% of identity and 17.6% of similarity, and for sequences in structures 1f42 and 1oke - 2% of identity and 3.5% of similarity. Alignment by BLASTP program reports no significant similarity for these sequences. The collection of these distantly-related amino acid sequences from different protein families with different biological functions is the best object for uncovering conserved positions crucial for SSS formation. Thus these widely used alignment algorithms are not applicable for analysis of the sequences with very low sequence similarity. Therefore, for comparison of sequences of beta-proteins that share the same SSS, a novel supersecondary structurebased multi-sequence alignment algorithm was developed.. To do it we collect. The single common property of these proteins is the same SSS. It allows us to avoid the common conserved positions specific for biological functions. Thus this group of diverse proteins.
Relationship between Amino Acids Sequences and Protein Structures
127
2.2 Algorithm for Sequence Alignment of Proteins with the Same SSS, but Widely Dissimilar Sequences The goal of this investigation is by using alignment approach to identify key residues responsible for formation of the particular SSS. However, it is known that the most popular heuristic methods of sequence alignment, such as Needle, BLAST and HMM work best with closely related sequences, but when it comes to very distantly-related sequences, the alignment may have little or no biological significance. Therefore, for comparison of sequences of beta-proteins that share the same SSS, a novel supersecondary structure-based multi-sequence alignment algorithm was developed. The essential feature of the algorithm is that the alignment procedure is performed separately for each strand and loop. Two rules of alignment of residues in strands. The alignment of corresponding strands is based on the alignment of residues that form hydrogen bond contacts between strands in beta sheets. Rule 1. If main chain atoms of residue a and residue a’ form an H-bond in one protein, and residue b forms contact with residue b’ in another, then if a and b are assigned the same position index, then a’ and b’, too, will have the same position index. This rule can be illustrated with the example of structures I and II shown in Fig. 2. Consider residue a1 in strand 1 of structure I that forms an inter-strand hydrogen bond with residue a’1 in strand 2. There is an analogous pair of residues in structure II residues b1 and b’1, which forms hydrogen bond contacts between strands 1 and 2. If we align residue a1 with residue b1, then rule 1 dictates that residues a’1 and b’1 also be aligned with each other. Rule 2. No gaps are allowed within strands. In other words: consecutive residues in each strand are always assigned consecutive position indices. From the two above rules, it follows that if residue a1 is aligned with residue b1, then the immediately downstream residues a2 and a3 in strand 1 of structure ‘A’ must be aligned with residues b2 and b3 in strand 1 of structure ‘B’ (Fig. 2). Likewise, residues a5 and a’3 in strands 2 (structure I) must be aligned with residues b8 and b’3 in structure II. Thus, after initial alignment of a pair of H-bond forming residues and systematically invoking the two rules, one can unambiguously align all residues in a beta sheet. Once these structurally-important residues have been aligned, all other residues are assigned their position indices in the sequence depending on how far ‘downstream’ from the nearest H-bonded residue they are. It is clear from the above discussion that alignment of residues depends on the initial choice of H-bonded residues that serve as a ‘nucleus’ of alignment, as it were. Usually strands are connected by 2-4 hydrogen bonds in a beta sheet, so the total number of possible variants is quite limited - just 24 variants per beta sheet. All these possible variants of alignment of strands need to be considered. The best variant of alignment is the one with the maximal number of conserved positions, each of which is occupied by similar residues across all sequences with same SSS.
128
A. Kister
Fig. 2. Beta sheets with three anti-parallel strands in structure A and B
The strands are schematically shown by arrows. The hydrogen bonds between residues are indicated by dotted lines between strands. Alignment of residues in loops. The multiple sequence alignment is performed independently for each loop. First, all sequences in proteins that corresponded to loops between strand 1 and 2 are aligned among themselves, then the same procedure is followed for loops between strands 2 and 3 and so forth. Because conformation of loops and residue content may be very variable in different proteins in different proteins, no structure data are used in loop alignment. In this work, multiple sequence alignment was carried out by hand to generate gaps in sequences. 2.3 Optimal Sequence Alignment The precise definition of what constitutes ‘residue similarity’ for purposes of alignment varies among different investigators. In this research hydrophobicity and hydrophilicity of residues was chosen as the criterion of residue similarity. A position was classified as ‘conserved hydrophobic’ or ‘conserved hydrophilic’ if all residues found in this position (with one exception allowed) belong either to the hydrophobic (V, I, L, M, F, W, and C) or the hydrophilic (Q, N, E, D, R, K, H, T, S, G and P) group of residues. From our observation of the SP sequences two residues, A and Y, were found with roughly equal frequency in both hydrophobic conserved positions in strands and in the hydrophilic conserved positions in loops. These two residues were therefore considered as either hydrophilic or hydrophobic, respectively in strands and loops for the purposes of identification of conserved positions in sandwich-like proteins.
Relationship between Amino Acids Sequences and Protein Structures
129
2.4 SSS-Determining Residues Selection of sequences for the alignment procedure makes use of SCOP protein hierarchical classification. The smallest unit of protein families in SCOP is “cluster”. Sequences in different clusters may diverge considerably even in one family, therefore, multiple sequences alignment of proteins was carried out with one randomly selected sequence per cluster. According to the supersecondary structure classification 607 proteins with the given SSS were classified in 13 clusters. For SSS-based sequence alignments were selected 10 representative sequences from different clusters (the "training" data set). The result of the alignment revealed 26 conserved positions, 12 of them hydrophilic and 13 - hydrophobic (Table 1). Residues at these conserved positions form a SSS-determining pattern (Fig. 3). Syntax of patterns is nearly identical to that of PROSITE patterns. Each conserved position can be occupied by any residue from the set of residues shown in square brackets. The non-conserved positions between two consecutive conserved positions are marked by “X”. If the number of non-conserved positions between to consecutive conserved positions varies in different sequences, the minimum and maximum distance from one conserved position to the next is shown in parentheses. Table 1. Sequence alignment of proteins PDB code chain start 1kcr H 117 1m7d B 114 2fbj H 119 1ow0 A 242 1fp5 A 336 1hxm A 121 1c16 A 181 1svb 303 1oke A 298 1f42 A 88 PDB code 1kcr 1m7d 2fbj 1ow0 1fp5 1hxm 1c16 1svb 1oke 1f42
S T R A N D 1 end 218 213 220 342 438 206 276 395 398 211
S S T S S S K T K T
V V I L A V A W F F
Y Y Y H Y F H K K L
P P P R L V V R V ─
… … … ─ … … … … … ─
L O O P ─ ─ ─ ─ R ─ P ─ Q Y
A G S G K N E S H S
A S S S S G G G G G
A S D E P T D H T R
S T R A N D 2 ─ ─ P ─ ─ ─ ─ D ─ ─
… … … … ─ ─ … ─ ─ ─
S T R A N D 4
L O O P
S T R A N D 5
─ ─ T A ─ E ─ ─ R A
─ Q ─ Q …S …C …N …P ─ A …N ─ N ─ ─
…S …S …S …V …T …V …W ─ F ─ ─ …E
V V V V V F M V L A
H R N Q N D E A I T
T T F G H P L M T L
F V P P S A V L V S
P… S… P… P… T… I … E… I … ─ ─ A…
S S G G G S G P P G
A ─ G ─ G… ─ ─ ─ ─ G ─ D… T… I … D ─
S S N S S K A I V Y
S L Q S T L A E N S
V V L V L G V M I V
T T T L P K V Q E E
T T I N T N T T ─ ─
V… V… L… P… V… Y… V… L… A… C…
L L I L I V L V I F
G G G T T A R V V T
C L V C L V C L I C T L C L V C L V CWA M E V I R V C WW
K K H T V K L T Q L
G G D ─ D E G F Y T
… … … ─ … … … ─ ─ …
L O O P …T …P …P …G …G …E …P …P …E …Q
S S ─ C ─ D L L P E
─ ─ ─ ─ ─ ─ ─ ─ ─ D
S S A A T S G P P S
P… T… V… E… R… N… K… ─… F… A…
L O O P
S T R A N D 3
L O O P
─ ─ ─ ─ ─ ─ ─ S E I
V V M V V I I C C L
─ ─ ─ ─ … ─ … ─ … ─
P P P G P P P G G S
E E S L S K A T D T
P P G R K D D K G D
─ T D G ─ ─ P S ─
─ ─ … … ─ ─ ─ … ─
T T N T N R T R K T
L V V F L I L I I F
T K T T T N T P P S
W W W W W L W V F V
S T R A N D 6
L O O P
─ ─ S T T ─ S I Y E
P P D P P D E E E K
V V V F Y V Y I I V
T I K T Q T T Y I M
C C C C C C C V I V
N N S T R S H G G D
V… V… V… A… V… V… V… ─ ─ V ─ A…
A A S E H N G L P L
S S N S L ─ L S G K
S K P K P ─ P ─ Q Y
N N G T S V Q R E K
… … … ─ ─ ─ ─ … … …
─ ─ ─ P A S N ─ L ─
G G S S S S G G E G
S A G S G K E S K S
L L K G K ─ E P R ─
… … … … … ─ … … … ─
S T R A N D 7 ─ ─ … … … ─ … ─ … ─
T V Q ─ ─ K ─ H L E
K D E ─ ─ T ─ Q N N
V L L L L V L W W Y
D I D T M H I F F T
K K V A R S L Q K S
K E N T S T R K K S
… … … … … … … … … …
PDB code of proteins is provided in the first column, chain name – in the 2nd column (‘chain’), and the beginning and end position of each sequence in a domain – in the 3rd and 4th (‘start’ and ‘end’) columns, respectively. The sequences of the longer loops are not shown in their entirety, and the ‘skipped’ segments of loops are indicated by ellipses (“…”). The spaces within the sequences separate between secondary structure units (strands and loops). The conserved hydrophobic and hydrophilic positions are shown in gray. ‘Boxed in’ residues at the conserved positions are the singular exceptions (non-hydrophilic residues at hydrophilic positions or vice versa). Per SCOP Database, structures 1C5C, 1NCW, 1KCR, 1M7D, 1IK3, 1FBJ, 1OW0, 1FP5, 1HXM, 1C16 belong to Immunoglobulin Superfamily, Family C1 set domains; structures 1SVB, 1OKE - to E-set domains Superfamily, Family Class II viral
130
A. Kister
Fusion Proteins C-terminal domain; and 1F42 - to Superfamily Fibronectin, Family Fibronectin type III. The criterion of high specificity and sensitivity of a set of SSS-determining residues is a possibility by scanning of a large number of diverse proteins with a given set of SSS-determining residues to detect all or almost all of the proteins with a given SSS and to give no or very few false positive results. In this work, 50,577 very diverse sequences of proteins with known 3D structures collected in the SCOP database were scanned using the set of SSS determining residues of the SSS. EMBOSS/Preg program, which matches of a regular expression to a protein sequence, was used [15]. The program rapidly scans all analyzed sequences and reliably identifies which sequences contain the set of SSS-determining residues. The search picked up 340 ‘true positives’, 78 false negative and 8 ‘false positives’ proteins. The next stage of the analysis is to allow a single mismatch in any one of 26 conserved positions. It yielded 75 additional true positive sequences and 17 additional false negative proteins. Thus, total was wound 415 true positive sequence out of 418 sequences, and 25 false negative proteins (sensitivity and specificity are almost 100%). The remaining 3 proteins that were not identified in the search had 2 or 3 mismatching positions. This result suggests a very important conclusion: substitution of a hydrophilic for a hydrophobic residue or vice verse is allowed at one or two conserved positions at the most. Strand 1 Loop Strand 2 Loop [STK][OzCM] (4,14)X[B][GS][B] (1,9)X[LIVF]X[CMVW]X[OzCM] (1,4)X[PGS]X[PGRKD] Strand 3 Loop Strand 4 Loop (0,4)X[VMICL]X[VILF][STKNP][WLIVF] (2,5)X[GS][ASGKE] (2,8)X[VFMILA] (4,12)X[SGP] Strand 5 Loop Strand 6 (6,10)X[VLMI][VLYAC [PTGEQS](0,2)X[SATGP] (2,12)X[YIVF]X[CIV] Loop Strand 7 (0,4)X[PDEGK] (3,7)X [OzCM]2X[KENTS]
Fig. 3. SSS-determining residues in the sequence pattern
Residues, which occupy the conserved positions, are shown in square brackets. The hydrophobic residues: V,L,I,M,F,Y,W,A,C, are designated by ‘O’; hydrophilic residues: K,R,E,D,Q,N,H,S,T,G.P, are designated by ’B’; polar residues: K,R,E,D,Q, N, are designated by ‘Z’. The “≠” sign means ‘except for the following residues. For example, O≠MW means all hydrophobic residues, except for M and W. The expression ‘(d, r)’ X indicates that the minimum number of residues between two consecutive conserved positions is ‘d’ residues, and the maximum number of residues between two consecutive conserved positions is ‘r’ residues.
3 Folding Pattern Folding pattern is considered here as the set of rules pertaining to the organization of secondary structure elements – strands in SSS. To make conclusions about secondary and supersecondary structural features the analysis of about 8000 structures consisting of 177 families from SCOP database was performed. Two steps in the analysis of the
Relationship between Amino Acids Sequences and Protein Structures
131
Fig. 4. Arrangement of strands and strandons in protein domain. a) Determination of SSS motif - arrangement of strands; b) ) Determination of SSS supermotif - arrangement of strandons. Strandons are shown by ovals.
proteins are as follows: First, construction of SSS motifs: determination of the arrangement of strands in a domain (Fig.4a). The second step is a construction of SSS supermotifs: determination of the arrangement of strandons in the two main sheets of SPs. The definition of strandon as a set of sequentially consecutive strands that are connected by hydrogen bonds in a beta-sheet are presented in Fig. 4b. For example, two consecutive Strands 4 and 5 form a strandon because they are H-bonded to each other. Strand 3, which is next to Strand 4, is not a part in this strandon because it has no H-bonds with Strand 4. Analogously, Strand 6 is not hydrogen bonded to Strand 5, so Strand 6 is not included in the strandon. By definition, a single strand, which is not hydrogen bonded to any of its consecutive strand, constitutes a strandon. This is the case with Strands 1, 2, 3 in Figure 4b. An assembly of strands and strandons exposes the underlying regularities in SSS and enables us to construct a novel classification scheme of SPs – motif and supermotif. Protein structures with an identical number and arrangement of strands in the two main beta sheets fall into one motif. Motifs with an identical number and arrangement of strandons form one supermotif. This hierarchical classification (SSS supermotif – SSS motif – structure is set forth in the new SSS database (http://binfs.umdnj.edu/sssdb/). The SSS database contains 38 different supermotifs and 185 different motifs, which describe all of the analyzed structures. The rule of supermotifs. Analysis of the supermotifs in the SSS database led to the discovery of a pattern in the ordering of the strandons in the two main sheets. This is called the rule of supermotifs, and is represented in (Fig. 5). We showed that the arrangement of strandons in 95% of all SP structures obeys the rule of supermotifs. The rule of supermotifs dramatically restricts the number of permissible arrangements of strandons. It turns out that the great diversity of sandwich proteins is described by a very small number of supermotifs. Our analysis shows that if N, the total number of strandons in a structure, is odd, then the number of possible arrangements that follow the rule of supermotifs is 2N. For an even number of strandons, that is, if N is even, the pattern of supermotifs has an inherent symmetry and there are N/2 possible arrangements that correspond to the rule of supermotifs.
132
A. Kister
Fig. 5. The rule of supermotifs: The pattern of strandons’ arrangement in 2 beta-sheets
Each strandon is depicted by a box. The top line shows the arrangement of strandons in Sheet A, and the bottom line, the arrangement of strandons in Sheet B. Dashes symbolize hydrogen bonds between strands in neighboring strandons. Edge strandon K can be elected on either side in a supermotif. The rule of motifs. The analysis of the arrangement of strands within the supermotifs and strandons reveals a strong correlation between the location of a strandon in the supermotif and strand ordering within the strandon. This observation leads us to formulate the rule of motifs: For a strandon K, the strand with the highest number is located at the edge of the sheet. For two edge strandons K and K +1 in the different sheets or any two neighboring strandons in the same sheet, ‘‘strands’’ numbers in these two strandons will increase in opposite directions (see the direction of arrows in Figure 6). This means that the order of strands in a given strandon is the same for all proteins described by the same supermotif
Fig. 6. Two supermotifs with 4 strandons. The left edge strandon in Sheet A is strandon K. The arrows ‘‘→’’ or “← are directed towards the highest numbered strand in each strandon. (a) A supermotif where K = IV, with two possible motifs for the given supermotif. (b) A supermotif where K = I with two possible motifs for the given supermotif.
In 82% of the SPs analyzed for the SSS database, the arrangement of strands in all strandons satisfies the rule of motifs. In 12% of analyzed structures, the ‘‘incorrect’’ ordering of strands was found in only one strandon, and in 6% of the structures, an exception was found in two or more strandons.
Relationship between Amino Acids Sequences and Protein Structures
133
4 Discussion This paper deals with the two steps of analysis of the relationship between primary sequence and supersecondary structures of beta-sandwich proteins. The first step determination of specific sequence characteristics common to all proteins with a given SSS. It is shown that the SSS motif examined in this work is described by a unique set of conserved hydrophobic and hydrophilic positions, whose residues are decisive for formation of respective SSS. The finding that approximately 30% of all positions in a sequence are conserved in proteins with same supersecondary structure, but widely dissimilar sequences and functions, supports the hypothesis that a small number of key residues largely determine SSS. The residues at the conserved positions are referred to as ‘SSS-determining residues’, as their presence is almost always required in order for the sequence to have the SSS that corresponds to that residue set. Residues in sequences may be conceptually divided into two groups: a relatively small set of SSS-determining residues, and a larger group of all other residues, whose structural role is supportive. Mutation of SSS-determining residues is generally limited to residues that have the same chemical and physical properties – either hydrophilic or hydrophobic. By contrast, mutations of residues at the supportive positions are much more variable, and interchange of a hydrophobic for a hydrophilic amino acid, and vice versa, are common. The finding of conserved positions in widely dissimilar sequences was made possible by introduction of a novel, powerful structure-based method of sequence alignment. It is worth emphasizing that the proposed algorithm of structure-based alignment does not require any manual intervention, and lends itself to being completely computerized. The second step is uncovering common structural properties for SPs. We described here a novel approach for SP structural classification. This classification is based solely on the analysis of arrangements of strands and strandons. It is important to note that because for classification we did not take in account sequence similarity, proteins grouped together can have very different amino acid sequences. In this classifications protein grouped together in one motif often belong to different folds and different superfamilies in the SCOP database. The advantage of looking at protein structure as an arrangement of strandons (or supermotifs) is that it enables us to find specific patterns of the strand packing in the sandwich-like protein structures: The rule of supermotifs and the rule of motifs. These rules sharply delimit the number of possible structural permutations of strands in space and lead to the conclusion that very few possible arrangements of strands are compatible with ‘‘sandwich architecture.’’ These restrictive rules help us to understand why only a limited number of different protein folds is observed. The patterns of spatial arrangement of secondary and SSS elements in SPs, described here, will help pave the way towards the goal of predicting protein structure from its amino acid sequence.
Acknowledgements I thank Dr. I. Gelfand for very useful discussions and Mrs. M. Goldman for continuous encouragement of the research project.
134
A. Kister
References 1. Sela, M., White Jr., F.H., Anfinsen, C.B.: Reductive Cleavage of Disulfide Bridges in Ribonuclease. Science 125, 691–692 (1957) 2. Anfinsen, C.: Principles that Govern the Folding of Protein Chains. Science 181, 223–230 (1973) 3. Xiang, Z.: Advances in homology protein structure modeling. Curr. Protein Pept. Sci. 7, 217–227 (2006) 4. Dalton, J., Jackson, R.: n evaluation of automated homology modelling methods at low target template sequence similarity. Bioinformatics 3, 1901–1908 (2007) 5. Gunalski, K.: Comparative modeling for protein structure prediction. Curr. Opin. Struct. Biol. 16, 172–177 (2006) 6. Pugalenthi, G., Tang, K., Suganthan, P.N., Chakrabarti, C.: Identification of structurally conserved residues of proteins in absence of structural homologs using neural network ensemble. Bioinformatics 25, 204–210 (2008) 7. Kister, A.E., Fokas, A.S., Papatheodorou, T.S., Gelfand, I.M.: Strict rules determine arrangements of strands in sandwich proteins. Proc. Natl. Acad. Sci. USA 103, 4107–4110 (2006) 8. Chiang, Y.S., Gelfand, T.I., Kister, A.E., Gelfand, I.M.: New classification of supersecondary structures of sandwich-like proteins uncovers strict patterns of strand assemblage. Proteins 68(4), 915–921 (2007) 9. Levitt, M., Chothia, C.: Structural patterns in globular proteins. Nature 261, 552–558 (1976) 10. Sternberg, M.J.E., Thornton, J.M.: On the conformation of proteins: the handedness of the b-strand-a-helix-b-strand unit. J. Mol. Biol. 105, 367–382 (1976) 11. Cohen, F.E., Sternberg, M.J.E., Taylor, W.R.: Analysis of the tertiary structure of protein beta-sheet sandwiches. J. Mol. Biol. 148, 253–272 (1981) 12. Michalopoulos, I., Torrance, G.M., Gilbert, D.R., Westhead, D.R.: TOPS: an enhanced database of protein structural topology. Nucleic Acids Res. 32, D251–D254 (2004) 13. Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C.: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540 (1995) 14. Orengo, C.A., Michie, A.D., Jones, D.T., Swindells, M.B., Thornton, J.M.: CATH: A Hierarchic Classification of Protein Domain Structures. Structure 5, 1093–1108 (1997) 15. Rice, P., Longden, I., Bleasby, A.: EMBOSS: The European Molecular Biology Open Software Suite Trends in Genetics 16, 276–277 (2000)
Improved Algorithms for Parsing ESLTAGs: A Grammatical Model Suitable for RNA Pseudoknots Sanguthevar Rajasekaran, Sahar Al Seesi, and Reda Ammar Department of Computer Science, University of Connecticut, Storrs, CT 06269-2155 {rajasek,sahar,reda}@engr.uconn.edu
Abstract. Formal grammars have been employed in biology to solve various important problems. In particular, grammars have been used to model and predict RNA structures. Two such grammars are Simple Linear Tree Adjoining Grammars (SLTAGs) and Extended SLTAGs (ESLTAGs). Performance of techniques that employ grammatical formalisms critically depend on the efficiency of the underlying parsing algorithms. In this paper we present efficient algorithms for parsing SLTAGs and ESLTAGs. Our algorithm for SLTAGs parsing takes O(min{m, n4 }) time and O(min{m, n4 }) space where m is the number of entries that will ever be made in the matrix M (that is normally used by TAG parsing algorithms). Our algorithm for ESLTAGs parsing takes O(n min{m, n4 }) time and O(min{m, n4 }) space. We show that these algorithms perform better in practice than the algorithms of Uemura, et al. [19].
1 Introduction Researchers studying RNA structures have been lately giving special attention to pseudoknot structures. Pseudoknot structures have functional importance since they appear, for example, in viral genome RNAs [11] and tmRNA [21]. Grammatical formalisms have been used to solve many problems in biology, including structural analysis of RNA/DNA and proteins. Sakakibara, et al. [16] used Context free grammars for RNA structure prediction. Context free grammars (CFGs) have been found to be less expressive and hence incapable of handling pseudoknot structures. CFGs, for example, cannot handle the crossing dependencies that pseudoknots exhibit. Thus researchers have employed more expressive grammars. One such grammar is the Tree Adjoining Grammar (TAG) introduced by Joshi, Levy, and Takahashi [8]. TAGs are strictly more expressive than context free grammars (CFGs). For instance, {an bn cn |n ≥ 0} can be generated with a TAG but this language is not context free. Uemura, et al. [19] have proposed two subclasses of TAGs called Simple Linear TAGs (SLTAGs) and Extended Simple Linear TAGs (ESLTAGs). Based on these two subclasses, they defined RNA modeling TAGs (T AGRN A s) capable of handling pseudoknot structures. T AGRN A s consist of five main types of trees, each responsible for modeling a certain structural unit or branching within an RNA structure. They discussed how the RNA folding problem can be solved using T AGRN A s coupled with a minimum free energy scoring function to predict RNA structures. They demonstrated the effectiveness of their approach using experimental results. I. M˘andoiu, G. Narasimhan, and Y. Zhang (Eds.): ISBRA 2009, LNBI 5542, pp. 135–147, 2009. c Springer-Verlag Berlin Heidelberg 2009
136
S. Rajasekaran, S. Al Seesi, and R. Ammar
Based on models defined in Uemura, et al. [19], Matsui, Sato, and Sakakibara [9] defined Pairs Stochastic TAGs (PSTAGs). PSTAGs are defined on the derived trees of TAGs. In their work, they used dynamic programming to fold an RNA sequence by aligning it with the derived TAG tree of an RNA sequence with a known structure. They tested their approach on RNA sequences from Rfam [5] and Pseudobase [3]. Al Seesi, Rajasekaran, and Ammar [1] presented an inference algorithm for T AGRN A s. This inference algorithm accepts a positive training set with structural information, and infers a T AGRN A for the family of RNAs represented by the training set. They used their inference engine as the core of an RNA structure identification framework. An inferred grammar is fed with the positive sample and also a negative sample to generate a score for each sequence in both samples. The scoring function used is the number of base pairs in the structure. Then, using dynamic programming, a score threshold function is inferred for the grammar under consideration. The inference of the grammar and its score threshold function constitute the training phase. The inferred grammar coupled with its score threshold function can be used to check if a certain input sequence follows the structure defined by the grammar. The proposed framework was tested on RNA sequences from Rfam [5], Pseudobase [3], and the tmRNA database [21]. In [2], Al Seesi, Rajasekaran, and Ammar extended the above framework to address the RNA folding problem with structure identification as a first step in folding. The identification framework includes a set of grammars that are inferred during the training phase. An input RNA sequence can be checked against one or more of these grammars. If it is identified to have the structure represented by any of them, then a set of base pairs representing the structure will be generated. All the grammar based frameworks critically depend on the underlying parsing algorithms. Uemura, et al. [19] have proposed algorithms for parsing SLTAG and ESLTAG. In this paper we present more efficient algorithms for parsing these two subclasses. We restrict our discussion to the problem of language recognition, since such an algorithm can also be used for retrieving a parse. We use the terms ‘parsing’ and ‘recognition’ interchangeably. The problem of parsing TAGs has been studied extensively. The first polynomial time algorithm was proposed by Vijayashanker and Joshi [20]. This algorithm takes O(n6 ) time on any input string of length n. This time bound assumes that the size of the grammar is O(1). This assumption has been made in most of the literature on parsing. The algorithm of [20] is based on the Cocke-Kasami-Younger (CKY) algorithm for CFL parsing. The algorithm of Schabes and Joshi [18] is based on Earley’s CFL parsing algorithm and it runs in O(n6 ) time as well. Several other sequential and parallel algorithms can also be found in the literature (see e.g., [6], [7], [10], [12], [13], and [17]). It was an open question since 1986 [20] if Tree Adjoining Languages (TAL) parsing can be done in o(n6 ) time. This was answered in the affirmative in [14]. The algorithm of [14] takes O(n3 M (n)) time, where M (n) is the time needed to multiply two n × n matrices. Since Coppersmith and Winograd [4] have shown that M (n) = O(n2.376 ), the run time of the algorithm of [14] is indeed o(n6 ). Subsequently it was shown in [15] that TAL parsing can be done in O(M (n2 )) time. These are nice theoretical results
Improved Algorithms for Parsing ESLTAGs
137
and it is not clear how these will perform in practice (given that they employ the matrix multiplication algorithm of [4] that has not been tested against the straight forward cubic time matrix multiplication algorithm and the Strassen’s algorithm in practice).
2 Definition of TAGs A TAG is a 5-tuple G = (N, Σ ∪{}, I, A, S), where N is a finite set of nonterminal symbols, Σ is a finite set of terminal symbols disjoint from N , is the empty terminal string not in Σ, I is a finite set of labelled initial trees, A is a finite set of labelled auxiliary trees, and S ∈ N is the distinguished start symbol. Uemura, et al. [19] refer to initial trees as center trees and auxiliary trees as adjunct trees. Initial and auxiliary trees of a TAG are elementary trees. All the internal nodes of elementary trees are labelled by nonterminal symbols. Every initial tree is labelled at the root by the start symbol S and has leaf nodes labelled by symbols from Σ ∪ {}. An auxiliary tree has both its root and exactly one leaf node (called the foot node) labelled by the same nonterminal symbol. All other leaf nodes are labelled by symbols from Σ ∪ {}. An operation called adjunction composes trees of the grammar as follows: Let α be a tree containing some internal node labelled X, and let β be an auxiliary tree whose root is also labelled by X. Adjoining β into α results in the tree α’ (see Figure 1). The formalism also supports constrained adjunction, selective adjunction, and obligatory adjunction. See e.g., [20].
3 ESLTAGs Predicting RNA secondary structures is an important problem in biology. In particular, the problem of identifying pseudoknots, a common type of secondary structure, has been studied extensively by many researchers. Uemura, et al. [19] have proposed a solution for this problem based on TAGs. They have come up with two subclasses of TAGs called Simple Linear TAGs (SLTAGs) and Extended Simple Linear TAGs (ESLTAGs) suitable for representing RNA secondary structures. They have also proposed an O(n4 ) time algorithm for parsing SLTAGs and an O(n5 ) time algorithm for parsing ESLTAGs. The TAL parsing algorithm of Rajasekaran [14] takes O(n3 M (n)) time for general TAGs, where M (n) is the time needed to multiply two n × n matrices. Note that the matrix multiplication algorithm of Coppersmith and Winograd [4] implies that M (n) = O(n2.376 ). The algorithm of Rajasekaran and Yooseph [15] takes O(M (n2 )) time for general TAGs. Thus it follows that ESLTAGs can be parsed in o(n5 ) time. Though these are nice theoretical results, it is not clear how these will perform in practice. (Note that no implementation of the matrix multiplication algorithm of [4] is known). In this paper our focus is on adapting some of the techniques of [14] (without using the matrix multiplication reduction) to improve the practical performance of the algorithms given in [19]. Uemura, et al. [19] first define a grammar called Simple Linear TAG (SLTAG) and then extend it to define the TAG subclass called Extended Simple Linear TAG (ESLTAG). A Simple Linear TAG is specified as G = (C, A) where C is the set of
138
S. Rajasekaran, S. Al Seesi, and R. Ammar
Fig. 1. Adjunction operation
Fig. 2. Yield Partitioning
center trees and A is the set of adjunct trees. The center trees and the adjunct trees constitute elementary trees. G has a set of associated adjoining constraints. For each node q of every elementary tree one of the following constraints applies: 1) Selective Adjoining: In this case only members of a given set T ⊆ A can be adjoined at q; 2) Null Adjoining: In this case no adjoining can be done at q; 3) Obligatory Adjoining: In this case a member of a given set T ⊆ A must be adjoined at q. A node q is said to be inactive if the adjoining constraint of q is Null Adjoining; the node q is active otherwise. A center tree is simple linear if only one node in tree is active, all the other nodes being inactive. An adjunct tree is simple linear if only one node in the tree is active and the other nodes are inactive. Also, the active node is in the backbone of the tree. (The backbone of an adjunct tree is nothing but the path from the root to the footnode). A Simple Linear TAG (SLTAG) is a TAG all of whose elementary trees are simple linear. SLTAG can model H-type pseudoknots, which is the most common type of pseudoknots. More than 70% of the pseudoknots in Pseudobase [3] are H-type. The general pattern for an H-type pseudoknot is x1 w1 x2 w2 x¯1 r w3 x¯2 r . SLTAG can also model more complex types of pseudoknots like the LL-type pseudoknot, which has the following pattern x1 w1 x2 x3 w2 x¯3 r x¯1 r w3 x¯2 r . Let Y (α) stand for the yield of the tree α. For any adjunct tree α, Y (α) can be partitioned into four parts, according to relative position to the adjoining node: LU (α), LD(α), RD(α), and RU (α) as in Figure 2. An SLTAG C = (C, A) is said to be in normal form if 1) Y (α) = λ for every center tree α; and 2) |LU (α)| + |LD(α)| + |RD(α)| + |RU (α)| = 1 for each adjunct tree α. This means that there is exactly one terminal symbol in the yield of α. One can show that for every SLTAG there is an equivalent SLTAG in normal form. An Extended SLTAG is a TAG where each center tree is simple linear and every adjunct tree is either simple linear or semi-simple linear. An adjunct tree is semi-simple linear if it has two active nodes one on the backbone and the other elsewhere. An ESLTAG is in normal form if: 1) Y (α) = λ for every center tree α; 2) |LU (α)| + |LD(α)| + |RD(α)| + |RU (α)| = 1 for each adjunct tree α that is simple linear; and 3) Y (α) = X (X ∈ N ) for each adjunct tree that is semi-simple linear. It can be shown that for every ESLTAG there is an ESLTAG in normal form. ESLTAG can be used to model structures where branching is needed. Semi-simple trees are used to model the branching. For example, the concatenation of two pseudoknots or the concatenation a hair-pin to a pseudoknot can not be modeled by SLTAG, and it requires ESLTAG. Branching can also occur within the pseudoknot structure
Improved Algorithms for Parsing ESLTAGs
139
itself, as in the HH-type pseudoknot. The general pattern for an HH-type pseudoknot can be represented as x1 x2 w1 x¯2 r x3 w2 x4 x¯3 r x¯1 r w3 x¯4 r . An example for an HHtype pseudoknot is the HCV IRES pseudoknot in Pseudobase [3]. Uemura, et al. [19] have presented an O(n4 ) parsing algorithm for SLTAGs. They also have presented an O(n5 ) time algorithm for ESLTAGs. In this paper we show how to devise a parsing algorithm for ESLTAGs that runs in time O(mn) where m is the number of entries that will ever be filled in the matrix M that the authors use. In fact this matrix M (of size (n + 1)4 ) is used by most of the TAG algorithms that have been proposed in the literature. In the worst case the value of m could be as much as (n+1)4 . But in practice the value of m is much smaller and hence the proposed algorithm should be of practical interest. 3.1 An O(n4 ) Time Algorithm for SLTAGs The O(n4 ) time algorithm of [19] uses a matrix M of size (n + 1)4 . This matrix is filled (in some specific order) taking O(1) time per entry and hence the algorithm takes a total of O(n4 ) time. To fill each entry an adjoin operation is performed. This adjoin operation can be performed in O(1) time since there is only one terminal symbol in the yield of any adjunct tree. In particular, the algorithm will place α ∈ A into M (i, j, k, l) if the following conditions hold: 1) There is a tree β ∈ M (i + |LU (α)|, j − |LD(α)|, k + |RD(α)|, l − |RU (α)|) such that β can be adjoined to α; 2) LU (α) = ai+1 · · · ai+|LU(α)| ; 3) LD(α) = aj−|LD(α)|+1 · · · aj ; 4) RD(α) = ak+1 · · · ak+|RD(α)| ; and 5) RU (α) = al−|RU(α)|+1 · · · al . Once all the entries in M have been filled, it is easy to check if the given input string is in the language or not. 3.2 An O(n5 ) Time Algorithm for ESLTAGs In the O(n5 ) time algorithm of [19], simple linear trees are processed as in the O(n4 ) time algorithm. Semi-simple linear trees are handled differently. There are four cases based on where the active node (other than the one in the backbone) is in the tree. (see Figure 3). In addition to the matrix M , they also use an (n + 1) × (n + 1) array B. Entries in this matrix will be pairs of the type (α, p) where α is an adjunct tree and p is an integer. The pair (α, p) will be in B(i, j) iff there exists a derived tree whose yield is ai+1 ai+2 · · · ap Xap+1 ap+2 · · · aj , for some nonterminal symbol X. If the semi-simple linear tree α is of type A, it will be placed in M (i, j, k, l) if there exist a β ∈ M (r, j, k, l) and a β ∈ B(i, r) (with i ≤ r ≤ j) such that β is adjoinable at n0 and β is adjoinable at n1 . There are similar strategies for semi-simple linear trees of type B, type C, and type D as well. These basic adjoining operations have to be performed for each possible value of i, j, k, l, and r. Since each of these indices takes on n + 1 possible values, the run time of the algorithm is O(n5 ).
4 New Algorithms In this section we show how we can improve the performance of the algorithms of [19]. We adapt some of the techniques of [14]. The algorithms of Section 3 are M -centric
140
S. Rajasekaran, S. Al Seesi, and R. Ammar
Fig. 3. Semi-simple linear trees
(where M is a matrix of size (n + 1)4 ). Independent of the elementary trees in the given grammar, the algorithms step through each entry in this matrix to check if any trees can be placed in this entry. This means that even if no tree will ever be placed in a matrix entry, some work is done corresponding to this entry. To overcome this undesirable time expense, Rajasekaran [14] proposed the employment of tree-centric (or grammar-centric) algorithms. The idea can be better explained with the example of a Context Free Grammar (CFG). Let G = (N, T, P, S) be a CFG in Chomsky normal form. Any production P takes the form: A → BC or A → a. Here A, B, and C are nonterminals and a is a terminal symbol. The parsing algorithm of [14] runs in stages. In each stage every production P is processed to grow larger and larger parse trees. Each nonterminal is associated with a list of tuples (i, j) at any time. LIST (A) will have tuples (i, j) such that A derives ai+1 ai+2 . . . aj (for any nonterminal A). In any stage, A → BC is processed as follows: Scan through the elements in LIST (B) and LIST (C) looking for matches. If (i, k) is in LIST (B), we check if (k, j) is in LIST (C), for some j. If this is the case, the tuple (i, j) will be inserted into LIST (A) (if it is not already there). There will be n stages in the algorithm. A single production can be processed in O(n3 ) time. The following data structures will be employed: 1) For each nonterminal A, keep an array XA [1 : n] of lists. XA [i] is the list of all tuples from LIST (A) whose first item is i (1 ≤ i ≤ n); and 2) An n × n matrix M such that M (i, j) is a list of nonterminals that derive ai+1 ai+2 . . . aj . There can be O(n2 ) entries in LIST (B). For each entry (i, k) in this list, at most n items in LIST (C) have to be examined. As a result, the total time needed to process a production is O(n3 ).
Improved Algorithms for Parsing ESLTAGs
141
By induction one can show that at the end of stage (1 ≤ ≤ n), the algorithm would have computed all the nonterminals that span any input segment of length or less. This implies that the above algorithm takes n stages accounting for a total of O(n4 ) time. The run time of each stage can be reduced to O(n2 ) by ensuring that in stage (1 ≤ ≤ n) we only deal with tuples from LIST (B) and LIST (C) whose combination will derive an input segment of length exactly . As an example, if (i, k) is a tuple in LIST (B), the only tuple in C we should look for is (k, i + ). This can be done in O(1) time. With this modification, the entire algorithm takes O(n3 ) time. Also, note that if m is the number of elements that will ever be stored in the matrix M , then, the run time of the above algorithm is no more than O(mn). 4.1 A Parsing Algorithm for SLTAGs We can extend the above ideas to SLTAGs. For each adjunct tree (α) a list (LIST (α)) is associated. LIST (α) will have quadruples. If (i, j, k, l) ∈ LIST (α) it means that the string ai+1 . . . aj Xak+1 . . . al can be derived from α, where X is a nonterminal. There are n stages in the algorithm. In each stage we process the elementary trees to grow larger and larger parse (derived) trees. In particular, LIST (α) is updated for every adjunct tree α. In the case of SLTAGs, trees grow in size with the help of the adjoin operations. Consider a derived tree γ. Note that when an adjunct tree α is adjoined to γ, its yield increases exactly by one. Thus we can grow the trees in a systematic manner. In particular, in stage (for 1 ≤ ≤ (n − 1)), we only consider derived trees of yield and perform adjoin operations on them to get trees of yield + 1. Since there could be at most O(n3 ) trees whose yield is of size , and since each adjoin operation can be performed in O(1) time, each stage of the algorithm takes O(n3 ) time and the entire algorithm runs in time O(n4 ). Also realize that the run time can be specified as O(min{m, n4 }) where m is the number of entries that will ever be made in the matrix M in the algorithm of [19]. For simplicity of notation, from hereon, we use O(m) to denote O(min{m, n4 }). A pseudocode follows. Algorithm SLTAG Parse /* This algorithm takes as input an SLTAG G and an input a1 a2 . . . an and returns ‘YES’ if the input string is in L(G) and ‘NO’ otherwise. */ 1 Initialize LIST (γ) for each adjunct tree γ; 2 Specifically, (i, j, k, l) will be added to LIST (γ) if 3 i + |L(γ)| ≤ k, L(γ) = ai+1 ai+2 · · · ai+|L(γ)| , and R(γ) = ak+1 ak+2 · · · ak+|R(γ)| ; 4 for := 2 to n do 5 for each adjunct tree γ do 6 L (γ) := ∅; 7 for each adjunct tree β that is adjoinable to γ do 8 for each (i, j, k, l) ∈ LIST (β) do 9 if LU (γ) = ai−|LU (γ)|+1 · · · ai , LD(γ) = aj+1 · · · aj+|LD(γ)| , 10 RD(γ) = ak−|RD(γ)|+1 · · · ak , and RU (γ) = al+1 · · · al+|RU (γ)| ,
142
11 12 13 14 15 16 17 18 19 20 21 22
S. Rajasekaran, S. Al Seesi, and R. Ammar
then add (i − |LU (γ)|, j + |LD(γ)|, k − |RD(γ)|, l + |RU (γ)|) to L (γ); end for end for end for for each adjunct tree γ do LIST (γ) := L (γ); Sort LIST (γ); end for end for if there exist an adjunct tree γ, an index i, and a center tree α such that (0, i, i, n) ∈ LIST (γ) and γ is adjoinable to α then return YES else return NO
Theorem 1. Algorithm SLTAG Parse works correctly. Proof. This theorem is proven by showing that if an adjunct tree γ derives the string ai ai+1 · · · aj Xak ak+1 · · · al and if (k − l) + (j − i) = , then LIST (γ) will have the quadruple (i, j, k, l) at the end of stage − 1. The proof of this statement is by induction on the stage number . Lines 1-3 correspond to stage 1. Each adjunct tree can be thought of as a derived tree with yield 1. Clearly, at the end of stage 1, for every adjunct tree a quadruple (i, j, k, l) is generated with (k − l) + (j − i) = 1. Thus the base case holds. Assume that the statement holds for all derived trees with yield ≤ . We can prove it for yield + 1 as follows. Let α be a derived tree with a yield of size + 1. The tree α has been obtained with a sequence of adjoin operations. This sequence corresponds to a sequence of adjunct trees γ1 , γ2 , . . . , γ+1 . Consider the derived tree obtained by adjoining γ+1 to γ . Call this derived tree α . The yield of this derived tree is 2. At the end of stage 2 there will be a quadruple in LIST (γ) corresponding to α . In stage 3, the algorithm realizes that γ can be adjoined to γ−1 . LIST (γ ) has a quadruple corresponding to α . Let α−1 be the derived tree obtained by adjoining α to γ−1 . The yield of α−1 is 3. A quadruple corresponding to α−1 will be added to L (γ−1 ) in the for loop of line 5 in stage 3. Continuing in this fashion, let α2 be the derived tree corresponding to the sequence: γ2 , γ3 , . . . , γ , γ+1 . Since the yield of α2 is , using the induction hypothesis, at the end of stage there will be a quadruple corresponding to α2 in LIST (γ2). In stage + 1, the algorithm realizes that γ2 can be adjoined to γ1 . LIST (γ2 ) will have a quadruple corresponding to α2 . Thus, the for loop of line 5 will create a quadruple corresponding to α and add it to L (γ1 ). This completes the proof. Theorem 2. Algorithm SLTAG Parse takes O(m) time and O(m) space assuming that |G| = O(1). Proof. Initialization done in Lines 1-3 takes O(n2 ) time. Note that LIST (γ) for any adjunct tree γ in any stage of the algorithm will be of size O(n3 ) since the quadruples in this list correspond to quadruples with a specific yield.
Improved Algorithms for Parsing ESLTAGs
143
The for loop of Line 8 takes O(|LIST (β)|) time. Summing this over all the adjunct trees and all the n stages, the total time for this loop is O(m). Sorting in Line 17 is done in lexicographic order. This sorting helps to eliminate duplicates. This can be done using radix sort in time O(|LIST (γ)|). Thus the total sorting time in the entire algorithm is O(m). Lines 20-22 take O(n) time. In any algorithm that makes use of the matrix M , in each entry of the matrix M , we keep only distinct elements. If m is the number of nonempty matrix entries, then the amount of time spent by these algorithms corresponding to these m entries may be more than m. This could happen, for example, if for an adjunct tree γ, the same quadruple gets generated multiple times. In fact, Algorithm SLTAG Parse provides us an upper bound on this time. Note that the size of L (γ) is upperbounded by |∪β LIST (β)|, where the union is done over all β adjoinable to γ. In other words, the number of quadruples (including duplicates) generated in any stage is no more than the number of distinct quadruples generated in the previous stage. This implies that the run time of the algorithm is O(m) where m is the total number of distinct quadruples generated in the algorithm. 4.2 A Parsing Algorithm for ESLTAGs For the case of ESLTAGs also we can employ a similar strategy. Like in the algorithm of [19], we also keep an (n + 1) × (n + 1) matrix B. Each entry in this matrix is a list of adjunct trees. The tree γ will be in B(i, j) if the string ai+1 ai+2 · · · aq Xaq+1 · · · aj can be derived from γ (for some q, X being any nonterminal symbol). Simple linear trees can be processed in exactly the same manner as in Algorithm SLTAG Parse. Semisimple linear trees have to be processed differently. Initialization is done only for simple linear trees. A semi-simple linear tree can be categorized into type A, type B, type C, and type D (see Figure 3) depending on where the adjoining nodes are located in the tree [19]. The algorithm of [19] takes O(n5 ) time to process a semi-simple linear tree. In our algorithm there are n stages. A semi-simple linear tree is processed in O(n4 ) (O(m) to be more precise) time per stage. In stage (1 ≤ ≤ n) we are only interested in derived trees whose yield is . Each adjunct tree γ will have an associated list LIST (γ) as in Algorithm ESLTAG Parse. A pseudocode follows. Algorithm ESLTAG Parse /* This algorithm takes as input an ESLTAG G and an input a1 a2 . . . an and returns ‘YES’ if the input string is in L(G) and ‘NO’ otherwise. */ 1 Initialize LIST (γ) for each simple linear adjunct tree γ; 2 Specifically, (i, j, k, l) will be added to LIST (γ) if 3 i + |L(γ)| ≤ k, L(γ) = ai+1 ai+2 · · · ai+|L(γ)| , and R(γ) = ak+1 ak+2 · · · ak+|R(γ)| ; 4 for := 2 to n do 5 for each adjunct tree γ do 6 L (γ) := ∅; 7 for each simple linear adjunct tree γ do 8 for each adjunct tree β that is adjoinable to γ do 9 for each (i, j, k, l) ∈ LIST (β) do
144
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
S. Rajasekaran, S. Al Seesi, and R. Ammar
if LU (γ) = ai−|LU (γ)|+1 · · · ai , LD(γ) = aj+1 · · · aj+|LD(γ)| , RD(γ) = ak−|RD(γ)|+1 · · · ak , and RU (γ) = al+1 · · · al+|RU (γ)| , then add (i , j , k , l ) to L (γ) where i = i − |LU (γ)|, j = j + |LD(γ)|, k = k − |RD(γ)|, and l = l + |RU (γ)|; if j = k then add γ to B(i , l ); end for end for end for for each semi-simple linear adjunct tree γ of type A do for each β that is adjoinable at n0 do for each (r, j, k, l) ∈ LIST (β) do Let q := − [(j − r) + (l − k)]; for each β ∈ B(r − q, r) that is adjoinable at n1 do Add (r − q, j, k, l) to L (γ); if j = k then add γ to B(r − q, l); end for end for end for end for for each semi-simple linear adjunct tree γ of type B do for each β that is adjoinable at n0 do for each (i, p, k, l) ∈ LIST (β) do Let q := − [(p − i) + (l − k)]; for each β ∈ B(p, p + q) that is adjoinable at n1 do Add (i, p + q, k, l) to L (γ); if k = (p + q) then add γ to B(i, l); end for end for end for end for for each semi-simple linear adjunct tree γ of type C do for each β that is adjoinable at n0 do for each (i, j, k, p) ∈ LIST (β) do Let q := − [(j − i) + (p − k)]; for each β ∈ B(p, p + q) that is adjoinable at n1 do Add (i, j, k, p + q) to L (γ); if j = k then add γ to B(i, p + q); end for end for end for end for for each semi-simple linear adjunct tree γ of type D do for each β that is adjoinable at n0 do for each (i, j, r, l) ∈ LIST (β) do Let q := − [(j − i) + (l − r)];
Improved Algorithms for Parsing ESLTAGs
55 56 57 58 59 60 61 62 63 64 65 66 67 68 69
145
for each β ∈ B(r − q, r) that is adjoinable at n1 do Add (i, j, r − q, l) to L (γ); if j = (r − q) then add γ to B(i, l); end for end for end for end for for each adjunct tree γ do Sort L (γ) to eliminate duplicates; Add the distinct quadruples of L (γ) to LIST (γ); end for end for if there exist an adjunct tree γ, an index i, and a center tree α such that (0, i, i, n) ∈ LIST (γ) and γ is adjoinable to α then return YES else return NO
Theorem 3. Algorithm ESLTAG Parse works correctly. Proof. is similar to that of Theorem 1.
Theorem 4. Algorithm ESALTAG Parse runs in time O(mn) using O(m) space assuming that the grammar size is O(1). Proof. Initialization in lines 1-3 takes O(n2 ) time. Note that LIST (γ) for any adjunct tree γ in any stage of the algorithm will be of size O(n4 ) (O(m) to be more precise). The for loop of line 7 takes O(m) time since we only spend O(1) time for each quadruple in LIST (β). The for loop of Line 18 also takes O(m) time since we only spend O(1) time for each quadruple in LIST (β). This is made possible by the fact that in stage we are only interested in derived trees whose yield is of size . The for loops of Lines 29, 40, and 51 are similar and hence take O(m) time each as well. Sorting in Line 63 is done in lexicographic order. This sorting helps to eliminate duplicates. This can be done using radix sort in time O(|L (γ)|). The total time spent in sorting in any stage is upper bounded by a constant multiple of the total number of distinct quadruples generated up to the end of the previous stage. Therefore, the total sorting time in any stage is O(m). As a result, the total sorting time in the entire algorithm is O(mn). Lines 67-69 take O(n) time. Therefore, the run time of the algorithm is O(mn). 4.3 Experimental Results To evaluate the performance of the newly developed algorithms, we implemented the SLTAG parsing algorithm presented in [19] and the one presented in this paper. Table 1 lists the CPU time for each algorithm, and m for our algorithm for different RNA
146
S. Rajasekaran, S. Al Seesi, and R. Ammar
Table 1. Performance results for Uemura’s SLTAG parser and our SLTAG parser. Time is measured in (min:sec), and m is the total number of tuples generated by our SLTAG parser. Length
Uemura’s Time
Rajasekaran’s Time
33 46 59 72 133 187
3 8 12:11 -
0.11 0.5 1.3 5.5 1:08.9 7:27.9
m 90,369 413,309 1,037,164 3,689,342 47,660,675 200,261,407
lengths. The reported m, which is considerably lower than O(n4 ) , represents the total number of tuples generated by our SLTAG parser for all trees; however, at any iteration, , only tuples generated in iterations and +1 are stored in memory. Experiments were conducted on a Intel Pentium M 1.7 GHz processor and 1GB RAM. The grammar size in all experiments was 88 trees. We could not test the time performance of Uemura’s parser for sequences of length > 70 nc due to its memory requirements.
5 Conclusions In this paper we have presented efficient algorithms for parsing SLTAGs and ESLTAGs. Our algorithm for SLTAG parsing takes O(m) time and O(m) space where m is the number of entries that will ever be made in the matrix M (that is normally used by TAG parsing algorithms). Our algorithm for ESLTAG parsing takes O(mn) time and O(m) space. We have shown by experimental results for the SLTAG parser the performance advantage of the tree-centric approach to the matrix-centric approach to parsing presented in [19].
Acknowledgements This work has been supported in part by the following grants: NSF 0326155, NSF 0829916 and NIH 1R01GM079689-01A1.
References 1. Al Seesi, S., Rajasekaran, S., Ammar, R.: Pseudoknot identification through learning T AGRNA s. In: Chetty, M., Ngom, A., Ahmad, S. (eds.) PRIB 2008. LNCS (LNBI), vol. 5265, pp. 132–143. Springer, Heidelberg (2008) 2. Al Seesi, S., Rajasekaran, S., Ammar, R.: RNA pseudoknot folding through inference and identification using T AGRNA s. In: International Conference on Bioinformatics and Computational Biology, BiCob 2009 (2009) (to appear) 3. van Batenburg, F.H.D., Gultyaev, A.P., Pleij, C.W.A., Ng, J., Oliehoek, J.: Pseudobase: a database with RNA pseudoknots. Nucl. Acids Res. 28(1), 201–204 (2000)
Improved Algorithms for Parsing ESLTAGs
147
4. Coppersmith, D., Winograd, S.: Matrix multiplication via arithmetic progressions. Journal of Symbolic Computation 9, 251–280 (1990); Also in: Proc. 19th Annual ACM Symposium on Theory of Computing, pp. 1-6 (1987) 5. Griffiths-Jones, S., Moxon, S., Marshall, M., Khanna, A., Eddy, S.R., Bateman, A.: Rfam: Annotating non-coding RNAs in complete genomes. Nucl. Acids Res. 33, D121-D124 (2005) 6. Guan, Y., Hotz, G.: An O(n5 ) recognition algorithm for coupled parenthesis rewriting systems. In: Proc. TAG+ Workshop. University of Pennsylvania, Philadelphia (1992) 7. Harbusch, K.: An efficient parsing algorithm for tree adjoining grammars. In: Proc. 28th Meeting of the Association for Computational Linguistics, Pittsburgh, pp. 284–291 (1990) 8. Joshi, A.K., Levy, L.S., Takahashi, M.: Tree adjunct. grammars. Journal of Computer and System Sciences 10(1), 136–163 (1975) 9. Matsui, H., Sato, K., Sakakibara, Y.: Pair stochastic tree adjoining grammars for aligning and predicting pseudoknot RNA structures. Bioinformatics 21(11), 2611–2617 (2005) 10. Nurkkala, T., Kumar, V.: A parallel parsing algorithm for natural language using tree adjoining grammar. In: Proc. 8th International Parallel Processing Symposium (1994) 11. Paillart, J.C., Skripkin, E., Ehresmann, B., Ehresmann, C., Marquet, R.: In vitro evidence for a long range pseudoknot in the 5-untranslated and matrix coding regions of HIV-1 genomic RNA. J. Biol. Chem. 277, 5995–6004 (2002) 12. Palis, M., Shende, S., Wei, D.S.L.: An optimal linear time parallel parser for tree adjoining languages. SIAM Journal on Computing 19(1), 1–31 (1990) 13. Partee, B.H., Ter Meulen, A., Wall, R.E.: Studies in Linguistics and Philosophy, vol. 30. Kluwer Academic Publishers, Dordrecht (1990) 14. Rajasekaran, S.: TAL parsing in o(n6 ) time. SIAM Journal on Computing 25(4), 862–873 (1996) 15. Rajasekaran, S., Yooseph, S.: TAL parsing in O(M (n2 )) time. Journal of Computer and System Sciences 56(1), 83–89 (1998) 16. Sakakibara, Y., Brown, M., Hughey, R., Mian, I.S., Sjolander, K., Underwood, R.C., Haussler, D.: Stochastic context-free grammars for tRNA modeling. Nucl. Acids Res. 22, 5112–5120 (1994) 17. Satta, G.: Tree adjoining grammar parsing and Boolean matrix multiplication. In: Proc. 32nd Meeting of the Association for Computational Linguistics (1994) 18. Schabes, Y., Joshi, A.K.: An Earley-type parsing algorithm for tree adjoining grammars. In: Proc. 26th Meeting of the Association for Computational Linguistics, pp. 258–269 (1988) 19. Uemura, Y., Hasegawa, A., Kobayashi, S., Yokomori, T.: Tree adjoining grammars for RNA structure prediction. Theoretical Computer Science 210, 277–303 (1999) 20. Vijayashanker, K., Joshi, A.K.: Some computational properties of tree adjoining grammars. In: Proc. 23rd Meeting of the Association for Computational Linguistics, pp. 82–93 (1985) 21. Williams, K.P.: The tmRNA website: Invasion by an intron. Nucl. Acids Res. 30(1), 179–182 (2002)
Efficient Algorithms for Self Assembling Triangular and Other Nano Structures Vamsi Kundeti and Sanguthevar Rajasekaran Department of Computer Science and Engineering University of Connecticut Storrs, CT 06269, USA {vamsik,rajasek}@engr.uconn.edu
Abstract. Nano fabrication with biomolecular/DNA self assembly is a promising area of research. Building nano structures with self assembly is both efficient and inexpensive. Winfree [1] formalized a two dimensional (2D) tile assembly model based on Wang’s tiling technique. Algorithms log(N) with an optimal tile complexity of (Θ( log(log(N)) )) were proposed earlier to uniquely self assemble an N × N square (with a temperature of α = 2) on this model. However efficient constructions to assemble arbitrary shapes are not known and have remained open. In this paper we present self assembling algorithms to assemble a triangle of base 2N − 1 (units) and height N with a tile complexity of Θ(log(N )). We also describe how this framework can be used to construct other shapes.
1
Introduction
Biomolecular nano technology is an emerging and a promising field. Fabrication of nano structures using a bottom up approach such as biomolecular self assembly has proven to be a scalable and efficient technique (see [2] [3]). Self assembly is a spontaneous self organization of small nano shapes into complex nano structures. The idea of self assembly was conceived of by chemists during the synthesis of lipid polymers. Initially proteins were used in the self assembly process. However proteins were soon replaced by DNA strands. Seeman [4] showed that using DNA strands can result in more rigid nano structures rather than protein based nano structures. Seeman used double crossovers to build the DNA tiles, which act as bricks in the self assembly process. These double crossover DNA tiles consist of two crossovers within two double DNA helices. These DNA tiles have four pads and are rectangular in shape. These pads can be encoded with single DNA strands and thus tiles with complementary strands abut together during the self assembly process. Rothemund, et al. [5] also demonstrated the feasibility of using DNA tiles. He conducted an experiment which performs unmediated self assembly of a fractal pattern (Sierpinski triangle). However it should be noted that this self assembly process was not terminal and would not generate the required pattern uniquely. The study of DNA self assembly is motivated by the impact it can have on human health and nanoelectronics. Recently, gold nanoparticles have been I. M˘ andoiu, G. Narasimhan, and Y. Zhang (Eds.): ISBRA 2009, LNBI 5542, pp. 148–158, 2009. c Springer-Verlag Berlin Heidelberg 2009
Self Assembly of Triangles
149
used to deliver multiple cancer treatment drugs in a controlled fashion (see [6]). During this process of drug-delivery at a nano level it is often necessary to arrange these nano particles in some pattern such as an array. Ordering these gold nano particles was an extremely difficult task for the scientists until DNA self assembly process aided them (see [7]) in achieving this daunting task. In nanoelectronics DNA self assembly helps in the surface treatment of the dielectric layer by appropriate patterning to improve the organic transistor performance (see [8]). DNA-templated nanowires, heterojunction diodes, with good rectification properties were mass-fabricated efficiently from copper, palladium and platinum (see [9]). Rothemund, et al. [10] formalized the theory behind self assembly of DNA tiles using Wang’s [11] tiling technique. Rothemund et al. [10] also introduced tile complexity and gave several constructions to self assemble an N × N square with a tile complexity of O(log(N )). Later Adleman [12] gave an optimal construction log(N ) with a tile complexity of O( log(log(N ) ) to self assemble an N × N square. Several variations (see e.g., [13] [14] [15]) to the original tile assembly model have since been studied. However, the fundamental problem of efficient tile set construction for arbitrary shapes in the original model has remained an open problem until now. In this paper we give efficient tile set constructions to self assemble a triangle and related shapes uniquely. Our main result is a tile set which can self assemble a triangle (at a temperature of α = 2) with O(N ) base and height and has a tile complexity of O(log(N )). The organization of this paper is as follows. We define the original tile assembly model in Section 2. In Section 3 we give our self assembly construction.
2
Tile Assembly Model
A Wang’s tile is a unit square with colors/symbols attached to each side. Two tiles can be placed adjacent only if the sides along which they are adjacent have the same color. Figure 1 shows a Wang’s tiling to cover a 3 × 3 square. In a valid Wang’s tiling, we make use of a tile set T which is the set of all unique tiles used in tiling. The computational power of Wang’s tiling is the same as that of a Turing machine. We now define the tile assembly model formally. Let Σ be an alphabet (color set). Then a (Wang’s) tile t over Σ is a 4-tuple (σN , σS , σE , σW ) ∈ Σ 4 , where σN is the color attached to the top (north) of the tile. Similarly, the colors σS , σE , σW are attached to bottom, left and right sides of the tile t (see Figure 1). The tiles in this model cannot be rotated (e.g., (σN , σE , σS , σW ) = (σN , σS , σE , σW )). We define a strength function g : Σ → R for every symbol/color in Σ. The reason for defining a strength function g is as follows. In practice we need the nano structures built using the self assembly to be stable. To guarantee the stability of the nano structure we need to ensure that every tile in the tiling will have enough bond strength. This bond strength threshold is defined by the parameter α. A tile receives its bond strength from every tile adjacent to it. For any tile, we will have at least one and at most four adjacent tiles. Each bond depends on the color and has a different strength
150
V. Kundeti and S. Rajasekaran
1
2
3
1
2
3 Fig. 1. A Wang’s tiling to cover a 3 × 3 square
1
2
3
1
2 New tile
3 Bond strength 2 Bond strength 1
α=2 threshold bond strength
Fig. 2. A self assembly process with α = 2
indicated by function g . This bond strength is additive. A new tile can enter the self assembling process only if it can receive at least a bond strength of α. For the self assembly process in Figure 2, α = 2. The new tile shown in the figure can enter the self assembly process because it can receive a total bond strength of 2, one each from yellow and green bonds. Note that g (yellow) = 1, g (green) = 1. Let T ∪ {φ} be the set of tiles (including an empty tile φ) . Then a valid Wang’s tiling of a region using tiles from T is represented by a tile placement function M : N × N → T ∪ {φ}. If a location (x, y) on the grid is not covered by any of
Self Assembly of Triangles
151
the unit tiles from T then M (x, y) = φ. The self assembly process starts off with a seed tile s placed at some standard location like (0, 0), and continues to grow (by adding one tile at a time) around the seed tile. The self assembly process proceeds as along as we have at least one tile t ∈ T which can gather a bond strength of at least α from the partial assembly. Otherwise the self assembly process stops and becomes terminal. There might be several ways to grow the self assembly process from the seed tile s, depending on which tile we choose from T to grow the self assembly. Irrespective of the number of ways to grow the partial self assembly, if the terminal assembly is unique, we say that the tile set T can produce the corresponding structure uniquely and is terminal. Throughout this discussion we are interested in building tile sets that are terminal and can produce the corresponding structure uniquely when α = 2. The number of unique tiles in a tile set T is defined as the tile complexity.
3
Tile Sets for Self Assembling Triangles
We introduce our tile set constructions in stages. To introduce our ideas, we begin by presenting a tile set of tile complexity Θ(N ) to self assemble a triangle whose height is N (units) and base is 2N − 1 units. We call this tile set TileSet1. After that we present a tile set which can self assemble a triangle of base and height Θ(N ) with a tile complexity of Θ(log(N )). We call this tile set TileSet2. 3.1
Construction of TileSet1
The idea behind the construction of TileSet1 is to generate a double staircase pattern. Winfree [1] in his construction used a single staircase pattern to self assemble an N × N square. Staircase patterns are generated using colors having a strength of 2. Figure 3 illustrates our construction. In this figure solid black dots and solid gold dots are of strength 2 each. All the other dots have a strength of 1 each. The alphabets (B, C, D, E) also have a strength of 2 each. We use a row major order to refer to the tiles in TileSet1. For example, the seed tile is referred to as t1 . The other two tiles in the first row are named t2 and t3 , respectively. The tiles in the second row are named t4 , t5 , and t6 , respectively, from left to right; and so on. The self assembly process starts off with the seed tile t1 (the green colored tile in Figure 4). Once the seed tile t1 is in place the only way for the self assembly to proceed would be to use the tile t2 which has a symbol B on the top. Since the symbol B has a strength of 2 tile t2 has enough strength to attach to t1 . All the sides of tile t2 have a strength of 2. So there are three ways for the self assembly to proceed now. Consider the growth along the sides of tile t2 . Since the right side of t2 has a gold dot which has a strength of 2, tile t9 can be attached on the right side of t2 . After this t7 can be attached to the right of tile t2 since the black dot has a strength of 2. At this stage the only way to progress the self assembly is by using tile t3 since it can attach to tile t2 at the bottom via symbol C which has a strength of 2. After this tile t8 can be attached to the left of t2 . Although
152
V. Kundeti and S. Rajasekaran
A
1
B B
4
C C D D
7
E E
10
A
B
C
B
C
D
D
E
E
F
F
Tile set
Self Assembly
Fig. 3. A self assembly of a triangle with height 5 , base 9 with α = 2. All bonds have strength of 1 except the solid black bond and bonds between alphabets which have a strength of 2. The green colored tile is the seed tile. A
A
A
A
B
B B
B B
B B
C
C
C
A
A
A
B B
B B
B B
C C
C C
C C
D
D
D
− Strength 1 ABCDE
− Strength 2
Fig. 4. Progress of the self assembly process with tile set TileSet1
it does not have any colors/symbols with strength 2, t8 receives a strength of 2 by making two strength 1 bonds on the top and on the left (see Figure 4). This process continues and finally the self assembly becomes terminal as in Figure 3. This construction can be generalized to build a triangle with height of N and a base of 2N − 1. The tile set would consist of exactly N + 5 tiles and this has a tile complexity of O(N ).
Self Assembly of Triangles
1
1
A
B
B
0 A
C
E
B
D
D
E
F
A D
0 A
1
4
153
H F
G
C
G
C
0
0
C H
7
B
Set−1 1
1
0
4
0
0
1 1 1
0 0 1 0
0
Set−2 0
1
0
1
1 1
0
0
1
0
7
1
Set−3
Set−4
Set−5
Fig. 5. Tile set TileSet2
Theorem 1. Construction based on TileSet1 can self assemble a triangle with height N and base 2N − 1 with a tile complexity of O(N ). Proof. We prove this by induction. N = 8 forms the basis and its true from Figure 3. As we can see the self assembly is terminal and unique for α = 2. Let the hypothesis be true for N = k (i.e., we need k + 5 tiles in the tile set). The only way we can extend this is to add a tile with a symbol matching the bottom alphabet (e.g., in Figure 3 the bottom alphabet is F ) on its top (similar to t2 , t3 , t4 )). Thus we just need to add one extra tile into the tile set to produce a triangle with height k + 1 and base 2(k + 1) − 1. This proves the tile complexity of TileSet1 as O(N ).
4
Construction of Tile Set TileSet2
In this section we present a tile set TileSet2 which can self assemble a triangle with Θ(N ) height and Θ(N ) base with a tile complexity of Θ(log(N )). The key idea behind this construction is to use self assembling binary counters as efficiently as possible. There are four major components in our construction. The first component is the binary counter. The second component is the double staircase described in the previous section. The third component is an upper ) triangle with height log(N and base log(N ) − 1. And finally a connector com2 ponent which combines all these independent components. See Figure 6 for an
154
V. Kundeti and S. Rajasekaran
D C C B B A A E E
F F 0 0
1 1
11 1 1
1 1 00
1 1
0 0 11
0 0
CC
1 1 0 0
1 1 00
1 1
1 1 1 1
1 1
1 1 0 0
Counter Component
11
00 1 1
HH 0 0
1 1
1 1 0 0
DD
GG 0 0
1 1 1 1
BB
AA
Staircase Component
Upper Triangle Component
Connector Component Fig. 6. Four major components in building the triangle with TileSet2
overview of the components involved in our construction. Self assembling binary counters is already known and this was also used by Winfree to generate an N × N square. However the difference in our idea is how we use the overflow bit from the counter to seed the double staircase pattern. We introduce our ideas by describing how the construction works in self assembling a triangle of base 19 and height 10. To do this we use tiles from the tile set shown in Figure 5. We have grouped the tiles into five sets. Tiles in sets (Set 3, Set 4 and Set 5) do not depend on the dimension of the triangle we wish to self assemble and hence are of constant size. However, the tiles in Set 1 and Set 2 depend on the dimension of the triangle. Similar to the previous section, we name the tiles using the row major order. Tile tij refers to a tile from Set i and having a row major index of j (e.g. t11 refers to the first tile in Set 1 in row major order, which is the green colored seed tile). The self assembly process first builds the binary counter component and then the upper triangle and the double staircase are assembled concurrently. 4.1
Self Assembling a Binary Counter Component
The self assembly process starts off with the seed tile t11 . This seed tile has a blue dot on the top and a symbol A on the left, both of which have a strength of 2. It is easy to see how the bottom row is self assembled using tiles from Set 1. Note after this the bottom row of the counter cannot expand further towards the left or towards the top because the white dot and light magenta dots of tile t14 are
Self Assembly of Triangles 1 DD
0 CC
Overflow bit
1 BB
AA 0
(a)
1 1
0
0
11
11
1
1
1 1
1
1
1 1 1
1 1
0 CC
1 BB
AA
(b)
1
1
0
1 1
1 1
0 0
0 0
0 0
CC
BB
AA
DD
CC
(c)
BB
1
1
1 1
1 1
0 0
AA
1 0 0
1
DD
BB
0 0
0 0 CC
1 1 1 1
BB
AA
1 1 1 1
0 0 CC
DD
0 11
00 1 1
0
11
00 1 1
1 1 1 1
0 0
AA
1 1
0 0
0 11
00 1 1
BB
(f)
(g) 1
CC
1 1
11
00 1 1
DD
1
1 1 1 1
0 0
1
1 1
0 0 11
00 1 1
DD
0
1
0 0
DD
155
(e) AA
(d)
Fig. 7. Self assembly process of the binary counter component using TileSet2
of strength 1 (see Figure 7(a)). The only way for the self assembly to proceed further is by using the blue dot which has a strength of 2. Tile t56 has a blue dot at the bottom and thus attaches to t11 on the top (see Figure 7(b)). Next, the tiles from Set 3 play a vital role in building the binary counter. We can think of each tile in Set 3 as a 1-bit counter with two inputs (a carry in bit and an input bit) and two output bits (carry out and output bits). The carry in, input bit, carry out and output bit are encoded on the right, bottom, left and top sides of the tile (e.g., see tile t32 which encodes input bit = 0, carry bit = 1, output bit = 1 and carray out bit = 0). Tile t56 starts the counter by introducing a carry of 1. Although all the bonds of the tiles in Set 3 are of strength 1, these tiles receive a strength of 1 from the bottom and a strength of 1 from the left and hence take part in progressing the self assembly from right to left (see Figure 7(c)). Also observe that the tile t54 takes the leftmost position on this row as the upper side of tile t14 contains a light magenta dot. At this stage (Figure 7(c)) the only way to progress the self assembly is to use a tile t51 which has a red dot at the bottom with a strength of 2. So, tile t51 attaches to t54 on the top. This stage of the self assembly is called the copy stage since tile t51 merely copies the bit pattern from bottom to top using tiles t52 and t53 (see Figure 7(e)). Also note that tile t55 is placed at the rightmost corner. At this stage (Figure 7(e)) the only way to progress is with tile t56 . As we described previously, the process continues in a spiral fashion (see Figure 7(g)), until the counter overflows. This happens when the carry bit of the most significant tile has a 1 (see Figure 7(g)). At this point
156
V. Kundeti and S. Rajasekaran
Upper triangle
A F F GG H H 0 0 0 0 0 0 1 1 11 11 1 1 1 1 1
E E
Staircase
Staircase
Fig. 8. Row of the connector component which seeds the double staircase patterns and the upper triangle
in time, the only tile which an fit in is t57 . As a result, the counter is terminated finally. In the next section we see how to make use of the overflow tile to start the double stair case. 4.2
Self Assembly of Double Stair Case and Upper Triangular Components
In the previous section we have seen that tile t57 terminates the binary counter (see Figure 7(g)). There is only one way to progress the self assembly from this stage since tile t57 has a cyan dot on the top which is of strength 2. This allows tile t21 to attach to tile t57 on the top. The tiles in Set 2 act as connectors (see Figure 6) between the staircase patterns and the upper triangle. After tile t21 is placed, since the symbols on the left and the right sides of the tiles in Set 2 are unique and each is of strength 2 they assemble as a row from left to right (see Figure 8). This row seeds the staircase on the left corner since it has a black dot of strength 2 on the left corner. It seeds a staircase on the right corner as well since it has a golden dot on the right corner. Finally since the tile t25 has a symbol A on the top which has a strength of 2, it also acts as the seed for the upper triangle. This upper triangle will be self assembled similar to the self assembly using TileSet1. Finally, the double staircase patterns and the upper triangle self assemble and become terminal to produce the final nano structure in Figure 9. Theorem 2. Construction TileSet2 can self assemble a triangle with height N and base 2N − 1 with a tile complexity of Θ(log(N )). Proof. From our previous discussion, we can see that only tiles in Set 1 and Set 2 contribute to the tile complexity. Their size depends on the dimension of the triangle which we want to self assemble. It is clear that the number of bits we
Self Assembly of Triangles
157
D C C B B A A E E
F F 0 0
1 1
11 1 1
11
1 1
0 0 11
0 0
CC
1 1 0 0
1 1 00
DD
1 1 00
1 1
1 1 0 0
1 1 1 1
1 1 00
1 1
HH 0 0
1 1
1 1 0 0
ABCD
GG 0 0
1 1 1 1
BB
AA
− Strength 2 bonds
Fig. 9. Final terminal and unique self assembly of the triangle with TileSet2
choose to use in our binary counter is what impacts our tile complexity. Suppose we choose a binary counter of t-bits. We will need t unique tiles in Set 1 and t + t/2 (t/2 tiles to seed the upper triangle) unique tiles in Set 2. Thus our tile complexity is 3t/2 + k (k is the total size of Set-3, Set-4, and Set-5). Finally the maximum height of the triangle which we could build using the t-bit binary counter is 2 × 2t + t/2. Therefore, for a given N we have to choose a t such that N = 2 × 2t + t/2. This means that t = Θ(log(N )).
5
Extending Our Construction for More General Shapes
Ideas presented here form a starting point for creating tile sets for more general shapes. We have seen in the previous section that we can seed a t-bit counter to grow upwards. Similar ideas can be applied to build a self assembling t-bit counter that grows downwards. By doing this we can flip the triangle along its base and create a tile set which can self assemble a rhombus. This can be accomplished within the same tile complexity of Θ(log N ) (where N is the length of the diagonal of the rhombus). Other shapes such as pentagon and hexagon can be self assembled by combining our triangle tile set with that of a square. In our self assembled triangles the width of each row starting from the base of the triangle decreases by two tiles (one from left and one from right symmetrically). We can further modify the tile set so that a constant number of rows have the same width, this can help us self assemble shapes that are closer and closer approximations to a circle.
158
6
V. Kundeti and S. Rajasekaran
Conclusions
Efficient methods which can self assemble arbitrary shapes are very essential for nano fabrication. Efficient constructions were given earlier to self assemble squares. In this paper, for the first time, we show that that we can self assemble a triangle of height N (units) and base 2N − 1 (units) with a tile complexity of Θ(log(N )) when α = 2. Acknowledgements. This work has been supported in part by the following grants: NSF 0326155, NSF 0829916 and NIH 1R01GM079689-01A1.
References 1. Soloveichik, D., Winfree, E.: Complexity of self-assembled shapes. SIAM Journal on Computing 36(6), 1544–1569 (2007) 2. LaBean, T.H., Winfree, E., Reif, J.H.: Experimental progress in computation by self-assembly of dna tilings. Proc. DNA Based Computers 54, 123–140 (2000) 3. Winfree, E., Liu, F., Wenzler, L.A., Seeman, N.C.: Design and self-assembly of two-dimensional dna crystals. Nature 394(6693), 539–544 (1998) 4. Seeman, N.C.: Nucleic acid junctions and lattices. Journal of Theoretical Biology 99(2), 237–247 (1982) 5. Rothemund, P.W.K., Papadakis, N., Winfree, E.: Algorithmic self-assembly of dna sierpinski triangles. PLoS Biology 2(12) (2004) 6. Paciotti, G.F., Myer, L.: Colloidal gold: A novel nanoparticle vector for tumor directed drug delivery. Drug Delivery: Journal of Delivery and Targeting of Therapeutic Agents 11(3), 169–183 (2008) 7. Chhabra, R., Sharma, J., Liu, Y., Yan, H.: Patterning metallic nanoparticles by dna scaffolds. Advances in experimental medicine and biology 620(10), 107–116 (2004) 8. Ando, M., Kawasaki, M.: Self-aligned self-assembly process for fabricating organic thin-film transistors. Applied Physics Letters 85(10), 1849–1851 (2004) 9. Zimmler, M.A., Stichtenoth, D.: Scalable fabrication of nanowire photonic and electronic circuits using spin-on glass. Nano letters 8(6), 1895–1899 (2008) 10. Rothemund, P.W.K., Winfree, E.: Program-size complexity of self-assembled squares. In: ACM Symposium on Theory Of Computation (STOC), pp. 459–468 (2000) 11. Wang, H.: An unsolvable problem on dominoes. Technical Report BL-30 (1962) 12. Adleman, L., Cheng, Q., Goel, A., Huang, M.: Running time and program size for self-assembled squares. In: Annual ACM Symposium on Theory of Computing, pp. 740–748 (2001) 13. Aggarwal, G., Cheng, Q.I., Goldwasser, M.H., Kao, M., De Espanes, P.M., Schweller, R.T.: Complexities for generalized models of self-assembly. SIAM Journal on Computing 34(6), 1493–1515 (2005) 14. Kao, M., Schweller, R.: Randomized self-assembly for approximate shapes. In: Aceto, L., Damg˚ ard, I., Goldberg, L.A., Halld´ orsson, M.M., Ing´ olfsd´ ottir, A., Walukiewicz, I. (eds.) ICALP 2008, Part I. LNCS, vol. 5125, pp. 370–384. Springer, Heidelberg (2008) 15. Kao, M., Schweller, R.: Reducing tile complexity for self-assembly through temperature programming. In: Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 571–580 (2006)
Motif Construction from High–Throughput SELEX Data (Invited Keynote Talk) Esko Ukkonen Helsinki Institute for Information Technology and Department of Computer Science P. O. Box 68 (Gustaf H¨ allstr¨ omin katu 2b) FIN-00014 University of Helsinki
[email protected]
Systematic evolution of ligands by exponential enrichment (SELEX) is an in vitro selection–amplification approach that can be used to produce generations of samples of sequences that are affine to some transcription factor. High–affinity sequences are enriched in this process. Combined with high–throughput sequencing, SELEX yields large sets of affine sequences from which binding affinity models such as positional weight matrices (PWMs) can be estimated. The talk will describe some recent developments in identifying various affinity models from SELEX data.
Supported by the Academy of Finland under grant 7523004 (Algorithmic Data Analysis).
I. M˘ andoiu, G. Narasimhan, and Y. Zhang (Eds.): ISBRA 2009, LNBI 5542, p. 159, 2009. c Springer-Verlag Berlin Heidelberg 2009
Rearrangement Phylogeny of Genomes in Contig Form Adriana Mu˜ noz1 and David Sankoff2 1
School of Information Technology and Engineering, University of Ottawa 2 Department of Mathematics and Statistics, University of Ottawa
Abstract. There has been a trend in increasing phylogenetic coverage for genome sequencing while decreasing the sequencing coverage for each genome. With lower coverage, there is an increasing number of genomes being published in contig form. Rearrangement algorithms, including gene order-based phylogenetic tools, require whole genome data on gene order, segment order, or some other marker order. Items whose chromosomal location is unknown cannot be part of the input. The question we address here is, for gene order-based phylogenetic analysis, how can we use rearrangement algorithms to handle genomes available in contig form only? Our suggestion is to use the contigs directly in the rearrangement algorithms as if they were chromosomes, while making a number of corrections, e.g., we correct for the number of extra fusion/fission operations required to make contigs comparable to full assemblies. We model the relationship between contig number and genomic distance, and estimate the parameters of this model using insect genome data. With this model, we can then reconstruct the phylogeny based on genomic distance and numbers of contigs.
1
Introduction
While the increasing pace of genome sequencing is adding phylogenetic breadth to the inventory of species available for comparative genomics, the sequencing coverage of many of these species is not sufficient to produce completely assembled genomes. Instead the published and archived data remain in contig form, not necessarily associated with chromosomal scaffolds, and there are often no resources allocated to further polishing. The price paid for increasing phylogenetic coverage in genome sequencing is thus the decreasing the sequencing coverage for each genome. With lower coverage, more genomes are being published in contig form. While such data may be adequate for many types of comparative genomic studies, they are not directly usable as input to genome rearrangement algorithms. These algorithms require whole genome data, i.e., complete representations of each chromosome in terms of gene order, conserved segment order, or some other marker order, in order to calculate the rearrangement distance d between two genomes. Items whose chromosomal location is unknown cannot be part of the input. The present paper deals with gene order-based phylogeny. The question we ask here: Is there any way to use genome rearrangement algorithms to compare I. M˘ andoiu, G. Narasimhan, and Y. Zhang (Eds.): ISBRA 2009, LNBI 5542, pp. 160–172, 2009. c Springer-Verlag Berlin Heidelberg 2009
Rearrangement Phylogeny of Genomes in Contig Form
161
genomes available in contig form only? One elegant answer was provided by Gaul and Blanchette [7] for the comparison of two genomes. Their method constructs a number of intermediate structures before actually comparing the genomes. Since we will be using distance matrix methods for phylogenetic analysis, the Gaul and Blanchette procedure is largely irrelevant; we need distances and not the detailed reconstruction of the structures used in calculating the distance. For these purposes, involving more than two genomes, our suggestion is to use the contigs directly in the rearrangement algorithms as if they were chromosomes. This introduces a number of biases, such as increasing the distance to accommodate the count of extra fusion/fission operations necessary to compare genomes with different numbers of chromosomes. This bias and other problems with rearrangement distances in general and with contig-based distances in particular must be corrected during the construction of a distance matrix to input into a phylogenetic analysis. We apply our methods to data originating mostly in the 12-genome Drosophila project [5]. We compare ten Drosophila genomes with two other dipteran genomes and two outlier insect genomes. We discuss this data in Section 2. In Section 3, we model the behaviour of the genomic distance as a function of evolutionary time, and discuss how to invert this function in order to infer elapsed time. In Section 4 we study the case where one of the two genomes being compared is fully assembled and the other is in contig form. Simulations are used to understand the consequences on evolutionary time inference of using incomplete assemblies. The ideas developed there are then extended to the more complex case where both genomes are fragmented into contigs, in Section 5. We can then construct a matrix of corrected evolutionary divergence times between all pairs of genomes in the database and carry out a phylogenetic analysis of the fourteen genomes, in Section 6. Finally, in the Conclusion, we suggest a simplifying hypothesis for further mathematical and empirical work on the contig problem.
2
The Data
One of the difficulties in using gene order rearrangement algorithms is the lack of curated gene order databases for the higher eukaryotes with sequenced genomes. Because the gene identification and homology identification has already been done Ref [5], we use a carefully constructed inventory of neighbouring gene pairs (NGP) in ten Drosophila species and four outgroup insects, rather than raw contig data. A.J. Bhuktar provided us with a file listing all NGPs and the genomes in which they appear. By the time of writing, the assembly of these genomes has progressed, but for our purposes, i.e., to show how to handle genomes in contig form, the original data set is preferred. We abstracted best-judgement divergence times among the genomes from a number of somewhat contradictory recent publications [8,10,11], as summarized in Figure 1. Bhutkar et al. [2,3] have already used the NGP data for a phylogenetic analysis of Drosophila, inferring phylogenies, rearrangements and synteny blocks, but our
162
A. Mu˜ noz and D. Sankoff
D. melanogaster D. sechellia D. yakuba D. erecta D. ananassae D. pseudoobscura D. persimilis D. mojavensis D. virilis D. grimshawi An. gambiae Aedes aegypti Apis mellifera T. castaneum Divergence time (million years) 250
200
150
100
50
0
Fig. 1. Phylogeny of Drosophila and outgroups abstracted from the literature, with divergence times Table 1. Number of contigs constructed for each genome species (abbreviation) genes contigs species (abbreviation) D. melanogaster (Dmel) 8867 6 D. sechellia (Dsec) D. yakuba (Dyak) 8809 30 D. erecta (Dere) D. ananassae (Dana) 8844 40 D. pseudoobscura (Dpse) D. persimilis (Dper) 8779 87 D. virilis (Dvir) D. mojavensis (Dmoja) 8853 14 D. grimshawi (Dgri) Anopheles gambiae (Anoph) 6168 6 Aedes aegypti (Aedes) Apis mellifera (Apis) 4898 702 Tribolium castaneum (Trib)
genes contigs 8851 66 8866 9 8778 51 8855 32 8801 35 6318 869 5647 89
use of the NGP here is different. It is simply to reconstruct the gene orders in the contigs; we wish to create a data set for testing our method for gene order-based phylogenetics from genomes in contig form. For each genome, we constructed contigs by amalgamating overlapping NGPs. Whenever we arrived at a gene in only one NGP in a genome, this terminated a contig. Our reconstruction then does not necessarily correspond completely to the original contigs in the 12-genome Drosophila sequencing project [5], but this has little importance for our work – how the genomes are fragmented into contigs,
Rearrangement Phylogeny of Genomes in Contig Form
163
and into how many, is a methodological question that depends on laboratory resources and techniques and has nothing directly to do with how the genome has evolved. (Both contig ends and rearrangement breakpoints may be enriched for duplicated sequence, but this indirect connection has no consequence for the problem we are attacking). Table 1 gives the number of contigs reconstructed for each genome. Note that the reconstructions of D. melanogaster, D. erecta and An. gambiae reflect the complete, or almost complete, assembly of these genomes.
3
Genomic Distance and Evolutionary Time
We assume familiarity with the classical genetics notions of inversion, transposition and reciprocal translocation of chromosome segments, as well as chromosomal fission and fusion. These are formalized in such papers as those by Tesler [12], Yancopoulos et al. [14], and Bergeron et al. [1] Briefly, representing a chromosome a a string of genes h1 · · · hl , where a pair of successive genes hu hu+1 are termed an adjacency, we can illustrate: • an inversion (implying change of sign, i.e., change of strand) of a chromosomal segment: h1 · · · hu · · · hv · · · hm → h1 · · · − hv · · · − hu · · · hm , disrupting the two adjacencies hu−1 hu and hv hv+1 , • a transposition of a chromosomal segment: h1 · · · hu · · · hv · · · hw · · · hm → h1 · · · hu−1 hv · · · hw hu · · · hv−1 hw+1 · · · hm , disrupting the three adjacencies hu−1 hu , hv−1 hv and hw hw+1 • a reciprocal translocation between two chromosomes: h1 · · · hu · · · hl , k1 · · · kv · · · km → h1 · · · kv · · · km , k1 · · · hu · · · hl , disrupting the two adjacencies hu−1 hu and kv−1 kv , • a chromosome fission: h1 · · · hv · · · hl → h1 · · · hv , hv+1 · · · hl , disrupting the adjacency hv hv+1 , and • the fusion of two chromosomes: h1 · · · hl , k1 · · · km → h1 · · · hl k1 · · · km . The genomic distance is the minimum number of operations of these types (or some specified subset of types) required to transform one of the genomes being compared into the other. The authors mentioned above also provide rapid algorithms for deriving the distance, given genomes composed of ordered chromosomes represented by the same n genes, markers or segments in the two genomes, assuming the strandedness, or reading direction, of each gene is known. Even assuming that rearrangements occur at a relatively constant rate over time and are randomly positioned in the genomes, we have no simple, exact probability relationship between the actual number τ of rearrangements after a certain time t has elapsed and the number of rearrangements d inferred by applying the genomic distance algorithms to compare the initial and the derived genomes [4,6,13]. We can, however, model the proportion of adjacencies that will be disrupted versus the proportion that will remain intact after τ random
164
A. Mu˜ noz and D. Sankoff
rearrangements. For each of the adjacencies in the original genome, the probability that it will remain undisrupted after τ rearrangements is (1 − λ/n)τ or approximately e−λτ /n , where λ depends on the proportions of the various kinds of rearrangements in the model. Thus the number of disrupted adjacencies will be approximately n(1 − e−λτ /n ). Now, we can expect at the τ -th step that the increase in d will also be closely connected to the proportion of the adjacencies between genes that have not been created, i.e., have never been disrupted, by the previous τ − 1 rearrangements — if the τ -th rearrangement only disrupts adjacencies created in previous steps, it is quite likely that the inference algorithm will suggest an optimal evolutionary history requiring no more rearrangements than were required after the τ − 1-st step. Then, though we do not know the precise probability law of d, we can hypothesize as a first approximation E(d) ≈ n(1 − e−λτ /n ),
(1)
where n is the number of ordered genes or markers in both genomes, and λ in this case is a constant close to 1, since we know that d ≈ τ for small τ and that d/n → 1, as τ → ∞. Then if we knew λ, we could estimate τ using n d τˆ = − log 1 − . (2) λ n In fact, the relationship between the actual and inferred numbers of rearrangements (not shown) deviates considerably from the one-parameter model in Eq 1 both for small and large τ . Combinatorial effects result in E(d) < τ even for very small values of τ . And the approach to the asymptote E(d) n 1 is faster than Eq 1 would suggest. We thus have recourse to a a two-parameter model by adding a quadratic correction to the linear term in the exponent, so that the model becomes 2 E(d) ≈ n(1 − e−λ1 τ /n−λ2 (τ /n) ), (3) in which case the estimate of τ becomes n d τˆ = −λ1 + λ21 − 4λ2 log 1 − 2λ2 n
(4)
This analysis resembles the “empirical” approach in Ref [13] to the relationship between d and τ , which also makes use of two parameters, except that our starting point is the intuitive development leading to Eq 1 at the beginning of this section, whereas Ref [13] takes a purely curve-fitting approach from the outset. To estimate the parameters λ1 and λ2 , we simulate pairs of genomes with n = 8867, the maximum number of genes used in our Drosophila melanogaster comparisons, and τ up to 9000 random rearrangements to derive one genome from the other. We assume the rearrangements are almost exclusively inversions (around 99.8%), reflecting the evolutionary history of Drosophila. We use a DCJ
Rearrangement Phylogeny of Genomes in Contig Form
165
10000 9000
predicted and observed distance
8000 7000
d
6000 5000 4000 3000 2000 1000 0 0
2000
4000
6000
8000
10000
number of rearrangements
Fig. 2. Predicted (curve) and observed (dots) values of genomic distance d, and inferred (open dots) values of τˆ versus true (diagonal line) values
algorithm [14,1] to calculate d from the genomes. This is repeated 100 times, and d averaged, to estimate E(d). Figure 2 shows the relationship between τ and both E(d) and τˆ, using the values λ1 = 0.846 and λ2 = 0.576, found by a least sum of squares criterion applied to the set of τ and τˆ values. The way τ and d are normalized means that the parameters should not be very sensitive to n, though we do not study this here, since the experimental genomes are of comparable sizes.
4
The Effect of Genome Fragmentation
Consider one completely assembled genome B and another, A, in contig form only. The basic idea is that if we treat each contig as a chromosome, a rearrangement algorithm will automatically carry out a number of “fusions” to assemble the χA contigs in A into a small number of inferred chromosomes equal to the number χB in B, in calculating d. At the same time it will find other rearrangements, but we know that the fusions can be separated out as an initial step without changing the total number of rearrangements required. Furthermore, we know exactly how many fusions are required, namely the difference between the number of contigs in A and the number of chromosomes in B. (The optimal scenario will never require both fusions and fissions.)
166
A. Mu˜ noz and D. Sankoff
Thus, when we use a rearrangement algorithm to compare a genome A in contig form with an assembled genome B, obtaining a preliminary distance d , it may seem appropriate to correct this to d = d − |χA − χB |.
(5)
The absolute value signs accommodate the rare case where χA < χB . Since the rearrangement distance can be achieved by doing all the translocations and fusions first, before all the inversions, the correction |χA − χB | is a fixed value and is not dependent on the details of the rearrangement scenario, for which there may be many for a particular data set. If this whole line of argument were universally valid, we could simply substitute correction Eq 5 into Eq 2 or 4 to estimate τ . In reality, this correction is only appropriate for small values of τ (e.g. τ < 0.1n). For larger values, the apparent rearrangement distance d based on contigs is inflated less than |χA − χB | over one based on the correctly assembled genomes. The fragmentation of the genome into contigs allows the algorithm, in effect, to compare more similar, albeit incorrect, assemblies. This effect was previously noted in Ref [9]. To circumvent it, we should only remove a proportion α of |χA − χB | from d . How large a proportion? To answer this, we undertook a series of simulations, starting from an initial genome B containing 8867 genes in χB = 6 chromosomes, generating 100 rearranged genomes, each through τ random rearrangements applied to B to produce a new genome, and each then fragmented into χA contigs. This was repeated for a range of values of τ and χA . The average results for d are summarized on the left of Figure 3. First the linearity of the response to increasing χA is clear, at least in the range studied χA < 1000, indicating that Eq 5 should be replaced by d = d − α(τ )|χA − χB |,
(6)
where α(τ ) is a decreasing function of the number of rearrangements τ . This decrease is not linear; for practical purposes, we can fit α(τ ) with a quadratic function. Also, as we already know from Eq 3 and Figure 2, d/τ is a decreasing function of τ . This dependence of α and d/τ on τ , as derived from the simulations, are shown on the right of Figure 3. Given d , then, we can solve Eqs 3 and 6 simultaneously to find τ and d, since n, λ1 , λ2 , χA and χB are known, as is the dependence of α on τ . In practice, this can be done by successive iteration of Eqs 4 and 6, which converges rapidly, initializing with, for example, τ0 = d . Applying this to the comparison of the completely assembled D. melanogaster genome with each of the other 13 genomes, and to the comparison of the completely assembled Anopheles gambiae genome with each of the other 13 genomes, gives the results on the left of Figure 4. The high degree of scatter at higher divergence times reflects both the uncertainty of the divergence dates and the inhomogeneity of rearrangement rates both between the fruitfly and mosquito families within the dipteran order and among the three orders in the class Insecta represented in these data.
Rearrangement Phylogeny of Genomes in Contig Form 8000
167
1
= 8000,
7000
0.9
d' = 0.38 + 6293
0.8
= 6000,
d' = 0.60 + 5030 parameters of contig correction
inferred distance d'
6000
5000
4000
= 3000,
d' = 0.87 + 2612
3000
= 1000,
2000
d' = 0.96 + 899
d
0.7 0.6 0.5 0.4
d' = + d
0.3 0.2
1000
0.1
0
0 0
200
400
600
number of contigs
800
1000
0
2000
4000
6000
number of rearrangments
8000
10000
10000
9000
9000
8000
8000
7000
7000
6000
6000
rearrangements
rearrangements
Fig. 3. For genomes generated by τ = 1000, 3000, 6000 or 8000 rearrangements, broken into χ = 100, 200, . . . , 1000 contigs: (left) the relationship between uncorrected genomic distance d and χ and equations of trend lines. (right) the parameters α and d of the linear dependence of d on χ, as a function of τ . Dotted line represents predicted behaviour based on Eq 3. Solid line represents quadratic fit α(τ ) = 1−0.0276(τ /1000)− 0.0063(τ /1000)2 .
5000 4000
5000 4000
3000
3000
2000
2000
1000
1000
0
0 0
50
100
150 millions of years
200
250
300
0
50
100
150
200
250
300
millions of years
Fig. 4. (left) Comparison of D. melanogaster with 13 other genomes (solid dots). Comparison of Anopheles gambiae with 13 other genomes (open dots). Divergence, in total number of genome rearrangements, estimated from genomic distances through Eqs 3 and 6, compared to divergence times abstracted from the literature. Line represents least squares fit to all points. (right) Pairwise comparison of all pairs of 14 genomes, as discussed in Section 5. Line represents least squares fit.
168
5
A. Mu˜ noz and D. Sankoff
The Case of Both Genomes in Contig Form
When we compare two incompletely assembled genomes A and B, we may still wish to remove some quantity depending on χA and χB from d to account for the fusions (and/or fissions), but this is not as easy to analyze, for two reasons. One is that we are not comparing a fragmented genome to a complete genome, so we can no longer consider this correction as a way of using the assembled genome as a guide for reconstructing the fragmented genome, simultaneous with the distance calculation. The second problem is that there is no obvious way, within the formula, of combining (adding, multiplying, . . . ) the number of contigs in one genome with the number in the other. This reflects the lack of intuition on how the contigs increase the distance (because of artificial fusions and fissions) on one hand, and how they decrease it (by multiplying the number of economical but false rearrangements) on the other hand. These reasons lessen the intuitive appeal of the kind of correction we used in the previous section. Nevertheless, we can try to find an appropriate correction using the same simulation approach as in the previous sections. We simulated 50 runs each of two genomes of size n = 8867 separated by τ = 1000, 3000, 6000 and 8000 random rearrangements as before, but with both genomes independently and randomly fragmented into χ = 100, 200, 400, 600 or 5 800 contigs, i.e., 5 + = 15 pairs of contig configurations for each degree of 2 rearrangement. We applied the DCJ algorithm and calculated the mean d for each configuration. The results are summarized in Figure 5. We observe on the left of Figure 5 that for fixed τ and χA , the response of d to increasing χB is systematically linear. This is clear up to τ = 6000 and only starts to break down for τ = 8000 and χA ≥ 600, where examination of the data on an expanded scale shows that d actually decreases somewhat initially, then increases, as χB increases (not discernible in Figure 5). The linear rate of increase of d , plotted as β(τ, χA ) on the right of the figure, is the same as the α(τ ) in Figure 3 for low values of χA . In fact, d shows the same linear increase as a function of χA + χB up to moderate values of this sum, as in Figure 3, depending on τ , after which the rate of increase drops off somewhat. As with the case of only one genome fragmented into contigs studied in Section 4, we can infer d and τ from observed values of d by solving Eq 4 simultaneously with d = d − α(τ )χA − β(τ, χA )χB ,
(7)
where β(τ, χA ) = α(τ ) − (.00027 − .00003τ )χA , and where the coefficient of χA is estimated by a least squares fit to the slopes of the four trend lines in Figure 6 (right). Plotting the inferred values of τ against values extracted from the literature produced the results on the right of Figure 4.
Rearrangement Phylogeny of Genomes in Contig Form
169
1
7000
=8000 0.9 6000
=1000
0.8
=6000
0.7
=3000
800 600 400 200 100
, =
4000
0.6
For each
parameter
inferred distance d'
5000
0.5
0.4 3000
=6000
=3000
0.3
0.2 2000 0.1
=8000
=1000 0
1000 0
200
400
600
number of contigs
0
800
200
400
600
number of contigs
800
A
Fig. 5. For genomes A and B separated by τ = 1000, 3000, 6000 or 8000 random rearrangements, broken into χ = 100, 200, 400 or 800 contigs: (left) the relationship between uncorrected genomic distance d , χA and χB , with trend lines for each χA connecting the values of d for a range of χB . (right) the coefficient β of the linear dependence of d on χB in the left hand diagram, as a function of χA . Dotted segments connect β(τ, 100) to β(τ, 0) = α(τ ) from Figure 3. Dashed line is the linear trend line, not taking account of β(τ, 0).
2300
5800
2200 5700
2100 2000
5600
1800
inferred distance d'
inferred distance d'
1900
1700 1600 1500
5500
5400
5300
1400 5200
1300 1200
5100
1100 1000
5000 0
200
400
600
800
total contigs
1000
A
+
1200
B
1400
1600
1800
0
200
400
600
800
total contigs
1000
A
+
1200
1400
1600
1800
B
Fig. 6. Dependence of d on the total number of contigs in the two genomes, for τ = 1000 (left) and τ = 6000 (right). Straight lines represent d = d + α(τ )(χA + χB ) where α(τ ) is as in Figure 3.
170
6
A. Mu˜ noz and D. Sankoff
Phylogeny
If we input the inferred pairwise values of τ into a neighbour-joining routine, we produce the phylogeny in Figure 7. When this is compared to Figure 1, the only structural difference is at one node where we see D. sechellia branching off just before D. melanogaster rather than branching off together as sister groups. More striking is the long branch leading to the Drosophila group, suggesting a rapid rate of evolution at the moment of divergence from other Diptera. Note that using the uncorrected matrix of d as input to neighbour joining does not show this rate effect as clearly as τ and also introduces other structural errors into the phylogeny. dyak ——| ——| dere | | ———| dmel | | ——| dsec | | ———| ————dana | | | | dpse | ————| ———————————————————————————————————————————| dper | | | | —dvir | | —| ———| ——————| ——dmov | | | | | ————dgri | | | | ——————————————————————anop | ———————————| | —————————————aede | |———————apis | —————————————————————————————————————trib
Fig. 7. Neighbour-joining phylogeny based on matrix of inferred number of rearrangements τ
7
Conclusion
We have developed a principled approach to correcting genome rearrangement distance when comparing genomes in contig form. Features of this include: – A model for the τ —d relationship motivated by intuitive connections between genomic distance and adjacency disruption. – A reasoned procedure for subtracting artificial fusions and fissions due to the fragmentation of one or both of the genomes into contigs. – The discovery and quantitative characterization of the linear relation between the uncorrected distance and the number of contigs, when only one or both of the genomes are fragmented into contigs. These linearities hold for a wide range of τ , up to 6000 for genomes of size around n = 9000, and up to χ = 1000 contigs.
Rearrangement Phylogeny of Genomes in Contig Form
171
– Improved phylogenetic reconstruction for a data set on 14 insect genomes. We recovered a tree that accurately reflects almost all the phylogenetic information extracted from the literature, and pinpointed a period of evolutionary acceleration on one lineage. As argued in Section 3, the values of the parameters λ1 and λ2 are not likely to be very sensitive to n, especially for n in the thousands, since the model relates the normalized variables τ /n and d/n. Nor should they depend on details of the rearrangement model such as the number of chromosomes or the proportions of different types of rearrangement, assuming the latter are naturally weighted as in the double-cut-and-join framework. This stability reassures us that our methods should be widely applicable beyond the Drosophila data we have used, but only partly mitigates the main shortcoming of this and other models such as in Ref [13], namely that they are not analytically derived. Thus the mathematical foundation of probability models and statistical analyses of genomic problems like the one addressed here would benefit more from advances like those in Ref [6] than by further characterization of empirical models such Eq 3. For example, if we knew the probability law of d under random rearrangements, or even its expectation, we could most easily investigate the following hypothesis, suggested by our results on the linearity of the dependence of d on χ: The imposition of a contig structure has the same effect on d as adding further rearrangements. In a continuous approximation of the τ —d relationship, dE(d) dE(d) = = α(τ ). dχ dτ
(8)
If this could be verified, analytically or, failing that, empirically, it would make for an elegant framework for our results.
Acknowledgements We thank Arjun Bhutkar for providing the NGP files with pair occurrence tabulated by species. We also thank Chunfang Zheng for guidance in using her programs for DCJ distance and rearrangement simulations. Research funded in part by NSERC. DS holds the Canada Research Chair in Mathematical Genomics.
References 1. Bergeron, A., Mixtacki, J., Stoye, J.: A unifying view of genome rearrangements. In: B¨ ucher, P., Moret, B.M.E. (eds.) WABI 2006. LNCS (LNBI), vol. 4175, pp. 163–173. Springer, Heidelberg (2006) 2. Bhutkar, A., Gelbart, W.M., Smith, T.: Inferring genome-scale rearrangement phylogeny and ancestral gene order: a Drosophila case study. Genome Biology 8, R236 (2007) 3. Bhutkar, A., Schaeffer, S.W., Russo, S.M., Xu, M., Smith, T.F., Gelbart, W.: Chromosomal rearrangement inferred from comparisons of 12 Drosophila genomes. Genetics 179, 1657–1680 (2008)
172
A. Mu˜ noz and D. Sankoff
4. Dalevi, D., Eriksen, N.: Expected gene-order distances and model selection in bacteria. Bioinformatics 24, 1332–1338 (2008) 5. Drosophila 12 Genomes Consortium. Evolution of genes and genomes on the Drosophila phylogeny. Nature 450, 203–218 (2007) 6. Eriksen, N., Hultman, A.: Estimating the expected reversal distance after a fixed number of reversals. Advances in Applied Mathematics 32, 439–453 (2004) 7. Gaul, E., Blanchette, M.: Ordering partially assembled genomes using gene arrangements. In: Bourque, G., El-Mabrouk, N. (eds.) RECOMB-CG 2006. LNCS (LNBI), vol. 4205, pp. 113–128. Springer, Heidelberg (2006) 8. Krzywinski, J., Grushko, O.G., Besansky, N.: Analysis of the complete mitochondrial DNA from Anopheles funestus: An improved dipteran mitochondrial genome annotation and a temporal dimension of mosquito evolution. Molecular Phylogenetics and Evolution 39, 417–423 (2006) 9. Sankoff, D., Zheng, Wall, C.P.K., de Pamphilis, C.W., Leebens-Mack, J., Albert, V.A.: Internal validation of ancestral gene order reconstruction in angiosperm phylogeny. In: Vialette, S., Nelson, C. (eds.) RECOMB-CG 2008. LNCS, vol. 5267, pp. 252–264. Springer, Heidelberg (2008) 10. Savard, J., Tautz, D., Richards, S., Weinstock, G.M., Gibbs, R.A., Werren, J.H., Tettelin, H., Lercher, M.J.: Phylogenomic analysis reveals bees and wasps (Hymenoptera) at the base of the radiation of Holometabolous insects. Genome Research 16, 1334–1338 (2006) 11. Severson, D.W., DeBruyn, B., Lovin, D.D., Brown, S.E., Knudson, D.L., Morlais, I.: Comparative genome analysis of the yellow fever mosquito Aedes aegypti with Drosophila melanogaster and the malaria vector mosquito Anopheles gambiae. Journal of Heredity 95, 103–113 (2004) 12. Tesler, G.: Efficient algorithms for multichromosomal genome rearrangements. Journal of Computer and System Sciences 65, 587–609 (2002) 13. Wang, L.-S., Warnow, T.: Distance-based genome rearrangement phylogeny. In: Gascuel, O. (ed.) Mathematics of Evolution and Phylogeny, ch. 13, Oxford, pp. 353–383 (2005) 14. Yancopoulos, S., Attie, O., Friedberg, R.: Efficient sorting of genomic permutations by translocation, inversion and block interchange. Bioinformatics 21, 3340–3346 (2005)
Prediction of Contiguous Regions in the Amniote Ancestral Genome ´ A¨ıda Ouangraoua1, Fr´ed´eric Boyer2 , Andrew McPherson1 , Eric Tannier3 , 1 and Cedric Chauve 1
Department of Mathematics, Simon Fraser University, Burnaby (BC), Canada {aouangra,awm3,cchauve}@sfu.ca 2 Institut de Recherches en Technologies et Sciences pour le Vivant; Laboratoire Biologie, Informatique et Math´ematiques; CEA Grenoble, F-38000 Grenoble, France
[email protected] 3 INRIA Rhˆ one-Alpes; Universit´e de Lyon; Universit´e Lyon 1; CNRS, UMR5558, ´ Laboratoire de Biom´etrie et Biologie Evolutive, F-69622, Villeurbanne, France
[email protected]
Abstract. We investigate the problem of inferring contiguous ancestral regions (CARs) of the genome of the last common ancestor of all extant amniotes, based on the currently sequenced and assembled amniote genomes as ingroups and three teleost fish genomes as outgroups. We combine a methodological framework using conserved syntenies computed from whole genome alignments of amniote species together with double conserved syntenies (DCS) using gene families from amniote and fish genomes, to take into account the whole genome duplication that occurred in the teleost lineage. From these comparisons, ancestral genome segments are computed using techniques inspired by physical mapping. Due to the difficulty caused by the whole genome duplication and the large evolutionary distance to the closest assembled outgroup, very few methods have been published with a reconstruction of the amniote ancestral genome. This one is the first which is founded on a simple and formal methodological framework, whose good stability is shown and whose CARs cover large regions of the human and chicken genomes.
1
Introduction
The reconstruction of ancestral karyotypes and gene orders from homologies between extant species can help to understand the large-scale evolutionary mutations that differentiate the present genomes. It has been approached using cytogenetics methods and recently applied to mammalian genomes [24]. Beyond this evolutionary distance, homologies are less visible and it is only with the recent availability of sequenced and assembled genomes that bioinformatics methods can predict the past of chromosomes. These methods address the problem at a much higher resolution, although with much less available genomes. The first results have been obtained on mammalian genomes [4,16,20], and several reviews have been published [9,19], analyzing the divergences with earlier I. M˘ andoiu, G. Narasimhan, and Y. Zhang (Eds.): ISBRA 2009, LNBI 5542, pp. 173–185, 2009. c Springer-Verlag Berlin Heidelberg 2009
174
A. Ouangraoua et al.
cytogenetics results [10,5,22]. These methods can be divided into model-based methods, that compute complete evolutionary scenarios [4,20] and model-free approaches that do not consider a precise rearrangement model, which are used by cytogeneticians and currently receive a lot of attention from computational biology (see [16,6] and references there). The application of such methods to more ancient genomes comes up against the difficulty to handle duplications and losses as evolutionary events. Yet the teleost fish genomes have undergone a whole genome duplication (WGD) [12,19] at an early stage of their evolution, and are currently the only available genomes that may serve as an outgroup to reconstruct amniote or tetrapod ancestral genomes. Two recent methods have been developed to reconstruct these ancestral genomes [14,21], and predict very divergent syntenic associations. Hence, while the reconstruction of the ancestral mammalian genome seems now to be close to a relative consensus, the reconstruction of the amniote ancestral genome looks as the next bottleneck on the way towards the ancestral proto-vertebrate genome. In [6], a general model-free framework was introduced for the reconstruction of “Contiguous Ancestral Regions”, or CARs (the terminology is borrowed from Ma et al. [16]) in an ancestral genome. It is inspired by genome physical mapping techniques, and roughly consists in two phases as follows. Given a set of “genomic markers”, which are sets of orthologous positions in the ingroup genomes: (1) detect “ancestral syntenies”, which are sets of genomic markers that are believed to have been contiguous in the ancestral genome, and (2) order the genomic markers into a set of “Contiguous Ancestral Regions” in which the ancestral syntenies are respected, discarding some of them if the whole set is not compatible with the formation of linear CARs. This second phase relies on combinatorial tools such as PQ-trees, that were introduced in computational biology for physical mapping of genomes. Indeed, our problem consists in the mapping of markers into ancestral chromosomes. This framework was applied in [6] for the reconstruction of contiguous ancestral regions of mammalian ancestors (ferungulates and boreoeutheria). It was shown to be very stable under different parameters for the computation of syntenies. Our goal here is to apply this framework to compute CARs of the ancestral amniote genome. The method used to infer mammalian CARs needs to be extended to handle two main issues. First, the closest currently available sequenced and assembled outgroups are the teleost fishes, whose evolutionary distance to the ingroups (mammals and birds) is considerable. Hence it is impossible to define a high coverage set of genomic markers that appear once in each genome of this study. Moreover, the Whole Genome Duplication followed by massive gene losses and intensive rearrangements makes ancestral syntenies inaccessible by a classical comparison between amniote and fish genomes. We handle these issues by using (1) genomic markers obtained from wholegenome alignments within amniote assembled genomes (chicken and mammals), and (2) gene families to compare amniote and fish genomes (teleost fishes) to construct ancestral syntenies. We rely on the Double Conserved Synteny (DCS) principle introduced by Kellis et al. [13] and Dietrich et al. [7], since then often
Prediction of Contiguous Regions in the Amniote Ancestral Genome
175
used to detect syntenies in a WGD context [12,21], and systematized by Van de Peer [23]. Despite the principle is well known, the detection of such syntenies implies a non trivial methodological problem and formal descriptions of the expected signal, adapted to the highly rearranged fish genomes, are lacking. It is a contribution of this paper to propose a formal definition of DCSs. We obtain a family of ancestral syntenies, and group them into Contiguous Ancestral Regions of the proto-amniote genome. This set can contain some “conflicting signal”, which means that no linear ordering of the genomic markers can account for all the ancestral syntenies. While the ancestral syntenies computed from mammalian genomes presented very little conflict [6], the greater evolutionary distance mixes up the signal and we have to cope with more conflicting ancestral syntenies. That is why we produce here the sets of CARs with different sets of parameters used to compute the DCS and propose both a set of CARs obtained with stringent parameters and a set of consensus CARs obtained from several values of parameters. We compare our results with two other studies [14,21] that proposed a configuration of the amniote ancestral genome. These two present contradictory results, low coverage of the extant genomes by the reconstructed ancestral one, and no validation of the methods. We try to make significant progresses in these directions: our CARs are more numerous than in the previous studies but present a good coverage of the extant genomes. As in the previous studies, we find a proto-amniote genome that shows more similarity to the chicken genome than to mammalian genomes. In the following we first describe the method, following the framework of [6], and focusing on the novelty we introduce, which is the possibility of integrating duplicated syntenies in this framework. The definition and computation of the duplicated syntenies is discussed. Then we describe the CARs we obtain into details, comparing them to previous studies, showing some convergences and differences. Eventually we study the soundness of the proposed CARs by running the method under different sets of parameters, to understand its behavior, its stability and the confidence we may have on the proposed CARs.
2 2.1
Data and Methods Overview
We consider a dataset containing eleven amniote genomes (human, chimpanzee, orangutan, macaca, mouse, rat, dog, cow, horse, opossum, chicken) and three teleost fish genomes (tetraodon, stickleback, medaka) used as outgroups. We then proceed with the following steps. 1. We compute a set of “genomic markers” that are unique and universal in amniote genomes (i.e. each markers appears once and exactly once in each of the eleven genomes), using whole genome alignments available. 2. A first set of ancestral syntenies is generated by computing common intervals (as in [6]) of genomic markers between all pairs of amniote species whose evolutionary path goes through the amniote ancestor (here, chicken against every mammal).
176
A. Ouangraoua et al.
3. A second set of ancestral syntenies is generated by computing “double conserved syntenies” (DCS) between each amniote and the three teleosts, using sets of gene families. This provides the coordinates on an amniote genome of a genomic segment which is likely to descend from a segment of the ancestral osteichthyes genome (the ancestor of teleost fishes and amniotes), and is then likely to have been present as an segment of the ancestral amniote genome. These coordinates provide a set of genomic markers (those which intersect the DCS) which is taken as the ancestral synteny. 4. We weight the ancestral syntenies from both sets according to their conservation pattern in the considered species tree. 5. We select from the ancestral syntenies a maximum weight subset such that alltogether they are compatible with the formation of linear CARs. This supposes that the least weight conflicting ancestral syntenies are more likely to be false positives. This phase relies on the “consecutive ones problem”, widely used in physical mapping problems. The final result is a combinatorial structure (a PQ-tree) that allows to linearly represent the whole set of solutions to the consecutive ones problem. The children of the root of the PQ-tree are the amniote CARs. 2.2
Genomes and Markers
We used data retrieved from the Compara database of Ensembl [8], comparing human (hg18), chimpanzee (CHIMP2.1), orangutan (PPYG2), macaca (Mmul 1) mouse (mm9), rat (RGSC 3.4), dog (CanFam2.0), cow (Btau 4.0), opossum (monDom5), platypus (Ornitorhynchus anatinus-5.0), chicken (WASHUC2), tetraodon (TETRAODON8.0), medaka (HdrR) and stickleback (BROAD S1). For the phylogenetic relationships between the amniote species, we considered the species tree used by Compara to compute whole-genome alignments (see http://www.cecm.sfu.ca/~cchauve/SUPP/ISBRA09/ for data and results). We construct a set of genomic markers by using the Pecan 12-amniotesvertebrates multiple alignments available in the release 50 of Compara with human genome as a reference. From the orthologous seeds on the 11 fully assembled genomes defined by the multiple alignments, we keep only the ones that have a minimum size (100b in the human reference genome). Two seeds are joined if they are distant from each other by less than 100Kb in all amniotes where they occur. A genomic marker is then an inclusionwise maximal set of linked seeds which spans more than 100Kb in all genomes and such that its seeds span at least 50% of its total span. In this way we expect to obtain a good set of orthologous markers, removing uncertain homologies as well as paralogies. We obtain 1101 non-overlapping genomic markers, spanning respectively 797Mb of the human genome (26% of its size) and 308Mb (29%) of the chicken genome. 2.3
Ancestral Syntenies
As described in Chauve et al. [6], ancestral synteny detection in the absence of large duplications (in mammalian and bird genomes) may be performed by
Prediction of Contiguous Regions in the Amniote Ancestral Genome
177
the detection of groups of genomic markers contiguous in at least two genomes whose evolutionary path goes through the desired ancestor. But the WGD that is believed to have occurred in the lineage of the teleost genomes requires a more sophisticated treatment when a fish genome is involved in the comparison. In the absence of large duplications: comparing two amniote genomes. First, when comparing the chicken genome to each mammalian genome, which are not separated by a whole genome duplication event, an ancestral synteny is the set of genomic markers intersecting (a) an inclusionwise maximal segment of the chicken genome that contains the same genomic markers as a segment of a mammalian genome (a maximal common interval) or (b) a segment of the chicken genome that contains only two markers that are also consecutive in a mammalian genome (an adjacency). We include adjacencies to balance the fact that there is no order information associated with common intervals. The result of this process is a family of sets of markers, each of which covers amniote genomic segments that are believed to be ancestral. In the presence of a WGD in the outgroups. For the comparisons between an amniote genome and a fish genome, the method described above can not be applied due to the numerous losses and intra-chromosomal rearrangements that shuffled the teleost fish chromosomes after the WGD. Indeed, massive losses of genes often follow a WGD, and two paralogous segments may present little similarity. It is possible to detect this paralogy indirectly by comparing the two segments with their common ortholog in a non duplicated genome. This principle is called the pivot method, and is now classical to detect chromosome segment homologies in a WGD context [13,7,12,23] (see Figure 1). Despite this principle is well known, no methodological discussion on the exact signal that should be detected using fish genomes has been published (due to the highly rearranged fish chromosomes, the approaches described in [23] are not efficient in this case). So in the present method we propose a definition of a DCS. The proportion of conflicting signal in the whole set of DCS tends to show that there is still some space for improvement of the precision of this proposition.
Fig. 1. A double synteny: an amniote chromosome (in the middle) is homologous to two fish segments (up and down), though paralogy between these two segments is not detectable through direct similarity (few genes are conserved in two copies). Clusters are much rearranged: this justifies the method of detection, which does not consider the order of the genes within a cluster.
178
A. Ouangraoua et al.
We use a set of orthologous gene families constructed from the orthologies between genes of amniotes and fishes available in the Ensembl Compara database [8]1 . When comparing an amniote genome to a fish genome, we use only genes that are annotated with coordinates on a chromosome on both species. A double conserved synteny (DCS) is a segment S of the amniote genome that is orthologous to two paralogous segments S1 and S2 in two different chromosomes of the fish genome, i.e. satisfies the two following criteria. 1. S contains at least 20 genes, minprop percent of all genes having orthologs on S1 or S2 (we choose minprop = 95% and test the sensitivity to this parameter), 2. There is a minimum number of alternances (4 for this study) along the genes of S between those which have an ortholog on S1 and those which have an ortholog on S2 . These conditions, inspired by the comparison between tetraodon and human genomes in [12], were designed to retrieve genome segments whose gene content exhibits a clear signal for originating from a single genome segment pre-WGD. We considered here all comparisons between one of the three teleost genomes and each amniote. We obtained a list of amniote genome segments orthologous to two segments in two chromosomes of a fish, showing a signal for a “double synteny”. The maximal set of genomic markers intersecting such a segment defines the corresponding ancestral synteny. We also included ancestral syntenies from mammalian ancestral segments showing a DCS signal with fishes. These are sets of genomic markers intersecting maximal common intervals between two mammalian genomes whose evolutionary path goes through the boroeutherian ancestor, which are in addition included in a DCS in at least one of the two considered genomes. These segments are refined DCS, with a stronger conservation signal, as they are predicted to be ancestral by two methods, in the boreoeutherian ancestral genome and in the osteichtyes ancestral genome. We obtain 2745 ancestral syntenies containing more than one genomic marker. These syntenies are then weighted according to the pattern of conservation they present in the phylogeny, using the formula described in [6] that accounts for a species tree and the branch lengths. 2.4
Assembling Ancestral Syntenies
The output of the phase described above is a set L of n genomic markers, and a family S = {S1 , . . . , Sm } of m subsets of L, where each subset is a set of genomic markers that are believed to be contiguous in the ancestral genome of interest. Following [6], we use the approach traditionally applied to physical mapping problems [1]. It is based on the consecutive ones property (C1P) and PQ-trees. We encode S by an m × n 0/1 matrix M where row i represents Si as 1
We use release 51, which is based on the same Genome assemblies than release 50, with improved gene annotation, but does not contain Pecan alignments.
Prediction of Contiguous Regions in the Amniote Ancestral Genome
179
follows: M[i, j] = 1 if marker j belongs to Si and 0 otherwise. Ordering markers into CARs consists in finding a permutation of the columns of the matrix M, such that all 1’s entries in each row are consecutive (also called a C1P ordering for M). Finding such an order of the columns of M is not always possible, in particular if there are false positives in S, that is groups of markers that were not contiguous in the ancestral genome. Moreover, if there exists a C1P ordering of the columns of M, there are often several possible orderings that make all 1’s consecutive on each row, that represent different ancestral genome architectures.
Fig. 2. (a) A matrix M with the consecutive ones property. (b) A PQ-tree T (M). (c) An equivalent representation of T (M) that highlights all ancestral genome architectures that correspond to C1P orderings for M: each row corresponds to a chromosomal segment represented by a child of the root, two glued blocks have to be adjacent in any ancestral genome architecture and sets blocks that float in the same box have to be consecutive in any genome architecture but their order is not constrained. Here for example we see three ancestral chromosomal segments and the second one contains markers 5 to 8, with only constraint that markers 6 and 7 are adjacent; hence, 5 6 7 8 is a possible order for this last segment, but not 5 6 8 7. All 13824 possible C1P orderings (possible ancestral orderings) are visible on this representation, that we use to present the amniote CARs in Figure 3.
If M is not C1P, then we know that some sets of markers in S are false positives and were not contiguous in the ancestral genome. Following [16,6], we clear ambiguities by computing a maximal subset of S that is C1P, using a branchand-bound algorithm described in [6] that finds an exact solution. Then, given this C1P subset of S, all C1P orderings can be represented in a compact way, using the PQ-tree of the resulting matrix M , denoted T (M ), that contains three kinds of nodes: leaves (labeled by L), P-nodes and Q-nodes. Computing T (M ) can be done efficiently [17]. See Figure 2 for an illustration. T (M ) encodes in a compact way all possible C1P orderings of the columns of M and then all genome architectures we can deduce from S: the root of T (M) is a P-node, the children of the root represent CARs, where Q-nodes describe fixed orderings, up to a reversal, while P-nodes, but the root, describe subsets of markers that have to be contiguous but where there is no information to fix a relative order (see Figure 2 for an illustration). Two markers that are consecutive children of a Q-node are said to define an adjacency.
180
3
A. Ouangraoua et al.
Results and Discussion
3.1 An Ancestral Amniote Genome Architecture We applied the described method to propose a genomic architecture of the ancestral species of all amniotes, using the data presented in the previous section. Of the 2745 ancestral syntenies, 372 had to be removed during the optimization phase, in order to have some solutions to the C1P problem. This resulted in an ancestral amniote genome architecture composed of 79 CARs, 63 of them containing more than one genomic marker. In these 63 CARs, 983 of 1101 genomic markers are included in adjacencies, which indicates that there is little ambiguity
Fig. 3. The PQ-tree of amniote CARs, with their correspondence in the chicken genome. All ancestral architectures are represented, in the format described in Figure 2, with few chromosomic segments in which the order of the markers is not fixed.
due to P-nodes in the PQ-tree, although more than in the boreoeutherian CARS of [6]. These CARs define 282 segments that are strictly colinear with segments of the chicken genome, and cover 75% of the chicken genome. Similarly, these
Prediction of Contiguous Regions in the Amniote Ancestral Genome
181
Table 1. Recovered syntenic associations between chicken chromosomes, for three different methods. Numbers refer to pieces of chicken chromosomes. CARs which contain markers from only one chicken chromosome are not mentioned here. Kohn et al [14] Nakatani et al. [21] Present method 2-9-16, 1-24, 5-10, 2-9 2-12, 22-Z 17-Z, 4-22, 18-27-19, 13-17-Z, 17-Z, 18-27 21-26-23-32, 3-14, 8-18 1-7, 1-14-18 21-26, 1-8, 4-20, 9-19
CARs define 225 segments that are colinear with the human genome and cover 67% of this genome. Although these numbers are smaller than for reconstruction of mammalian ancestors described in [6], they are much larger than the amniote CARs inferred in [21]. The 51 CARs which span more than 1 Mb of the chicken genome are illustrated on Fig. 3, with their correspondence with the chicken chromosomes. Chicken syntenic associations. A “syntenic association” is the presence in a single CAR of genomic markers from two different chromosomes of an extant amniote. Here we may observe several chicken syntenic associations and compare them with other published methods [14,21]. These are, up to our knowledge, the only two methods that lead to the proposition of an architecture for the amniote genome. These propositions are very divergent: not only the number of chromosomes varies between 18 [14] and 26 [21], but the observed syntenic associations between chicken chromosomes are not always compatible. Of 13 syntenic associations found by Kohn et al. [14] and 6 found by Nakatani et al. [21], only two are common (Z-17 and 2-9). The reason is probably the absence of a formal framework, that we tend to fill here. We give a summary of the differences in Table 1, together with the syntenic associations we find in this study. We find one of the common syntenic associations (17-Z), plus two associations from Kohn et al. [14] and none additional from Nakatani et al. [21]. 3.2
Stability of the Method and Sensitivity to Parameters
The advantage of using a general framework for ancestral genome reconstruction is the possibility, to a certain extent, of assessing the quality and robustness of the results. We have a simple support to every pair of adjacent markers in the CARs: the existence of an ancestral synteny that contains these markers. So every adjacency in CARs may be examined using the data, independently of the methodology. The choices we have to make are the parameters used in the computations of ancestral syntenies. When two amniotes are compared, the ancestral synteny construction requires no parameter and its stability has been assessed in [6]. When an amniote genome and a fish genome are compared, we rely on a novel method that is less tried and tested, that is the computation of DCS. The principle itself is well known and employed, but we are not aware of any formal
182
A. Ouangraoua et al.
study on a reliable implementation of this principle. The two parameters that are used to construct the DCS are inclusive: strengthening both parameters will improve the specificity. Any DCS which is found for a set of parameters will also be found by less stringent parameters. The question is which parameters are stringent enough to assure a good specificity. We think an amniote segment with at least 20 genes, covering 95% of the genes annotated in this segments is a sufficient proof for an double orthology signal, and it is confirmed by the comparison with the map of [12] made from the same principle with visual expertise. The optimization step is also a source of possible instability of the method, especially as, from our experiments with the stringent criterion minprop = 95, at least 10% of the possible ancestral syntenies are false positive or result from convergent evolution, and have to be discarded during the optimization phase. To assess these two possible sources of instability, we inferred DCS with the following values of the parameter minprop : 80%, 83%, 86%, 89%, 92% and 95%. To measure the stability, we considered the set of adjacencies (pairs of markers that are consecutive children of a Q-node) defined by each set of CARs. An adjacency is said to be conserved between two sets of CARs if it is found in both sets of CARs. An adjacency of a given set of CARs is said to be weakly conserved in another set of CARs if it is absent in the latter one but the two markers that define it belong to the same CAR. We show in Table 2 the characteristics of the set of CARs we computed. Table 2. Characteristics of the set of CARs computed with DCS obtained with several values of minprop . Discarded syntenies are the ancestral syntenies discarded during the optimization phase. Long CARs are CARs that cover at least 1Mb of the chicken genome. Conserved and weakly conserved adjacencies are in terms of the previous value of minprop . minprop Ancestral Discarded CARs Long Adjacencies Conserved Weakly conserved Syntenies Syntenies CARs Adj. Adj. 80 4054 1182 25 20 1062 83 3691 973 37 24 1044 1021 17 86 3328 779 47 34 1033 1012 15 89 3093 626 49 35 1034 1011 13 92 2956 491 70 46 1006 979 24 95 2745 372 79 63 983 961 16
Increasing minprop clearly results in less false positive ancestral syntenies, as the ratio of discarded syntenies drops from 29% to 14%. However, the results obtained are very stable in terms of adjacencies, as most adjacencies of a given set of CARs are consistent with the adjacencies in the previous set of CARs. The increase in the number of CARs is expected as less ancestral syntenies are obtained with more stringent parameters to compute DCS. This seems to indicate that, on this particular dataset, the optimization phase behaves very consistently with various sets of DCS and seems to conserve a subset of DCS
Prediction of Contiguous Regions in the Amniote Ancestral Genome
183
that lead to very similar sets of CARs. Note also that the proportion of genomic markers that do not belong to adjacencies (i.e. are children of a P-node) is low, which indicates that the proposed ancestral genomic architectures contain very few segments where the order of the markers is not fixed. Additional results regarding the support of each adjacencies by the different types of ancestral syntenies – common intervals between ingroups, DCS and mammalian syntenies included in DCS – are available on the companion website. They show that with these data, that contain only one species on one branch from the ancestor (birds and reptiles), DCS are fundamental to detect and support a significant number of adjacencies: 112 adjacencies are not supported by any ancestral synteny chicken-mammalian.
4
Conclusion
We extended a general framework for reconstructing ancestral genome architectures [6] in order to handle WGD events. We apply our method to reconstruct the architecture of the ancestral amniote genome. While we put a lot of attention to the specificity of the method, not to infer doubtful ancestral syntenies, sensitivity is not sufficient to provide the exact set of chromosomes of the amniote ancestor. The small number of genomes used, as well as the fact that the chicken is the only available genome among birds and reptiles and has several very small chromosomes, is also a reason for which there are certainly more CARs than ancestral chromosomes. The definitive amniote ancestral genome is still an open problem, but with this general, simple and formal method, some of its characteristics are accessible and valuable for further studies. From a methodological point of view, several avenues can be explored. One of the issues is the fact that the C1P framework requires that each genomic marker appears exactly once in the ancestral genome architecture. This forced us to define our genomic markers from whole-genome alignments. But already at the level of the amniotes, these markers span around 30% of the chicken genome. Another approach could have been to use genes as genomic markers. However, in order to apply the C1P framework, this requires to compute the gene content of the ancestral amniote genome using the gene trees/species tree reconciliation approach. This problem is still a hard problem, that is very sensitive for example to errors in computing gene trees [11]. The other main issue is the computation of the DCS. The results we obtain with several values of the parameter that defines DCS clearly show that many of the DCS we compute are probably not ancestral syntenies. They are very likely to intersect or contain genome segments that really originate from an ancestral amniote genome segment, but due to the lack of flexibility of the C1P framework, they are considered as false positive. Indeed, an amniote segment that only overlaps, even on a large part, a complete segment derived from the amniote common ancestor, will induce conflict with respect to the C1P framework. In the same time, without DCS, a significant number of adjacencies are not supported, which makes DCS instrumental in our method. This motivates the problem, open up to now, of the design and study of formal methods for the reliable and precise detection of DCS.
184
A. Ouangraoua et al.
Acknowledgments A¨ıda Ouangraoua is funded by the ANR BRASERO (ANR-06-BLANC-0045) and SFU. Eric Tannier is funded by the ANR (ANR-08-GENO-003-01 and NT053 45205) and by the CNRS. Cedric Chauve is funded by NSERC and SFU.
References 1. Alizadeh, F., et al.: Physical mapping of chromosomes using unique probes. J. Comp. Biol. 2, 159–184 (1995) 2. Bergeron, A., Chauve, C., Gingras, Y.: Formal models of gene clusters. In: Zelikovsky, A., Mandoiu, I. (eds.) Bioinformatics Algorithms: Techniques and Applications. Wiley Series on Bioinformatics, pp. 177–202. Wiley Interscience, Hoboken 3. Booth, K.S., Lueker, G.S.: Testing for the consecutive ones property, interval graphs, and graph planarity using PQ-tree algorithms. J. Comput. System Sci. 13, 335–379 (1976) 4. Bourque, G., Pevzner, P.A., Tesler, G.: Reconstructing the genomic architecture of ancestral mammals: lessons from human, mouse and rat genomes. Genome Res. 14, 507–516 (2004) 5. Bourque, G., Tesler, G., Pevzner, P.A.: The convergence of cytogenetics and rearrangement-based models for ancestral genome reconstruction. Genome Res. 16, 311–313 (2006) 6. Chauve, C., Tannier, E.: A methodological framework for the reconstruction of contiguous regions of ancestral genomes and its application to mammalian genome. PLoS Comput. Biol. 4, e1000234 (2008) 7. Dietrich, F.S., et al.: The ashbya gossypii genome as a tool for mapping the ancient saccharomyces cerevisiae genome. Science 304, 304–307 (2004) 8. Hubbard, T.J.P., et al.: Ensembl 2007. Nucl. Acid. Res. 35, D610–D617 (2007) 9. Faraut, T.: Addressing chromosome evolution in the whole-genome sequence era. Chromosome Res. 16, 5–16 (2008) 10. Froenicke, L., et al.: Are molecular cytogenetics and bioinformatics suggesting diverging models of ancestral mammalian genomes? Genome Res. 16, 306–310 (2006) 11. Hahn, M.W.: Bias in phylogenetic tree reconciliation methods: implications for verte- brate genome evolution. Genome Biol. 8, R141 (2007) 12. Jaillon, O., et al.: Genome duplication in the teleost fish tetraodon nigroviridis reveals the early vertebrate proto-karyotype. Nature 431, 946–957 (2004) 13. Kellis, M., Birren, B.W., Lander, E.S.: Proof and evolutionary analysis of ancient genome duplication in the yeast saccharomyces cerevisiae. Nature 428, 617–624 (2004) 14. Kohn, M., et al.: Reconstruction of a 450-my-old ancestral vertebrate protokaryotype. Trends Genet. 22, 203–210 (2006) 15. Ma, J., et al.: DUPCAR: Reconstructing contiguous ancestral regions with duplications. J. Comput. Biol. 15, 1007–1027 (2008) 16. Ma, J., Haussler, D., Miller, W., et al.: Reconstructing contiguous regions of an ancestral genome. Genome Res. 16, 1557–1565 (2006) 17. McConnell, R.M.: A certifying algorithm for the consecutive-ones property. In: SODA 2004, pp. 761–770 (2004) 18. Meidanis, J., Porto, O., Telles, G.P.: On the consecutive ones property. Discrete Appl. Math. 88, 325–354 (1998)
Prediction of Contiguous Regions in the Amniote Ancestral Genome
185
19. Muffato, M., Roest Crollius, H.: Paleogenomics, or the recovery of lost genomes from the mist of times. BioEssays 30, 122–134 (2008) 20. Murphy, W.J., et al.: Dynamics of mammalian chromosome evolution inferred from multispecies comparative maps. Science 309, 613–617 (2005) 21. Nakatani, Y., Takeda, H., Morishita, S.: Reconstruction of the vertebrate ancestral genome reveals dynamic genome reorganization in early vertebrates. Genome Res. 17, 1254–1265 (2007) 22. Rocchi, M., Archidiacono, N., Stanyon, R.: Ancestral genome reconstruction: An integrated, multi-disciplinary approach is needed. Gen. Res. 16, 1441 (2006) 23. Van de Peer, Y.: Computational approaches to unveiling ancient genome duplications. Nat. Rev. 5, 752–763 (2004) 24. Wienberg, J.: The evolution of eutherian chromosomes. Curr. Opin. Genet. and Dev. 14, 657–666 (2004)
Pure Parsimony Xor Haplotyping Paola Bonizzoni1,, , Gianluca Della Vedova2,∗ , Riccardo Dondi3 , Yuri Pirola1,∗ , and Romeo Rizzi4 1 DISCo, Univ. Milano-Bicocca, {bonizzoni,pirola}@disco.unimib.it 2 Dip. Statistica, Univ. Milano-Bicocca,
[email protected] 3 Dip. Scienze dei Linguaggi, della Comunicazione e degli Studi Culturali, Univ. Bergamo,
[email protected] 4 DIMI, Univ. Udine,
[email protected]
Abstract. The haplotype resolution from xor-genotype data has been recently formulated as a new model for genetic studies [1]. The xorgenotype data is a cheaply obtainable type of data distinguishing heterozygous from homozygous sites without identifying the homozygous alleles. In this paper we propose a formulation based on a well known model used in haplotype inference: pure parsimony. We exhibit exact solutions of the problem by providing polynomial-time algorithms for some restricted cases and a fixed-parameter algorithm for the general case. These results are based on some interesting combinatorial properties of a graph representation of the solutions. Moreover we propose a heuristic and produce an experimental analysis showing that it scales to real-world instances taken from the HapMap project.
1
Introduction
In this paper we investigate a computational problem arising in genetic studies of diploid organisms. In such organisms (which include all vertebrates), all chromosomes are in two copies, one inherited from the mother and one from the father. Since chromosomes are almost identical except for specific gene variants called Single Nucleotide Polymorphisms (or SNPs), changes between variants are represented by a sequence of sites, each one bearing a specific value called allele. In almost all cases, for each site at most two different alleles are present in the population, one of which is called major and the other one minor. The sequence of alleles along a chromosome is called haplotype, while a genotype is a sequence of unordered pairs of alleles that appear in each site of the two copies
Partially supported by FAR 2008 grant “Computational models for phylogenetic analysis of gene variations”. Partially supported by the MIUR Project “Mathematical aspects and emerging applications of automata and formal languages”. (2007)
I. M˘ andoiu, G. Narasimhan, and Y. Zhang (Eds.): ISBRA 2009, LNBI 5542, pp. 186–197, 2009. c Springer-Verlag Berlin Heidelberg 2009
Pure Parsimony Xor Haplotyping
187
of the chromosome. Haplotype data are crucial in genetic population studies. The current technology for finding the two haplotypes of an individual is too expensive to be used in genetic studies of a population. Fortunately it is much cheaper to determine the xor-genotype for each individual, that is the set of sites for which the individual is heterozygous, i.e. bearing both a major and a minor allele. The term xor-genotype derives from the fact that a site is reported in the genotype if and only if the two alleles in the site are different. Thus a xorgenotype lists only heterozygous sites, while excluding sites bearing identical alleles, called homozygous sites. The problem of reconstructing the haplotypes resolving a given set of xorgenotypes naturally arises and represents an interesting case of the process of inferring haplotypes from general genotypes (phasing). Polynomial time algorithms for the problem have been developed [1,2] in the framework of the Perfect Phylogeny model. This model has been originally proposed by Gusfield [10] to solve the phasing problem. In this paper, we investigate the problem under the parsimonious principle that asks for a smallest set of haplotypes resolving all input xor-genotypes: such problem will be called Pure Parsimony Xor Haplotyping (PPXH). This problem is formalized as follows. Let Σ be a set of characters (also called sites). Then a xor-genotype (or simply a genotype) x is a non-empty subset of Σ, and a haplotype h is a possibly empty subset of Σ. Given two distinct haplotypes h1 , h2 , then the pair (h1 , h2 ) resolves the xor-genotype x iff x = h1 ⊕ h2 , where ⊕ is defined as the symmetric difference of h1 and h2 , i.e. the set of characters that are present in exactly one haplotype. Since ⊕ is associative and commutative, by a slight abuse of language, given a set S = {s1 , . . . , sn } of subsets of Σ we denote by ⊕(S) the expression s1 ⊕ s2 ⊕ · · · ⊕ sn . A set H of haplotypes resolves a set X of xor-genotypes if for each genotype x ∈ X, there exists a pair of haplotypes in H that resolves x. We are now able to formally introduce the problem that we will study in this paper. Problem 1. Pure Parsimony Xor Haplotyping (PPXH) The instance of the problem is a set X of xor-genotypes, and the goal is to compute a smallest set H of haplotypes resolving X. The pure parsimony model has been investigated as an approach to the phasing problem over general genotypes (i.e. where alleles of each homozygous site are specified) [11]. There is a rich literature in this area. In particular the APXhardness [12] of the problem and the lack of good approximation guarantees have led many researchers to the discover of methods based on linear programming techniques to find solutions of the problem [5]. Indeed, the best known approximation algorithm yields approximation guarantees of 2k−1 where k is the maximum number of heterozygous sites appearing in each genotype [12]. Restricted cases of the problem with polynomial time solutions have been shown in [16]. In the paper we investigate the PPXH problem by devising exact solutions of the problem by either considering fixed-parameter tractability or polynomial time algorithms for some restricted instances of the problem. Firstly we show that finding an optimal solution of the PPXH problem corresponds to building
188
P. Bonizzoni et al.
a specific graph representation of xor-genotypes, called xor-graph. The notion of xor-graph is crucial in the study of the PPXH problem, since most of the results presented in this paper rely on combinatorial properties of the xor-graph. Afterwards we design two polynomial-time solutions for restricted instances of the PPXH problem. Subsequently we design a fixed-parameter algorithm with 2 O(mn+2k km) time, for k the size of the optimum solution. Moreover we provide a k-approximation algorithm, where k is the maximum number of occurrences of a character in the set of input genotypes. Finally we propose a heuristic for the general problem and an experimental analysis on real and artificial datasets.
2
Basic Properties
A fundamental idea used in our paper is a graph representation of a feasible solution. More precisely, given a set X of xor-genotypes, the representation of a set H of haplotypes resolving X is the graph G = (H, E), called xor-graph associated with H, where edges of G are labeled by a bijective function λ : E → X such that, for each edge e = (hi , hj ), λ(e) = hi ⊕hj . The labeling λ is generalized to a set S by defining λ(S) = {λ(s) | s ∈ S}. We call optimal xor-graph for X, a xor-graph associated with an optimal solution for X. In this section we state some basic combinatorial properties of xor-graphs that will be used to prove the main results of the paper. Let A be a subset of Σ, and let B be a set of xor-genotypes or haplotypes. We identify a distinguished haplotype, called null haplotype and denoted by h0 , which corresponds to the empty set. Cycles of a xor-graph satisfy the following property. Lemma 1. Let X be a set of xor-genotypes, let G be a xor-graph associated with a set of haplotypes resolving X and let C be the edge set of a cycle of G. Then ⊕(λ(C)) is equal to the empty set. The above property of cycles of a graph is sufficient for the construction of a set of haplotypes resolving a set of genotypes. Indeed let X be an instance of PPXH and let G = (V, E) be a graph whose edge are biunivocally labeled by a function λ : E → X such that ⊕(λ(C)) = ∅ for each cycle C of the graph, it is immediate to compute a feasible solution H from G where |H| ≤ |V |. We associate with each vertex of G a haplotype as follows. Perform a depth-first visit of G and associate the null haplotype with the first visited vertex of each connected component of G. When visiting a new vertex v of G there must exist an edge e = (v, w) so that the haplotype wh has been previously assigned to w. Then the haplotype wh ⊕ λ(e) is associated with v. Lemma 1 and our construction guarantee that H is actually a feasible solution of X. The following results justify our attention to connected xor-graphs and their cuts. Lemma 2. Let X be a set of xor-genotypes and let G be a xor-graph associated with a set H of haplotypes resolving X. Let α be any character of Σ. Then the set A of edges of G whose label contains α is a cut of G.
Pure Parsimony Xor Haplotyping
189
¯α = Proof. Let Hα be the subset of H containing the character α, and let H ¯ H \ Hα . Let E be the edges of G with an endpoint in Hα and one in Hα (clearly E is a cut of G.) Notice that E is exactly the set of edges connecting a haplotype containing α and a haplotype not containing α, therefore E = A. Lemma 3. Let X be a set of xor-genotypes, and let G = (H, E) be an optimal xor-graph for X. Then G is connected. Proof. Assume to the contrary that G has at least two connected components C1 and C2 , and let a1 , a2 be two vertices respectively from C1 and C2 . Construct the set H from H by replacing each haplotype h ∈ C1 by h ⊕ a1 and each haplotype h ∈ C2 by h ⊕ a2 . Since C1 and C2 are not connected, the set of genotypes resolved by H is equal to that of H. But both a1 and a2 are replaced by the null haplotype in H , therefore H is strictly smaller than H, contradicting the minimality of set H. Instances and solutions of the PPXH problem can be represented by binary matrices. More precisely, we can have a genotype matrix associated with a set of xor-genotypes and a haplotype matrix associated with a set of haplotypes. In both matrices each column is uniquely identified by a character in Σ, while the rows of a genotype matrix (respectively haplotype matrix) correspond to the genotypes (resp. haplotypes). Given an ordering of the character set (that is Σ = σ1 , . . . , σ|Σ| ), the entry in the i-th row and j-th column of a genotype matrix (respectively, haplotype matrix) is 1 if σj belongs to the i-th genotype (respectively, i-th haplotype) and is equal to 0 otherwise. In the following we identify rows of a genotype (or haplotype) matrix with the corresponding genotypes (or haplotypes). Given a matrix M , we denote by M [·, A] (by M [B, ·], respectively) the submatrix of M induced by the set A of columns (by the set B of rows, respectively). Given a genotype or haplotype matrix M over Σ, we will say that a subset Σ1 of Σ is a dependent set of characters in matrix M if there exists a non-empty subset Σ1 of Σ1 such that, for each row i, ⊕σ∈Σ1 M [i, σ] = 0, otherwise it is called independent. While solving the PPXH problem, we can restrict our attention to a maximal independent subset of characters as stated in the following lemma. Lemma 4. Let X be a xor-genotype matrix and H be a haplotype matrix over the same character set Σ. Let Σ be a maximal independent subset of Σ in X. Then, H resolves X if and only if H[·, Σ ] resolves X[·, Σ ]. Proof. The only-if part is obviously true because H[·, Σ ] and X[·, Σ ] are two submatrices of, respectively, H and X. The if part can be proved by constructing a feasible solution H for X from the smaller solution H[·, Σ ] for X[·, Σ ] (for simplicity we will refer to the two submatrices respectively as H and X ). For each character α ∈ Σ \ Σ , since Σ ∪ {α} is dependent, there exists a non-empty subset Σα of Σ such that for each genotype x, X[x, α] = ⊕σ∈Σα X [x, σ]. Set the entry H[x, α] to ⊕σ∈Σα H [x, σ]. We now have to prove that H resolves X. Since H resolves X , it suffices to prove that for each character α ∈ Σ \ Σ , H[h1 , α] ⊕ H[h2 , α] = X[x, α], for some pair of haplotypes h1 , h2 .
190
P. Bonizzoni et al.
We already know that for each genotype x of X , there is a pair (h1 , h2 ) of haplotypes in H that resolves x . Notice that X[x, α] = ⊕σ∈Σα X [x, σ] since Σ is a maximal subset of independent characters of X. Since H resolves X , ⊕σ∈Σα X [x, σ] = ⊕σ∈Σα (H [h1 , σ] ⊕ H [h2 , σ]). Moreover, by the associativity of ⊕, ⊕σ∈Σα (H [h1 , σ]⊕H [h2 , σ]) = (⊕σ∈Σα H [h1 , σ])⊕(⊕σ∈Σα H [h2 , σ]). Finally, by our construction of the columns of H corresponding to characters in Σ \ Σ , (⊕σ∈Σα H [h1 , σ]) ⊕ (⊕σ∈Σα H [h2 , σ]) = H[h1 , α] ⊕ H[h2 , α], hence completing the proof. Notice that, given a n × m xor-genotype matrix X, the maximal subset of independent characters in X can be extracted by applying the Gauss-elimination algorithm on the matrix X that requires O(nm2 ) time. Observe that the proof of Lemma 4 shows how to compute efficiently a solution H for X from a solution H[·, Σ ] for X[·, Σ ]. We can introduce another simplification of the instance which can be performed efficiently. It affects the construction of the xor-graph and allows an efficient reconstruction of an optimal xor-graph for the general instance, given a xor-graph for the reduced or simplified instance. Lemma 5. Let X be an instance of PPXH, and let α be a character of X such that there exists exactly one genotype x ∈ X such that α ∈ x. Then there exists an optimal xor-graph G for X such that there is a vertex v of G with exactly one edge e incident on v and λ(e) = x. Proof. Let G be an optimal xor-graph G for X. Since α appears in only one genotype in X, there is exactly one edge e of G such that α ∈ λ(e). By Lemma 2 removing e from G results in a bipartition {Hα , H¯α } where Hα consists of the haplotypes containing α. Let v ∈ Hα and w ∈ H¯α be the two endpoints of the edge labelled by x, and let D be the set of vertices of Hα adjacent to v. Change each haplotype in h ∈ Hα \ {v} to h ⊕ v ⊕ w, obtaining a new optimal xor-graph G1 . By construction we obtain a new optimal xor-graph G1 such that it has set of edges E := E \ {(v, d) | d ∈ D} ∪ {(w, d) | d ∈ D}. Indeed, each genotype d ⊕ v labelling in graph G an edge incident to v from set D, in graph G1 labels a new edge that connects w to d. The optimality of G1 is trivial as no new vertices have been introduced. Note that by our construction of {Hα , H¯α }, in G1 only v contains α thus proving the Lemma. An instance X of PPXH is called reduced if (i) X consists of only one genotype, or the two following conditions are satisfied: (ii a) the set of characters of X are independent and (ii b) each character appears in at least two genotypes. Lemmas 4 and 5 justify the fact that we will assume in the rest of the paper that all instances are reduced, as the reduction process can be performed efficiently, and we can easily compute a solution of the original instance given a solution of a reduced instance. Moreover, the reduction process leads us to an important lower bound on the size of the optimum. Lemma 6. Let X be a reduced genotype matrix having n rows and m columns. Then any haplotype matrix H resolving X has at least m + 1 rows.
Pure Parsimony Xor Haplotyping
191
Proof. Let G be a xor-graph for X. By Lemma 2, each character α induces a cut in graph G. Each cut can be represented as n-long binary vector cα in which each element cα [i] is equal to 1 if and only if the genotype xi belongs to the cut. Clearly, such vector is precisely the column vector corresponding to character α of matrix X. Thus, since the characters are independent, also the family of the cuts (represented as binary vectors) induced by the set of characters is linearlyindependent. By Theorem 1.9.6 of [6] all connected graphs with m independent cuts have at least m + 1 vertices. As a consequence, we have that in a reduced xor-genotype matrix, the number of rows is greater or equal to the number of columns. In fact, in any matrix, the number of linearly-independent columns is equal to the number of linearlyindependent rows and, clearly, is bounded by the minimum between the number of columns and the number of rows.
3
Polytime Algorithms for Restricted Instances
In this section we investigate two restrictions of the PPXH problem obtained by bounding the number of characters that can appear in each genotype and the number of genotypes where a character occurs. Those restrictions are summarized by the following formulation. Problem 2. Constrained Pure Parsimony Xor Haplotyping (PPXH (p, q)). The instance consists of a set X of xor-genotypes, where each xorgenotype x ∈ X contains at most p characters, and each character appears in at most q xor-genotypes. The goal is to compute a minimum cardinality set H of haplotypes that resolves X. We use the symbol ∞ when one of parameters p or q is not bounded. A polynomial-time algorithm for PPXH(∞, 2) The structure of the cycles in a xor-graph characterizes the solutions for the PPXH(∞, 2) problem as stated in the following Lemma whose proof is omitted. Lemma 7. Let X be a reduced instance of PPXH(∞, 2), let G be an optimal xor-graph for X, and let e be an edge of G. Then e belongs to exactly one simple cycle of G. Since the optimal xor-graph consists of a set of edge-disjoint cycles (Lemma 7) and since it must be connected (Lemma 3), the size of the optimum solution is equal to |X| + 1 − |C|, for |X| the number of genotypes or edges of the graph and |C| the number of simple cycles of the graph. An algorithm solving PPXH(∞, 2) consists in building the optimal xor-graph which, in this case, means computing a most refined partition of the genotypes X into nonempty sets X1 , . . . Xk where ⊕Xi = ∅ for each Xi . Each element Xi of the partition, since it is a subset of genotypes whose sum is equal to ∅, will correspond to a cycle of the optimal xorgraph. In the following let us show an algorithm that computes such a partition
192
P. Bonizzoni et al.
in polynomial time. The algorithm builds iteratively an auxiliary graph whose vertices are the input genotypes X and where two genotypes are adjacent in the graph if and only if they share a common character. Since ⊕Xi = ∅ for a set of the partition of X, for each character c either none or both of the genotypes containing c must belong to Xi , which implies that, given two genotypes that are adjacent in the auxiliary graph, either none or both of them must be in Xi . This fact immediately implies that we can easily obtain an optimal xor-graph composed by the edge-disjoint cycles corresponding to the connected components of the auxiliary graph. A polynomial-time algorithm for the PPXH(2, ∞) problem For simplicity’s sake we will assume that the instance of the problem is a genotype matrix X and the desired output is a haplotype matrix H. The algorithm is based on Lemmas 4 and 6. In fact we will first compute a largest set Σ of independent characters in X. Moreover for each character α ∈ Σ \ Σ we determine the subset Σα of Σ such that X[·, α] = ⊕σ∈Σα X[·, σ]. Notice that this step can be carried over by a simple application of Gauss-elimination algorithm. Let X be the submatrix X[·, Σ ]. An optimal solution of the instance X is the matrix H containing |Σ | + 1 rows. More precisely the i-th row of H , for 1 ≤ i ≤ |Σ |, consists of all zeroes, except for the i-th column (where it contains 1). The last row contains only zeroes. Clearly H resolves X . In fact it is immediate to notice that each row of X contains at most two 1s, as the same property holds for X, therefore for each row r of X there are two rows of H resolving r. The optimality of such solution is a direct consequence of Lemma 6. A feasible solution H for X can be easily computed from H as shown in the proof of Lemma 4.
4
Fixed-Parameter Tractability of PPXH
The unrestricted PPXH problem is fixed-parameter tractable (where the parameter is the optimum), as can be seen by a simple observation. Let H be a set of haplotypes and let X be a set of genotypes resolved by H; in the following we√will denote |X| by n. Since H can resolve at most |H| genotypes, n ≥ |H| ≥ 2n. 2 In other words, if k is the size of the minimum-cardinality set of haplotypes resolving X, n ∈ O(k 2 ). The number of the possible graphs with at most n + 1 vertices and exactly n edges is no more than 22n log2 (n+1) = (n + 1)2n which, by our previous ob2 servation, is O(k 4k ), i.e. a function dependent only on k. The time needed to check if one of such graphs is a xor-graph for H is clearly polynomial in n and thus we can immediately derive a fixed-parameter algorithm to find an optimal xor-graph for X. The time complexity of the algorithm is well beyond what is deemed acceptable in practice, therefore we propose a more efficient algorithm that is based on the matrix representation of genotypes and haplotypes. In the following we will assume that the genotype matrix X is reduced, that X has n rows and m independent columns and that we are looking for a haplotype
Pure Parsimony Xor Haplotyping
193
matrix H with at most k distinct rows that resolves X. In the na¨ıve approach, testing if each haplotype matrix resolves a genotype matrix requires O(k 2 nm) time because each pair of haplotypes has to be considered and then each resulting genotype has to be searched in the genotype matrix. Our basic idea, instead, is to enumerate all the haplotype matrices by changing only one haplotype each time, in such a way that only k − 1 new pairs of haplotypes must be considered when testing if H resolves set X. We use Gray codes [13] to visit all the haplotype matrices in such a way that each pair of consecutive matrices differs by a single entry and, thus, by a single haplotype. More precisely, we enumerate all k × m matrices by generating all km-long bit vectors. Indeed, the bits from position (i − 1)m + 1 to position im in a km-long vector gives the i-th row of the matrix (for 1 ≤ i ≤ k). The fastest known algorithm for computing the next vector of a Gray code requires constant time at each invocation [3]. Observe that the na¨ıve algorithm requires O(nm) time to test if there is a genotype in matrix X resolved by a pair of haplotypes. By representing the set of the row vectors of matrix X as a binary trie [8], the time required to get the index of the row containing a m-long binary vector is reduced to O(m). The details of the fixed-parameter algorithm are given in the Algorithm 1, where we also use some additional data structures: the array ResolvedByHowMany which associates with each genotype the number of pairs of haplotypes resolving such genotype, and ListResolvedG which associates with each haplotype h a list of the relevant pairs of haplotypes in which h is involved. In fact, the elements of the lists in ListResolvedG is a triple (h1 , h2 , x) where (h1 , h2 ) is a pair of haplotypes resolving x. Notice that the outermost foreach loop iterates 2km times, while the for loop at lines 1–1 iterates k times. Each iteration of the latter loop consists of a lookup in a trie (which can be done in O(m) time) and updating some arrays and lists. Since each operation on those data structures requires constant time and each list can contain at most k elements, the time required for outermost loop is 2 O(km) time, resulting in an overall O(nm + 2k km) time complexity.
5
An Approximation Algorithm
We present a simple approximation algorithm with factor k, where k is the maximum number of xor-genotypes where each character appears, for a reduced instance X of PPXH. Initially the set H of haplotypes computed by the algorithm contains only the null haplotype. While the set of genotypes is not empty, pick a character α that appears in at least a genotype, move to H all genotypes containing α, and remove from X all genotypes that are solved by a pair of haplotypes in H. Clearly the final set of haplotypes H solves the set of genotypes X. The proposed algorithm returns a solution of size at most k times larger than the optimum which, by Lemma 6, is at least |Σ| + 1. Our algorithm starts with a solution H containing only the null haplotype, and at each iteration adds at most k haplotypes to the solution H, as k is also the maximum number of genotypes containing any character. Since there can be at most |Σ| steps, |H| ≤ k|Σ| + 1.
194
P. Bonizzoni et al.
Algorithm 1. A fixed-parameter algorithm for PPXH
1 2 3 4 5 6 7 8 9 10 11 12
13 14 15 16 17 18 19 20 21
6
Data: A genotype matrix X defined over a set of m independent characters, and an integer k. Result: a set H of at most k haplotypes resolving X if it exists, No otherwise. if k2 < n or k < m then return No; if k > n then return H ∪ {h0 }; Build a trie T that stores the xor-genotypes contained in X; Let ListResolvedG be an array of k initially empty lists; ResolvedByHowMany ← (0, 0, . . . , 0);TotalResolvedG ← 0; foreach binary matrix H in Gray code do if H is the matrix containing only zeroes then continue to the next matrix; ChangedRow ← index of the row changed from the previous iteration; /* Update state of xor-genotypes resolved by changed haplotype */ foreach entry (h1 , h2 , x) of ListResolvedG[ChangedRow] do Remove (h1 , h2 , x) from ListResolvedG [h1 ] and ListResolvedG [h2 ]; ResolvedByHowMany [x] ← ResolvedByHowMany [x] − 1; if ResolvedByHowMany[x] = 0 then TotalResolvedG ← TotalResolvedG − 1; /* Look for genotypes resolved by the new haplotype */ for r ← 1 to k do if l is the index returned by the lookup of the vector H[r, ·] ⊕ H[ChangedRow, ·] in T then if ResolvedByHowMany[l] = 0 then TotalResolvedG ← TotalResolvedG + 1; ResolvedByHowMany [l] ← ResolvedByHowMany [l] + 1; Add (r, ChangedRow , l) to ListResolvedG [r] and to ListResolvedG [ChangedRow ]; if TotalResolvedG = n then /* all genotypes are resolved */ Remove from H all duplicate rows; return H; return No;
Solving PPXH by a Heuristic Method
In this section, we propose a heuristic algorithm to build a near optimal xorgraph for an input matrix X of genotypes. Observe that an optimal xor-graph for X is a graph having the minimum vertex set and each edge uniquely labeled by a genotype X. In virtue of Lemma 1, a cycle of the xor-graph consists of a subset X of the input genotypes such that ⊕X = ∅: we call such a subset X a candidate cycle. The basic idea that guides our heuristic is first to select a subset of the candidate cycles of X and then to build a labeled graph (a xor-graph) where the selected candidate cycles are actual cycles. The procedure successively iterates over the genotypes that are not yet successfully realized in the xor-graph. If we have selected a set of candidate cycles which are fundamental cycles of a graph G, then we can compute such graph by solving the Graph Realization (GR)
Pure Parsimony Xor Haplotyping
195
problem [15] on those cycles. In fact, the Graph Realization problem consists in building a graph given its fundamental cycles. Recall that the set C of fundamental cycles of a graph G with respect to a fixed spanning tree T of G, is defined as C := {the unique cycle of T ∪ {e} | e ∈ E(G) \ E(T )} (see e.g. [6], pag. 26). The Graph Realization problem can be formally formulated as follows [15]. Given two disjoint sets T and C, the input of the GR problem is a family F of subsets of T ∪ C so that (i) for each set Fi of the family F , Fi ∩ C = {ci }, and (ii) for each pair of subset Fi and Fj of F , Fi ∩ Fj ∩ C = ∅. The GR problem consists of finding a labeled graph G = (V, E) (if such a graph exists) which realizes F , that is there is a bijection between the set T and a spanning tree of G, and the elements of each set Fi label exactly the edge set of a (simple) cycle of G. Notice that almost linear time algorithms exist for this problem [4,9] which, with hindsight, have inspired our heuristic. We denote by G(F ) a graph realization G of a family of sets F . The heuristic procedure transforms a r × c genotype matrix X into an instance of GR as described in the following two main steps. In a first step, the set T is defined as a subset T = {xj1 , . . . , xjc } of linearly-independent input genotypes of X. This means that any other input genotype xi can be expressed as a linear combination αi (1)xj1 ⊕ . . . ⊕ αi (c)xjc of the genotypes in T . Then C = {c1 , . . . cr−c } is defined as consisting of the set of genotypes not in T . In a second step, the family of subsets of T ∪ C giving an instance of the GR is built by building sets Fi such that Fi = {ci } ∪ {xjl ∈ T | αj (l) = 1}. Informally, Fi consists of ci and the unique set Pi ⊆ T , such that ⊕Pi = {ci }. An immediate consequence of our definition is that ⊕Fi = ∅, therefore Fi is a candidate cycle by the definition of candidate cycle. Computing the set T from X is simply a matter of running the Gauss elimination algorithm on X T . The family F can be easily inferred by computing the coefficients αi (1), . . . , αi (c) for all ci ∈ C. Clearly, the Gauss-elimination procedure applied on the matrix X T results in a matrix R whose first r columns form the identity matrix while the other ones are the vectors of the linear combination coefficients. Now, let us detail the construction of the family F giving an instance of GR. The heuristic starts defining F as an empty family and iteratively adding to F a candidate cycle Fi if and only if the resulting family admits a Graph Realization. Clearly, this approach ends with a maximal subset of candidate cycles that admits a Graph Realization. The above described two steps of the heuristic procedure are then recursively iterated on the set of xor-genotypes of X that do not label an edge of the computed Graph Realization. Let n and m be, respectively, the number of xor-genotypes and sites. The time complexity of the heuristic is determined by the time complexity of the Gauss elimination algorithm, which requires O(n2 m) time because it is called on matrix X T , and of the Graph Realization algorithm, whose time complexity is O(α(n, m)nm), where α is the inverse Ackermann function. Notice that the Graph Realization algorithm is repeated at most n times in order to compute a maximal subfamily F , hence O(α(n, m)n2 m) times. Finally, there is at least one xor-genotype of X that labels an edge of the Graph Realization, hence the
196
P. Bonizzoni et al.
Table 1. Results on random instances. For each choice of the first three columns, 10 random instances were generated and the average size of the result obtained by the heuristic is reported in column avg. result. The last column (ratio) reports the ratio between the fourth and the second column. The smaller the ratio, the better the solution computed. number of number of number of genotypes haplotypes characters
avg. result
avg. ratio
number of number of number of genotypes haplotypes characters
avg. result
avg. ratio
100 100 100 100 100 100 100 100 100
50 50 50 33 33 33 66 66 66
50 33 66 50 33 66 50 33 66
50 79.2 50 33 33.7 33 69.7 87.2 63
1 1.58 1 1 1.02 1 1.05 1.32 0.95
300 300 300 300 300 300 300 300 300
86 86 86 100 100 100 200 200 200
86 100 200 86 100 200 86 100 200
87 86 86 131.2 100.1 100 283 282.4 191
1.01 1 1 1.31 1 1 1.41 1.41 0.95
200 200 200 200 200 200 200 200 200
70 70 70 66 66 66 133 133 133
70 66 133 70 66 133 70 66 133
70.4 74.2 70 66 66 66 186.4 187.2 126
1 1.06 1 1 1 1 1.4 1.4 0.94
400 400 400 400 400 400 400 400 400
100 100 100 133 133 133 266 266 266
100 133 266 100 133 266 100 133 266
100.4 100 100 193.7 133 133 383.4 380.7 250
1 1 1 1.45 1 1 1.44 1.43 0.93
total number of iterations is at most n, leading to an overall time complexity O(α(n, m)n3 m). We have implemented our heuristic as a C program and we have experimentally analyzed its behavior. Our analysis is made of two parts. The first part has been performed on a set of random synthetic instances with the goal of determining the quality of the results produced, while the second part has been performed on some real-world instances in order to determine if the implementation can be successfully applied to large instances. Our random instances are created by first generating a set of haplotypes, where each haplotype is a random fixed-length binary sequence, and successively generating xor-genotypes, each one obtained from a random pair of haplotypes. Since the outcome of our heuristic can be influenced by the order of the input genotypes, for each instance we have created ten random permutations of the genotypes and we have chosen the smallest set of haplotypes computed over all those permutations. Table 1 reports the average size of the solutions computed by our heuristic over 10 instances for each choice of the three parameters, namely the number of input genotypes, the number of haplotypes and the number of characters. The main indicator of the quality of the results computed by the heuristic is the ratio between the size of the solution computed and the number of original haplotypes (notice that the latter is an upper bound on the size of the optimal solution). The results in Table 1 point out that the average ratio is never larger than 1.58 and quite often very close to 1. Also the heuristic seems to produce better solutions when the optimal xor-graph is dense, that is when the number of genotypes is large relatively to the number of original haplotypes.
Pure Parsimony Xor Haplotyping
197
To validate the feasibility of applying our heuristic on large instances, we have produced some instances from the Phase I dataset of the HapMap project [14] (release 2005-06 16c.1). A set of xor-genotypes were produced from the data for each population in the dataset (discarding non biallelic sites and non autosomal chromosomes). Those instances vary from 44 genotypes and 184604 sites to 90 genotypes and 91812 sites. On all those instances our heuristics has never required more than 5.027 seconds on a standard PC with 2GB of memory under Ubuntu Linux 8.10, clearly establishing that the heuristic can be successfully used on real-world large instances.
References 1. Barzuza, T., Beckmann, J.S., Shamir, R., Pe’er, I.: Computational problems in perfect phylogeny haplotyping: Xor-genotypes and tag SNPs. In: Sahinalp, S.C., Muthukrishnan, S.M., Dogrusoz, U. (eds.) CPM 2004. LNCS, vol. 3109, pp. 14–31. Springer, Heidelberg (2004) 2. Barzuza, T., Beckmann, J.S., Shamir, R., Pe’er, I.: Computational problems in perfect phylogeny haplotyping: Typing without calling the allele. IEEE/ACM Trans. on Comput. Biol. and Bioinform. 5(1), 101–109 (2008) 3. Bitner, J.R., Ehrlich, G., Reingold, E.M.: Efficient generation of the binary reflected Gray code and its applications. Comm. of the ACM 19(9), 517–521 (1976) 4. Bixby, R.E., Wagner, D.K.: An almost linear-time algorithm for graph realization. Mathematics of Operations Research 13, 99–123 (1988) 5. Brown, D.G., Harrower, I.M.: Integer programming approaches to haplotype inference by pure parsimony. IEEE/ACM Trans. on Comput. Biol. and Bioinform. 3(2), 141–154 (2006) 6. Diestel, R.: Graph Theory, 3rd edn. Graduate Texts in Mathematics, vol. 173. Springer, Heidelberg (2005) 7. Downey, R., Fellows, M.: Parameterized Complexity. Springer, Heidelberg (1999) 8. Fredkin, E.: Trie memory. Comm. of the ACM 3(9), 490–499 (1960) 9. Fujishige, S.: An efficient PQ-graph algorithm for solving the graph realization problem. J. of Computer and System Science 21, 63–86 (1980) 10. Gusfield, D.: Haplotyping as perfect phylogeny: Conceptual framework and efficient solutions. In: Proc. 6th RECOMB, pp. 166–175 (2002) 11. Gusfield, D.: Haplotyping by pure parsimony. In: Baeza-Yates, R., Ch´ avez, E., Crochemore, M. (eds.) CPM 2003. LNCS, vol. 2676, pp. 144–155. Springer, Heidelberg (2003) 12. Lancia, G., Pinotti, M.C., Rizzi, R.: Haplotyping populations by pure parsimony: Complexity of exact and approximation algorithms. INFORMS J. Comp. 16(4), 348–359 (2004) 13. Savage, C.: A survey of combinatorial Gray codes. SIAM Rev. 39(4), 605–629 (1997) 14. The International HapMap Consortium. A haplotype map of the human genome. Nature 437(7063), 1299–1320 (2005) 15. Tutte, W.T.: An algorithm for determining whether a given binary matroid is graphic. Proc. of the American Mathematical Society 11(6), 905–917 (1960) 16. van Iersel, L., Keijsper, J., Kelk, S., Stougie, L.: Shorelines of islands of tractability: Algorithms for parsimony and minimum perfect phylogeny haplotyping problems. IEEE/ACM Trans. on Comput. Biol. and Bioinform. 5(2), 301–312 (2008)
A Decomposition of the Pure Parsimony Haplotyping Problem A. Holder and T. Langley Rose-Hulman Institute of Technology, Terre Haute, IN 47803
Abstract. We partially order a collection of genotypes so that we can represent the NP-Hard problem of inferring the least number of haplotypes in terms of substructures we call g-lattices. This representation allows us to prove that the problem can be solved efficiently if the genotypes partition into chains with certain structure. Even without the specified structure, the decomposition shows how to separate the underlying integer programming model into smaller models.
1
Introduction
The pure parsimony problem is to infer a maximally parsimonious collection of genetic donations that can combine to form a new population’s diversity over portions of the chromosome. The problem was presented in 1990 by Clark in [1], although not in terms of an optimization problem. Gusfield posed the question as a combinatorial optimization problem in [2], and it was further suggested to the mathematical programming community in [3]. The problem has received significant attention as an integer program (IP), with the first model being proposed in [4]. Although this model’s size grows exponentially in the number of heterozygous positions in the genotype, it tends, but is certainly not guaranteed, to solve efficiently as long as the problem is within memory limitations. Several have suggested alternative, polynomial-size integer programs [5,6,7,8]. Although the problem is APX-Hard [8], the case in which each genotype has no more than two heterozygous positions is polynomial [9]. The supportive literature is large and growing, and we point interested readers to the bibliography in [10] and the more recent work in [6]. Our objectives are twofold. First, we provide a polynomial bound on the pure parsimony problem and establish conditions under which this bound is indeed the optimal solution, and hence, we identify a sub-class of problems that is solvable in polynomial time. This result is independent of the number of heterozygous positions in the genotype. The underlying mathematics is based on a partial ordering of the genotypes that partitions them into a collection of substructures that we term g-lattices. If each g-lattice is a chain, that is, each g-lattice is linearly ordered, then the top elements are used to decide whether or not the problem decouples into smaller problems whose solutions are easy to calculate and whose solutions aggregate to form the overall solution. If the problem does not decouple into chains, then the g-lattice decomposition is used to heuristically solve the I. M˘ andoiu, G. Narasimhan, and Y. Zhang (Eds.): ISBRA 2009, LNBI 5542, pp. 198–208, 2009. c Springer-Verlag Berlin Heidelberg 2009
A Decomposition of the Pure Parsimony Haplotyping Problem
199
problem. The fact that the general problem is APX-Hard supports such tactics, especially in light of the recent growth in genotypic information [11]. Our results show that on average we can find a polynomial solution of size 1.53 times the number of genotypes, and we can further reduce this to 1.09 times the number of genotypes if we solve the smaller IPs for each g-lattice. In comparison, the minimum solution calculated by the model in [4] was 0.55 times the number of genotypes, but this calculation required a 239% increase in solution time. This article continues with an introduction to our notation and a formal statement of the pure parsimony problem. This is followed by Section 3 in which we discuss the decomposition imposed by the partial ordering. Our polynomial upper bound is established here. We describe our heuristic based on the g-lattice decomposition in Section 4. Our numerical results are presented in Section 5, which is followed by a conclusion that discusses future directions.
2
Notation and Problem Statement
Diploid organisms, such as humans, receive half of their genetic code from each parent. The vast majority of each genome is largely the same, but the locations that differ provide the diversity observed in a population. These locations are called Single Nucleotide Polymorphisms (SNPs), and a sequence of these is called a genotype. The parental sequences that combine to give the genotype are called haplotypes. If the haplotypes agree at a SNP, then the SNP is homozygous. Otherwise, the SNP is heterozygous. We follow the notation in [5] for representing haplotypes and genotypes. Haplotype locations have one of two possible states, denoted by 0 or 1. The child’s genotype is the direct sum of two sequences built over these values. So, if one haplotype is h = (1, 1, 0, 0) and another is h = (0, 1, 1, 0), then the resulting genotype is h + h = (1, 1, 0, 0) + (0, 1, 1, 0) = (1, 2, 1, 0) = g. Therefore, in a genotype, a 0 or 2 represents a homozygous SNP, while a 1 represents a heterozygous SNP (we note that this is not the standard notation in the biological literature, wherein 2 typically represents a heterozygous SNP). In general, we consider a collection of m genotypes constructed over the alphabet {0, 1, 2}. We further assume that each genotype contains n SNPs, so each genotype g is in {0, 1, 2}n. Similarly, each haplotype is in {0, 1}n. We use superscripts to distinguish genotypes (or haplotypes) and subscripts to denote location within a genotype or haplotype. So, for example, gji represents SNP location j in genotype gi . In the presence of heterozygous SNPs, the haplotypes that can mate to form a genotype are not uniquely defined. In particular, the pair (0, 1, 0, 0) and (1, 1, 1, 0) could also have formed the genotype in the preceding example. This means that inferring the haplotypes requires additional assumptions, and one such assumption is that of parsimony, which assumes that small collections of haplotypes are favorable. The problem of inferring a most parsimonious solution is called the
200
A. Holder and T. Langley
pure parsimony problem, and this is the problem we consider. Other objectives, such as the construction of a perfect phylogeny, are also common [12,13,14]. We say that haplotypes h and h mate to form genotype g if h + h = g. In this case we also say that h (or h ) resolves g. Two genotypes are incompatible if there is no haplotype that can resolve both. A set of haplotypes H resolves a set of genotypes G if every genotype in G has a pair of mates in H. In this terminology, the pure parsimony problem is to find a smallest set of haplotypes that resolves the known set of genotypes. We refer to such a set as a minimum or optimal solution for the set of genotypes.
3
A Polynomial-Time Upper Bound Based on Ordering Genotypes
The first observation that ordering genotypes can provide a closed form solution to the pure parsimony problem is found in [15], where it is shown that if m genotypes, each with at least one heterozygous SNP, form a chain under a partial ordering, then an optimal solution contains m + 1 haplotypes. The goal of this section is to apply similar methods to structures other than chains. This leads to a polynomial-time upper bound on solution size. Following [15], we partially order {0, 1, 2} with defined by 0 1, 2 1 and a a for each a in {0, 1, 2}. This leaves 0 and 2 incomparable. Similarly, we order {0, 1, 2}n with componentwise comparisons. If G is a collection of genotypes, we call each connected component of the Hasse diagram of G under a g-lattice. We use this term (rather than just lattice) to reflect the fact that a g-lattice need not have a greatest or least element, and so may not be a lattice in the usual sense. Figure 1 illustrates a collection of genotypes that divides into three g-lattices under . If a set of genotypes G forms a chain under this ordering, then it was shown in [15] that a minimum solution has size |G| + 1 if the minimal element in the chain has at least one heterozygous SNP. A trivial modification of the proof reduces this by one if the minimal element of the chain has no heterozygous SNPs. Theorem 1. Suppose G is a collection of m genotypes that form a chain under . Then a minimum solution to G has size m + 1 if the minimal element in the chain has at least one heterozygous SNP. Otherwise a minimum solution has size m. Proof. The key to the proof in [15] is the following fact. Suppose g1 , g2 and g3 are distinct genotypes with g1 g2 g3 . Also suppose that h1 and h2 are haplotypes able to resolve g1 and g2 , respectively. Then h1 and h2 can not mate to form g3 . This follows since the ordering forces a SNP location i where gi1 and gi2 must both be 0 or 2, while gi3 = 1. (For example, this occurs in the fourth SNP in the three-element chain in Figure 1.) So h1i = h2i , and therefore h1i + h2i = 1 = gi3 .
A Decomposition of the Pure Parsimony Haplotyping Problem
201
Fig. 1. A collection of genotypes divided into three g-lattices by the ordering
So to construct a minimum solution when G is a chain, we start by choosing haplotypes that mate to form the minimal element. We then must include at least one new haplotype for each additional genotype in the chain. The fact that one is enough is a consequence of Lemma 2 below. The proof in [15] does not consider the case in which the minimal element has no heterozygous SNPs, and therefore two haplotypes are needed to form the minimal element. So a minimum solution contains m + 1 haplotypes. In the case with no heterozygous SNPs, the minimal element is formed by adding a single haplotype to itself, reducing the count to m. Unfortunately, this method of proof does not extend to structures other than chains. For example, consider the right-most g-lattice in Figure 1. Four haplotypes suffice to resolve the lowest three genotypes, as we may write (1, 0, 0, 1, 1, 1) + (0, 0, 0, 1, 1, 1) = (1, 0, 0, 2, 2, 2) (1, 0, 0, 1, 1, 1) + (0, 0, 0, 1, 0, 1) = (1, 0, 0, 2, 1, 2) (0, 0, 0, 1, 1, 1) + (1, 0, 0, 0, 1, 1) = (1, 0, 0, 1, 2, 2). This is, in fact, a minimum solution for this set of genotypes, as the first SNP in each is heterozygous. Any set of three haplotypes can not resolve a set of three genotypes that share a heterozygous SNP, as in that location two of the haplotypes would necessarily take the same value, while the third would take the other. If the two haplotypes with the same value were to mate, the resulting genotype would have a homozygous SNP in the location in question. Now, two of the same four haplotypes above can mate to form the next higher genotype in the g-lattice: (0, 0, 0, 1, 0, 1) + (1, 0, 0, 0, 1, 1) = (1, 0, 0, 1, 1, 2).
202
A. Holder and T. Langley
Therefore, this set of four haplotypes is also a minimum solution for the lowest four genotypes in the g-lattice. So unlike when the g-lattice is a chain, adding a new genotype does not necessarily increase the size of a minimum solution, as two haplotypes, each able to resolve a different genotype, can mate to form a third genotype that is above both of the original two. This complicates the process of finding minimum solutions. However, the construction of the chain solution does lead to the following general upper bound. Theorem 2. Let G be a collection of m genotypes and suppose that q minimal elements of G have at least one heterozygous SNP. Then no more than m + q haplotypes are needed to resolve G. The proof uses the following two results. Lemma 1. Let g1 and g2 be genotypes with g1 g2 . Then any haplotype that can resolve g1 can also resolve g2 . Proof. Suppose g1 = h0 + h1 and g1 g2 . We construct a haplotype h2 such that g2 = h0 + h2 . In particular, define h1i if gi1 = gi2 2 hi = . 1 −hi if gi1 = gi2 Then, if gi1 = gi2 we have gi2 = h0i + h1i = h0i + h2i . If gi1 = gi2 , then gi2 = 1 and gi1 is 0 or 2 since g1 g2 . So h0i = h1i and gi2 = h0i − h1i = h0i + h2i . Therefore g2 = h0 + h2 . Lemma 2. Suppose G = {g1 , g2 , . . . , gk } is a collection of genotypes such that g1 gi for 2 ≤ i ≤ k. Then, if g1 has a heterozygous SNP, no more than k + 1 haplotypes are needed to resolve G. If g1 has no heterozygous SNPs, then no more than k haplotypes are needed to resolve G. Proof. Suppose g1 = h0 + h1 . Then by Lemma 1, there exist hi , 2 ≤ i ≤ k, such that gi = h0 + hi . So the collection H = {h0 , h1 , . . . , hk } resolves G. If g1 has a heterozygous SNP, then h0 = h1 so |H| = k + 1. Otherwise h0 = h1 and |H| = k. Proof of Theorem 2. Let G be a collection of m genotypes. Suppose {g1 , g2 , . . . , gl } is the set of minimal elements of G and suppose without loss of generality that {g1 , g2 , . . . , gq } is the set of minimal elements with at least one heterozygous SNP. Choose a partition {G1 , G2 , . . . , Gl } of G in such a way that gi is the least element of Gi for all i, that is, gi g for all g in Gi . By Lemma 2, Gi can be resolved with |Gi | + 1 haplotypes for 1 ≤ i ≤ q and with |Gi | haplotypes for q + 1 ≤ i ≤ l. So G can be resolved with no more than q i=1
haplotypes.
(|Gi | + 1) +
l i=q+1
|Gi | = m + q
A Decomposition of the Pure Parsimony Haplotyping Problem
203
We refer to the approximate solution implied by Theorem 2 as the gl-solution to the pure parsimony problem. The minimal elements of G under can be calculated as follows. Select a genotype g and compare it to the remaining m − 1 genotypes componentwise. If we find g such that g g, then g is not a minimal element. Otherwise, g is a minimal element. So, identifying the minimal elements requires no more than m2 n comparisons, which establishes the following result. Theorem 3. If G is a collection of m genotypes, each consisting of n SNPs, the complexity of calculating the minimal elements is at worst O(m2 n). Theorems 2 and 3 establish a polynomial-time upper bound on a solution to the pure parsimony problem. Theorem 1 says that this bound is optimal when G is a chain under . But how far from optimal can the bound be in general? The smallest possible solution to the pure parsimony problem on m-genotypes is min{k : k2 ≥ m}. As an example, consider the set G = {(2, 2, 1, 1), (2, 1, 2, 1), (2, 1, 1, 2), (1, 2, 2, 1), (1, 2, 1, 2), (1, 1, 2, 2)}, which is optimally and uniquely resolved by H = {(0, 1, 1, 1), (1, 0, 1, 1), (1, 1, 0, 1), (1, 1, 1, 0)}. Notice that the elements of G are pairwise incomparable, and hence, form six single-element chains. Each element contains a heterozygous SNP, so the glsolution has size 2 · 6, the largest possible solution for a set of six genotypes. Extending this example, we see that for any integer q, there is a G of size q2 whose unique minimum solution has size q, but whose gl-solution has size 2 · q2 . Since the minimum solution is as small as possible, the gl-solution is capable of achieving the worst possible error. However, this is a contrived example, and we will analyze how this bound performs on real biological data in Section 5.
4
Developing a Heuristic Based on g-Lattice Decompositions
In this section, we leverage the g-lattice decomposition of G to find optimal solutions to a special case and to develop an algorithm for approximating solutions by decomposing the general IP into smaller IPs. Theorem 4. Let G be a collection of genotypes such that any two maximal elements from different g-lattices of G are incompatible. Then the size of a minimum solution to G is the sum of the sizes of minimum solutions to the g-lattices of G. Proof. Suppose g1 and g2 are maximal elements of two disjoint g-lattices of G. If g1 and g2 are incompatible, there exists a SNP location i where gi1 and gi2 are both homozygous, and are not equal. Without loss of generality, suppose gi1 = 0 and gi2 = 2. Then any genotype g with g g1 must have a 0 at SNP i. Similarly, any genotype g g2 must have a 2 at SNP i. So g and g
204
A. Holder and T. Langley
are incompatible. As an example, the second and third g-lattices in Figure 1 have such an incompatibility in the third SNP. So if the maximal elements of different g-lattices are pairwise incompatible, so are all pairs of elements from different g-lattices. Therefore the sets of haplotypes resolving the g-lattices must be disjoint. An immediate corollary gives an optimal solution when G decomposes into incompatible chains. Corollary 1. Suppose G is a collection of m genotypes that decomposes into chains, q of which have minimal elements with at least one heterozygous SNP. Then if the maximal elements of the chains are pairwise incompatible, a minimum solution to G contains m + q haplotypes. The proof follows directly from Theorems 1 and 4, and combining this with Theorem 3, we establish the following. Corollary 2. If a collection of genotypes decomposes into chains with pairwise incompatible maximal elements, then the complexity of calculating a solution to the pure parsimony problem is no worse than O(m2 n). To design an algorithm that heuristically solves the problem in the event that the maximal elements of distinct g-lattices are not pairwise incompatible, we follow the approach suggested by Theorem 4, that is, we find solutions to each g-lattice individually. To implement the decomposition into g-lattices, we first determine the minimal elements of G. For each minimal element, we find the set of elements greater than that element. We then merge any of these sets that intersect, giving us the g-lattices. We also record the maximal elements to determine if the optimality condition in Theorem 4 is met. Our approach reduces the gl-solution by solving what are hopefully smaller IPs, but is not in general polynomial, since each IP is itself a pure parsimony problem. To improve the run time, we also bound the size of the IPs by setting a threshold on lattice size. For g-lattices larger than the threshold, we use the gl-solution instead of solving the IP. We refer to our algorithm’s estimate of the optimal solution as the mgl-solution. Numerical results based on this algorithm follow in the next section.
5
Numerical Results
Early numerical work on the pure parsimony problem was often accomplished with simulated data, which was somewhat precocious in light of the HapMap project [11], which catalogs the genotypes of several individuals across numerous populations (see www.hapmap.org). Most recent computational work is based on these growing databases [5,6], and all of our numerical work was done on chromosome 10 in the 2008-03 databases over the CHB (Han Chinese in Beijing, China) and YRI (Yoruban in Ibadan, Nigeria) populations. The CHB population has 45 individuals and 211, 862 SNPs, and the YRI population has 90 has individuals and 204, 146 SNPs.
A Decomposition of the Pure Parsimony Haplotyping Problem
205
Our computing environment was a laptop with Linux, 3GiB of memory, and dual 2.6 GHz processors. Freeware was used throughout, which readily makes the computations reproducible. The algorithm that partitions the genotypes into g-lattices was written in Octave. We used the IP model in [4] and adapted Gusfield’s Perl script minthap.pl (posted at wwwcsif.cs.ucdavis.edu/~gusfield/) so that it exported models native to lp solve (see lpsolve.sourceforge.net/ 5.5/), which was used with default settings to solve all IPs. A time limitation of 900 seconds was imposed on all IP solves. All code can be downloaded at holderfamily.dot5hosting.com/aholder/research. Our experimental design was based on a series of 50 solves with a varying number of consecutive SNPs. The data is not perfect, and several SNPs have an undetermined value for some individual. Undetermined SNPs were ignored and not included in the count of consecutive SNPs. For example, for both databases we solved 50 problems with 10 consecutive SNPs, 50 problems with 20 consecutive SNPs, ..., and 50 problems with 100 consecutive SNPs. The first step of each solve was to locate the next collection of consecutive SNPs and identify the unique genotypes (several individuals could share a common genotype). All calculations were done on unique genotypes. Problems with 10, 20 and 30 SNPs were solved in three ways: 1. to optimality unless the IP time limitation was invoked, 2. by the mgl-solution, with the threshold for using the gl-solution instead of solving the IP set at 25 genotypes, and 3. by the gl-solution. Tables 1 and 2 detail the solution characteristics for the 10, 20, 30 and 40 SNP cases. IP problems with more than 30 SNPs routinely grew beyond our computational abilities, which is likely due to the fact that the IP models grew exponentially in the number of heterozygous SNPs. An important direction for future work is to replace the IP model with one of the polynomially sized IP formulations. For the 40 SNP cases we failed to compute the minimum solutions but were able to calculate the mgl-solutions. We forewent IP solutions in any form if there were more than 50 SNPs. However, the gl-solutions were calculated in a few moments for all cases. See Table 3 for solution information. An observation about the cases with a larger number of SNPs is that the glsolution tends toward the upper bound of 2m. This is not surprising since the probability of pairwise incompatibility grows as the number of SNPs increases. Although none of these instances were guaranteed to be optimal, we did calculate the gl-solution for the CHB data set with all 211, 862 SNPs and with all undetermined SNPs interpreted as heterozygous. This ensures that the maximal elements of each g-lattice have the least amount of incompatibilities with the maximal elements from other g-lattices. In this case, the problem did decompose into 45 incompatible single element chains, which proves that the pure parsimony solution is 2m = 90 genotypes no matter how the undetermined SNPs are decided. Again, this is not a surprising result, but it does support the observation that the gl-solution should tend to 2m as the number of SNPs increases.
206
A. Holder and T. Langley
Table 1. Average solution information for the CHB database for the cases of 10, 20, 30 and 40 SNPs. Time is in seconds and only records the IP solution time. The column labeled “opt” indicates the number of mgl-solutions out of the 50 that were guaranteed to be optimal (none of the gl-solutions were optimal). The average solution of the pure parsimony problems is in the column labeled “min.” n 10 20 30 40
m 9.42 14.32 19.10 25.18
gl-sol 10.96 18.36 25.68 36.32
mgl-sol 7.08 13.08 20.26 31.50
opt 28 12 1 1
time 0.01 1.61 117.36 395.96
min 5.74 9.14 10.48
time 0.01 20.83 161.97
Table 2. Average solution information for the YRI database for the cases of 10, 20, 30 and 40 SNPs. Time is in seconds and only records the IP solution time. The column labeled “opt” indicates the number of mgl-solutions out of the 50 that were guaranteed to be optimal (none of the gl-solutions were optimal). The average solution of the pure parsimony problems is in the column labeled “min.” The ∗ (∗∗ ) indicates that 2 (23) problems were unable to solve within the time restriction; these problems are not included in the average. n 10 20 30 40
m 23.68 45.42 59.60 69.78
gl-sol 26.68 59.84 87.42 108.02
mgl-sol 20.62 50.74 79.82 100.20
opt 20 0 0 0
time min 0.01 10.62 33.84 22.81∗ 305.87 32.85∗∗ 983.77
time 0.02 45.70 513.93
Table 3. Average solution information for the CHB and YRI databases for the cases of 50 through 100 SNPs CHB SNP m gl-sol YRI SNP m gl-sol
6
50 28.40 41.66 50 73.54 119.92
60 30.02 45.00 60 79.98 136.50
70 32.78 50.66 70 83.16 146.92
80 34.02 53.58 80 84.86 153.24
90 36.14 57.90 90 85.62 157.74
100 37.18 60.84 100 86.38 161.00
Conclusions
Although the pure parsimony problem is generally APX-Hard, we have identified a sub-class of polynomial-time problems. The algorithm used to compute this solution gives a polynomial bound on the general problem, and the mathematical insights support a reduction of this bound by decomposing the problem into disjoint g-lattices. While the gl-solution was not found to be optimal in any of our test cases, the mgl-solution was. So on real data, the problem’s decomposition makes sense in some cases.
A Decomposition of the Pure Parsimony Haplotyping Problem
207
There are many avenues to consider beyond this work. First, the IP formulation should be changed to one whose size grows polynomially. The promising results in [6] show that this could lead to much improved solution times. Second, the gl- and mgl-solutions may be useful beyond the goal of pure parsimony. In particular, there may be biological insights into the g-lattice structure that support its use. Third, the g-lattice partition might be useful in guiding a branch-and-bound/price procedure, which could improve solution time. Fourth, we suspect that the structure of the g-lattices indicates whether or not a problem is computationally difficult. Fifth, wider scale numerical work should be conducted to assess the appropriateness of these techniques over a spectrum of populations. Acknowledgments. The authors are thankful for David Rader’s support and insightful conversations, and to the referees for many helpful suggestions.
References 1. Clark, A.G.: Inference of haplotypes from PCR-amplified samples of diploid populations. Molecular Biology and Evolution 7(2), 111–122 (1990) 2. Gusfield, D.: Inference of haplotypes from samples of diploid populations: Complexity and algorithms. Journal of Computational Biology 8(3), 305–324 (2001) 3. Greenberg, H., Hart, W.E., Lancia, G.: Opportunities for combinatorial optimization in computational biology. INFORMS Journal on Computing 16, 221–231 (2004) 4. Gusfield, D.: Haplotyping inference by pure parsimony. In: Baeza-Yates, R., Ch´ avez, E., Crochemore, M. (eds.) CPM 2003. LNCS, vol. 2676, pp. 144–155. Springer, Heidelberg (2003) 5. Brown, D.G., Harrower, I.M.: Integer programming approaches to haplotype inference by pure parsimony. IEEE/ACM Transactions on Computational Biology and Bioinfomatics 3, 141–154 (2006) 6. Catanzaro, D., Godi, A., Labb´e, M.: A class representative model for pure parsimony haplotyping. INFORMS Journal on Computing (2008) (to appear) 7. Halld´ orsson, B.V., Bafna, V., Edwards, N., Lipper, R., Yooseph, S., Istrail, S.: A survey of computational methods for determining haplotypes. In: Istrail, S., Waterman, M.S., Clark, A. (eds.) DIMACS/RECOMB Satellite Workshop 2002. LNCS (LNBI), vol. 2983, pp. 26–47. Springer, Heidelberg (2004) 8. Lancia, G., Pinotti, M., Rizzi, R.: Haplotyping populations by pure parsimony. complexity, exact and approximation algorithms. INFORMS Journal on Computing 16(4), 348–359 (2004) 9. Lancia, G., Rizzi, R.: A polynomial case of the parsimony haplotyping problem. Operations Research Letters 34, 289–295 (2006) 10. Gusfield, D., Orzack, S.H.: Haplotype inference. In: Aluru, A. (ed.) Handbook of Computational Molecular Biology. Chapman and Hall/CRC, Boca Raton (2006) 11. Consortium, T.I.H.: Integrating ethics and science in the international hapmap project. Nature Reviews Genetics 5, 467–475 (2004), www.hapmap.org
208
A. Holder and T. Langley
12. Gusfield, D.: Haplotyping as perfect phylogeny: Conceptual framework and efficient solutions. In: Proceedings of RECOMB 2002: The Sixth Annual International Conference on Computational Biology, pp. 166–175 (2002) 13. Chung, R.H., Gusfield, D.: Perfect phylogeny haplotyper: Haplotype inferral using a tree model. Bioinformatics 19(6), 780–781 (2003) 14. Ding, Z., Filkov, V., Gusfield, D.: A linear-time algorithm for perfect phylogeny haplotyping. Journal of Computational Biology 13, 522–553 (2006) 15. Blain, P., Davis, C., Holder, A., Silva, J., Vinzant, C.: Diversity graphs. In: Butenko, S., Chaovalitwongse, W., Pardalos, P. (eds.) Clustering Challenges in Biological Networks. World Scientific, Singapore (2008)
Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu Department of Computer Science and Engineering University of Connecticut Storrs, CT 06269, U.S.A.
[email protected]
Abstract. Coalescent likelihood is the probability of observing the given population sequences under the coalescent model. Computation of coalescent likelihood under the infinite sites model is a classic problem in coalescent theory. Existing methods are based on either importance sampling or Markov chain Monte Carlo. In this paper, we develop a simple method that can compute the exact coalescent likelihood for many datasets of moderate size, including a real biological data whose likelihood was previously thought to be difficult to compute exactly. Simulations demonstrate that the practical range of exact coalescent likelihood computation is significantly larger than what was previously believed.
1
Introduction
Originally developed by Kingman [11], coalescent theory has been a major research subject in population genetics. Coalescent theory provides stochastic genealogical models, and can be applied to answer many biological questions. Coalescent theory has become even more important now because it is one of the major analytical tools for population variation data analysis. Such data is currently being generated rapidly by high throughput genotyping and resequencing technologies. There are different kinds of genetic data that the coalescent theory can be used to analyze. An important type of genetic marker is single nucleotide polymorphism (SNP). SNP data is the focus of this paper, although the method presented herein can be applied to other types of genetic data. SNP is a single nucleotide site where exactly two (of four) different nucleotides occur in a large percentage of the population. Thus, we simply use 0/1 to represent the nucleotide at a SNP site. A haplotype (also called a SNP sequence in this paper) is a binary vector of m SNPs (listed as s1 , . . . , sm from left to right). The set of haplotypes sampled from a population is denoted by D, where D has n haplotypes (rows) and m sites (columns). Throughout the paper, we assume at most one mutation occurred at any SNP site during the evolution of the haplotypes, which is supported by the standard “infinite sites model” in population genetics [19]. This is particularly justified for populations that originated not very long I. M˘ andoiu, G. Narasimhan, and Y. Zhang (Eds.): ISBRA 2009, LNBI 5542, pp. 209–220, 2009. c Springer-Verlag Berlin Heidelberg 2009
210
Y. Wu
ago where the time-scale of interest is short enough that two mutations at any single site are unlikely. Coalescent theory has been shown to be very useful in simulating population sequences (e.g. [10]). However, applying coalescent theory to inference problems can be challenging. Answering inference problems requires the development of efficient and accurate computational methods based on coalescent theory. Unfortunately, many problems on coalescent theory are computationally difficult. This article focuses on one such problem: compute the likelihood P(D) of population haplotypes D for a given mutation rate. Assuming that recombination is absent, the only parameter of the coalescent likelihood is the mutation rate. The commonly used scaled population mutation rate is θ = 4N μ, where N is the effective population size and μ is the mutation rate per haplotype per generation. We assume the entire genomic region under study has the same scaled mutation rate θ. Under the coalescent model, P(D) can be viewed as a summation of probability over all possible coalescent genealogies. Intuitively, there are many possible gene genealogies that can generate D. Some genealogies are more likely and others are less likely. P (D) is the sum of probability of all the genealogies that derive D. For a given coalescent genealogy (including both topology and timing), it is straightforward to compute P (D) for a given mutation rate. See [17] for some concrete examples. The main difficulty of computing P(D) is that the number of possible genealogies that derive D is often very large. Thus, computing P(D) can be challenging under many genetic models. Computing P(D) under the coalescent model has been actively studied. Here we assume that only site mutations are considered. That is, other genetic processes such as recombination, migration and selection are neglected. A milestone in population genetics is the discovery by Ewens [3] that P (D) has a closedform formula under the infinite alleles model. No closed-form formula of P(D) is known for the infinite sites model. When there is no recombination, a perfect phylogeny can be constructed using the efficient algorithm by Gusfield [7], assuming the infinite sites model of mutation. The inferred perfect phylogeny is useful because every coalescent genealogy must be consistent with the phylogeny. Moreover, the perfect phylogeny is unique up to some flexibility in arranging the site mutations along a single tree branch. However, the perfect phylogeny misses important aspects of the genealogy. First, the perfect phylogeny may contain some internal nodes with more than two descendants. Second and more importantly, the timing information is missing from the perfect phylogeny. The timing information is critical in computing the coalescent likelihood. Thus, although the perfect phylogeny is unique, the number of possible genealogy can still be very large. See Figure 1 for an illustration. In a seminal paper, Griffiths and Tavar`e [4] presented an importance sampling method to compute the coalescent likelihood under the infinite sites model. In 2000, Stephens and Donnelly [15] made significant progress in the importance sampling approach, which leads to significant reduction of variance in the likelihood computation. Alternatively, Kuhner, et al. [12] applied Markov chain Monte
Exact Computation of Coalescent Likelihood
(a) A genealogy.
211
(b) The perfect phylogeny.
Fig. 1. An example of coalescent genealogy and inferred perfect phylogeny for seven haplotypes and five sites. The filled dots represent the mutation events (and the numbers to the right are the sites). Figure 1(a) is an example of coalescent genealogy. Figure 1(b) is the inferred perfect phylogeny from input haplotypes. The numbers below the tips are the multiplicities of the lineages. The lineage can be represented by the list of site mutations up to the root, with an extra 0 appended to the end. For example, the middle lineage can be represented as [3 2 0]. Note that there are multiple consistent genealogies for this phylogeny.
Carlo to compute the coalescent likelihood. New methods are still being developed for this likelihood computation problem. A recent paper [9] proposed a new importance sampling method to compute P (D) under infinite sites model. One should note all these methods are stochastic and none is ensured to compute the exact P(D). It is well known that P(D) can be computed exactly by solving a set of recursions, which were developed by Ethier, Griffiths and Tavar`e in several papers between 1987 to 1995 [2,4]. This recursion (which we call Ethier-Griffiths-Tavar`e recursion or simply EGT recursion) is central to our method. In fact, Griffiths developed a program called ptree before the release of program Genetree [1,4]. Program ptree can solve the EGT recursion exactly. However, program ptree was known to work for relatively small data (say with 15-20 sequences), while importance sampling methods can handle much larger data. This may be one reason that program ptree has largely been replaced by program Genetree. The common belief is that solving the EGT recursion exactly is usually hard. For example, it was stated in [8] that it may not be practical to find exact solution for EGT recursion when the summation of the number of sites and the number of sequences exceeds 30. The computational difficulty of solving the EGT recursion exactly is often listed as the motivation to develop stochastic methods (see, e.g. [9]). In this paper, we present an exact method for computing the coalescent likelihood P(D) for a given set of population haplotypes D under the infinite sites model of mutations. Only mutation and coalescent events are considered. Other genetic processes such as recombination are not considered. Different from
212
Y. Wu
existing statistical methods [4,12,15,9], our method is exact and fully deterministic. Our method is based on the well-known EGT recursion, which is also the foundation of the importance sampling method implemented in program Genetree [1,4]. Our contribution is to provide a time and memory efficient implementation that solves the EGT recursion exactly. Although the practical range of our method is still limited, we show that our method can be used to analyze relatively large data including one dataset that was originally analyzed by Griffiths and Tavar`e [5] using importance sampling. Likelihood of this data was once thought to be difficult to compute exactly [14]. We show in this paper that computing exact coalescent likelihood is still challenging but feasible for a wide range of data D. We show through extensive simulation the practical range of our method is significantly larger than what is previously believed. We also give a brief performance comparison of existing importance sampling methods using the exact method.
2
Method
We now describe our method for the exact likelihood computation under the infinite sites model. The basic idea is very simple: we solve the Ethier-GriffithsTavar`e (EGT) recursion exactly using a dynamic programming approach. First, we briefly review some basic concepts of coalescent theory that are important to the current work. Then, we present our method for solving the EGT recursion exactly. 2.1
Ethier-Griffiths-Tavar` e Recursion
For ease of exposition, we assume that the ancestral state of the genealogy is known to be all zero (i.e. the genealogy is rooted and the root is all-zero). It is easy to extend to the unrooted case (as explained in Section 3) and our implementation works with both rooted and unrooted cases. At any given time going backwards in time, there are two types of possible events in the genealogy: 1. Coalescent. Two identical sequences coalesce into a single sequence. 2. Mutation. A sequence changes the state at a site (which has not mutated before) from the derived state to the ancestral state. An important concept in the EGT recursion is ancestral configuration (AC). An AC is essentially a multiset of sequences that exist in the genealogy at a particular point of time. At the present time, the AC is composed of all the input sequences in D. As we trace backwards in time, we may observe different ACs due to the coalescent or mutation events. Following the notations in Wakeley [17], we use the pair (X, v) to represent an AC. Here, X is a set of haplotypes and v is the multiplicity (i.e. haplotype count) vector for the haplotypes in X. Each haplotype in X is written as an ordered list of sites that appear on the path from the haplotype to the root of the genealogy. At the end of each haplotype in X, we append a special site 0. Site 0 can be thought to label a dummy edge out
Exact Computation of Coalescent Likelihood
213
of the root. See Figure 1 for an illustration. We let p(X, v) be the likelihood of the AC for a given θ. At the root of the genealogy, we have X = [0] and v = [1] and p([0], [1]) = 1.0. Moreover, we let Sk be an operator that deletes the first element of the kth haplotype of X, and Sk X as the new haplotype set X after Sk is applied on X. This operator is used to represent the effect of mutations on X. We let Rk be an operator that deletes the kth item in a vector. For example, Rk X as the new haplotype set X after Rk is applied on X. Finally, we let ek to be the vector whose kth bit is 1 and the rest is all 0. We are now ready to give the EGT recursion [2]. This recursion is defined on a finite number of ACs and makes the exact computation of coalescent likelihood feasible. The recursion is based on three operators on the ACs, which correspond to the coalescent event and two types of mutation events in the genealogy. We say a haplotype is mutatable in (X, v) if its haplotype count is 1 and its first site is unique in haplotypes in X. vk is the multiplicity of the n-th haplotype in X. We let set A contain the indices of mutatable haplotypes which remain distinct after deleting their first site. We let set B contain the indices of all the mutatable haplotypes. Set Ck contains the indices of haplotypes that match the new haplotype Xk after Xk is mutated at its first site. We now have the EGT recursion (See [16,8,17] for more information): p(X, v) =
vk (vk − 1) n−1 p(X, v − ek ) θ+n−1 n(n − 1) k:vk ≥2
1 θ p(Sk X, v) θ+n−1 n k∈A 1 θ + p(Rk X, Rk (v + ej )) θ+n−1 n +
k∈B j∈Ck
2.2
Solving Ethier-Griffiths-Tavar` e Recursion Efficiently
The main difficulty in calculating the exact P(D) is that the number of ancestral configurations can be very large for data of even moderate size [14]. Therefore, any exact method based on EGT recursion will eventually become impractical as the data size grows, unless there are some clever ways of reducing the size of the recursion (e.g. merging ACs to get a much smaller recursion). However, as shown in this paper, this infeasibility issue may not be as bad as what is previously believed: a well designed and implemented method can solve the exact EGT recursion on a modern desktop computer for simulated or real biological data that are thought to be difficult before. The key idea of our method is to solve the EGT recursion using dynamic programming. That is, we look at the genealogical history forward in time. This is different from most coalescent computation approaches (e.g. [4,15]) which look backwards in time. As we show below, when solving the EGT recursion under the infinite sites model, looking forward in time is more efficient in both running time and memory.
214
Y. Wu
Recall that the genealogy is tightly constrained by the perfect phylogeny constructed from the input sequences. The perfect phylogeny actually provides all the information we need about what genealogical events (forward in time) are possible for a given AC. Thus we base our computation on the perfect phylogeny, and proceed in stages. For n input haplotypes and m sites, there are n+m stages: 0, 1, . . . , n + m − 1. We start our computation for stage 0 from the root of the phylogeny. At stage 0, there is only a single AC0 = ([0], [1]). An AC is said to be in stage i if the summation of the number of coalescent events and mutation events from the AC backwards to AC0 is equal to i. It is straightforward to find which stage a given AC is: the stage number is equal to the number of haplotypes plus the number of segregating sites minus one. Note that grouping the ACs by the total number of events is natural and is known previously (see e.g. [8]). Such grouping helps to develop a practical and efficient method. At stage i, we maintain a list of all the ACs at stage i. For each ACi,k = (X, v) in this list, we keep the following information: – p(X, v): the probability of ACi,k . – A list of genealogical events (called active events) that can occur in order to move to stage i + 1 from this AC. The set of active events can be thought as a snapshot of the genealogical events specified by the perfect phylogeny. For every internal node of the phylogeny whose out-degree is k, there are k − 1 coalescent events associated with the node. For each mutation along a phylogeny branch, there is one mutation event associated with the mutation. We assign a unique label (say e1 , e2 , . . .) for each of these genealogical events. At the root, the AC has a single active event: the coalescent event e1 . A new AC is created for the next stage when one of the active events in the current AC occurs. For an AC, it is straightforward to find the ACs that can be reached from this AC together with their lists of active events. To see this, examine each active event in the current AC. The active events of a new AC after event e occurs are the union of the active events of the current AC and the set of the newly feasible events En (e) (and minus e). If e is a coalescent event, En (e) contains mutation events at the outgoing branches from the internal node in the phylogeny where the coalescent events occurs, and a coalescent event next to e at the same node in the phylogeny (if there are more coalescent events possible besides e). If e is a mutation event, En (e) contains a single event: either (a) a mutation event right below e if e is not the last mutation event along a branch in the phylogeny, or (b) a coalescent event at the internal node right below where e occurs if e is the last mutation event along a branch. Note that we can reduce the number of ACs by fixing the order of mutations along the same branch in the perfect phylogeny. This has been used in program Genetree [1,4]. It helps to provide an example on how the ACs are maintained in the computation. In Figure 1, consider AC1 = ([0], [2]), which is the single AC at stage 1. The set of active events of AC1 contains three mutation events at sites 1, 2 and 4 respectively and a coalescent event at the root of phylogeny. Now suppose we create AC2,1 for stage 2 by performing the mutation event at site 2. In this new AC, mutation event at site 3 becomes feasible. The active
Exact Computation of Coalescent Likelihood
215
events of AC2,1 are three mutation events at sites 1, 3 and 4 respectively and a coalescent event at the root of phylogeny. The computation proceeds in stages. Suppose we have generated the complete list of ancestral configurations ACi,k (X, v) at stage i. We then generate the list of ACi+1,k (X , v ) for stage i + 1 from the list of ACi,k (X, v). We also set p(X , v ) to be the product of p(X, v) and the transition probability. The transition probability depends on the type of the event, which is given in the EGT recursion (see Section 2.1). When the generated AC is already in the list, we merge the two by simply adding the probability of the new AC to the one already in the list, and then discard the newly generated AC. Since the number of ACs to consider is very large, our implementation is designed to scan through the ACs as quickly as possible. A more serious problem is in the memory since the memory needed to store such a large number of ACs can be very large. Looking forward in time allows us to process each AC just once. A main benefit of this dynamic programming approach is the savings of memory. When an AC is processed and new ACs for the next stage from it have been generated, we can discard this AC. This simple technique is similar to what has been previously used in memory efficient pairwise sequence alignment. Moreover, we only keep the minimum amount of information in each AC. Note that such savings of time and space is easily achievable for a method searching backwards in time. At last, our method can handle the likelihood computation on a grid of θ values in a very simple way. We just keep a list of probability values for each AC, each for a different θ. This allows a much faster computation when maximum likelihood estimate of θ is desired.
3
Results
We implement the exact likelihood computation method in C++. To test the effectiveness of this method, we use it to compute the exact likelihood for both simulated and real biological data. We use our methods to obtain maximum likelihood estimation (MLE) of θ. The experiment was performed on a 3192 MHz Intel Xeon workstation with 16 GB memory. 3.1
Simulated Data
We first test our method on simulated data. We use Hudson’s program MS to generate datasets with several different settings for haplotypes and scaled mutation rate θ. We simulate 20, 30, 40 or 50 haplotypes, and θ = 1, 3 or 5. For each settings, we generate 100 datasets. We assume the root haplotype is fixed to be all-0 haplotype (which is what program MS uses). The simulation results are listed in Table 1. When θ is small, sometimes there is segregating sites. We treat such cases as computable. The results in Table 1 indicate that exact likelihood computation is feasible for many moderate sized data. As expected, the running time and the number of
216
Y. Wu
Table 1. Exact likelihood computation of simulated data. Average MLE of θ, running time (in seconds), number of ACs are listed (for the datasets where MLE computation is feasible). Some datasets are too large to compute the exact likelihood. So we also report percentage of data where exact computation is feasible. #rows 20 20 20 30 30 30 40 40 40 50 50 50
θ 1 3 5 1 3 5 1 3 5 1 3 5
%res. Ave. MLE Ave. time 100 1.12 < 0.1 s 100 2.94 0.7 s 100 5.04 7.7 s 100 1.01 0.2 s 100 2.92 34.5 s 99 4.93 319 s 100 1.06 1.6 s 99 2.88 280 s 94 4.78 2886 s 100 1.01 30.3 s 95 2.85 2425 s 74 4.43 9816 s
Ave. #ACs 906 10,039 64,566 3571 231,252 1,580,873 15,543 1,277,251 11,229,671 147,789 8,044,357 26,395,143
ACs increase when the number of rows and θ increase. Exact computation starts to become more difficult when the number of haplotypes is over 50 and θ is over 5.0. Nevertheless, our result shows that exact computation can be practical for the range of data that was deemed to be infeasible before. For example, in Hein, et al. [8], it was stated that exact computation is feasible for moderately sized data, where it was stated “moderately sized would be the number of segregating sites plus the number of haplotypes is less than thirty”. Our experimental results suggest significantly larger data allows exact computation of P (D). Also, Hobolth, et al. [9] tested their importance sampling method for simulated data with 50, 75 and 100 rows and θ = 1, 3 and 5. As shown in Table 1, many of these data allows exact computation. 3.2
A Real Biological Data
We now demonstrate the capability of our method on a real biological data. This is a Mitochondrial data, originally collected in [18] and analyzed by Griffiths and Tavar`e in [5]. The original data contains 63 mtDNA sequences. To apply their importance sampling method based on the infinite sites model, Griffiths and Tavar`e chose a subset of mtDNA sequences by removing sequences in which mutations seem to have occurred more than Fig. 2. Perfect phylogeny for Ward, et al. once at some sites. The remaining mtDNA data [18]. The numbers at the tree 55 mtDNA sequences (with 18 segre- tips are the multiplicities of the lineages. gating sites) are compatible with the
Exact Computation of Coalescent Likelihood
217
infinite sites model. The perfect phylogeny constructed from this data is shown in Figure 2, which is drawn by Griffiths’ program Genetree. Song, et al. [14] alluded that exact computation of P(D) for this data may be impractical. We now show that, although the running time is a little long and the amount of memory needed is large, exact computation of the likelihood of this data can be practically computed with moderate amount of computing resource. Since the ancestral sequence of this mtDNA data was not known, Griffiths and Tavar`e treated it as a root-unknown problem. They simply compute the likelihood for each possible rooted phylogeny that is compatible with the unrooted phylogeny, and then sum over the likelihood of all the rooted phylogeny. We take the same approach here. See Table 2 for the likelihood and the running time for each of the 18 possible rooted phylogeny at θ∗ = 4.8, which is the MLE of θ. Table 2. Detailed analysis of the mtDNA data in [18]. For each rooted tree, the number of ACs, the likelihood for θ = 4.8 and running time are listed. Also, if θ = 4.8 does not maximize the probability for the rooted tree, we also give the θ that does. We use ’-’ if 4.8 does indeed maximize the probability. The root is listed as the sites with different values from the root in Figure 2. Time is in hours (h) and minutes (m). The likelihood value is scaled by multiplying a factor of 1020 . Root 15 14 14,4,6 14,4,6,1 18 18,11 18,11,12,13 18,16,17 3,9
#AC Likelihood Time Best θ Root #AC Likelihood Time Best θ 218,486,135 8.99 21 h, 13 m 8 156,785,599 0.0254 16 h, 48 m 137,267,493 0.150 17 h, 7 m 4.7 7,15 68,633,748 0.0231 8 h, 12 m 4.6 217,678,497 4.75 25 h, 40 m 4.7 14,4 216,989,956 2.86 25 h, 35 m 4.7 216,301,428 0.799 25 h, 52 m 14,4,6,5 105,395,284 0.000321 12 h, 6 m 188,058,258 0.00392 22 h, 44 m 14,4,6,1,10 107,461,865 0.0000118 12 h, 37 m 218,359,460 1.353 25 h, 7 m 4.9 18,2 97,623,432 0.00210 11 h, 20 m 4.9 167,291,541 0.0702 21 h, 7 m 18,11,12 125,468,660 0.01380 16 h, 11 m 4.7 41,822,889 0.0000242 4 h, 34 m 4.7 18,16 212,086,076 0.0272 25 h, 16 m 4.9 165,530,601 0.000132 17 h, 15 m 4.9 3 168,748,161 0.255 19 h, 27 m 4.7 126,561,124 0.0429 15 h, 32 m 4.6
As shown in Table 2, both the running time and the likelihood value for different root sequences can vary. However, if we take the MLE of θ by only picking just one of the roots, the result is not very different: the range of such estimate is within [4.6, 4.9]. Also, the running time for each root is well correlated with the total number of ACs. Thus, it may be useful to first compute the total number of ACs in order to see whether exact likelihood computation is feasible. There are 2.96×109 ACs with all the roots combined, and it takes about 360 hours to compute the full likelihood of the data. This indicates that exact likelihood computation under infinite sites model is still challenging but feasible for moderate sized data. Note that the total number of ACs is slightly smaller than that in [14]. This is due to a different way of handling multiple mutations along a single tree branch: our method fixes the order of these mutations, while all possible order is considered in [14].
218
Y. Wu
The likelihood curve for the entire data (i.e. the sum of the likelihood for all the 19 roots) is shown in Figure 3. Figure 3 confirms θ∗ = 4.8 is the MLE of θ as found by Griffiths and Tavar`e [5]. Identifying the ancestral lineage. We can validate one of ancestral inference solutions in [5]. Griffiths and Tavar`e infer the likely ancestral lineage by picking the rooted tree with the largest likelihood and choosing the corresponding root as the inferred ancestral lineage. They then concluded that the all-zero lineage (i.e. the root in Figure 2) is the most likely ancestral lineage. From Table 2, we can draw the same conclusion, although there are a few other ancestral lineages (e.g. the lineage with exactly one 1 value at site 14) that give smaller but still comparable likelihood values. Importance sampling methods. Hobolth, et al. [9] compares the likelihood curves generated by three importance sampling methods: Griffiths and Tavar`e [4] (GT), Stephens and Donnelly [15] (SD) and their own method (HUW). It appears that they only consider the single phylogeny shown in Figure 2, instead of the unrooted case. The plot of the three likelihood curves is given in Figure 4 in [9]. To give a rough comparison of the three likelihood curves with the exact likelihood curve in Figure 3, we simply compare the maximum likelihood values. the maximum likelihood values GT, SD, and HUW are 7.85, 9.23, and 8.75 respectively, while the exact likelihood is 8.99. The likelihood value is scaled by multiplying a factor of 1020 . It can be seen that both SD and HUW appear to be more accurate than GT. Also it is shown in [9] that HUW has smaller variance than SD and SD has smaller variance than GT. Note that more validations are needed to further compare these importance sampling methods.
Fig. 3. MLE of θ of the mtDNA data in [18] is 4.8, which matches the result in [5]. The plotted value is the log-likelihood for θ from 4.0 to 5.9, with increment of 0.1.
Exact Computation of Coalescent Likelihood
4
219
Discussions
In this paper, we show that exact likelihood for haplotypes evolving under the infinite sites model can be computed for many data, while such computation for many such data is previously considered to be infeasible. Our method is exact and fully deterministic. Statistical methods, e.g. the importance sampling method implemented in Genetree, on the other hand, usually have (sometimes large) variance in their estimate. Some importance sampling methods may have smaller variance, but the accuracy of the estimated likelihood remains an issue. Nevertheless, exact computation becomes increasingly difficult when the data size grows. Currently, no known methods can compute exact likelihood for large data (with 100 haplotypes with θ = 10, for example). Statistical methods (e.g. [4,12,15,9]) are much faster and can handle larger datasets than the exact method. Moreover, our results in Section 3 suggest that faster statistical methods can be quite accurate. We also note that program Genetree has more functionalities such as allowing input sequences from subdivided population. So what is the use of the new exact method? We believe that there are two applications of the exact method. 1. Exact likelihood computation of moderately sized data can be useful. Although slower than statistical methods, exact likelihood can be computed for many datasets on a modern computer. We expect the type of computer on which exact method can run for these data will be accessible to many users, and situation can improve with the continuing fast development of computing technology. Biological data is often noisy, which may not be fully consistent with the model assumed by analysis methods. Moreover, the model itself can be controversial. Stochastic methods, although powerful, can introduce another source of errors; Thus, exact computation eradicates this additional uncertainty introduced by analysis methods. This can ease the concerns on analysis methods and help the study of underlying biological processes. 2. Exact methods can also help to compare and justify existing and new statistical methods. Comparing with exact likelihood is natural and was used in [4] when developing new statistical methods. We believe with more powerful exact methods (such as the method developed here) as the reference, one can compare the different methods and develop new methods by testing on larger simulated and real biological data. Initial comparison of several statistical methods with the exact method is given in Section 3. Our exact method can be useful in a more thorough comparison of these methods. Lack of good exact methods often poses problems when one wants to compare different statistical methods. As an example, there is no known good exact method that can compute the exact coalescent likelihood with recombination unless the data is tiny [13]. This makes it difficult to compare the relative performance of different methods for the coalescent with recombination problem [6]. Thus, it is valuable to develop effective exact methods for the purpose of comparison and validation.
220
Y. Wu
Funding and Acknowledgment. This work is supported by National Science Foundation [IIS-0803440]. I am also supported by the Research Foundation of University of Connecticut. I thank Yun S. Song for useful discussions.
References 1. Bahlo, M., Griffiths, R.C.: Inference from Gene Trees in a Subdivided Population. Theoretical Population Biology 57, 79–95 (2000) 2. Ethier, S.N., Griffiths, R.C.: The Infinitely-Many-Sites Model as a Measure Valued Diffusion. Annals of Probability 15, 515–545 (1987) 3. Ewens, W.J.: The sampling theory of selectively neutral alleles. Theor. Popul. Biol. 3, 87–112 (1972) 4. Griffiths, R.C., Tavar`e, S.: Simulatiing Probability Distributions in the Coalescent. Theor. Popul. Biol. 46, 131–159 (1994) 5. Griffiths, R.C., Tavar`e, S.: Ancestral inference in population genetics Statistical Science 9, 307–319 (1994) 6. Griffiths, R.C., Jenkins, P.A., Song, Y.S.: Importance Sampling and Two-Locus Model with Subdivided Population Structure. Adv. Appl. Prob. 40, 473–500 (2008) 7. Gusfield, D.: Efficient algorithms for inferring evolutionary history. Networks 21, 19–28 (1991) 8. Hein, J., Schierup, M., Wiuf, C.: Gene Genealogies, Variation and Evolution: A primer in coalescent theory. Oxford University Press, Oxford (2005) 9. Hobolth, A., Uyenoyama, M.K., Wiuf, C.: Importance Sampling for the Infinite Sites Model. Stat. Appl. Genet. and Mol. Biol. 7 Article 32 (2008) 10. Hudson, R.: Generating Samples under the Wright-Fisher neutral model of genetic variation. Bioinformatics 18(2), 337–338 (2002) 11. Kingman, J.F.C.: The coalescent. Stochast. Process. Appl. 13, 235–248 (1982) 12. Kuhner, M.K., Yamato, J., Felsenstein, J.: Estimating effective population size and mutation rate from sequence data using Metropolis-Hastings sampling. Genetics 140, 1421–1430 (1995) 13. Lyngso, R., Song, Y.S., Hein, J.: Accurate Computation of Likelihoods in the Coalescent with Recombination Via Parsimony. In: Vingron, M., Wong, L. (eds.) RECOMB 2008. LNCS (LNBI), vol. 4955, pp. 463–477. Springer, Heidelberg (2008) 14. Song, Y.S., Lyngsoe, R., Hein, J.: Counting all possible ancestral configurations of sample sequences in population genetics. IEEE/ACM Transactions on Computational Biology and Bioinformatics 3, 239–251 (2006) 15. Stephens, M., Donnelly, P.: Inference in molecular population genetics. J. R. Stat. Soc. 62, 605–655 (2000) 16. Tavar`e, S.: Ancestral Inference in Population Genetics. In: Lectures on Probability Theory and Statistics, pages 1931. Springer, Heidelberg (2004) 17. Wakeley, J.: Coalescent Theory: An Introduction. Roberts and Company Publishers, Greenwood Village (2008) 18. Ward, R.H., Frazier, B.L., Dew, K., Paabo, S.: Extensive Mitochondria Diversity within a Single Amerindian Tribe. Proc. of the Nat. Academy of Science 88, 8720–8724 (1991) 19. Watterson, G.A.: On the number of segregating sites in genetical models without recombination. Theoretical Population Biology 7, 256–276 (1975)
Imputation-Based Local Ancestry Inference in Admixed Populations Bogdan Pa¸saniuc1 , Justin Kennedy2 , and Ion M˘ andoiu2 1 2
International Computer Science Institute, Berkeley, CA
[email protected] CSE Department, University of Connecticut, Storrs, CT {jlk02019,ion}@engr.uconn.edu
Abstract. Accurate inference of local ancestry from whole-genome genetic variation data is critical for understanding the history of admixed human populations and detecting SNPs associated with disease via admixture mapping. Although several existing methods achieve high accuracy when inferring local ancestry for individuals resulting from the admixture of genetically distant ancestral populations (e.g., AfricanAmericans), ancestry inference in the case when ancestral populations are closely related remains challenging. Surprisingly, methods based on the analysis of allele frequencies at unlinked SNP loci currently outperform methods based on haplotype analysis, despite the latter methods seemingly receiving more detailed information about the genetic makeup of ancestral populations. In this paper we propose a novel method for imputation-based local ancestry inference that exploits ancestral haplotype information more effectively than previous haplotype-based methods. Our method uses the ancestral haplotypes to impute genotypes at all typed SNP loci (temporarily marking each SNP genotype as missing) under each possible local ancestry. We then assign to each locus the local ancestry that yields the highest imputation accuracy, as estimated within a neighborhood of the locus. Experiments on simulated data show that imputation-based ancestry assignment is competitive with best existing methods in the case of distant ancestral populations, and yields a significant improvement for closely related ancestral populations. Further demonstrating the synergy between imputation and ancestry inference, we also give results showing that the accuracy of untyped SNP genotype imputation in admixed individuals improves significantly when using estimates of local ancestry. The open source C++ code of our method, released under the GNU General Public Licence, is available for download at http://dna.engr.uconn.edu/software/GEDI-ADMX/.
1
Introduction
Rapid advances in SNP genotyping technologies have enabled the collection of large amounts of population genotype data, accelerating the discovery of genes associated with common human diseases. Admixture mapping has recently I. M˘ andoiu, G. Narasimhan, and Y. Zhang (Eds.): ISBRA 2009, LNBI 5542, pp. 221–233, 2009. c Springer-Verlag Berlin Heidelberg 2009
222
B. Pa¸saniuc, J. Kennedy, and I. M˘ andoiu
emerged as a powerful method for detecting risk factors for diseases that differ in prevalence across populations [12]. This type of mapping relies on genotyping hundreds of thousands of single nucleotide polymorphisms (SNPs) across the genome in a population of recently admixed individuals and is based on the assumption that near a disease-associated locus there will be an enhanced ancestry content from the population with higher disease prevalence. Therefore, a critical step in admixture mapping is to obtain accurate estimates of local ancestry around each genomic locus. Several methods have been developed for addressing the local ancestry inference problem. Most of these methods use a detailed model of the data in the form of a hidden Markov model, e.g. SABER [19], SWITCH [13], HAPAA [18] but differ in the exact structure of the model and the procedures used for estimating model parameters. A second class of methods estimate the ancestry structure using a window-based framework and aggregate the results for each SNP using a majority vote: LAMP [14] uses an assumption of no recent recombination events within each window to estimate the ancestries, while WINPOP [9] employs a more refined model of recombination events coupled with an adaptive window size computation to achieve increased accuracy. Local ancestry inference methods also differ in the type of information used to make local ancestry inferences. Surprisingly, methods that do not model the linkage disequilibrium (LD) structure between SNPs currently outperform methods that model the LD information extracted from ancestral population haplotypes. The main contribution of this paper is a novel method for imputation-based local ancestry inference that more effectively exploits LD information. Our method uses a factorial HMMs trained on ancestral haplotypes to impute genotypes at all typed SNP loci (temporarily marking each SNP genotype as missing) under each possible local ancestry. We then assign to each locus the local ancestry that yields the highest imputation accuracy, as assessed using a weighted-voting scheme based on multiple SNP windows centered on the locus of interest. Preliminary experiments on simulated admixed populations generated starting from the four HapMap panels [22] show that imputation-based ancestry inference has accuracy competitive with best existing methods in the case of distant ancestral populations, and is significantly more accurate for closely related ancestral populations. We also give results showing that the accuracy of untyped SNP genotype imputation in admixed individuals improves significantly when taking into account estimates of local ancestry.
2
Methods
In this work we consider the inference of locus-specific ancestry in recently admixed populations. We assume that for each admixed individual we are given the genotypes at a dense set of autosomal SNP loci, and seek to infer the two ancestral populations of origin at each genotyped locus. For simplicity we consider only bi-alelic SNPs. For every SNP locus, we denote the major and minor alleles by 0 and 1. A SNP genotype is encoded as the number of minor alleles
Imputation-Based Local Ancestry Inference in Admixed Populations
F1k
F2k
F3k
H1k
H2k
H3k
G1
G2
G3
H1
l
H2
l
H3
l
F2
l
F3
F1
…
Fnk
Hnk
…
l
l
223
Gn
Hnl
…
Fnl
Fig. 1. Factorial HMM model for a multilocus SNP genotype (G1 , . . . , Gn ) over an n-locus window within which one haplotype is inherited from ancestral population P k and the other from ancestral population P l . For every locus i, Fik and Hik denote the founder haplotype, respectively the allele observed on the haplotype originating from population P k ; similarly, Fil and Hil denote the founder haplotype and observed allele for the haplotype originating from population P l .
at the corresponding locus, i.e., 0 and 2 encode homozygous major and minor genotypes, while 1 denotes a heterozygous genotype. 2.1
Genotype Imputation within Windows with Known Local Ancestry
Various forms of left-to-right HMM models of haplotype diversity in a homogeneous population have been successfully used for numerous genetic data analysis problems including SNP genotype error detection [4], genotype phasing [11, 15], testing for disease association [6, 16], and imputation of untyped SNP genotypes [5, 7, 8, 15]. In this section we extend the imputation model in [5] to the case of individuals with known mixed local ancestry. Specifically, we assume that, over the set of SNPs considered, the individual has one haplotype inherited from ancestral population P k and the other inherited from ancestral population P l , where P k and P l are known (not necessarily distinct) populations. Multilocus SNP genotypes of individuals with such mixed ancestry are modeled statistically using a factorial HMM (F-HMM) [3] referred to as Mkl and graphically represented in Figure 1. At the core of the model are two left-to-right HMMs representing haplotype frequencies for the two ancestral populations (dotted boxes in Figure 1). Under these models, a haplotype from population P j , j ∈ {k, l} is viewed as a mosaic formed as a result of historical recombination among a set of Kj founder haplotypes, where Kj is a population specific parameter (unless specified otherwise, we used Kj = 7 in our experiments). Formally, for each SNP locus i ∈ {1, . . . , n}, we let Gi ∈ {0, 1, 2} be a random variable representing the genotype at locus i, Hij ∈ {0, 1} be a random variable representing the allele inherited from population P j at locus i, and Fij ∈ {1, . . . , Kj } be a random variable denoting the founder haplotype from
224
B. Pa¸saniuc, J. Kennedy, and I. M˘ andoiu
which Hij originates. Values taken by these random variables are denoted by the corresponding lowercase letters (e.g., gi , hji , fij ). The model postulates that for each j ∈ {k, l}, Fij , i = 1, . . . , n, form the states of a first order HMM with emissions Hij . We set P (gi |hki , hli ) to be 1 if gi = hki + hli and 0 otherwise. Model j training is completed by separately estimating probabilities P (f1j ), P (fi+1 |fij ), j j and P (hi |fi ) using the classical Baum-Welch algorithm [1] based on haplotypes inferred from a panel representing each ancestral population P j , j ∈ {k, l}. The parameters of the two left-to-right HMMs can alternatively be estimated directly from unphased genotype data using an EM algorithm similar to those in [6, 11]. Let g = (g1 , . . . , gn ) be the multilocus genotype of a mixed ancestry individual and let g−i = (g1 , . . . , gi−1 , gi+1 , . . . , gn ). If the individual’s SNP genotype at locus i is unknown, it can be imputed based on the model Mkl by maximizing over g ∈ {0, 1, 2} PMkl (Gi = g|g−i ) ∝ PMkl (g[gi ← g])
(1)
where g[gi ← g] = (g1 , . . . , gi−1 , g, gi+1 , . . . , gn ). The ancestry inference method described in Section 2.2 temporarily marks as missing and imputes each SNP genotype, and thus requires computing probabilities (1) for all n SNP loci. This computation can be done efficiently using a forward-backward algorithm, as described below. For every i ∈ {1, . . . , n}, fik ∈ {1, . . . , Kk }, and fil ∈ {1, . . . , Kl }, we let i F f k ,f l = PMkl (g1 , . . . , gi−1 , fik , fil ), which we refer to as the forward probability i i associated with the partial multilocus genotype (g1 , . . . , gi−1 ) and the pair of founder states (fik , fil ) at locus i. The forward probabilities can be computed using the recurrence: F 1f k ,f l = P (f1 )P (f1 ) 1
F if k ,f l = i
(2)
1
i
Kk
Kl
k l F i−1 E i−1 (gi−1 )P (fik |fi−1 )P (fil |fi−1 ) f k ,f l f k ,f l
k =1 f l fi−1 i−1 =1
=
Kk
i−1
k P (fik |fi−1 )
k =1 fi−1
where
i−1
i−1
Kl
l F i−1 E i−1 (gi−1 )P (fil |fi−1 ) (3) f k ,f l f k ,f l
l fi−1 =1
E if k ,f l (gi ) = i
i
i−1
i−1
i−1
i−1
i−1
P (hki |fik )P (hli |fil )
(4)
l hk i ,hi ∈{0,1} l =g hk +h i i i
The innermost sum in (3) is independent of fik , and so its repeated computation can be avoided by replacing (3) with: C if k
l i−1 ,fi
=
Kl l fi−1 =1
l F i−1 E i−1 (gi−1 )P (fil |fi−1 ) f k ,f l f k ,f l i−1
i−1
i−1
i−1
(5)
Imputation-Based Local Ancestry Inference in Admixed Populations
F if k ,f l = i
i
Kk
k P (fik |fi−1 )C if k
(6)
l i−1 ,fi
k =1 fi−1
225
By using recurrences (2), (5), and (6), all forward probabilities can be computed in O(nK 3 ) time, where n is the number of SNP loci and K = max{Kk , Kl }. Backward probabilities Bif k ,f l = PMkl (fik , fil , gi+1 , . . . , gn ) can be computed i i in O(nK 3 ) time using similar recurrences: B nfnk ,fnl = 1 Dif k
l i+1 ,fi
=
Bif k ,f l = i
i
Kl l fi+1 =1
Kk
l B i+1 E i+1 (gi+1 )P (fi+1 |fil ) f k ,f l f k ,f l i+1
i+1
i+1
i+1
k P (fi+1 |fik )D if k
l i+1 ,fi
k =1 fi+1
After computing forward and backward probabilities, posterior SNP genotype probabilities (1) can be evaluated in O(K 2 ) time per SNP locus by observing that: Kk Kl PMkl (g[gi ← g]) = F if k ,f l E if k ,f l (g)B if k ,f l (7) fik =1 fil =1
i
i
i
i
i
i
Thus, the total time for computing all posterior SNP genotype probabilities is O(nK 3 ). 2.2
Local Ancestry Inference
Consider an individual coming from an admixture of (a subset of) of N ancestral populations P 1 , . . . , P N . As in previous works [19,14,13,18,9], we view the local ancestry at a locus as an unordered pair of (not necessarily distinct) ancestral populations. The set of possible local ancestries is denoted by A = {kl | 1 ≤ k ≤ l ≤ N }. Our local ancestry inference method is based on two observations: (1) for individuals from recently admixed populations the local ancestry of a SNP locus is typically shared with a large number of neighboring loci, and (2) the accuracy of SNP genotype imputation within such a neighborhood is typically higher when using the factorial HMM model Mkl corresponding to the correct local ancestry compared to a mis-specified model. These observations suggest using the algorithm in Figure 2 for inferring local ancestry based on imputation accuracy within windows centered at each SNP locus. More precisely, the algorithm assigns to each SNP locus i the local ancestry that maximizes the average posterior probability for the true SNP genotypes over a window of up to 2w + 1 SNPs centered at i (w SNPs downstream and w SNPs upstream of i). Step 1 of the algorithm requires training N left-to-right HMMs based on haplotype data using the Baum-Welch algorithm, which takes O(nK 2 ) per iteration
226
B. Pa¸saniuc, J. Kennedy, and I. M˘ andoiu
Input: multilocus genotype g = (g1 , . . . , gn ), window half-size w, and reference haplotypes for ancestral populations P 1 , . . . , P N Output: inferred local ancestries a ˆi ∈ A for each i = 1, . . . , n 1. Train HMM models for each ancestral population and combine them to form factorial HMM models Mkl for every kl ∈ A 2. For each locus i, compute posterior SNP genotype probabilities (Equation 1) under each local ancestry model Mkl 3. For each locus i = 1, . . . , n, a ˆi ← argmaxkl∈A
1 P (Gi = gi |g−i ) |Wi | j∈W Mkl
(8)
i
where Wi = {max{1, i − w}, . . . , min{n, i + w}}
Fig. 2. Single-window imputation-based ancestry inference algorithm
and typically converges in a small number of iterations. As described in Section 2.1, Step 2 of the algorithm is implemented in O(nK 3 ) time for each local ancestry model Mkl . Once posterior SNP genotype probabilities are computed in Step 2, the window average probabilities required in Step 3 for each local ancestry model Mkl can be computed in O(1) per window after precomputing in O(n) time the sums of posterior probabilities for all prefix sets {1, . . . , i}. Thus, since the number of possible ancestry models is |A| = O(N 2 ), the algorithm requires O(nK 3 N 2 ) time overall. As previously observed for other window-based methods of local ancestry inference [14, 9], optimal window size selection plays a significant role in the overall estimation accuracy. Window-based methods must balance two conflicting requirements: on one hand, small window sizes may not provide enough information to accurately differentiate between the |A| possible local ancestries (particularly when ancestral populations are closely related) and on the other hand, large window sizes lead to more frequent violations of the assumption that local ancestry is uniform within each window. In the case of imputationbased ancestry inference we obtained good results by using a multi-window approach: for each SNP genotype gi we run the algorithm of Figure 2 for all w ∈ {100, 200, . . . , 1500} and aggregate the results over all windows using a simple weighted voting scheme. Specifically, within each window we assign to each ancestry model Mkl a weight obtained by dividing the average posterior probability of the true genotypes, |W1i | j∈Wi PMkl (Gi = gi |g−i ) by the sum of the averages achieved by all local ancestry models, and select for each locus the model with maximum sum of weights over all windows. Preliminary experiments (see Figure 4 and Table 1) suggest that the multi-window strategy yields an average accuracy that is very close to (and, for some admixed populations, better
Imputation-Based Local Ancestry Inference in Admixed Populations
8%
227
40%
7% 30%
5% 4%
20%
3% 2%
% Unimputed (dashed)
% Discordance (solid)
6%
10%
1% 0%
0% 0.5
0.6
0.7
0.8
0.9
Calling threshold
Fig. 3. Percentage of imputation errors (solid line) and unimputed genotypes (dashed line) at varying cutoff thresholds on posterior imputation probability for the WTCCC 1958 birth cohort dataset
than) the maximum average accuracy achieved by running the single-window algorithm with any window size from the above set.
3
Experimental Results
In this section we present preliminary results comparing our approach to several state-of-the-art methods for local ancestry inference. We begin with results demonstrating the accuracy of imputation based on the factorial HMM model. In a second set of experiments, we compare our imputation-based algorithm to existing methods for local ancestry inference on admixture datasets simulated starting from the four populations represented in HapMap [22]. Finally, we present results demonstrating the benefit of incorporating accurate local ancestry estimates when performing genotype imputation for admixed individuals. 3.1
SNP Genotype Imputation in Homogeneous Populations
To assess the accuracy achieved when imputing missing SNP genotypes based on the factorial HMM model described in Section 2.1, we used the 1,444 individuals of the 1958 birth cohort of the Wellcome Trust Case Control Consortium (WTCCC) [2]. For this homogeneous population imputation was performed using the GEDI package [5], based on a factorial model consisting of two identical left-to-right HMMs trained on CEU panel haplotypes from HapMap. SNP genotype imputation for admixed populations is further discussed in Section 3.3. The individuals in the 1958 birth cohort were genotyped using the Affymetrix 500K GeneChip Assay. We masked as un-typed and then imputed 1% of the SNPs on chromosome 22. We measured the error rate as the percentage of erroneously recovered genotypes from the total number of masked genotypes. Since
228
B. Pa¸saniuc, J. Kennedy, and I. M˘ andoiu
the model provides the posterior probability for each imputed SNP genotype, one can get different tradeoffs between the error rate and the percentage of imputed genotypes by varying the cutoff threshold on posterior imputation probability. Figure 3 plots the achievable tradeoffs. For example, using a cutoff threshold of 0.95, HMM-based imputation has an error rate of 1.7%, with 24% of the genotypes left un-imputed. 3.2
Inference of Local Ancestry in Admixed Populations
The method described in Section 2.2 was implemented in an extension of the GEDI software package [5], referred to as GEDI-ADMX. We compared GEDIADMX to several local ancestry inference methods capable of handling genomewide data. Three of the competing methods (SABER [19], SWITCH [13], and HAPAA [18]) are HMM based, while the other two (LAMP [14] and WINPOP [9]) perform window-based estimation based on genotype data at a set of unlinked SNPs. When comparing various methods for ancestry inference one needs to take into account the fact that different methods use different types of information to make ancestry predictions. LAMP, WINPOP and SWITCH only require information about ancestral allele frequencies, while the other methods require the ancestral genotypes. In addition, HAPAA and GEDI-ADMX use additional information about ancestral haplotypes. Some of the methods also require the number of generations since the admixture process started. In general, we provided each method the maximum amount of information about the admixture process (e.g. number of generations g or the admixture ratio α) that it could take into account. Although these parameters can be estimated from genotype data when needed [20], we note that GEDI-ADMX does not require any additional parameters besides the ancestral haplotypes. Experiments were performed on simulated admixtures using as ancestral populations the four HapMap [22] panels: Yoruba people from Ibadan Nigeria (YRI), Japanese from the Tokyo area (JPT), Han Chinese from Beijing (CHB) and Utah residents with northern European ancestry (CEU). We simulated admixtures for each of the YRI-CEU, CEU-JPT, and JPT-CHB pairs of populations as follows: we started the simulation by joining a random set of α × n individuals from the first population and (1 − α) × n individuals from the second population. Within the merged panel we simulated g generations of random mating with a mutation and recombination rate of 10−8 per base pair per generation. We used only the 38,864 SNPs located on Chromosome 1 found on the Affymetrix 500K GeneChip Assay. For these simulations we used n = 2000, g = 7 and α = 0.2 as it roughly corresponds to the admixture history of the African American population [21, 17, 10]. Our simulations result in an admixed population with known local ancestry. Each of the evaluated methods infers an ancestry estimate for every SNP genotype; we measure the accuracy as the fraction of SNP genotypes for which the correct ancestry is inferred. Effect of window size on the local ancestry estimates. Figure 4 plots the accuracy of the local ancestry prediction of GEDI-ADMX on the HapMap
Imputation-Based Local Ancestry Inference in Admixed Populations
229
100%
95%
Accuracy
90%
85%
80% YRI-CEU CEU-JPT JPT-CHB
75%
70% 0
500
1000
1500
2000
2500
3000
3500
Window size (number of SNPs)
Fig. 4. Accuracy of local ancestry estimates obtained by GEDI-ADMX on the three HapMap admixtures using a single window of varying size
admixtures for different window sizes. As expected, the accuracy initially increases with window size for all three datasets, since more information is available to differentiate between ancestry models. However, very large window sizes lead to more violations in the assumption of uniform ancestry within each window, overshadowing these initial benefits. As previously reported in other windowbased methods [14, 9] we also notice that the best window size employed by our method for the three datasets is correlated with the genetic distance between ancestral populations as closer ancestral populations require longer window size for accurate predictions. Finally, we notice that the combined multi-window approach described in Section 2.2 achieves accuracy close to the best window size for the YRI-CEU and CEU-JPT admixtures and better than any window size for the JPT-CHB admixture (see Table 1). All remaining results were obtained using the multi-window approach. Effect of number of founders on local ancestry inference accuracy and runtime scalability. An important parameter of the HMM models used to represent the LD in ancestral populations is the number of founder haplotypes K. As discussed in Section 2.2, the runtime of the algorithms grows asymptotically with the cube of K, which renders the use of very large values of K impractical. Using very large values of K may also be problematic when the number of training haplotypes is limited, due to model overfitting. On the other hand, HMMs with very few founder haplotypes have a limited ability of capturing LD patterns in the ancestral populations, and is expected to lead to poor accuracy. To assess these potentially complex tradeoffs between runtime and accuracy we run GEDI-ADMX on the CEU-JPT dataset using for both ancestral populations a number of founder haplotypes K varied between 1 and 10. The accuracy and runtime achieved by GEDI-ADMX for each value of K are plotted in Figure 5. Since for K = 1 our HMM model degenerates into a simple multinomial i.i.d.
230
B. Pa¸saniuc, J. Kennedy, and I. M˘ andoiu
100.00%
50 45
95.00%
35
Accuracy
90.00%
30 85.00%
25 20
80.00%
15
CPU sec. per sample
40
10
75.00%
5 70.00%
0 1
2
3
4
5
6
7
8
9
10
Number of founders K
Fig. 5. GEDI-ADMX accuracy (solid line) and runtime (dashed line) for varying values of the number K of HMM founder haplotypes on the CEU-JPT dataset, consisting of n = 38, 864 SNPs on Chromosome 1
model that captures allele frequency at each SNP but completely ignores LD, it is not surprising that ancestry inference accuracy is relatively poor (about 78%). For K = 2 accuracy improves significantly (to 93.5%), as the model is now able to represent pairwise LD between adjacent SNPs. As K is further increased, the model can capture more of the longer range LD, leading to further accuracy improvements. However, improvements in accuracy are quickly diminishing, with only 1% accuracy improvement achieved when increasing K from 3 to 10. Although for small values of K lower order terms make the runtime growth in Figure 5 appear sub-cubic, the asymptotic cubic growth is already apparent for the largest tested values of K. For remaining experiments we used K = 7 since this setting achieves a good tradeoff between runtime and accuracy. Comparison with other methods. Table 1 presents accuracies achieved by the six compared methods on the three simulated HapMap admixtures. We note that GEDI-ADMX achieves similar accuracy to the best performing methods on the YRI-CEU and CEU-JPT admixture, while yielding a significant improvement in accuracy for the JPT-CHB dataset. Indeed, on the JPT-CHB admixture our method achieves an accuracy of 94.0%, which is an increase of more than 11% over the second best performing method WINPOP. Table 1 also reports an upper-bound on the maximum accuracy that can be obtained by methods that do not model the linkage disequilibrium (LD) between SNPs, computed as described in [9]. Notably, GEDI-ADMX accuracy on the JPT-CHB dataset exceeds the upper-bound for methods that do not model the LD. This underscores the importance of exploiting ancestral haplotypes when performing local ancestry inference for admixtures of closely related populations.
Imputation-Based Local Ancestry Inference in Admixed Populations
231
Table 1. Percentage of correctly recovered SNP ancestries on three HapMap admixtures with α = 0.2 Method YRI-CEU CEU-JPT JPT-CHB SABER 89.4 85.2 68.2 HAPAA 93.7 88.2 72.0 SWITCH 97.8 94.8 74.8 LAMP 94.8 93.0 65.8 WINPOP 98.0 95.9 82.8 Upper Bound(no LD) 99.9 99.6 91.9 GEDI-ADMX 97.5 96.5 94.0 Table 2. Imputation error rate, in percents, on three HapMap simulated admixtures with α = 0.5 Method YRI-CEU CEU-JPT JPT-CHB GEDI-1-Pop Avg. 12.79 6.67 3.81 GEDI-2-Pop 7.31 3.90 3.02 GEDI-ADMX 4.34 2.81 2.74
3.3
SNP Genotype Imputation in Admixed Populations
In this section we present results that further demonstrate the synergy between SNP genotype imputation and local ancestry inference in admixed population. More specifically, we focus on assessing the utility of inferring locus-specific ancestries when performing imputation of genotypes for untyped SNPs. For this experiment we generated three admixtures, corresponding to the YRICEU, CEU-JPT and JPT-CHB pairs of HapMap populations, using the same simulation procedure as described in Section 3.2 with parameters of n = 2000, α = 0.5 and g = 10. We randomly chose 10% of the SNPs as untyped and we masked them from all the individuals in the admixture. We first ran GEDIADMX using unmasked SNP genotypes to infer local ancestries as described in Section 2.2. We then imputed masked genotypes using the model in Section 2.1 based on the ancestry inferred for the adjacent unmasked SNPs. We measured the error rate of the imputation procedure as the percentage of genotypes inferred erroneously (using no cutoff threshold on posterior imputation probability). To establish a baseline for the comparison, we also performed imputation using the GEDI package [5], based on a factorial model similar to that in Section 2.2 except that it consists of two identical left-to-right HMMs trained on either (1) panel haplotypes for only one of the ancestral populations (GEDI-1-Pop), respectively on (2) a haplotype list obtained by merging the panel haplotypes of the two ancestral populations (GEDI-2-Pop). Table 2 shows the imputation accuracy achieved by the three compared methods. As expected, there is a large decrease in error rate when switching from using only one panel of ancestral haplotypes to using the combined panel consisting of haplotypes from both populations. Performing imputation based on the local
232
B. Pa¸saniuc, J. Kennedy, and I. M˘ andoiu
ancestry inferred by GEDI-ADMX yields further improvements in accuracy. Accuracy gains are largest when admixed populations are distant (e.g. YRI-CEU).
4
Discussion
In this paper we propose a novel algorithm for imputation-based local ancestry inference. Experiments on simulated data show that our method exploits ancestral haplotype information more effectively than previous methods, yielding consistently accurate estimates of local ancestry for a variety of admixed populations. Indeed, our method is competitive with best existing methods in the case of admixtures of two distant ancestral populations, and is significantly more accurate than previous methods for admixtures of closely related populations such as the JPT and CHB populations from HapMap. We also show that accurate local ancestry estimates lead to improved accuracy of untyped SNP genotype imputation for admixed individuals. In ongoing work we are exploring methods that iteratively alternate between rounds of imputation-based ancestry inference and ancestry-based imputation for further improvements in accuracy. We are also conducting experiments to characterize the accuracy of our imputation-based local ancestry inference methods in the case of admixtures of more than two ancestral populations.
Acknowledgments BP was supported by NSF grant 0713254. IM and JK were supported in part by NSF CAREER award IIS-0546457 and NSF award DBI-0543365. This study makes use of data generated by the Wellcome Trust Case Control Consortium. A full list of the investigators who contributed to the generation of the data is available from www.wtccc.org.uk. Funding for the project was provided by the Wellcome Trust under award 076113.
References 1. Baum, L.E., Petrie, T., Soules, G., Weiss, N.: A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Ann. Math. Statist. 41, 164–171 (1970) 2. The Wellcome Trust Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678 (2007) 3. Ghahramani, Z., Jordan, M.I.: Factorial hidden Markov models. Mach. Learn. 29(2-3), 245–273 (1997) 4. Kennedy, J., M˘ andoiu, I.I., Pa¸saniuc, B.: Genotype error detection using hidden markov models of haplotype diversity. Journal of Computational Biology 15(9), 1155–1171 (2008) 5. Kennedy, J., Pa¸saniuc, B., M˘ andoiu, I.I.: GEDI: Genotype error detection and imputation using hidden markov models of haplotype diversity (manuscript) (in preparation), http://dna.engr.uconn.edu/software/gedi/
Imputation-Based Local Ancestry Inference in Admixed Populations
233
6. Kimmel, G., Shamir, R.: A block-free hidden Markov model for genotypes and its application to disease association. Journal of Computational Biology 12, 1243–1260 (2005) 7. Li, Y., Abecasis, G.R.: Mach 1.0: Rapid haplotype reconstruction and missing genotype inference. American Journal of Human Genetics 79, 2290 (2006) 8. Marchini, J., Spencer, C., Teo, Y.Y., Donnelly, P.: A bayesian hierarchical mixture model for genotype calling in a multi-cohort study (2007) (in preparation) 9. Pa¸saniuc, B., Sankararaman, S., Kimmel, G., Halperin, E.: Inference of locusspecific ancestry in closely related populations (under review) 10. Parra, E.J., Marcini, A., Akey, J., Martinson, J., Batzer, M.A., Cooper, R., Forrester, T., Allison, D.B., Deka, R., Ferrell, R.E., et al.: Estimating african american admixture proportions by use of population-specific alleles. Am. J. Hum. Genet. 63(6), 1839–1851 (1998) 11. Rastas, P., Koivisto, M., Mannila, H., Ukkonen, E.: Phasing genotypes using a hidden Markov model. In: M˘ andoiu, I.I., Zelikovsky, A. (eds.) Bioinformatics Algorithms: Techniques and Applications, pp. 355–372. Wiley, Chichester (2008) 12. Reich, D., Patterson, N.: Will admixture mapping work to find disease genes? Philos. Trans. R Soc. Lond. B Biol. Sci. 360, 1605–1607 (2005) 13. Sankararaman, S., Kimmel, G., Halperin, E., Jordan, M.I.: On the inference of ancestries in admixed populations. Genome Research (18), 668–675 (2008) 14. Sankararaman, S., Sridhar, S., Kimmel, G., Halperin, E.: Estimating local ancestry in admixed populations. American Journal of Human Genetics 8(2), 290–303 (2008) 15. Scheet, P., Stephens, M.: A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. American Journal of Human Genetics 78, 629–644 (2006) 16. Schwartz, R.: Algorithms for association study design using a generalized model of haplotype conservation. In: Proc. CSB, pp. 90–97 (2004) 17. Smith, M.W., Patterson, N., Lautenberger, J.A., Truelove, A.L., McDonald, G.J., Waliszewska, A., Kessing, B.D., Malasky, M.J., Scafe, C., Le, E., et al.: A highdensity admixture map for disease gene discovery in african americans. Am. J. Hum. Genet. 74(5), 1001–1013 (2004) 18. Sundquist, A., Fratkin, E., Do, C.B., Batzoglou, S.: Effect of genetic divergence in identifying ancestral origin using HAPAA. Genome Research 18(4), 676–682 (2008) 19. Tang, H., Coram, M., Wang, P., Zhu, X., Risch, N.: Reconstructing genetic ancestry blocks in admixed individuals. Am. J. Hum. Genet. 79, 1–12 (2006) 20. Tang, H., Peng, J., Pei Wang, P., Risch, N.J.: Estimation of individual admixture: Analytical and study design considerations. Genetic Epidemiology 28, 289–301 (2005) 21. Tian, C., Hinds, D.A., Shigeta, R., Kittles, R., Ballinger, D.G., Seldin, M.F.: A genomewide single-nucleotide-polymorphism panel with high ancestry information for African American admixture mapping. Am. J. Hum. Genet. 79, 640–649 (2006) 22. http://www.hapmap.org/
Interpreting Population Sequencing Data (Invited Keynote Talk) Shamil R. Sunyaev Division of Genetics, Brigham and Women’s Hospital Harvard Medical School, Boston, MA 02115
[email protected]
Rapid advancements of sequencing technology result in accumulation of massive data on individual human DNA sequences. Interpretation of these data presents a number of statistical and computational challenges and demands development of new computational methods. Sequencing of phenotyped clinical populations is widely anticipated to replace genotyping in studies aiming at finding genes underlying human complex diseases. These sequencing studies will identify a multitude of previously unknown, predominantly rare, human allelic variants. Obviously, statistical power to detect association of human phenotypes with rare variants is greatly reduced. At the same time, analysis of very large number of DNA variants will require a very strict multiple test correction. One possible strategy to overcome these complications is to combine multiple rare variants in genes or pathways and consider genes (pathways) as units of the association test. The success of this strategy was demonstrated in a number of candidate-gene based studies. We investigated whether this strategy can be extended to de novo discovery of genes underlying human traits. We used computer simulations based on a population genetics model which incorporated historic changes in the population size and natural selection. The model was informed by existing deep resequencing data. We showed that genes meaningfully influencing a human phenotype can be identified from whole ”exonome” sequence data, although large sample sizes would be needed to achieve substantial power. Application of new statistical approaches taking into account allele frequency distribution and computational predictions of the functional effect of individual mutations will result in increasing power of these studies. Due to continuously decreasing cost, sequencing now also finds application in clinical genetic diagnostics. Genetic diagnostics is increasingly important in guiding therapeutic intervention and providing counseling to family members of patients with monogenic and oligogenic diseases. However, interpretation of results of diagnostic sequencing remains a challenge because for majority of newly detected allelic variants there are no data on pathogenicity. These variants are described in clinical reports as ”variants of unknown significance” (VUS). I. M˘ andoiu, G. Narasimhan, and Y. Zhang (Eds.): ISBRA 2009, LNBI 5542, pp. 234–235, 2009. c Springer-Verlag Berlin Heidelberg 2009
Interpreting Population Sequencing Data
235
A frequently used strategy to assess pathogenicity of mutations is to genotype large cohorts of unaffected unrelated controls. New systematic re-sequencing datasets help quantifying levels of rare polymorphism in the human population and ultimately resolve whether a missense mutation present in a patient and absent in hundreds of controls should be considered functional at a stringent level of statistical significance. We attempted to answer this question by means of direct analysis of resequencing data and by computer simulations with parameters of demographic history and strength of natural selection estimated from large systematic resequencing datasets. Our results suggest that absence of a mutation in hundreds and even thousands of control subjects cannot be considered a reliable indication of the functional significance even for fully penetrant mutations. We also analyzed applicability of computational methods for predicting the effect of mutations in the setting of a clinical genetic diagnostics lab. We developed and extensively tested a new computational method based on comparative sequence analysis and analysis of protein 3D structure. We created an automated multiple sequence alignment pipeline. Further, we employed a number of machine learning techniques to select the best set of features and to generate prediction rules. These developments resulted in a greatly increase sensitivity and specificity of predictions as evident from tests on several datasets. Analysis of mutations identified by diagnostic sequencing in cardiomyopathy patients demonstrated the method’ ability to accurately predict the effect of approximately 50% of VUS.
Modeling and Visualizing Heterogeneity of Spatial Patterns of Protein-DNA Interaction from High-Density Chromatin Precipitation Mapping Data Juntao Li, Fajrian Yunus, Zhu Lei, Majid Eshaghi, Jianhua Liu, and R. Krishna Murthy Karuturi∗ Genome Institute of Singapore, A-STAR, 60 Biopolis St, S138672, Republic of Singapore {lij9,zhul,eshaghim,lijh,karuturikm}@gis.a-star.edu.sg
Abstract. Chromatin precipitation technologies such as ChIP and FAIRE combined with microarray or sequencing technologies have facilitated identification of DNA-protein interaction sites in the genome of interest. At some loci, the interaction of a nucleus protein with DNA may be distributed instead of being localized. As the high density mapping of the interaction sites is feasible recently, we developed Regional Piecewise Linear Modeling (RPLM) to model the locus-dependent distribution of DNA-protein interaction signal that reveals the heterogeneity of the roles of a protein. It requires fewer parameters to tune than otherwise required by similar methods such as MPeak and JBD. We also incorporate a procedure, using SOM, into RPLM to visualize the interaction locus certainty. The application of our approach on ChIP-chip datasets of two transcription factors has revealed the differences in their roles and shown the major difference between localised and distributed interactions. Software: Available upon request. Keywords: ChIP-chip, DNA-protein Interaction, Microarray.
1 Introduction Protein-DNA interaction [5] is a fundamental transcriptional regulatory mechanism. Several nucleus proteins such as transcription factors and histones interact with DNA to achieve appropriate gene regulation. The role of a nucleus protein may be one or more of protein complex stabilization by being part of it, recruiting RNA polymerase, DNA elongation, transcription initiation, transcription termination, chromatin structural alterations, etc. Disruption of such interaction patterns may lead to severe flaws in the functioning of the cells which might lead to diseases such as cancers. ∗
Corresponding author.
I. Măndoiu, G. Narasimhan, and Y. Zhang (Eds.): ISBRA 2009, LNBI 5542, pp. 236–247, 2009. © Springer-Verlag Berlin Heidelberg 2009
Modeling and Visualizing Heterogeneity of Spatial Patterns
237
Hence, mapping and characterizing DNA-protein interaction sites/loci will provide important information on transcriptional regulation, complementing the other omics data. The first step to get such data is to isolate DNA interacting with the protein of interest from that of non-interacting using ChIP or FAIRE technologies. Upon obtaining the appropriate isolated DNA fragment samples, its content is analyzed using either microarray technology or sequencing technology. Several computational methods have been proposed to analyze chromatin precipitation mapping data and identify the interacting sites on DNA. For example, MPeak [10] fits a mixture of triangular basis to model the binding or interaction data. It is suitable only for single and direct binding events. Joint Binding De-convolution (JBD) [8] uses probabilistic graphical model to improve spatial resolution of identification of the transcription factor binding sites. But it requires the DNA fragment length distribution which may not always be available. Sliding window binomial testing [7] was used to identify the binding signal enriched regions, and split the multi-peak regions into individual peak regions. The current approaches suffer from several drawbacks. Importantly, they ignore potential heterogeneity stemming from the temporally controlled spatially distributed interaction (a non-localized interaction) as they assume that the interaction is spatially certain or localized at all distinct DNA loci and the spread of the interaction signal is due to stochastic nature of the sonication. Hence, the output of all these methods is merely a list of distinct loci of interaction sites which does not provide deeper biological insights into the interaction of the protein with DNA. Besides the above common limitation, each method suffers from its own assumptions not suitable or optimal for the high-density chromatin mapping data such as tiling array and sequencing data. MPeak does not work well for closely located multiple interactions, assumes sonication sites are uniformly distributed around an interaction site and it requires large number of complex parameters to be tuned for its optimal performance. JBD assumes the genome-wide distribution of the sonication sites can be obtained by the distribution of the sonication fragments’ length which may not always be true. Binomial testing requires setting three interacting parameters which may be completely unintuitive for a naïve user. Peak finding algorithms also require a large number of parameters be tuned. Hence, we fit global piece-wise linear model using our proposed Regional Piecewise Linear Modeling (RPLM) method to identify relevant significantly enriched regions of interaction signals (also called enrichment footprints). It characterizes each enrichment footprint based on the nature of the spread of the signal. It requires lesser number of parameter to be tuned and less knowledge of experiment. RPLM requires two smoothing parameters, one local reference parameter (percent of regional peak) and finally p-value & fold change threshold parameters which are common to any statistical analysis of omics data. The ‘percent of peak’ parameter is not a crucial parameter and a default value of 10% would be enough i.e. only two smoothing parameters are to be tuned. It is a minimum of all of the available methods. Furthermore, RPLM method immediately provides visualization of the heterogeneity of interaction patterns. Further, we demonstrate the biological significance of different types of enrichment signal footprints.
238
J. Li et al.
2 Global Piece-wise Linear Modeling via Regional Piece-wise Linear Modeling (RPLM) A global piece-wise linear modeling (GPLM) involves identifying an optimal set of approximation points. It is computationally expensive for an interaction mapping data involving chromosomes of size ranging from a few million bases to hundreds of millions of bases with resolution < 50bp. Moreover, the straightforward trade-off between squared error and complexity of model (number of approximation points) may not reveal biologically meaningful patterns and regions of occupancy and thereby may not serve our purpose of visualization. Hence, we have devised Regional Piece-wise Linear Modeling (RPLM) method, a computationally efficient strategy to generate global piece-wise linear model of biological relevance. RPLM follows two step approach: (1) identifying distinguishably different regions of interactions called putative enrichment footprints; and, followed by, (2) Piece-wise linear modeling of the whole interaction mapping data starting from putative enrichment footprints. Before we fit RPLM, the data is smoothed and median centered. Formally, let fC(x) be the enrichment function mapping the interaction strength of the protein at location ‘x’ on chromosome ‘C’ after median centering. Let gC(x) be the function obtained after repeated smoothing of fC(x) for ‘k’ rounds using rectangular kernel of bandwidth of (2w+1). Note that ‘x’ is described in terms of its probe number on chromosome ‘C’ in case of arrays and discretized location in case of sequencing data. 2.1 Identifying Putative Enrichment Footprints A typical chromatin precipitation experiment is aimed at finding enrichment of chromatin regions interacting with the protein of interest except in FAIRE. However, FAIRE signal may be inverted to find enrichment of interacting regions. Hence, in practice, we will be interested in positive signal from the mapping data. It naturally leads us to split the enrichment signal on a chromosome at zero crossing points (ZCPs) following the definitions below: Definition 1: ‘y’ is a zero-crossing point (ZCP) of gC(x) if and only if gC(y-1)×gC(y+1) ≤ 0 and |gC(y-1)| + |gC(y+1) | > 0 and ‘y’ is in the domain of gC(x). Definition 2: Let A0C = {x1,x2,…,xZc | xi < xi+1} be the set of linearly ordered ZC zerocrossing points on chromosome ‘C’ then RiC, the enrichment signal between two consecutive zero crossing points ‘xi’ and ‘xi+1’ on chromosome ‘C’, is called zero-crossing footprint. ‘xi’ and ‘xi+1’are lower and upper approximation points of RiC The zero-crossing footprints may encompass long regions on the genome and cover many reasonably closely located distinct interaction regions. Hence, we further split them into one or more so called putative enrichment footprints using our Recursive Footprint Splitting (RFS) algorithm as described in pseudocode 1:
Modeling and Visualizing Heterogeneity of Spatial Patterns
239
RFS Algorithm Input: Zero crossing footprints H = {RiC| i = 1,2,…Zc-1} and a base reference 0< r <1 Output: Putative enrichment footprints E = {EjC | j = 1,2,…,EC-1} Function: RFS (H,r) { E=Ø If (H = Ø) Then Return (E) For Each h є H { m = max g C ( x ) // Lh and Uh are lower and upper approximation points of ‘h’ x∈[ Lh ,U h ]
b = r×m
//0
Yh = {y | y є [lh,uh], (gC(y)-b)×(gC(y+1)-b) ≤ 0 and |gC(y)-b| + |gC(y+1)-b| > 0, gC(y)>0 } Ynh = {y=[Lh,Uh] | y є h, Lh and Uh are closest left and right ZCPs or valley points to lh and uh respectively} If (|Yh| < 3) // The footprint h is not split, see the following //lemmas Then E = E U {h} ; H = H – {h} Else H = H U {Y=[Li,Ui] | y є Ynh, gC(y)>0 for y є [Li,Ui]} If(H = Ø) Return (E)} Return (E) } Pseudo-code 1: Pseudo code of RFS algorithm to identify putative enrichment footprints. Lemma 1: If ∞>m≥K(x)>0 and is a continuous function for Lk < x < Uk and K(Lk), K(Uk) ≤ b< m then |{x | K(x) = b}| ≥ 2. Corollary 1: h (є H) will be split into two or more regions at reference b if and only if |Yh| > 2. 2.2 Regional Piece-wise Linear Modeling (RPLM) and Augmenting Approximation Points Before proceeding with piece-wise linear modeling (or RPLM) of the regions, we start with an initial set of approximation points that includes the ZCPs. In other words, we initialize the approximation points for a region in two steps: (1) Zero Crossing points are approximation points, i.e. ACi = A0Ci; and, (2) Peak points and Valley points are approximation points as defined below. Definition 3: Let ΔgC(x) = gC(x+1)-gC(x). Then ‘y’ is a valley point of gC(x) if and only if ΔgC(y-1)<0, ΔgC(y+1)>0 and ‘y’ is in the domain of gC(x). Similarly, ‘y’ is a
240
J. Li et al.
peak point of gC(x) if and only if ΔgC(y-1)>0, ΔgC(y+1)<0, gC(y)>0 and ‘y’ is in the domain of gC(x). Definition 4: ‘y’ is an approximation point of gC(x) if ΔgC(y-1)×ΔgC(y+1)<0, gC(y)>0 and ‘y’ is in the domain of gC(x) i.e. all peaks and valley points are approximation points. Hence the approximation points ACi = A0Ci U {Peak points} U {Valley Points}. Each putative enrichment footprint ECi is now defined by a set of initialized approximation points ACi. The RPLM algorithm augments ACi by approximating ECi by piece-wise linear model as follows: Let x and y are two consecutive approximation points in ACi on ECi i.e. x,y є ACi, z є (x,y) is not in ACi. Let L(z | x, y) denotes line segment joining (x, gC(x)) and (y, gC(y)) for x ≤ z ≤ y
L ( z | x, y ) = g C ( x) +
gC ( y) − gC ( x) (z − x) , x ≤ z ≤ y y−x
Let q (x < q < y) be the point of maximum difference between L(z | x, y) and gC(z) i.e.
q = arg max z∈[ x , y ]
g C ( z ) − L ( z | x, y )
Let PL(z | x, q, y) denotes the piece-wise linear approximation of gC(z) for x ≤ z ≤ y using approximation points x, q and y. It is defined as
⎧L( z | x, q) ⎪ PL( z | x, q, y) = ⎨ ⎪L( z | q, y) ⎩
for
z ∈ [ x, q]
for
z ∈ [ q, y ]
Now, we have to find whether PL(z | x, q, y) is significantly a better approximation compared L(z | x, y) for gC(z) for x ≤ z ≤ y to accept ‘q’ as an approximation point for ECi. To do so, we compare the errors of approximation in both cases and check significance of difference between their errors using classical F-test as follows: y
eL = ∑ ( gC ( z ) − L( z | x, y ) )
2
z=x
q
y
e PL = ∑ ( g C ( z ) − L( z | x, q ) ) + ∑ ( g C ( z ) − L( z | x, q) ) 2
z=x
2
z =q
FPL
(e L −e PL ) 2 = ~ F( 2 ),( y − x −3) e PL ( y − x − 3)
FPL follows central F-distribution with degrees of freedom of 2 and (y-x-3) under null hypothesis that there is no difference between eL and ePL. p-value is obtained using this null distribution. If the p-value is < 0.05 then ‘q’ is added to ACi and the procedure is repeated until no more approximation points are added as illustrated in figure 1.
Modeling and Visualizing Heterogeneity of Spatial Patterns
241
Fig. 1. Illustration of applying RPLM to augment ACi to {P1,P5,P4,P3,P2}. (A) The initial approximation points is ACi = {P1,P3,P2}. (B) the piece-wise linear modeling between P1 and P3 added approximation point P4, P5 (grey) is rejected. (C) Similarly, P5(black) was added between P1 and P4.
Then ECi is approximated by ACi as follows
⎧ ⎪L( z | z1 , z 2 ) for z ∈ [ z1 , z 2 ] PLM ( z | z1 , z 2 ,..., z i ) = ⎨ ⎪PLM ( z | z , z ,..., z ) for z ∈ [ z , z ] 2 3 i 2 i ⎩ 2.3 Quantifying Significance of Putative Enrichment Footprints To find true enrichment footprints, we assess the statistical significance of nonzeroness of each putative enrichment footprint. A putative enrichment footprint ECi with approximation points ACi (= {z1, z2,…zi} where z1 < z2 < …< zi) is approximated by piecewise linear model and is tested for its statistical significance as follows: y
y
e1 = ∑ ( f C ( z ) − PLM ( z | ACi ) ) and e0 = ∑ ( f c ( z ))2 z=x
2
z= x
The significance of the difference between e0 and e1 is tested using the F-statistic FPLM
FPLM =
(e 0 −e1 ) (2 ACi − 1) ~ F( 2 ACi −1),( y − x −2 ACi +3) e1 ( y − x − 2 ACi + 3)
Where, |ACi| is the cardinality or the number of elements in the set ACi. FPLM follows central F-distribution with degrees of freedom of (2|ACi|-1) and (y-x-2|ACi|+3) under null hypothesis that there is no difference between e0 and e1. P-value is obtained using this null distribution. We compute p-value of each putative enrichment footprint. Fold change may be obtained as the maximum of gC(x) in the footprint region. Note that we used the original enrichment function fC(z) instead of gC(z) as we have to consider the noise present in fC(z) to assess the statistical significance of putative enrichment
242
J. Li et al.
footprints. The putative enrichment footprints selected based on certain fold change and p-value thresholds are called enrichment footprints. 2.4 Visualizing Spatial Distribution Patterns of Protein-DNA Interaction Signals The above piece-wise linear modeling provides information on putative enrichment footprints along with their p-value for being chromatin enriched and quantification of interaction. It completes only half of required analysis. We are interested to cluster the spatial distribution patterns of interaction of protein of interest with DNA and visualize them. There are few difficulties in achieving such task: (1) different enrichment footprints are of different length; (2) relative magnitudes of various approximation points (valleys, peaks and saddle points) are different in different enrichment footprints; and (3) relative order and positioning of the approximate points are different on different enrichment footprints. To cluster and visualize the patterns of interaction we developed the following 3 step procedure: (1) Enrichment Footprint Equalization (EFE); (2) Classifying equalized footprints based on number of peak points present in it; and, (3) Apply Self Organizing Map (SOM) clustering on each class and visualize the result using heat maps. Of the three steps, EFE is crucial in which each enrichment footprint is represented by a vector of dimension ‘B’ that contains information on the nature of the approximation points and their relative positioning on the respective enrichment footprints. The procedure is as follows: 1.
2.
3.
Indicate each approximation point by the negative of the sign of the second derivative i.e. Let Δ2gC(y) = ΔgC(y+1)- ΔgC(y). An approximation point at ‘y’ is denoted by ‘+1’ if Δ2gC(y-1)×Δ2gC(y+1)<0; and, similarly, it is denoted by ‘-1’ if Δ2gC(y-1)×Δ2gC(y+1)>0 Split the enrichment footprint in to ‘B’ equally spaced bins. In each bin, sum the values, given in step 1, of all approximation points in the bin. In other words, each bin gets ‘1, 0, -1’ which represent near peak, none, near valley approximation points. Smooth the vector using any small bandwidth kernel (=1 is enough in practice).
3 Analysis of Interaction Patterns of TBP1 and CDC10 in S. pombe Using RPLM To demonstrate the efficacy of our approach we present analysis of spatial distribution of interaction sites of two important transcription factors TBP1 and CDC10. TBP1 is TATA Binding Protein, plays key role in transcriptional machinery [6] whereas CDC10 plays important role in cell division cycle regulation and its function is associated with the formation of so called MBF complex [3] in G1 phase of cell division cycle. The data was obtained from Majid et al [submitted for publication] and
Modeling and Visualizing Heterogeneity of Spatial Patterns
243
it will be publicly available soon. Majid et al used NimbleGen very high-density tiling arrays (~380K probes covering ~12Mb of whole S. Pombe genome). The probes are of 50-mer length and they overlap by 30bp. Each of the data was analyzed separately using smoothing parameters of w = 9 and k = 3. 3.1 Enrichment Footprint Pattern Clustering and Visualization Reveals Several Interactions Are Not Completely Localized but Spatially Distributed in Both TBP1 and CDC10 Contrary to the assumption made by several methods only ~85% interactions in TBP1 experiment and ~74% interactions in CDC10 experiment are reasonably localized (1 peak and 2 peaks) as summarized in the table 1. In other words, significantly, ~15% of interactions in TBP1 experiment and ~26% of interactions in CDC10 experiment do appear to be completely dynamic and spatially distributed on DNA within window sizes ranging from 500bp to 2Kbp as they demonstrate multiple approximation points, number of peaks (or peak points) > 2. The heat maps of the clustering of the equalized vectors of the enrichment footprints of TPB1 and CDC10 experiments are shown in the figures 3 and 2 respectively. Though single peak and two peak enrichment footprints may be thought to be indicating static interactions, the SOM clustering showed in figures 2 and 3 reveals significantly different picture. It reveals that several single peak enrichment footprints, ~ 50%, are not centered but skewed either to left or right indicating that they are only partially static or localized interactions and they are dragged in the direction of the skew. The remaining ~50% of the single peak enrichment footprints have their peaks centered indicating that interaction with DNA at these sites is completely static or highly localized. The two peak enrichment footprints show close localization of 2 potentially fully or partially static localized interactions. It is difficult to distinguish between completely localized interactions from partially localized interactions in 2-peak enrichment footprints. But the multi-peak interactions (3 and more peaks) demonstrate that the interactions are dynamic and spatially distributed. The peaks in these footprints may be the result of combined effect of varying localization tendencies of the interactions and variations in regional enrichment measurements introduced by the microarray technology. Table 1. The heterogeneity of interaction localization behaviour of CDC10 & TBP1 Peaks
TBP1
TBP1 (%)
CDC10
CDC10 (%)
1
1246
64.69%
511
50.44%
2
389
20.20%
235
23.20%
3 4+ Total
163 128 1926
8.46% 6.65% 100%
121 146 1013
11.94% 14.41% 100%
244
J. Li et al.
Table 2. Contingency table to show association of TBP1 and CDC10 transcription factors with more localized and distributed interactions respectively. The p-value using Fisher’s Exact Test is ~4×10-13.
Type of Interactions Localized (1 or 2 peak footprints) Distributed (3 and more peak footprints)
TBP1 1635 291
CDC10 746 267
Apart from revealing the heterogeneity of interactions of a particular protein with DNA, our method also allows us to compare the nature of interactions of different protein. For example, the difference between TBP1 interactions and that of CDC10 is of twofold: (1) number of interaction footprints and their distribution; and, (2) finer differences in the interaction footprints. The number of enrichment footprints in TBP1 (1926) is nearly double that of CDC10 (1013) indicating the higher and more general role of TBP1 compared to CDC10. TBP1 interactions are more static and localized than that of CDC10 as only 15% of TBP1 interactions are spatially distributed but it is 26% in the case of CDC10, the p-value of association is very significant, only slightly less than 4×10-13, see table 2. The first finer difference between TBP1 and CDC10 interactions is shown by several strong positive second derivative approximation points (green) on both left and right sides of the several single peak enrichment footprints for TBP1 which is absent in CDC10. This is an indication of sharp drop of enrichment signal from the peak indicating either higher localization tendency of TBP1 interactions as opposed to CDC10. The other finer difference appears in the case of multi-peak enrichment footprints. In contrast to CDC10 interaction patters, one of the peaks on the extreme positions of the footprints appears to be much sharper and relatively of higher value in TBP1. It may be indicating that, though the interaction of TBP1 is not completely localized, it may be better localized than that of CDC10. 3.2 Localized Footprints Are Better Enriched with Motifs Compared to Distributed Footprints To further test the biological difference between localized and distributed enrichment footprints, we tested for the occurrence of the respective motifs. We search the TBP1 (TATA box motif [11]) and CDC10 (MCB motif [12]) related motif in localized and distributed enrichment footprints that are overlapped with the promoter region of genes with orientation. Then calculate the motif enrichment percentages of localized and distributed enrichment footprints by the total occurrence of motif over the total number of peaks in each type of footprints. The average motif occurrences in localized enrichment footprints is much higher than distributed enrichment footprints (p-value<0.001 by fisher exact test). The difference is indicating that the interaction between protein and DNA is driven by different regulatory factors in the localized and distributed footprints. Localized footprints
Modeling and Visualizing Heterogeneity of Spatial Patterns
245
Fig. 2. CDC10 interaction pattern clustering and visualization. (A) Shows the heat map (red for peaks, white for green and black for 0) of clustering of the distribution of approximation points in the equalized enrichment footprints. (B)-(D) are example single peak footprints showing leftpeaked, centered and right-peaked enrichment footprints. (B) and (D) are slightly spread interactions whereas (C) is for completely localized interactions. (E) Shows a typical 2-peak enrichment footprint. Typical multi peak enrichment footprints are shown in (F) and (G). Both show highly uniformly distributed interactions.
Fig. 3. TBP1 interaction pattern clustering and visualization. (A) shows the heat map (red for peaks, green for valleys and black for 0) of clustering of the distribution of approximation points in the equalized enrichment footprints. (B)-(D) are example single peak foot print showing left-peaked, centered and right-peaked enrichment footprints. (B) and (D) are slightly spread interactions whereas (C) is for completely localized interactions. (B) and (D) further demonstrate the sharp fall of enrichment signal from their peaks in contrast to that of CDC10. (E) Shows a typical 2-peak enrichment footprint. Typical multi peak enrichment footprints are shown in (F) and (G). Both show highly uniformly distributed interactions.
arise largely by motif driven or direct interaction and distributed footprints arise because of the indirect interaction through another distributed tethered binding of the protein to the DNA.
246
J. Li et al.
Table 3. Average fraction of peaks with motifs i.e. ratio of number motifs in the footprints and the number peaks in the footprints
Footprint Type Localized Distributed P-value of Difference
CDC10 23% (223/981) 18% (175/947) 4x10-6
TBP1 41% (754/1824) 28% (281/1001) 10-12
4 Discussion We have presented the first-of-its-kind approach that permits direct assessment of genome-wide behavior and heterogeneity of spatial distribution of protein-DNA interaction from high-density maps of chromatin precipitation samples. This is a significant step in protein-DNA interaction studies as the direct and accurate genomic scale estimation of interaction patterns and their heterogeneity at various loci of the genome which has not been performed previously. In this paper, we presented RPLM algorithm to fit global piece-wise linear model in a computationally efficient manner under reasonable and realistic assumptions. The RPLM algorithm requires less number of parameters to tune and the result is easy to comprehend. We have also presented a method to visualize the complex distribution patterns of DNA-protein interactions. We have demonstrated that RPLM together with SOM based visualization can easily reveal the complex patterns of spatial distribution of interactions. The application of our analysis on the high-density ChIP-chip data of TBP1 (TATA Binding Protein) and CDC10 (MBF complex protein) have revealed interesting heterogeneous spatial distribution patterns of interactions. Both transcription factors appear to be interacting with DNA in locus specific manner. Only 75-85% of interactions are reasonably localized and the remaining 15-25% show greater uniform spatial distribution. But even the among the static interactions, ~50% of them show some level of distributed interactions as revealed by the non-centrality of the peak in single peak enrichment footprints. Though both transcription factors exhibit highly distributed interactions, the level of uniformity appears to be slightly different. CDC10 interactions appear to be more uniformly distributed whereas the TBP1 interactions show some higher level of static behavior. We have further shown the disparity in the motif frequency in localized and distributed footprints. Without our presented approach it would be difficult to reveal such patterns of interactions and minute differences between different transcription factors in a single glance. It clearly revealed that finding just most interacting locations or sites is not enough to understand the roles of different transcription factors. The approach we presented is unique in its way and applicable to diverse high-density mapping data. For example, the interaction of some histone modifications shows mostly flat enrichment footprints whereas the others are heavily localized. Instead of using different methods to analyze different datasets such as HMM for flat footprints and peak finding algorithms for peak footprints, we use single method which can reveal different patterns and their proportions.
Modeling and Visualizing Heterogeneity of Spatial Patterns
247
Though our method is effective in identifying different patterns of interactions, we still have to improve the method in terms of representation of approximation points and enrichment footprints. Simple {+1,0,-1} representation would not be sufficient as it does not differentiate between true peaks/valleys from other saddle points. Moreover, the vectorization of enrichment footprints should be improved to separately cluster multi-peak footprints of CDC10 and TBP1 if they are combined for clustering as they show different localization behavior. Currently, it is possible only through visualizing actual footprints in each cluster. We are working on all these aspects of our algorithm for future improvements.
Acknowledgments We thank A-STAR and Genome Institute of Singapore for their funding and continual support. We thank Edison T. Liu and Neil D. Clarke for their valuable feedback. We also thank our colleagues Galih and Joon Hong for their support.
References 1. Barski, A., Cuddapah, S., Cui, K., Roh, T.Y., Schones, D.E., Wang, Z., Wei, G., Chepelev, I., Zhao, K.: High resolution profiling of histone methylations in the human genome. Cell 129(4), 823–837 (2007) 2. Buck, M.J., Lieb, J.D.: ChIP-chip: considerations for the design, analysis, and application of genome-wide chromatin immunoprecipitation experiments. Genomics 83(3), 347–348 (2004) 3. Chu, Z., Li, J., Eshaghi, M., Peng, X., Karuturi, R.K.M., Liu, J.H.: Modulation of cell cycle specific gene expressions at the onset of S-phase arrest contributes to the robust DNA replication checkpoint response in fission yeast. Mol. Biol. Cell 18(5), 1756–1767 (2007) 4. Johnson, D.S., Mortazavi, A., Myers, R.M., Wold, B.: Genome-wide mapping of in vivo protein-DNA interactions. Science 316(5830), 1441–1442 (2007) 5. Jones, S., van Heyningen, P., Berman, H.M., Thornton, J.M.: Protein-DNA interactions: A structural analysis. Jl. Mol. Biol. 287(5), 877–896 (1999) 6. Lee, T.I., Young, R.A.: Transcription of eukaryotic protein-coding genes. Annu. Rev. Genet. 34, 77–137 (2000) 7. Li, J., Zhu, L., Eshaghi, M., Liu, J., Karuturi, R.K.M.: Genome-wide ChIP-chip Tilling Array Data Analysis in Fission Yeast. In: RECOMB 2008 (2008) 8. Qi, Y., Rolfe, A., MacIsaac, K.D., Gerber, G.K., Pokholok, D., Zeitlinger, J., Danford, T., Dowell, R.D., Fraenkel, E., Jaakkola, T.S., Young, R.A., Gifford, D.K.: High-resolution computational models of genome binding events. Nat. Biotechnol. 24(10), 1293 (2006) 9. Wei, C.L., Wu, Q., Vega, V.B., et al.: A global map of p53 transcription-factor binding sites in the human genome. Cell 124(1), 21–23 (2006) 10. Zheng, M., Barrera, L.O., Ren, B., Wu, Y.N.: ChIP-chip: data, model, and analysis. Biometrics 63(3), 787–796 (2007) 11. Basehoar, A.D., Zanton, S.J., Pugh, B.F.: Identification and Distinct Regulation of Yeast TATA Box-Containing. Genes Cell 116, 699–709 (2004) 12. Rustici, G., Mata, J., Kivinen, K., Lió, P., Penkett, C.J., Burns, G., Hayles, J., Brazma, A., Nurse, P., Bähler, J.: Periodic gene expression program of the fission yeast cell cycle. Nat. Genet. 36, 809–817 (2004)
A Linear-Time Algorithm for Analyzing Array CGH Data Using Log Ratio Triangulation Matthew Hayes and Jing Li Case Western Reserve University, Cleveland, OH 44106 USA
[email protected]
Abstract. DNA copy number is the number of replicates of a contiguous segment of DNA on the genome. Copy number alteration (CNA) is a genetic abnormality in which the number of these segments differs from the normal copy number, which is two for human chromosomal DNA. The association of CNA with cancer has led to a proliferation of research into algorithmic methods for detecting these regions of genetic abnormality. We propose a linear-time algorithm to identify chromosomal change points using array comparative genomic hybridization (aCGH) data. This method treats log-2 ratio values as points in a triangle and segments the genome into regions of equal copy number by exploiting the properties of log-2 ratio values often seen at segment boundaries. Applying our method to real and simulated aCGH datasets shows that the triangulation method is fast and is robust for data with low to moderate noise levels.
1
Introduction
For a given contiguous segment of chromosomal DNA, the copy number of the segment is the number of its replicates on the chromosome. In humans, the normal copy number is two, with one copy inherited from each parent. However, the phenomenon known as copy number alteration (CNA) can occur which leads to abnormal copy numbers (i.e. where the number of segment replicates is not 2). Copy number alteration can occur because of genetic inheritance, but can also be the result of cancer progression. The studies performed in [1,2,3] provide clear pictures into the relationship between copy number aberrations and different forms of cancer. The correlation between CNA and cancer necessitates the ability to locate CNA regions on the genome. Accurately locating these regions would aid the researcher in locating relevant genes associated with a particular type of cancer, and it would also help in classifying tumor cells [4,5]. 1.1
Locating CNA Regions: The aCGH Platform
Array comparative genomic hybridization (aCGH) is a microarray technology designed to provide genome-wide screening of DNA copy number profiles. Using
Correspondence author.
I. M˘ andoiu, G. Narasimhan, and Y. Zhang (Eds.): ISBRA 2009, LNBI 5542, pp. 248–259, 2009. c Springer-Verlag Berlin Heidelberg 2009
A Linear-Time Algorithm for Analyzing Array CGH Data
249
aCGH, it is possible to measure DNA copy number for thousands of DNA clones along the chromosome [6]. Using data from the aCGH platform, regions of equal copy number can be computationally determined through segmentation, in which the genome is partitioned into regions of distinct copy number. The problem of segmentation is then to find copy number transition points or “breakpoints” locations that define the boundaries between regions of different copy number. For a given clone in an aCGH experiment, its copy number in a test sample is compared to its copy number in a reference sample. The samples here are cells, and the normalized log-2 ratio of the copy numbers for the test and reference samples are recorded for each clone. An example graph of log-2 ratio values is shown in Fig. 1. However, a significant problem with aCGH experiments is made clear in this figure: the data is very noisy. There are various experimental factors that contribute to noisy log-2 ratio measurements, so it can be expected that this aspect of aCGH data will continue to be problematic.
Fig. 1. Subset of aCGH data from Coriell cell line [7]. Each point represents the log-2 ratio value of one clone.
1.2
Our Proposed Method: Log-2 Ratio Triangulation
We propose a linear-time algorithm to locate chromosomal breakpoints in array CGH data by triangulating1 log-2 ratio values, which enables us to exploit the properties of triangles created from those values. Our results show that this method is not only fast, but is robust against aCGH datasets with low to moderate noise levels.
2
Related Work
Locating regions of distinct copy number can be generalized to the problem of detecting change points, which are specific locations in a numeric set where the local distribution changes. The Circular Binary Segmentation (CBS) algorithm of Olshen et al. finds change points in array CGH data by partitioning a chromosome until a partition is found that maximizes a t-statistic [8]. In two comparison 1
Note that we define triangulation as the task of forming a triangle from 3 points in two dimensional space.
250
M. Hayes and J. Li
papers that evaluate aCGH analysis algrithms [9,10], the CBS algorithm consistently returned good results, but was computationally slow. A change to the CBS algorithm [11] sought to address this issue and while the speed improvements were sound, the algorithm still produced a significant amount of overhead for certain instances of their datasets. The permutation test performed by the CBS algorithm results in a running time of O(n2 ), where n is the number of clones. Other methods, such as the hidden Markov model (HMM) methods of Guha [12] and Fridlyand [13] could require quadratic time because the algorithms associated with HMMs (i.e. Viterbi, Baum-Welch, forward-backward) all require at least Θ(n) time in the number of observations [14], though the running times of these methods may depend on the number and connectivity of the hidden states in the model. The HMM-based methods are different than segmentation algorithms such as CBS because the HMM methods generally assign clones to states, where the states are “gain” or “loss” of a specific number of copies. Dynamic programming based approaches such as that proposed by Daruwala et al. [15] could potentially have high asymptotic running times because of the time required to fill in the tabular information used by the algorithm.
3
Log-2 Ratio Triangulation
The idea of the triangulation method is to exploit the behavior of log-2 ratio values that are often seen at change point locations. By forming triangles from every three consecutive ratio values on the chromosome, we seek to computationally compare the triangles formed in segments of constant copy number to triangles formed at the segment boundaries. To distinguish between these two types of triangles, we define two score functions S and G that assign values to each triangle. The S-score for a triangle τ is defined by the following equation: S(τ ) := F cHT
◦
.
(1)
where F is the value in radians of the largest interior angle of τ , T is the triangle’s tilt off the x-axis in degrees, H is the height of τ , and c is a user specified parameter whose effects are explored in Section 4.2 Figure 2 provides an illustrative example of the T , F , and H dimensions and how they are determined from log-2 ratio values. After calculating the value in (1), S(τ ) is then given as input to the following function G(S), which we refer to as the G-score. G(S(τ )) := arctan(γ(S(τ ) − d)) .
(2)
In (2), the variable γ is the estimated noise dispersion from the segment means of all log-2 ratios for the given input set of clones. The variable d is a user-specified constant that specifies the x-intercept of the arctangent function. 2
We specify the tilt dimension in degrees so that it produces a larger value to aid in separating triangles at breakpoints from triangles in constant copy number segments.
A Linear-Time Algorithm for Analyzing Array CGH Data
251
Fig. 2. Illustrative example of T ,F , and H for a given triangle. The points x1, x2, and x3 are log-2 ratio values in the aCGH data, and triangles are created for every three consecutive log-2 ratio values.
The effects of this parameter are studied further in Section 4. We calculate γ by first performing median filtering on the input log-2 ratios (with a window size of 20) and for each original ratio value, we take the difference between that value and its corresponding filtered value. We then sum the differences (i.e., their absolute values) of all clones to their filtered values and we divide the resulting sum by the number of clones in the input set. The result is a value that serves as a measure of noise level for the input data and such a value is important because array CGH datasets vary in the levels of noise they contain.3 The γ parameter also affects the curvature of the sigmoid and the rate at which the upper and lower asymptotes are reached. A larger γ value will give a steeper curve and will have a larger rate of change to reach the lower and upper asymptotes. This effect is desirable because γ varies directly with noise, and noise varies directly with S-score values. Thus, a noisy dataset will have triangles with larger S-scores, and we want our sigmoid G function to account for these larger values. The idea of the score functions is not only to emphasize the importance of the T ,F , and H dimensions, but to exploit the properties of triangles that are formed at the breakpoint locations. These triangles tend to have larger heights, greater tilt, and larger maximum interior angles than other triangles. Since these triangles are formed from log-2 ratio values in the data, we are essentially exploiting the noisiness of the data and projecting the related properties onto properties of the associated triangles. Equations (1) and (2) also allow our method to numerically separate triangles seen at copy number transition points from triangles obtained from clones with the same copy number. For our purposes, we want to place smaller triangles with low S-score values in the lower asymptote region of the G-score curve, while triangles with higher S-scores should be placed in 3
We do not refer to this value as the estimated noise standard deviation because we do not take the mean of the values in the window, nor do we square the difference between the observed and expected values. We observed that such a method significantly misestimates the true noise standard deviation.
252
M. Hayes and J. Li
the upper asymptote region. We chose a sigmoid as our G function because at a certain point, we do not want scores to increase because a large score value for a single triangle would hinder our algorithm’s ability to locate change points indexed by smaller triangles.
4
The Algorithm
The Triangulator algorithm is shown in Fig. 3. The algorithm clearly runs in O(n) time, which is an improvement over many methods to address this problem. It first reads in an array of clones A ordered by genomic position, and each element in A contains information about the clone’s genomic location and it’s log-2 ratio. The genomic locations for the clones are temporarily stored to an array P so that we can “normalize” the distances between the clones to ensure that all triangles formed are of the same width. The original genomic position values are indexed by the value of the new position, so at termination, the original positions are used in estimating change point locations. For every three consecutive clones, the log-2 ratios for the clones are used to create a triangle. This step is shown on line 8. The T riangulate function simply creates a triangle from the three clones given as input where the first, second, and third parameters correspond to the first, second, and third points in the triangle, respectively ordered by their position on the x-axis. The G-score of each triangle is stored to the array M , and on line 11, the actual triangles are stored to the array T . Moreover, each triangle is indexed by its first point on the x-axis, as shown on line 9. This indexing facilitates the process of associating triangles with genomic locations. Lines 14 through 17 identify an initial set of triangles that are candidates for breakpoint locations. For each triangle in the array, its G-score is compared to the mean G-score of all triangles in T . If the G-score for a given triangle is not within one standard deviation of the mean of all G-scores, then the triangle is a candidate for a true breakpoint. All such triangles are stored to the array Ta in increasing order by genomic position. Lines 18 through 33 are the part of the algorithm that determines if the aberrant triangles in Ta are indexed at true change points or if they are simply indexed at noisy points. The index of the kth aberrant triangle is stored to e at line 20. The inner for-loop at line 25 loops between the index values for consecutive aberrant triangles. In this step, the segment of the chromosome between the indexed aberrant triangles is analyzed by triangulating the 1st point of the triangle at T [e] with the 2nd and 3rd points of the triangle indexed by p. The score value from the resulting triangle is stored to the set R and this process is repeated for all points in the current segment. Referring to lines 31 through 33, if the aberrant triangle Ta [k] is within 1 standard deviation of the mean of the score values in R, the clone indexed by Ta [k] is identified as a change point and is added to the set C.
A Linear-Time Algorithm for Analyzing Array CGH Data
Fig. 3. CGH Triangulator Algorithm
253
254
5
M. Hayes and J. Li
Experiments
5.1
Study on Real Dataset: Coriell Cell Line
We first tested our algorithm on an array CGH dataset from the Coriell Institute that was first published in [7]. This dataset is annotated with true copy number state information which allowed for quantitative evaluation of the accuracy of our segmentation method. Moreover, it is important to have this information for real data so that we can make predictions as to the performance of our method on real datasets. Design and Methods. To measure accuracy of predicted breakpoints, we measured the sensitivity (SE) as the proportion of true breakpoints identified to all real breakpoints, and we also measured the false discovery rate (FDR) as the proportion of falsely-predicted breakpoints to all predicted breakpoints. We also ran a Matlab implementation of the CBS algorithm and we compared the performance of the two programs in regards to accuracy and computation time. For the user parameters c and d, referenced in Section 3, we ran the experiment for c = {2, 3} and for each value of c, we chose d = {100, 200, 300, 400, 500, 600, 700, 800, 900, 1000}. For c > 3, we noticed that the method returned a high number of false positives, and for c = 1, a low number number of true positives were returned. Thus, we focused our attention only on two values of c. Results and Discussion. The results from this study are provided in Tables 1 and 2, and the results of the CBS algorithm on this dataset are given in Table 3. The tables suggest that the results are generally robust against different choices of parameters for c and d, though the proportion of false positives is higher for the c = 3 experiment. This is especially true for lower values of d when c = 3 because the S-score values are higher. This causes the G-score function to lose effectiveness in separating normal and aberrant triangles. We note, however, that the number of false positives decreased as d grew larger. In both sets of results, we observed that as d grows larger, the FDR will decrease before the method begins to lose sensitivity. Although not reported in Table 2, we observed that for c = 3, the value of d can increase to approximately 10000 before the method begins to lose sensitivity.4 For c = 2, the drop in sensitivity occurs much sooner. As suggested in [9], we used a variable w to represent the maximum allowed localization error of clones, so a breakpoint was correctly identified if it was within w clones of the true breakpoint. For both values of c, the triangulation method generally outperforms CBS with regard to FDR, while SE is never less than 0.80 for w = 1. For w = 0, the Triangulator algorithm is clearly more precise than the CBS algorithm on this dataset. The sensitivity of both algorithms improved when we increased w from 0 to 1. This is mainly because the data does not have sharp breakpoints in some cases. For our algorithm, reasons for loss of sensitivity include lack of breakpoint 4
For c = 3 and d = 10000, SE = 0.86, FDR = 0.22.
A Linear-Time Algorithm for Analyzing Array CGH Data
255
“sharpness” and segments that were of length 1. Our method relies on breakpoints that are defined by large, abnormal triangles. But on occasion, the breakpoints are defined by smaller triangles whose points are gradually increasing. There is also a loss of sensitivity with segments of length 1 because the method treats them as single, noisy data points. We observed that CBS will also miss segments of length 1. As previously stated, our reported FDR is generally smaller than that of CBS, which is a desirable feature. In regards to time, our study confirmed our assumption that the Triangulator algorithm would be much faster than the CBS algorithm. In each case, our algorithm segmented the 15 cell lines in under 11 seconds, while the CBS algorithm needed over 25 minutes to segment these data. Table 1. For c = 2. Results of Triangulator algorithm on Coriell data. Each test run required approximately 11 seconds.
Table 2. For c = 3. Results of Triangulator algorithm on Coriell data. Each test run required approximately 11 seconds.
d 100 200 300 400 500 600 700 800 900 1000
d 100 200 300 400 500 600 700 800 900 1000
SE, FDR (w = 0) SE, FDR (w = 1) 0.77, 0.58 0.82, 0.55 0.80, 0.375 0.86, 0.32 0.80, 0.31 0.86, 0.25 0.77, 0.32 0.84, 0.26 0.72, 0.36 0.81, 0.26 0.70, 0.34 0.80, 0.25 0.70, 0.30 0.80, 0.20 0.70, 0.32 0.80, 0.24 0.70, 0.26 0.80, 0.16 0.70, 0.32 0.80, 0.16
SE, FDR (w = 0) SE, FDR (w = 1) 0.77, 0.80 0.82, 0.55 0.82, 0.71 0.89, 0.68 0.80, 0.65 0.86, 0.65 0.80, 0.60 0.86, 0.57 0.80, 0.59 0.86, 0.55 0.80, 0.57 0.80, 0.54 0.80, 0.51 0.86, 0.46 0.80, 0.49 0.86, 0.44 0.82, 0.47 0.86, 0.44 0.82, 0.45 0.86, 0.41
Table 3. Results of the CBS algorithm on Coriell dataset. Results are consistent with those in [8]. Programs SE, FDR (w = 0) SE, FDR (w = 1) Time CBS 0.600, 0.769 0.900, 0.654 > 25m0s
5.2
Study on Simulated Dataset
The second part of our study involved the application of our algorithm to synthetic aCGH data. The simulation model used was that of Willenbrock et al. [9]. In this paper, they assume that not all cells in a sampled culture will be affected by copy number changes, but that only a proportion of cells will be diseased. Thus, they assume that the expected log-2 ratio of a clone is given by the following equation: ratio := log2 [(c ∗ Pt + 2 ∗ (1 − Pt ))/2].
(3)
256
M. Hayes and J. Li
where c is the copy number state (0,1,2,3,4,5) and Pt is a proportion of tumorous cells typically seen in biopsies, sampled from a uniform distribution on [0.3,0.7]. After adding Gaussian noise sampled from N ormal(0, σ), where σ is sampled from U nif orm(0.1, 0.2), the final value of the clone’s log ratio is determined. Design and Methods. As in the study involving real aCGH data, we measured the sensitivity and false discovery rates of breakpoint prediction accuracy using the simulated data created from the model in [9]. However, we not only wanted to measure the accuracy of predicted breakpoints, but we wanted to determine the level of noise that would cause our algorithm to give undesirable results. We therefore altered the simulation model by assigning different values to the parameters of the uniform distribution used for the σ variable. The σ parameter is the standard deviation for the Gaussian noise added to the log-2 ratios. By restricting the values of these parameters at each step, we determine the suboptimal noise level by measuring sensitivity and false discovery rate of predicted breakpoints. For this part of the experiment, we measured sensitivity and FDR for parameter (σ) values of [0.0,0.05], [0.0,0.1], [0.05,0.1], and [0.1,0.15]. For each of these parameter settings, we generated 500 samples with 20 chromosomes, each with 100 clones. This is similar to the experiment performed in [9]. As in the study on the Coriell data, we calculated SE and FDR for w = 0,1. We did not redo this experiment using CBS because the comparison paper [9] had done an extensive study of CBS on the simulated data. For this part of the study, we chose values of c = {2,3} and d = {100,200,300,400,500}. Results and Discussion. The results of this study are given in Tables 4 and 5. As seen in the tables, the method performs well on data with low to moderate noise levels, but loses accuracy as the amount of noise increases. Increasing w from 0 to 1 causes a slight increase in sensitivity, but a sharper decline in the number of false positives. This improvement from w = 0 to w = 1 is consistent with the results from the Coriell data, in which the results also improved when the offset value increased. It should be noted that the CBS algorithm gave more consistent results even with a noise level higher than those specified in our study (i.e., (σ ∈ [0.1, 0.2]). Our method would likely be best suited for microarray experiments in which the researcher can expect low to moderate noise levels. Examples of real data with such noise levels are the Coriell data, that we observed had a noise variance of around 0.1. Furthermore, we used a simulation model that assumed that a sample from a cell line contained only a portion of cells affected by copy number. As a result, the log-2 ratio values calculated for each clone were smaller than they would be if the assumption was that all cells in a culture were affected by copy number changes. Ultimately, if the data is very noisy, then triangles within noisy segments will be very large. Moreover, breakpoints will not be sharply defined by single, oblong triangles, but may instead be defined by multiple triangles that are indistiguishable from other triangles in the data. This lack of sharply defined breakpoints will certainly cause the method to lose sensitivity.
A Linear-Time Algorithm for Analyzing Array CGH Data Table 4. For c = 2. Results of Study on Simulated Data. Each test run required approximately 5m30s. d 100
200
300
400
500
Noise 0.0, 0.05 0.0,0.1 0.05,0.1 0.1,0.15 0.0,0.05 0.0,0.1 0.05,0.1 0.1,0.15 0.0,0.05 0.0,0.1 0.05,0.1 0.1,0.15 0.0,0.05 0.0,0.1 0.05,0.1 0.1,0.15 0.0,0.05 0.0,0.1 0.05,0.1 0.1,0.15
c=2 SE,FDR (w=0) 0.80, 0.02 0.77, 0.07 0.70, 0.17 0.55, 0.72 0.75, 0.03 0.72, 0.07 0.65, 0.13 0.54, 0.64 0.73, 0.03 0.69, 0.06 0.65, 0.12 0.52, 0.59 0.71, 0.02 0.67, 0.07 0.61, 0.12 0.52, 0.56 0.70, 0.03 0.65, 0.07 0.60, 0.11 0.51 0.53
SE,FDR (w=1) 0.82, 0.005 0.80, 0.03 0.76, 0.09 0.70, 0.64 0.77, 0.006 0.75, 0.02 0.72, 0.05 0.67, 0.54 0.74, 0.005 0.72, 0.02 0.69, 0.03 0.65, 0.49 0.73, 0.006 0.71, 0.02 0.67, 0.03 0.64, 0.45 0.715, 0.005 0.69, 0.01 0.66, 0.02 0.63, 0.42
257
Table 5. For c = 3. Results of Study on Simulated Data. Each test run required approximately 5m30s. d 100
200
300
400
500
Noise 0.0, 0.05 0.0,0.1 0.05,0.1 0.1,0.15 0.0,0.05 0.0,0.1 0.05,0.1 0.1,0.15 0.0,0.05 0.0,0.1 0.05,0.1 0.1,0.15 0.0,0.05 0.0,0.1 0.05,0.1 0.1,0.15 0.0,0.05 0.0,0.1 0.05,0.1 0.1,0.15
c=3 SE,FDR (w=0) 0.91, 0.02 0.86, 0.16 0.78, 0.35 0.56, 0.86 0.88, 0.02 0.84, 0.12 0.76, 0.26 0.56, 0.86 0.86, 0.02 0.82, 0.1 0.74, 0.21 0.56, 0.80 0.85, 0.02 0.82, 0.09 0.74, 0.20 0.56, 0.78 0.83, 0.02 0.80, 0.08 0.73, 0.18 0.56 0.76
SE,FDR (w=1) 0.92, 0.006 0.90, 0.12 0.86, 0.28 0.76, 0.80 0.90, 0.005 0.88, 0.08 0.83, 0.18 0.76, 0.80 0.88, 0.005 0.85, 0.06 0.82, 0.13 0.74, 0.73 0.86, 0.005 0.85, 0.05 0.81, 0.11 0.73, 0.71 0.85, 0.005 0.84, 0.04 0.80, 0.1 0.72, 0.70
The simulated dataset appears to be more sensitive to choices of c than the Coriell dataset, though for both values of c, the method generally has a low FDR. Regarding sensitivity, lower values of d increase the number of true positives, though the number of true positives increases significantly for c = 3. For both values of c, the proportion of false positives was clearly higher in the Coriell dataset than in the simulated dataset. This difference can be attributed to differences between the two datasets, such as the number of copy number states, noise standard deviation, copy number state distribution, and the assumption of a proportion of affected cells.5 These differences between the datasets likely account for the differences seen in the results, so it is important to know how these properties may affect results for microarray experiments. Our implementation of the Triangulator algorithm required approximately 5m30s to process 500 simulated cell lines on a 3 GHz, 15 GB RAM server. Conversely, the Matlab implementation of CBS required approximately 6h40min on a 2.4 GHz, 3 GB machine. Although the Triangulator algorithm was implemented on a more powerful machine, the difference in computation time is still vast. More importantly, to identify copy number alterations with finer resolutions, researchers have investigated approaches using newer technologies such as SNP genotyping platforms [16,17] and next-generation sequencing techniques [18]. The number of SNPs from SNP chips and the number of reads generated from 5
The simulation model used in our experiments does not assume that all cells in a culture are affected by copy number changes.
258
M. Hayes and J. Li
massive parallel sequencing machines are of a magnitude several orders greater than the number of clones based on aCGH techniques. To develop approaches for such datasets, it is important that they are fast and effective. Our linear algorithm has the potential to handle large datasets from these newer technologies.
6
Conclusion and Future Work
We have presented a linear-time algorithm for locating copy number breakpoints in array CGH data. Our results show that the method works best on data where the noise levels are not high. As a result, a researcher must have an idea of what kinds of noise levels to expect for her microarray experiments before employing such a method. Future work will focus on increasing the robustness of the method against large amounts of noise. Possible ways of addressing this issue would be to triangulate every two or three log-2 ratios in addition to triangulating consecutive values. By taking the union of the results of each triangulation, the result should be more robust against breakpoints that are not sharply defined by a single triangle. Another direction is to investigate new score functions, or to explore alternative parameters for our existing score functions. More detailed analysis of false positives and false negatives may guide us to define new score functions that better distinguish true breakpoints from false ones. We will also investigate the applicability of our approach on datasets that are based on newer technologies such as those generated by massive parallel sequencing machines.
Acknowledgments We would like to thank Xiaolin Yin for his help in performing our experiments. This work is supported in part by NIH/NLM (grant LM008991), NIH/NCRR (grant RR03655), NSF (grant CRI0551603) and a start-up fund from Case Western Reserve University.
References 1. Veltman, J.A., Fridlyand, J., Pejavar, S., Olshen, A., Korkola, J., DeVries, S., Carroll, P., Kuo, W., Pinkel, D., Albertson, D., Cordon-Cardo, C., Jain, A., Waldman, F.: Array-based comparative genomic hybridization for genome-wide screening of DNA copy number in bladder tumors. Cancer Res. 63, 2872–2880 (2003) 2. Whang-Peng, J., Kao-Shan, C., Lee, E., Bunn, P., Carney, D., Gadzar, A., Minna, J.: Specific chromosome defect associated with human small cell lung cancer; deletion 3p(14-23). Science 215, 181–182 (1982) 3. de Leeuw, R., Davies, J., Rosenwald, A., Bebb, G., Gascoyne, D., Dyer, M., Staudt, L., Martinez-Climent, J., Lam, W.: Comprehensive whole genome array cgh profiling of mantle cell lymphoma model genomes. Hum. Mol. Genet. 13(17), 1827–1837 (2004)
A Linear-Time Algorithm for Analyzing Array CGH Data
259
4. Fridlyand, J., Snijders, A., Ylstra, B., Li, H., Olshen, A., Segraves, R., Dairkee, Shanaz, Tokuasu, T., Ljung, B., Jain, A., McLenna, J., Ziegler, J., Chin, K., DeVries, S., Feiler, H., Gray, J., Waldman, F., Pinkel, D., Albertson, D.: Breast tumor copy number aberration phenotypes and genomic instability. BMC Cancer 6, 96 (2006) 5. Wang, Y., Makedon, F., Pearlman, J.: Tumor classification based on DNA copy number aberrations determined using SNP arrays. Oncology Reports 15, 1057–1061 (2006) 6. Pinkel, D., Albertson, D.G.: Array comparative genomic hybridization and its applications in cancer. Nat. Genet. 37, suppl. S11–S17 (2005) 7. Snijders, A., Nowak, N., Segraves, R., Blackwood, S., Brown, N., Conroy, J., Hamilton, G., Hindle, A., Huey, B., Kimura, K., Law, S., Myambo, K., Palmer, J., Ylstra, B., Yue, J., Gray, J., Jain, A., Pinkel, D., Albertson, D.: Assembly of microarrays for genome-wide measurement of DNA copy number. Nat. Genet. 3, 263–264 (2001) 8. Olshen, A., Venkatraman, E.: Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 5, 557–572 (2004) 9. Willenbrock, H., Fridlyand, J.: A comparison study: applying segmentation to array CGH data for downstream analyses. Bioinformatics 21(22), 4084–4091 (2005) 10. Lai, W., Johnson, M., Kucherlapati, R., Park, P.: Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics 21, 3763–3770 (2005) 11. Venkatraman, E., Olshen, A.: A Faster Circular Binary Segmentation Algorithm for the Analysis of Array CGH Data. Bioinformatics 23(6), 657–663 (2007) 12. Guha, S., Li, Y., Neuberg, D.: Bayesian Hidden Markov Modeling of Array CGH Data. Harvard University Biostatistics Working Paper Series. Working Paper 24 (2006) 13. Fridlyand, J., Snijders, A., Pinkel, D., Albertson, D., Jain, A.: Hidden Markov models approach to the analysis of array CGH data. J. Multivar. Anal. 90, 132 14. Durbin, R., et al.: Biological Sequence Analysis. Cambridge University Press, Cambridge (1999) 15. Daruwala, R., Rudra, A., Ostrer, H., Lucito, R., Wigler, M., Mishra, B.: A versatile statistical analysis algorithm to detect genome copy number variation. Proc. Natl. Acad. Sci. 101(46), 16292–16297 (2004) 16. Redon, R., Ishikawa, S., Fitch, K.R., Feuk, L., Perry, G.H., Andrews, T.D., Fiegler, H., Shapero, M.H., Carson, A.R., Chen, W., Cho, E.K., Dallaire, S., Freeman, J.L., Gonzalez, J.R., Gratacos, M., Huang, J., Kalaitzopoulos, D., Komura, D., MacDonald, J.R., Marshall, C.R., Mei, R., Montgomery, L., Nishimura, K., Okamura, K., Shen, F., Somerville, M.J., Tchinda, J., Valsesia, A., Woodwark, C., Yang, F., Zhang, J., Zerjal, T., Armengol, L., Conrad, D.F., Estivill, X., Tyler-Smith, C., Carter, N.P., Aburatani, H., Lee, C., Jones, K.W., Scherer, S.W., Hurles, M.E.: Global variation in copy number in the human genome. Nature 444(7118), 444–454 (2006) 17. Carter, N.P.: Methods and strategies for analyzing copy number variation using dna microarrays. Nat. Genet. 39(suppl. 7), S16–S21 (2007) 18. Chiang, D.Y., Getz, G., Jaffe, D.B., O’Kelly, M.J., Zhao, X., Carter, S.L., Russ, C., Nusbaum, C., Meyerson, M., Lander, E.S.: High-resolution mapping of copynumber alterations with massively parallel sequencing. Nat. Methods 6(1), 99–103 (2009)
Mining of cis-Regulatory Motifs Associated with Tissue-Specific Alternative Splicing Jihye Kim, Sihui Zhao, Brian E. Howard, and Steffen Heber Bioinformatics Research Center, North Carolina State University, Raleigh, NC, 27606, U.S.A. {jkim4,szhao3,behoward,sheber}@ncsu.edu
Abstract. Alternative splicing (AS) is an important post-transcriptional mechanism that can increase protein diversity and affect mRNA stability and translation efficiency. Many studies targeting the regulation of alternative splicing have focused on individual motifs; however, little is known about how such motifs work in concert. In this paper, we use distribution-based quantitative association rule mining to find combinatorial cis-regulatory motifs and to investigate the effect of motif pairs. We also show that motifs that occur in motif pairs typically occur in clusters. Keywords: Alternative splicing, cis-regulatory motifs, association rule mining, quantitative association rule mining.
1 Introduction Alternative splicing (Fig. 1) plays an important role in the generation of protein diversity and subcellular localization, interacting with other regulatory processes such as transcription and signal transduction [1]. Furthermore, alternative splicing (AS) is also known to be involved in a variety of diseases including familial isolated GH deficiency type II (IGHD II), Frasier syndrome, and myotonic dystrophy; see [2, 3] for a detailed review. It is estimated that up to 70% of human genes are alternatively spliced [4] . This percentage might even increase if one takes into account that AS events often only occur in specific tissues, or developmental stages [5]. The mechanism and regulation of splicing is a complex process, affected by multiple factors, including exon size, trans-factors (e.g. SR proteins), and cis- regulatory motifs [6]. Often, cis-regulatory motifs are found near the corresponding splice site [7], where they act either as binding sites for trans-regulatory factors, or by forming loop structures on pre-mRNA [8]. These motifs can be further categorized according to their location and function as exonic splicing enhancers, intronic splicing enhancers, exonic splicing silencers, or intronic splicing silencers. Multiple studies have searched for splicing signals using experimental and computational methods [9-11]. For example, Yeo and colleagues [5] showed that cisregulatory elements might influence the amount and the type of alternative splicing. Researchers have also analyzed the frequencies of k-mers in exon and intron regions I. Măndoiu, G. Narasimhan, and Y. Zhang (Eds.): ISBRA 2009, LNBI 5542, pp. 260 – 271, 2009. © Springer-Verlag Berlin Heidelberg 2009
Mining of cis-Regulatory Motifs Associated with Tissue-Specific Alternative Splicing
261
neighboring splice sites [1, 5, 9]. Other studies have analyzed splice sites using Gibbs algorithm [12], and used a support vector machine to identify regulatory elements [10]. Recently, Burge’s group suggested a statistical method to find interacting pairs of cis-regulatory elements by detecting co-occurring conserved intron motifs [13]. In this paper, we use quantitative association rule mining to discover individual exonic/intronic sequence motifs, and motif pairs that influence AS in mouse tissues. In the Methods Section we will describe the data set used and our rule mining algorithm. In the Result Section we describe the statistically significant rule set that was inferred. Finally, we discuss our validation strategy and the biological significance of our results.
Fig. 1. Alternative splicing (exon skipping). Rectangles represent exons in the gene sequence. Two mRNA isoforms and their corresponding proteins are generated. Exon 2 (marked in grey) can be either included or skipped.
2 Methods 2.1 Datasets In a seminal paper, Pan and colleagues measured alternative splicing patterns of thousands of mouse genes using a custom microarray [14]. This work resulted in the publication of exon skipping rates for 3126 alternatively spliced exons from 2647 genes, in 10 tissues. To estimate these exon skipping rates, Pan and colleagues employed a generative model for the Alternative Splicing Array Platform (GenASAP) [15]. In this model, the exon skipping rate is defined as the observed expression value of the transcript isoform lacking the cassette exon divided by the total expression of both isoforms. For example, Fig 2 shows exon skipping rate values of the BG046833 gene in 10 tissues. In our research, we used Pan’s dataset to search for association rules linking possible regulatory motifs to differential exon skipping rates. First, we retrieved 3126 whole-length transcripts from NCBI using GeneBank [16] identifiers provided by Pan and colleagues. We trimmed the polyA tails using the trimest program from the EMBOSS package [17], and then mapped the transcripts onto the mouse genome (Build 36 v.1 released in May 2006) via BLAT [18]. Only
262
J. Kim et al.
Fig. 2. (A) Probe design in Pan’s quantitative microarray platform. The dark grey rectangle represents the alternatively skipped cassette exon; light grey rectangles are up and downstream exons. Six probes (C1, A, C2, C1-A, A-C2, C1-C2) are either within the exon or intron or on the boundary. (B) Exon skipping rate of the BG046833 gene in 10 mouse tissues.
transcripts which aligned with more than 95% identity over the whole transcript were used in our experiments. After alignment, each transcript can have more than one partial match to the genome sequence (called “blocks” in BLAT), indicating potential exons. Blocks separated by less than 5bp were merged. We compared Pan’s cassette exons and their neighboring constitutive exons with the corresponding set of alignment blocks. Only genes where the exon borders differed by less than 5bp from the corresponding block borders were retained for our study, resulting in a total of 2565 alternatively spliced pre-mRNA sequences. 2.2 Quantitative Association Rules Association rule mining is a tool used to find interesting relationships or associations among a set of items (see ref. [19]). An association rule is an expression of the form
Mining of cis-Regulatory Motifs Associated with Tissue-Specific Alternative Splicing
263
X => Y, where X and Y are disjoint sets of items. Given a database D of transactions, where a transaction T∈D is a set of items, a rule indicates that whenever T contains X, then T probably contains Y, as well. The support of an association rule is the number of transactions for which X and Y both appear together; a rule’s confidence is the support divided by the number of transactions for which X appears, regardless of Y. Given a user defined minimum support and confidence, the goal of ARM is to find all association rules that satisfy these thresholds. The majority of ARM applications have dealt with databases of categorical items (e.g. market basket databases), and many algorithms for dealing with this type of data have been suggested [19-24]. However, for many practical problems, there are also important quantitative attributes that are measured on a numerical scale. Since quantitative attributes, in general, cannot be treated as categorical ones, it is necessary to define quantitative association rules and corresponding rule mining algorithms. Several recent papers have addressed this problem [25-27]. One of the most popular methods is based on discretization (also called binning) of quantitative attributes. This method allows rules to be described in terms of numeric intervals, for example, X [20, 30] => Y [5, 10]. Although this method is straightforward, its results are often sensitive to the choice of bin size. Aumann and Lindell [26] proposed an alternative approach that overcomes the challenge of choosing the “correct” bin size. Their method represents the distribution of continuous data using standard statistical measures such as mean and variance. Under this framework, a quantitative association rule is an association between a subset of a database (lefthand side of a rule) and its “extraordinary” behavior (right-hand side of rule). For example, the quantitative rule {A, B} => {mean (X) = 68.7} relates categorical items A and B to X, a quantitative attribute. This rule is interesting if it reveals that a group containing A and B shows has a significantly different average value for X than the rest of the data.
∈
∈
2.3 Heptamer Association Rules The goal of our study is to apply quantitative association rule mining to find sequence motifs associated with tissue-specific alternative splicing, We searched for interesting rules of the form “a set of heptamer(s) => exon skipping rate”, where a set of heptamer(s) from seven exon/intron regions are categorical attributes and exon skipping rate is a quantitative attribute. An “interesting” rule indicates that genes which include a specific set of heptamer(s) are likely show an extraordinary exon skipping rate in one or several tissue(s) as compared to the remaining genes. After testing k-mers with k ranging from 5 to 9, we chose heptamers because of their superior performance, and because they are capable of detecting binding sites of splicing factors such as SR proteins [5, 9, 28]. We defined seven regions around each alternatively spliced exon. Since it is assumed that the majority of cis-regulatory elements involved in splicing are found close to splice sites [7, 29], we restricted our analysis to 200 base pairs flanking the
264
J. Kim et al.
splice sites. Under this framework, each gene corresponds to a transaction. Each transaction contains as items, the counts for all occurrences of all possible heptamers in each of the 7 different gene regions. Figure 3 shows how the transaction database can be represented in tabular form. Each row corresponds to the transaction for a single gene. The table contains columns for each possible heptamer/region combina7 tion, for a total of 4 × 7 = 114,688 columns. Also included in the table are 10 additional columns containing the exon skipping rates for the various tissues. In figure 3 (and throughout the text), the heptamers from a given region are prefixed with the region number; for example, the heptamer GGCAGAT from region 4 is designated by 4_GGCAGAT.
Fig. 3. For each alternatively spliced exon (grey box) we define seven regions (1-7) in the corresponding genomic sequence. The heptamer composition of each region is analyzed separately, and the corresponding heptamer counts are stored in an occurrence table.
2.3.1 Algorithm To identify sequence motifs associated with changes in exon skipping rates, we used an adaptation of Aumann and Lindell’s method. The algorithm for finding heptamer association rules follows three steps, outlined below: 1. 2.
3.
Find all “frequent” heptamer sets, where a heptamer set is called frequent if its support is greater than a user-defined minimum support threshold. For each frequent heptamer set and tissue type, compute the mean exon skipping rates for genes having the heptamer set, and genes lacking the heptamer set. Identify and report “interesting” association rules using a t-test of the skipping rates computed in step 2. Association rules are considered interesting if the exon skipping rate is significantly different depending on whether the heptamer set on the left-hand side of the rule is found in the gene.
Mining of cis-Regulatory Motifs Associated with Tissue-Specific Alternative Splicing
265
We computed frequent heptamer sets (which include both location and sequence information, as shown in figure 3) based on an a priori algorithm [19]. To efficiently compute frequent heptamer sets containing multiple heptamers, we used an itemset inclusion lattice, as described in [30]. The lattice, G = (V, E), is composed of nodes of frequent heptamer sets with edges showing parent/children relationships (Fig. 4). An item superset cannot be frequent if any of its subsets is not frequent. For example, in Fig 4, a frequent heptamer set, {A, B} is frequent and all of its subsets, {A}, {B} are frequent. Also, {A, B, C} cannot be a frequent set because one of its subsets, {A, C} is not frequent. To identify interesting rules, we used a standard t-test. Let D denote the full gene set; let TA denote the set of genes which include a given frequent heptamer set, A, which occurs on the left-hand side of a rule for some tissue, t; and, let D − TA denote the remaining genes. Given the heptamer set, A, we first compared μTA , t , the mean exon skipping rate in tissue t for genes having this heptamer with μ D −TA , t , the mean exon skipping rate of genes lacking the heptamer. The rule was reported if the corresponding null hypothesis is rejected at an alpha level of 0.05, after applying Bonferroni’s method to adjust for multiple testing.
Fig. 4. Lattice of frequent heptamer sets. Each node corresponds to a frequent heptamer set and contains mean exon skipping rates for each tissue. If {A, C} is not a frequent heptamer set, its superset {A, B, C} is not a frequent heptamer set. We compare each frequent node with the root node.
We noticed that the heptamer items for rules often overlapped, and could be simplified by substitution with a single, longer sequence motif on the left-hand side of the rule. To identify such cases, we analyzed the overlap and distance patterns of heptamer items involved in complex rules. If an overlapping heptamer pair exceeded the support threshold we replaced the heptamer pair by a single, larger sequence item and updated the rule correspondingly.
266
J. Kim et al.
3 Results We computed all association rules for minimum support values ranging from 20 to 70 in increments of 5, corresponding to 2.72% to 0.77% of the whole dataset. Based on previous experience, we assumed that 20 genes is the smallest number to safely support sequence motifs as candidates for binding sites; we then increased the minimum support threshold and extracted the corresponding interesting rules until we could no longer find any interesting rules. In total, we mined 97 interesting rules, of which 3 contain multiple heptamers. There are 59 different heptamer sets and 71 individual frequent heptamers in the left hand sides of the rules. The rules found for exon skipping rates in spleen tissue are listed in Table 1 of the Appendix. All rules extracted are statistically significant after correcting for multiple testing. We performed a permutation experiment to estimate the number of rules obtained from a randomized data set. To do so, we shuffled gene sequences and exon skipping rates and then re-ran our algorithm. This procedure was repeated 100 times. Using the same minimum supports, we found that the mean number of simple rules obtained from the randomized data sets was 14.7, compared to the 97 rules we found in the original database. Furthermore, we were unable to extract any complex rules using the randomized data sets. In general, the number of reported rules decreases with increasing minimum support, but some rules were especially robust. Several heptamer sets in region 4 (cassette exon) are commonly found for a wide range of minimum support values. For example, a rule with left hand side GCTGGAG was reported for all tested support values in association rules describing exon skipping in brain, intestine, kidney, liver, lung, muscle and salivary tissue. This heptamer overlaps with the 5' end of a potential SC35 binding site. It has been shown that this binding site is crucial for the correct splicing of exon 5 of muscle-specific cardiac troponin T transcripts [31]. The complex rules we uncovered included two complex rules having heptamers from different regions. In brain tissue, the rule {6_TTTAAAA, 3_TTATTTT} => {meandiff(Brain) = -20.216} indicates that genes with both TTTAAAA in the downstream intron and TTATTTT in the upstream intron show, on average, a 20.216% lower exon skipping rate in brain compared to the other genes (Fig 5A). Interestingly, neither of these heptamers is included in a simple rule in any of the tissues. The other complex rule with two heptamers from different regions, was found in spleen: {2_TTTCTCT, 3_TTTCTCT} => {meandiff(Spleen) = 32.536}. This rule indicates that genes with two TTTCTCTs in the upstream intron show, on average, a 32.536% higher exon skipping rate in spleen compared to the rest of the genes (Fig 5B). The third complex rule also occurred in spleen, and contained two heptamers from the same regulatory region: {3_AAAATAT 3_TTTGTTT} => {meandiff(spleen) = 24.253}. The heptamers corresponding to complex rules were, on average, repeated higher multiplicity within their genes than heptamers from simple rules (Fig. 6). In genes with two or more heptamer occurrences, heptamers from complex rules occurred in greater numbers than heptamers from simple rules regardless of whether the heptamers were from the same region (p-value of 0.067) or from all regions (p-value of 0.009).
Mining of cis-Regulatory Motifs Associated with Tissue-Specific Alternative Splicing
267
Fig 5. Exon skipping rates of two complex rules. (A) {6_TTTAAAA, 3_TTATTTT} => {meandiff(Brain) = -20.216} (B) {2_TTTCTCT, 3_TTTCTCT} => {meandiff(Spleen) = 32.536}. Genes with only one heptamer do not show a significant difference in mean exon skipping rate while genes with both heptamers show significantly lower exon skipping rates in both cases.
Fig. 6. Number of genes with two or more heptamer repeats from simple and complex rules
We also compared the motif conservation scores of heptamers from simple and complex rules using PhastCons [32] scores stored in UCSC. PhastCons fits a phylo-HMM to the data using maximum likelihood, and then predicts conserved elements based on this model [32]. Half of the heptamers from complex rules are significantly more conserved than random heptamers (p-values < 0.05) and a third of the heptamers from simple rules are significantly more conserved than random heptamers (p-values < 0.05), data not shown.
268
J. Kim et al.
Fig 7. Exon skipping rates in 10 tissues. Black bars represent the mean exon skipping rate for genes with a frequent heptamer. Gray bars represent mean exon skipping rate in for all genes.
Finally, to further validate our motif predictions, we assessed the overlap of our predictions with known AS regulatory sequence motifs stored in AEDB [33]. Among all heptamers involved in simple and complex rules, 43% occur within enhancer/silencer sequences from AEDB. This is a significantly (p-value = 0.017) higher percentage than we observed for a randomly selected set of heptamers of equal size.
4 Conclusion and Discussion We have applied distribution-based quantitative association rule mining to discover putative cis-regulatory motifs and motif combinations in alternatively spliced genes. Quantitative association rule mining provides a convenient framework for the systematic investigation of sequence motifs involved in the regulation of AS. Using the t-test and Bonferroni’s multiple testing correction, we identified several statistically significant associations between sequence motifs, and tissue specific exon skipping rates. We found 94 simple rules containing 1 sequence motif in the antecedent, and 3 complex rules which contain 2 sequence motifs in the antecedent. Among the complex rules, 2 rules contain heptamer pairs from different regions of the pre-mRNA sequence. None of the heptamers from a complex rule is also found in a simple rule. We hypothesize that these heptamer pairs correspond to factors which have to co-occur in order to influence alternative splicing. An approach which only targets individual motif candidates would have overlooked these motifs. Many heptamer sets are found in multiple tissues, even when using high support thresholds. For example, the heptamer TGTGGAG in cassette exons appears in rules
Mining of cis-Regulatory Motifs Associated with Tissue-Specific Alternative Splicing
269
describing heart, intestine, and muscle expression. Genes including this heptamer show lower exon skipping rates in all tissues (Fig 7). In addition, two very similar heptamers, 4_GCTGGAG and 4_TGTGAAG, appear in rules which also correspond to a reduction in exon skipping rate. We speculate that the heptamers TGTGGAG, GCTGGAG and TGTGAAG might correspond to a single degenerate cis-regulatory element associated with a reduction of exon skipping. Among all 59 heptamer sets, 16 heptamer sets are found in two or more rules affecting exon skipping in different tissues. On the other hand, some heptamers only affect exon skipping rate in a single tissue. For example, the rule {1_GCCAAAG} => {meandiff(spleen) = -18.186} occurs only in spleen, with a support of 29 genes. The genes with this heptamer show significantly (p-value = 0.040) lower exon skipping in spleen (Fig 7). We hypothesize that this heptamer motif increases exon inclusion specifically in spleen. This work has demonstrated that distribution-based quantitative association rule mining is a viable approach for discovering putative complex regulatory motifs for AS. In addition, comparison with known regulatory motifs stored in AEDB [33] shows a significant enrichment of our heptamer set. Thus, we hypothesize that our motif catalog provides a promising list of candidates for subsequent experimental validation.
References 1. Brudno, M., et al.: Computational analysis of candidate intron regulatory elements for tissue-specific alternative pre-mRNA splicing. Nucleic Acids Res. 9(11), 2338–2348 (2001) 2. Faustino, N.A., Cooper, T.A.: Pre-mRNA splicing and human disease. Genes Dev. 17(4), 419–437 (2003) 3. Garcia-Blanco, M.A., et al.: Alternative splicing in disease and therapy. Nat. Biotechnol. 22(5), 535–546 (2004) 4. Ladd, A.N., Cooper, T.A.: Finding signals that regulate alternative splicing in the postgenomic era. Genome Biol. 3(11), reviews0008 (2002) 5. Yeo, G., et al.: Variation in alternative splicing across human tissues. Genome Biol. 5(10), R74 (2004) 6. Burge, C.B., et al.: Splicing of precursors to mRNAs by the spliceosomes. In: Gesteland, R.F., Cech, T., Atkins, J.F. (eds.) The RNA World, 2nd edn., pp. 525–560. Cold Spring Harbor Laboratory Press, Plainview (1999) 7. Akerman, M., et al.: Alternative splicing regulation at tandem 3’ splice sites. Nucleic Acids Res. 34(1), 23–31 (2006) 8. Yun, L., Harold, R.G.: Evidence for the regulation of alternative splicing via complementary DNA sequence repeats. Bioinformatics 21(8), 1358–1364 (2005) 9. Fairbrother, W.G., et al.: Predictive identification of exonic splicing enhancers in human genes. Science 297, 1007–1013 (2002) 10. Zhang, X.H., et al.: Dichotomous splicing signals in exon flanks. Genome Res. 15(6), 768–779 (2005) 11. Famulok, M., Szostak, J.W.: Selection of Functional RNA and DNA Molecules from Randomized Sequences. In: Eckstein, F., Lilley, D.M.J. (eds.) Nucleic Acids and Molecular Biology, vol. 7, p. 271. Springer, Heidelberg (1993)
270
J. Kim et al.
12. Stamm, S., et al.: An alternative-exon database and its statistical analysis. DNA Cell Biol. 19(12), 739–756 (2000) 13. Friedman, B.A., et al.: Ab initio identification of functionally interacting pairs of cisregulatory elements. Genome Res. 18, 1643–1651 (2008) 14. Pan, Q., et al.: Revealing global regulatory features of mammalian alternative splicing using a quantitative microarray platform. Mol. Cell. 16(6), 929–941 (2004) 15. Shai, O., et al.: Inferring global levels of alternative splicing isoforms using a generative model of microarray data. Bioinformatics 22(5), 606–613 (2006) 16. Benson, D.A., et al.: GenBank. Nucleic Acids Res. 1(34), D16–D20 (2006) 17. Rice, P., et al.: EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 16(6), 276–277 (2000) 18. Kent, W.J.: BLAT–the BLAST-like alignment tool. Genome Res. 12(4), 656–664 (2002) 19. Agrawal, R.S.: Fast Algorithms for Mining Association Rules. In: Proc. of the 20th Int’l Conference on Very Large Databases (1994) 20. Park, J.S., et al.: An effective Hash-Based Algorithm for Mining Association Rules. In: Proc. of the ACM SIGMOD Int’l Conference on Management of Data (1995) 21. Cheung, D.: A Fast Distributed Algorithm for Mining Association Rules. In: Proc, 4th Int’l Conference Parallel and Distributed Information Systems. IEEE Computer Soc. Press, Los Alamitos (1996) 22. Agrawal, R., et al.: Parallel Mining of Association Rules. IEEE Transactions on Knowledge and Data Engineering 8(6) (1996) 23. Han, E.H.K., Kumar, V.: Scalable parallel data mining for association rules. In: ACM SIGMOD Conference Management of Data (1997) 24. Zaki, M.J.: Parallel Data Mining for Association Rules on Shared-Memory MultiProcessors. In: Proc. Supercomputing 1996. IEEE Computer Soc. Press, Los Alamitos (1996) 25. Fukuda, Takeshi, et al.: Mining optimized association rules for numeric attributes. In: Proceedings of the 15th ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems, Montreal, Quebec, Canada, pp. 182–191 (1996) 26. Aumann, Y., Lindell, Y.: A statistical theory for quantitative association rules. In: KDD 1999, pp. 261–270 (1999) 27. Brin, S., et al.: Mining optimized gain rules for numeric attributes. In: Proceedings of the 5th ACM SIGKDD International conference on Knowledge Discovery and Data Mining, pp. 324–333 (2003) 28. Voelker, R.B., Berglund, J.: A comprehensive computational characterization of conserved mammalian intronic sequence reveals conserved motifs associated with constitutive and alternative splicing. Genome Res. 17, 1023–1103 (2007) 29. Grabowski, P.J., et al.: Exon silencing by UAGG motifs in response to neuronal excitation. PLoS Biol. 5(2), e3 (2007) 30. Zaki, M., et al.: An Efficient Algorithm for Closed Itemset Mining. In: 2nd SIAM International Conference on Data Mining (2000) 31. Hodges, D., et al.: The role of evolutionarily conserved sequences in alternative splicing at the 3’ end of Drosophila melanogaster Myosin heavy chain RNA. Genetics 151, 263–276 (1999) 32. Siepel, A., et al.: Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15(8), 1034–1050 (2005) 33. Stamm, S., et al.: An alternative exon database (AEDB) and its statistical analysis. DNA and Cell Biol. 19, 739–756 (2000)
Mining of cis-Regulatory Motifs Associated with Tissue-Specific Alternative Splicing
Appendix Table 1. Motif association rules found for spleen tissue exon skipping rates minsupp 20
25
30 35 40
45
50
55
60
65
70
Heptamer set , p-value, mean difference 4_TGCTGGA , 0.000, -18.461 2_TTTCTCT 3_TTTCTCT , 0.023, +32.536 3_AAAATAT 3_TTTGTTT 3_TTGTTTT , 0.002, -24.879 4_TGCTGGA , 0.000, -18.461 4_GCTGCTG , 0.043, -12.906 1_GCCAAAG , 0.040, -18.186 4_TGCTGGA , 0.000, -18.461 4_GCTGCTG , 0.031, -12.906 4_GCTGCTG , 0.022, -12.906 4_TGCTGGA , 0.000, -18.461 4_GCTGCTG , 0.016, -12.906 4_TGCTGGA , 0.000, -18.461 4_GCTGGAG , 0.046, -11.911 6_GCAGCTG , 0.038, -15.559 2_CGCGCGG , 0.042, +18.532 4_TGCTGGA , 0.000, -18.461 4_GCTGCTG , 0.012, -12.906 4_GCTGGAG , 0.034, -11.911 6_GCAGCTG , 0.028, -15.559 2_CGCGCGG , 0.031, +18.532 4_TGCTGGA , 0.000, -18.461 4_GCTGCTG , 0.009, -12.906 4_GCTGGAG , 0.025, -11.911 6_GCAGCTG , 0.020, -15.559 4_TGCTGGA , 0.000, -18.461 4_GCTGCTG , 0.006, -12.906 4_GCTGGAG , 0.018, -11.911 4_TGCTGGA , 0.000, -18.461 4_GCTGCTG , 0.005, -12.906 4_GCTGGAG , 0.014, -11.911 4_GCTGGAG , 0.010, -11.911 4_TGCTGGA , 0.000, -18.461 4_GCTGCTG , 0.004, -12.906 4_GCTGGAG , 0.008, -11.911 4_GCTGCTG , 0.003, -12.906
271
Analysis of Cis-Regulatory Motifs in Cassette Exons by Incorporating Exon Skipping Rates Sihui Zhao, Jihye Kim, and Steffen Heber Bioinformatics Research Center, North Carolina State University Raleigh, NC 27606, USA {szhao3,jkim4,sheber}@ncsu.edu
Abstract. Identification of cis-regulatory motifs has long been a hotspot in the study of alternative splicing. We propose a two-step approach: we first identify k-mer seed motifs by testing for enrichment and significant differences in exon skipping rate, then a local stochastic search is applied to refine the seed motifs. Our approach is especially suitable to discover short and degenerate motifs. We applied our method to a dataset of CNS-specific cassette exons in mouse and discovered 15 motifs. Two of these motifs are highly similar to validated motifs, Nova and hnRNP A1 binding sites. Four motifs show positional bias relative to the splice sites. Our study provides a dictionary of sequence motifs involved in the regulation of alternative splicing in CNS tissues, and a novel tool to detect such motifs. Keywords: alternative splicing, motif discovery, exon skipping rate.
1 Introduction Alternative splicing (AS) might produce multiple mature mRNA isoforms from one single gene, contributing essentially to protein diversity [1, 2]. AS has a pivotal role in many biological processes and is also involved in the regulation of gene expression. About 50% of all mutations in exons causing human disease affect splicing [3]. Cis-regulatory motifs in combination with other factors such as splicing sites, exon size, etc, play an important role in the regulation of alternative splicing [1]. Often, cisregulatory motifs are degenerate short sequence elements, acting as the binding sites for trans-regulatory factors or pairing to form loop structure [4]. According to their location and function, they are categorized into exonic splicing enhancer/silencer (ESE/ESS) or intronic splicing enhancer/silencer (ISE/ISS). Several computational approaches to identify cis-regulatory motifs in (alternative) splicing have been reported [5, 6, 7]. Most of them find the statistically overrepresented k-mers (5-10 basepairs) by enumerative approaches. However, k-mers have limited flexibility to represent AS motifs due to the higher degeneracy of these motifs [1]. Although adding biological features may improve the quality of the predictions (see ref. [8] for a comprehensive review), only a few approaches for transcriptional I. Măndoiu, G. Narasimhan, and Y. Zhang (Eds.): ISBRA 2009, LNBI 5542, pp. 272–283, 2009. © Springer-Verlag Berlin Heidelberg 2009
Analysis of Cis-Regulatory Motifs in Cassette Exons
273
factor binding sites (TFBSs) have been implemented. Bussemaker and colleagues identified k-mers with significant effect on mRNA abundance in a linear regression experiment [9]. Conlon and colleagues identified candidate TFBSs using MDScan in combination with model selection in linear regression [10]. Smith and colleagues analyzed TFBS via weighted log-odds scores using CHIP-chip data [11]. This paper provides a novel approach to discover cis-regulatory motifs involved in the regulation of alternative splicing which makes use of exon skipping rate measurements. Similar measurements become more and more abundant with the growing use of microarray platforms targeting AS. It is believed that DNA motifs involved in alternative splicing are generally short and highly degenerate [1]. A simulation study demonstrates that our approach is especially suitable for detecting such motifs. We applied our method to CNS-specific (Central Nervous System) cassette exon data in mouse [12], and compared the predicted motifs with the Conserved Domain Database [13] to ensure that no protein domains are wrongly reported as motifs in coding regions.
2 Materials and Methods 2.1 Motif Discovery Algorithm A two-step motif discovery approach which incorporates exon skipping rate information is implemented, see Box 1. In step one, we identify seed motifs. We check exhaustively all possible k-mers with k ranging between 4 and 7. K-mers which occur in less than 10% of the gene sequences are removed. For the remaining k-mers we compare the skipping rates between sequences with and without the k-mer via a t-test, and eliminate k-mers which do not show a significant (p-value<0.005) skipping rate difference. To remove similar k-mer seeds, we sort the k-mers based on their significance level. We select the k-mer with the highest significance, store this k-mer, and eliminate all k-mers with three or more basepairs match. This choice controls the average random match between k-mers to be about 1 and also allows flexibility. We repeat this procedure until no more k-mers are left. In step two, we use stochastic search to explore the neighborhood of each seed in the sequences to either get a more flexible representation or extend/shrink the motif using IUPAC symbols. Motifs made from IUPAC symbols correspond to discrete PWMs. They provide a flexible motif representation while simultaneously keeping the search space limited. Starting from a motif seed, we extend, shrink or modify one motif position to get a new candidate motif. Log-odds scores for each sliding window in the sequences are computed based on the position weight matrix which corresponds to the candidate motif. We then use the skipping rates as weights in the following objective function: n
Si − wij +1
i =1
j =1
f ( S , Y , M ) = ∑ yi
∑
⎛ Pr ( wij | M ) ⎞ ⎟ z log ⎜ ⎜ Pr ( wij | B ) ⎟ ⎝ ⎠
(1)
274
S. Zhao, J. Kim, and S. Heber
Here, S denotes a set of sequences, S1, S2, …, Sn, and Y is the associated skipping rates. M and B indicates whether a string is a motif or from background, respectively. wij is the string which corresponds to the j-th sliding window in sequence i, log(Pr(wij|M)/ Pr(wij|B)) is the log-odds score, and z is an indicator variable for the case that the log-odds score is greater than 0. This objective function is a modified version from Smith and colleagues [11]. It calculates the weighted sum of log-odds scores. For each string in j-th sliding window on sequence i, we compute the log-odds score. The indicator variable z discards the string with score less than 0. We then sum the scores over all sequences and all strings weighted by the skipping rates. We accept a motif change if the objective function improves. We repeat this procedure until no improvement can be achieved for any change. Box 1. Workflow of the motif discovery approach Step 1: Find seed motifs Generate all k-mers Remove k-mers occurring in less than 10% of the sequences Identify k-mers which significantly affect skipping rates Remove similar seeds Repeat Record the k-mer with the highest significance of skipping rate difference Eliminate k-mers with more than 3bp match until no more k-mers Step 2: Motif refinement via local stochastic search Repeat Extend, eliminate or change one position using IUPAC symbols Compute the position weight matrix from corresponding k-mers Calculate the objective function Accept and update the motif if objective function improves until no further improvement
2.2 Simulation Study We tested our approach in a simulation study. We simulated DNA sequences, implanted artificial motifs, generated corresponding exon skipping rates, and evaluated the performance of our algorithm. Each analyzed sequence set consists of 100 simulated sequences, ranging from 100 to 300 basepairs, using a uniform nucleotides distribution. We implanted motifs in 50 sequences with different combinations of parameters: motif width (4, 6, 8 and 10 basepairs) and motif conservation, measured by information content (1.15, 0.95, 0.75, 0.55 ±0.025 per column). We generated two scenarios for exon skipping rates. A non-overlap group with skipping rates 0.80±0.15 ([0.65, 0.95]) for sequences with motifs implanted and 0.20±0.15 ([0.05, 0.35]) without motifs; an overlap group with skipping rates 0.60±0.15 and
Analysis of Cis-Regulatory Motifs in Cassette Exons
275
0.40±0.15, respectively. We performed 100 simulations for each combination of parameters and calculated the mean and standard deviation of distance between the predicted and implanted motifs. To evaluate the performance of our approach, we also analyzed the artificial sequences with skipping rates greater than 0.5 by MEME [14] for comparison. This mimics the more traditional way of selecting sequences by skipping rates and then doing motif discovery. 2.3 Comparison of Position Weight Matrices The column metric is based on Euclidean distance and is a modified version from Tsai and colleagues [15]. Let C1 = (p1A, p1C, p1G, p1T)T and C2 = (p2A, p2C, p2G, p2T)T be two columns in matrices 1 and 2, where p is the probability of a nucleotide. The distance between two columns is calculated as d (C1 , C2 ) =
1 2
∑ (p
j∈{ A,C ,G ,T }
1j
− p2 j )
2
(2)
Two PWMs are globally aligned without internal gaps while end gaps are allowed. The distance between the column and the background nucleotide distribution is used as penalty of overhang. We define the distance between two PWMs as the score of the optimal global alignment. 2.4 Collection of Sequence Data We analyzed the mouse skipped exons (the most common type of alternative splicing) from Fagnani and colleagues [12]. A quantitative microarray platform created by the same lab was used to measure the skipping rate (%ASex, expression of isoform without the skipped exon divided by the total expression of both) [16]. The whole-length transcripts were downloaded from NCBI. We used BLAT to map each transcript to the mouse genomic sequences [17]. We did motif discovery in seven regions, which are up-stream exon, 5’ and 3’ end of the up-stream intron (up to 200 basepair), cassette exon, 5’ and 3’ end of the down-stream intron (up to 200 basepair) and downstream exon. Since the skipping rate is a ratio, we transform the associated skipping rate by logit transformation. 2.5 Positional Bias of Motif Occurrences We use the Kolmogorov-Smirnov (K-S) statistic to test for non-uniform occurrences of the predicted motifs within the sequences used for motif discovery rather than the whole introns. The K-S statistic is defined as
D = sup x F ( x) − F0 ( x) where F(x) and F0(x) are the motif and background distributions.
(3)
276
S. Zhao, J. Kim, and S. Heber
We use a Monte Carlo approach to estimate the empirical distribution of K-S statistic. Let L1, L2, …, Ln be the lengths of sequences from which a motif is predicted, we randomly generate one motif for each sequence assuming that motif can occur at any position with equal probability and repeat for 1000 times. Similarly, we generate a large number of random motif positions to obtain a reference background distribution. We then calculate the K-S statistic between each set of randomly generated positions and the reference. The p-value is the number of larger or equal K-S statistics divided by 1000.
3 Results 3.1 Motif Re-discovery in Simulated Sequence Data We compared our approach to the approach combining sequence grouping by expression and MEME analysis in simulated sequence data using the parameter settings similar to Smith and colleagues [18]. Table 1. Mean distances (normalize by width of implanted motifs) between the re-discovered and implanted motifs using different methods. The standard deviation is given in the parenthesis. The numbers in bold (upper-right) show the better performance of our approach in the prediction of short and degenerate motifs. Information Content Width Skipping (bp) Rate
1.15 New
4
MEME
0.95 New
MEME
0.75 New
MEME
0.55 New
MEME
0.386 1.687 0.524 1.588 0.540 1.953 0.578 1.855 overlap (0.190) (0.448) (0.147) (0.716) (0.108) (0.563) (0.069) (0.618) 0.410 1.937 0.527 1.559 0.574 1.837 0.577 1.805 nonoverlap (0.172) (0.586) (0.140) (0.789) (0.091) (0.548) (0.080) (0.492) overlap
0.236 0.633 0.300 1.104 0.357 1.252 0.431 1.294 (0.051) (0.591) (0.100) (0.622) (0.093) (0.430) (0.061) (0.366)
6 0.243 0.561 0.313 1.155 0.398 1.302 0.410 1.317 nonoverlap (0.037) (0.565) (0.087) (0.433) (0.101) (0.409) (0.061) (0.420) overlap 8
0.240 0.102 0.282 0.581 0.341 0.892 0.354 0.991 (0.048) (0.026) (0.073) (0.472) (0.069) (0.355) (0.051) (0.300)
non0.240 0.112 0.285 0.666 0.345 0.892 0.371 0.941 overlap (0.051) (0.043) (0.062) (0.514) (0.072) (0.318) (0.047) (0.292) overlap 10
0.224 0.074 0.284 0.147 0.331 0.892 0.336 0.779 (0.071) (0.014) (0.066) (0.144) (0.056) (0.334) (0.039) (0.218)
non0.245 0.075 0.282 0.122 0.331 0.892 0.333 0.730 overlap (0.063) (0.016) (0.051) (0.071) (0.049) (0.294) (0.036) (0.223)
Analysis of Cis-Regulatory Motifs in Cassette Exons
277
We combined all artificial sequences with skipping rates greater than 0.5 and used MEME for motif re-discovery. This procedure is similar to the traditional approach – selecting sequences by similar expression profiles and doing motif discovery thereafter. Skipping rate of 0.5 is the expected mean in the simulated data. In the non-overlap group, MEME only analyzed the sequences containing an implanted motif. In the overlap group, the chance of also including sequences without motif into the MEME analysis is about 16%. This corresponds to the (more realistic) case that the start sequence set is not perfect. To compare the performance between our approach and MEME, we calculated the distance between the implanted and re-discovered motifs using formula 2. Table 1 gives the mean and standard deviation of the distance (normalized by the width of implanted motifs) in 100 replicates under different combinations of parameters. Generally, the mean distance increases with decreasing motif conservation (information content). Similarly, the mean distance decreases with increasing motif width. MEME, as a well-known and mature motif discovery tool, performs very well in the re-discovery of longer and more conserved motifs (in the left bottom corner of table 1). Under these conditions, MEME performs similar to or slightly better than our approach. However, when analyzing shorter and more degenerate motifs, our approach outperforms MEME. Overall, our approach performs steadily in all combinations of settings, even when the implanted motifs are only 4 basepair long. Although the short width and degeneracy also deteriorate the performance of our approach, the magnitude is much smaller than to MEME. 3.2 Motif Discovery in CNS-Specific Exon Skipping Events We applied our motif discovery approach to the DNA sequences of seven regions of 75 CNS-specific exon skipping events from Fagnani and colleagues [12]. By incorporating logit-transformed skipping rates, we identified 15 candidate motifs (see Table 2) with width between five and ten basepairs. To investigate the effect of a motifs on the exon skipping rate, we performed a regression of the motif log-odds scores against the exon skipping rates using the model y = a+b×log-odds+e. We first determined a more accurate and stringent motif cutoff by shuffling the sequences and calculated the log-odds scores for all possible sliding windows. The 99.5% quantile of all the log-odds scores was used as the cutoff to make sure the random occurrence of motif match was less than one per sequence. We then calculated the sum of log-odds score for the matches to the predicted motifs in each sequence. Eleven predicted motifs show a significant effect (α=5%) on the skipping rates in the simple linear regression for each motif separately. Two motifs have a significance level between 0.05 and 0.1. The corresponding R2 values are between 6% and 23% for each significant motif. This is slightly better or comparable to a similar approach searching for TFBS [9, 10]. The occurrence and log-odds score sums of several predicted motifs are strongly correlated even though they were predicted independently in different regions. The scores of motif 9 (CNYGK) and motif 14 (VYCAK) show positive correlation (Spearman’s correlation coefficient 0.50 with p-value 0, see Fig. 1). The p-value for
278
S. Zhao, J. Kim, and S. Heber
the Spearman’s correlation coefficient is calculated by shuffling between the motif scores and sequences for 1000 times and computing the chance of getting more extreme values. Both motifs act as intronic splicing enhancer. More interestingly, motif 9 and motif 14 reside in the down-stream intron, but motif 9 in the 5’ end and motif 14 in the 3’ end. This might suggest the cooperative role of these two motifs. Motif 11 (CHDCNBHB) and motif 14 are also positively correlated. They both reside in the 3’ end of the down-stream intron and act as intronic splicing silencer. The Spearman’s correlation coefficient between them is 0.41 with p-value equal to 0. Negative correlation between predicted motifs is also observed (see Fig. 1). Motif 11 and 14 are both negatively correlated with motif 4 (HWKATTWWTD) with Spearman’s correlation coefficients -0.37 and -0.33 (p-value 0 and 0.002). The former two motifs are intron enhancer while the latter one acting as silencer in up-stream intron.
Fig. 1. Correlation between the predicted motifs. The colors encode the correlation between motifs. Red indicates a positive correlation and blue negative correlation. The diagonal, the correlation of each motif to itself, is in black.
In summary, we identified 15 candidate motifs. Three of them occur in exons, one in the up-stream, one in the cassette, and one in the down-stream exon. The remaining motifs occur in introns. Candidate motifs show different correlation with the skipping rates. Seven have a positive correlation with the skipping rate, indicating that they are splicing silencer. The remaining eight are negatively correlated and might be splicing enhancer. 3.3 Positional Bias of the Discovered Motifs DNA motifs involved in (alternative) splicing are usually located in the vicinity of 5’ or 3’ splice sites. Therefore, positional bias is often used as a validation of predicted AS motifs [19, 20]. We used Kolmogorov-Smirnov (K-S) statistic and a Monte Carlo method to test the positional bias within the sequences used for motif discovery. Because our sequences have different lengths, the background will not be a flat line (see Figure 2).
Analysis of Cis-Regulatory Motifs in Cassette Exons
279
Using a cutoff False Discovery Rate of 0.1, we found four motifs showing positional bias relative to the splice sites. Motif 6, 12 and 13 show a preference occurring close to the 3’ splice site while all of them are TC-rich and located in the 3’ end of the flanking introns. This might indicate the similarity between these motifs and polypyrimidine tract binding sites. Motif 10 is located in the 5’ end of the downstream intron and has positional bias toward the 5’ splice site. 3.4 Comparison to Protein Domains Genes containing same active protein domains may be co-regulated and have similar splicing pattern in certain tissues or developmental stages, thus erroneously regarded as splicing motifs correlated with skipping rates. Therefore, we checked the Conserved Domain Database (CDD) to verify that the predicted motifs do not coincide with known protein domains. We retrieved protein sequences for each gene and searched CDD in NCBI [13]. Using the default cutoff of e-value 0.01, we did not find any significant overlap between our motifs and protein domains of CDD.
Fig. 2. Positional bias of the predicted motifs 6 (left) and 12 (right). The grey histogram shows the observed positions of the predicted motif, while the white histogram gives the background distribution under the uniform assumption. The background distribution is normalized to have the same total number of occurrences as the predicted motifs.
3.5 Predicted Motifs with Match to Known AS Motifs Compared to transcriptional factor binding sites, relatively little information about AS motifs is known and validated. However, some AS motifs have been reported [21, 22, 23], and by comparing our predictions to these motifs we found several close matches. Motif 14 (see Table 2) is discovered in the 3’ end of the down-stream intron. It is negatively correlated with the exon skipping rate, suggesting a role as intronic splicing enhancer. The motif has a consensus of VYCAK or [ACG][CT]CA[GT] which is similar to the binding sites of Nova. Nova is a neuron-specific alternative splicing factor which affects spliceosome assembly and removal of introns, thus regulates the
Width
8
5
8
5
7
5
Motif
2
9
11
12
13
14
Consensus
BBCASGGK
CNYGK
CHDCNBHB
CTKTH
VTBKBCY
VYCAK
Region
Up-stream Intron 5’ end
Down-stream Intron 5’ end
Down-stream Intron 3’ end
Down-stream Intron 3’ end
Down-stream Intron 3’ end
Down-stream Intron 3’ end 3.8E-3
1.2E-4
0.075
4.3E-3
9.6E-4
2.7E-4
P-value
0.11
0.18
0.043
0.11
0.14
0.17
R2
ISE
ISE
ISE
ISE
ISE
ISS
Function
Logo
Table 2. Predicted motifs described in the Result section. ESE/ESS: exonic splicing enhancer/silencer, ISE/ISS: intronic splicing enhancer/silencer. IUPAC symbols: R: A/G, Y: C/T, M: A/C, K: G/T, S: C/G, W: A/T, B: C/G/T, D: A/G/T, H: A/C/T, V: A/C/G and N: A/C/G/T.
280 S. Zhao, J. Kim, and S. Heber
Analysis of Cis-Regulatory Motifs in Cassette Exons
281
inclusion of exons [22, 24]. The binding sites of Nova is YCAY, which is almost identical to VYCAK except the last position (Y partially matches with K because Y is C or T and K is G or T). The location of a Nova binding site affects its biological function. If YCAY occurs within or in immediate up-stream of AS exons, Nova acts as a silencer by blocking the binding of U1 snRNP. However, when YCAY occurs in the down-stream intron, Nova becomes an intronic enhancer which boosts the inclusion of AS exons [22]. The location and biological function in the latter case, coincides with our motif 14. Most (61/75) of the sequences in our dataset contain motif 14. About half of the sequences have multiple copies (2 - 4). Not only the strength of motif but also the number of copies affects exon skipping. In the simple linear regression, the contribution of motif 14 to the variation of skipping rates is 11% (p-value 3.8E-3). The enhancing effect of Nova is believed to be caused by controlling removal of introns harboring YCAY. It is interesting that the log-odds scores of motif 9 are positively correlated with motif 14, indicating the co-occurrence of these two motifs in the same intron sequences (motif 9 at the 5’ end and motif 14 at the 3’ end). This, along with the same enhancing effects of both motifs, may suggest their cooperative role during intron removal. Motif 2 (see Table 2) has a consensus sequence of BBCASGGK which is similar to the binding site of heterogeneous nuclear ribonucleoproteins A1 (hnRNP A1). The binding sites of hnRNP A1 are either TAGGGT or CAGG[GA]T [21, 23], while starting from position three motif 2 is CA[GC]GG[GT]. The family of hnRNP counteracts the effect of SR protein in splicing regulation. The hnRNP A1 can act as splicing silencer by affecting the selection of splice sites [21, 23]. Motif 2 is positively correlated with exon skipping rate, suggesting the role of a splicing silencer. The motif occurs in the 5’ end of the up-stream intron. Unlike motif 14 which occurs in most of the 75 CNS genes with multiple copies, only thirteen genes contain this motif and most with single copy. The corresponding R2 value of motif 2 in a linear regression is 0.17 with a p-value 2.7E-4.
4 Conclusion and Discussion In this paper, we describe a two-step approach to predict regulatory motifs involved in alternative splicing. Our approach makes essential use of exon skipping rate measurements to overcome the inaccuracy caused by short motifs with high degree of degeneracy. We start from identifying motif seeds that both are enriched and have significant effect on skipping rate. Subsequently, the motif seeds are extended and refined by a local stochastic search. Our algorithm optimizes an objective function which combines both skipping rates and motif log-odds scores. In finding short and less conserved motifs, a simulation study indicates that our approach performs better than an approach that groups sequences by expression data and uses MEME for motif discovery. Besides distance comparison, we also compared the accuracy of both approaches. We fixed the number of top motifs that MEME reported so that MEME reported same number of motifs as our approach. Using the 5% quantile of distance between random motifs as cutoff, the correct rate of our approach is greater by 24% on average (p-value 0.00008 by Wilcoxon signed-rank test, data not shown) in our approach. This makes our method especially suitable for AS motif discovery where
282
S. Zhao, J. Kim, and S. Heber
motifs are typically short and highly degenerate [1]. MEME shows a similar or better performance for longer and more conserved motifs. One possible explanation for this behavior is that our algorithm restricts the search space to discrete IUPAC symbols, and cannot refine the corresponding PWMs beyond this resolution. We suggest testing this hypothesis in future work by implementing a more flexible stochastic search procedure. Compared to other k-mer based motif finders, e.g. YMF [25] and Weeder [26], our approach refines the k-mer seeds by association with skipping rates and thus shows more flexibility, particularly in motif length and degenerate symbols allowed. We applied our approach to a dataset containing expression information for 75 exon skipping events in multiple CNS tissues. We identified in total fifteen motifs located in cassette exon as well as flanking exons and introns. The comparison of the predicted motifs to the Conserved Domain Database does not provide any evidence that any of our predicted motifs could be explained by conserved protein domains. Each single motif accounts for the variation of skipping rate between 6 and 20%. Four motifs show positional bias relative to the splice sites. Two of our predicted motifs show high sequence similarity to known and well investigated AS motifs – NOVA and hnRNP A1 binding sites. Also the observed effects on exon skipping rates and the motif positions of our motifs coincide with those of NOVA and hnRNP A1. We find co-occurrences of several predicted motifs, suggesting possible cooperative roles of these motifs in exon skipping. Interestingly, we find a so far unreported motif (motif 9) co-occurring with Nova binding sites. This suggests that Nova might have a more complex role in the CNS-specific splicing code than expected before. Originally, Fagnani and colleagues performed an ab initio motif discovery study by comparing k-mer usage, as well as searching experimentally validated motifs in the sequence dataset [12]. We compared our motif set with the consensus sequences of the motifs reported in Fagnani’s study. Similar to their results, we also identified C/Urich motifs involved in CNS-specific AS. In addition, we discovered binding sites which were not found in Fagnani’s ab initio motif discovery, e.g. NOVA binding site. Ten of our predictions may be novel findings. We hypothesize that this might be attributed to our motif discovery approach which makes essential use of exon skipping rate information. Based on the evidences we presented above, we believe our motifs are part of a complex CNS-specific splicing code, and they are promising candidates for future validation experiments.
References 1. Ladd, A.N., Cooper, T.A.: Finding Signals that Regulate Alternative Splicing in the PostGenomic Era. Genome Biol. 3, reviews0008 (2002) 2. Matlin, A.J., Clark, F., Smith, C.W.: Understanding Alternative Splicing: Towards a Cellular Code. Nat. Rev. Mol. Cell Biol. 6, 386–398 (2005) 3. Blencowe, B.J.: Alternative Splicing: New Insights from Global Analyses. Cell 126, 37–47 (2006) 4. Miriami, E., Margalit, H., Sperling, R.: Conserved Sequence Elements Associated with Exon Skipping. Nucleic Acids Res. 31, 1974–1983 (2003) 5. Brudno, M., Gelfand, M.S., Spengler, S., Zorn, M., Dubchak, I., Conboy, J.G.: Computational Analysis of Candidate Intron Regulatory Elements for Tissue-Specific Alternative Pre-mRNA Splicing. Nucleic Acids Res. 29, 2338–2348 (2001)
Analysis of Cis-Regulatory Motifs in Cassette Exons
283
6. Fairbrother, W.G., Yeh, R.F., Sharp, P.A., Burge, C.B.: Predictive Identification of Exonic Splicing Enhancers in Human Genes. Science 297, 1007–1013 (2002) 7. Zhang, X.H., Chasin, L.A.: Computational Definition of Sequence Motifs Governing Constitutive Exon Splicing. Genes Dev. 18, 1241–1250 (2004) 8. Vingron, M., Brazma, A., Coulson, R., van Helden, J., Manke, T., Palin, K., Sand, O., Ukkonen, E.: Integrating Sequence, Evolution and Functional Genomics in Regulatory Genomics. Genome Biol. 10, 202 (2009) 9. Bussemaker, H.J., Li, H., Siggia, E.D.: Regulatory Element Detection using Correlation with Expression. Nat. Genet. 27, 167–171 (2001) 10. Conlon, E.M., Liu, X.S., Lieb, J.D., Liu, J.S.: Integrating Regulatory Motif Discovery and Genome-Wide Expression Analysis. Proc. Natl. Acad. Sci. U.S.A. 100, 3339–3344 (2003) 11. Smith, A.D., Sumazin, P., Das, D., Zhang, M.Q.: Mining ChIP-Chip Data for Transcription Factor and Cofactor Binding Sites. Bioinformatics 21(suppl. 1), i403–i412 (2005) 12. Fagnani, M., Barash, Y., Ip, J., Misquitta, C., Pan, Q., Saltzman, A., Shai, O., Lee, L., Rozenhek, A., Mohammad, N., Willaime-Morawek, S., Babak, T., Zhang, W., Hughes, T., van der Kooy, D., Frey, B., Blencowe, B.: Functional Coordination of Alternative Splicing in the Mammalian Central Nervous System. Genome Biol. 8, R108 (2007) 13. Marchler-Bauer, A., Bryant, S.H.: CD-Search: Protein Domain Annotations on the Fly. Nucl. Acids Res. 32, W327–W331 (2004) 14. Bailey, T.L., Elkan, C.: Fitting a Mixture Model by Expectation Maximization to Discover Motifs in Biopolymers. Proc. Int. Conf. Intell. Syst. Mol. Biol. 2, 28–36 (1994) 15. Tsai, H.K., Huang, G.T., Chou, M.Y., Lu, H.H., Li, W.H.: Method for Identifying Transcription Factor Binding Sites in Yeast. Bioinformatics 22, 1675–1681 (2006) 16. Pan, Q., Shai, O., Misquitta, C., Zhang, W., Saltzman, A.L., Mohammad, N., Babak, T., Siu, H., Hughes, T.R., Morris, Q.D., Frey, B.J., Blencowe, B.J.: Revealing Global Regulatory Features of Mammalian Alternative Splicing using a Quantitative Microarray Platform. Mol. Cell 16, 929–941 (2004) 17. Kent, W.J.: BLAT–the BLAST-Like Alignment Tool. Genome Res. 12, 656–664 (2002) 18. Smith, A.D., Sumazin, P., Zhang, M.Q.: Identifying Tissue-Selective Transcription Factor Binding Sites in Vertebrate Promoters. Proc. Natl. Acad. Sci. U.S.A 102, 1560–1565 (2005) 19. McCullough, A.J., Berget, S.M.: G Triplets Located Throughout a Class of Small Vertebrate Introns Enforce Intron Borders and Regulate Splice Site Selection. Mol. Cell. Biol. 17, 4562–4571 (1997) 20. Yeo, G., Holste, D., Kreiman, G., Burge, C.: Variation in Alternative Splicing Across Human Tissues. Genome Biol. 5, R74 (2004) 21. Burd, C.G., Dreyfuss, G.: RNA Binding Specificity of hnRNP A1: Significance of hnRNP A1 High-Affinity Binding Sites in Pre-mRNA Splicing. EMBO J. 13, 1197–1204 (1994) 22. Ule, J., Stefani, G., Mele, A., Ruggiu, M., Wang, X., Taneri, B., Gaasterland, T., Blencowe, B.J., Darnell, R.B.: An RNA Map Predicting Nova-Dependent Splicing Regulation. Nature 444, 580–586 (2006) 23. Zhao, X., Rush, M., Schwartz, S.: Identification of an hnRNP A1-Dependent Splicing Silencer in the Human Papillomavirus Type 16 L1 Coding Region that Prevents Premature Expression of the Late L1 Gene. J. Virol. 78, 10888–10905 (2004) 24. Ule, J., Jensen, K.B., Ruggiu, M., Mele, A., Ule, A., Darnell, R.B.: CLIP Identifies NovaRegulated RNA Networks in the Brain. Science 302, 1212–1215 (2003) 25. Sinha, S., Tompa, M.: YMF: A Program for Discovery of Novel Transcription Factor Binding Sites by Statistical Over representation. Nucleic Acids Res. 31, 3586–3588 (2003) 26. Pavesi, G., Mereghetti, P., Mauri, G., Pesole, G.: Weeder Web: Discovery of Transcription Factor Binding Sites in a Set of Sequences from Co-Regulated Genes. Nucleic Acids Res. 203, W199–W203 (2004)
A Class of Evolution-Based Kernels for Protein Homology Analysis: A Generalization of the PAM Model Valentina Sulimova 1, Vadim Mottl 2, Boris Mirkin 3, Ilya Muchnik 4, and Casimir Kulikowski 4 1
Tula State University, Lenin Ave. 92, 300600 Tula, Russia
[email protected] 2 Computing Center of the Russian Academy of Sciences, Vavilov St. 40, 119333 Moscow, Russia
[email protected] 3 Birkbeck College, University of London, Malet Street, London, WC1E 7HX, UK
[email protected] 4 Rutgers University, New Brunswick, New Jersey 08903, USA
[email protected],
[email protected]
Abstract. There are two desirable properties that a pair-wise similarity measure between amino acid sequences should possess in order to produce good performance in protein homology analysis. First, it is the presence of kernel properties that allow using popular and well-performing computational tools designed for linear spaces, like SVM and k-means. Second, it is very important to take into account common evolutionary descent of homologous proteins. However, none of the existing similarity measures possesses both of these properties at once. In this paper, we propose a simple probabilistic evolution model of amino acid sequences that is built as a straightforward generalization of the PAM evolution model of single amino acids. This model produces a class of kernel functions each of which is computed as the likelihood of the hypothesis that both sequences are results of two independent evolutionary transformations of a hidden common ancestor under some specific assumptions on the evolution mechanism. The proposed class of kernels is rather wide and contains as particular subclasses not only the family of J.-P Vert’s local alignment kernels, whose algebraic structure was introduced without any evolutionary motivation, but also some other families of local and global kernels. We demonstrate, via k-means clustering of a set of amino acid sequences from the VIDA database, that the global kernel can be useful in bringing together otherwise very different protein families. Keywords: Protein homology analysis, evolution modeling, amino acid sequence alignment, evolutionary kernel function, kernel-derived clusters.
1 Introduction Protein homology is understood as sequence similarity based on recent common ancestry and similar function. The concept of homology is one of the most important in I. Măndoiu, G. Narasimhan, and Y. Zhang (Eds.): ISBRA 2009, LNBI 5542, pp. 284–296, 2009. © Springer-Verlag Berlin Heidelberg 2009
A Class of Evolution-Based Kernels for Protein Homology Analysis
285
proteomics. However, its operational meaning remains rather obscure and not well represented by computationally sound concepts and tools, and practical definition of protein homologous families remains subject to significant manual curation. In the absence of structural folding data for the overwhelming majority of proteins, the development of protein homologous families mostly relies on protein amino acid sequence data, under the assumption that homologous proteins should have similar protein sequences. To fully automate the process, a reliable and computationally feasible solution to two principal issues is needed: (a) measuring similarity between protein sequences and (b) clustering proteins into similarity groups. The most popular approach to assessing the pair-wise similarity of amino acid sequences, which was developed as a means of detecting evolutionary relationships between them, is the approach based on the notions of global or local alignment. The original idea was to determine a similarity measure through finding the best correspondence between successive amino acids in two sequences with possible gaps, which is computed via a version of the dynamic programming procedure, respectively, the global Needleman-Wunsch [1] or the local Smith-Waterman algorithm [2]. In the latter case, fast heuristics BLAST, PSI-BLAST [3] and FASTA [4] are usually applied instead of dynamic programming to accelerate calculations. However, these approaches do not perform well for remotely homologous proteins [5,6 ,7]. We suppose that this insufficiency arises due to the absence of two very important properties: (a) classical optimal-alignment-based similarity measures are not based on a biologically-defined evolution mechanism, and (b) they are not kernel functions, i.e., do not enable usage of such powerful and convenient tools as SVM and k -means clustering developed for linear spaces. Multiple investigations have explored the idea of endowing the similarity measures with these desirable properties. In particular, a number of kernel functions in the set of amino acid and DNA sequences were introduced in [8, 9, 10, 11, 12, 13]. However, practically all of them remain motivated by purely algebraic considerations, even J.-P. Vert’s LA-kernels [8] which average the similarity measures produced by all feasible local alignments. The families of kernels studied in [14, 15], in spite of their probabilistic nature, are also not based on an explicitly formulated model of evolution. A separate line of investigations has been aimed at forming models of protein evolution. In [16, 17, 18, 19], a number of different models were proposed, but the similarity measures resulting from them, which are based on the notion of statistical alignment and take into account most likely ways of evolution [16] or all possible ways [17,18,19], do not possess, nevertheless, all the kernel properties. Besides, the methods of this kind have very high computational complexity [18] or do not guarantee the mathematical correctness of the similarity measure as the probability of two independent transformations of the protein pair under comparison from the same unknown ancestor [17]. In this paper, we propose a simple probabilistic model of evolution of amino acid sequences that is built as a rather straightforward generalization of Margaret Dayhoff’s Point Accepted Mutation (PAM) model developed for single amino acids [20]. The respective pair-wise sequence similarity measure has the strong mathematical meaning of the likelihood of the hypothesis that the two sequences are the results of two independent evolutionary transformations of some hidden sequence considered as
286
V. Sulimova et al.
their common ancestor. Each similarity measure of the proposed class possesses all the properties of a kernel function, in particular, the matrix of its values computed for any finite set of amino acid sequences is, at least, positive semidefinite. By its algebraic structure, this class of kernels is an essential generalization of that of local alignment kernels [8] and embraces not only the local but also the global principle of sequence comparison. A kernel function defined on a set of entities of arbitrary kind embeds it into a hypothetical linear space, with the role of the inner product played by the kernel itself. Thus, any linear methods are applicable to the set of amino acid sequences since we have managed to measure some pair-wise relation between them by a kernel function. The proposed class of evolutionary kernels is verified via clustering a given set of amino acid sequences that contains several known groups of homologous proteins from the VIDA database. For this purpose, we correspondingly modified the k -means method of clustering, as one of the most popular linear method, for the kernel-based protein representation. It turned out that the subclass of global kernels demonstrated almost complete coincidence of the clustering result with the true homologous groups in the protein set under processing, in contrast to local alignment kernels and similarity measures based on finding the optimal alignment, which could not bring together different protein families.
2 Evolution-Based Principle of Comparing Amino Acid Sequences 2.1 Similarity of Amino Acids Measuring similarity of amino-acid sequences must inevitably be based on measuring similarity of amino acids forming them. The most commonly adopted similarity measure involves the family of PAM substitution matrices derived by Margaret Dayhoff [20] from a probabilistic model of evolution. Another popular family of substitution matrices was introduced by Steven and Jorjia Henikoff and called BLOSUM (BLOcks SUbstitution Matrices) [21]. These matrices directly calculate frequencies of appearance of different amino acids at the same positions in an extracted block of similar fragments of sequences, requiring no knowledge of phylogeny but only the results of the alignment. However, it is shown in [22] that the family of BLOSUM substitution matrices can be explained in terms of Dayhoff’s evolutionary model as was done for PAM. The main mathematical notion of Dayhoff’s PAM evolution model at a single point of protein sequence is that of a Markov chain over the alphabet of 20 amino acids A = {α1 ,..., α 20 } . The model is defined by a matrix of conditional probabilities
Ψ = ( ψ (α j | α i ), i, j = 1,..., 20 ) that amino acid α i will be substituted by amino acid
α j at the next step of evolution (mutation probability matrix). It is assumed that the Markov chain of evolution is ergodic and reversible, i.e., there exists the final probability distribution ξ(α j )
ξ ( α )ψ ( α ∑ i i
α ∈A
and the equality
j
| α i ) = ξ( α j ) ,
(1)
A Class of Evolution-Based Kernels for Protein Homology Analysis
ξ (α i )ψ ( α j | α i ) = ξ (α j )ψ ( α i | α j )
287
(2)
holds true for all amino acids. When estimating the matrix Ψ by aligning a set of very similar sequences, it was
conventionally accepted that 1 − ∑ i=1 ξ(αi )ψ(αi | αi ) = 0.01 , i.e., 1% of amino acids 20
would change at a single step of evolution. This mutation probability matrix is called PAM 1 and associated with the evolutionary distance 1. Margaret Dayhoff considered also the s -foldly thinned-out Markov chain, i.e., derived from the original one by taking away s − 1 of every s elements, which will be defined by the matrix ×24 ... ×3 Ψ[ s ] = ( ψ[ s ] (α j | α i ), i, j = 1,..., 20 ) = Ψ Ψ corresponding to the evolutionary 14 s
distance s and having the same final probability distribution ξ(α j ) . The most popular mutation probability matrix is PAM 250 with s = 250 . It is easy to prove [22] that for any s the similarity measure
μ[ s ] (αi , α j ) = ψ[ s ] (α j | αi )ξ(α i ) = ψ[ s ] (αi | α j )ξ(α j ) ,
(3)
as well as its normalized version μ% [ s ] (αi , α j ) = μ[ s ] (αi , α j ) / ξ(α i )ξ(α j ) = ψ[ s ] (α j | αi ) / ξ(α j ) = ψ[ s ] (α i | α j ) / ξ(αi ) (4) are kernel functions, i.e. form positive definite matrices in the set of amino acids. The PAM scoring matrices are traditionally represented in the log-odds form as π[ s ] (αi , α j ) = 10 log10 μ% [ s ] (αi , α j ) . This logarithmic representation is convenient from the viewpoint of scaling similarity values but deprives μ% [ s ] (αi , α j ) of the original kernel-function properties. 2.2 The Main Idea of Comparing Two Amino Acid Sequences
Let Ω be the set of all finite amino acid sequences ω = (ωt , t = 1,..., N ) , ωt ∈ A . We shall use also the notation Ωn = {ω = (ωt , t = 1,..., N ), ωt ∈ A, N = n} for the set of all sequences having the fixed length n , and Ω≥n = {ω = (ωt , t = 1,..., N ), ωt ∈ A, N ≥ n} for the set of sequences that are not shorter than n . Let us, further, consider a random sequence ϑ = (ϑi ∈ A, i = 1,..., n) ∈ Ω n ⊆ Ω of random length n , such that the pair (n, ϑ) is jointly defined by a pair of probability distributions ( r (n), n = 0,1, 2,...) and ( pn (ϑ), ϑ ∈ Ωn ) . It appears natural to evaluate the similarity of two amino acid sequences ω′, ω′′ ∈ Ω≥n by computing the probability of the hypothesis that they originate from the same random ancestor ϑ ∈ Ω n as results of two independent branches of evolution defined by a known random transformation ( ϕn (ω | ϑ), ω ∈ Ω≥n , ϑ ∈ Ω n ) :
K(ω′, ω′′) = ∑n=0 r(n)∑ϑ∈Ω pn (ϑ)ϕn (ω′ | ϑ)ϕn (ω′′ | ϑ) . ∞
n
(5)
288
V. Sulimova et al.
Theorem 1. Any choice of distribution ( r (n), n = 0,1, 2,...) and families of condi-
tional distributions ( ϕn (ω | ϑ), ω ∈ Ω≥n , ϑ ∈ Ω n ) and ( pn (ϑ), ϑ ∈ Ω n ) leads to the fact that (5) is a kernel function in the set of all amino acid sequences.
The proof of Theorem 1 is based on checking whether Mercer’s conditions [23] are met. 2.3 The Model of Random Evolutionary Transformation of an Amino Acid Sequences into Another One
We consider here only two-step random transformations ϕn (ω | ϑ) of the ancestor sequence ϑ ∈ Ωn into a resulting sequence ω ∈ Ω≥ n . Step 1. Random choice of an n -length structure v = (v1 ,..., vn ) of transformation ϑ → ω with distribution q ( v) defined in the set of all n -length structures
v = (v1 ,..., vn ) ∈ Vn ,
∑ v∈V
n
qn ( v) = 1 , where increasing sequence of natural numbers
1 ≤ v1 < v2 < ... < vn , indicates the positions of the resulting sequence into which the respective elements of the ancestor ϑ = (ϑ1 ,..., ϑn ) will be transformed. So, for any given sequence ω = (ω1 ,..., ωN ) of length N ≥ n , a specific structure v = (v1 ,..., vn ) explicitly defines, first, a subsequence ω v = (ωvi , i = 1,..., n) of elements, which we shall name the key subsequence, and, second, the additional subsequence ω v = (ωt , t ≠ vi , i = 1,..., n) , such that ω = ω v U ω v .
v′
– additional subsequence ( ω )
v′′
ω′′
v′n
v′1
– key ω′ subsequence ( ω )
d e l e t i o n s
ω′
ϑ
n=5
w s u b s t i t u t i o n s
ω′′
v′′1
i n s e r t i o n s
v′′n
Fig. 1. The structure of random transformation of symbolic sequences
Step 2. For each structure v ∈ Vn , a structure-dependent random transformation is
assumed to be defined ηn (ω | ϑ, v ) ≥ 0 such that ηn (ω | ϑ, v) = 0 if ω ∉ Ω≥vn , i.e.
∑ ω∈Ω≥vn η (ω | ϑ, v) = 1 . As a result, distribution ϕ (ω | ϑ) ϕn (ω | ϑ) = ∑ v∈V qn ( v)ηn (ω | ϑ, v ) . n
n
n
is the mixture (6)
In this work, we make some additional assumptions on the distributions forming the class of kernel functions (5). Conditional independence of the key-subsequence elements. The symbols of the key subsequence are randomly generated from those of the original sequence in
A Class of Evolution-Based Kernels for Protein Homology Analysis
289
accordance with Dayhoff’s mutation probabilities ψ[ s ] (α j | αi ) in the set of amino acids with a conventional evolution step s (Section 2.1) ηn (ω v | ϑ, v) = ∏ i=1 ψ[ s ] (ωvi | ϑi ) . n
(7)
Independence of the ancestor-sequence elements. The original sequence is formed by independent symbols chosen in accordance with Dayhoff’s final probabilities ξ(αi ) in the set of amino acids pn (ϑ) = pn (ϑ1 ,..., ϑn ) = ∏ i =1 ξ(ϑi ) . n
(8)
Completely random runaway of the resulting sequence. The runaway τ of the length Nω = vn + τ over vn in the transformation structure v = (v1 ,..., vn ) is determined through the “completely random” length of the final part of the additional subsequence (ωvn ,..., ωNω ) . So, the distribution of τ is considered as an improper “al-
most uniform” distribution z ( W) o 0 ,
¦
f W 0
z ( W) 1 .
(9)
Independence of the additional subsequence from the key one. It is assumed that ηn (ω | ϑ, v) = ηn (ω v | ϑ, v )η(ω v ) .
(10)
The notation η(ω v ) instead of η(ω v | ϑ, v) means that, first, there is no dependence on the original sequence ϑ and, second, if the symbolic compositions of different subsequences coincide ω v ′ = ω v ′′ then η(ω v ′ ) = η(ω v ′′ ) . 2.4 The General Kernel Structure
After these assumptions, it remains only to choose the family of distribution qn ( v ) over the set of n -length structures v = (v1 ,..., vn ) ∈ Vn and the family η(ω v ) that determines the additional subsequence. Below, in Section 3, we shall see that these two choices not only accomplish the definition of the class of kernels but also essentially affect its properties. However, the already made assumptions allow for representing the class of kernels (5) in a more structurally explicit form. Any pair of n -length structures v ′ = (v1′,..., vn′ ) ∈ Vn and v′′ = (v1′′,..., vn′′) ∈ Vn of transformations ϑ → ω′ and ϑ → ω′′ , ϑ ∈ Ω n , ω′, ω′′ ∈ Ω ≥ n defines a pair-wise alignment of the two sequences (Figure 1): ⎡⎛ v ′ w = ( v ′w , v ′′w ) = ⎢ ⎜ 1, w ⎣ ⎝ v1,′′w
⎞ ⎛ v n′ , w ⎟ ,..., ⎜ v ′′ ⎠ ⎝ n,w
⎞⎤ ⎟⎥ . ⎠⎦
We shall call it the pair-wise alignment of order n ( n -order alignment) because exactly n pairs of amino acids will be immediately compared. The set of all n -order pair-wise alignments is Wn = Vn × Vn , and distribution qn ( v ) in Vn defines distribution qn (w ) = qn ( v ′w )qn ( v ′′w ) . Vice-versa, any n -order pair-wise alignment w defines
290
V. Sulimova et al.
a pair of n -length transformation structures ( v′w , v ′′w ) . It should be noticed that not all pair-wise alignments can define the pair of sequences ω′ and ω′′ of lengths N ′ and N ′′ , but only those of them which satisfy the conditions vn′ (w ) ≤ N ′ and vn′′(w ) ≤ N ′′ . The set of all such pair-wise alignments of sequences ω′ and ω′′ will be denoted as w ∈ WnN ′N ′′ ⊂ Wn . For each pair-wise alignment w ∈ WnN ′N ′′ , we define a real-valued symmetric function over all pairs of sequences ω′ ∈ Ω≥ N ′ and ω′′ ∈ Ω≥ N ′′
(
)
K n (ω ′, ω′′ | w ) = K n (ω′v w′ , ω ′′vw′′ ) = ∏ i =1 μ[2 s ] ω′vw′ ,i , ω′′vw′′ ,i , n
(11)
where μ[2 s ] (ω′vi′,w , ω′′vi′′,w ) is the kernel on the set of amino acids (3) for the double evolution step 2s with respect to the step s taken in the one-side model (7). Since any product of kernels remains to be a kernel, the function K n (ω′, ω′′ | w ) is also a kernel. We shall call it alignment-dependent key kernel of order n . Further, we define the alignment-dependent additional kernel as (12) K n (ω′, ω ′′ | w ) = K n (ω′v′w , ω ′′v ′′ ) = η(ω′v′ )η(ω′′v′′ ) . w
Theorem 2. Under assumptions (7)-(10) and notations (11) and (12), the kernel (5) is representable as ∞ (13) K (ω′, ω′′) = ∑ n=0 r (n)∑ w∈W qn (w ) K (ω′, ω′′ | w ) K (ω′, ω′′ | w ) . nN ′N ′′
Proof. Elementary substitution of assumptions (7)-(10) in (5) yields (13) for the function Kn (ω′, ω′′ | w) = ∑ϑ∈Ω pn (ϑ)∏i =1 ξ(ϑi )ψ[ s ] (ωv′ w′ ,i | ϑi )ψ[ s ] (ω′′vw′′ ,i | ϑi ) . n
n
Its equivalence to (11) immediately follows from Dayhoff’s main assumption on the ergodicity (1) and reversibility (2) of the PAM Markov chain. Thus, we have come to the class of kernels on the set of amino acid sequences ∞
K (ω′, ω′′) = ∑ n =0 r (n)∑ w∈W
nN ′N ′′
(
qn (w) K (ω′, ω′′ | w)∏i =1 μ[2 s ] ω′v′ , ωv′′′′ n
w ,i
w ,i
),
(14)
where the distributions ( r (n), n = 0,1, 2,...) , ( qn (w ), w ∈ Wn ) and η(ω v ) over all symbolic sequences of any length ω v ∈ Ω are not defined as yet. To work up some policy of choosing these remaining elements of the kernel construction, it is required to clarify their influence on the result of sequence comparison. The distribution ( r (n), n = 0,1, 2,...) is meant to express some assumption on the length of the hidden ancestor, i.e., the number of pairs of amino acid positions in ω′ and ω′′ , which will be taken into account when comparing these sequences. The similarity of the given sequences from the viewpoint of the presence of some “almost common” subsequence of length n in them is measured just by the key kernel (11), whereas the choice of the additional kernel (12) defined by η(ω v ) gives the possibility to dilute
A Class of Evolution-Based Kernels for Protein Homology Analysis
291
this assessment, if desirable, through involving other unpaired elements into comparison. The role of distribution ( q(w ), w ∈ Wn ) is, actually, regulation of whether, how and to what extent the gaps between paired positions will affect the comparison.
3 Some Particular Kinds of Kernels 3.1 Kernels of Fixed and Unfixed Order
In particular, if there is no reason to constrain the tentative length of the hypothetical “almost common” subsequence, the distribution ( r (n), n = 0,1, 2,...) should be taken as an improper “almost uniform” distribution r (n) → const ≅ 0 ,
∑
∞ n =0
r (n) = 1 . In
this extreme case, the respective item will fall out of (14) completely, and we obtain key-length indifferent kernels, which we call kernels of absolutely unfixed order: ∞
K (ω ′, ω ′′) = ∑ n = 0 ∑ w∈W
(
qn ( w ) K (ω ′, ω ′′ | w )∏ i =1 μ[ 2 s ] ω′v′ , ω′′v′′ n
nN ′N ′′
w ,i
w ,i
).
(15)
On the extreme contrary, if it desirable to strictly preset the length of the key subsequence, this distribution should turn into an absolutely concentrated one r (n) = 1 and r (k ) = 0 with any k ≠ n . The kernels of this kind are called here kernels of a fixed order: K n (ω ′, ω ′′) = ∑ w∈W
nN ′N ′′
(
qn ( w ) K (ω ′, ω ′′ | w )∏ i =1 μ[ 2 s ] ω′v′ , ω′′v′′ n
w ,i
w ,i
).
(16)
3.2 Local and Global Kernels
As a rule, the desirability of that of other kinds of alignments is expressed by the mathematical assumption that the likelihood qn ( w) = qn ( v′) qn ( v′′) depends only on the lengths of the gaps at the left (v1 − 1) , in the middle ( (v2 − v1 ),..., (vn − vn−1 ) ) and at
the right ( N ω − vn ) of each of the two sequences. It is the usual practice to assume that the random lengths of the gaps are a priori independent, and each of them has a probability distribution monotonically diminishing as the length grows: ⎧⎪1, d i = 1, d 0 = v1 , di = vi − vi −1 , or d n+1 = N ω − vn . (17) g i ( d i | a, b ) ∝ ⎨ ⎪⎩exp ⎣⎡ −β ( a + bd i ) ⎦⎤ , d i > 1,
If a = 0 , the “cost” of two gaps d i and d j is the same as that of one gap of the summed up length di + d j . Otherwise, if a > 0 , one long gap is considered as more
preferable than two short gaps making the same length. However, the distributions may be taken, if required, as position-dependent, i.e., different for different i . The kernel function is said to be local if only middle parts of the two sequences participate in comparison. In this case, we define ( q( v), v ∈ Vn ) , for instance, by putting the improper joint distribution as the product of the identical single distributions (17) within the range of the key part and “absolutely random” ones beyond it
292
V. Sulimova et al.
qn ( v) ∝ ∏ i = 2 g (vi − vi −1 | a, b) , and the distribution of the additional subsequences is n
taken as completely improper one η(ωv ) = const = 1 , which defines neither the lengths nor the compositions of the additional subsequences. From the inverse point of view, the additional parts of the two sequences should be compared with the same attention as the key ones when judging of sequence similarity. In this case, the a priori models of both gaps and additional symbols should be extended onto the entire lengths of the sequences, for instance, as
(
)
n ⎧ q ( v ) = g (v | a, b) ∏ i = 2 g (vi − vi −1 | a, b) g ( N ω − vn ) , 1 ⎪ n ⎨ ⎪ η(ω v ) = ∏ 1 ≤ t < vn ξ( ωt ). t ≠ vi ⎩
(18)
The a priori distributions (18) define the family of global kernels.
4 Kernel Computation: A Slight Modification of the Algorithm for Local Alignment Kernels It should be noticed that the properties of the proposed model of evolution allow to express the initial kernel (19) in absolutely equivalent but essentially more simple form (14), which does not contain the sum over all possible hidden ancestor sequences (see Theorem 1). This circumstance provides the possibility for sufficiently simple and quick computation of this kernel with complexity O (| ω′ || ω′′ |) . Despite the fact that the local alignment kernels proposed by J.-P. Vert and his colleagues in [8] for classification of biological sequences are not motivated by any explicitly formulated evolution model, they fall, by their algebraic structure, into the class considered here. More exactly, the local alignment kernels belong to the family of local kernels of absolutely unfixed order. So, the dynamic-programming algorithm described in [8] computes a kernel of this kind. The other particular cases of the proposed class of evolution-based kernels, namely, global kernels of absolutely unfixed order and local and global kernels of fixed order, require a slight modification of the algorithm [8]. In particular, the global kernel of absolutely unfixed order can be computed using (17) by recurrent expressions: M i , j = μ s (ω′i , ω′′j )( M i −1, j −1 + e − b X i −1, j −1 + e − b Yi −1, j −1 + e −2 b Z i −1, j −1 + e − g (i −1|a ,b ) − g ( j −1| a ,b ) ), −a
−b
−a
−b
−a
−a
−2
Xi , j = e Mi−1, j +e Xi−1, j ; Yi, j = e Mi, j −1+e Yi , j−1, Zi, j = e (e Mi−1, j−1+ Xi , j −1+ Yi−1, j )+ e Zi−1, j −1,
starting with Mi,0= M0, j = 0, Xi,0=X0, j = 0, Yi,0= Y0, j = 0, Zi,0= Z0, j = 0 . The resulted value of the global kernel of absolutely unfixed order is given by the formula K (ω′, ω′′) = M|ω′|,|ω′′| + e−b ( X|ω′|,|ω′′| + Y|ω′|,|ω′′| + e−b Z|ω′|,|ω′′| ) .
5 Kernel-Based Clustering of Proteins The task of clustering is to partition the given set of amino acid sequences Ω* = {ω j , j = 1,..., M } into k disjoint subsets Ω* = Ω1* U Ω*2 U ... U Ω*k , Ω*i I Ω*l = ∅ ,
A Class of Evolution-Based Kernels for Protein Homology Analysis
293
i, l = 1,..., k , i ≠ l , each of which consists of similar sequences with respect to the accepted similarity measure. We use the well known k -means method [24, 25], adopted by us for the case when the similarity is measured by a kernel function. Any kernel K (ω′, ω′′) defined in the given set of amino acid sequences Ω* em% * ⊃ Ω* with Euclidean metric beds it into a hypothetical linear space Ω
%∗. ρ2 (ω′, ω′′) = K (ω′, ω′) + K (ω′′, ω′′) − 2 K (ω′, ω′′) , ω′, ω′′ ∈ Ω
(20)
The k -means iteration procedure consists in implementation of the following two steps at each ( s + 1 )th iteration: % * on the basis of the 1) finding k fixed abstract class centers ϑ1s+1 ,..., ϑ sk+1 ∈ Ω known partition {Ω*(i s) , i = 1,..., k} at the previous iteration by the rule
ϑis+1 = arg min ϑ∈Ω% ∗ ∑ω ∈Ω*( s ) ρ2 (ωl , ϑ) , which, with respect to (20) and the properties of l
i
% ∗ , leads to the explicit expression: the Frechet differential in the linear space Ω
ϑis +1 = (1 | Ω∗i ( s ) |) ∑ ω ∈Ω∗( s ) ωl . l
(21)
i
2) finding the new partition defined by these centers:
{
}
Ω*(i s+1) = ω j∈Ω∗: ρ2(ω j , ϑis+1) = min ρ2(ω j , ϑls+1) , i = 1,..., k . l =1,...,k
(22)
In accordance with (20) and on the force of the linearity property of the inner product % ∗ , we have: K (ω′, ω′′) in the linear space Ω ρ2 (ω j , ϑls+1 ) = K(ω j , ω j ) +
1 |Ω |
∗( s ) 2 i
1
∑ωj∈Ωi∗(s) ∑ωl∈Ωi∗(s) K(ω , ω ) − 2 | Ω | ∑ω j∈Ω∗i (s) K(ω ,ω ) j
l
∗( s ) i
j
l
Substitution of the obtained formula into (22) allows to avoid explicit computation of the abstract centers and, so, to avoid step 1. To start the k -means algorithm, we apply the procedure of finding “anomalous patterns” proposed in [25], which automatically identifies the number and compositions of the initial clusters.
6 Protein Homology Analysis. Data, Experiments and Results One of the particular subclasses of the class of evolution-model-based kernels proposed in this paper is that of so-called local kernels of absolutely unfixed order (Section 3) which coincides with the family of local alignment kernels by J.-P. Vert and his colleagues [8]. It is shown in [8] that the kernels of this kind essentially outperform the protein similarity measures based on finding the optimal alignment and other nonevolutionary kernels in detecting remote homology of proteins. The aim of the experiment presented in this Section is to demonstrate that there exists, at least, one subclass in our class of kernels, namely that of global kernels of absolutely unfixed order (Section 3), that can be useful in bringing together proteins from indubitably the same homologous
294
V. Sulimova et al.
group which appear, nevertheless, as very different from the viewpoints of other similarity measures. The data set consists of 233 membrane glycoproteins comprising 8 herpesvirus Homology Protein Families (HPF) divided in the VIDA database in three subsets according to their function. The structure of the data set and the results of its clustering are shown in Figure 2. In the experiments, we tested four different similarity measures: (a) PSI-BLAST tool [3], (b) Needleman-Wunsch algorithm [1], (c) local alignment kernel of absolutely unfixed order [8], and (d) global kernel of absolutely unfixed order. Traditional measures (a) and (b) are based on the optimal local and global alignment respectively. The kernels (c) and (d) considered in this paper are based on multiple alignments. For methods (a)-(c), we used the values of parameters recommended by authors.
Desired classification
class 1 (109 proteins) glycoprotein H (HPF 12, 42, 531)
39
52
PSI-BLAST
cluster 1 18 52
class 2 (76 prot.) glycoprotein L (HPF 47,50,114,296)
18 cluster 2 39
NeedlemanWunsch
cluster 1 15 50
cluster 2 39
Local kernel
cluster 1 18 52
cluster 2 39
Global kernel
52
cluster 1 39
17
30
31
30
30 30
18
cl. 4 31
cl. 3 30
class 3 (48 proteins) glycoprotein M (HPF 20)
48 cl. 5 18
cluster 3 31 18
cluster 3 31 18 cluster 2 18 30
cluster 6 48 cluster 4 48
cluster 4 cluster 5 23 25 cluster 3 48
Fig. 2. Results of clustering the set of 233 membrane glycoproteins
For each of these similarity measures, we solved the problem of clustering the 233 proteins into an unfixed number of clusters. In all the cases, the number of initial classes k was identified on the basis of the procedure of finding “anomalous patterns” [25]. For cases (a) and (b), we used the standard dissimilarity-based k -centers algorithm of clustering in which some real object plays the role of the approximate center of each class, whereas for cases (c) and (d) the kernel-based k -means procedure was applied in which the center of the respective class is represented by the arithmetic mean of objects forming it (21) in accordance with the linear operations % ∗. induced by the kernel in Ω The results of clustering presented in Figure 2 show that only the global kernel of completely unspecified order yields the clustering which practically coincides with the actual structure of the three homology groups of proteins. In fact, the global kernel correctly identifies the similarity between otherwise dissimilar homologous protein families that bear the same function in the organisms under consideration.
A Class of Evolution-Based Kernels for Protein Homology Analysis
295
This is the obvious example of superiority of the global kernel. However, it should be noticed that this superiority is not absolute. For a number of classes, the other particular cases of the proposed evolutionary kernel demonstrate higher performance compared to the global kernel. In particular, we detected that the global kernel, as any global similarity measure, is very sensitive to the length of proteins and cannot detect the similarity of sequences of very different lengths. At the same time, it allows comparing whole sequences instead of their parts and, as a result, detecting similarity in some cases when local kernels fail.
7 Conclusions In this paper, we have proposed a simple probabilistic model of amino acid sequence evolution, which is built as a straightforward generalization of the PAM evolution model developed by Margaret Dayhoff for single amino acids. The respective pair-wise sequence similarity measure possesses the properties of a kernel function computed as the likelihood of the hypothesis that both sequences are results of two independent evolutionary transformations of some hidden common ancestor. Under some particular assumptions on the model of protein evolution, the proposed kernel has the same structure as the well-known local alignment kernel introduced by J.-P. Vert [8]. So, on one hand, we have found a probabilistic justification of Vert’s local alignment kernels, and, on the other, an essential generalization of them is proposed, which embraces not only the local but also the global principle of sequence comparison and a number of other particular cases, which are specified by the choice of the parameters in the evolution model. We also show that the proposed evolutionbased pair-wise similarity measure can be useful in the analysis of some difficult distant homology sets of proteins and help in computationally resolving situations in which other measures may fail. The particular subclass of fixed-order kernels, which are based on alignments with only a fixed number of substitutions, attract a special interest. This kind of kernels may be very useful when the a priori information is available on the length of the unknown ancestor sequence.
Acknowledgments This work was supported by the Russian Foundation for Basic Research, Grants 0801-12023 and 08-01-00695-а, and INTAS grant YSF 06-1000014-6563 to V. Sulimova for her visiting Birkbeck College in 2007-2008. The authors are grateful to Dr. P. Kellam of the Department of Virology UCL London for making the VIDA database contents available for the analysis and advising us on substantive matters. The authors are indebted to anonymous referees whose multiple comments helped us to improve the presentation.
References 1. Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino-acid sequence of two proteins. J.of Molecular Biology 48, 443–453 (1970) 2. Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. of Mol. Biol. 147, 195–197 (1981)
296
V. Sulimova et al.
3. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research 25, 3389–3402 (1997) 4. Pearson, W.R.: Flexible sequence similarity searching with the FASTA3 program package. Methods Mol. Biol. 132, 185–219 (2000) 5. Mirkin, B., Camargo, R., Fenner, T., Loizou, G., Kellam, P.: Aggregating homologous protein families in evolutionary reconstructions of herpesviruses. In: Ashlock, D. (ed.) Proceedings of the IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, pp. 255–262 (2006) 6. Rocha, J., Rossello, F., Segura, J.: The Universal Similarity Metric does not detect domain similarity. Technical Report, Quantitative Methods, Q-bio QM (2006), http://arxiv.org/abs/q-bio/0603007 7. Vinga, S., Almeida, J.: Alignment-free sequence comparison – A review. Bioinformatics 19, 513–523 (2003) 8. Vert, J.-P., Saigo, H., Akutsu, T.: Local alignment kernels for biological sequences. In: Scholkopf, B., Tsuda, K., Vert, J.P. (eds.) Kernel Methods in Computational Biology. MIT Press, Cambridge (2004) 9. Schölkopf, B., Smola, A.J.: Learning with Kernels. MIT Press, Cambridge (2002) 10. Rangwala, H., Karypis, G.: Profile-based direct kernels for remote homology detection and fold recognition. Bioinformatics 21(23), 4239–4247 (2005) 11. Qiu, J., Hue, M., Ben-Hur, A., Vert, J.-P., Noble, W.S.: A structural alignment kernel for protein structures. Bioinformatics 23(9), 1090–1098 (2007) 12. Sun, L., Ji, S., Ye, J.: Adaptive diffusion kernel learning from biological networks for protein function prediction. BMC Bioinformatics 9, 162 (2008) 13. Leslie, C.S., Eskin, E., Cohen, A., Weston, J., Noble, W.S.: Mismatch string kernels for dis-criminative protein classification. Bioinformatics 20(4), 467–476 (2004) 14. Haussler, D.: Convolution kernels on discrete structures. Technical Report UCSC-CRL99-10, UC Santa Cruz (1999) 15. Cuturi, M., Vert, J.-P.: A mutual information kernel for sequences. In: Proc. of IEEE Int. Joint Conference on Neural Networks, vol. 3, pp. 1905–1910 (2004) 16. Thorne, J.L., Kishino, H., Felsenstein, J.: An evolutionary model for maximum likelihood alignment of DNA sequences. Journal of Molecular Evolution 33, 114–124 (1991) 17. Miklos, I., Lunter, G.A., Holmes, I.: A “long indel” model for evolutionary sequence alignment. Molecular Biology and Evolution 21(3), 529–540 (2004) 18. Miklos, I., Novak, A., Satija, R., Lyngso, R., Hein, J.: Stochastic models of sequence evolution including insertion-deletion events. Statistical methods in medical research 29 (2008) 19. Metzler, D.: Statistical alignment based on fragment insertion and deletion models. Bioinformatics 19, 490–499 (2003) 20. Dayhoff, M.O., Schwarts, R.M., Orcutt, B.C.: A model of evolutionary change in proteins. Atlas of Protein Sequences and Structures 5(suppl. 3), 345–352 (1978) 21. Henikoff, S., Henikoff, J.: Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA 89(22), 10915–10919 (1992) 22. Sulimova, V., Mottl, V., Kulikowski, C., Muchnik, I.: Probabilistic evolutionary model for substitution matrices of PAM and BLOSUM families. In: DIMACS Technical Report 2008-16. DIMACS Technical Report 2008-16, Rutgers University, 17 p. (2008), ftp://dimacs.rutgers.edu/pub/dimacs/TechicalReports/ TechReports/2008/2008-16.pdf 23. Mercer, T.: Functions of positive and negative type and their connection with the theory of integral equations. Trans. London. Philos. Soc. A 209, 415–416 (1999) 24. Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data. Wiley, New York (1990) 25. Mirkin, B.: Clustering for Data Mining: A Data Recovery Approach. Chapman and Hall/CRC, Boca Raton (2005)
Irreplaceable Amino Acids and Reduced Alphabets in Short-Term and Directed Protein Evolution Miguel A. Jiménez-Montaño2 and Matthew He1 1 Division of Math, Science, and Technology Nova Southeastern University, Ft. Lauderdale, USA
[email protected] 2 Facultad de Física e Inteligencia Artificial Universidad Veracruzana, Xalapa, 91000 Veracruz, México
[email protected]
Abstract. In this paper we extend codon volatility definition to amino acid reduced alphabets to characterize mutations that conserve physical-chemical properties. We also define the average relative changeability of amino acids in terms of single-base codon self-substitution frequencies (identities). These frequencies are taken from an empirical codon substitution matrix [14]. It is shown that this index splits the amino acids into two groups: replaceable and irreplaceable. The same grouping is obtained from the size/complexity index introduced by Dufton [32]. Also, a 71 % agreement is obtained with residues in mutually persistent conserved (MPC) positions [31]. These positions play a key role in fold and functional determination. The residual 29 % can be readily explained. 75 % of residues with highest rank according to MPC positions have the highest probability of causing disease if mutated. Keywords: Codon volatility, amino acid changeability, genetic mutations, reduced alphabet.
1 Introduction The understanding of short-term protein evolution is important for studying virus variability [1], somatic hyper mutation [2], in vitro exploration of protein sequence space [3], enzyme, and drug design by directed evolution [4,5], and human genetic mutations [6], among other important problems in molecular biology and biotechnology. The aims, relevant questions, and research methods employed in long and short term protein evolution are substantially different. The former is mainly oriented to the construction of phylogenetic trees and intended to understand the development of life on Earth as a result of historical and natural evolution. The later is mainly concerned with the probabilistic prediction of future changes, in order to deal with such problems as viral and bacterial evolution; or, with the design of new proteins and drugs aided by directed evolution in vitro, and the causes of genetic disease. I. Măndoiu, G. Narasimhan, and Y. Zhang (Eds.): ISBRA 2009, LNBI 5542, pp. 297–309, 2009. © Springer-Verlag Berlin Heidelberg 2009
298
M.A. Jiménez-Montaño and M. He
Current models to describe protein sequence evolution employ Markov processes at the amino acid and codon levels [7]. They are of two types: empirical models that do not explicitly consider biological factors that shape protein evolution, and mechanistic models which are built taking into account such features as: codon usage, transition/tranversion bias, selective pressures, and the structure of the genetic code. However, these factors are introduced through a set of parameters with different emphasis, which are not free of arbitrary assumptions. See e. g. [8] and references therein. For approaching the fundamental problem of disentangling mutational and selective effects in evolution [9], is necessary to employ a codon-level model. These models make it possible to separate mutational biases in the DNA from selective constraints on the protein, and to distinguish between codons encoding the same amino acid from those that do not. They differentiate between silent mutations, consisting of: identities and synonymous changes that maintain the encoded amino acid, from non-synonymous changes, that alter the encoded amino acid. In our model, we start from the fact that the state-space for the Markov process at the codon level has a fixed topological structure. The nucleotide triplets that comprise the genetic code can be plotted on the corners of a six-dimensional hyper cube [10, 11, 12, and 13]. Different regions of the cube tend to share traits relevant to the amino acid function, which in some way encode the syntax of the genetic language. As we shall see, because the neighbors of a codon are invariant, they play a universal role in the fixation of a mutation. Besides the influence of the genetic code, empirical codon substitution matrices include variable factors, which pertain to the use of the genetic language, such as codon usage, transition/tranversion bias and selective pressures. Recently, Gonnet and co-workers [14] proposed a matrix that represents a conservative approach, which describes codon substitutions probabilities for a particular set of evolutionary distances [8]. This implies 1 Ka/Ks ratio as well as the same transition-transversion bias for all genes, as pointed out in [15]. Since we are interested only in short-term evolution, out of the 190 possible interchanges among the 20 amino acids, we consider only 75 that can be obtained by single-base substitutions. Therefore, we introduce a reduced empirical matrix (REM), making zero all entries corresponding to more than one nucleotide change in the original matrix [14] and normalizing the resulting matrix. In this way, we take into account the local structure of the genetic code around each codon. In one step, a codon can change in nine different ways and generally can have from zero to three synonymous changes and from six to nine non-synonymous changes, except in the cases of six-fold degeneracy such as serine, leucine and arginine. We analyze all possible one-step changes for all 61 codons except for the three stop codons. Additional supplemental data files are available at http://www.uv.mx/ajimenez/#articles under the title of this paper. The proponents of the empirical matrix [14] showed that codon substitution matrices are preferable than amino acid substitution matrices for PAM distances smaller than 50. Therefore, the matrix we selected seems to be appropriate for our analysis. This empirical matrix was constructed from 5 metazoan genomes. Thus, most accurately describes the evolution of these species, nonetheless it appears to have a more general range of applicability.
Irreplaceable Amino Acids and Reduced Alphabets
299
2 Codon Volatility and Reduced Alphabets For each of the 61 sense codons, the volatility of a codon [16] is defined as the sum, over all one-point neighboring codons ci , of the distances between corresponding amino acids. In general, (1) Thus, by definition, the volatility requires us to choose a metric, d, that quantifies the distance between amino acids. The Hamming metric equals zero or one, depending on whether two amino acids are identical. We define the distance between any amino acid and a stop codon as zero. In [17] the definition of volatility was slightly modified, dividing by q, the number of non-stop single-base mutation neighbors. We get (2) With the Hamming metric, νH (c) gives an a priori probability for an amino acidchanging mutation. For example, let us consider the case of two arginine codons under the Hamming metric: CGC (R) has the following neighbors: CGG (R), CGT (R),CGA (R),CAC(H), AGC (S), TGC (C),GGC (G),CTC (L), and CCC (P) with volatility H (CGC) = 6/9 =0.66. While, AGA (R) has the following neighbors: AAA (K), ACA (T), ATA (I), GGA (G),CGA (R), TGA (stop), AGG (R), AGT (S), AGC (S), with volatility H (AGA) = 6/8 =0.75. Thus, different codons of the same amino acid may have different propensities to change the encoded amino acid [13, 16]. Besides the Hamming metric, one may use another distance to take into account selected physical-chemical properties of the amino acids. For example, the Euclidean distance of Grantham [18], which weighs differences between amino acids according hydropobicity, volume and composition, or the Miyata distance [19], which only retains the first two properties. Both metrics are different ways to quantify the degree to which a random point mutation will change the stereo chemical properties of the corresponding acid [17]. Another possibility, we propose here, is the Hamming metric in the space of a reduced amino acid alphabet [20, 21, 22, and 23]. For example, we take the common binary alphabet [24] that splits the amino acids into: • •
Hydrophobic {1= C, L, I, M, V, F, Y, W} and Non-hydrophobic {0= P, A, G, S, T, Q, N, E, D, H, K, R}.
In this case, d (aa (ci), aa (cj)) = 0 if the amino acids are in the same category; 1 otherwise. The so called “buriability parameter” [25] provides a quantitative measure of the driving force for the burial of a residue and protein stability. Recently, Zhou and Zhou [25] showed that the observed large buriability gap between hydrophobic and
300
M.A. Jiménez-Montaño and M. He
hydrophilic residues is responsible for the burial of hydrophobic residues in soluble proteins. The residues in group 1 of our categorization have the 8 highest values of buriability [25]. For the codons in the above example we get ν(2)(CGC) = 2/9 = 0.22 while ν(2) (AGA) = 1/8 = 0.125, where the super-index refers to the number of letters of the reduced alphabet (see Table 1 below). As we shall see in the next section, the character of neighboring amino acids in the genetic code [10-12], plays an important role in the fixation of amino acids during short-term evolution, as well as in the definition of amino acid groupings. Table 1. Substitution probabilities from CGC and AGA codons, according to the empirical matrix [14], normalized to single-base substitutions (REM)
Substitution Probabilities R CGC
dH [ai,aj]d(2) [ai ,aj]
Substitution Probabilities R AGA
dH [ai,aj] d(2) [ai ,aj]
R CGC 0.433 R CGG 0.174
0
0
R AGA 0.444
0
0
0
0
R AGG 0.243
0
0
R CGT 0.162
0
0
K AAA 0.142
1
0
R CGA 0.131
0
0
R CGA 0.106
0
0
H CAC 0.0443
1
0
G GGA 0.0183
1
0
S2 AGC 0.0173
1
0
S2 AGC 0.0154
1
0
C TGC 0.0132
1
1
S2
AGT 0.0135
1
0
G GGC 0.0086
1
0
T ACA 0.0127
1
0
L CTC 0.0080
1
1
I ATA 0.0039
1
1
P CCC 0.0070
1
0
For the definitions of dH [ai,aj] and d(2) [ai ,aj] see text. S2 are Ser (S) codons with AG doublet.
Let us now consider the case of amino acids with a single codon: methionine (M) and tryptophan (W), which obviously have the greatest volatility. On the one hand, ATG (M) has the following neighbors: CTG (L), GTG (V), TTG (L), ATT (I), ATA (I), ATC (I), AAG (K), ACG (T), AGG (R), with H (ATG) = 9/9 = 1, but reduced alphabet- volatility, ν(2) (ATG) = 1/3. Therefore, 66.6 % of the mutations maintain the hydrophobic character. On the other hand, TGG (W) has the following neighbors (without stop): CGG (R), TTG (L), AGG (R), TGC (C), TGT (C), GGG (G), TCG (S), with H (TGG) = 7/7 =1, but reduced alphabet-volatility, ν(2)(TGG) = 4/7. In this case only 42.8 % of mutations conserve hydrophobicity (see Table 2 below). We see that the nature of its neighbors in the genetic code has an important influence on the low variability of W. This fact is confirmed experimentally in the example discussed in the next section.
Irreplaceable Amino Acids and Reduced Alphabets
301
Table 2. Substitution probabilities for ATG and TGG, according to the empirical matrix 14], normalized to single-base substitutions (REM)
Substitution Probabilities M ATG
dH [ai,aj] d(2) [ai ,aj]
M ATG
0
0
0.744 L CTG 0.0703
1
0
V GTG 0.0486
1
0
L TTG 0.0311
1
0
I ATT
0.0276
1
I ATA 0.0267
Substitution Probabilities W TGG W 0.972 TGG R 0.0055 CGG L TTG 0.0054
dH [ai,aj] d(2) [ai ,aj]
0
0
1
1
1
0
1
1
0
R 0.0048 AGG C TGC 0.0040
1
0
1
0
C TGT 0.0039
1
0
I ATC 0.0267
1
0
0.0033
1
1
K AAG 0.0108
1
1
G GGG S TCG
0.0011
1
1
T ACG 0.0091
1
1
R AGG 0.0044
1
1
For the definitions of dH [ai,aj] and d(2) [ai ,aj] see text.
3 Amino Acid Conservativeness For soluble proteins, we are interested in the separation of those positions that are conserved as a result of non-divergence from a common ancestor, from those that are conserved for structural/functional reasons. The extent to which an amino acid residue can be replaced depends on its structural and functional role within a protein. Many positions in a protein sequence have no critical structural role. Only a very small number of residues play a key role in fold determination. The core residues are mostly hydrophobic, but their identity is not crucial. At some codon sites non-synonymous mutations are not tolerated due to strong purifying selection, while at other sites only a particular subset of non-synonymous mutations, those that preserve the function of the protein by preserving its overall structure, are allowed. However, in short-term evolution, there are favorable mutations that do not occur because they require two or three base changes. The original amino acid residue remains invariant, but not due to its very specific physicochemical properties, it only appears irreplaceable as a consequence of the fact that similar amino acids are not reachable by single-base substitutions. As pointed out long ago by Miyata, Miyazawa, and Yasunaga [19]: “Amino acids separated by two or three codon position differences are unlikely to interchange even if they are chemically similar”.
302
M.A. Jiménez-Montaño and M. He
For example, in their study of in vitro scanning saturation mutagenesis of an antibody binding pocket, Burks and co-workers [26] obtained a comprehensive information on the functional significance and information content of a given residue of an antibody by saturation mutagenesis in which all 19 amino acid substitutions are examined. From the results in Fig. 2 of their paper, displaying histograms of the ELISA data for the different mutant proteins binding to digoxin, digitoxin, digoxigenin, and ouabain, it is clear that for the three heavy chain aromatic residues that make substantial Van der Waals contacts with bound digoxin, H:Tyr-33, H:Tyr-50, and H:Trp-100, conservative changes to any of the other aromatic amino acids largely retained the ELISA signal and there was little effect on specificity. In frame A of the mentioned figure, it is shown that the wild type, H: Tyr-33, could be improved by Trp-33, however in natural antibodies this mutation has not been observed. This example perfectly illustrates the point we want to make: In long term evolution, in which evolutionary change between codons varying in 2 or 3 nucleotides is allowed, it is appropriate to consider the substitution group (W, Y, F). However, in short term and directed evolution, (W) constitutes a separated group. Although W could be a good replacement for Y the substitution does not occur because it is not an elementary amino acid change: it needs 2 nucleotide changes. As it was mentioned above, the empirical codon substitution matrix [14] gives the overall codon exchangeability, i.e. it includes both mutational and selective effects. It is found that the mean rate of amino acid-changing substitutions is about 15% (SD =5.35); while silent, fixed-mutations, amount to the remaining 85 %.( see Figure 1 below).
Fig. 1. Amino acid-changing substitutions
Irreplaceable Amino Acids and Reduced Alphabets
303
This dominance of silent substitutions over amino acid-changing substitutions is the result of the short term, and the necessity of preserving the essential genetic information coding for the protein. Although the codons of methionine (M) and tryptophan (W) have the highest volatility (1), since any mutation is an amino acidchanging mutation, their fate in evolution is completely different. While, about 25 % of the substitutions from ATG (M) become fixed (the highest rate of all amino acids); only less than 3 % of the substitutions from TGG (W) become fixed (the lowest rate of all amino acids). As just explained, this is due, both, to mutational and selective factors. While methionine is frequently replaced by its hydrophobic neighboring amino acids, leucine, valine and isoleucine, in the interior of proteins; tryptophan is rarely replaced by its single-base neighbors.
4 Amino Acid Irreplaceablility and Exchangeability Measures The reduced empirical matrix (REM) includes mutational and selective effects. The selection pressures are hidden in the identities; that is, the codon self-substitutions. We employ these self-substitutions to estimate the amino acid average degree of irreplaceability (AADI), with respect to a given reference amino acid (see Table 3 below). Table 3. Relative changeabilities of amino acids and ranking with various irreplaceability and changeability indices
Amino ACII AADI (W)c Acida,b % W (1) 7.80 100 C (2) 5.06 65 R (3) 3.88 50 Y (4) 3.72 48 H (5) 3.65 47 M (6) 3.40 44 F (7) 2.96 38 S (8) 2.75 35 I (9) 2.62 34 N (10) 2.55 33 P (11) 2.39 31 G (12) 2.34 30 L (13) 2.31 30
AADI (K) 4.44
V
AMD
MPC
EX
S/C Score
W
W, M
W
W
2.88
M
C, Y , H
C
F
2.21
C
F, Q, K
G
G
2.12
Y
N, D, E
Y
L
2.08
H
I
F
I
1.94
Q, F
S
H
R, Y, D
1.69
K
R
P
N
1.57
N
L
D
V
F Q
1.49 1.45 1.36 1.33 1.32
D E I S R
P, T, G, V, A
I L N V M
C T H M A
E N D P K
W M H C Y R
304
M.A. Jiménez-Montaño and M. He Table 3. (Continued)
T (14) Q (15) V (16) A (17) D (18) K (19) E (20)
2.28 2.24 2.13 2.02 1.96 1.76 1.41
29 29 27 26 25 23 18
1.30 1.27 1.21 1.15 1.12 1.00 0.80
P L G T V A
E R T A K Q S
P S E Q K
T S I L V A G
a
Ranks are given in parentheses Ranked by ACII c Mean = 38%; SD = 18% ACII = Average Codon Identity Index AADI (W) = Amino acid average degree of irreplaceability, with respect to tryptophan in % AADI (K) = Amino acid average degree of irreplaceability, with respect to lysine V = Amino acid value [27] AMD = Average mutational deterioration [30] MPC = Mutually Persistently Conserved [31] EX = Exchangeability of amino acids, with respect to lysine [9] S/C Score = Size/Complexity Score [32] The 7 irreplaceable amino acids according to our index are in bold type. b
First, we define the codon identity index (CII) as the identity frequency of the codon, normalized by the codon frequency in the database. Then, the AADI index of a given amino acid am , with respect to a fixed reference amino acid ar , AADI (am), is simply the average of CII, over all codons belonging to am divided by the CII of ar . As a first reference amino acid we take lysine (K), since it is the most readily replaced amino acid [9]. Therefore, by definition, AADI (K) = 1.00. Alternatively, we can also take tryptophan (W) as reference amino (see the legend in Table 1). A main result of this paper is that, according to our measure, the seven most irreplaceable amino acids are, in order: W, C, R, Y, H, M, and F. Is it interesting to compare this ranking of amino acids with other measures of amino acid replaceability (Table 3). Thirty years ago, Volkenstein [27] introduced an amino acid-based index of replaceability of a codon or codon value (q), in terms of the functional similarity of amino acids [28]. This last quantity was built upon the mutual replaceability of amino acid residues in isofuncional proteins, from the then limited database collected by Dayhoff [29]. Averaging q over all jointly degenerate codons, he defined a corresponding value for the amino acid degree of irreplaceability, or amino acid value (V). Six out of seven of his irreplaceable amino acids coincide with ours (Table 3); a remarkable agreement of 87 %. Another measure, mutational deterioration (MD), was introduced by Luo [30]. This index takes into account transition /transversion mutational bias, and that a mutation of third letter of a codon differs from first and second ones. Then, Luo defines the average mutational deterioration (AMD) of an amino acid as the average of MD, taken over the codons corresponding to the amino acid. The ranking of amino acids by AMD is consistent with the ranking with V and our ranking.
Irreplaceable Amino Acids and Reduced Alphabets
305
Recently Friedberg and Margalit [31] determined the over-represented amino acids in mutually persistently positions (MPC). These authors identified positions, for each protein in a database, that show residue conservation within both close and distant family members. They called these positions “persistently conserved.” Then, they determined “mutually” persistently conserved (MPC) positions: those structurally aligned positions in a protein pair that are persistently conserved in both pair mates. It is well known, that a very small number of residues play a key role in fold determination, enabling different sequences that contain these key residues at appropriate positions to assume similar folds. From a global analysis of proteins from many different folds, Friedberg and Margalit [31] succeeded in separating those positions that are conserved as a result of non-divergence from a common ancestor, from those that are conserved for structural or functional reasons. They found that the first seven residues in mutually persistently conserved positions are, in order: W,C, G, Y, F, H, P. An agreement of 71 % with the most irreplaceable amino acids found with our AADI index (Table 3). It is interesting that the residues that do not agree can be simply explained. On the one hand, in our ranking occur Arginine (R) and Methionine (M). They are not (MPC) because their codons have single-base neighbors that codify similar residues. Arginine (R) is generally substituted by Lysine (K). In globular proteins, the core residues need to be hydrophobic but their identity is not crucial; thus, M can be substituted by the other small hydrophobic residues: L, V, and I. On the other hand, the two (MPC) residues that do not occur in the first 7 places of our ranking, Glysine ( G) and Proline ( P), have four codons each. Therefore, synonymous mutations diminish the number of identities, lowering the value of AADI accordingly. A completely different measure of amino acid exchangeability was suggested by Yampolsky and Stolzfus [9]. These authors suggested the employment of experimental exchanges that provide independent data on amino acid exchange effects with no obvious risk of confounding mutational effects. Their measure is defined as: “EX: The “experimental exchangeability from i to j”, or EXij , is the mean activity of variants with an exchange from amino acid i to amino acid j”. In this case the agreement with our measure, based on protein evolution, is poor. Only four out of seven higher ranked amino acids coincide. The absence of G, L and I in our ranking is explained as before. However, in their tabulation of exchanges by source and destination (Table 2 in [9]) the following amino acids have the lowest exchange counts as source: C, W, M, F, H, Y, an agreement of 85.5 % with our ranking.
5 Relative Size/Complexity Measure For comparison, in the last column of Table 3 amino acids are ranked according to the Size/Complexity Score introduced by Dufton [32]. Although there are some minor differences in the order, there is a remarkable 100 % agreement with the first seven residues appearing in our ranking (column1). One of the simplest ways that an ad hoc Size/complexity value based on structure alone can be calculated is to derive a score for each side chain based upon the frequency of occurrence of its component features amongst the 20 residues as a whole. Hence, if the atomic composition of each side chain is used to derive an overall score: the larger the total / the more unusual is the composition / the bigger the side chain and the more complicated the synthesis, the
306
M.A. Jiménez-Montaño and M. He
higher is the complexity.Therefore, a numerical expression for the “Size/Complexity” is approximated by quantities such as side chain volume, bulkiness (i.e. the ratio of side chain volume to length) or formula weight. Thus, the irreplaceablility of an amino acid is highly influenced by these properties. Five out of the seven irreplaceable amino acids (W, Y, F, H, R) are also the most complex according to the shortest-description of structural formula [33]. Of course, the changeability of a particular residue depends on the structural and functional context. The rankings in Table 3 are valid for water-soluble globular proteins. Different rankings are obtained for transmembranal and non-transmembranal regions of membrane proteins [34]. According to Tourasse and Li [35], histidine (H) which is the most changeable residue in the transmembranal regions of the proteins they analysed, is well conserved in segments that cross the membrane. Nonetheless, they mention that some amino acids have similar changeability in different contexts. Although not included in our Table 3 because of lack of space, the ranking of relative changeability of the 20 amino acids computed by Jones et al. [34] and Tourasse and Li is given in Table 1 of [35]. These authors take cysteine (C) as reference. In their ranking, the seven least changeable amino acids are, respectively: W, C, Y, G, F, L, P [34], and C, W, Y, F, L, G, K [35]. In both cases, there is an agreement of 57% with our ranking. The absence of G, L, P, and K from our seven irreplaceable amino acids is explained in terms of synonymous substitutions as before. We also compared our irreplaceable amino acids with the mutational spectrum of human genetic disease [6]. According to the results in Fig. 4 (b) from [6], about the relative probability that a random mutation at different amino acids will cause a genetic disease, tryptophan (W) and cysteine (C) have the highest probability of causing disease, followed by G, R, L, Y, H, and P. Again, an agreement of 57% with our ranking of irreplaceable amino acids, with the same already explained differences. From our comparison of amino acid exchangeability measures (Table 3) and the results above, we can arrive at the following consensus for 7 irreplaceable amino acids: W, C, G, P, R, H, and Y, followed closely by F, and L. Five of them have a high Size/Complexity Score [32]; and the small ones, G and P, have special roles in conserving protein stability. Mutations at Gly (G), which is frequently present at turns of alpha-helices, might have a negative impact on protein structural stability [6], and Pro (P) is a helix-breaker. It has been suggested that the dominant mechanism by which disease mutations damage protein function is a decrease in protein stability [36], as opposed to mutations of active-site residues (usually located on the protein surface) [6]. This result is confirmed by comparing 8 residues with the highest values of MPC [31] in Table 3, with 8 residues with the highest probability of causing disease: There is an agreement of 75 %.
6 Discussion and Conclusions Our approach to short-term protein evolution, as the proposal of Yampolsy and Stoltzfus [9], emphasizes the role of mutations and the structure of the genetic code. As shown by Sander and co-workers [6], the overall amino acid mutational spectrum of human genetic disease mainly reflects the mutability of the genetic code. This line of attack differs from the more conventional approaches aimed to detect positive
Irreplaceable Amino Acids and Reduced Alphabets
307
selection, defined as a significant excess of non-silent over silent nucleotide substitutions [37]. We considered a different application of volatility, introduced originally to estimate selection pressures on proteins on the basis of their synonymous codon usage [17]. By applying the Hamming distance to reduced alphabets, we extended the volatility concept to a discrete- characterization amino acid-changing mutations. This extension proved to be useful for the understanding of the influence of neighboring codons in the genetic code [10-13] on the conservativeness of amino acids in shortterm evolution. A main finding of this investigation is the identification of irreplaceable amino acids, and their characterization with a size/complexity measure. Our amino acid average degree of irreplaceability is consistent with other measures of amino acid changeability. The distinguished role that these amino acids play in the conservation of protein structure and function make them important targets in the identification of key residues in directed evolution studies, the design of drugs and proteins, and the identification of mutations causing disease. The amino acid variability observed in the evolution of influenza-A hemaglutinin is dominated by single-base mutations [1]. Therefore, the reported results have already been applied to the study of viral evolution [38]. Another possible application is to human genetic mutations, because most of them are caused by single-nucleotide changes [39].
Acknowledgements Authors would like to thank Q.F.B. Antero Ramos-Fernández, from Dept. of Physics and A.I., Universidad Veracruzana for his computational assistance. This article was written during a sabbatical of M.A.J-M at Nova Southeastern University, supported from CONACYT, Project: 81484; Sistema Nacional de Investigadores; PROMEP, Project: UV-CA-197 MEXICO, and Universidad Veracruzana.
References 1. Shih, A.C.-C., Hsiao, T.-C., Ho, M.-S., Li, W.-H.: Simultaneous amino acid substitutions at antigenic sites drive influenza A hemagglutinin evolution. Proc. Natl. Acad. Sci. USA 104(15), 6283–6288 (2007) 2. Clark, L.A., Ganesan, S., Papp, S., van Vlijmen, H.W.T.: Trends in Antibody Sequence Changes during the Somatic Hypermutation Process. The Journal of Immunology 177, 333–340 (2006) 3. Keefe, A.D., Szostak, J.W.: Functional proteins from a random-sequence library. Nature 410, 715–718 (2001) 4. Arnold, F.H.: Design by Directed Evolution. Accounts of Chemical Research 31(3), 125–131 (1998) 5. Orencia, M.C., Yoon, J.S., Ness, J.E., Stemmer, W.P.C., Stevens, R.C.: Predicting the emergence of antibiotic resistance by directed evolution and structural analysis. Nature Structural Biology 8(3), 238–242 (2001) 6. Vitkup, D., Sander, C., Church, G.M.: The amino-acid mutational spectrum of human genetic diease. Genome Biology 4, R72 (2003)
308
M.A. Jiménez-Montaño and M. He
7. Liò, P., Goldman, N.: Models of molecular evolution and phylogeny. Genome Res. 8, 1233–1244 (1998) 8. Kosiol, C., Holmes, I., Goldman, N.L.: An empirical codon model for protein sequence evolution. Mol. Biol. Evol. 24(7), 1464–1479 (2007) 9. Yampolsky, L.Y., Stolzfus, A.: The exchangeability of amino acids in proteins. Genetics 170, 1459–1472 (2005) 10. Jiménez-Montaño, M.A., de la Mora-Basáñez, R., Pöschel, T.: The Hypercube Structure of the Genetic Code Explains Conservative and Non-Conservartive Amino acid Substitutions in Vivo and in Vitro. BioSystems 39, 117–125 (1996) 11. Karasev, V.A., Soronkin, S.G.: Topological structure of the genetic code. Russian Journal of Genetics 33, 622–628 (1997) 12. He, M.X., Petoukhov, S.V., Ricci, P.E.: Genetic code, Hamming distance and stochastic matrices. Bull. Math. Biology 66(5), 1405–1421 (2004) 13. Hershberg, U., Shlomchik, M.J.: Differences in potential for amino acid change after mutation reveals distinct strategies for {kappa} and {lambda} light-chain variation. Proc. Natl. Acad. Sci. USA 103(43), 15963–15968 (2006) 14. Schneider, A., Cannarozzi, G.M., Gonnet, G.H.: Empirical codon substitution matrix. BMC Bioinformatics 6, 134 (2005) 15. Doron-Faigenboim, A., Pupko, T.: A combined empirical and mechanistic codon model. Mol. Biol. Evol. 24(2), 388–397 (2007) 16. Plotkin, J.B., Dushoff, J.: Codon bias and frequency-dependent selection on the hemagglutinin epitopes of influenza A virus. Proc. Natl. Acad. Sci. USA 100(12), 7152–7157 (2003) 17. Plotkin, J.B., Dushoff, J., Fraser, H.B.: Detecting selection using a single genome sequence of M. tuberculosis and P. falciparum. Nature 428, 942–945 (2004) 18. Grantham, R.: Amino Acid Difference Formula to Help Explain Protein Evolution. Science 185(4154), 862–864 (1974) 19. Miyata, T., Miyazawa, S., Yasunaga, T.: Two types of amino acid substitutions in protein evolution. J. Mol. Evol. 12, 219–236 (1979) 20. Cannata, N., Toppo, S., Romualdi, C., Valle, G.: Simplifying amino acid alphabets by means of a branch and bound algorithm and substitution matrices. Bioinformatics 18, 1102–1108 (2002) 21. Murphy, L.R., Wallqvist, A., Levy, R.M.: Simplified amino acid alphabets for protein fold recognition and implications for folding. Protein Eng. 13(3), 149–152 (2000) 22. Fan, K., Wang, W.: What is the minimum number of letters required to fold a protein? J. Mol. Biol. 328, 921–926 (2003) 23. Albatineh, A., Razeghifard, R.: Clustering Amino Acids Using Maximum Clusters Similarity. In: Doble, M., Loging, W., Malone, J., Tseng, V.S.-M. (eds.) Proc. 2008 International Conference on Bioinformatics, Computational Biology, Genomics and Chemoinformatics (BCBGC 2008), pp. 87–92. ISRST, USA (2008) 24. Jiménez-Montaño, M.A.: On the syntactic structure of protein sequences and the concept of grammar complexity. Bull. Math. Biol. 46(4), 641–659 (1984) 25. Zhou, H., Zhou, Y.: Quantifying the effect of burial amino acid residues on protein stability. PROTEINS: Structure, Function, and Bioinformatics 54, 315–322 (2004) 26. Burks, E.A., Chen, G., Georgiou, G., Iverson, B.L.: In vitro scanning saturation mutagenesis of an antibody binding pocket. Proc. Natl. Acad. Sci. USA 94, 412–417 (1997) 27. Volkenstein, M.V.: Mutations and the value of information. J. Theor. Biol. 80, 155–169 (1979) 28. Bachinsky, A., Ratner, V.: Biomed. Zs. 18, 53 (1976) (in Russian)
Irreplaceable Amino Acids and Reduced Alphabets
309
29. Dayhoff, M. (ed).: Atlas of protein sequence and structure. Nat. Biomed. Res. Found (1972) 30. Luo, L.F.: The degeneracy rule of genetic code. Origins of Life and evolution of the biosphere 18, 65–70 (1988) 31. Friedberg, I., Margalit, H.: Persistently conserved positions in structurally similar, sequence dissimilar proteins: Roles in preserving protein fold and function. Protein Science 11, 350–360 (2002) 32. Dufton, M.J.: Genetic code synonym quotas and amino acid complexity: Cutting the cost of proteins? J. Theor. Biol. 187, 165–173 (1997) 33. Papentin, F.: On order and complexity. II. Application to chemical and biochemical structures. J. Theor. Biol. 95(2), 225–245 (1982) 34. Jones, D.T., Taylor, W.R., Thornton, J.: The rapid generation of mutation data matrices from protein sequences. Compt. Appl. Biosci. 8, 275–282 (1992) 35. Tourasse, N.J., Li, W.-H.: selective constraints, amino acid composition, and the rate of protein evolution. Mol. Biol. Evol. 17(4), 656–664 (2000) 36. Wang, Z., Moult, J.: SNPs, protein structure, and disease. Hum. Mutat. 17, 263–270 (2001) 37. Li, W.-H., Wu, C.-I., Luo, C.-C.: A new method for estimating synonymous and nonsynonymous rates of nucleotide substitution considering the relative likelihood of nucleotide and codon changes. Mol. Biol. Evol. 2, 150–174 (1985) 38. Jiménez-Montaño, M.A., Ramos-Fernandez, A.: An empirical method to identify positively selected sites in antigenic evolution. In: Argüello-Astorga, G.R., González, R.A., Méndez Salinas, E. (eds.) e-Proc. V National Congress of Virology. Sociedad Mexicana de Bioquimica, Mexico (2007) 39. Cargill, M., Altshuler, D., Ireland, J., Sklar, P., Ardlie, K., Patil, N., Shaw, N., Lane, C.R., Lim, E.P., Kalyanaraman, N., et al.: Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nat. Genet. 22, 231–238 (1999)
A One-Class Classification Approach for Protein Sequences and Structures Andr´ as B´anhalmi1 , R´obert Busa-Fekete1,2 , and Bal´ azs K´egl2 1
Research Group on Artificial Intelligence of the Hungarian Academy of Sciences and University of Szeged, Aradi v´ertan´ uk tere 1., H-6720 Szeged, Hungary {banhalmi,busarobi}@inf.u-szeged.hu 2 LAL, University of Paris-Sud, CNRS, 91898 Orsay, France
[email protected]
Abstract. The One-Class Classification (OCC) approach is based on the assumption that samples are available only from a target class in the training phase. OCC methods have been applied with success to problems where the classes are very different in size. As class-imbalance problems are typical in protein classification tasks, we were interested in testing one-class classification algorithms for the detection of distant similarities in protein sequences and structures. We found that the OCC approach brought about a small improvement in classification performance compared to binary classifiers (SVM, ANN, Random Forest). More importantly, there is a substantial (50 to 100 fold) improvement in the training time. OCCs may provide an especially useful alternative for processing those protein groups where discriminative classifiers cannot be easily trained. Keywords: One-class classification, Protein classification, ROC analysis.
1
Introduction
The classification of proteins (domain types, structural classes, protein families) is a key issue in genome annotation. The simplest methods of protein classification are based on pairwise comparisons; more advanced approaches use generative models of the positive class like Hidden Markov Models (HMMs), while more recent methods like Support Vector Machines (SVMs) are based on discriminative models in which the positive and negative classes are both used in the training phase. However, the known protein groups have some typical properties that make the application of classification algorithms difficult or impractical. First, protein classes are very heterogeneous in most of their characteristics (such as the number of known members, protein size, within-group similarities, separation from other groups). Second, a large proportion of known protein groups have only one or two known members. Third, the classes are imbalanced, there being many more negative examples than positive ones. The training of support vector machines is difficult on such data, and generative models like the popular HMM need manually curated multiple alignments that require a substantial human overhead. These points also pose problems to the recent machine learning approaches that use new input space representations or similarity measures. I. M˘ andoiu, G. Narasimhan, and Y. Zhang (Eds.): ISBRA 2009, LNBI 5542, pp. 310–322, 2009. c Springer-Verlag Berlin Heidelberg 2009
A One-Class Classification Approach for Protein Sequences and Structures
311
One class classification (OCC) approaches have been successfully used in various fields where only positive examples are available for training. Application examples include image retrieval [1], the fault detection of machinery [2], automated currency validation [3], bioacoustic monitoring [4], document classification [5], and spam filtering [6]. While discriminative models try to establish a decision surface between the positive class and the other class(es), OCC methods try to draw a closed decision surface around the positive class that surrounds the majority of the positive training instances (see Figure 1). In this scenario, the negative examples are outliers, and the relevant methods are referred to as Outlier Detection or Novelty Detection in different fields. The area of OCC includes several algorithms like generative probability density estimation methods (Gaussian Mixture Model (GMM) [7,8], Parzen estimator [9]), reconstruction methods (k-means [8], Autoencoder Neural Networks [10]), and boundary estimators (k-centers [11], SVDD [12,13,14], NNDD [15]).
Fig. 1. The difference between the discriminative and OC classification methods
The aim of this paper is to test the performance of OCC algorithms in protein classification tasks, using standardized datasets developed for benchmarking machine learning algorithms. The classification tasks were selected in such a way that the members of a new protein family (test set) had to be detected based on other known members of a protein superfamily (training set), using measures of sequence similarity (BLAST [16], Smith/Waterman [17]) or structure similarity (DALI [18], PRIDE [19]). We carried out a ROC analysis for the characterization of classifier performance and found that OCC methods provide a slight improvement with respect to discriminative methods (SVM [20], ANN [8], Random Forest [21]). On the other hand they require 50 to 100 times less training time, which makes them promising candidates for large scale applications.
2
Protein Classification Datasets
We used two different classification benchmark datasets taken from the Protein Classification Benchmark Database [22].
312
2.1
A. B´ anhalmi, R. Busa-Fekete, and B. K´egl
The COG Dataset
This dataset is a subset of the COG database of functionally annotated orthologous sequence clusters [23]. In the COG database, each COG cluster contains functionally related orthologous sequences belonging to unicellular organisms, including archaea, bacteria and unicellular eukaryotes. For a given COG group, the positive test set included the sequences from three unicellular eukaryotic genomes, while the positive training set was compiled from the rest of the sequences in the group. Of the over 5, 665 COGs we selected 117 that contained at least 8 eukaryotic sequences (positive test group) and 16 additional prokaryotic sequences (positive training group). This dataset contained 17, 973 sequences. The negative training/test sets were obtained by randomly assigning sequences from the remaining COGs to the two groups. In this way it reduced to 117 classification tasks, but we used only a subset of them, because we did not use those which had a 1.0 Nearest Neighbor AU C performance. So we chose just 13 tasks from the 117. These COGs were used as positive elements in these learning tasks, that is: COG0406, COG0526, COG0631, COG0695, COG0697, COG0699, COG0814, COG0842, COG0847, COG1310, COG1752, COG2036, COG2801. 2.2
The SCOP40 Dataset
The evaluation of classification performance was tested on a sequence dataset designed to test distant protein similarities [24]. This set consists of 4, 352 protein domain sequences (whose lengths range from 20 to 994 amino acids) selected from the SCOP database [25]. The sequences of this dataset belong to 55 superfamilies, which were divided into training sets and test sets in such a way that the test set consisted of members of a protein family that was not represented in the training set, i.e. there was a low degree of sequence similarity and no guaranteed evolutionary relationship between the two sets. 2.3
Sequence Comparison Algorithms
The protein comparison datasets were taken from the Protein Classification Benchmark website [22]. In experiments a protein sequence comparison was performed with BLAST version 2.2.4 of the BLAST program [16] using a cutoff score of 50, or with the Smith-Waterman algorithm [17], as implemented in MATLAB. The BLOSUM 62 matrix [26] was also used in each case. Afterwards, a protein structure comparison was carried out with DaliLite [18] and with PRIDE2 [19] using default parameters. 2.4
Data Representation
The machine learning algorithms can accept only fixed length, real-valued input vectors such as a kernel representation [24] in which each protein X is represented by a feature vector FX = fx1 , fx2 , . . . fxn , where n is the total number of proteins in the training set and fxi is a similarity/distance score, such as the BLAST score, between X and the ith sequence in the training set. Here we
A One-Class Classification Approach for Protein Sequences and Structures
313
used a more compact representation, where each sequence was represented by its average similarity score obtained from one of the superfamilies represented in the training set [27,28]. After the aggregation step, the length of the training vectors became be equal to the number of superfamilies. Thus in the case of the SCOP40 dataset we had 24 dimensional vectors instead of 1357. In the case of the COG dataset [23], we used the average similarity score on various COGs as an aggregate similarity measure and this resulted in 117 dimensions instead of the original 17, 947. We also found that this aggregation does not affect very much the classification performance.
3 3.1
Methods Data Description Methods
In the following we will give a brief description of the Data Description (or One-Class Classification) methods used in our experiments. The methods which estimate the probability density give a probability value for a test data sample (p(x)), and this value can be used for ranking the test samples. Other methods supply distance values (d(x)), and the negative values of the magnitudes of these distances can be used for scoring. Gaussian Data Description: The Gaussian Data Description [29] seeks to directly approximate the class-conditional probability distribution corresponding to a class with a multidimensional Gaussian density function (p(x)): p(x) =
1
(2π)
D 2
1
1
|Σ| 2
e− 2 (x−μ)
T
Σ −1 (x−μ)
Here D denotes the dimension of the vector-space, and μ and Σ represent the mean and covariance, respectively. The parameters can be computed directly from the training data. The main computational effort is the inversion of the covariance matrix. When the covariance matrix is singular the following regularization step is necessary before the inversion Σ = Σ + rI, where r ∈ R+ is a small regularization parameter and I is the D dimensional identity matrix. Parzen Data Description: This method [15] estimates the class-conditional probability density of the training data via the sum of kernel functions centered to the training examples. The most commonly employed kernel function is the Gaussian function with a zero mean and a variance of one: p(x) =
N 1 ϕ(x, xi , hI), N i=1
where I is the unit covariance matrix, the xi values represent the training data samples, and N stands for the number of data samples. The parameter h is
314
A. B´ anhalmi, R. Busa-Fekete, and B. K´egl
the bandwidth (a smoothing parameter), which can be fixed beforehand or an optimal value can be found using a maximum likelihood method. Nearest Neighbor Data Description: Here the score of a test vector is expressed by the distance between the test vector and its nearest neighbor in the training set: ε(x) = −x − N N (x) K-means Data Description: This method belongs to the family of so-called reconstruction methods, because it minimizes a reconstruction error on the training set. The score of the test data samples can also be expressed using this reconstruction error function. For the k-means data description method, the following reconstruction error is applied [15]: 2
ε(xi ) = min xi − μk ε=
N i=1
k
ε(xi ),
where xi represent the training data samples, N denotes the number of samples, and μk is one of the mean values to be optimized. After the training phase, the distance score for a test vector z is calculated via the following formula: d(z) = min(z − μk ) k
K-centers Data Description: The only difference between this method and the K-means Data Description is the objective function to be minimized [7]. Here the objective function is the maximal radius of the hyperspheres embedding the clusters. 2 ε(xi ) = min xi − μk k
ε = max ε(xi ), i
After the training phase, the distance score for a test vector z is calculated by using the formula: d(z) = min(z − μk ) k
Self-Organizing Map (SOM): This method is similar to the k-means procedure in the sense that reference vectors corresponding to a training set are iteratively updated until convergence. Here denoting the k-th reference vector by wk (it is similar to μk above) for each training data sample x, the closest reference vector wk is determined. After, this the closest reference vector is altered so as to be closer to the actual x point. The difference here between this method and other similar (so-called Learn Vector Quantisation) methods is that each reference vector corresponds to a grid point in a low (1 − 3) dimensional space.
A One-Class Classification Approach for Protein Sequences and Structures
315
In the learning phase not only the closest wk reference vector to x is updated, but also some of the vectors whose grid point is close to the grid point corresponding to wk . In this way –after convergence– the reference vectors corresponding to the grid points will be close to each other when they are close on the grid, so the low dimensional grid tries to preserve the topological properties of the input space [30]. The test mechanism is the same as that for previous methods; that is, the distance score for a test vector z is calculated via the formula: d(z) = min(z − wk ) k
Support Vector Data Description: This novel approach finds the minimum enclosing ball (hypersphere) for the positive training data, and by applying the ’kernel trick’ the minimum enclosing ball is found in the kernel-space. Various kernel-functions can be used, but in our experiments the radial basis kernel was applied. Though we do not give a detailed description of the SVDD method here, the interested reader can peruse [13]. Counter-Example Generation-Based Data Description: Here the problem of one-class classification (or ranking) is transformed into a two-class problem: a simple algorithm is applied that automatically generates artificial counter-examples using just the positive data, in such a way that the generated counter-examples lie outside the region of positives at a predefined distance from them. Afterwards, traditional discriminative classification methods can be utilized to separate the positive and the artificial negative examples [31]. For the negative example generation, and for the two-class classification problem some new methods were introduced in [32]. Here we will use the original counterexample generation method, but for the purpose of separating the positive examples from the generated negative ones we will apply both the ν-SVM (Support Vector Machine) and the RBN (Radial Basis Network) classifiers. 3.2
Reference Binary Classifiers
The implementation of Artificial Neural Network (ANN) and Random Forest (RF) was done using WEKA, which is an open-source JAVA package for machine learning [33]. The class-conditional probability estimation for the test elements was performed in the same way as that implemented in WEKA. The data representation employed by these models was the same as the one we used for the OCCs. In addition, we used the SVMLight program [34], which is a Support Vector Machine implementation. The main advantage of the latter implementation is that it can handle sparse data representations as well.
4
Performance Evaluation
The evaluation was carried out via standard receiver operating characteristic (ROC) analysis [35,36], which provides a measure of both the sensitivity and the
316
A. B´ anhalmi, R. Busa-Fekete, and B. K´egl
Table 1. Comparison of the AUC performance of different Classifiers on SCOP40 dataset. The comparison was carried out using the aggregate feature representation described in the 3.1 section. AU C AU C50
BLAST
SW
DALI
PRIDE
0.8980 0.7315 0.9213 CE-OC(SVM) 0.8043 0.9147 One-class CE-OC(RBN) 0.7913 classifiers 0.9184 K-center 0.7945 0.9206 K-means 0.7914 0.9200 SVDD 0.7709 0.9048 Parzen 0.7574 0.9184 SOM 0.7630 0.9185 Gauss 0.7962
0.8831 0.6640 0.9558 0.8227 0.9185 0.7267 0.9197 0.7327 0.9301 0.7755 0.9421 0.7840 0.9384 0.7589 0.9362 0.7190 0.9387 0.8036
0.8845 0.6622 0.9527 0.8457 0.9398 0.8317 0.9834 0.8686 0.9857 0.9058 0.9895 0.9402 0.9825 0.8747 0.9827 0.8685 0.9912 0.9526
0.8109 0.4104 0.8831 0.6071 0.8965 0.6098 0.8704 0.5766 0.8849 0.6655 0.9030 0.6581 0.8947 0.5863 0.8963 0.6291 0.8581 0.6585
0.8907 0.7912 0.8857 0.7878 0.8082 0.6332
0.9383 0.8231 0.9441 0.8195 0.8884 0.6984
0.9932 0.9795 0.9108 0.8434 0.9948 0.9678
0.9222 0.7686 0.9171 0.7326 0.8988 0.6472
NNDD
Binary classifiers
ANN SVM RForest
specificity of the classification based on a ranking of the objects to be classified [37]. The ROC curve is a curve which plots the sensitivity against the specificity. The integral of this curve is the ”area under curve” or AUC value which is is equal to the probability that the score of a randomly chosen positive example is higher than that of a randomly chosen negative one [38,39]. AU C = 1.0 corresponds to perfect ranking, while a random ranking has an AU C = 0.5 value on average [35]. In order to limit the effect of class imbalance and to have datasets of a manageable size, it is customary to truncate the toplist so as to include only a limited number of negative samples [37]. The resulting measures are the so-called (e.g. ROC10 , ROC50 ) values. In our experiments we used both the full AU C and AU C50 . An other important question in classification tasks is how to choose the parameters of the different machine learning models. For the reference multi-class methods (like SVM or ANN) their parameters were chosen according to the references in the protein classification benchmark [22]. For One-Class models which have one or more parameters influencing the performance of the ranking
A One-Class Classification Approach for Protein Sequences and Structures
317
Table 2. Comparison of the AUC performance of different Classifiers on SCOP40 dataset. The comparison was carried out without using any aggregated features. This table is just for a comparison between the results of the applications of aggregated and non-aggregated features. AU C AU C50
BLAST
SW
DALI
PRIDE
0.7403 0.5076 0.8393 CE-OC(SVM) 0.5686 One-class 0.8283 CE-OC(RBN) 0.4942 classifiers 0.7943 K-center 0.5762 0.7758 K-means 0.4575 0.8405 SVDD 0.6015 0.8949 Parzen 0.6846 0.8014 SOM 0.5111 0.8266 Gauss 0.4789
0.7102 0.5076 0.96 0.7911 0.9673 0.8327 0.9299 0.6862 0.9547 0.7595 0.9583 0.7828 0.9405 0.7103 0.9511 0.7589 0.9628 0.8134
0.7521 0.5494 0.9527 0.8457 0.9398 0.8317 0.9469 0.7835 0.9443 0.8171 0.9695 0.9028 0.8386 0.6208 0.9440 0.7912 0.9499 0.7200
0.8027 0.4330 0.9081 0.6826 0.8988 0.6729 0.8710 0.6198 0.8877 0.6067 0.8993 0.6571 0.8361 0.6660 0.9019 0.6738 0.8790 0.6524
0.7054 0.5278 0.8854 0.7593 0.6521 0.6132
0.7896 0.5336 0.9437 0.8237 0.8599 0.7122
0.9543 0.7319 0.9886 0.9822 0.9944 0.98
0.9253 0.7098 0.9389 0.8344 0.8945 0.7561
NNDD
Binary classifiers
ANN SVM RForest
(or classification) problem, in the most cases the default value of the parameters proved to be the best ones, cross validation methods were not applied because of the very few positive training data. Only two exceptions should be mentioned. When Gauss DD was tested, the regularization matrix had a 0.05 multiplier, and the with of the Parzen window was set to 0.5. All the training datasets were normalized to the [−1, 1] interval.
5
Results and Discussion
Table 1 shows the performance of different classifiers on the SCOP40 database using BLAST, Smith-Waterman, DALI and PRIDE similarities taken from the Protein Classification Benchmark dataset (PCB) [22]. Here the AU C values were determined for the 55 classification tasks specified in PCB, and the average of the 55 AUC values is given in Table 1. The results show that OCCs can achieve
318
A. B´ anhalmi, R. Busa-Fekete, and B. K´egl
Table 3. Comparison of the AUC performance of different classifiers on selected classification tasks taken from COG dataset. The comparison was carried out using the aggregate feature representation described in the 3.1 section. AU C AU C50
BLAST
0.9669 0.9423 0.9801 CE-OC(SVM) 0.7219 0.9810 CE-OC(RBN) 0.7326 0.9975 K-center 0.9512 0.9977 K-means 0.9493 0.9817 SVDD 0.9520 0.9044 Parzen 0.7637 0.9975 SOM 0.9483 0.9978 Gauss 0.9471 0.9763 ANN 0.7838 0.9885 SVM 0.9802 0.8558 RForest 0.7849 NNDD
SW 0.8027 0.4330 0.9739 0.8073 0.9744 0.7476 0.9824 0.8136 0.9790 0.8195 0.9715 0.7833 0.8696 0.6877 0.9829 0.7778 0.9742 0.9382 0.9627 0.8178 0.9752 0.9394 0.8173 0.9437
a modest but consistent improvement compared to the binary classifiers, both in AU C and in AU C50 . The COG dataset consists of groups of orthologues which were generated by a clustering method based on pairwise BLAST sequence comparison. The resulting orthologue clusters (COGs) are compact and even a simple classifier like 1NN can achieve an AU C of 1.00 on the majority of the groups. In order to get a ”difficult” subset, we picked 13 COG groups (14, 939 sequences) with a 1NN AU C score below 0.95. The results we got on this subset are shown in Table 3. These results are consistent with the SCOP results, where we found that OCCs outperform the binary classifiers. After, typical training times are shown in Table 4. Here it is apparent that some OCCs have a 50 to 100 times smaller training time than binary classifiers. Looking at our results in more detail, our first observation could be what is referred to as ’no free lunch theorem’ in machine learning [7]: there is no classifier which is the best for all the problems. The OC methods as well as
A One-Class Classification Approach for Protein Sequences and Structures
319
Table 4. Typical training times (in seconds) for the classifiers used in this study Task +train -train
a.118.1. b.40.4. 21 47 664 642 OCC methods CE-OC (SVM) 6.95 14.32 CE-OC (RBN) 7.17 16.45 K-center 0.26 0.29 K-means 0.12 0.18 SVDD 0.14 0.51 Parzen 0.14 0.21 SOM 16.12 34.64 Gauss 0.48 0.51 Binary methods ANN > 15m > 15m SVM 14.71 14.66 Random Forest 61.89 54.16
c.2.1. 103 604 33.67 71.15 0.35 0.23 0.76 0.31 79.87 0.56 > 15m 15.75 64.15
the discriminative methods have a high diversity in their performance, when using different feature sets on different tasks. It is interesting to note that on the DALI features, discriminative methods (ANN, RF) perform better than OCC methods (CE, Gauss, SVDD, SOM), while on SW features, the situation is just the reverse. It is also seen that the relative ranking of methods is different on the SCOP40 and COG datasets, nevertheless the OCC methods perform better in both cases. Among the binary classifiers tested, ANN apparently gave the most robust results. Next we should mention that the findings reported here were obtained using a compact, aggregated feature representation. An analysis was also carried out on the complete, non-aggregated feature set. The results given in Table 2 show the same tendencies as those reported above (only the results on scop40 is reported here, in the case of COG similar results were obtained). The results show a very high improvement in the classification performance, when – a very effective dimension-reduction ie. – aggregated features were used for our class-imbalanced and high dimensional problems.
6
Conclusions
Based on the above comparisons, we may conclude that one-class classifiers can provide a viable alternative to binary classifiers in protein classification tasks. They do not require multiple alignment and can be easily incorporated into multiple classifier systems. As they require short training times, they can be especially useful for large-scale applications, and may provide a solution for the protein groups that binary classifiers cannot handle.
320
A. B´ anhalmi, R. Busa-Fekete, and B. K´egl
Acknowledgements The authors thank S´ andor Pongor for helpful suggestions. This research was supported by the French National Research Agency. This work was partly supported in part by the NKTH grant of the National Technology Programme 2008 (project codename AALAMSRK NTP OM-00192/2008) of the Hungarian government.
References 1. Chen, Y., Zhou, X.S., Huang, T.S.: One-class SVM for learning in image retrieval. In: 2001 International Conference on Image Processing proc., vol. 1, pp. 34–37 (2001) 2. Shin, H.J., Eom, D.-H., Kim, S.-S.: One-class support vector machines: an application in machine fault detection and classification. Comput. Ind. Eng. 48(2), 395–408 (2005) 3. He, C., Girolami, M., Ross, G.: Employing optimised combinations of one-class classifiers for automated currency validation. Pattern Recognition 37, 1085–1096 (2004) 4. Sachs, A., Thiel, C., Schwenker, F.: One-class support-vector machines for the classification of bioacoustic time series. ICGST International Journal on Artificial Intelligence and Machine Learning (AIML) 6(4), 29–34 (2006) 5. Manevitz, L.M., Yousef, M.: One-class SVMs for document classification. Journal of Machine Learning Research 2, 139–154 (2001) 6. Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A Bayesian approach to filtering junk E-mail. In: Learning for Text Categorization: Papers from the 1998 Workshop, Madison, Wisconsin, AAAI Technical Report WS-98-05 (1998) 7. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. John Wiley and Son, New York (2001) 8. Bishop, C.M.: Neural Networks for Pattern Recognition. Oxford University Press, Oxford (1995) 9. Parzen, E.: On the estimation of a probability density function and mode. Annals of Mathematical Statistics 33, 1065–1076 (1962) 10. Japkowicz, N., Myers, C., Gluck, M.A.: A novelty detection approach to classification. In: IJCAI, pp. 518–523 (1995) 11. Ypma, A., Duin, R.: Support objects for domain approximation (1998) 12. Sch¨ olkopf, B., Platt, J.C., Shawe-Taylor, J., Smola, A.J., Williamson, R.C.: Estimating the support of a high-dimensional distribution. Neural Computation 13(7), 1443–1471 (2001) 13. Tax, D.M.J., Duin, R.P.W.: Support vector domain description. Pattern Recogn. Lett. 20(11-13), 1191–1199 (1999) 14. Tax, D.M.J., Duin, R.P.W.: Support vector data description. Mach. Learn. 54(1), 45–66 (2004) 15. Tax, D.M.J.: One-class classification; Concept-learning in the absence of counterexamples. Ph.D thesis, Delft University of Technology (2001) 16. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)
A One-Class Classification Approach for Protein Sequences and Structures
321
17. Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. Journal of Molecular Biology 147(1), 195–197 (1981) 18. Holm, L., Park, J.: Dalilite workbench for protein structure comparison. Bioinformatics (16), 566–567 (2000) 19. Vlahovicek, K., Gaspari, Z., Pongor, S.: Efficient recognition of folds in protein 3d structures by the improved pride algorithm. Bioinformatics (21), 3322–3323 (2005) 20. Vapnik, V.N.: Statistical Learning Theory. John Wiley and Son, Chichester (1998) 21. Breiman, L.: Random forests. Machine Learning V45(1), 5–32 (2001) 22. Sonego, P., Pacurar, M., Dhir, S., Kert´esz-Farkas, A., Kocsor, A., G´ aspari, Z., Leunissen, A.M., Pongor, S.: A protein classification benchmark collection for machine learning. Nucleic Acids Research 35(suppl. 1), D232–D236 (2007) 23. Tatusov, R.L., Fedorova, N.D., Jackson, J.D., Jacobs, A.R., Kiryutin, B., Koonin, E.V., Krylov, D.M., Mazumder, R., Mekhedov, S.L., Nikolskaya, A.N., Rao, B.S., Smirnov, S., Sverdlov, A.V., Vasudevan, S., Wolf, Y.I., Yin, J.J., Natale, D.A.: The cog database: an updated version includes eukaryotes. BMC Bioinformatics 4 (September 2003) 24. Liao, L., Noble, W.S.: Combining pairwise sequence similarity and support vector machines for remote protein homology detection. In: RECOMB 2002: Proceedings of the sixth annual international conference on Computational biology, pp. 225–232. ACM Press, New York (2002) 25. Andreeva, A., Howorth, D., Brenner, S.E., Hubbard, T.J., Chothia, C., Murzin, A.G.: Scop database in 2004: refinements integrate structure and sequence family data. Nucleic Acids Res. 32(Database issue) (January 2004) 26. Henikoff, S., Henikoff, J.G.: Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. U S A 89(22), 10915–10919 (1992) 27. Vlahovicek, K., Kajan, L., Agoston, V., Pongor, S.: The sbase domain sequence resource, release 12: prediction of protein domain-architecture using support vector machines. Nucleic Acids Research 33(suppl. 1), 223 (2005) 28. Murvai, J., Vlahovicek, K., Szepesv´ ari, C., Pongor, S.: Prediction of protein functional domains from sequences using artificial neural networks. Genome Res. 11, 1410–1417 (2001) 29. Paalanen, P.: Bayesian classification using Gaussian mixture model and EM estimation: Implementations and comparisons. Technical report, Department of Information Technology, Lappeenranta University of Technology, Lappeenranta (2004) 30. Allinson, N.M., Yin, H.: Self-organising maps for pattern recognition. In: Oja, E., Kaski, S. (eds.) Kohonen Maps, pp. 111–120. Elsevier, Amsterdam (1999) 31. B´ anhalmi, A., Kocsor, A., Busa-Fekete, R.: Counter-example generation-based one-class classification. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladeniˇc, D., Skowron, A. (eds.) ECML 2007. LNCS (LNAI), vol. 4701, pp. 543–550. Springer, Heidelberg (2007) 32. B´ anhalmi, A.: One-class classification methods via automatic counter-example generation. In: AIAP 2008: Proceedings of the 26th IASTED International MultiConference, Anaheim, CA, USA. ACTA Press (2008) 33. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann, San Francisco (2005) 34. Joachims, T.: Making large-scale support vector machine learning practical. MIT Press, Cambridge (1998)
322
A. B´ anhalmi, R. Busa-Fekete, and B. K´egl
35. Egan, J.P.: Signal Detection theory and ROC Analysis. Academic Press, New York (1975) 36. Sonego, P., Kocsor, A., Pongor, S.: Roc analysis: applications to the classification of biological sequences and 3d structures. Brief Bioinform. (January 2008) 37. Gribskov, M., Robinson, N.: Use of receiver operating characteristic (roc) analysis to evaluate sequence matching (1996) 38. Cortes, C., Mohri, M.: Auc optimization vs. error rate minimization (2004) 39. Ingleby, J.D.: Signal detection theory and psychophysics. Journal of Sound Vibration 5, 519–521 (1967)
Prediction and Classification of Real and Pseudo MicroRNA Precursors via Data Fuzzification and Fuzzy Decision Trees Na’el Abu-halaweh and Robert Harrison Computer Science Department Georgia State University 30303 Atlanta, GA, USA
Abstract. MicroRNAs (miRNAs) are short non-coding RNA molecules that play a significant role in post-transcriptional gene regulation. Although, hundreds of miRNAs have been identified, recent studies indicate that more remain to be discovered. Identifying novel miRNAs remains a very important aspect to the understanding of their biological roles. Computational methods can complement experimental approaches and can play an important role in identifying miRNAs candidates for further experimental validation. Most computational approaches utilize features extracted from miRNA precursors (pre-miRNA) sequences and/or their secondary structures to detect miRNAs. A key characteristic of pre-miRNAs is their hairpin structure. In this paper, Fuzzy decision trees are applied to the prediction and classification of real and pseudo pre-miRNAs. In our model, a number of features that encode local and global characteristics of pre-miRNA sequence structure are used. A fuzzy model of the extracted features was constructed. The fuzzified data was then fed into a fuzzy decision tree induction algorithm. Our experimental results showed that our method achieved better accuracy than other machine-learning based computational approaches. Analyzing the results revealed that one of the features –the sequence length to number of basepairs ratio - is very critical to the classification and identification of pre-miRNAs. Keywords: miRNA, microRNA, pre-miRNA, microRNA precursors.
1 Introduction MicroRNAs are short non-coding RNAs that play important roles in gene regulation by targeting mRNAs for cleavage or translational inhibition [1] [2] [3] [4] [5] [6]. The formation mechanism of mature miRNAs starts in the nucleus [4] [7] [13], where longer RNA molecules known as primary miRNAs (pri-miRNAs) are processed into hairpin structures known as miRNA precursors (pre-miRNAs) by nuclear RNase III Drosha [13]. The pre-miRNAs are then transported to the cytoplasm by Exportin-5 [1] [8] [13]. In the cytoplasm another RNase III Dicer cuts these pre-miRNAs to release the ~22 nt mature miRNAs [1] [7]. I. Măndoiu, G. Narasimhan, and Y. Zhang (Eds.): ISBRA 2009, LNBI 5542, pp. 323–334, 2009. © Springer-Verlag Berlin Heidelberg 2009
324
N. Abu-halaweh and R. Harrison
In the past few years, hundreds of miRNAs have been identified [2], however recent studies showed that many more miRNAs still undiscovered. For example, in [14], the authors argued that the number of miRNAs encoded in a mammalian genome is close to a thousand. This indicates that the number and the identity of miRNAs are still questionable and that much more work is needed to uncover and understand their biology. Experimental approaches to identifying new miRNAs are inefficient, time consuming and expensive [4]. In spite of this, hundreds of miRNAs have been cloned since the discovery of the first miRNA. However, only abundant miRNAs can be easily detected by PCR or northern blot due to limitations in those techniques [7]. To overcome the disadvantages of the experimental approaches, computational methods are introduced. Computational methods can complement experimental approaches [3] and can play an important role in identifying miRNAs candidates for further experimental validation [4]. Since pre-miRNAs and their secondary structures encode more discriminative information than their corresponding mature miRNAs; most computational approaches utilize features extracted from pre-miRNA sequences and/or their secondary structures to detect miRNAs [2] [4]. For example, MiRscan, a computational tool for miRNAs prediction, utilizes the observation that known miRNAs are derived from phylogenetically conserved stem-loop precursor RNA [1]. Generally, computational techniques to identifying miRNAs can be grouped into two main categories [4]: gradually hierarchical methods and directly discriminative ones. Gradually hierarchical methods utilize comparative genomics information with pre-miRNAs sequence and/or secondary structure features to predict new miRNAs. A typical idea utilized by these methods is to use comparative genomics to filter most of the hairpin structures that are not conserved in related species. This filtering step makes these methods unable to identify new miRNAs with no known close homologues. Furthermore, some true pre-miRNAs may be excluded during early steps [4]. Examples of such tools include: miRscan, miRSeeker, miRAlign, miRCheck and miRFinder. On the other hand, discriminative identification methods utilize machine learning techniques to build a classifier. The classifier is trained using positive and negative pre-miRNA samples; positive samples are extracted from true pre-miRNAs sequences and/or their secondary structures, where as negative samples are extracted from sequence segments with similar stem-loop structures or from other known RNAs such as mRNA. Our approach falls into this category. In the past few years, a number of algorithms have been proposed for detecting miRNAs. In [1] a set of local structure-sequence features and support vector machines (SVM) are used to classify real and pseudo pre-miRNAs. In [2] a method for identifying clustered miRNAs was proposed. In [4] a string kernel and support vector machines are used to identify miRNAs. In [5] the authors combined features consisting of the local structure-sequence, the minimum of free energy of the secondary structure and the P-value of randomization test. Then they applied random forest to classify real and pseudo pre-miRNAs. In [7] a computational tool to identify miRNAs was introduced; it uses pre-miRNA sequence and structure alignment to identify miRNAs. In [9] a computational tool called miRseeker is introduced, it is used to identify miRNAs in Drosophila.
Prediction and Classification of Real and Pseudo MicroRNA Precursors
325
In this paper, we combine the local pre-miRNA features defined by the triplet elements introduced in [1], and the global pre-miRNA features introduced in [6] to construct a fuzzy model of pre-miRNA sequence and secondary structure data, then we use the fuzzy-decision-tree tool we developed [20] to classify and predict real and pseudo pre-miRNAs. Our fuzzy-decision-tree tool was found to be significantly more accurate than other single decision tree tools and competitive with random forest approaches [20]. The rest of the paper is organized as follows: section 2 reviews the encoding schemes used, and the data fuzzification model. In addition it reviews fuzzy decision trees and the inference method used to decide on test data and introduces our algorithm. Section 3 summarizes our experimental results. Section 4 provides a brief discussion of the experimental results, and section 5 concludes the paper.
2 Method In this section, we review the encoding procedure of the pre-miRNA sequences and their secondary structure into a set of features that capture their local and global characteristics. The encoding scheme for local characteristics was proposed in [1] and that for the global characteristics was proposed in [6]. Combining both schemes results in a total of fourty three features. Then we discuss the data fuzzification procedure, review fuzzy decision trees and discuss the inference method used to draw the final classification/prediction decision of a given input sequence, and finally we introduce our algorithm. 2.1 Encoding the Pre-miRNAs Hairpin Structures As mentioned earlier, the pre-miRNA sequences and their secondary structures are encoded into 43 features including 11 global features and 32 local ones. The global features were proposed in [6], these include the symmetric difference, number of basepairs, GC content, length to basepair ratio, sequence length, length of central loop, free energy per nucleotide, bulge size, tail length and number of tails. Please refer to [6] for complete definitions of these features and how they are derived from the hairpin structure. The 32 local features are defined by the triplet features introduced in [1]. A triplet is defined by one nucleotide and the secondary structure of its -1, 0, +1 positions. Since there are four nucleotides: A, C, G, U and only two possible secondary structures: match ‘(‘ and mismatch ‘.’, there are 4 X 23 possible triplet elements. The count values of them are used as the 32 local features. 2.2 Data Fuzzification Fuzzy decision trees can operate only on fuzzy data. The advantage of a fuzzy decision tree is that it is robust with respect to overfitting the data due to spurious precision. If the input data is not in a fuzzy form it should be fuzzified. The fuzzification reflects the estimated uncertainty in the data. Generally data fuzzification is achieved by a dividing each feature’s range into a number of fuzzy sets with certain membership function. Common fuzzy membership functions include triangular, trapezoidal
326
N. Abu-halaweh and R. Harrison
and Gaussian. The data fuzzification parameters can be defined by experts, by dividing the range of the feature evenly into a number of fuzzy sets or by using clustering. In this paper, since the datasets are not in a fuzzy form, we fuzzify the input data by dividing the range of all features evenly into an equal number of fuzzy sets. Refer to the experiments and results section for more details. 2.3 Fuzzy Decision Trees Decision trees are one of the most popular applications used in machine learning. An advantage of decision trees is their clarity in representing classification information [16] [17]. ID3 algorithm [21] forms the basis of many decision tree based algorithms and applications. An example of a machine-learning application that is based on this algorithm is C4.5 [22]. However decision trees based on ID3 are very sensitive to small changes in feature values. Fuzzy ID3 algorithm is an extension of the ID3 algorithm. It integrates ideas from fuzzy set theory and fuzzy logic with ID3 to overcome spurious precision in data and to overcome the sensitivity to small changes in attribute values. Several versions of fuzzy ID3 algorithm exist [18] [19], however their approach to building the fuzzy decision tree is similar to that of ID3. Given M features in a dataset, the algorithm recursively partitions the data into a number fuzzy subsets based on a selected feature. The splitting feature is selected such that a certain information theory measure is maximized. Common measures include: gini index, information gain and classification ambiguity. A feature can be used only once along a path from the root node to any leaf node in the decision tree. The recursive splitting of the data continues until all the data at a tree node belong to same class, all features are used and no more features are available, or certain thresholds have been met. Thresholds used include: the number of objects (Data items or records) at a tree node is less than a given threshold or the ratio of membership of a class at tree node is higher than a given threshold. The former threshold is called the leaf decision threshold and the later one is called the fuzziness control threshold. Unlike, their crisp ancestors where an object can propagate only down a single branch into a single tree node [20]; fuzzy decision trees allow a data item or an object to fall down multiple branches into one or more tree nodes. Similar to inference in crisp decision trees, inference in fuzzy decision trees start at the root node, however, an object may fall into several leaf nodes, with possibly conflicting decisions. Several inference methods have been proposed; examples include selecting the node with greatest firing threshold, and summing up the membership values of each class in leaf nodes and assigning the object the class with the greater weight. Generally a fuzzy tree induction process consists of four steps [20]: Fuzzifying training data, building a fuzzy decision tree, converting the fuzzy decision tree into a set of fuzzy rules, and applying fuzzy rules for the purpose of classification or prediction (inference). Fuzzy decision trees are more robust and more immune to noise, measurement errors, data uncertainties than their crisp ancestors and other machine learning approaches. In addition they can represent classification information in a more human friendly form.
Prediction and Classification of Real and Pseudo MicroRNA Precursors
327
2.4 Fuzzy Decision Tree Parameter Optimization In the fuzzy ID3 algorithm, two parameters need to be optimized, the fuzziness control threshold r and the leaf decision threshold n. Several values of both thresholds were tested and the optimal ones were used. Refer to the experiments and results section for more details. 2.5 Performance Evaluation Measures To evaluate the performance of our method, several experiments were conducted. After each experiment the following values were calculated: • • • •
True positives (TP): Positive samples predicted as positive. False positives (FP): Negative samples predicted as positive. True negatives (TN): Negative samples predicted as negative. False negatives (FN): Positive samples predicted as negative.
The above values were used to calculate three statistical measures: accuracy sensitivity and specificity. Equations 1, 2 and 3 define the formulas for these three measures: Accuracy = (TP + TN)/(TP + TN + FP + FN)
(1)
Sensitivity = TP/(TP + FN)
(2)
Specificity = TN/(FP + TN)
(3)
2.6 Algorithm In this section, we discuss our approach to miRNAs classification and prediction. After extracting the 43 features from the data using the procedures introduced in [1] and [6]; the extracted values of these features are fed into our fuzzy decision tree induction tool; the tool supports only triangular and trapezoidal fuzzy membership functions. All features’ ranges are divided evenly into an equal number of fuzzy sets. After the data fuzzification step, a fuzzy decision tree is constructed using a version of the Fuzzy ID3 algorithm [20] that utilizes both classification ambiguity and information gain to select the split variable. Then the constructed tree is converted into a set of fuzzy rules. To do inference, the testing dataset is fuzzified using the same fuzzy parameters as those of the training dataset. Then the generated fuzzy rules are applied to the fuzzified testing dataset. Since each of the generated fuzzy rules includes partial membership information of the object firing the rule in each of the classes in the training data set, the partial membership information of the fired rules is recorded. For example, the resulting decision of applying a certain fuzzy rule may be 0.1 premiRNA and 0.9 background. To make the final decision on a certain sequence the weights of the partial class membership information assigned by all fired rules are added and the class with the larger weight is considered as the final decision. For
328
N. Abu-halaweh and R. Harrison
Fig. 1. New model for microRNA prediction. In the figure, the datasets contain the 43 features.
Example the sum of the weights assigned by all fired rules may be 0.2 pre-miRNA and 0.8 background; in this case the final decision will be background. Note that our application allows the user to get a weighted decision, for example, 0.95 pre-miRNA and 0.05 background, from now on, we will refer to this inference method as inference-by-partial-membership. Fig. 1 shows the follow of the suggested method. Following is a summary of the steps of our algorithm: 1. 2.
3. 4. 5. 6. 7. 8.
Extract the 43 features from the training and testing datasets using the encoding procedures proposed in [1] and [6]. Build a fuzzy model to be used to fuzzify the training dataset. This includes specifying the number of fuzzy sets to be used per feature and the fuzzy membership function. Use the fuzzy model built in step 2 to fuzzify the training dataset. Apply the fuzzy ID3 algorithm to the fuzzified training dataset to build a fuzzy decision tree. Convert the fuzzy decision tree generated in step 4 into a set of fuzzy rules. Use the fuzzy model built in step 2 to fuzzify the testing dataset. Apply the fuzzy rules to each sequence in the testing dataset to classify it. Use the inference-by-partial-membership method to make the final classification decision.
Prediction and Classification of Real and Pseudo MicroRNA Precursors
329
3 Experiments and Results 3.1 Datasets In order to test the performance of our approach, we used the same datasets used in literature [1] [6] and compare our results with theirs. The datasets are described briefly in table 1. Datasets 1 through 5 are from human, and datasets 6 through 16 are from other species. The encoded data was downloaded from [15]. For the preprocessing steps of the raw dataset, the reader is referred to [1] and [6]. Table1. Summary of Datasets [6]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Dataset TR-C TE-C1 TE-C2 Conserved-Hairpin (T3) Updated(T4) Mus musculus Rattus norvegicus Gallus gallus Danio rerio Caenorhabditis briggsae Caenorhabditis elegans Drosophila pseudoobscura Drosophila melanogaster Oryza sativa Arabidopsis thaliana Epstein barr virus
Number of Samples 163/168 30 1000 2444 39 36 25 13 6 73 110 71 71 96 75 5
Class Pre-miRNAs/background Pre-miRNAs Background Background Pre-miRNAs Pre-miRNAs Pre-miRNAs Pre-miRNAs Pre-miRNAs Pre-miRNAs Pre-miRNAs Pre-miRNAs Pre-miRNAs Pre-miRNAs Pre-miRNAs Pre-miRNAs
3.2 Parameter Optimization and Fuzzy Model Selection In order to optimize the parameters for the Fuzzy ID3 algorithm and to select a fuzzy membership function, the training dataset was divided into 10 folds, 9 folds were used as a training dataset and the 10th fold as a validation dataset. The values for the fuzziness control parameter were varied between 0.85 and 1, and the values for the number of objects at a node threshold (leaf control threshold) were varied between 1 and 200. Triangular and trapezoidal fuzzy membership functions were also tested. For this purpose, the features were divided into an equal number of fuzzy sets, the number of fuzzy sets tested ranges between 2 and 15. For each combination of the three parameters: Fuzzy model (number of fuzzy sets, trapezoidal or triangular membership function), leaf control threshold and fuzziness control threshold, the experiment was executed 10 times by changing the fold used for validation. The results were averaged and the parameters that maximize the accuracy were selected. Based on the results of these experiments, the number of optimal fuzzy sets was determined to be either 4, 8 or 12. The results of these experiments also showed that triangular and trapezoidal membership functions can achieve comparable accuracies. In all the experiments we
330
N. Abu-halaweh and R. Harrison
tested 4, 8 and 12 fuzzy sets with both triangular and trapezoidal membership functions. Due to space limitation we only report the models that achieved the best results. 3.3 Experimental Results We performed four different experiments. In the first experiment Dataset 1 (TR-C) was used as a training dataset and datasets 2 through 16 were used as testing datasets. These are the same training datasets used in [1] and [6]. The range of each feature was divided evenly into 4 fuzzy sets. Triangular fuzzy membership function is used such that the membership functions of adjacent fuzzy sets cross each other at 0.5. Our approach achieved a 93.6% overall accuracy, 82.62% sensitivity and a 95.64% specificity. Table 2 shows the overall accuracy of the fuzzy decision tree (FDT) approach compared to the other machine learning approaches used in the literature [1] [6] and posted on [15] in addition to random forest (RF). We would like to mention that we only ran RF and FDT. The other entries in all tables were obtained from [6]. Table 2. Fuzzy Decision Trees vs. other machine learning approaches overall accuracy dataset 2 through 16
Accuracy
SVM 93.1
C4.5 88.9
kNN 88.9
Ripper 87.2
Tri-SVM [1] 89.1
RF 90.7
FDT 93.6
In the second experiment, we used Dataset 1 (TR-C) as training dataset. We also utilized the same fuzzy model used in the first experiment. However, we used datasets 2 through 5, in other words we train on human data and test on human data. Our approach achieved a 95.64% overall accuracy, 95.65% sensitivity and 95.64% specificity. Table 3 compares the overall accuracy, sensitivity and specificity achieved by our approach on human data to the overall accuracy, sensitivity and specificity achieved by other machine learning approaches used in the literature [1] [6], in addition to random forest. Table 4 shows the accuracy achieved by our method on individual human datasets; datasets 2 through 5. Table 3. Fuzzy Decision Trees vs. other machine learning approaches overall accuracy on human data, datasets 2 through 5 SVM Sensitivity 94.2 Specificity 93.3 Accuracy 93.3
C4.5 94.2 89.7 89.8
kNN 94.2 88.8 88.9
Ripper 94.2 87.0 87.1
Tri-SVM [1] 92.8 88.7 90.9
RF 95.6 90.9 91.0
FDT 95.6 95.6 95.6
Table 4. Accuracy achieved by Fuzzy Decision Trees individual Human datasets 2 through 5
2 3 4 5
Dataset TE-C1 TE-C2 Conserved-Hairpin (T3) Updated(T4)
Number of Samples 30 1000 2444 39
Accuracy 100 97.4 94.9 92.3
Prediction and Classification of Real and Pseudo MicroRNA Precursors
331
In the third experiment, we used datasets 1 (TR-C) as the training dataset and datasets 6 through 16 as a testing dataset, refer to table 1, in other words we train on human data and test on data from other species. In this experiment we used a new fuzzy model. In this model, the range of each feature is divided evenly into eight fuzzy sets, and trapezoidal fuzzy membership function is used such that the membership functions of adjacent fuzzy sets cross each other at 0.5. Our approach achieved a 95.52% overall accuracy and 95.52% sensitivity. Since the testing dataset does not have any negative samples, specificity can not be calculated. Table 5 compares the overall accuracy and sensitivity achieved by our approach on other species data to the overall accuracy and sensitivity achieved by other machine learning approaches used in the literature [1] [6], in addition to random forest. Since there are no negative samples in the testing dataset, the sensitivity is equal to the accuracy. Table 6 shows the accuracy achieved by our method on individual species datasets; datasets 6 through 16. Table 5. Fuzzy Decision Trees vs. other machine learning approaches overall accuracy on ther species datasets, datasets 6 through 16 SVM Sensitivity 91.7 Accuracy 91.7
C4.5 83.3 83.3
kNN 87.8 87.8
Ripper 87.8 87.8
Tri-SVM [1] 90.9 90.9
RF 89.2 89.2
FDT 95.5 95.5
Table 6. Accuracy achieved by Fuzzy Decision Trees individual Human datasets 6 through 16
6 7 8 9 10 11 12 13 14 15 16
Dataset Mus musculus Rattus norvegicus Gallus gallus Danio rerio Caenorhabditis briggsae Caenorhabditis elegans Drosophila pseudoobscura Drosophila melanogaster Oryza sativa Arabidopsis thaliana Epstein barr virus
Number of Samples 36 25 13 6 73 110 71 71 96 75 5
Accuracy 94.4 84.0 92.3 83.3 95.9 93.6 97.2 95.8 100 100 40.0
In the fourth experiment, we used datasets 6 through 16 (datasets from other species) and dataset 3 (TE-C2) from human as a training dataset, and we used dataset 1 (TR-C), dataset 2 (TE-C1), dataset 4 (Conserved-Hairpin) and dataset 5 (Updated) as a testing dataset, in other words we train on pre-miRNA data from other species and background samples from human and test on data from human. In this experiment we used a new fuzzy model. In this model, the range of each feature is divided evenly into twelve fuzzy sets, and trapezoidal fuzzy membership function is used such that the membership functions of adjacent fuzzy sets cross each other at 0.5. Our approach achieved a 96.2% overall accuracy, 91.8% sensitivity and 96.6% specificity. To further validate our method and evaluate its effectiveness, we tested it on the most recent Rfam database -Rfam Release 12.0- [23] [24] [25] [26] [27]. Since the
332
N. Abu-halaweh and R. Harrison
Rfam dataset contains only positive samples, it was combined with the background sequences from TE-C2 dataset (1000 sequences), T3 dataset (2444 sequences) and TR-C dataset (168 sequences). The resulting dataset was then divided randomly into three folds, such that the percentage of the negative samples to positive samples in each of the folds is the same as that in the entire dataset. Two of the resulting folds were used as the training dataset and the third one as a testing dataset. The experiment was repeated three times by changing the testing fold (3-fold cross-validation). The results were then averaged. Our method achieved an average accuracy of 92.28%, an average sensitivity of 93.92% and an average specificity of 88.88%.
4 Discussion After examining all the fuzzy decision trees produced during the experiments, we noticed that the length basepair ratio (length of the sequence/the number or basepairs) is the major feature in the data. Using only this feature our method, when trained on dataset 1 and tested on all other datasets, can achieve a 92.1% overall accuracy, 65.7% sensitivity and 97.0% specificity. Investigating the values taken by this feature, we find out that if the length basepair ratio value is greater than 3.9, the sequence does not represent a pre-miRNA. On the other hand, if the value of the length basepair ratio feature is less than or equal to 3.9, then the sequence may represent either a pre-miRNA sequence or a background sequence depending on the fuzzy membership values of the sequence in each of the three overlapping fuzzy sets spanning the range 0 to 3.9, furthermore, if the length basepair ratio value is less than or equal to 3.9, the likelihood of the sequence to represent a pre-miRNA increases as the value of length basepair ratio feature decreases. A further study on release 12.0 of the Rfam database [23] [24] [25] [26] [27] showed that 95.86% of the pre-miRNA sequences in this database have a length basepair ratio less than or equal to 3.9. To further analyze the significance of the length basepair ratio feature on the classification accuracy, we removed all the other features from the dataset and we repeated the first, second and third experiments. In the first experiment the accuracy fell from 93.6% to 85.3%, in the second experiment, our method achieved an accuracy of 96.64%, a sensitivity of 76.81% and a specificity of 97.0%, and in the third experiment, our method achieved an overall accuracy and a sensitivity of only 64.4%. These results indicate that, by using the length basepair ratio feature, our method can exclude many of the negative samples. Since the value of the length basepair ratio feature is a function of the sequence length and the number of basepairs features, we believe that removing either the number of basepairs or the sequence length from the datasets should not affect the results significantly. We removed the sequence length feature from the datasets and we repeated the first experiment, our results were not affected. When we removed the two variables from the datasets, and repeated the first experiment the accuracy fell from 93.6% to 93.3%. These results indicate that, as long as the length basepair ratio and number of basepairs feature are included in the dataset, then removing the sequence length feature has no effect on the results. Finally, we noticed that the data fuzzification model plays an important role in determining the performance of our method. We believe that we can significantly
Prediction and Classification of Real and Pseudo MicroRNA Precursors
333
improve our results by carefully designing the data fuzzification model. We hypothesize that the more accurately the fuzzification model reflects the uncertainty or ambiguity of the data the more effective the fuzzy-decision-tree will be as a decision tool.
5 Conclusion In this paper we presented a new method to classify and predict pre-miRNA sequences. We used the same sequence encoding techniques introduced in [1] and [6]. The encoded features are then used to build a fuzzy model of the sequence data. Then the fuzzy decision tree machine-learning induction method was used to classify positive and negative pre-miRNA samples. Our approach achieved better results than other machine-learning methods found in the literature. We attribute these results to the nature of fuzzy decision trees and to the fuzzy model used to fuzzify the data. The data fuzzification model plays a critical rule in determining the performance of our method. We believe that a more efficient fuzzification model than the models presented in this work exist, and further future research is needed to investigate such models. Finally, we believe that important information is lost during the data encoding and preprocessing steps, therefore we believe designing a fuzzy string kernel capable of processing the raw sequence data will result in a significant improvement in the performance of our method.
References 1. Xue, C., Li, F., He, T., Liu, G., Li, Y., Zhang, X.: Classification of Real and Pseudo MicroRNA Precursors Using Local Structure_Sequence and Support Vector Machine. BMC Bioinformatics 6(1), 310 (2005) 2. Sewer, A., Paul, N., Landfraf, P., Aravin, A., Pfeffer, S., Brownstein, M., Tuschl, T., van Nimwegan, E., Zavolan, M.: Identification of Clustered MicroRNAs Using an Ab Initio Prediction Method. BMC Bioinformatics 6(1), 267 (2005) 3. Yoon, S., De Micheli, G.: Computational Identification of MicroRNAs and Their Tragets. In: Birth Defects Research, vol. 78, pp. 118–128 (2006) 4. Xu, J., Li, F., Sun, Q.: Identification of MicroRNA Precursors with Support Vector Machine and String Kernel. Genomics, Proteomics & Bioinformatics 6(2), 121–128 (2008) 5. Jaing, P., Wu, H., Wang, W., Ma, W., Sun, X., Lu, M.: MiPred: Classification of Real and Pseudo MicroRNA Using Random Forest Prediction Model with Combined Features. Nucleic Acids Res. 35, W339–W344 (2007) 6. Zheng, Y., Hsu, W., Li Lee, M., Soon Wong, L.: Exploring Essential Attributes For Detecting MicroRNA Precursors From Background Sequences. In: 32nd International Conference on Very Large Databases Workshop on Data Mining in Bioinformatics, Seoul, Korea (2006) 7. Wang, X., Zhang, J., Li, F., Gu, J., He, T., Zhang, X., Li, Y.: MicroRNA Identification Based on Sequence and Structure Alignment. Bioinformatics 21, 3610–3614 (2005) 8. JonesRhoades, M., Bartel, D.: Computational Identification of Plant MicroRNAs and Their Targets, Including a Stress-Induced MiRNA. Mol. Cell. 14(6), 787–799 (2004)
334
N. Abu-halaweh and R. Harrison
9. Lai, E., Tomancak, P., Williams, R., Rubin, G.: Computational Identification of Drosophila MicroRNA Genes. Genome Biol. 4(7), R42 (2003) 10. Ambros, V., Bartel, B., Bartel, D.: A Uniform System for MicroRNA Annotation. RNA 9(3), 277–279 (2003) 11. Gordon, L., Chervonenkis, A., Gammerman, A., Shahmuradov, I., Solovyev, V.: Sequence Alignment Kernel for Recognition of Promoter Regions. Bioinformatics 19(15), 1964– 1971 (2003) 12. Bartel, D.: MicroRNAs: Genomics, Biogenesis, Mechanism, and Function. Cell 116, 281–397 (2004) 13. Ambion, MiRNA Research Guide, http://www.ambion.com/miRNA 14. Berezikov, E., Guryev, V., Van de Belt, J., Weinholds, E., Plasterk, R.H., Cuppen, E.: Phylogenetic Shadowing and Computational Identification of Human MicroRNA Genes. Cell 120, 21–24 (2005) 15. Zheng, Y., Hsu, W., Li Lee, M., Limsoon, W.: Exploring Essential Attributes for Detecting MicroRNA Precursors from Background Sequences, http://www.comp.nus.edu.sg/~wongls/projects/ miRNA/suppl-info/vldb2006.htm 16. Janikow, C.: Exemplar Learning in Fuzzy Decision Trees. In: 5th IEEE International Conference on Fuzzy Systems. New Orleans, vol. 2, pp. 1500–1505 (1996) 17. Lee, K., Lee, J., Lee-Kwang, H.: A Fuzzy Decision Tree Induction Method for Fuzzy Data. In: IEEE Conference on Fuzzy Systems, FUZZ-IEEE 1999, Seoul, vol. 1, pp. 16–25 (1999) 18. Umano, M., Okamoto, H., Hatono, I., Tamura, H., Kawachi, F., Umedzu, S., Kinoshita, J.: Fuzzy Decision Trees by Fuzzy ID3 Algorithm and Its Application to Diagnosis Systems. In: 3rd IEEE Conference on Fuzzy Systems, Orlando, vol. 3, pp. 2113–2118 (1994) 19. Yuan, Y., Shaw, M.: Induction of Fuzzy Decision Trees. Fuzzy Sets and Systems 69(2), 125–139 (1995) 20. Abu-halaweh, N., Harrison, R.: Practical Fuzzy Decision Trees. In: IEEE Symposium on Computational Intelligence and Data Mining (CIDM 2009), Nashville (2009) (accepted) (to appear) 21. Quinlan, J.R.: Induction of Decision Trees. Machine Learning 1, 81–106 (1986) 22. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993) 23. Griffiths-Jones, S., Saini, H., Van Dongan, S.: miRBase: Tools for MicroRNA genomics. NAR 2008 36(Database Issue), D154–D158 (2008) 24. Griffiths-Jones, S., Grocock, R.J., Van Dongan, S., Bateman, A., Enright, A.: miRBase: microRNA Sequences, Targets and Gene Nomenclature. NAR 2006 34(Database Issue), 140–144 (2006) 25. Griffiths-Jones, S.: The MicroRNA Registry. NAR 2004 32(Database Issue), D109–D111 (2004) 26. Ambros, V., Bartel, B., Bartel, D.P., Carrington, J.C., Chen, X., Dreyfuss, G., GriffithsJones, S., Marshall, M., Ruvkun, G., Tuschl, T.: A Uniform System for MicroRNA Annotation. RNA 2003 9(3), 277–279 (2003) 27. Rfam Release 12.0: ftp://ftp.sanger.ac.uk/pub/mirbase/sequences/CURRENT/ hairpin.fa.gz
Author Index
Abu-halaweh, Na’el 323 Al Seesi, Sahar 135 Altun, Gulsah 28 Ammar, Reda 135 Apple, Jim 88 Astrovskaya, Irina 28
Kl´ema, Jiˇr´ı 5 Korostelev, Yuri D. 1 Kulikowski, Casimir 284 Kundeti, Vamsi 148 Laikova, Olga N. 1 Langley, Thomas 198 Larsen, Peter 40 Lei, Zhu 236 Li, Jing 248 Li, Juntao 236 Li, Min 75 Liu, Jianhua 236 Lushington, Gerald Henry
B´ anhalmi, Andr´ as 310 Bao, Lichen 112 Baumgartner, Christian 63 Bereg, Sergey 112 Blin, Guillaume 52 Bonizzoni, Paola 186 Boyer, Fr´ed´eric 173 Busa-Fekete, R´ obert 310
M˘ andoiu, Ion 221 McPherson, Andrew 173 Mirkin, Boris 284 Mironov, Andrei A. 1 Mottl, Vadim 284 Mu˜ noz, Adriana 160 Muchnik, Ilya 284
Chauve, Cedric 173 Chen, Jianer 75 Dai, Yang 40 Della Vedova, Gianluca Dondi, Riccardo 186 Eshaghi, Majid
63
186
236 Ouangraoua, A¨ıda
Gelfand, Mikhail S. Gremalschi, Stefan Gusfield, Dan 88
1 28
173
Pa¸saniuc, Bogdan 221 Pan, Yi 75 Pfeifer, Bernhard 63 Pirola, Yuri 186
Harrison, Robert 323 Hayes, Matthew 248 He, Matthew 297 Heber, Steffen 18, 260, 272 Holder, Allen 198 Holec, Matˇej 5 Howard, Brian E. 18, 260
Rajasekaran, Sanguthevar 135, 148 Rakhmaninova, Alexandra B. 1 Ravcheev, Dmitry A. 1 Rizzi, Romeo 186 Rodionov, Dmitry A. 1
Jim´enez-Monta˜ no, Miguel A.
297
Karuturi, R. Krishna Murthy Kazakov, Alexei E. 1 K´egl, Bal´ azs 310 Kennedy, Justin 221 Kim, Jihye 260, 272 Kister, Alexander 124
236
Sankoff, David 160 Sick, Beate 18 Sikora, Florian 52 St. John, Katherine 88 Sul, Seung-Jin 100 Sulimova, Valentina 284 Sunyaev, Shamil R. 234
336
Author Index
´ Tannier, Eric 173 Tilg, Bernhard 63 Tolar, Jakub 5 Tsinoremas, Nicholas F. Ukkonen, Esko
Wang, Jianxin 75 Williams, Tiffani L. Wu, Yufeng 209
100
87
159
Venkatachalam, Balaji 88 Vialette, St´ephane 52 Visvanathan, Mahesh 63 Vitreschak, Alexei G. 1
Yunus, Fajrian
236
ˇ Zelezn´ y, Filip 5 Zelikovsky, Alexander 28 Zhao, Sihui 260, 272