Lecture Notes in Bioinformatics
6832
Edited by S. Istrail, P. Pevzner, and M. Waterman Editorial Board: A. Apostolico S. Brunak M. Gelfand T. Lengauer S. Miyano G. Myers M.-F. Sagot D. Sankoff R. Shamir T. Speed M. Vingron W. Wong
Subseries of Lecture Notes in Computer Science
Osmar Norberto de Souza Guilherme P. Telles Mathew J. Palakal (Eds.)
Advances in Bioinformatics and Computational Biology 6th Brazilian Symposium on Bioinformatics, BSB 2011 Brasilia, Brazil, August 10-12, 2011 Proceedings
13
Series Editors Sorin Istrail, Brown University, Providence, RI, USA Pavel Pevzner, University of California, San Diego, CA, USA Michael Waterman, University of Southern California, Los Angeles, CA, USA Volume Editors Osmar Norberto de Souza Faculdade de Informática - PUCRS Avenida Ipiranga 6681, Prédio 32 - Sala 608, 90619-900, Porto Alegre-RS, Brazil E-mail:
[email protected] Guilherme P. Telles Instituto de Computação - Unicamp Avenida Albert Einstein, 1251, 13083-852, Campinas-SP, Brazil E-mail:
[email protected] Mathew J. Palakal Indiana University Purdue University Indianapolis 535 W. Michigan Str. IT 477, Indianapolis, IN 46202, USA E-mail:
[email protected]
ISSN 0302-9743 e-ISSN 1611-3349 ISBN 978-3-642-22824-7 e-ISBN 978-3-642-22825-4 DOI 10.1007/978-3-642-22825-4 Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2011932980 CR Subject Classification (1998): J.3, I.2, F.1, H.2.8, I.5, H.3 LNCS Sublibrary: SL 8 – Bioinformatics
© Springer-Verlag Berlin Heidelberg 2011 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
Preface
This volume contains the full papers and extended abstracts selected by peer review for presentation at the 6th Brazilian Symposium on Bioinformatics, BSB 2011, held in Bras´ılia, Distrito Federal, Brazil, during August 10–12, 2011. The first three meetings of this series, which took place in 2002, 2003, and 2004, were called WOB (Workshop on Bioinformatics). In 2005, the conference name was changed from WOB to its current name BSB to reflect the increase both in the quality of the contributions and in the interest of the scientific community for the meeting. Since then the BSB proceedings have been published as a special issue of the subseries Lecture Notes in Bioinformatics (volumes 3594/2005, 4643/2007, 5167/2008, 5676/2009, and 6268/2010) of the Lecture Notes in Computer Science series by Springer. BSB topics of interest embrace many areas of bioinformatics, ranging from theoretical aspects of problems in bioinformatics to applications in molecular biology, biochemistry, genetics, and associated subjects. The Program Committee Chairs are very thankful to the authors of all submitted papers, and especially to the Program Committee and additional reviewers for their careful work in helping to select the eight full papers and four extended abstracts to build up this proceedings volume. The members of BSB Organizing Committees also thank the keynote speakers, Mathew J. Palakal, Jo˜ ao Setubal, Peter F. Stadler and Sergio Verjovski-Almeida, for their participation in the event. The conference was sponsored by the Brazilian Computer Society (SBC), with further support from the Brazilian sponsoring agencies CNPq and CAPES. Special thanks go to Springer for their continued support by agreeing to publish the BSB 2011 proceedings volume and to the CEBioComp/SBC Steering Committee. Last but not least, our sincerest thanks to all who have supported the Brazilian Symposium on Bioinformatics along all these years. In 2011, we are in debt to the indispensable close help and assistance of Maria Em´ılia Telles Walter. Mia, it did make a difference. Our gratitude also goes to Marcelo de Macedo Br´ıgido and Maria Em´ılia Telles Walter and their local team at UnB: Tain´ a Raiol, Daniel Saad Nunes, Halian Vilela, Felipe Lessa, Filipe Lima, Maria Beatriz W. Costa, Paulo Alvarez, and Tulio Conrado da Silva. August 2011
Osmar Norberto de Souza Guilherme P. Telles Mathew J. Palakal
Organization
BSB 2011 was promoted by the Brazilian Computer Society (SBC) and was organized by the Institute of Biology and by the Institute of Exact Sciences of the University of Bras´ılia (UnB), Bras´ılia-DF, Brazil.
Conference Chairs Marcelo de Macedo Br´ıgido Maria Em´ılia M.T. Walter
Laboratory of Molecular Biology, Institute of Biology, UnB, Brazil Departament of Computer Science, Institute of Exact Sciences, UnB, Brazil
Program Chairs Osmar Norberto de Souza Guilherme P. Telles Mathew J. Palakal
Faculty of Informatics–PUCRS, Brazil Institute of Computing–Unicamp, Brazil Indiana University Purdue University Indianapolis, USA
Steering Comittee Marcelo de Macedo Br´ıgido Andr´e C.P.L.F. de Carvalho Carlos E. Ferreira Katia Guimar˜ aes Sergio Lifschitz Osmar Norberto de Souza Maria Em´ılia M.T. Walter
University of Bras´ılia, Brazil University of S˜ao Paulo–S˜ ao Carlos, Brazil University of S˜ ao Paulo, Brazil Federal University of Pernambuco, Brazil Pontifical Catholic University of Rio de Janeiro, Brazil Pontifical Catholic University of Rio Grande do Sul, Brazil University of Bras´ılia, Brazil
Program Committee Nalvo F. Almeida Ana L´ ucia C. Bazzan Marcelo Brigido Mario Caccamo Andr´e C.P.L.F. de Carvalho Ivan G. Costa Mark Craven
Federal University of Mato Grosso do Sul, Brazil Federal University of Rio Grande do Sul, Brazil University of Bras´ılia, Brazil TGAC-BBSRC, UK University of S˜ao Paulo–S˜ ao Carlos, Brazil Federal University of Pernambuco, Brazil University of Wisconsin, USA
VIII
Organization
Alberto D´ avila Zanoni Dias Alan Durham Fazel Famili Katia S. Guimar˜ aes Ronaldo Hashimoto Paul Horton Maricel Kann Paula Kuser-Falc˜ ao Alair Pereira do Lago Ney Lemke Sergio Lifschitz Ion Mandoiu Natalia F. Martins Wellington Martins Anna Panchenko F´ abio Passeti Duncan Ruiz Alexander Schliep Jo˜ao Setubal Marcilio M.C.P. de Souto Rainer Spang Peter F. Stadler Jerzy Tiuryn Maria Emilia M.T. Walter Alex Zelikovsky
Fiocruz, Brazil University of Campinas, Brazil University of S˜ ao Paulo, Brazil Institute for Information Technology, Canada Federal University of Pernambuco, Brazil University of S˜ ao Paulo, Brazil AIST, Japan UMBC, USA Embrapa, Brazil University of S˜ ao Paulo, Brazil IBB-Unesp, Brazil Pontifical Catholic University of Rio de Janeiro, Brazil University of Connecticut, USA Embrapa, Brazil Federal University of Goi´ as, Brazil NIH, USA INCA, Brazil Pontifical Catholic University of Rio Grande do Sul, Brazil Rutgers University, USA Virginia Bioinformatics Institute, USA Federal University of Pernambuco, Brazil University of Regensburg, Germany University of Leipzig, Germany University of Warsaw, Poland University of Bras´ılia, Brazil Georgia State University, USA
Additional Reviewers Said Adi Aleteia P.F. de Araujo Francisco Eloi Ara´ ujo Katti Faceli Mileidy Gonzalez Carlos Higa Maristela T. de Holanda Renato P. Ishii
Federal University of Mato Grosso do Sul, Brazil University of Bras´ılia, Brazil Federal University of Mato Grosso do Sul, Brazil Federal University of S˜ao Carlos, Brazil NIH, USA University of S˜ ao Paulo, Brazil University of Bras´ılia, Brazil Federal University of Mato Grosso do Sul, Brazil
Organization
David Martins-Jr. Edson Matsubara Mariana Mendoza Roberto Riga John Spouge David Swarbreck Manoj Tyagi Michel Yamagishi
Federal University of ABC, Brazil Federal University of Mato Grosso do Sul, Brazil Federal University of Rio Grande do Sul, Brazil Embrapa, Brazil NIH, USA The Genome Analysis Center, UK NCBI/NIH, USA Embrapa, Brazil
Sponsoring Institutions Conselho Nacional de Desenvolvimento Cient´ıfico e Tecnol´ ogico (CNPq) Coordena¸c˜ao de Aperfei¸coamento de Pessoal de N´ıvel Superior (CAPES) Brazilian Computer Society (SBC) University of Bras´ılia (UnB)
IX
Table of Contents
Full Papers MicroRNA or Not MicroRNA? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . David Langenberger, Sebastian Bartschat, Jana Hertel, Steve Hoffmann, Hakim Tafer, and Peter F. Stadler Hierarchical Multilabel Protein Function Prediction Using Local Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ricardo Cerri and Andr´e C.P.L.F. de Carvalho Identifying Significant Features in HIV Sequence to Predict Patients’ Response to Therapies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Samuel Evangelista de Lima Oliveira, Luiz Henrique de Campos Merschmann, and Leoneide Erica Maduro Bouillet
1
10
18
Gene Prediction by Multiple Spliced Alignment . . . . . . . . . . . . . . . . . . . . . . Rodrigo Mitsuo Kishi, Ronaldo Fiorilo dos Santos, and Said Sadique Adi
26
A New Algorithm for Sparse Suffix Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gustavo Akio T. Sacomoto and Alair Pereira do Lago
34
Analysis and Implementation of Sorting by Transpositions Using Permutation Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marcelo P. Lopes, Marilia D.V. Braga, Celina M.H. de Figueiredo, Rodrigo de A. Hausen, and Luis Antonio B. Kowada Improved Gene Expression Clustering with the Parameter-Free PKNNG Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ariel E. Bay´ a and Pablo M. Granitto Efficiently Querying Protein Sequences with the Proteinus Index . . . . . . . Felipe Alves da Louza, Ricardo Rodrigues Ciferri, and Cristina Dutra de Aguiar Ciferri
42
50 58
Extended Abstracts SciPhy: A Cloud-Based Workflow for Phylogenetic Analysis of Drug Targets in Protozoan Genomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kary A.C.S. Oca˜ na, Daniel de Oliveira, Eduardo Ogasawara, Alberto M.R. D´ avila, Alexandre A.B. Lima, and Marta Mattoso
66
XII
Table of Contents
A Conceptual Model for Transcriptome High-Throughput Sequencing Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ruben Cruz Huacarpuma, Maristela Holanda, and Maria Emilia Walter A Conceptual Many Tasks Computing Architecture to Execute Molecular Docking Simulations of a Fully-Flexible Receptor Model . . . . . Renata De Paris, F´ abio A. Frantz, Osmar Norberto de Souza, and Duncan D. Ruiz
71
75
kGC: Finding Groups of Homologous Genes across Multiple Genomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Guilherme P. Telles, Nalvo F. Almeida, Marcelo M. Brigido, Paulo Antonio Alvarez, and Maria Emilia Walter
79
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83
MicroRNA or Not MicroRNA? David Langenberger1−3, Sebastian Bartschat1, Jana Hertel1,2 , Steve Hoffmann3,2 , Hakim Tafer1,2 , and Peter F. Stadler1−8 1 Bioinformatics Group, Department of Computer Science Interdisciplinary Center for Bioinformatics, University of Leipzig, H¨ artelstraße 16-18, D-04107 Leipzig, Germany 3 LIFE – Leipzig Research Center for Civilization Diseases, University of Leipzig, Germany 4 Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, D-04103 Leipzig, Germany 5 Fraunhofer Institut f¨ ur Zelltherapie und Immunologie, Perlickstraße 1, D-04103 Leipzig, Germany 6 Department of Theoretical Chemistry University of Vienna, W¨ ahringerstraße 17, A-1090 Wien, Austria Center for non-coding RNA in Technology and Health, University of Copenhagen, Grønneg˚ ardsvej 3, DK-1870 Frederiksberg C, Denmark 8 Santa Fe Institute, 1399 Hyde Park Rd., Santa Fe, NM 87501, USA 2
7
Abstract. The avalanche of next generation sequencing data has led to a rapid increase of annotated microRNAs in the last few years. Many of them are specific to individual species or rather narrow clades. A closer inspection of the current version of miRBase shows that dozens of entries conflict with other ncRNAs, in particular snoRNAs. With few exceptions, these cases show little similarities to canonical microRNAs, however, and thus they should be considered as mis-annotations.
1
Introduction
MicroRNAs and small nucleolar RNAs are thought of distinct classes of ncRNAs with very different functions. While microRNAs are matured to ∼ 20nt sequences that direct post-transcriptional gene silencing, snoRNAs canonically guide, in their complete form, the chemical modification of mostly rRNAs and snRNAs [1]. On the other hand, high-throughput sequencing studies revealed that snoRNAs are a prolific source of sequence fragments of microRNA size [2,3,4,5], termed sdRNAs. At least some of these snoRNA-derived small RNAs, similar to microRNAs, interact human Argonaut and affect gene expression [6]. Recently, efficient gene silencing has been demonstrated for 11 small RNAs derived from box C/D sno-miRNA [5]. Similar short RNAs, in a few cases with validated functions in gene silencing, are also produced from most other wellknown structured RNAs including Y RNA [4,7], vault RNAs [8,9,10], snRNAs [4], and tRNAs [11,12,13,14]. Recent work [15], furthermore, cast doubt on the microRNA nature of several short RNA products that likely originate from the 3’-end of matured tRNAs since they include the post-transcriptionally append O. Norberto de Souza, G.P. Telles, and M.J. Palakal (Eds.): BSB 2011, LNBI 6832, pp. 1–9, 2011. c Springer-Verlag Berlin Heidelberg 2011
2
D. Langenberger et al.
CCA tail. The large numbers of CCA-tagged reads from nearly all tRNAs, which are abundant in deep sequencing data, supports a tRNA-origin of a few annotated “microRNAs”. Canonical microRNAs are generated from a quite specific processing pathway [16]: a polymerase II transcript, the primary miRNA precursor (pri-miRNA) is cropped by the Drosha-DGCR8 complex, also known as Microprocessor. The resulting pre-microRNA hairpin uses the exportin-5 pathway to reach the cytoplasm, where it is cleaved to generate the mature miRNA. Early reports [17] of pre-microRNAs originating from pol-III transcription have recently been refuted [18]. A survey of human pol-III transcription [19], furthermore, recovered no annotated microRNA except two mis-annotations: a vault RNA (hsa-mir-886) and the Y5 RNA (hsa-mir-1975). Mirtrons, on the other hand, are short introns forming stable hairpin structures [20,21,22,23]. Both ends of mirtrons are defined by the splice sites. A related, mirtron-like source of small RNAs requires both splicing and exosome-mediated trimming to extract the pre-microRNA hairpin [24,25]. In this case only one end of the precursor hairpin is defined by the splicing reaction. The production of small RNAs from these intronic precursors is independent of Drosha [22,25]. A recent review [16] lists several additional esoteric pathways, including at least two of them independent of both Drosha and Dicer. The similarity between H/ACA snoRNAs and microRNAs has been noticed in several computational studies. For example, [26] reports twenty miRNA precursors that show significant similarity to H/ACA snoRNAs; five of these (miR-151, miR-605, mir-664, miR-215 and miR-140) even bind to dyskerin, a component of the H/ACA snoRNP. Some microRNAs, furthermore, are known to be predominantly localized in the nucleolus [27] emphasizing their snoRNA-like features. This may suggest that a subset of microRNA precursors may have evolved from snoRNAs [26]. The production of small RNAs from snoRNAs, on the other hand, is independent of Drosha [6,3,5], and in some cases Drosha even inhibits sdRNA formation [3], suggesting that the snoRNAs and (canonical) microRNAs are in general clearly distinguished entities. Here we investigate systematically the conflicts in annotation between microRNAs in miRBase [28] and other classes of ncRNAs as defined by a variety of other databases. Since most of the conflicts, not surprisingly, concern overlaps of microRNA and snoRNA assignments, we focus in particular on these cases.
2
Conflicts of microRNA and snoRNA Annotation
In order to determine to what extent the microRNA annotation of miRBase conflicts with non-coding RNA annotation stored in other databases, we retrieved the mature miR sequences from miRBase (v. 16) and compared them against Rfam [30] (v. 10.0) using the mapping tool segemehl [31]. We found 38 mature miRNAs mapping perfectly to other annotated ncRNAs. Stringently requiring exact hits of sequences from the same species and collapsing overlaps observed in more than one species left 26 examples. In addition, a few previously known
MicroRNA or Not MicroRNA?
3
Table 1. Overlap of annotation as microRNA and other ncRNA classes. Classification probabilities for microRNAs and snoRNAs are listed. SVM refers to the analysis described in the text. miRBase
PMID
RNA class
gga-miR-3528 19891781 SNORA17 cfa-miR-1836 18392026 SNORA20 bta-miR-2427 19633723 SNORA25 hsa-miR-664 * ¶ 20413612 SNORA36 mmu-miR-1940 ¶ 18849523 SNORA47 bta-miR-2311 19633723 SNORA61 mdo-miR-1543 17965199 SNORA74 hsa-miR-1248 * ¶ 18285502 SNORA81 hsa-mir-3651 20483914 SNORA84 hsa-miR-1291 ¶ 18285502 SNORA34 tgu-miR-2989 § 20360741 SNORD74 mmu-miR-3096 20413612 SNORD93 gga-miR-3535 19891781 SNORD20 gga-miR-3538 19891781 SNORD83B gga-miR-1454 § 18430245 SNORD100 hsa-mir-3647 20483914 SNORD111B hsa-mir-3653 20483914 SNORD125 hsa-miR-1201 † SNORD126 miR-1843 * ¶ 20413612 SCARNA3b oan-miR-1354 18463306 snoU85 bta-miR-2424 19633723 SCARNA10 mmu-miR-3069 20413612 SCARNA13 oan-miR-1348 § 18463306 SCARNA15 gga-miR-3540 § 19891781 SCARNA15 bta-miR-1940 § 19633723 SCARNA4 dre-miR-735 16698962 Y RNA [29] hsa-mir-1975 † Y5 RNA [7] dre-miR-733 16698962 vault RNA [8] mir-866 † VTRNA2 [8] mmu-miR-699 † RNase MRP 1 hsa-miR-1246 18285502 U2 hsa-mir-1274 18285502 tRNA-Lys [15] hsa-mir-1280 18285502 tRNA-Leu [15] mir-720 16582102 tRNA-Thr [15] hsa-mir-1308 † tRNA-Gly [15] mmu-miR-1937b 18849523 tRNA-Pro [15] hsa-mir-151 — hsa-mir-215 — hsa-mir-140 — hsa-mir-605 —
RNA- snoReport SVM micro H/ACA C/D 0.99 0.94 0 0 0.99 0 0 0.82 0 0.64 0.86 0 0.10 – 0.97 0 – 0.60 0 0.99 0 0 0.99 0 0 0.38 0 0.99 0 0.05 0.81 0.97 0 0.89 0.09 0 0.96 0.04 0 0.99 0.82 0 0.98 0.18 0 0.99 0.08 0 0.91 0.04 0 0.99 0.02 0.02 0 0.99 0.25 0 0 0.99 0.42 0.08 0 0 0.09 0 0 0.01 0.98 0.99 0.09 0 0.96 – 0.99 0 0.01 0.99 0 0 0.76 0.91 – 0 0 – 0 0 – 0 0 – – 0.05 0 0 0 0 0 0.99 0 0 0 0 0 – 0 0 0.78 0.99 0 0 0.94 0.99 0 0 0.33 0.99 0 0 0.84 0.93 0 0 0.62
† indicates miRBase entries that have been removed in the most release(s) because their source has been convicingly identified as another class of ncRNAs. § reported sno-miRs from [5] in human. ¶ also discussed in [26]. * overlap of microRNA and snoRNA annotation in multiple species. 1 the mature hsa-miR-1246 maps both to the U2 snRNA and a degraded hairpin deriving from a MLT1M ERVL-MaLR element.
cases from the literature have been included in Table 1. Most of the overlaps concern snoRNAs. In some cases, these “mature microRNAs” have length of 24 or larger, i.e., outside the range observed for canonical microRNAs. It is important in this context to recall the common practice of annotating microRNAs. Experimental evidence is almost always only available for the mature microRNA. After mapping the mature sequence to genome, putative precursor hairpins are then assigned based solely on computational secondary structure predictions of the surrounding genomic DNA sequence. In many of the cases
4
D. Langenberger et al.
GPL6541
HACA 84 O.anatinus gga-mir-1454
gga-SNORD100
oan-mir-1348
Fig. 1. Examples of putative microRNAs that are most likely mis-annotated. L.h.s.: SNORD100 is a well-conserved box C/D snoRNA, while “mir-1454” would be specific to chicken. R.h.s.: The annotated platypus mir-1348 precursor sequence is located in a putative precursor hairpin whose 5’ side is not conserved at all. The, alternative explanation, the 5’ side of the 2nd hairpin of the box H/ACA snoRNA SCARNA15, on the other hand, is highly conserved.
listed in Tab. 1 we observe that the annotated precursor hairpins only partially overlap alternative annotations, while the short RNA may arise from either of the conflicting putative precursors. Crucially, annotations as snoRNAs or other RNA classes are often supported by direct evidence, such as cloning and sequencing or Northern blots, which are lacking for the putative pre-microRNA. One possibility to distinguish evolutionarily conserved microRNAs from evolutionarily conserved other ncRNAs, say snoRNAs, is to evaluate the patterns of sequence conservation. Clearly, this can be conclusive only in those cases where both the mature microRNA and the snoRNA map to unique positions in the genome. Otherwise, the microRNA might, e.g., derive from a paralog of the conserved functional snoRNA locus. A likely example for the latter is tgu-miR-2995, reported as a species-specific microRNA in the zebrafinch. It derives from a degraded paralog of SNORA54 rather than the syntenically conserved, complete, and presumably functional copy of SNORA54. The NAPIL4 gene harbors a single complete copy of SNORA54 in both chicken and zebrafinch. In addition, both tgu-miR-2995 and a part of the SNORA54 sequence also map to different, more 5’, intron of the same gene in the zebrafinch only. This mechanism for generating novel microRNAs is consistent with two well established facts: snoRNAs are known to behave like retro-elements in many genomes [32,33], and several well-documented microRNAs arose by exaptation from repetitive elements [34]. It is a possible explanation for the origin of the snoRNA-like microRNAs in [26]. In many cases, however, the putative microRNA is rather poorly conserved and there is little or no conservation for the precursor hairpin, while at the same time the alternative annotation as a snoRNA or other ncRNA features a deep phylogenetic conservation. UCSC Genome Browser representations of two examples are shown in Fig. 1. Although there is clear block of short RNAs for chicken mir-1454, the sequence conservation is extremely poor and there is no signal for a matching miR*. Thus, if mir-1454 is indeed a microRNA, it is almost certainly specific to chicken. On the other hand, SNORD100 is conserved at least across vertebrates. Since there is no paralog of the snoRNA in the chicken genome, it is parsimonious to assume the short reads interpreted as mir-1454 constitute an sdRNA deriving from a box C/D snoRNA precursor. Another example of this
MicroRNA or Not MicroRNA?
5
140 1291
3651 1246 3647
664
605 1201 215, 1248
3653
0
frequency 20 40
60
151
−2
−1 0 1 snoRNA-like decision value miRNA-like
2
Fig. 2. Histogram of the SVM decision values for all 1,048 miRNAs annotated in miRBase v16. The positions of the 12 putatively mis-annotated miRNAs are indicated. Only four of these (mir-605, mir140, mir-1291 and mir-151) are unambiguously classified as miRNAs, of these, only mir-1291 overlaps a known snoRNA. The remaining ones show conservation patterns and structural features similar to snoRNA hairpins.
kind is SCARNA15, a box H/ACA snoRNA, for which a platypus mir-1348 was annotated, Fig. 1B. The 3’arm of the microRNA containing the annotated mature sequence overlaps the 5’arm of the second hairpin of the H/ACA snoRNA. The stem loop structure of the putative pre-microRNA untypically shows two larger interior loops, while the putative snoRNA shows a perfect double stem loop pattern with perfect conservation of both the H and ACA boxes. Again, the detailed inspection of the locus suggests that it should be considered as a conserved snoRNA rather than a microRNA. In addition to manual inspection, we applied the class-specific gene finders RNAmicro [35] and snoReport [36] to assess the overlaps of microRNA and snoRNA annotations of Table 1. The possible classifications are (1) microRNA but not snoRNA, (2) vice versa, (3) both classes predicted with high probability and (4) no classification as microRNA or snoRNA at all. As expected, the majority falls into the classes (1) or (2). There are only three candidates for case (3). Neither class is assigned in cases where the putative microRNA precursor hairpin is not conserved in related species so that RNAmicro cannot be used, and snoReport fails to recognize a box H/ACA or box C/D snoRNA structure. The main advantage for classifying microRNAs with RNAmicro is the use of comparative information. Thus, stem loop structures of annotated microRNAs that look characteristically at a first glance are nevertheless not classified as microRNA if the conservation pattern is not as expected for typical microRNAs. Applying snoReport to those sequences (extended if necessary) almost always yields good snoRNA classification. A nice example is the overlapping annotation of mir-1940 and SNORA26 in mouse. While the secondary structure of the annotated miRNA in mouse and related species is a nicely conserved stem loop the underlying conservation pattern is not miRNA-like (constantly high at the mature and mature-star part and low in the hairpin loop region). This is the reason for the low prediction probability (p = 0.000017) of the RNAmicro SVM.
6
D. Langenberger et al.
The clear occurrence and conservation of the H and ACA box, their distances to each other and the hairpin-hinge-hairpin-tail secondary structure prediction pattern, however, yields a high classification probability (p = 0.97 for box H/ACA snoRNA) of the snoReport SVM. Most box H/ACA snoRNAs consist of two hairpins. We ask here whether hairpins that give rise to large amounts of small reads are more “microRNAlike” than hairpins of H/ACA snoRNAs that are no prolific sources of short RNA products. Hence, we employ an SVM classifier that is trained from two disjoint sets of hairpins: (1) The bona fide evolutionary conserved microRNA precursor compiled in [37], which contains neither repeat-derived microRNAs nor lineage-specific ones. (2) A H/ACA hairpin set consisting of those hairpins of H/ACA snoRNAs that show very low levels of short RNA production. To determine small RNA molecules originating from these loci we used mapped reads from different developmental stages of the human brain (GSE18012). Using principal component analysis, we selected the following features for the final SVM classifier: the mean pairing probability of all nucleotides, the number of bound bases, the GC content, the longest paired region, the energy z-score of the precursor and its flanking region, the number of asymmetric bulges of the miRNA arms, as well as the conservation of the arms and the loop of the hairpin. The libSVM classifier was trained and executed in R, using the e1071 package. Repeatedly using a randomly chosen third of the positive and negative loci to train the SVM resulted in a positive predictive value of 0.94, a sensitivity of 0.88, and a specificity of 0.87. We apply the SVM to the complete set of microRNAs, including the putative mis-annotations. The classification results are compiled in Tab. 1. Overall, two thirds of the annotated microRNAs with conflicting annotation are not recognized as “microRNA-like” by this approach, supporting our view that these sequence do not constitute true microRNAs.
3
Discussion
In addition to an increasing number of transcripts with multiple processing products and multiple functions, an increasingly diverse universe of small RNAs has been described. Small RNAs are produced by a wide variety of mechanisms, they originate from a broad array of source transcripts, and they exert a broad range of biological functions. This begs the question what exactly should be considered as a microRNA as opposed to the many other types of small RNAs. The most inclusive definition, favored in at least part of the literature, encompasses any short RNA that is incorporated in an Argonaute complex. This point of view had led to the inclusion in miRBase of significant number of small RNAs that are originating from snoRNAs, snRNAs, tRNAs, and other structured RNAs. We systematically searched for such cases and investigated to what extent the ambiguities in the annotation can be decided. We found that short RNAs can often be recognized as products of well-know structured ncRNAs other than microRNAs, leaving also the annotated putative pre-microRNA hairpin doubtful at best.
MicroRNA or Not MicroRNA?
7
Although the definition of “microRNA” at first glance may seem to be a purely semantic issue, it has important consequences in practice, since it determines what is included in databases such as miRBase. This in turn determines, e.g., what is used in practice as training sets for machine learning approaches. In the case of microRNAs, for which typically the precursor hairpins are utilized, one unknowingly works with contaminated datasets when “microRNAs” are included that are not produced in the canonical way. The inclusion of mitrons and other non-canonical precursors, for instance, precludes the identification of features associated with Drosha processing. A more stringent curation of microRNAs and an explicit annotation of the source of the short RNAs would be highly desirable. Funding. DFG, FP-7 project QUANTOMICS, European Structure Fund, and the Free State of Saxony.
References 1. Terns, M.P., Terns, R.M.: Small nucleolar RNAs: versatile trans-acting molecules of ancient evolutionary origin. Gene Expr. 10, 17–39 (2002) 2. Kawaji, H., Nakamura, M., Takahashi, Y., Sandelin, A., Katayama, S., Email, F.S., Daub, C., Kai, C., Jun Kawai, J., Yasuda, J., Carninci, P., Hayashizaki, Y.: Hidden layers of human small RNAs. BMC Genomics 9, 157 (2008) 3. Taft, R.J., Glazov, E.A., Lassmann, T., Hayashizaki, Y., Carninci, P., Mattick, J.S.: Small RNAs derived from snoRNAs. RNA 15, 1233–1240 (2009) 4. Langenberger, D., Bermudez-Santana, C., Stadler, P.F., Hoffmann, S.: Identification and classification of small RNAs in transcriptome sequence data. In: Pac. Symp. Biocomput., vol. 15, pp. 80–87 (2010) 5. Brameier, M., Herwig, A., Reinhardt, R., Walter, L., Gruber, J.: Human box C/D snoRNAs with miRNA like functions: expanding the range of regulatory RNAs. Nucleic Acids Res. 39, 675–686 (2011) 6. Ender, C., Krek, A., Friedl¨ander, M.R., Beitzinger, M., Weinmann, L., Chen, W., Pfeffer, S., Rajewsky, N., Meister, G.: A human snoRNA with microRNA-like functions. Mol. Cell 32, 519–528 (2008) 7. Meiri, E., Levy, A., Benjamin, H., Ben-David, M., Cohen, L., Dov, A., Dromi, N., Elyakim, E., Yerushalmi, N., Zion, O., Lithwick-Yanai, G., Sitbon, E.: Discovery of microRNAs and other small RNAs in solid tumors. Nucleic Acids Res. 38, 6234– 6246 (2010) 8. Stadler, P.F., Chen, J.J.L., Hackerm¨ uller, J., Hoffmann, S., Horn, F., Khaitovich, P., Kretzschmar, A.K., Mosig, A., Prohaska, S.J., Qi, X., Schutt, K., Ullmann, K.: Evolution of vault RNAs. Mol. Biol. Evol. 26, 1975–1991 (2009) 9. Persson, H., Kvist, A., Vallon-Christersson, J., Medstrand, P., Borg, A., Rovira, C.: The non-coding RNA of the multidrug resistance-linked vault particle encodes multiple regulatory small RNAs. Nat. Cell Biol. 11, 1268–1271 (2009) 10. Mosig, A., Stadler, P.F.: Evolution of vault RNAs. In: N.N. (ed.) Encyclopedia of Life Sciences. Wiley-Blackwell, Hoboken, NJ (2011), doi:10.1002/9780470015902.a0022883 11. Lee, Y.S., Shibata, Y., Malhotra, A., Dutta, A.: A novel class of small RNAs: tRNA-derived RNA fragments (tRFs). Genes Dev. 23, 2639–2649 (2009)
8
D. Langenberger et al.
12. Cole, C., Sobala, A., Lu, C., Thatcher, S.R., Bowman, A., Brown, J.W., Green, P.J., Barton, G.J., Hutvagner, G.: Filtering of deep sequencing data reveals the existence of abundant Dicer-dependent small RNAs derived from tRNAs. RNA 15, 2147–2160 (2009) 13. Haussecker, D., Huang, Y., Lau, A., Parameswaran, P., Fire, A.Z., Kay, M.A.: Human tRNA-derived small RNAs in the global regulation of RNA silencing. RNA 16, 673–695 14. Findeiß, S., Langenberger, D., Stadler, P.F., Hoffmann, S.: Traces of posttranscriptional RNA modifications in deep sequencing data. Biol. Chem. 392, 305– 313 (2011) 15. Schopman, N.C.T., Heynen, S., Haasnoot, J., Berkhout, B.: A miRNA-tRNA mixup: tRNA origin of proposed miRNA. RNA Biology 7, 573–576 (2010) 16. Miyoshi, K., Miyoshi, T., Siomi, H.: Many ways to generate microRNA-like small RNAs: non-canonical pathways for microRNA production. Mol. Genet. Genomics 284, 95–103 (2010) 17. Borchert, G.M., Lanier, W., Davidson, B.L.: RNA polymerase III transcribes human microRNAs. Nat. Struct. Mol. Biol. 13, 1097–1101 (2006) 18. Bortolin-Cavaill´e, M.L., Dance, M., Weber, M., Cavaill´e, J.: C19MC microRNAs are processed from introns of large Pol-II, non-protein-coding transcripts. Nucleic Acids Res. 37, 3464–3473 (2009) 19. Canella, D., Praz, V., Reina, J.H., Cousin, P., Hernandez, N.: Defining the RNA polymerase III transcriptome: Genome-wide localization of the RNA polymerase III transcription machinery in human cells. Genome Res. 20, 710–721 (2010) 20. Berezikov, E., Chung, W.J., Willis, J., Cuppen, E., Lai, E.C.: Mammalian mirtron genes. Mol. Cell 28, 328–336 (2007) 21. Okamura, K., Hagen, J.W., Duan, H., Tyler, D.M., Lai, E.C.: The mirtron pathway generates microRNA-class regulatory RNAs in Drosophila. Cell 130, 89–100 (2007) 22. Ruby, G.J., Jan, C.H., Bartell, D.P.: Intronic microRNA precursors that bypass Drosha processing. Nature 48, 83–86 (2007) 23. Chung, W.J., Agius, P., Westholm, J.O., Chen, M., Okamura, K., Robine, N., Leslie, C.S., Lai, E.C.: Computational and experimental identification of mirtrons in Drosophila melanogaster and Caenorhabditis elegans. Genome Res. 21, 286–300 (2011) 24. Flynt, A.S., Greimann, J.C., Chung, W.J., Lima, C.D., Lai, E.C.: MicroRNA biogenesis via splicing and exosome-mediated trimming in Drosophila. Mol. Cell 38, 900–907 (2010) 25. Chong, M.M.W., Zhang, G., Cheloufi, S., Neubert, T.A., Hannon, G.J., Littman, D.R.: Canonical and alternate functions of the microRNA biogenesis machinery. Genes Dev. 24, 1951–1960 (2010) 26. Scott, M.S., Avolio, F., Ono, M., Lamond, A.I., Barton, G.J.: Human miRNA precursors with box H/ACA snoRNA features. PLoS Comput. Biol. 5, e100050 (2009) 27. Politz, J.C., Hogan, E.M., Pederson, T.: MicroRNAs with a nucleolar location. RNA 15, 1705–1715 (2009) 28. Griffiths-Jones, S.: The microRNA Registry. Nucleic Acids Res. 32, D109–D111 (2004) 29. Mosig, A., Guofeng, M., Stadler, B.M.R., Stadler, P.F.: Evolution of the vertebrate Y RNA cluster. Th. Biosci. 126, 9–14 (2007) 30. Griffiths-Jones, S., Bateman, A., Marshall, M., Khanna, A., Eddy, S.R.: Rfam: an RNA family database. Nucleic Acids Res. 31, 439–441 (2003)
MicroRNA or Not MicroRNA?
9
31. Hoffmann, S., Otto, C., Kurtz, S., Sharma, C., Khaitovich, P., Vogel, J., Stadler, P.F., Hackerm¨ uller, J.: Fast mapping of short sequences with mismatches, insertions and deletions using index structures. PLoS Comp. Biol. 5, e1000502 (2009) 32. Weber, M.J.: Mammalian small nucleolar RNAs are mobile genetic elements. PLoS Genet. 2, e205 (2006) 33. Schmitz, J., Zemann, A., Churakov, G., Kuhl, H., Gr¨ utzner, F., Reinhardt, R., Brosius, J.: Retroposed SNOfall–a mammalian-wide comparison of platypus snoRNAs. Genome Res. 18, 1005–1010 (2008) 34. Smalheiser, N.R., Torvik, V.I.: Mammalian microRNAs derived from genomic repeats. Trends Genet. 21, 322–326 (2005) 35. Hertel, J., Stadler, P.F.: Hairpins in a haystack: Recognizing microRNA precursors in comparative genomics data. Bioinformatics 22, e197–e202 (2006) 36. Hertel, J., Hofacker, I.L., Stadler, P.F.: snoReport: Computational identification of snoRNAs with unknown targets. Bioinformatics 24, 158–164 (2008) 37. Hertel, J., Lindemeyer, M., Missal, K., Fried, C., Tanzer, A., Flamm, C., Hofacker, I.L., Stadler, P.F.: Students of Bioinformatics Computer Labs 2004 & 2005: The expansion of the metazoan microRNA repertoire. BMC Genomics 7, 15 (2006)
Hierarchical Multilabel Protein Function Prediction Using Local Neural Networks Ricardo Cerri and Andr´e C.P.L.F. de Carvalho Department of Computer Science, University of S˜ ao Paulo at S˜ ao Carlos Avenida Trabalhador S˜ ao-carlense – 400 – Centro P.O. Box: 668 – CEP: 13560-970 – S˜ ao Carlos – SP – Brazil {cerri,andre}@icmc.usp.br
Abstract. Protein function predictions are usually treated as classification problems where each function is regarded as a class label. However, different from conventional classification problems, they have some specificities that make the classification task more complex. First, the problem classes (protein functions) are usually hierarchically structured, with superclasses and subclasses. Second, proteins can be simultaneously assigned to more than one class in each hierarchical level, i.e., a protein can be classified into two or more paths of the hierarchical structure. This classification task is named hierarchical multilabel classification, and several methods have been proposed to deal with it. These methods, however, either transform the original problem into a set of simpler problems, loosing important information in this process, or employ complex internal mechanisms. Additionally, current methods have problems dealing with a high number of classes and also when predicting classes located in the deeper hierarchical levels, because the datasets become very sparse as their hierarchies are traversed toward the leaves. This paper investigates the use of local artificial neural networks for hierarchical multilabel classification, particularly protein function prediction. The proposed method was compared with state-of-the-art methods using several protein function prediction datasets. The experimental results suggest that artificial neural networks constitute a promising alternative to deal with hierarchical multilabel classification problems. Keywords: Bioinformatics, Hierarchical, Multilabel, Classification, Neural Networks, Protein Function Prediction.
1
Introduction
In most of the protein function prediction problems described in the literature, a classifier assigns an instance xi (protein) to just one class, and the classes (functions) involved in the problem have the same hierarchical level. However, there are more complex classification problems where one or more classes can be divided into subclasses or grouped into superclasses, forming a hierarchical structure, like a tree or a directed acyclic graph (DAG). Additionally, an instance can be assigned to more than one class in the same hierarchical level. These problems O. Norberto de Souza, G.P. Telles, and M.J. Palakal (Eds.): BSB 2011, LNBI 6832, pp. 10–17, 2011. c Springer-Verlag Berlin Heidelberg 2011
Hierarchical Multilabel Protein Function Prediction
11
are known as Hierarchical Multilabel Classification (HMC) problems. In HMC problems, given a space of instances X, the training process looks for a function able to map each instance xi into a set of classes, respecting the constraints of the hierarchical structure, and optimizing some quality criterion. For such, two main approaches are used, named local, which trains a set of local classifiers, and global, which induces only one classifier to solve the classification problem [22]. This paper presents a local method where one Multi-Layer Perceptron (MLP) network using the back-propagation algorithm [20] is associated with each hierarchical level and responsible for the predictions made in that level. The predictions of a level are then used as inputs for another MLP in the next level. The proposed method is compared with state-of-the-art decision trees for HMC problems, showing competitive predictive accuracy with other local methods.
2
Related Work
This section briefly reviews some works proposed in the literature to deal with HMC problems related to protein function prediction. The work of [3] proposed a local based method with a hierarchy of SVM [24] classifiers, evaluated with a dataset of proteins structured according to the hierarchy of biological processes of the Gene Ontology (GO) [2]. Local classifiers were also used in an ensemble method proposed by [23], which used a dataset with proteins annotated according to the FunCat scheme developed by MIPS [17]. Local based classifiers were also investigated by [10] using a hierarchy of neural networks, but differently from the method proposed in this work, more than one multiclass MLP was used in each hierarchical level, while here, just one multiclass MLP is used in each level. The works [8,26] also used neural networks, but global based methods applied to the classification of documents. In [25], three decision trees based methods were compared: a global method named Clus-HMC and two other local methods named Clus-HSC and Clus-SC. A global based method extending the C4.5 algorithm [19], named HMC4.5, was proposed by [11]. In [1], the authors investigated a global method, named MHCAIS, based on Artificial Immune Systems [9]. They applied this method to find rules for protein function prediction, where the proteins were described according to the GO hierarchy. Another bioinspired method, named hmAnt-Miner [18], uses swarm intelligence [14] for hierarchical multilabel protein function prediction.
3
Proposed Method
The proposed method, named here HMC-LMLP (Hierarchical Multilabel Classification with Local Multi-Layer Perceptron), incrementally trains a local MLP network for each hierarchical level, using the back-propagation algorithm. First, a MLP network is trained for the first hierarchical level. This network uses one input layer, one hidden layer and one output layer (classes). After training this network, a new network with one hidden layer and one output layer is created for the second hierarchical level. The output nodes of the previous network become
12
R. Cerri and A.C.P.L.F. de Carvalho
the input layer to this new network, which is responsible for the predictions in the second level. This procedure is repeated for all hierarchical levels. Figure 1 illustrates the MLP network architecture of HMC-LMLP considering a two level hierarchy. The network is fully connected. When the network is being trained for a hierarchical level, the synaptic weights of the layers corresponding to the previous levels are not adjusted, because their adjustment have already occurred in the earlier training phases of the network, corresponding to the shallower levels. In the test phase, to classify a new instance, a threshold value is associated with each output layer of the network in the hierarchy. Each output (class) whose value is higher or equal than its threshold receives the value 1, assigning the class to the instance, and 0 otherwise.
Fig. 1. Architecture of HMC-LMLP for a two level hierarchy Outputs of the MLP responsible for the predictions in the first level are given as inputs to another hidden layer of the MLP responsible for the predictions in the second level
After the classification of new instances, a post-processing phase corrects inconsistencies that may have occurred during the classification. Inconsistencies can occur when a class is predicted, but not its superclass. The post-processing phase removes classes that are predicted without the prediction of their superclasses, ensuring that only consistent predictions are made. Following the hierarchical classification algorithm’s taxonomy proposed by [22], HMC-LMLP can be categorized as a hierarchical multilabel classifier with non-mandatory leaf node prediction. It can be applied to both DAG and tree hierarchical structures, and uses one local MLP classifier per hierarchical level.
4
Materials and Methods
Ten datasets related to protein functions of the Saccharomyces cerevisiae organism were used in the experiments. The datasets are freely available at http://
Hierarchical Multilabel Protein Function Prediction
13
Table 1. Datasets’ characteristics: number of attributes (|A|), number of classes (|C|), total number of instances (Total) and number of multilabel instances (Multilabel) Dataset
|A|
|C|
Total
Cellcycle Church Derisi Eisen Expr Gasch1 Gasch2 Pheno Seq Spo
77 27 63 79 551 173 52 69 478 80
499 499 499 461 499 499 499 455 499 499
1628 1630 1608 1058 1639 1634 1639 656 1701 1600
Training Multilabel
Total
1323 1322 1309 900 1328 1325 1328 537 1344 1301
Valid Multilabel
848 844 842 529 849 846 849 353 879 837
673 670 671 441 674 672 674 283 679 666
Total 1281 1281 1275 837 1291 1284 1291 582 1339 1266
Test Multilabel 1059 1057 1055 719 1064 1059 1064 480 1079 1047
www.cs.kuleuven.be/~dtai/clus/hmcdatasets.html, and are related to Bioinformatics data such as sequence statistics, phenotype and expression. They are organized in a hierarchy structured as a tree, according to the FunCat scheme of classification developed by MIPS. Table 1 shows the main characteristics of the training, valid and test datasets used. The performance of HMC-LMLP was compared with three state-of-the-art decision trees for HMC problems presented in [25]: Clus-HMC, global method that induces a single decision tree for the whole set of classes; Clus-HSC, a local method which explores the hierarchical relationships to build a decision tree for each hierarchical node; and Clus-SC, a local method which builds a binary decision tree for each class of the hierarchy. These methods are based on the concept of Predictive Clustering Trees (PCT) [5]. The evaluation used Precision-Recall curves (P R curves), which reflect the precision of a classifier as a function of its recall, and give a more informative picture of an algorithm’s performance when dealing with highly skewed datasets [12], which is the case of HMC problems. The hierarchical precision (hP ) and hierarchical recall (hR) measures (Equations 1 and 2) used to construct the P R curves consider that an instance belongs not only to a class, but also to all ancestor classes of this class [16]. Thus, given an instance (xi , Ci ), with xi belonging to the space X of instances, Ci the set of its predicted classes, and Ci the set of its real classes, Ci and Ci can be extended to contain their corresponding i = ancestor classes as: C ck ∈Ci Ancestors(ck ) and Ci = cl ∈Ci Ancestors(cl ), where Ancestors(ck ) denotes the set of ancestors of the class ck . hP =
i |Ci ∩ Ci | |Ci |
(1)
i
hR =
i ∩ C | |C i i |Ci |
i
(2)
To obtain a P R curve, different threshold values are applied to the outputs of the methods in order to obtain different hP and hR values. The outputs of
14
R. Cerri and A.C.P.L.F. de Carvalho
the methods are represented by vectors of real values, where each value is the pertinence degree of a class to a given instance. For each threshold, a point in the P R curve is obtained, and final curves are then constructed by the interpolation of these points, according to [12]. The areas under these curves (AU (P RC)) are then approximated by summing the trapezoidal areas between each point. These areas are then used to compare the performances of the methods. The higher the AU (P RC) of a method, the better is its predictive performance. Statistical tests were used to verify the statistical significance of the results with a confidence level of 95%. The tests employed were the well-known Friedman and Nemenyi, recommended for comparisons involving many datasets and several classifiers [13]. As in [25], 2/3 of each dataset were used for training and validation of the algorithms, and 1/3 for test.
5
Experiments and Discussion
The HMC-LMLP method was executed with the number of neurons of each hidden layer equals to 50% of the number of neurons of the corresponding input layer. As each MLP is composed by three layers (input, hidden and output), the learning rate values 0.2 and 0.1 were used in the hidden and output layers, respectively. In the same manner, the momentum constant values 0.1 and 0.05 were used in each of these layers. No attempt was made to tune these parameter values. The training process lasted a maximum period of 100 epochs, and, at every 10 epochs, P R curves were calculated over the validation dataset. The model which obtained the best performance on the validation dataset was then evaluated in the test dataset. For each dataset, the algorithm was executed 10 times, each time initializing the synaptic weights randomly. The averages of the AU (P RC) obtained in the individual executions were then calculated. The Clus-HMC, Clus-HSC and Clus-SC methods were executed one time each, with their default configurations, just like described in [25]. Table 2 shows their results and the results obtained by HMC-LMLP, together with the corresponding standard deviation, and number of training epochs needed to obtain these results. It is worth noting that, although HMC-LMLP does not produce classification rules (good characteristic of the PCT based methods), it is interesting to investigate the performance of traditional MLPs for the HMC task, specially because a MLP network can be considered naturally a multilabel classifier, as their output layers can simultaneously predict more than one class. As can be observed in Table 2, HMC-LMLP obtained competitive predictive performances in comparison to the other local methods (Clus-HSC and Clus-SC) in all datasets, except Expr and Pheno. It performed better than CLus-HSC in 80% of the cases, and better than Clus-SC also in 80% of the cases. These results, however, were not statistically significant according to the statistical tests. The poor predictive performance in the Pheno dataset may be because all its attributes are nominal, and a pre-processing phase was necessary to transform the nominal attributes to binary values, which increased the number of attributes and may have harmed the performance of HMC-LMLP. The pre-processing phase
Hierarchical Multilabel Protein Function Prediction
15
Table 2. The mean of the AU (P RC) obtained by each method. For HMC-LMLP, it is also shown the corresponding standard deviation and number of training epochs. Dataset
HMC-LMLP
Cellcycle Church Derisi Eisen Expr Gasch1 Gasch2 Pheno Seq Spo
0.144 0.140 0.138 0.173 0.070 0.133 0.127 0.085 0.114 0.139
(±0.009) (±0.006) (±0.008) (±0.009) (±0.012) (±0.018) (±0.012) (±0.009) (±0.009) (±0.006)
20 20 20 40 90 20 30 70 40 50
Clus-HMC
Clus-HSC
Clus-SC
0.172 0.170 0.175 0.204 0.210 0.205 0.195 0.160 0.211 0.186
0.111 0.131 0.094 0.127 0.127 0.106 0.121 0.152 0.091 0.103
0.106 0.128 0.089 0.132 0.123 0.104 0.119 0.149 0.095 0.098
applied to the Pheno dataset considerably increased its number of attributes from 69 to 276. The high number of attributes may also be the reason for its inferior performance in the Expr dataset. Maybe, for these datasets, a higher number of hidden neurons is necessary for a better classification performance. The results also showed that competitive performances in comparison with the local methods could be obtained with a relative small number of hidden neurons (50% of the number of inputs). The exception was observed only for the datasets Expr and Pheno, where a higher number of hidden neurons seems to be necessary. This may be, again, due the higher number of attributes in these datasets, as can be observed in Table 1. The number of attributes, of course, may not be the only reason for this poor performance. As can be observed in Table 1, the Seq dataset also has many attributes and HMC-LMLP performed better than the other local methods. Thus, further investigation considering hierarchical and multilabel characteristics of the datasets is needed to find out specific reasons for these performances. It also could be observed that a large number of epochs was not necessary to achieve the best performances of the algorithm. In the majority of the datasets, the best results were obtained in the first 50 epochs. After that, it was observed that the performances of the method start to decrease, probably because the network starts to overfit the data. Only in Expr and Pheno datasets, the best performances of the method were achieved after 90 and 70 epochs, respectively. Looking at the results obtained by Clus-HMC, it is possible to see that it performed better than all the other methods. The statistical tests, however, suggested no statistically significant differences between the results of Clus-HMC and HMC-LMLP. In fact, statistically significant differences were found only among the results of Clus-HMC, Clus-HSC and Clus-SC. Despite the statistical results, the predictive performance of HMC-LMLP can be seen as promising, specially considering that PCT based methods have been investigated for more than 10 years, with early applications in [5,6] and more recently in [25,4,7,21], and also because no attempt to tune the MLP parameter values was made. Additionally, simple MLPs with back-propagation were used, without any modifications to cope with the multilabel problem.
16
R. Cerri and A.C.P.L.F. de Carvalho
A characteristic that needs to be observed in HMC-LMLP, which is a characteristic of local based methods, is its tendency to predict more classes at the first hierarchical levels. Most of the classes predicted by the method are located in the two first levels of the hierarchy. Classes located in deeper levels can be predicted depending on the threshold values chosen. Usually, lower threshold values result on higher recall values, which is a characteristic of predictions at deeper levels, while high threshold values result in higher precision values, which is a characteristic of predictions at levels closer to the root.
6
Conclusion and Future Works
This paper presented a new local based method to cope with HMC problems. It uses one local MLP network trained with the back-propagation algorithm in each hierarchical level, being each MLP responsible for the predictions in its respective level. Experiments in the protein function prediction domain showed that the proposed method can achieve competitive predictive accuracy when compared with the state-of-the-art methods based on the local approach. As future works, the authors plan to investigate global methods based on neural networks, as the global approach has already shown to produce simpler classifiers [25]. Experiments to fine tune MLP and back-propagation parameter values can be performed, and error functions specially proposed for multilabel problems [27,15] can be used, which can improve the classification performance of the method. Hierarchies structured as DAGs should also be used in future experiments. Acknowledgments. The authors would like to thank the Brazilian research councils CAPES, CNPq and FAPESP for their financial support, and the Katholieke Universiteit of Leuven’s Machine Learning Research Group for the Clus programs and datasets.
References 1. Alves, R., Delgado, M., Freitas, A.: Knowledge discovery with artificial immune systems for hierarchical multi-label classification of protein functions. In: International Conference on Fuzzy Systems, pp. 2097–2104 (2010) 2. Ashburner, M., et al.: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25, 25–29 (2000) 3. Barutcuoglu, Z., Schapire, R.E., Troyanskaya, O.G.: Hierarchical multi-label prediction of gene function. Bioinformatics 22, 830–836 (2006) 4. Blockeel, H., Bruynooghe, M., Dzeroski, S., Ramon, J., Struyf, J.: Hierarchical multi-classification. In: W. on Multi-Relational Data Mining, pp. 21–35 (2002) 5. Blockeel, H., De Raedt, L., Ramon, J.: Top-down induction of clustering trees. In: International Conference on Machine Learning, pp. 55–63 (1998) 6. Blockeel, H., Dzeroski, S., Grbovic, J.: Simultaneous prediction of multiple chemical ˙ parameters of river water quality with tilde. In: Zytkow, J.M., Rauch, J. (eds.) PKDD 1999. LNCS (LNAI), vol. 1704, pp. 32–40. Springer, Heidelberg (1999)
Hierarchical Multilabel Protein Function Prediction
17
7. Blockeel, H., Schietgat, L., Struyf, J., Dzeroski, S., Clare, A.: Decision trees for hierarchical multilabel classification: A case study in functional genomics. In: F¨ urnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) PKDD 2006. LNCS (LNAI), vol. 4213, pp. 18–29. Springer, Heidelberg (2006) 8. Cai, L., Hofmann, T.: Exploiting known taxonomies in learning overlapping concepts. In: International Joint Conference on Artifical Intelligence, pp. 714–719 (2007) 9. de Castro, L.N., Timmis, J.I.: Artificial Immune Systems: A New Computational Intelligence Approach. Springer, London (2002) 10. Cerri, R., Carvalho, A.C.P.L.F.: Hierarchical multilabel classification using topdown label combination and artificial neural networks. In: Brazilian Symposium on Artificial Neural Networks, pp. 253–258 (2010) 11. Clare, A., King, R.D.: Predicting gene function in saccharomyces cerevisiae. Bioinformatics 19, 42–49 (2003) 12. Davis, J., Goadrich, M.: The relationship between precision-recall and roc curves. In: International Conference on Machine Learning, pp. 233–240 (2006) 13. Demˇsar, J.: Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7, 1–30 (2006) 14. Dorigo, M.: Optimization, Learning and Natural Algorithms. Ph.D. thesis, Dipartimento di Elettronica, Politecnico di Milano, IT (1992) 15. Grodzicki, R., Ma´ ndziuk, J., Wang, L.: Improved multilabel classification with neural networks. In: International Conference on Parallel Problem Solving from Nature, pp. 409–416 (2008) 16. Kiritchenko, S., Matwin, S., Famili, A.F.: Hierarchical text categorization as a tool of associating genes with gene ontology codes. In: European Workshop on Data Mining and Text Mining in Bioinformatics, pp. 30–34 (2004) 17. Mewes, H.W., et al.: Mips: a database for genomes and protein sequences. Nucleic Acids Res. 30, 31–34 (2002) 18. Otero, F., Freitas, A., Johnson, C.: A hierarchical multi-label classification ant colony algorithm for protein function prediction. Memetic Computing 2, 165–181 (2010) 19. Quinlan, J.R.: C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco (1993) 20. Rumelhart, D.E., McClelland, J.L.: Parallel distributed processing: explorations in the microstructure of cognition, vol. 1. MIT Press, Cambridge (1986) 21. Schietgat, L., Vens, C., Struyf, J., Blockeel, H., Kocev, D., Dzeroski, S.: Predicting gene function using hierarchical multi-label decision tree ensembles. BMC Bioinformatics 11, 2 (2010) 22. Silla, C., Freitas, A.: A survey of hierarchical classification across different application domains. Data Mining and Knowledge Discovery 22, 31–72 (2010) 23. Valentini, G.: True path rule hierarchical ensembles. In: International Workshop on Multiple Classifier Systems, pp. 232–241 (2009) 24. Vapnik, V.N.: The Nature of Statistical Learning Theory (Information Science and Statistics). Springer-Verlag New York, Inc., Heidelberg (1999) 25. Vens, C., Struyf, J., Schietgat, L., Dˇzeroski, S., Blockeel, H.: Decision trees for hierarchical multi-label classification. Machine Learning 73, 185–214 (2008) 26. Woolam, C., Khan, L.: Multi-label large margin hierarchical perceptron. International Journal of Data Mining, Modelling and Management 1, 5–22 (2008) 27. Zhang, M.L., Zhou, Z.H.: Multilabel neural networks with applications to functional genomics and text categorization. IEEE Transactions on Knowledge and Data Engineering 18, 1338–1351 (2006)
Identifying Significant Features in HIV Sequence to Predict Patients’ Response to Therapies Samuel Evangelista de Lima Oliveira, Luiz Henrique de Campos Merschmann, and Leoneide Erica Maduro Bouillet Federal University of Ouro Preto (UFOP) - Ouro Preto/MG - Brazil
[email protected],
[email protected],
[email protected]
Abstract. The Human Immunodeficiency Virus (HIV) is a retrovirus that attacks the human immune system reducing its effectiveness. Combinations of antiretroviral drugs are used to treat the infection by HIV. However, the high mutation rate in the HIV virus makes it resistant to some antiretroviral drugs and leads to treatment failure. Nowadays, there are computational methods based on machine learning that try to predict the patients’ response to therapies. In this bioinformatics study we deal with data preprocessing techniques to find significant features in HIV sequences that can be interesting for the prediction of patients’ short-term progression. Experiments were conducted trough four classification methods using datasets composed by different sets of attributes. Classifiers trained with a dataset including solely viral load, CD4+ cell counts and information about mutations in the viral genome achieved accuracies ranging from 50.29% to 63.87%. Nevertheless, the addition of attributes (antiretroviral drug resistance levels, HIV subtype, epitope occurrence and others) in the dataset has improved the accuracy of the classifiers in almost all tests executed in this work, indicating its relevance to the prediction task discussed here.
1
Introduction
The Human Immunodeficiency Virus (HIV) is a retrovirus that causes the Acquired Immunodeficiency Syndrome (AIDS) [5]. The HIV attacks the human immune system, mainly the cells T CD4+, progressively reducing its effectiveness. The POL gene is one of the nine genes of HIV. This gene encodes essential enzymes to the HIV life cycle: reverse transcriptase (RT), integrase and protease (PR). The protein RT uses the viral RNA to transcribe the DNA molecule.The integrase is responsible for fusing the transcribed viral DNA into the host genome. Finally, the protease cleaves the amino acid sequence coded by HIV into functional units to generate a new viral particle. The antiretrovirals are drugs aimed to stop the HIV life cycle by preventing one of these proteins of fulfill its role.
This work was supported in part by a UFOP MSc scholarship and in part by CNPq research grant 307711/2010-2.
O. Norberto de Souza, G.P. Telles, and M.J. Palakal (Eds.): BSB 2011, LNBI 6832, pp. 18–25, 2011. c Springer-Verlag Berlin Heidelberg 2011
Identifying Significant Features in HIV Sequence
19
The reverse transcriptase has no proofreading ability, which explains the high mutation rate in the HIV virus. These mutations can make the virus resistant to some antiretrovirals, and then reduce its effectiveness. When a single antiretroviral therapy is used, there exists the likelihood that viral phenotype resistance can emerge [1]. A combination of antiretroviral therapies is an alternative to dealing with this problem. The Highly Active Antiretroviral Therapies (HAART) is a therapy composed of multiple anti-HIV drugs aimed to inhibit more than one stage of the HIV life cycle, in such a way that the virus takes a much longer time to become resistant to the therapy [4]. A lot of efforts have been made by researchers to understand the interaction of the virus with the host [15,17], to develop more effective drugs [9] or to help in the choice of antiretroviral therapies [18]. Nowadays, there are in vitro resistance assays and computational methods based on machine learning that try to predict the patient’s response to therapies. In this work, we deal with data preprocessing techniques to find significant features in HIV sequences that can be interesting for the prediction of patient’s short-term progression. The goal is to identify a set of attributes using only the sequence of the human immunodeficiency virus genome and not requiring additional information obtained from in vitro experiments. Previous works on quantitative prediction of patient’s response to anti-HIV drugs [11,12,14] have employed similar computational techniques adopted here, but have not analyzed the importance of attributes such as levels of antiretroviral drug resistance, the length of RT and PR sequences, similarity of the RT and PR to their reference sequences, HIV subtype, and epitope occurrence in the prediction task.
2
Our Approach
We used a dataset taken from the Kaggle competition website1 . It is a real dataset which contains records of 1,692 patients who had only recently contracted HIV-1 and had not been treated before. Each record is composed by the patient identification, the nucleotide sequence of the Reverse Transcriptase (RT), the nucleotide sequence of the Protease (PR), and two common clinical indicators used to determine the health status of an HIV-1 infected individual: Viral Load and CD4+ cell counts. In addition, there is an attribute that indicates whether the patient improved after sixteen weeks of therapy, i.e., each patient is assigned to the responder or non-responder group based on his response to therapy. The dataset contains 552 patients of responder class and other 1140 patients of non-responder class. Also, it is important to mention that the nucleotide sequence of the Protease is not available for 80 patients. In this work we present the results of studies using different classifiers to predict patients’ response to therapies from their HIV-1 viral sequences and their viral load and CD4+ count at the beginning of therapy. We proposed such tests with various classifiers to address the following questions: can classifiers 1
http://www.kaggle.com/c/hivprogression
20
S.E.L. Oliveira, L.H.C. Merschmann, and L.E.M. Bouillet
trained with information about mutations in the viral genome accurately predict the patients’ response to therapies? Does the addition of resistance levels to antiretroviral drugs, the length of RT and PR sequences, similarity of the RT and PR to their reference sequences, HIV subtype, and epitope information significantly enhance the accuracy of the classifiers? To answer these questions the following methodology was adopted. From the original dataset we generated three different datasets varying the number of attributes. The first dataset contains the following predictive attributes: HIV protease and reverse transcriptase mutations, viral load and CD4+ cell counts. We used the Stanford Sierra web service2 to obtain the HIV protease and reverse transcriptase mutations. The Sierra web service accepts user-submitted PR and RT sequences and returns information such as the prevalence of PR and RT mutations, the virus subtype, the length of the sequences, the similarity to a reference sequence, the levels of resistance to anti-HIV drugs and others. After, additional predictive attributes were added to the first dataset to form other two enlarged datasets. The second dataset is composed by the same attribute set of the first dataset and the resistance levels to antiretroviral drugs attributes. In addition to the attributes of the second dataset, the third dataset contains the length of RT and PR sequence, similarity of RT and PR to their reference sequences, HIV subtype and the QIYQEPFKNLK epitope attribute, which indicates whether this epitope is contained in the HIV sequence. An epitope is the part of an antigen that is recognized by the immune system. Using the three generated datasets, we conducted experiments with four traditional classifiers (Support Vector Machine, decision tree, Random Forests and naive Bayes) commonly used in bioinformatics researches [2]. Our goal is to evaluate if: (a) the classifiers trained with datasets including solely viral load, CD4+ cell counts and information about mutations in the viral genome could accurately predict the patients’ response to therapies, and (b) the additional attributes (resistance levels to antiretroviral drugs, length of RT and PR sequence, similarity of RT and PR to their reference sequences, HIV subtype and epitope) could improve the accuracy of the classifiers. As the generated datasets contained a large number of attributes, we also conducted classification experiments after a data preprocessing step to reduce the number of attributes in the generated datasets. This data preprocessing step aims at identifying relevant attributes for the classification task. It tries to remove from datasets attributes that do not contribute to, or even reduces, the predictive accuracy of classification algorithms. Two attribute selection techniques, known as Correlation-based Feature Selection [8] and Consistency-based Feature Selection [13], were chosen to reduce the number of attributes of datasets. In addition to evaluating the classification performance by the reduced dataset, the attribute selection process allowed us to observe if the attributes added to form the second and third datasets were selected by the attribute selection techniques adopted, indicating their relevance to the classification task. 2
http://hivdb.stanford.edu/pages/webservices/
Identifying Significant Features in HIV Sequence
3
21
Computational Experiments
The computational experiments were designed to perform the attribute relevance analysis for the task of prediction of patients’ response to therapies. The only difference among the nine datasets used in our experiments is the number of attributes. The first three datasets were obtained from data available on the Kaggle website3 and using the Stanford Sierra web service. These datasets, named D1 , D2 and D3 , are composed by the following predictive attributes: 1) D1 : HIV mutations, viral load and CD4+ cell counts; 2) HIV mutations, viral load, CD4+ cell counts and antiretroviral drug resistance levels; 3) HIV mutations, viral load, CD4+ cell counts, resistance levels to antiretroviral drugs, length of RT and PR sequence, similarity of RT and PR considering their reference sequences, HIV subtype and epitope. Other six datasets were generated from the attribute selection process. Applying the Consistency-based Feature Selection technique on the datasets D1 , D2 and D3 , we obtained the datasets D1cons , D2cons and D3cons , respectively. Finally, applying the Correlation-based Feature Selection technique on the datasets D1 , D2 and D3 , we achieved the datasets D1corr , D2corr and D3corr , respectively. The experiments involving Correlation-based Feature Selection and Consistency-based Feature Selection were carried out using the algorithms implemented in the Weka tool [7] (version 3.6). The number of attributes selected by Correlation based Feature Selection and Consistency-based Feature Selection techniques was automatically defined by their search method. The characteristics of the datasets are shown in the Table 1. The datasets names are listed in the first column and the number of attributes for different features of HIV sequence are reported from second to fourth columns. The total number of attributes is shown in the last column. Table 1. Characteristics of the datasets HIV Antiretroviral Epitope, HIV subtype, Total number mutation drug resistance length and similarity of attributes Datasets attributes level attributes attributes D1 381 384 D2 381 38 422 D3 381 38 6 428 D1cons 62 65 D2cons 29 11 43 D3cons 15 9 3 30 D1corr 44 46 D2corr 37 2 41 D3corr 34 1 1 38
3
http://www.kaggle.com/c/hivprogression/Data
22
S.E.L. Oliveira, L.H.C. Merschmann, and L.E.M. Bouillet
The classification experiments were conducted using four techniques: decision tree, naive Bayes, Random Forests and Support Vector Machines (SVM). The experiments involving decision tree, naive Bayes, Random Forests and SVM were carried out using the algorithms ADTree [6], NaiveBayes [10], RandomForest [3] and SMO [16], respectively, implemented in the Weka tool [7]. In order to estimate the accuracy of the classifiers , the datasets were partitioned into two independent sets, a training set and a test set. From the 1,692 records of patients. 1,000 were allocated to the training set, and the remaining was allocated to the test set. Experiments were conducted using the following parameter values. For the RandomForest algorithm, the parameter numFeatures, used in random selection of attributes, was varied from 6 to 10, and the parameter numTrees, which defines the number of trees to be generated, was varied from 10 to 500. The ADTree algorithm was executed varying the parameter which determines the number of boosting iterations, called numOfBoostingIterations, from 2 to 500. In addition, the parameter searchPath, which defines the type of search to be performed when building the tree, was set to Expand all the paths and Expand the best z-pure path. The NaiveBayes and SMO algorithms were executed using default parameter values. In this study, for the RandomForest and ADTree algorithms, we presented the results obtained from the combination of parameter values that produced the best predictive accuracy performance. The experimental results for the datasets D1 , D2 and D3 are shown in Figure 1. We can observe that, considering the four classifiers, the predictive accuracy for the dataset D1 ranged from 55.92% to 63.87%. However, for all classifiers, the datasets D2 and D3 reached better predictive accuracies when compared with those obtained for the dataset D1 , indicating that the addition of resistance levels to antiretroviral drugs, HIV subtype and epitope attributes contributes to enhance the accuracy of the classifiers. Figure 2 presents the accuracy results for all classifiers using datasets with attributes selected by the Consistency-based Feature Selection technique. Again, for all classifiers, the D2cons and D3cons augmented datasets had better performance of classification than the dataset D1cons . Also, it is important to mention
Fig. 1. Accuracy for the datasets D1 , D2 and D3
Identifying Significant Features in HIV Sequence
23
Fig. 2. Accuracy for the datasets D1cons , D2cons and D3cons
that 25.58% of the attributes selected from dataset D2 and 40% of the attributes selected from dataset D3 are related to resistance levels to antiretroviral drugs (in D2cons and D3cons ), epitope occurrence, length of the RT and PR sequence, similarity and HIV subtype (only in D3cons ). Therefore, these results confirm the importance of the attributes added in the dataset D1 for the prediction task discussed in this work. When compared with the results showed in Figure 1, we can note that for Random Forests classifier the predictive accuracy results for the D2cons and D3cons reduced datasets were superior to those presented by the original datasets (without attribute selection). As for the remaining classifiers the results achieved by the reduced datasets do not always outperform those obtained by the original datasets. Figure 3 presents the predictive accuracy results for the datasets with attributes selected by Correlation-based Feature Selection technique. In this case, only 4.87% of the attributes selected from dataset D2 and 5.26% of the attributes selected from the dataset D3 are related to resistance levels to antiretroviral drugs (in D2corr and D3corr ), epitope occurrence, length of the RT and PR
Fig. 3. Accuracy for the datasets D1corr , D2corr and D3corr
24
S.E.L. Oliveira, L.H.C. Merschmann, and L.E.M. Bouillet
sequence, similarity to RT and PR to their reference sequence and HIV subtype (only in D3corr ). Random Forests and naive Bayes classifiers had the same behavior reported in the previous analysis, that is, the D2corr and D3corr augmented datasets had better classification performance than the dataset D1corr . However, for SVM and decision tree classifiers this behavior was not verified. This leads us to believe that for the last classifiers, the small number of additional attributes was not sufficient to improve the predictive accuracies for the datasets D2corr and D3corr .
4
Conclusions
In this work we carried out studies using different classifiers to predict HIV patients’ response to therapies from datasets composed by different attribute sets extracted from RT and PR sequences. Our first goal was to evaluate the predictive accuracy obtained from classifiers trained with a dataset including solely viral load, CD4+ cell counts and information about mutations in the viral genome. The experiments conducted using four traditional classifiers showed that the accuracy obtained using this dataset ranged from 50.29% to 63.87%. The drug resistant HIV is a major public health concern. The task of learning classifiers using the attributes identified in this work, when applied in the antiretroviral therapy monitoring, may orientate the physicians to choose the best antiretroviral therapy, helping to reduce morbidity and mortality among patients infected with HIV. After, additional attributes (resistance levels to antiretroviral drugs, length of PR and RT sequences, HIV subtype, similarity of the RT and PR to their reference sequences, and epitope occurrence) were included in the previous dataset in order to analyze the significance of these attributes to the prediction task discussed in this study. Our experimental results have shown that, for all classifiers, the addition of such attributes improved the accuracy. Due to the large size (number of attributes) of our datasets, we carried out a data preprocessing step to reduce the number of attributes. This data preprocessing step aimed at identifying relevant attributes for the classification task. The experiments using the reduced datasets, in general, presented the same behavior previously described, that is, the augmented datasets achieved the best results. Therefore, the results achieved in the conducted experiments indicate that attributes such as resistance levels to antiretroviral drugs, length of PR and RT sequences, HIV subtype, similarity of the RT and PR to their reference sequences, and epitope occurrence are relevant to the classification problem considered in this work. It was shown in [19] that Eukariotic Linear Motifs (ELMs), found in RT sequence, are highly predictive of patients’ response to therapy involving different combinations of antiretroviral drugs. As future work we plan to include these ELMs in our datasets.
Identifying Significant Features in HIV Sequence
25
References 1. Andrew, R., David, P., Crandall, K.A., Holmes, E.C.: The causes and consequences of HIV evolution. Nature Reviews Genetics 5(1), 52–61 (2004) 2. Bhaskar, H., Hoyle, D., Singh, S.: Machine learning in bioinformatics: A brief survey and recommendations for practitioners. Computers in Biology and Medicine 36(10), 1104–1125 (2006) 3. Breiman, L.: Random forests. Machine Learning 45, 5–32 (2001) 4. Deeks, S.: Treatment of antiretroviral-drug-resistant HIV-1 infection. The Lancet 362(9400), 2002–2011 (2003) 5. Frankel, A.D., Young, J.A.T.: HIV-1: Fifteen proteins and an rna. Annual Review of Biochemistry 67(1), 1–25 (1998) 6. Freund, Y.: The alternating decision tree learning algorithm. In: Machine Learning: Proceedings of the Sixteenth International Conference, pp. 124–133. Morgan Kaufmann, San Francisco (1999) 7. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. SIGKDD Explor. Newsl. 11, 10–18 8. Hall, M.A.: Correlation-based feature selection for machine learning. Tech. rep. (1999) 9. Havlir, D., Richman, D.: Viral dynamics of HIV: implications for drug development and therapeutic strategies. Annals of Internal Medicine 124(11), 984 (1996) 10. John, G., Langley, P.: Estimating continuous distributions in bayesian classifiers. In: Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, pp. 338–345. Morgan Kaufmann, San Francisco (1995) 11. Larder, B., Wang, D., Revell, A., Montaner, J., Harrigan, R., De Wolf, F., Lange, J., Wegner, S., Ruiz, L., P´erez-El´ıas, M., et al.: The development of artificial neural networks to predict virological response to combination HIV therapy. Antiviral Therapy 12(1), 15 (2007) 12. Lin, R.S., Rhee, S.Y., Shafer, R.W., Das, A.K.: Prediction of HIV mutation changes based on treatment history. American Medical Informatics Association (2006) 13. Liu, H., Setiono, R.: A probabilistic approach to feature selection - a filter solution. In: Proc. of Int. Conf. on Machine Learning, pp. 319–327. Morgan Kaufmann, San Francisco (1996) 14. Nanni, L., Lumini, A.: MppS: An ensemble of support vector machine based on multiple physicochemical properties of amino acids. Neurocomputing 69(13-15), 1688–1690 (2006) 15. Pinney, J.W., Dickerson, J.E., Fu, W., Sanders-Beer, B.E., Ptak, R.G., Robertson, D.L.: HIV-host interactions: a map of viral perturbation of the host system. AIDS 23(5) (March 2009) 16. Platt, J.C.: Fast training of support vector machines using sequential minimal optimization, pp. 185–208. MIT Press, Cambridge (1999) 17. Ptak, R.G., Fu, W., Sanders-Beer, B.E., Dickerson, J.E., Pinney, J.W., Robertson, D.L., Rozanov, M.N., Katz, K.S., Maglott, D.R., Pruitt, K.D., Dieffenbach, C.W.: Cataloguing the hiv type 1 human protein interaction network. AIDS Res. Hum. Retroviruses 24(12), 1497–1502 (2008) 18. Rosen-Zvi, M., Altmann, A., Prosperi, M., Aharoni, E., Neuvirth, H., S¨ onnerborg, A., Sch¨ ulter, E., Struck, D., Peres, Y., Incardona, F., Kaiser, R., Zazzi, M., Lengauer, T.: Selecting anti-HIV therapies based on a variety of genomic and clinical factors. Bioinformatics 24, i399–i406 (2008) 19. Dampier, W., Perry Evans, L.U., Tozeren, A.: Host sequence motifs shared by HIV predict response to antiretroviral therapy. BMC Med. Genomics 47 (2009)
Gene Prediction by Multiple Spliced Alignment Rodrigo Mitsuo Kishi, Ronaldo Fiorilo dos Santos, and Said Sadique Adi Faculdade de Computa¸c˜ ao (FACOM) Universidade Federal de Mato Grosso do Sul (UFMS) CP 549, 79070-900 Campo Grande-MS, Brazil
[email protected] [email protected] [email protected] http://www.facom.ufms.br
Abstract. With recent advances in sequencing technologies, a huge amount of DNA sequences become available year after year. In order to obtain useful information on these sequences, we need to process them in search of biologically meaningful regions. The genes are amongst the most important regions of a genome and the task of locating them in a DNA of interest is called the gene prediction problem. This problem can be addressed in several ways, and one of the most promising methods relies on homology information between the genomic DNA and previous annotated sequences (proteins, cDNAs and ESTs). In this paper we generalize a traditional formulation of the gene prediction problem and use this new formulation in the development of three gene identification tools. All these programs were tested on a benchmark of 240 human genomic sequences and the results obtained compare favorably with those achieved by other gene prediction tools available in the literature. Keywords: Multiple Spliced Alignment, Gene Prediction, Comparative Genomics.
1
Introduction
With the great number of genomic DNA sequences available, we are now faced with the challenging task of processing a huge amount of raw data in order to provide useful information about a number of organisms of interest. A first step towards this direction is to identify the meaningful regions of unannotated DNA sequences. In this work, we are interested in identifying the protein-coding regions of a genomic sequence, the so-called genes. This problem has been known as the Gene Prediction Problem. As the genes encode the information needed for a proper development of all living organisms, the identification of these regions in a DNA of interest has an undeniable practical importance. Unfortunately, this task is difficult, especially
During the development of this work, the authors were supported by FUNDECT (Proc. No. 23/200.221/2007 and 23/200.184/2007).
O. Norberto de Souza, G.P. Telles, and M.J. Palakal (Eds.): BSB 2011, LNBI 6832, pp. 26–33, 2011. c Springer-Verlag Berlin Heidelberg 2011
Gene Prediction by Multiple Spliced Alignment
27
in eukaryotic genomes, due to a number of reasons. First of all, the genes of most eukaryotic organisms are neither continuous nor contiguous. They are separated by long stretches of intergenic DNA and their coding fragments, named exons, are interrupted by non-coding ones, the introns. Furthermore, there is not an evident pattern of bases that distinguishes the genes from the intergenic regions of a genomic sequence. Finally, the DNA to be processed is usually very long, and may include overlapping genes. Since 1980 a number of gene prediction methods have been developed. One of the most promising methods, called similarity-based method, makes use of information regarding the homology between a DNA of interest and a fully annotated transcript sequence (cDNAs, ESTs or proteins) in order to find its genes. In the last years, a new type of similarity-based method has been developed. Based on the supposed high rate of conservation related to the coding regions, the so-called comparative-based methods try to find the genes of a given DNA sequence by aligning it with other piece of genomic DNA. For a review on the similarity-based and other gene prediction methods see [9]. Most existing gene prediction tools try to accomplish their task in two main steps. The first one consists in finding candidate exons of the input genomic sequence. The second step, referred to as the exon assembly problem [5], is that of selecting the best ordered subset of non overlapping candidate exons that span the searched gene. In the context of similarity based gene prediction, the exon assembly problem can be rewritten as the following combinatorial puzzle: given a DNA sequence and a transcript sequence, find the exons in the DNA whose concatenation/assembly best fit the transcript. This problem has been studied by a number of authors, giving raise to several similarity based gene prediction tools ([4], [11], [3], [16], [17]). Despite the good results achieved by these programs, they suffer from mispredictions ([2], [10]) when the target gene shows a low level of similarity to previously annotated transcript sequences. We present in this work three new heuristics for the gene prediction problem. All of them are based on a generalization of a combinatorial optimization problem that models the exon assembly problem. The proposed generalization considers more than one transcript sequence in expecting better predictions of the genes in a DNA of interest. Here, our main assumption is that the use of several transcript sequences provides more evidences about the location of the target gene than a single sequence does.
2
Spliced Alignment Problem
Gelfand et al. suggest in [4] a combinatorial approach to the exon assembly problem based on an optimization problem called spliced alignment problem. To a better understanding of this problem, consider the following. Let S = s1 . . . sn be a string and B1 = si . . . sj and B2 = sk . . . sl be substrings of S. We denote the fact that B1 ends before B2 in S (j < k) by B1 ≺ B2 . A sequence Γ = (B1 , . . . , Bp ) of substrings of S is a chain if B1 ≺ B2 . . . ≺ Bp . We denote the concatenation of strings from the chain Γ by Γ ∗ = B1 ∗ B2 ∗ . . . ∗ Bp .
28
R.M. Kishi, R.F. dos Santos, and S.S. Adi
Finally, given two strings S and T , scorew (S, T ) denotes the score of an optimal alignment between S and T under the scoring function w. With these definitions in mind, the spliced alignment problem can be stated as follows: Spliced Alignment Problem (SAP): given a sequence S = s1 . . . sn called subject sequence, a sequence T = t1 . . . tm called target sequence, a set B = {B1 , . . . , Bb } of substrings of S called blocks and a scoring function w, find a chain Γ of strings from B such that scorew (Γ ∗ , T ) is maximum among all chains of blocks from B. The relation between the SAP and the exon assembly problem becomes clear by taking S as a DNA sequence, T as a transcript sequence and the blocks of B as potential exons of S. With these associations in mind, a solution to the SAP will correspond to a chain of exons of the DNA that best fit the transcript sequence. In case we have a transcript sequence evolutionary related to the gene being searched for in S, a solution to the SAP will probably correspond to the correct exon-intron structure of this gene. 2.1
A Solution to the Spliced Alignment Problem
In [4], the spliced alignment problem is solved by means of a dynamic programming algorithm that, in polynomial time, explores all possible block assemblies and finds the one with the highest similarity score to the target sequence. This algorithm is the core of a gene prediction tool called Procrustes. To a better understanding of the spliced alignment algorithm proposed by Gelfand et al., let Bk = sf . . . si . . . sl be a substring of S that spans the position i of this sequence. Define i-prefix of Bk as Bk (i) = sf . . . si . For a block Bk = sf . . . sl , let first(k) = f , last(k) = l and size(k) = l − f + 1. Let Γ = {B1 , . . . , Bk , . . . , Bt } be a chain such that some block Bk contains position i. Finally, define Γ ∗ (i) as the string B1 ∗ B2 ∗ . . . ∗ Bk (i). Given these definitions, let: M [i, j, k] =
all chains
Γ
max containing block
Bk
score(Γ ∗ (i), T (j)).
M [i, j, k] can be computed by an algorithm based on Recurrence 1, where B(i) = {k : last(k) < i} is the set of blocks ending strictly before position i in S. ⎧ if i = first(k) : ⎪ ⎪ ⎪ ⎪ M [i − 1, j − 1, k] + w(si , tj ) ⎪ ⎪ ⎪ ⎪ M [i − 1, j, k] + w(si , − ) ⎪ ⎨ M [i, j, k] = max if i = first(k) : ⎪ ⎪ maxl∈B(first(k)) M [last(l), j − 1, l) + w(si , tj ) ⎪ ⎪ ⎪ ⎪ maxl∈B(first(k)) M [last(l), j, l) + w(si , − ) ⎪ ⎪ ⎪ ⎩ M [i, j − 1, k] + w( − , tj )
(1)
Gene Prediction by Multiple Spliced Alignment
29
After computing the three dimensional matrix M [i, j, k], the score of an optimal spliced alignment can be found as maxk M [last(k), m, k]. A naive implementation of Recurrence 1 runs in time O(mnc + mb2 ), where b c = n1 Σk=1 size(k) is the coverage of the subject sequence by blocks, and requires O(mnb) space. In [4], it is shown that with some modifications the algorithm can be implemented to run in O(mnc + mb) time using quadratic space.
3
Multiple Spliced Alignment Problem
A direct generalization of the spliced alignment problem involves multiple target sequences and can be defined as follows: Multiple Spliced Alignment Problem (MSAP): given a subject sequence S = s1 . . . sn , a set T = {T1 , T2 , . . . , Tk } of target sequences, a set B = {B1 , . . . , Bb } of blocks (substrings) of S and a scoring function w, find a k chain Γ of strings from B such that i=1 scorew (Γ ∗ , Ti ) is maximum among all chains of blocks from B. Observe that, differently from the SAP, a solution to the MSAP will correspond to a chain of exons that best fit a set of transcript sequences (and not just one transcript sequence). Given this multiple sources of evidence, it is reasonable to assume that the probability of a solution to the MSAP be related with the searched gene is higher when compared with a solution to the SAP. Unfortunately, no efficient algorithm is known for the MSAP. In fact, the MSAP is NP-complete. This can be proved by a reduction from a restricted version of the Median String Problem to the MSAP. In the median string problem we are interested in, given a set W of sequences over an alphabet Σ, find a sequence over Σ that minimizes the sum of the edit distances to the sequences in W . In [12], Nicolas and Rivals show that the median string problem is NPcomplete under the Levenshtein distance even if Σ is binary. The main idea of the proposed reduction is to create an instance of the MSAP whose set T is equal to W . The scoring function w can be defined as w(α, β) = 0 if α = β and w(α, β) = −1 otherwise. The subject sequence S can be constructed through a concatenation of the characters of Σ and a further concatenation of 2y copies of the resulting sequence, where y is the length of the largest sequence in W . Finally, the set B of blocks is constructed by associating a block B for each symbol of S. Given the fact that the similarity between two sequences under the proposed scoring function w is equal to the Levenshtein distance multiplied by -1 and that the length of the median string is always smaller than 2y, a solution to the MSAP will correspond to a solution to the median string problem. More details about this reduction can be found in [7].
4
Heuristics for the Multiple Spliced Alignment Problem
Given the fact that the MSAP is NP-complete, we developed three heuristics for it. All of them are based on the solution proposed by Gelfand et al. for the SAP.
30
R.M. Kishi, R.F. dos Santos, and S.S. Adi
The first heuristic (H1) has two main steps. In the first step, the similarity between all pairs of sequences in T is computed searching for one whose sum of similarity to all the remaining sequences in T is maximum. This sequence is called center sequence. In the second step, a spliced alignment between S and the center sequence is computed. The complexity of this heuristic is O(k2 m2 + mnc + mb2 ). The second heuristic (H2) has three main steps. The first one computes an approximation to the optimal multiple alignment of the sequences in T based on the center star method due to Gusfield [6]. Given this multiple alignment, a consensus sequence is then constructed. Finally, a spliced alignment between the sequence S and the consensus sequence is computed in the third step of the heuristic. The complexity of this heuristic is O(k 2 m2 + mnc + mb2 ). The third heuristic (H3) starts by running the spliced alignment algorithm k times, one for each target sequence. At the end of this step, the blocks that appear in more than half of the computed solutions are chosen as the output of the heuristic. The complexity of this heuristic is O(kmnc + kmb2 ).
5
Test Results
To assess the applicability of the proposed heuristics in the gene prediction task, they were implemented (in ANSI-C programming language) and tested in a sample of human DNAs taken from the chromosome sequences annotated by the ENCODE (ENCyclopedia of DNA elements) project [13]. Our programs take as input a DNA sequence and a set of cDNA sequences and return the locations of the predicted exons in the DNA. Our benchmark includes all the annotated protein-coding genes with typical structure and no alternative splicing. To each sequence obtained, we left 1000bp of intergenic region at their 5 and 3 ends. The target sequences were taken from the HomoloGene Database [14] and correspond to sequences of cDNAs evolutionary related to the target gene. Finally, the set of blocks for each instance was constructed by means of a HMM-based algorithm that finds the potential exons of a given DNA sequence. This algorithm is part of a gene prediction tool called Genscan [1]. Detailed information about each instance can be found at http://www.facom.ufms.br/~said/multiplespliced/appendixtable.html. To evaluate the gene prediction accuracy of our programs, we made use of two measures commonly used in the evaluation of gene prediction tools, namely, specificity and sensitivity [2]. At the nucleotide level, the specificity (Spn ) measures the proportion of predicted coding nucleotides that are really coding. The sensitivity at the nucleotide level (Snn ) measures the proportion of really coding nucleotides that have been correctly predicted as coding. The quantity approximate correlation (AC), that evaluates the correlation between the predicted and real coding nucleotides, has been introduced to summarize sensitivity and specificity at the nucleotide level in a single measure. Similarly, the specificity at the exon level (Spe ) measures the proportion of predicted exons that have been correctly predicted and the sensitivity at the exon level (Sne ) the proportion
Gene Prediction by Multiple Spliced Alignment
31
of real exons that have been correctly predicted. At the exon level, the average Av = (Spe + Sne )/2 is used to summarize sensitivity and specificity at the exon level in a single measure For a better analysis of our approach, we performed a comparison between the results reported by the three heuristics and the results of other six similaritybased gene prediction programs: Augustus+ [15], EST2Genome [11], GeneSeqer [16], Procrustes [4], Sim4 [3] and Spidey [17]. Differently from the proposed heuristics, these programs consider only one cDNA sequence in the gene finding task. In our tests, they were executed taking as target the cDNA sequence more similar to the searched gene. The average values of the specificity and sensitivity achieved by our heuristics as well as by the six similarity-based gene prediction tools are shown in Table 1. Table 1. Average values of Sp , Sn , AC and Av achieved by the evaluated tools Tools H1 H2 H3 Procrustes
Spn 0.96 0.91 0.96 0.94
Snn 0.95 0.96 0.93 0.94
AC 0.95 0.92 0.94 0.93
Spe 0.80 0.71 0.83 0.77
Sne 0.85 0.82 0.85 0.82
Av 0.82 0.77 0.84 0.80
Tools Augustus+ EST2Genome GeneSeqer Sim4 Spidey
Spn 0.89 0.97 0.91 0.96 0.96
Snn 0.88 0.96 0.84 0.92 0.92
AC 0.84 0.96 0.85 0.93 0.93
Spe 0.74 0.78 0.64 0.73 0.68
Sne 0.73 0.78 0.67 0.73 0.68
Av 0.74 0.78 0.65 0.73 0.68
As can be seen, all the three heuristics perform well on the proposed benchmark with H1 performing a little bit better at the nucleotide level and H3 a little better at the exon level than the other ones. About this third heuristic, it is important to note that it performs better than all the other gene prediction tools at the exon level. In fact, we can say that H3 was the program that achieved the best results given its better accuracy at the exon level and an accuracy at the nucleotide level that compares with that of the other programs. By analyzing the instances where the heuristics presented the worst results, we can see that the main reason of most underpredictions is the lack of some input blocks corresponding to real exons of the target gene. This can be seen in Figure 1, where the last exon of TOMM6 sequence was missed by all the heuristics. Additionally, H3 can also fail when the set of potential blocks includes exons that overlap some real exon but are a little bit longer or shorter than it. This can be seen in Figure 1, where the exon of coordinates 5819..5945 (second one) of the sequence ZNF418 was missed by the third heuristic because one of the two spliced alignment solutions suggests the block of coordinates 5816..5945 and the other suggests the block of coordinates 5819..5945. Finally, H2 can also suffer from overpredictions when one of the cDNA sequences includes a prefix and/or a suffix that is not present in the others sequences. In this case, the consensus sequence will include these regions, inducing the overpredictions of some exons at the start and/or end of the gene. This can be seen in Figure 1, where an exon of coordinates 886..1107 (first one) was erroneously predicted by H2 in IRF1 sequence.
32
R.M. Kishi, R.F. dos Santos, and S.S. Adi 0
1100
2200
3300
4400
5500
6600
7700
8800
9900
11000
7700
8800
9900
11000
TOMM6 Real
TOMM6 H1,H2,H3
ZNF418 Real
ZNF418 H3
IRF1 Real
IRF1 H2
0
1100
2200
3300
4400
5500
6600
Fig. 1. Example os mispredictions presented by the heuristics
Despite the aforementioned drawbacks, the values in Table 1 shows that, with exception of the second heuristic, the first and third ones outperform all the other gene prediction tools at the exon level, including Procrustes, that implements the spliced alignment algorithm. These values lead us to conclude that the use of more than one transcript sequence can in fact improve the sensitivity and specificity of gene prediction.
6
Discussion
With the advent of new sequencing technologies, the number of genomic sequences is increasing at an unprecedent rate. This fact leads to a constant need for efficient tools able to transform raw data into biologically meaningful information. In this context, one of the main tasks is to identify the coding regions of a genomic sequence. Despite the number of gene predictions tool available, the task of gene finding remains an open problem and an interesting subject of research. In this work, we studied a generalization of a well known formulation of the exon assembly problem, called multiple spliced alignment problem, and used the related results in the development of three comparative-based heuristics to the gene prediction task. In order to assess how adequate is the proposed formulation and the related heuristics, they were implemented and the resulting programs were tested and
Gene Prediction by Multiple Spliced Alignment
33
compared with other six comparative-based gene prediction tools by means of a benchmark including 240 human genomic sequences. The values of specificity and sensitivity at both nucleotide and exon levels show that two of the proposed heuristics outperform the other gene prediction tools. This suggests that the results of eukaryotic gene prediction can be ameliorated by using two or more cDNA sequences.
References 1. Burge, C., Karlin, S.: Prediction of Complete Gene Structures in Human Genomic DNA. Journal of Molecular Biology 268(1), 78–94 (1997) 2. Burset, M., Guig´ o, R.: Evaluation of Gene Structure Prediction Programs. Genomics 34(3), 353–367 (1996) 3. Florea, L., Hartzell, G., Zhang, Z., Rubin, G., Miller, W.: A Computer Program for Aligning a cDNA Sequence with a Genomic Sequence. Genome Research 8(9), 967–974 (1998) 4. Gelfand, M.S., Mironov, A.A., Pevzner, P.A.: Gene Recognition via Spliced Sequence Alignment. Proc. Natl. Acad. Sci (USA) 93(17), 9061–9066 (1996) 5. Guig´ o, R.: Assembling Genes from Predicted Exons in Linear Time with Dynamic Programming. Journal of Computational Biology 5(4), 681–702 (1998) 6. Gusfield, D.: Efficient Methods for Multiple Sequence Alignment with Guaranteed Error Bounds. Bulletin of Mathematical Biology 5(1), 141–154 (1993) 7. Kishi, R.M.: Identifica¸c˜ ao de Genes e o Problema do Alinhamento Spliced M´ ultiplo. Master Thesis. Faculdade de Computao¸c˜ ao/UFMS (December 2010) 8. Knight, J., Myers, E.W.: Super-pattern and Matching. Algorithmica 13, 211–243 (1995) 9. Majoros, W.H.: Methods for Computational Gene Prediction. Cambridge University Press, Cambridge (2007) 10. Math´e, C., Sagot, M.-F., Schiex, T., Rouz´e, P.: Current Methods of Gene Prediction, their Strengths and Weaknesses. Nucleic Acids Research 30(19), 4103–4117 (2002) 11. Mott, R.: EST Genome: a Program to Align Spliced DNA Sequences to Unspliced Genomic DNA. Comput. Appl. Biosci. 13(4), 477–478 (1997) 12. Nicolas, F., Rivals, E.: Hardness Results for the Center and Median String Problems Under the Weighted and Unweighted Edit Distances. Journal of Discrete Algorithms 3(2-4), 390–415 (2004) 13. The ENCODE Project Consortium: The ENCODE (Encyclopedia of DNA Elements) Project. Science 306(5696), 636–640 (2004) 14. Sayers, E.W., et al.: Database Resources of the National Center for Biotechnology Information. Nucleic Acids Research 37(suppl. 1), D5–D15 (2008) 15. Stanke, M., Sch¨ offmman, O., Morgenstern, B., Waack, S.: Gene Prediction with a Hidden Markov Model and a New Intron Submodel. Bioinformatics 19(suppl. 2), II215–II225 (2003) 16. Usuka, J., Zhu, W., Brendel, V.: Optimal Spliced Alignment of Homologous cDNA to a Genomic DNA Template. Bioinformatics 16(3), 203–211 (2000) 17. Wheelan, S.J., Church, D.M., Ostell, J.M.: Spidey: a Tool for mRNA-to-genomic Alignments. Genome Research 11(11), 1952–1957 (2001)
A New Algorithm for Sparse Suffix Trees Gustavo Akio T. Sacomoto1, and Alair Pereira do Lago2 1
2
Universit´e Lyon 1, CNRS, UMR, Laboratoire de Biom´etrie et Biologie Evolutive, Villeurbanne, France and INRIA Rhˆ one-Alpes, Montbonnot Saint-Martin, France
[email protected] Universidade de S˜ ao Paulo, Instituto de Matem´ atica e Estat´ıstica, S˜ ao Paulo, Brazil
[email protected]
Abstract. The sparse suffix trees are suffix trees in which some suffixes are omitted. There are many applications for these structures, ranging from bioinformatics to natural language processing. In many cases, they can replace the complete suffix tree, without decreasing the performance, but with expressive gains in memory consumption. Here, we present a new optimal algorithm to build the sparse suffix tree. That algorithm is simple enough to be implemented and to obtain good results in practice.
1
Introduction
The sparse suffix tree (also called word suffix tree in some works) has the same structure that a suffix tree, however instead of representing all suffixes of a given word, it represents just a subset of them. This structure is particularly adequate to index natural language texts. In these texts, generally, we are interested just in the suffixes that start in the beginning of a word. For example, in the phrase “The quick brown fox” the suffixes “The quick brown fox”, “quick brown fox” and “brown fox” are of interest, but “he quick brown fox” and “uick brown fox” are not. Quite often, when searching in English, we are not interested in a subsequence (of letters), but in a subsequence of words. The applications of sparse suffix trees are not restrict to natural language indexing. In bioformatics, the sparse suffix tree is used in the core of the algorithm to find the shortest pair of patterns that do not occurs close to each other in a sequence [4]. Which can be used to select good adapter-primers in order to increase sensitivity of multiplexed PCR assays. Karkkainen et al. [6] have a work that is the base for another application of the sparse suffix trees. They present an algorithm that efficiently searches for a pattern P in the whole text T using just a sparse suffix tree of T . Since the sparse suffix tree uses less memory than whole suffix tree, this approach is an alternative when indexing large amount of data and the memory consumption of the regular suffix tree is a problem. Which is often the case of genomic sequences databases. In this work, we present a new algorithm that runs in time linear in the word length and uses memory proportional to the number of suffixes represented. This
Corresponding author.
O. Norberto de Souza, G.P. Telles, and M.J. Palakal (Eds.): BSB 2011, LNBI 6832, pp. 34–41, 2011. c Springer-Verlag Berlin Heidelberg 2011
A New Algorithm for Sparse Suffix Trees
35
algorithm is based on McCreight’s algorithm [8] in the form that was presented in [7]. There are two algorithms with these same bounds for memory and time. The first one, presented by Andersson et al. [1] is not a practical alternative, it is very complex and has a very high time constant. On the other hand, the algorithm of Inenaga et al. [5] based on Ukkonen’s algorithm, can be used in practice.
2
Definitions
Let w be a word (or string) over the finite alphabet A. The length of w is denoted by |w|. The same notation also is used for set cardinality, for instance the size of the alphabet is |A|. The set of all possible words over A, including the empty word 1, is given by A∗ and the set A∗ \ {1} is denoted by A+ . The ith letter of w is denote by w[i]. The set of all suffixes (prefixes) of w, including the empty word 1 and the whole word w, is denoted Suf(w) (Pref(w)). Let u ∈ Pref(w), we denote by u−1 w the word obtained by the removal of the prefix u of w, i.e. the suffix of w with length |w| − |u|. Let W be a subset of A∗ . The set all prefixes of all words in W are denoted by Pref(W ) (Suf(W )). The word x is a bifurcation of W if x ∈ Pref(W ) and there are distinct letters a, b ∈ A such that xa, xb ∈ Pref(W ). A bifurcation is a maximal proper prefix between a pair of words in W . The set of all bifurcations of W is denoted by Bifurc(W ). Following [2,7], we define the A+ -tree T = (V, E, λ) as a rooted tree with nonempty labeling in the edges λ : E → A+ , such that two distinct edges leaving the same node have labels which starts with two different letters. The label of node v ∈ V is defined as the concatenation of the labels of the edges in the unique path from the root to v. The special case where all the labels have length one is a trie. Finally, in Definition 1 we present the Ukkonen’s tree of W , a very special case of A+ -tree. Definition 1 (Ukkonen’s tree). Given a set of words W , the Ukkonen’s tree of W is the A+ -tree T = (V, E, λ) where: V = W ∪ Bifurc(W ) ∪ {1}, E = {(u, v) ∈ V × V | u = max|x| {x ∈ V | x is a proper prefix of v}}, λ:E −→ A+ (u, v) −→ u−1 v. The proof that the Ukkonen’s tree is in fact an A+ -tree, along with more detailed exposition of the properties Ukkonen’s tree, can be found in [9,7]. The suffix tree of W is the Ukkonen’s tree of Suf(W ). Finally, a formal definition of the sparse suffix tree can also be given in terms of the Ukkonen’s tree, as the following. Definition 2 (Sparse Suffix Tree). Let w be a word and W a subset of Suf(W ), the set of suffixes of W . We define the sparse suffix tree of W as the Ukkonen’s tree of W .
36
G.A.T. Sacomoto and A.P. do Lago
In the rest of the work it is assumed that w is a word over the finite alphabet A, with length n = |w|, and W is a subset, with cardinality m = |W |, of Suf(w) that we are interested.
3
The Set of Factors
In this section is presented a characterization of the set W in terms of a factorization of w. Since W ⊆ Suf(w), the word w can be uniquely written in the form w = f1 f2 . . . fm , such that any suffix in W is of the form w(i, m) = fi fi+1 . . . fm for some i ∈ [1, m]. The factors fi are precisely the factors w(i, m)w(i + 1, m)−1 . For example, if w = abbbaa and W = {abbbaa, bbaa, aa} the set of factors is {ab, bb, aa}. If w = abbbaa and W = Suf(w), the factors are {a, b} ⊆ A. Hypothesis 1 The set of factors F = ∪m i=1 {fi } is prefix free, i.e. if fi ∈ Pref(fj ), then fi = fj . This hypothesis is necessary for the suffix links, given in Definition 3, to be unique and also in the proof of Lemma 1, which is the basis for the algorithm. This same assumption is made in the other avaible algorithms [1,5]. It is important to notice that this restriction does not reduce the generality of the algorithm described here. Given an arbitrary set F it is always possible to transform it in a prefix free set by concatenating a special character # ∈ / A in each fi ∈ F , # can be seen as a separator.
4
A Naive Algorithm
In this section we describe a naive algorithm that builds the sparse suffix tree in time O(n2 ) and space O(m), which is also the base for our optimal algorithm. It uses an iterative approach, inserting one suffix at a time and keeping at every step the sparse suffix tree of the set of suffixes already processed. The order in which the suffixes are added to the tree is given next. For i ∈ [0, m], we define the sequence of sets Wi , and the sequence of sparse suffix trees Ti in the following way {1}, if i = 0, Wi = Wi−1 ∪ {w(i, m)}, if i > 0, Ti = sparse suffix tree of Wi . For i ∈ [1, m], we also define the head hi as the longest prefix in the set Pref(w(i, m)) ∩ Pref(Wi−1 ); the tail ti = h−1 i w(i, m); the place of hi in Ti−1 as the ordered pair (ui , xi ), such that: hi = ui xi ; ui is a vertex and either xi = 1 or its a non-empty proper prefix of the label of an edge that leaves ui to a vertex vi . In order to update Ti−1 to obtain Ti , the algorithm, using the function Spell (Algorithm 1), walks down from the root of Ti−1 to find the place of the longest prefix of w(i, m) in Ti−1 , i.e. hi . The place (ui , xi ) of hi could be in a vertex (xi = 1) or in the middle of an edge (xi = 1) (see Fig. 1). If it is the latter, we
A New Algorithm for Sparse Suffix Trees
37
ui
ui = hi
xi hi
ti ti w(i, m) (a)
vi
w(i, m) (b)
Fig. 1. These figures illustrate the tree Ti−1 after the update to Ti , for the two possible cases: (a) The place (ui , xi ) of hi is in a vertex; (b) The place (ui , xi ) is in an edge
break the edge in xi and add a vertex with label hi in that place. Finally, we add an edge with label ti from the vertex ui and the new leave w(i, m). It is not hard to see that this algorithm has running to time to find mtime proportional m the place of each hi , which is bounded by i=1 hi ≤ i=1 w(i, m) = O(n2 ).
5
Suffix Link
Definition 3 (Suffix Link). Let u and v be two vertices of T, if there is fi ∈ F such that λ(u) = fi λ(v), then the suffix link of u is v. It is not explicit in Definition 3, but each vertex has at most one suffix link, this is a consequence of the hypothesis that F is prefix free. If there are two distinct vertices v and v such that λ(u) = fi λ(v ) = fj λ(v ), we also have that v and v have distinct lengths, assuming |λ(v )| > |λ(v )|, then fj is a proper prefix of fi , contradicting Hypothesis 1. Moreover, it is easy to see that if the suffix link of a vertex v removes a prefix fi , then the suffix link of any ancestral with label length at least |fi | must remove the same prefix fi . −1 Lemma 1. Suppose that i ∈ [3, m] and |hi−1 | ≥ |fi−1 |. Then, the word fi−1 hi−1 −1 is a prefix of hi . Furthermore, if this prefix is proper fi−1 hi−1 is a vertex of Ti−1 .
This lemma1 guarantees that if |hi−1 | ≥ |fi−1 |, then there are two possibilities −1 for the suffix link of hi−1 either it is already present in Ti−1 (fi−1 hi−1 is a proper −1 prefix of hi ); or it is hi (fi−1 hi−1 = hi ), which will be inserted in the iteration i. A simple induction, using this lemma as the inductive step, give us that if u ∈ Bifurc(Wi−2 ) then the suffix link of u is present in Ti−1 . In other words, if u is an internal node of Ti−2 , then its suffix link is present in Ti−1 . Besides that, all the leaves of Ti−2 also have their suffix links present. This proves the next corollary. Corollary 1. Let u be a vertex of Ti−1 already present in Ti−2 . Suppose that there is some j ∈ [1, m] such that fj ∈ Pref(u). Then fj−1 u ∈ Ti−1 , i.e. the suffix link of u is present in Ti−1 . 1
The proof is omited due to space restrictions, but it can be found in [9].
38
6
G.A.T. Sacomoto and A.P. do Lago
The Algorithm
In this section we present an algorithm that builds the sparse suffix tree for any set of suffixes, such that the corresponding set of factors satisfies Hypothesis 1. The algorithm runs in time linear on n, the length of w, and space linear on the number o suffixes m. It is worth notice that this is an optimal algorithm, both in space and time, since it is necessary to read the whole word and represent all m suffixes. Finally, a detailed pseudo-code is given in Algorithm 3. The algorithm uses the same strategy as the McCreight’s algorithm [8], inserting one suffix at each iteration, in the order of decreasing lengths. It uses two auxiliar functions Spell(l, r, T) and ReSpell(l, r, T). Both walk down the tree T, from the vertex r, looking for l , the longest prefix of l such that λ(r)l is a word of T. The difference between them is that ReSpell assumes that l = l, i.e. the whole word l is going to be spelled in the way down. By assuming that, the function needs only to compare the first letter of each edge label, therefore the running time is proportional to the number of edges visited, while Spell has a time proportional to the length of the spelled prefix l . Whenever possible, the algorithm uses ReSpell instead of Spell.
Algorithm 1. Function Spell Spell(l, r, T) 1 l is spelled from r in T; 2 returns longest prefix of λ(r)l in T, its place and associated edge. 3 u←r 4 s←l 5 while |s| > 0 and ∃e = (u, v) | λ(e) = λ(u)−1 λ(v) ∈ Pref(s) do 6 s ← λ(e)−1 s 7 u←v 8 if |s| > 0 and ∃e = (u, v) | s[1] ∈ Pref(λ(e)) then 9 x ← the longest common prefix of λ(e) and s 10 λ(u)x is proper prefix of λ(v) 11 else x←1 12 v←u 13 return (λ(u)x, u, x, v)
Based on Lemma 1, at iteration i, in order to find the position to insert −1 the vertex hi , the algorithm avoids spelling the prefix fi−1 hi−1 , using whenever possible, the suffix link of the vertex hi−1 or its father. The main difference with McCreight’s algorithm is at this point, when the suffix tree contains all the suffixes, the suffix link of the father of hi−1 is always present. In the sparse case, this is not true anymore, there are cases where none of the suffix links are present and it is necessary to process the whole hi , without eliminating any prefix. However, we show in the analysis that despite that, the algorithm remains linear.
A New Algorithm for Sparse Suffix Trees
39
Algorithm 2. Function ReSpell ReSpell(l, r, T) 1 l is spelled from r in T; 2 returns the longest prefix of λ(r)l in T, its place and associated edge; 3 important restriction: λ(r)l is word of T. 4 u←r 5 s←l 6 x←1 7 while |s| > 0 do 8 (u, v) ← edge e | s[1] ∈ Pref(λ(e)) λ(e) = λ(u)−1 λ(v) 9 if |s| ≥ |λ(e)| then 10 s ← λ(e)−1 s 11 u←v 12 else x←s 13 s←1 14 return (λ(u)x, u, x, v)
Theorem 1. The Algorithm 3 builds the sparse suffix tree of the set W ⊆ Suf(w) in time O(n) and space O(m), in the worst case. Proof. At every iteration the algorithm keeps only the current tree Ti and its suffix links, at most one per vertex. Since the number of vertices is limited by the number of suffixes, the space used in the worst case in O(m). The running time is dominated by the functions Spell and ReSpell. We analyze them separately. At each iteration i the calls of Spell are made in order to spell ti−1 or −1 −1 fi−1 hi−1 ti−1 (whenever this happens, |fi−1 hi−1 ti−1 | < |ti−1 |), in both cases the length of the spelled prefix is no greater than |ti−1 t−1 i |. Therefore, the total number of letters spelled is bounded by m i=1
|ti−1 t−1 i | = |t0 | − |tm | = n − |tm | ≤ n.
Now, we bound the number of edges visited by ReSpell. Let ki be the number of edges visited by the function at iteration i and φ(i) the length2 of the suffix of the word w that still going to be spelled by Spell or ReSpell on further iterations (observe that φ(i) ≤ |xi ti |). At iteration i + 1 the maximum number of edges that we can walk up tree is |fi | + 1. This is the situation of the call of ReSpell in line 12: one edge from hi to its father ui ; and |fi | edges from ui to the root, since when this occurs we have that |ui | < |fi | (line 11) and the node label is a upper bound for the tree depth. So, in the worst case, at iteration i at least ki − (|fi | + 1) edges of the ki visited by ReSpell are not going be visited in further iterations. The edge labels are non-empty, so at least a prefix of length ki − (|fi | + 1) was definitely spelled at iteration i, which gives the bound φ(i + 1) ≤ φ(i) − 2
For ease of exposition we define φ(i) = 0 for i > m.
40
G.A.T. Sacomoto and A.P. do Lago
Algorithm 3. Sparse suffix tree construction algorithm SparseSuffixTree(w, F ) 1 F : set of factors such that w = f1 f2 . . . fm 2 initialize T (T0 ) as the tree with just the root 1 3 for i from 1 to m do 4 T = Ti−1 5 if i = 1 then 6 (hi , ui , xi , vi ) ← Spell(w, 1, T) 7 elseif |hi−1 | < |fi−1 | then −1 8 (hi , ui , xi , vi ) ← Spell(fi−1 hi−1 ti−1 , 1, T) 9 elseif xi−1 = 1 then 10 (hi , ui , xi , vi ) ← Spell(ti−1 , SufLink(hi−1 ), T) 11 else if |ui−1 | < |fi−1 | then −1 12 (h, u, x, v) ← ReSpell(fi−1 ui−1 xi−1 , 1, T) 13 else (h, u, x, v) ← ReSpell(xi−1 , SufLink(ui−1 ), T) 14 if x = 1 then 15 (hi , ui , xi , vi ) ← Spell(ti−1 , u, T) 16 SufLink(hi−1 ) ← u 17 else (hi , ui , xi , vi ) ← (h, u, x, v) 18 hi will be added at this iteration 19 SufLink(hi−1 ) ← hi 20 ti ← h−1 i w(i, m) 21 Updates T from Ti−1 to Ti 22 if |xi | > 0 then 23 add the new vertex hi 24 remove the edge (ui , vi ) 25 add the edge from ui to hi with label xi 26 add the edge from hi to vi with label h−1 i vi 27 if |ti | > 0 then 28 add the new leave w(i, m) 29 add the edge from hi to w(i, m) with label ti 30 SufLink(w(i − 1, m)) ← w(i, m) 31 return T
(ki − (|fi | + 1)). Rearranging and summing over all iterations, we obtain that the number of edges visited by ReSpell is bounded by m i=1
ki ≤ φ(1) − φ(m + 1) +
m
(|fi | + 1) = 2n + m ≤ 3n.
i=1
It is interesting to observe that when the suffix link of a vertex or its father are present the algorithm behaves exactly like McCreight’s, but otherwise it behaves like the naive algorithm, searching the position to insert hi using Spell from the root. Which has complexity proportional m to the sum of the length of the inserted words, for the set W it would be i=1 w(i, m) = O(n2 ). Intuitively, the reason why the algorithm remains linear, is a consequence of Lemma 1. According to
A New Algorithm for Sparse Suffix Trees
41
it the only vertexes hi that does not have the suffix links present are the ones such that |hi | < |fi |. Therefore, the algorithm builds the tree using the naive strategy for the set F = {fi |1 ≤ i ≤ m}, but the sum of the lengths of this set is m i=1 |fi | = n.
7
Conclusion
In this work we presented a new optimal algorithm to build the sparse suffix trees, which is an elegant improvement of the naive algorithm based on the McCreight’s strategy for the complete suffix trees [8]. More importantly, it is not interesting just from a theoretical point of view, although no experimental tests were done here, our algorithm is simple enough to be implemented and used in practice. Finally, based on Kurtz et al. [2], for the same implementations conditions, our algorithm is expected to perform better than Inenaga’s algorithm [5].
References 1. Andersson, A., Jesper, N., Swanson, L.K.: Suffix trees on words. Algorithmica 23, 102–115 (1999) 2. Giegerich, R., Kurtz, S.: From Ukkonen to McCreight and Weiner: A unifying view of linear-time suffix tree construction. Algorithmica 19(3), 331–353 (1997) 3. Gusfield, D.: Algorithms on strings, trees, and sequences. Cambridge University Press, Cambridge (1997), computer science and computational biology 4. Inenaga, S., Kivioja, T., M¨ akinen, V.: Finding missing patterns. In: Jonassen, I., Kim, J. (eds.) WABI 2004. LNCS (LNBI), vol. 3240, pp. 463–474. Springer, Heidelberg (2004) 5. Inenaga, S., Takeda, M.: On-line linear-time construction of word suffix trees. In: Lewenstein, M., Valiente, G. (eds.) CPM 2006. LNCS, vol. 4009, pp. 60–71. Springer, Heidelberg (2006) 6. K¨ arkk¨ ainen, J., Ukkonen, E.: Sparse suffix trees. In: Cai, J.-Y., Wong, C.K. (eds.) COCOON 1996. LNCS, vol. 1090, pp. 219–230. Springer, Heidelberg (1996) 7. do Lago, A.P., Simon, I.: T´ opicos em Algor´ıtmos sobre Sequˆencias. In: IMPA (2003) 8. McCreight, E.M.: A space-economical suffix tree construction algorithm. J. ACM 23(2), 262–272 (1976) 9. Sacomoto, G.A.T.: Ukkonen’s Tree: Combinatorial Caracterization and Applications. Master’s thesis, Universidade de S˜ ao Paulo (2011) 10. Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995), http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.10.751
Analysis and Implementation of Sorting by Transpositions Using Permutation Trees Marcelo P. Lopes1 , Marilia D.V. Braga2, Celina M.H. de Figueiredo1 , Rodrigo de A. Hausen3 , and Luis Antonio B. Kowada4 1
2
Universidade Federal do Rio de Janeiro Instituto Nacional de Metrologia, Normaliza¸c˜ ao e Qualidade Industrial 3 Universidade de S˜ ao Paulo 4 Universidade Federal Fluminense
Abstract. Feng and Zhu defined the permutation tree structure to achieve a running time of O(n log n) for Hartman and Shamir’s 1.5approximation algorithm for sorting genomes by transpositions. The present work describes the first available implementation of this algorithm using permutation trees. Our analysis of Hartman and Shamir’s algorithm also leads to a modified 1.5-approximation algorithm that achieves in most cases a shorter sequence of transpositions that sorts a given permutation. Although our modified algorithm has a worst-case running time of O(n2 log n), in our experiments using real genomic data the running time is still very small. Keywords: comparative genomics, genome rearrangement, sorting by transpositions, approximation algorithm, permutation tree.
1
Introduction
In genome rearrangements, the usual approach for studying the similarity of chromosomes is to compare the gene orders of the chromosomes instead of directly comparing their DNA sequences [14]. In this context, a transposition is an operation in which a block of genes is “cut” from a chromosome and “pasted” elsewhere in the same chromosome [1]. A possible biological explanation for this rearrangement is the duplication of a block of genes, followed by the deletion of the original block [2]. The transposition distance is then the problem of finding the minimum number of transpositions necessary to transform a chromosome into another one. Determining the transposition distance of permutations is a notoriously challenging problem; it has been recently proven to be NP-hard [3], and tight bounds for the transposition distance of permutations are not available in general, but only for some classes of permutations [5,11,10]. Therefore, approximation algorithms [5,9] seem to be the only viable alternatives for estimating the transposition distance between permutations in general. To this date, the fastest implementation of Hartman and Shamir’s 1.5-approx√ imation algorithm runs in O(n3/2 log n) time [8]. The present paper proposes O. Norberto de Souza, G.P. Telles, and M.J. Palakal (Eds.): BSB 2011, LNBI 6832, pp. 42–49, 2011. c Springer-Verlag Berlin Heidelberg 2011
Sorting by Transpositions Using Permutation Trees
43
the first available implementation of Hartman and Shamir’s 1.5-approximation algorithm [9] using Feng and Zhu’s permutation trees [6]; our proposed implementation achieves the time upper bound of O(n log n) claimed in [6]. Additionally, we propose an improvement to Hartman and Shamir’s algorithm that achieves in most cases a shorter sequence of sorting transpositions and in worst case runs in O(n2 log n) time. Although the theoretical complexity is higher, in our experiments with real genomic data the performance of the improved algorithm is still reasonable.
2
Sorting by Transpositions
As a transposition only moves genes across one chromosome, but never between chromosomes, we will focus our attention on the problem of comparing unichromosomal genomes. A chromosome with n genes is modeled by a permutation π = [π1 π2 . . . πn ], where each element πi , which represents a gene, is an integer in the range 1 . . . n, and no element is repeated. The size of a permutation is the number of elements in it; therefore, the size of π is |π| = n. A transposition t = (i, j, k), such that 1 ≤ i ≤ j ≤ k ≤ n + 1, is the operation that, applied to a permutation π = [π1 . . . πi−1 πi . . . πj−1 πj . . . πk−1 πk . . . πn ], results in the permutation: πt = [π1 . . . πi−1 πj . . . πk−1 πi . . . πj−1 πk . . . πn ]. Notice that a transposition t = (i, j, k) moves the block πj . . . πk−1 to the left of the block πi . . . πj−1 . We can also generalize this notation to a sequence of transpositions t1 t2 . . . tq . Thus, πt1 t2 . . . tq denotes the application of t1 onto π, followed by the application of t2 onto πt1 and so on. Given two permutations, the problem of finding a sequence of transpositions that sorts one permutation into the other is called sorting by transpositions. In the study of this problem, without loss of generality, we can consider only the case in which a permutation π is sorted to the identity permutation ι[n] = [1 2 . . . n]. We may then find a sequence t1 t2 . . . tq such that πt1 t2 . . . tq = ι[n] . The minimum number of transpositions required to sort a permutation π is its transposition distance, denoted by dt (π). The problem of determining the transposition distance has been recently proven to be NP-hard [3], and the best known approximation ratio is 1.375 [5]. In the present study, we have implemented algorithms whose approximation ratio is 1.5.
3
1.5-Approximation Algorithms
When studying the problem of sorting by transpositions, it is useful to extend a permutation π, by adding two elements π0 = 0 and πn+1 = n + 1. Bafna and Pevzner [1] presented a 1.5-approximation algorithm, based on the following structure.
44
M.P. Lopes et al.
The breakpoint graph [1] of a permutation π, denoted by G(π), is a graph whose vertex set is {0, −1, +1, . . . , −n, +n, −(n + 1)}, and its edge set is partitioned into two, A and B – gray and black edges respectively – such that A = {(+i, −(i + 1)); 0 ≤ i ≤ n} and B = {(−πi+1 , +πi ); 0 ≤ i ≤ n}. The black edges represent the consecutive elements in the extended permutation of π, whereas the gray edges represent the consecutive elements in the extended permutation of ι[n] . Since the degree of every vertex in G(π) is 2, this graph can be partitioned into disjoint cycles, and each one of them alternates black and gray edges. The length of a cycle in G(π) is the number of black (or gray) edges in it. A cycle in G(π) is odd if it has an odd length, and even otherwise. The number of odd cycles in G(π) is denoted by codd (π), and plays an important role in estimating the transposition distance. A transposition t = (i, j, k) is said to affect a cycle c in G(π) if c has at least one of the following black edges: (−πi , +πi−1 ), (−πj , +πj−1 ) or (−πk , +πk−1 ). The graph G(πt) is almost identical to G(π), up to the removal of the edges (−πi , +πi−1 ), (−πj , +πj−1 ), (−πk , +πk−1 ) and the addition of the black edges (−πj , +πi−1 ), (−πi , +πk−1 ), (−πk , +πj−1 ) (see [1] for more details). The effect of removing and adding black edges may change the number of odd cycles, as seen in Theorem 1. Theorem 1. [1] A transposition t applied to a permutation may only increase the number of odd cycles by 2, decrease by 2, or it does not change the number of odd cycles. That is, codd (πt) − codd (π) ∈ {−2, 0, +2}. A transposition that increases the number of odd cycles by 2, decreases by 2, or does not change the number of odd cycles is named, respectively, a 2-move, a (−2)-move or a 0-move. A cycle is said to be oriented if there is a 2-move that affects it. If no 2-move affects a cycle, then it is unoriented. Theorem 2. [1] If there is no 2-move for π, then it is possible to find a sequence of transpositions t1 t2 t3 such that t1 is a 0-move for π, t2 is a 2-move for πt1 and t3 is a 2-move for πt1 t2 . Such a sequence is called a (0, 2, 2)-move. Notice that the maximum number of cycles in the breakpoint graph of a permutation of n elements is n + 1; the only permutation with n elements whose graph has n + 1 cycles is ι[n] . Therefore, if one manages to find a sequence of 1 (n + 1 − codd (π)) 2-moves that sorts a permutation π, this sequence is optimal, 2 accounting for the lower bound in Theorem 3. The upper bound is achieved by successive applications of (0, 2, 2)-moves. Theorem 3. [1] A permutation π of n elements satisfies 1 3 n + 1 − codd (π) ≤ dt (π) ≤ n + 1 − codd (π) . 2 4 As a consequence of Theorems 2 and 3, an algorithm that finds (0, 2, 2)-moves to sort a permutation has an approximation ratio of 3/4 = 1.5. 1/2
Sorting by Transpositions Using Permutation Trees
45
Simple Permutations If every cycle of a permutation has length either 1, 2 or 3, then this permutation is said to be simple. It is possible [12] to create a simple permutation π ˆ , with n ˆ elements, from a permutation π, such that the bounds in Theorem 3 for π ˆ are preserved with respect to π: n ˆ + 1 − codd (ˆ π ) = n + 1 − codd (π). A sequence of transpositions that sorts π ˆ can be transformed into a sequence of transpositions that sorts π with the same number of transpositions [12]. Since there are less edges in each cycle in a simple permutation, it is easier to determine which edges must be removed in a 2-move. This is the principle of the simpler 1.5-approximation to√the transposition distance given by Hartman and Shamir, that runs in O(n3/2 log n) time [9]. The Permutation Tree The permutation tree [6] is a binary balanced tree that represents a permutation. Let π = [π1 π2 . . . πn ] be a permutation. The corresponding permutation tree has n leaves, labeled π1 , π2 , . . . πn ; every node represents an interval of consecutive elements πi , πi+1 , . . . , πk−1 , with i < k, and is labeled by the maximum number in the interval. Therefore, the root of the tree is labeled with n. Furthermore, the left child of a node represents the interval πi , . . . , πj−1 and the right child represents πj , . . . , πk , with i < j < k. Feng and Zhu provide algorithms to build a permutation tree in O(n) time, to join two permutation trees into one in O(h), where h is the height difference between the trees, and to split a permutation tree into two in O(log n). These operations allow one to find and to apply a transposition to a permutation π, updating the tree, in O(log n) time. The overall complexity of Hartman and Shamir’s [9] algorithm is then reduced to O(n log n). Improvement to Reduce the Number of Transpositions Hartman and Shamir’s algorithm traverses the list of 2- and 3-cycles of the simple permutation, sorting these cycles in the order in which they are visited. Consequently, if the visited cycle is unoriented, a (0, 2, 2)-move is performed. This algorithm visits each cycle only once, but may use a (0, 2, 2)-move when a 2-move on an oriented cycle that appears later in the list is also possible. In other words, it does not perform all possible 2-moves before applying a (0, 2, 2)-move. However, if we skip the unoriented cycles and apply first all possible 2-moves, we may use less (0, 2, 2)-moves than in Hartman and Shamir’s algorithm. This happens because, depending on how the cycles of the breakpoint graph intersect, a 2-move may transform an unoriented into an oriented cycle. Observe that Theorem 2 guarantees that this improvement also leads to an 1.5-approximation. Furthermore, although the worst-case running time is O(n2 log n), when each cycle is visited O(n) times, in practice the running time is still very small, as seen in Sec. 5.
46
M.P. Lopes et al.
4
Implementation
This section summarizes some aspects of our implementation, which has been made in Java 1.6 and is available at http://grafos.cos.ufrj.br/bioinfo/. Obtaining Simple Permutations We will first describe how to create a simple permutation from a given permutation π. In a nutshell, for each cycle c of length ≥ 4 in G(π), our implementation inserts a value in π (respectively two vertices in G(π)) that splits c into one cycle c of length 3 and one cycle c of length − 3 + 1. If − 3 + 1 ≥ 4, we repeat this procedure for c , until only cycles of lengths 1, 2 and 3 remain. These operations can be executed in linear time, as we will show next, with the help of an example. Remember that only gray (or only black) edges account for the length of a cycle. A cycle can be represented by its nonnegative vertices, listed in the circular order obtained by a traversal of this cycle starting in vertex 0 or in any vertex +i (for 1 ≤ i ≤ |π|) and following its black edge. Consider the permutation π = [7 6 5 4 3 2 1], whose breakpoint graph has two cycles of length 4. One of these cycles is c1 = (0 +6 +4 +2). In order to split this cycle, we need to insert an element in π after 4, so that the first three vertices of c1 are separated into a cycle c1 , and the inserted element forms a new cycle c1 along with the fourth vertex. For the sake of clarity, the inserted elements will be represented by rational numbers, so that the previous elements do not have to be renumbered. Since the new element must end the cycle in vertex 0, it has to be a number between 0 and 1. Thus, if we insert 0.5 after 4, we get the permutation [7 6 5 4 0.5 3 2 1], splitting c1 into c1 = (0 +6 +4), of length 3, and c1 = (+0.5 +2), of length 2. We still have another cycle of length 4, that is c2 = (+7 +5 +3 +1). Analogously, we can insert 7.5 after 3 and obtain [7 6 5 4 0.5 3 7.5 2 1]. This step splits c2 into c2 = (+7 +5 +3) and c2 = (+7.5 +1). Transpositions on a Permutation Tree After transforming the permutation into a simple permutation, we proceed to build the permutation tree. Every node of the tree is represented by a structure that has the following attributes, as defined by Feng and Zhu: value, the label of the node, which is an integer that corresponds to an element in the permutation, L and R, references to the left and right children, parent reference to the parent node, and the height H, being 0 for a leaf and max{H(L(v)), H(R(v))} + 1 for every other node v. We also added two more attributes to facilitate our implementation: parent, a reference to the parent node, nextNodeCycle and previousNodeCycle, references to the previous and next element in the cycle respectively, according to the order provided by the procedure next. The procedures Build, Split and Join are implemented according to Feng and Zhu’s definitions, with some additional operations to account for the attributes that we have defined. The procedure Build(π) receives a permutation π and returns a permutation tree; Split(T, i) receives a permutation tree T
Sorting by Transpositions Using Permutation Trees
47
and an integer i, and returns two permutation trees T1 and T2 , that represent the permutations [π1 π2 . . . πi−1 ] and [πi πi+1 . . . πn ], with a suitable reordering of its elements to keep the same adjacencies and breakpoints; Join(T1, T2 ) receives two permutation trees T1 , T2 , representing permutations [π1 π2 . . . πi−1 ] and [πi πi+1 . . . πn ], and returns a permutation tree T that represents [π1 π2 . . . πi−1 πi . . . πn ], with a suitable reordering of its elements to keep the same adjacencies and breakpoints. A transposition t = (i, j, k) is performed on a permutation tree T by the following procedure, which runs in O(log n) time: PROCEDURE transposition(T , i, j, k) (T1 , T2 ) := Split(T , i) (T3 , T4 ) := Split(T2, j − i) (T5 , T6 ) := Split(T4, k − j) RETURN Join(Join(Join(T1, T5 ), T3 ), T6 ) The sequence of transpositions applied to the simple permutation returned by the algorithm can be mimicked to sort the original permutation, by simply omitting the inserted values.
5
Results
We selected pairs of microorganisms to test our implementation of the proposed modification to the algorithm of Hartman and Shamir with Feng and Zhu’s permutation tree. All sixty compared pairs had results similar to the ones presented in Table 1. The permutations were obtained by a pairwise comparison of protein homologs in the NCBI database using the Gene Plot service (http://www.ncbi.nlm.nih.gov/sutils/geneplot.cgi). For each comparison, we have: a) selected two similar species, b) removed the duplicated proteins in each organism, c) created a permutation associated to the pairwise comparison. Lu and Yang [13] found a class of permutations with transposition distance 1 greater than or equal to 17 n + 33 , where n is the size of permutation. These are 33 the farthest permutations known to date; for this reason, they were selected as a benchmark to test our implementation. These permutations range in size from 1000 to 10000 elements. For these permutations, the distance returned by both implementations is the same: 1.235 greater than the lower bound. The average running time of our implementations, for this family, is shown in Figure 1.
6
Discussion
In this work, we give an implementation of Hartman and Shamir’s 1.5-approximation algorithm for sorting genomes by transpositions, and an improved implementation, that performs all possible 2-moves first. The experimental results in Table 1 show that, in practice, the time of computation using permutation trees is small (less than three seconds) for genomes with thousands of genes.
48
M.P. Lopes et al.
Table 1. Microorganism comparison. Algorithm 1 - original HS algorithm. Algorithm 2 - Improved HS algorithm. NCG - Number of Common Genes, NT - Number of transpositions obtained by each program, time - average time in seconds in an Intel Dual Core 2 GHz Computer. Organism 1
Organism 2
B. bacilliformis KC583
B. henselae str.Houston-1
B. adolescentis ATCC15703 B. longum NCC2705 Azoarcus sp. BH72
A. sp. EbN1
B. fragilis NCTC9343
B. thetaioaomicron VPI-5482
B. melitensis 16M
B. suis 1330
Arthobacter aurescens TCI A. sp. FB24 B. mallei ATCC23344
B. bronchiseptica RB50
B. parapertussis 12822
B. cenocepacia AU1054
B. cepacia AMMD
NT1/NT2 383/381 581/553 1052/950 921/846 2129/2129 629/559 1672/1609 303/303 2136/2106
time1/time2 0,031/0,063 0,078/0,11 0,297/0,453 0,25/0,359 0,806/1,609 0,187/0,109 0,875/1,129 0,037/0,047 1,828/2,515
B. cenocepacia AU1054
NCG 999 1196 2208 2542 2845 3074 3311 4077 5087
Fig. 1. Running time of both implementations (1 - original HS algorithm. 2 - Improved HS algorithm) for Lu and Yang permutations
On the other hand, the number of transpositions found with the improved version was smaller in most cases (and never larger than the original version). Figure 1 shows that, for the permutations analyzed so far, the running time of the improved version is reasonable (less than 20 seconds for sizes up to 10,000). The implemented algorithms have an approximation ratio of 1.5, while the best known ratio for sorting by transpositions is 1.375, using Elias and Hartman’s algorithm [5]. Dias and Dias [4] implemented Elias and Hartman’s algorithm and also modified it with some heuristics, preserving the same approximation ratio of 1.375. The goal of Dias and Dias’s implementation is the development of an algorithm that, in practice, returns a smaller number of transpositions, irrespective of the theoretical complexity. According to the authors, for instances with thousands of elements, their algorithm can take several days [4]. We plan on comparing the results of our implementation with Dias and Dias’s implementation if we manage to obtain the distances returned by their algorithm.
Sorting by Transpositions Using Permutation Trees
49
Recently, the use of permutation trees in Elias and Hartman’s algorithm was claimed to also achieve a running time of O(n log n) [7], but no implementation was given. Providing this implementation would be a nice extension to our work. Acknowledgement. We thank Maria E. Walter from Univ. de Bras´ılia and Philipe Carvalho from Univ. Federal Fluminense for helping us to select the microorganisms used in our experiments. This research was partially supported by the Brazilian research agencies CNPq (grants UNIVERSAL 472144/09-0 and PROMETRO 563087/10-2) and FAPERJ (grant PRONEX E-26/110.550/10).
References 1. Bafna, V., Pevzner, P.A.: Sorting by transpositions. SIAM J. Disc. Math. 11(2), 224–240 (1998) 2. Boore, J.L.: The duplication/random loss model for gene rearrangement exemplified by mitochondrial genomes of deuterostome animals. In: Sankoff, D., Nadeau, J.H. (eds.) Comparative Genomics, pp. 133–148. Kluwer Academic Publishers, Dordrecht (2000) 3. Bulteau, L., Fertin, G., Rusu, I.: Sorting by transpositions is difficult. arXiv: 1011.1157v1 [cs.DS] (November 2010) 4. Dias, U., Dias, Z.: An improved 1.375-approximation algorithm for the transposition distance problem. In: Proceedings of the First ACM International Conference on Bioinformatics and Computational Biology, BCB 2010, pp. 334–337. ACM, New York (2010) 5. Elias, I., Hartman, T.: A 1.375-approximation algorithm for sorting by transpositions. IEEE/ACM Trans. Comput. Biol. and Bioinformatics 3(4), 369–379 (2006) 6. Feng, J., Zhu, D.: Faster algorithms for sorting by transpositions and sorting by block interchanges. ACM Trans. Algorithms 3 (August 2007) 7. Firoz, J., Hasan, M., Khan, A., Rahman, M.: The 1.375 approximation algorithm for sorting by transpositions can run in O(n log n) time. In: Rahman, M., Fujita, S. (eds.) WALCOM 2010. LNCS, vol. 5942, pp. 161–166. Springer, Heidelberg (2010) 8. Hartman, T.: Combinatorial algorithms for genome rearrangements and DNA oligonucleotide arrays. Ph.D. thesis, Weizmann Institute of Science (2004) 9. Hartman, T., Shamir, R.: A simpler and faster 1.5-approximation algorithm for sorting by transpositions. Inf. Comput. 204(2), 275–290 (2006) 10. Hausen, R.A., Faria, L., Figueiredo, C.M.H., Kowada, L.A.B.: Unitary toric classes, the reality and desire diagram, and sorting by transpositions. SIAM J. Disc. Math. 24(3), 792–807 (2010) 11. Kowada, L., de, A., Hausen, R., de Figueiredo, C.: Bounds on the transposition distance for lonely permutations. In: Ferreira, C., Miyano, S., Stadler, P. (eds.) BSB 2010. LNCS, vol. 6268, pp. 35–46. Springer, Heidelberg (2010) 12. Lin, G., Xue, G.: Signed genome rearrangement by reversals and transpositions: models and approximations. Theoret. Comput. Sci. 259(1-2), 513–531 (2001) 13. Lu, L., Yang, Y.: A lower bound on the transposition diameter. SIAM J. Disc. Math. 24(4), 1242–1249 (2010) 14. Sankoff, D., Leduc, G., Antoine, N., Paquin, B., Lang, B.F., Cedergren, R.: Gene order comparisons for phylogenetic inference: evolution of the mitochondrial genome. Proc. Natl. Acad. Sci. 89(14), 6575–6579 (1992)
Improved Gene Expression Clustering with the Parameter-Free PKNNG Metric Ariel E. Bay´ a and Pablo M. Granitto CIFASIS, French Argentine International Center for Information and Systems Sciences, UPCAM (France) / UNR–CONICET (Argentina) Bv. 27 de Febrero 210 Bis, 2000 Rosario, Argentina {baya,granitto}@cifasis-conicet.gov.ar http://www.cifasis-conicet.gov.ar/ Abstract. In this work we introduce a modification to an automatic non-supervised rule to select the parameters of a previously presented graph-based metric. This rule maximizes a clustering quality index providing the best possible solution from a clustering quality point of view. We apply our parameter-free PKNNG metric on gene expression data to show that the best quality solutions are also the ones that are more related to the biological classes. Finally, we compare our parameter-free metric with a group of state-of-the-art clustering algorithms. Our results indicate that our parameter-free metric performs as well as the state-ofthe-art clustering methods. Keywords: Distance Measure, Clustering, Manifolds, Gene Expression Data.
1
Introduction
A common non-supervised analysis of gene expression data is to cluster the samples (tissues/diseases/patients). The goal in this situation is to find groups of samples sharing similar gene expression patterns. Pioneering works in this area are Golub et al. [1] and Alizadeh et al. [2], both aimed at the discovery of new cancer subtypes. It is clear that clustering samples is very different to clustering genes. In this case the analysis is made over so-called “wide datasets”, with a few samples measured over thousands of features (genes). Dealing with high dimensional spaces is a known challenge for clustering procedures. In this work we address this particular problem, i.e., the grouping of samples in the high-dimensional space created by their genetic profiles. Clustering problems can be analyzed in three separated stages: i) measuring the similarity of the objects under analysis, ii) grouping the objects according to these similarities, and iii) evaluating the “goodness” of the clustering solution. In a previous work [3] we introduced the new PKNNG metric that can be used to obtain better estimation of distances on high dimensional data. We showed that, in the second stage, these distances can be used effectively by almost any clustering algorithm, even by one not specifically designed to solve problems in high dimensional spaces. O. Norberto de Souza, G.P. Telles, and M.J. Palakal (Eds.): BSB 2011, LNBI 6832, pp. 50–57, 2011. c Springer-Verlag Berlin Heidelberg 2011
Parameter-Free PKNNG Metric
51
The PKNNG metric is based on the use of a k-nearest-neighbours (k-nn) graph of the samples, with a low k-value that forms several subgraphs. These subgraphs are connected via highly penalized edges to form a connected graph. For a detailed description of the metric we refer the reader to [3]. Although in most cases there is a range of values of k that leads to good clustering solutions, we showed in a previous work [4] with this metric that the particular value of k that produces the best solution is dependent on the settings of the problems, and that there is a simple method, based on a clustering quality measure, to select the value of k that produces the best results for a given clustering problem. This automatic non-supervised rule to select the value of k transforms PKNNG into an effective and parameter-free metric. In this work we introduce a simple modification to the PKNNG metric, the use of a fixed scale parameter µ, that leads to better selections of the optimal k value. Then, we apply for the first time this parameter-free PKNNG metric to the clustering of gene-expression data and discuss its efficacy. To give context to the validity of our method, we also compare it with stateof-the-art clustering methods. The Spectral Clustering algorithm [5] has recently received increasing attention and is considered to be effective for arbitrary manifolds using an appropriate kernel. The Evidence Accumulation Clustering algorithm (EAC) method of Fred and Jain [6] is based on the innovative idea of producing several different clustering solutions and extract from them the information to produce the final clustering. The main idea is that if a pair of points is usually clustered together, under different conditions, then they could be assigned to the same cluster with high confidence. Recently, Kim et al. [7] have presented the Multi-K clustering method, which is based in the same general idea of ensemble clustering. We also considered a fourth clustering method in this comparison, Model-Based Clustering (MBC) [8], which showed good results in gene-expression problems [9]. In Section 2 of this paper we review our previous strategy to select the value of k in the PKNNG metric using the Silhouette index, and introduce the use of a fixed µ. Then, in Section 3 we analize the performance of the new rule on gene-expression datasets and compare its performance with state-of-the-art clustering algorithms. Finally, in Section 4 we present a summary of our work.
2
An Automatic Non-supervised Rule
In this section we describe and analyse the rule for the selection of the value of k that leads to the optimal clustering solution. We assume that the instances are grouped forming tight and well separated clusters but over arbitrary highdimensional paths. Also, we assume that the PKNNG metric has the power to correctly decode the biological information in the high-dimensional gene expression data to group instances effectively, but that its performance depends on k.
52
2.1
A.E. Bay´ a and P.M. Granitto
Clustering Quality Measure
There are many functions used to measure clustering quality in the literature, each of them based on a given concept of what a cluster is. We choose the Average Silhouette index as measure of quality in our method, because it favors tight and well separated clusters. The Silhouette definition for a single point was proposed b(i)−a(i) by Rousseeuw [10] as: s(i) = max(b(i),a(i)) , where a(i) is the average distance from object i to all other objects in the same cluster and b(i) is the minimum of the average distance of object i to all objects in each one of the other clusters. From the definition it is easy to see that −1 < s(i) < 1. A s(i) value close to 1 means that point i was correctly clustered, while a s(i) value near (−1) means that point i should belong to other cluster. The average silhouette (AS from here on) is calculated over all points. This index has been previously used as an accurate indicator of the goodness of a clustering solution [11], in particular in the evaluation of solutions with different number of clusters. We proposed in a previous work [4] to use AS as a measure of the quality of the metric produced by PKNNG when using different k values. 2.2
Scale Factor
Our objective is to use the AS index to find which is the best value of k to represent our data. Then, given a range of values of k, we can cluster the data (using a fixed number of clusters) to retain the k that maximizes the AS index. But, in this setting, the AS depends on two different factors. When we change the parameter k we obtain a varying number of components, because as the number of neighbours raises, the number of components diminishes. These components are later connected via penalized weights to obtain the PKNNG graph. In the definition of the PKNNG metric [3] we proposed to penalize edges using: w = d ed/µ , where d refers to the distance between a pair of vertices and µ is a scale factor. We proposed to set µ as the mean value of the distribution of distances obtained from the k-neighbours, a value that is dependent on the value of k. Then, a change in k will affect the value of µ, which in turn would also change the value of some edges. AS measures both effects at the same time, the change in the structure of the graph and the change in the penalization µ. Simulation on artificial datasets (not included due to lack of space) show that this behavior constitutes a problem. To solve this issue we need to use a new scale factor µ that it is also derived from the data but it is invariant with k. To this end we propose here to set µ as the mean distance to the first neighbor. This idea was validated on artificial dataset and the obtained results where very positive (these results where also not included due to lack of space).
3 3.1
Experimental Evaluation Settings
In this section we present experiments on six real gene expression datasets. The first five datasets from Table 1 describe genetic profiles, where the task at hand is
Parameter-Free PKNNG Metric
53
to cluster samples. In this case instances represent a general phenotype (different tissues or diseases). In the remaining dataset the objective is to cluster a group of genes of Yeast that are defined based on their functional properties. In all six cases we have the true classes of the data available, which we only use to establish the quality of our automatic non-supervised rule to select k and to compare the results of our automatic PKNNG metric combined with simple clustering algorithms against a group of state-of-the-art clustering methods. We combined PKNNG with three simple, well-known and widely used clustering algorithms: PAM and Hierarchical Clustering (HC) with average linkage rule (HC-Av) and complete linkage rule (HC-Co). We used the Euclidean metric as base metric for PKNNG. Following Monti et al. [12], we normalized all datasets by row to mean zero and standard deviation one (for each sample we normalized genes for the first five datasets and experimental conditions for the last one). Then, to report our results, we used a sub-sample and cluster approach. From each dataset we generated 100 random samples with 95% of the data and then we applied PKNNG combined with a clustering algorithm. We also applied this same setting to the state-of-the-art algorithms used for comparison. In this analysis we always used the correct number of clusters. We did not validate nor analyse this number because it is not relevant to the goal of this work. In a first analysis we evaluated the performance of the modified PKNNG metric combined with simple algorithms, comparing its performance with the same algorithms combined with the base metric. Next, to give a context to our results, we made a comparison with state-of-theart methods. We selected four clustering algorithms to compare with our proposal of using a simple clustering method as PAM or HC with the new PKNNG metric. The methods, already discussed in the Introduction, are Spectral clustering, EAC, Multi-K and MBC. Table 1. Six real gene expression Dataset publicly available. We detail the number of examples (n), the number of variables (p) and the class distribution.
Dataset AML-ALL [1] (ALB) Alizadeh [2] (ALI) Leukemia [13] (LEU) Novartis Tissue [14] (BCLP) CNS Tumors [15] (CNS) Yeast [16] (Y)
3.2
n Classes 38 3 (11-8-19) 62 3 (42-9-11) 248 6 (43-27-15-79-20-64) 103 4 (26-26-28-23) 48 5 (10-8-10-10-4) 208 4 (41-121-35-11)
p 999 1000 985 1000 1000 79
Results
In this subsection we present the results of our modified non-supervised rule applied to the gene expression data. Each row of Figure 1 presents one of the datasets from Table 1 and each column is one of the clustering algorithms selected
54
A.E. Bay´ a and P.M. Granitto
to be combined with the PKNNG metric, namely HC- Average Linkage, HCComplete Linkage, and PAM. In all cases we study the same range of k [2-7]. This interval contains the best possible solution in all cases for all the tested algorithms. For each sub-sample we record its maximum AS index and the corresponding k value. We then mark the k where the maximum AS occurred as best solution for that particular subsample and finally use that value of k to calculate the cRand Index [17]. Based on this procedure we generated two boxplots, the green one which is the result of measuring AS index to every sub-sampled solution in the interval [2-7] and the black boxplot that informs the cRand values. There is an extra value at the end of the X axis labeled as (A) that reports the values for both the maximum AS index (colour green) and the corresponding value of cRand (colour black) for a given maximum AS. If we follow the green plot it is easy to see that in (A) we obviously have the maximum AS for all runs. The interesting result is that the cRand value in (A) is always maximum, which indicates that we are obtaining better solutions than any particular k value, which indicates that our selection rule for k under a fixed µ is optimal. Figure 2 shows the results of the comparison with state-of-the-art methods. In the figure we included the results of clustering with PAM, HC-AL and HCCL using the Euclidean base metric. These results where included to show that we are obtaining a better representation of the data than the one we get from only applying the base metric. The figure also shows two interesting facts: i) when we compare the results in all datasets when using PKNNG combined with PAM, HC-Av and HC-Co we can see similar good performances with all clustering algorithms and, ii) an analysis of all the methods in the figure and all datasets show that PKNNG combined with the selected clustering methods perform similarly well as the selected state-of-the-art methods.
4
Concluding Remarks
In this work we proposed a simple modification of the PKNNG metric, the use of a fixed µ. Using this modified metric and the AS measure of quality, we showed that we can select automatically the number of neighbors k in the PKNNG metric, leading to better clustering results. With this modification, the PKNNG metric becomes a parameter-free method, with optimal clustering results and the ease-of-use of such methods. Also, we applied the modified metric for the first time to high-dimensional gene-expression data. In this context, we compared the PKNNG metric combined with a group of simple and commonly used clustering algorithms against two groups of clustering methods. First, we compared simple algorithms with the Euclidean metric and with the improved PKNNG metric, showing the clear advantage of using PKNNG. Second, we made another comparison with four state-of-the-art clustering algorithms, finding that our improved metric combined to a simple and commonly used clustering method is comparable to other state-of-the-art methods.
2
3
4
5
6
7
A
1.00 0
0.25
0.50
0.75 0.50 0.25 0
0
0.25
0.50
0.75
cRand Distribution AS Distribution
2
3
4
5
6
7
A
5
6
7
A
6
7
A
A
A
7
A
4
5
6
7
A
6
7
A
5
6
7
A
0.75 7
A
2
3
4
5
6
7
A
0
0.25
0.50
0.75
1.00
(o) BCLP HC-Co
1.00 A
4
0.50 6
0.50 7
3
1.00 5
0.25 6
A
0.50 4
0 5
7
0.25 3
0.75
1.00 0.75 0.50 0.25
4
(p) Y PAM
2
(n) BCLP HC-Av
0
3
6
0 2
(m) BCLP PAM
2
5
0.75
1.00 0.50 0.25 0 5
4
(l) LEU HC-Co
0.75
1.00 0.75 0.50 0.25
4
3
(i) CNS HC-Co
0.25 3
(k) LEU HC-Av
0
3
2
0 2
(j) LEU PAM
2
A
1.00 6
1.00 7
7
0.75 5
0.50 6
6
0.25 4
0.25 5
5
0 3
(h) CNS HC-Av
0 4
4
1.00
2
0.75
1.00 0.75 0.50 0.25
3
3
0.50
0.75 0.50 7
0 2
2
(f) ALI HC-Co
0.25 6
A
1.00 5
0 5
7
0.50 4
1.00
1.00 0.75 0.50 0.25
4
(g) CNS PAM
6
0.25 3
(e) ALI HC-Av
0
3
5
0 2
(d) ALI PAM
2
4
0.75
1.00 0.75 0.25 0 4
3
(c) ALB HC-Co
0.50
0.75 0.50 0.25 0
3
2
(b) ALB HC-Av
1.00
(a) ALB PAM
2
55
0.75
1.00
1.00
Parameter-Free PKNNG Metric
2
3
4
5
6
(q) Y HC-Av
7
A
2
3
4
5
6
7
A
(r) Y HC-Co
Fig. 1. As k varies the different panels [a-r] show how the AS and cRand indexes evolve. Each row of the figure shows a different dataset and each column represents a different clustering algorithm. The Y axis of each panel is in the interval [0 − 1]. The black curve represents the cRand index while the green curve is the AS index. The X axis presents different values of k used to generate the k-nn graph.
A.E. Bay´ a and P.M. Granitto
0.75 0.25
0.50
PAM Euc. HC−Av Euc. HC−Cl Euc. PAM Pknng HC−Av Pknng HC−Cl Pknng MBC Spectral EAC−av Multi−K
0
CRand Distribution
1.00
56
ALB
ALI
CNS
LEU
BCLP
Y
Fig. 2. Comparison of different clustering methods using Eclidean Distane as Base Metric. The Y axis shows the cRand index distribution for the different clustering methods. The X axis presents the 6 gene expressions Datasets. For each Dataset we present a group of 10 bars to compare quality of solutions between the clustering algorithms.
We plan to extend the evaluation of the PKNNG metric to an extended set of gene-expression datasets.
Availability An R package with the metric can be downloaded at: http://www.cifasis-conicet.gov.ar/granitto/nng_0.0.1.tgz. Acknowledgments. We thank partial support to this work from ANPCyT (PICT 237/2008).
References 1. Golub, T.R., Slonim, D.K., et al.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439), 531–537 (1999) 2. Alizadeh, A.A., Eisen, M.B., et al.: Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403(6769), 503–511 (2000)
Parameter-Free PKNNG Metric
57
3. Bay´ a, A.E., Granitto, P.M.: Clustering gene expression data with a penalized graph-based metric. BMC Bioinformatics 12, 2 (2011), http://www.biomedcentral.com/1471-2105/12/2 4. Bay´ a, A.E., Granitto, P.M.: Improved graph-based metrics for clustering highdimensional datasets. In: Kuri-Morales, A., Simari, G.R. (eds.) IBERAMIA 2010. LNCS (LNAI), vol. 6433, pp. 184–193. Springer, Heidelberg (2010) 5. Ng, A., Jordan, M., Weiss, Y.: On spectral clustering: Analysis and an algorithm, Vancouver, British Columbia, Canada, December 3-8. Advances in Neural Information Processing Systems, vol. 14, pp. 849–856. MIT Press, Cambridge (2001) 6. Fred, A., Jain, A.K.: Combining Multiple Clusterings Using Evidence Accumulation. IEEE Trans. Pattern Anal. Mach. Intell. 27(6), 835–850 (2005) 7. Kim, E.-K., Kim, S.-Y., et al.: MULTI-K: accurate classification of microarray subtypes using ensemble k-means clustering. BMC Bioinformatics 10(1) (2009) 8. Yeung, K., Fraley, C., et al.: Model-based clustering and data transformations for gene expression data. Bioinformatics 17(10), 977–987 (2001) 9. Thalamuthu, A., Mukhopadhyay, I., et al.: Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics 22(19), 2405 (2006) 10. Rousseeuw, P.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987) 11. Kaufman, L., Rousseeuw, P.: Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, New York (1990) 12. Monti, S., Tamayo, P., et al.: Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data. Machine Learning 52(1-2), 91–118 (2003) 13. Yeoh, E.-J., Ross, M.E., Shurtleff, S.A., et al.: Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell 1, 133–143 (2002) 14. Su, A.I., Cooke, M.P., et al.: Large-scale Analysis of the Human and Mouse Transcriptomes. Proceedings of the National Academy of Sciences USA 99, 4465–4470 (2002) 15. Pomeroy, S., Tamayo, P., et al.: Gene Expression-Based Classification and Outcome Prediction of Central Nervous System Embryonal Tumors. Nature 415, 436–442 (2002) 16. Eisen, M.B., Spellman, P.T., et al.: Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences USA 95, 14863–14868 (1998) 17. Milligan, G., Cooper, M.: A study of comparability of external criteria for hierarchical cluster analysis. Multivariate Behavioral Research 21, 441–458 (1986)
Efficiently Querying Protein Sequences with the Proteinus Index Felipe Alves da Louza1 , Ricardo Rodrigues Ciferri2 , and Cristina Dutra de Aguiar Ciferri1 1
2
Department of Computer Science, University of S˜ ao Paulo, 13.560-970, S˜ ao Carlos, SP, Brasil {louza,cdac}@icmc.usp.br Department of Computer Science, Federal University of S˜ ao Carlos, 13.565-905, S˜ ao Carlos, SP, Brasil
[email protected]
Abstract. Finding similarities in protein sequences is a core problem in bioinformatics. It represents the first step in the functional characterization of novel protein sequences, and is also employed in protein evolution studies and for predicting biological structure. In this paper, we propose Proteinus, a new index aimed at similarity search of protein sequences. Proteinus is characterized by using a reduced amino acid alphabet to represent protein sequences and also by providing a persistent storage of the index on disk, as well as by allowing the execution of range queries. Performance tests with real-world protein sequences showed that the Proteinus index was very efficient. Compared with the BLASTP tool, Proteinus provided an impressive performance gain from 45% up to 93% for range query processing. Keywords: Protein Sequences, Similarity Search, Index Structure.
1
Introduction
Protein sequence databases (PSDs) store data related to the primary structure of proteins and their amino acids chains, which are composed of strings that represent the 20 molecules present in proteins (i.e. the amino acid alphabet p Σ20 = {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y}) [3]. With the technological advances in molecular biology, the amount of protein sequences collected and available for analysis has increased very fast. Finding similarities in protein sequences is a core problem in bioinformatics. It represents the first step in the functional characterization of novel protein sequences, and is also employed in protein evolution studies and for predicting biological structure. In the literature, tools widely used to search for similarities in protein sequences are BLASTP, PSI-BLAST and PHI-BLAST [1], which are characterized by performing a sequential scan in the whole database to answer similarity queries. Another research area improves the performance of similarity search by using indices, which are data structures specially designed to reduce the search O. Norberto de Souza, G.P. Telles, and M.J. Palakal (Eds.): BSB 2011, LNBI 6832, pp. 58–65, 2011. c Springer-Verlag Berlin Heidelberg 2011
Efficiently Querying Protein Sequences with the Proteinus Index
59
space, leading the search to portions of the database where the stored sequences probably have higher similarity with the query. In this context, indices based on approximations can help to provide quick search for protein sequences in PSDs. An important type of query frequently used to search for similarities in protein sequences is the range query. Consider that S = {s1 , s2 , .., sd } is a PSD composed of d protein sequences, sq represents a protein sequence query, and d(si , sq ) represents the distance between a protein sequence si stored in S and sq . Given a query radius rq , the range query retrieves every protein sequence si ∈ S that satisfies the condition d(si , sq ) ≤ rq . An example is: “Select the protein sequences that are similar to the sequence HIY W GKED of the Vaccinia virus organism by up to five similarity units”. In this paper, we focus on the efficient processing of range queries over PSDs. We propose Proteinus (acronym for protein sequence similarity search), a new protein sequence persistent index based on approximations. To index protein sequences, Proteinus introduces two major characteristics, as described as follows. – It uses a reduced amino acid alphabet. To this end, we apply the reduction technique proposed by Li et al [7], which preserves the maximum information on the original protein sequence and brings along the weights used for calculating protein distances. – It is aimed at the persistent storage of the index on disk. To this end, Proteinus extends a previous data structure proposed by our research group, called nsP-index [4], which is an efficient disk-based index that uses approximations to search for nucleotide sequences in biological databases. This paper is organized as follows. Section 2 surveys related work, Section 3 details the background, Section 4 describes the proposed Proteinus index, Section 5 discusses the experimental results, and Section 6 concludes the paper.
2
Related Work
In the literature, there are a number of approaches that have been proposed for indexing sequences, but they differ from our work on their purpose. On the one hand, approaches aimed at the protein domain also focus on investigating threedimensional structures [10,8]. Despite the importance of this area of research, similarity search in protein sequences is also a core challenge, as highlighted in Section 1. The Proteinus index that we propose in this paper is aimed at similarity search in protein sequences, aiming at avoiding a sequential scan in the whole PSD, such as that performed by the BLASTP tool. On the other hand, most of the approaches are aimed at indexing nucleotide sequences [5,9,4,2], which are composed of strings that represent the nitrogen bases adenine (A), cytosine (C), guanine (G), and thymine (T). Differently from these approaches, Proteinus aims at indexing protein sequences. In detail, the related approaches do not focus on the amino acid alphabet, which is bigger than the nucleotide alphabet. Also, PSDs are very different from nucleotide sequences databases, as they store a greater number of protein sequences whose sizes are
60
F.A. Louza, R.R. Ciferri, and C.D.A. Ciferri
usually much smaller than the sizes of nucleotides. An average sequence protein contains 353 amino acids1 , but some are much smaller (i.e. from 30 to 40 amino acids). Furthermore, it is necessary to use a distance function different from the Edit Distance to calculate the similarity among protein sequences, which must be based on the PAM (Percent Accepted Mutation) or the BLOSUM (BLOcks SUbstitution Matrix) matrices [6]. Thus, these related approaches can not be applied to index protein sequences without addressing these issues, which are tackled by our proposed Proteinus index.
3
Background
3.1
Similarity Search in Protein Sequence Databases
The distance function most used to compare similarity between two sequences s1 and s2 is the Edit Distance, which is defined as the minimum number of edit operations (i.e. insert, delete and replace) required to transform s1 into s2 . Due to the characteristics of protein sequences, the Edit Distance function used to compare them also must consider weights related to amino acid exchanges, which are obtained by replacing matrices, such as the PAM and the BLOSUM. The PAM and the BLOSUM focus on the complete amino acid alphabet, which p is composed of the 20 molecules present in proteins, i.e. Σ20 = {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y}. However, in this paper, we base our proposed Proteinus index in a reduced amino acid alphabet. This idea was first introduced in [9], but this work described its proposal only analytically, as neither details of the index data structure or an algorithm for range queries were presented. Also, according to Li et al [7], using a reduced alphabet in protein sequence comparisons avoids the need of using a replacing matrix, as detailed as follows. The amino acid alphabet can be reduced to a new alphabet Σkp , which contains k symbols, by grouping amino acids based on their physical and chemical characteristics into k disjoint subsets containing similar amino acids. Li et al p proposed and compared several reductions on the alphabet Σ20 by seeking to maximize a similarity score derived from the BLOSUM62 matrix. They conp cluded that a reduced alphabet Σ10 , which is composed by 10 groups, preserves the maximum information on the original protein sequence and brings along the weights used for calculating protein distances, making the information in the reduced protein closer to that consisting of 20 amino acids. Therefore, we base our proposal on this fixed alphabet, which is described in Equation 1. p Σ10 = {C , F , M , I , G , P , A , N , Q , R }, such that
(1)
C = {C}, F = {F, Y, W }, M = {M, L}, I = {I, V }, G = {G}, P = {P }, A = {A, T, S}, N = {N, H}, Q = {Q, E, D}, R = {R, K} 1
http://ca.expasy.org/sprot/relnotes/relstat.html
Efficiently Querying Protein Sequences with the Proteinus Index
3.2
61
The nsP-Index
The nsP-index [4] extends the MRS-index [5] to be a persistent index aimed at similarity search of nucleotide sequences. Let Sn = {s1 , s2 , .., sx } be a database composed of x nucleotide sequences and w = 2a be the length of the shortest query sequence. The nsP-index’s data structure stores a set of elements Ti,j on disk, where a ≤ i ≤ b (b = a + L − 1), and 1 ≤ j ≤ x. The value of a is computed from w, while the parameter L is the maximum number of resolution levels in the index. Each element Ti,j indexes the j th sequence with resolution 2i . In order to obtain Ti,j , a window of length 2i is slid on the sequence sj , starting from its leftmost point and ending at its rightmost point. For each placement of the window, a subsequence of sj is mapped into a point using a wavelet-based method, and this point is encompassed by a minimum bounding rectangle (MBR). The c consecutive points are covered using the same MBR of the current disk page. When this page is full, it is written on disk, and a new MBR is created to cover the next c subsequences. Note that the nsP-index uses disk pages of fixed sizes, and that an MBR stores only two points that delimit it, as well as the initial position of its first subsequence. The nsP-Index organizes MBRs and elements Ti,j on disk pages. An element Ti,j is stored in k pages that, altogether, store all the MBRs of Ti,j . Each disk page stores MBRs related to only one element Ti,j . Also, disk pages related to Ti,j are stored sequentially in the index file, thus improving the search for MBRs of Ti,j by avoiding disk accesses. In fact, the sequence of the written MBRs follows closely the order they are scanned in the nsP-index’s range query. Finally, to reduce the number of disk accesses in range query processing, the nsP-index uses a buffer-pool, which is a main memory cache used for the temporary storage of disk pages related to elements Ti,j resident on disk.
4
The Proposed Proteinus Index
In this section, we describe Proteinus (acronym for protein sequence similarity search), a persistent and approximation-based index to process protein sequence similarity queries over PSDs. Figure 1 depicts the principles used by Proteinus. p A protein sequence represented by the complete amino acid alphabet Σ20 is extracted from the PSD (Figure 1a), transformed into a protein sequence reprep (Figure 1b), and then indexed sented by the reduced amino acid alphabet Σ10 by Proteinus according to its index organization (Figure 1c). Proteinus inherits all the characteristics of the nsP-index. Therefore, details about the data structure and the building and querying algorithms are presented in [4]. As the Proteinus extends the nsP-index to carry on protein sequences, in this section we only focus on the extensions introduced by Proteinus. The proposed Proteinus index focuses on five tasks, as described as follows. 1. It extends the index data structure to handle the protein sequence alphabet. 2. It defines a mapping function to transform a subsequence sj into a multidimensional point.
62
F.A. Louza, R.R. Ciferri, and C.D.A. Ciferri
Fig. 1. Overview of Proteinus’s principles
3. It defines an approximated distance function among the transformed points, which is a lower bound of the Edit Distance function. 4. It adapts the index data structure to handle missing elements Ti,j . 5. It provides a range query routine to support the new characteristics of the index data structure. To accomplish task 1, we designed the Proteinus’s data structure to support p the reduced alphabet Σ10 = {C , F , M , I , G , P , A , N , Q , R } described in Equation 1, instead of using the nucleotide alphabet Σ4n = {A, C, G, T } as the nsP-index does. It is worth to say that we used a strategy to design the Proteinus’s data structure that maintained the same requirements as the nsP-index to store the multidimensional points. Proteinus decreases the storage requirements through the use of a shorter data type to represent a protein sequence, since protein sequences are much smaller than nucleotide sequences. Regarding tasks 2 and 3, the mapping and the approximated distance functions should be easily computable to not affect the Proteinus index’s performance. Also, the distance function should take into account the weights provided by replacing matrices, i.e. should be a lower bound of the Edit Distance function. Using the wavelet-based method inherited from the nsP-index together with the p alphabet Σ10 described in Equation 1, we accomplished tasks 2 and 3, as the reducing process brings along the weights used for calculating the Edit Distance function as well as incorporates the similarities among amino acids. Task 4 is required since the protein sequence sizes greatly vary. Thus, not all the sequences can be indexed by Proteinus in all the resolutions. To this end, we extended the Proteinus’s data structure to allow for the manipulation of missing elements Ti,j , as represented by the black boxes in Figure 1c. When a sequence sj ∈ P SD is shorter than a window 2i , an empty element Ti,j is generated. In this case, the Proteinus’s manager Buffer stores in its metadata that this empty element is missing and does not persist the related page on disk. To accomplish task 5, we extended the nsP-index’s range query algorithm to support the manipulation of missing elements Ti,j in the proposed Proteinus’s data structure as follows. When a query sequence sq is larger than a sequence sj ∈ P SD, our search algorithm bypass sj in similarity comparisons. Also, the use of approximations, such as MBRs, introduces loss of accuracy in the exact
Efficiently Querying Protein Sequences with the Proteinus Index
63
representation of protein sequences. Thus, approximation-based indices should provide both a filter phase that identifies candidates that are possible answers and a refinement phase that further analyzes these candidates to determine which ones are correct answers. To this end, we also developed a simple post-processing routine to eliminate false answers, which calculate the edit distance between sq and each candidate. As a result, the use of the refinement phase enables the Proteinus index to accurately answer range queries.
5
Performance Tests
5.1
Workbench
The advantages of the Proteinus index were investigated through performance tests that compared this proposed index with the BLASTP tool, taking into account the elapsed time in seconds to build the index and to process range queries. Although Proteinus is not a complete alignment strategy as the BLASTP, we extended Proteinus with a refinement phase to allow comparisons. In the tests, we used real-world protein sequences obtained from Uniprot2 and created several PSDs containing increasing volumes of protein sequences (i.e. 4 GB, 6 GB, 8 GB and 10 GB). We performed range queries by submitting 4 different sizes of queries to each PSD. In detail, we used the sizes of 128, 256, 512 and 1,024 amino acids. For each query size, we issued 5 different protein sequences and took the average of the measures. We used the error ε = 0.005 to calculate the query radius r, such that ε = r/|q| and |q| is the query size. Proteinus was implemented in C++. We used 4 resolutions with window sizes of 32, 64, 128 and 256 amino acids, and set the MBR capacity to 80 subsequences. The buffer-pool size was set to 200 MB and the disk page size was set to 672 bytes, as we found experimentally that these values were adequate for indexing protein sequences. Also, we implemented the Smith-Waterman algorithm [6] to calculate the edit distance between sq and each candidate in the Proteinus’s refinement phase to process range queries. The experiments were conducted on a computer with an Intel Core i7 2.67 GHz processor, 12 GB of main memory, 2 SATA 1 TB hard disks and Linux Ubuntu 9.04. We employed the BLASTP version 1:2.2.21.20090809-1. 5.2
Performance Results
Figure 2a shows the time spent by Proteinus to build the indices and the BLASTP to create its data structure for each database. Regarding the BLASTP, this time refers to the execution of the command formatdb. The time spent by Proteinus was higher (43% on average) than that spent by the BLASTP because Proteinus performs a better organization of the protein sequences, aiming at providing best query performance results. An advantage of Proteinus is that its building cost can be compensated over time. Since Proteinus is a persistent 2
http://www.uniprot.org/
64
F.A. Louza, R.R. Ciferri, and C.D.A. Ciferri
Fig. 2. The Proteinus index versus the BLASTP tool
Fig. 3. The Proteinus index: costs of the filter and the refinement phases
index, it does not need to be rebuilt after each system shutdown, and therefore the cost of query processing quickly surpasses the cost of building Proteinus. Also, Proteinus proved to be scalable in the construction of different PSD sizes, since it showed a linear growth in response to increasing data volumes. Regarding the range query processing, Proteinus provided an impressive performance gain that ranged from 45% up to 93% over the BLASTP, as shown in Figure 2b. In fact, the larger the dataset, the better the performance gain of Proteinus. Also, Proteinus proved to be scalable also in the range query processing, since it showed a linear growth in response to increasing volumes of data (continuous line). Conversely, the time spent by the BLASTP grew up very fast as the data volume increased (dashed line). These results corroborate the fact that our index’s organization is specially designed to handle protein sequences. Figure 3 shows the percentage of the total elapsed time spent by Proteinus to process the filter and the refinement phases separately. As can be noted, the larger the dataset, the smaller the impact of the refinement phase’s cost. This is due the fact that the range queries returned a small number of candidates, and this number remained almost the same as the data volume increased. We can conclude that Proteinus provided a good filter phase, which highly reduced the costly refinement phase.
Efficiently Querying Protein Sequences with the Proteinus Index
6
65
Conclusions and Future Work
In this paper, we proposed Proteinus, a new index based on approximations to process range queries over protein sequence databases. The Proteinus index transforms protein sequences represented by the complete amino acid alphabet into equivalent protein sequences represented by a reduced amino acid alphabet, and then indexes these transformed sequences according to its index organization, persistently storing its data structure on disk. Performance tests with real-world protein sequences showed that Proteinus was very efficient, providing an impressive performance gain that ranged from 45% up to 93% over the BLASTP tool for range query processing. The proposed Proteinus index is able to use any reduced amino acid alphabet to transform protein sequences. Therefore, we are currently investigating the influence of different reduced amino acid alphabets on the performance of our index structure. Also, Proteinus can be used as an efficient filter phase to select a reduced set of protein sequences, which will be further analyzed by any postprocessing tool. We also plan to investigate this issue. Acknowledgments. This work has been supported by the following Brazilian research agencies: FAPESP, CNPq, CAPES, INEP and FINEP.
References 1. Altschul, S.F., Madden, T.L., Sch¨ affer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25(17), 3389–3402 (1997) 2. Barsky, M., Stege, U., Thomo, A.: A survey of practical algorithms for suffix tree construction in external memory. Softw. Pract. Exper. 40(11), 965–988 (2010) 3. Baxevanis, A.D., Ouellette, B.F.F.: Bioinformatics: a practical guide to the analysis of genes and proteins. Wiley-Interscience, Hoboken (2005) 4. Ciferri, R.R., Ciferri, C.D.A., Car´elo, C.C.M., Traina Jr., C.: nsP-index: a robust and persistent index for nucleotide sequences. In: ADBIS, pp. 28–41 (2008) 5. Kahveci, T., Singh, A.K.: Efficient index structures for string databases. In: VLDB, pp. 351–360 (2001) 6. Korf, I., Yandell, M., Bedell, J.: BLAST. O’Reilly, Sebastopol (2003) 7. Li, T., Fan, K., Wang, J., Wang, W.: Reduction of protein sequence complexity by residue grouping. Protein Eng. 16(5), 323–330 (2003) 8. Novosad, T., Sn´ asel, V., Abraham, A., Yang, J.Y.: Searching protein 3-D structures for optimal structure alignment using intelligent algorithms and data structures. IEEE TITB 14(6), 1378–1386 (2010) ¨ urk, O.: Feature Extraction and Similarity-based Analysis for Proteome and 9. Ozt¨ Genome Databases. Ph.D. thesis, The Ohio State University (2007) 10. Shibuya, T.: Geometric suffix tree: indexing protein 3-D structures. J. ACM 57(3), article 15 (2010)
SciPhy: A Cloud-Based Workflow for Phylogenetic Analysis of Drug Targets in Protozoan Genomes* Kary A.C.S. Ocaña1, Daniel de Oliveira1, Eduardo Ogasawara1,2, Alberto M.R. Dávila3, Alexandre A.B. Lima1, and Marta Mattoso1 1
Computer Science, COPPE, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil 2 Federal Center of Technological Education, Rio de Janeiro, Brazil 3 Laboratory of Computational and Systems Biology, IOC, FIOCRUZ, Rio de Janeiro, Brazil {kary,danielc,ogasawara,assis,marta}@cos.ufrj.br,
[email protected]
Abstract. Bioinformatics experiments are rapidly evolving with genomic projects that analyze large amounts of data. This fact demands high performance computation and opens up for exploring new approaches to provide better control and performance when running experiments, including Phylogeny/Phylogenomics. We designed a phylogenetic scientific workflow, named SciPhy, to construct phylogenetic trees from a set of drug target enzymes found in protozoan genomes. Our contribution is the development, implementation and test of SciPhy in public cloud computing environments. SciPhy can be used in other Bioinformatics experiments to control a systematic execution with high performance while producing provenance data. Keywords: Phylogeny, Protozoa, Scientific Workflow, Cloud computing.
1 Introduction Protozoan species are microscopic unicellular eukaryotes, some of them being pathogenic and causing severe illnesses in humans and animals. In a set of ten diseases defined as research priorities by the World Health Organization’s Special Program for Research and Training in Tropical Diseases (http://www.who.int/tdr), four of them are caused by protozoan parasites (malaria, leishmaniasis, Chagas disease, and African Trypanosomiasis). Genomics and proteomics have created a new paradigm for drug, vaccine and diagnostic discovery process. Bioinformatics plays a key role in exploiting biological data to identify potential drug targets and to gain insights about molecular mechanisms that underlie diseases [1]. One strategy used in genomics, which can be extrapolated for drug target discovery/identification, involves the analysis of the function of sequences of determined organism. Here, the unknown sequences can often be accurately inferred by identifying other sequences homologous to them that were properly annotated. Given a query set of new sequences or genes, there are at least four approaches to infer similarity and occasionally homology: (i) local sequence *
This work was partially sponsored by CAPES, FAPERJ and CNPq.
O. Norberto de Souza, G.P. Telles, and M.J. Palakal (Eds.): BSB 2011, LNBI 6832, pp. 66–70, 2011. © Springer-Verlag Berlin Heidelberg 2011
SciPhy: A Cloud-Based Workflow for Phylogenetic Analysis of Drug Targets
67
comparisons (Blast), (ii) motif/signature analysis, (iii) profile-based similarity (PSSM and HMM), and (iv) phylogeny/phylogenomics. All these approaches are computationally intensive and high performance computing (HPC) solutions. Also, most of the existing approaches for Phylogeny and Phylogenomics are based on individual scripts or independent applications, with results that may become out of control, especially when collecting large provenance data related to each script analysis. To address these problems, these scripts can be modeled as scientific workflows (e.g. a workflow for phylogenetic analysis), which is an abstraction that allows the structured composition of programs as a flow of activities. This flow can be further orchestrated by Scientific Workflow Management Systems (SWfMS). W.A.T.E.R.S [2] is an initiative that uses scientific workflows. W.A.T.E.R.S adds more control to the experiments but is not capable of executing workflows in parallel, thus keeping the performance problem. In this paper, we propose SciPhy, which uses a scientific workflow with a middleware to support HPC providing provenance gathering mechanisms. SciPhy was designed to construct phylogenetic trees and it was executed using, as input, a set of potential drug target enzymes found in protozoan genomes. We have implemented and executed SciPhy to construct the phylogenetic trees for the PDTD enzymes targeting: (i) the identification of potential drug target enzymes in all protozoan genomes available in RefSeq (http://www.ncbi.nlm.nih.gov/RefSeq/) and ProtozoaDB [3] databases; and (ii) the inference of phylogenetic relationships of the potential drug target enzymes found in protozoan genomes. PDTD – Potential Drug Target Database (http://www.dddc.ac.cn/pdtd/) is a comprehensive database of drug targets focusing on those with known 3D-structures. We have used PDTD enzymes since they enable the analysis of phylogenetic relationships of potential drug target enzymes found in sequenced/annotated protozoan genomes. The remainder of this paper is organized as follows. Section 2 presents the SciPhy workflow while Section 3 shows the implementation of this workflow using SciCumulus. Section 4 discusses experimental results and Section 5 concludes.
2 SciPhy: A Phylogenetic-Based Scientific Workflow This section presents the specification of the phylogenetic analysis workflow. The four general activities of SciPhy are presented in Fig.1. The first activity executes SciHmm workflow to obtain best hmmpfam hits using HMMER3. The protein sequences related to each 17 enzyme groups are obtained from UniProt. Each set is aligned using MAFFT v5.861. Then, Hidden Markov Models profiles are constructed (hmmbuild) and they are queried (hmmsearch) against all protozoan genomes (137 genomes) available in Refseq (13-Jan-2011) with e-value “1e-20” as cut-off. Then, SciPhy executes the next three activities. In the second activity, individual multiple alignments of the best hmmsearch hits of each 17 drug target enzymes are built with MAFFT v5.861. Each individual alignment is tested at the third activity to find the best evolutionary model using Modelgenerator 0.85 and both of them (alignment and evolutionary model) are used in the fourth activity to generate the individual phylogenetic trees using RAxML-7.2.8-ALPHA with 100 bootstrap replicates.
68
K.A.C.S. Ocaña et al.
The PDTD enzymes are (sum of all 17 enzymes, because for two of them – A, B chain enzymes – were included): 6-phosphogluconate Dehydrogenase, DeltaAminolevulinic Acid Dehydratase, Falcipain-2, Dihydrofolate Reductase, Dihydroorotate Dehydrogenase, Glutathione S-transferase, Glycerol-3-phosphate dehydrogenase, Hypoxanthine-Guanine Phosphoribosyltransferase, Tubulin, Inosine-5'-monophosphate Dehydrogenase, L-lactate Dehydrogenase, Protein farnesyl transferase, Purine Nucleoside Phosphorylase and Trypanothione Reductase.
Fig. 1. SciPhy: phylogenetic scientific workflow
3 SciPhy Workflow Implementation and Execution During the course of a phylogenetic analysis there may be many preliminary analyses to be performed, causing the execution of the same workflow many times and these executions cannot be processed in a feasible time not even by a small cluster. To produce results in a reasonable time, scientists improve the performance of the workflow by applying parallel programming and specialized environments. Computing clouds [4] have recently emerged as an alternative environment where resources are obtained on demand. Using clouds scientists are not required to assemble computational infra-structure or even configure many pieces of software. To benefit from it, we chose the SWfMS VisTrails (http://www.vistrails.org/) and SciCumulus [5]. SciCumulus is a middleware that is coupled to a SWfMS that distributes and controls the execution of activities in clouds. It executes activities on a set of virtual machines in clouds such as Amazon EC2 (http://aws.amazon.com/ec2/). SciCumulus also stores provenance data. In this paper, it was deployed and executed on top of Amazon EC2. The execution of the SciPhy workflow produced a total of 17 phylogenetic trees (unpublished data). Due to space restriction, we explain one of the PDTD enzymes used in this study that was the 6-phosphogluconate Dehydrogenase, decarboxylating (PGD), which is the third enzyme of the pentose phosphate pathway, considered as a
SciPhy: A Cloud-Based Workflow for Phylogenetic Analysis of Drug Targets
69
potential drug target for Human African Trypanosomiasis. While previous PGD studies were focused mainly on Trypanosoma, Trichomonas and Leishmania, we present an exploratory analysis (in 137 protozoan genomes) that shows the evolutionary and phylogenetic relationships of PGD. Since PGD was found in several other protozoan species, by exploring more protozoan genomes, it might be possible to: (i) infer the evolutionary life of the candidate drug target enzymes in protozoan, (ii) extrapolate and better understand the biological mechanisms of these enzymes in all protozoan species (those better studied and those not yet explored), and (iii) open new alternatives for searching of drugs for (human or other hosts) pathogenic protozoan. The protozoan PGD phylogeny shows more representative monophyletic clades formed by the following genus: Trichomonas, Leishmania, Trypanosoma, Theileria, and Plasmodium (with bootstrap values above 80). Other representative taxa were: Phaeodactylum, Thalassiosira, Toxoplasma, and Giardia. Finally, there were taxa annotated as hypothetical, putative or predicted belonging to Monosiga, Thalassiosira, and Plasmodium. This exploratory analysis compared 17 pHMMs, against 137 genomes (taxids), and the total number of sequences associated to these 137 genomes is 386,611 sequences. It means that SciCumulus performed 17 × 386,611 = 6,572,387 comparisons. Based on these 6,572,387, HMMER3 produced 17 files containing the best hits (SciHmm). For each one of the 17 files we executed SciPhy workflow for producing the trees. The tree generation (17 trees) was distributed into the multi-virtual machine cloud environment which leads to 76.05 hours of total processing time when using 8 cores.
4 Conclusions Phylogenetics and phylogenomics pipelines are required in bioinformatics to explore genomes and to infer evolutionary and phylogenetics relationships of a set of sequences/genes/enzymes of interest. When thousands of executions are necessary to manipulate large datasets, a parallel pipeline systematic controlled execution is mandatory. We proposed a scientific workflow for phylogenetic analysis that was executed in parallel using a public cloud. Scientific workflows allow for the following benefits: (i) to extract provenance of the experiments, (ii) to guarantee that the experiments are properly finished and the data was stored, and (iii) to benefit from the processing power of clouds (on demand use). Our experiments using different large quantities of data showed the advantages of all these three points. As future work, we intend to use SciPhy for better exploration of protozoan genomes for searching new candidate drug target enzymes.
References [1] Jiang, Z., Zhou, Y.: Using bioinformatics for drug target identification from the genome. American Journal of Pharmacogenomics: Genomics-Related Research in Drug Development and Clinical Practice 5(6), 387–396 (2005)
70
K.A.C.S. Ocaña et al.
[2] Hartman, A.L., Riddle, S., McPhillips, T., Ludäscher, B., Eisen, J.A.: Introducing W.A.T.E.R.S.: a Workflow for the Alignment, Taxonomy, and Ecology of Ribosomal Sequences. BMC Bioinformatics 11(1), 317 (2010) [3] Dávila, A.M.R., Mendes, P.N., Wagner, G., Tschoeke, D.A., Cuadrat, R.R.C., Liberman, F., Matos, L., Satake, T., Ocaña, K.A.C.S., et al.: ProtozoaDB: dynamic visualization and exploration of protozoan genomes. Nucleic Acids Research 36(Database issue), D547– D552 (2008) [4] Vaquero, L.M., Rodero-Merino, L., Caceres, J., Lindner, M.: A break in the clouds: towards a cloud definition. SIGCOMM Comput. Commun. Rev. 39(1), 50–55 (2009) [5] Oliveira, D., Ogasawara, E., Baião, F., Mattoso, M.: SciCumulus: A Lightweigth Cloud Middleware to Explore Many Task Computing Paradigm in Scientific Workflows. In: Proc. CLOUD 2010, Miami, FL, pp. 378–385 (2010)
A Conceptual Model for Transcriptome High-Throughput Sequencing Pipeline Ruben Cruz Huacarpuma, Maristela Holanda, and Maria Emilia Walter Department of Computer Science, University of Brasilia, Brazil
Abstract. In recent years, high-throughput sequencers have been generating enourmous volumes of data in hundreds of genome projects around the world. Besides being stored, the original data are transformed through multiple analysis that are realized in a computational pipeline. This poses important problems for treating these highly complex data. In this context, a model to represent, organize and guarantee accessibility, correctness and understandability to these data is essential to support the work of the biologists involved in a transcriptome project. Different formats of data, terminologies, file structures and ontologies turn data management very difficult. In this work, we propose a conceptual model for the different phases of a transcriptome high-throughput sequencing pipeline in order to represent and manage data.
1
Introduction
Since DNA double helix discovery in 1953, advances in molecular biology have been remarkable. In recent years hundreds of genome and transcriptome sequencing projects, using high-throughput technologies, have been generating an enourmous volume of data [1]. Representing, managing and storing these data is very difficult, due to different formats, terminologies, file structures and ontologies. Original DNA or RNA fragments, as generated by sequencers, are transformed through a pipeline which realizes multiple analysis, such as assessing data quality, filtering sequencing errors, storing sequences from external databases, searching biological functionalities, and storing results produced by many different programs. In this context, a model to represent, organize and guarantee accessibility, correctness and understandability to data is essential to support the work of the biologists involved in a transcriptome project. In this work, we propose a conceptual model for the different phases of a transcriptome sequencing pipeline, in order to represent and manage original (untreated) and produced (processed) data. We work with information generated by a high-throughput sequencer, using the object-oriented paradigm [2], based on UML notation. O. Norberto de Souza, G.P. Telles, and M.J. Palakal (Eds.): BSB 2011, LNBI 6832, pp. 71–74, 2011. c Springer-Verlag Berlin Heidelberg 2011
72
R.C. Huacarpuma, M. Holanda, and M.E. Walter
Table 1. Conceptual model comparison. Second column means how difficult is to use the corresponding model. Aproach Difficulty Conceptual Model for Genomic InforOOM Less mation [8] Conceptual Model for Bioinformatics [9] ERM-OOM Less Concepts of Modeling bioinformatics ERM-EERM More or less data improving its use in query integration [5] Modeling genomic data with attribute OOM-DOM High types, balancing stability and manutenability [10] BioConceptual [11] OOM More or less
2
Implement. PEOT PEOT Any Related MSDB MCK
Object oriented framework
Related Work
Different data models [3,4,5], such as entity-relationship model (ERM), enhanced entity relationship model (EERM) based on ERM with some additional features, and object-oriented model (OOM) are used to create biological data models for bioinformatics. OOM is appropriate to represent highly complex data, since it uses ordered data types, which solves some dependency constraints [6,7]. Paton et al. [8] integrate genetic and phenotypic data using OOM, besides describing eukaryotic genomic and transcriptomic data, functions and experiments. Bornberg-Bauer and Paton [9] present two conceptual models using ERM to represent general biological concepts and use OOM to represent genomic and proteomic data using UML. Ji [5] uses EERM to define a model that can be mapped to a relational database and includes characteristics to represent DNA sequence spatial structure. Busch and Wedemann [10] propose an object model which can be dynamically modified as it is executed. Bioconceptual Model [11] is an extended object-oriented model conceived to project a data scheme for biological applications, to facilitate the implementation, and to be comprehensive for biologists. Table 1 shows a comparison among these models. The main characteristic of these works is that they model biological data. Instead, our approach focus on the process of transcriptome high-throughput sequencing pipelines, besides biological data.
3
A Conceptual Model for a Transcriptome Pipeline
The main feature of our approach is the definition of a conceptual model for each one of the four phases of a high-throughput transcriptome sequencing pipeline: filtering, mapping, assembly and annotation (Fig. 1). This kind of pipeline treats short-read sequences (SRS) having from 60 to 100 bp which is the length sequence range of SRS generated by the Illumina sequencer.
A Model for Transcriptome High-Throughput Sequencing Pipeline
73
Fig. 1. A conceptual model for each one of a four phases high-throughput transcriptome sequencing pipeline: filtering, mapping, assembly and annotation
In the filtering phase, the objective is to filter quality and complexity of each SRS generated by the sequencer. Low quality and highly complex (repeated characters such as polyA) SRS are removed. The output data of this phase has the same type as the input data (SRS). In the mapping phase, SRS are mapped into a reference genome. This phase is necessary because SRS are too short to be directly assembled. In the assembly phase, the contiguously mapped SRS are assembled and a consensus (sequence representing grouped SRS) is generated. Finally, in the annotation phase, a function is assigned to each consensus of the mapped sequences (fragment mapping entity). Function prediction usually is done by comparing consensus sequences to a genome database (database entity). Besides, according to the project objectives, other analysis can be done, such as identifying non-coding RNAs, among others. The proposed conceptual model for SRS transcriptome high-throughput sequencing pipeline is different from others found in the literature [5,8,9,10,11], since we focus on the pipeline process and not only on the biological data. Developing a model for each phase facilitates the understanding of the data types and clearly defines which are the input and output data involved in each phase, which allows to more efficient data managing.
74
4
R.C. Huacarpuma, M. Holanda, and M.E. Walter
Conclusions
In this work, we first presented a comparison among models representing data generated in genome projects. These approaches were defined to represent biological data, regardless their use in projects. Instead, we proposed a conceptual model to represent the different types of input and output data involved in each one of the different phases of a transcriptome high-throughput sequencing pipeline. So, our conceptual model refers to the pipeline process, and we claim that this facilitates the management and manipulation of data in a transcriptome project. This conceptual model will be tested soon in a transcriptome high-throughput sequencing pipeline that will use the Illumina sequencer. Besides, provenance can be included in this model, so that it will allow to trace the information produced on each phase.
References 1. R¨ ohm, U., Blakeley, J.: Data Management for High-Throughput Genomics. In: CDIR 2009, Asilomar, CA, USA, vol. 5667, pp. 97–111 (2009) 2. Catell, R.: ODMG-93: The Object Database Standard. Morgan Kauffmann, San Francisco (1994) 3. Silberchatz, A., Korth, H.F., Sudarshan, S.: Database System Concepts, 6th edn. McGraw Hill, New York (2010) 4. Elmasri, R., Navathe, S.: Fundamentals of Database Systems, 4th edn. Pearson Addison Wesley, London (2005) 5. Ji, F.: Enhaced Bioinformatics Data Modeling Concepts and their use in Querying and Integration. PhD Thesis. Faculty of the Graduate School of The University of Texas, Arlinton (2008) 6. Shah, A., Ahsan, S., Jaffer, A.: Temporal Object-Oriented System (TOS) for Modeling Biological Data. Journal of American Science 5(3), 63–73 (2009) 7. Aberer, K.: The Use of Object-Oriented Datamodels for Biomolecular Database. OOCNS, Heidelberg, Germany (1995), http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.11.8025 8. Paton, N., Khan, S., Hayes, A., Moussouni, F., Brass, A., Eilbeck, K., Goble, C., Hubbard, S., Oliver, S.: Conceptual modeling of genomic information. Bioinformatics 16(6), 548–557 (2000) 9. Bornberg-Bauer, E., Paton, N.: Conceptual data modeling for bioinformatics. Briefing in Bioinformatics 3(2), 166–180 (2002) 10. Busch, N., Wedemann, G.: Modeling genomic data with type attributes, balancing stability and maintainability. BMC Bioinformatics 10(97), 1471–2105 (2009) 11. Macedo, J.F., Porto, F., Lifschitz, S., Picouet, P.: A Conceptual Data Model Language for the Molecular Biology Domain. In: Twentieth IEEE International Symposium on Computer-Based Medical Systems (CBMS 2007), pp. 231–236 (2007)
A Conceptual Many Tasks Computing Architecture to Execute Molecular Docking Simulations of a Fully-Flexible Receptor Model Renata De Paris, Fábio A. Frantz, Osmar Norberto de Souza, and Duncan D. Ruiz Programa de Pós-Graduação em Ciência da Computação, Faculdade de Informática, PUCRS, Av. Ipiranga, 6681 – Prédio 32, Sala 602, 90619-900, Porto Alegre, RS, Brasil {renata.paris,fabio.frantz}@acad.pucrs.br, {osmar.norberto,duncan.ruiz}@pucrs.br
Abstract. Molecular docking simulations of Fully-flexible Protein Receptor (FFR) models are coming of age. However, it demands High Performance Computing (HPC) for its executions and generates huge amounts of data that needs to be analyzed. Distributed HPC systems and scientific workflows are routinely applied to execute compute intensive tasks and to manage data flow. In this work we propose a conceptual Many Tasks Computing architecture to execute molecular docking simulations of FFR models to small molecules in distributed HPC environments. Keywords: many tasks computing, fully-flexible receptor, molecular docking.
1 Introduction Large scale scientific experiments have an ever increasing demand for high performance distributed computing. This typical scenario is found in bioinformatics, which needs to perform computer modeling and simulation on data varying from DNA sequence to protein structure to protein-ligand interactions. It produces sets of data flow that are processed by an iterative sequence of tasks, software or services [1]. Rational Drug Design (RDD) constitutes one of the earliest medical applications of bioinformatics [2]. RDD aims to transform biologically active compounds into suitable drugs [3]. In silico molecular docking (MDock) simulation is one the main steps of RDD. It is used to identify and optimize drug candidates by computationally examining and modeling molecular interactions between ligands and a target protein or receptor [3]. The best ligand orientation and conformation inside the binding pocket is computed in terms of Free Energy of Bind (FEB) by a software, for instance the AutoDock4.2 [4]. In order to mimic the natural, in vitro and in vivo, behavior of ligands and receptors, their plasticity or flexibility should be treated in an explicit manner [5]. A major approach to incorporate the explicit flexibility of receptors in MDock is by means of snapshots derived from a molecular dynamics simulations [6] trajectory of the receptor (reviewed by [7]). The resulting receptor model is called a FullyFlexible Receptor (FFR) model. Organizing and handling the execution and analysis O. Norberto de Souza, G.P. Telles, and M.J. Palakal (Eds.): BSB 2011, LNBI 6832, pp. 75–78, 2011. © Springer-Verlag Berlin Heidelberg 2011
76
R. De Paris et al.
of MDock simulations of FFR models and flexible ligands are not trivial tasks. The dimension of the FFR model can become a limiting step because, instead of performing docking simulations in a single, rigid receptor conformation, we must do it for all n conformations that make up the FFR model [5]. n can vary from hundreds to thousands to millions of conformations. Therefore, the high computing cost involved in using FFR models to perform practical virtual screening of thousands or millions of ligands may turn it unfeasible. For this reason, we have been developing methods [5] to simplify or reduce the FFR model dimension. We dubbed this simpler representation of a FFR model a Reduced Fully-Flexible Receptor (RFFR) model. A RFFR model is achieved by eliminating redundancy in the FFR model through clustering its set of conformations, thus generating subsets which should contain the most promising conformations [5]. At present, to develop a RFFR, we still need to perform thousands, and in the near future, this number should grow to hundreds of thousands of MDock simulations of a particular target receptor modeled as a FFR model. However, the sequential execution of docking simulations of FFR models by software, such as AutoDock4.2 [4], is computationally very expensive [5], hence demanding days, weeks, or even months of CPU time. As a result, to make use of parallel processing is paramount to enhance the performance of bioinformatics workflows used for automating and accelerating MDock simulations [5,8]. In this study we present a conceptual architecture to assist in the high performance massively parallel execution of MDock simulations of subsets of conformations of FFR models. Future MDock experiments, with different ligands, will not use all the conformations that make up a FFR model, but instead, only those which are significantly more promising [5,9]. The architecture is intended to model both the Many Tasks Computing (MTC) and the Scientific Workflow (SW) environments.
2 A Framework to Model Multiple Environments The framework is composed of two environments for building a RFFR model from a FFR one. The SW environment manages the MDock simulations. The MTC environment performs the dynamic scheduling of various tasks in a distributed HPC environment. The workflow model (Fig. 1 below) shows the handling of tasks for MDock simulations with the software AutoDock4.2. Before submitting the FFR model to the SW environment, its constituent snapshots are grouped together by data mining techniques [5]. Then, the SW environment prepares each group to be executed in the MTC environment. After receiving the execution results, the SW applies the Self-adaptive Multiple Instances (P-SaMI) [9] to each group of snapshots. P-SaMI uses the values obtained from the Results Concatenation (Fig. 1) at runtime and applies some criteria to establish priorities for each individual group of snapshots. To build a RFFR model, only the groups of snapshots which present the most promising conformations [5] are executed in parallel by the MTC environment. Thus, from the P-SaMI analyzes, each group of snapshots receives a priority and a status to order the job queue to be processed by the HPC environment or to discard the non promising groups of snapshots. The MTC environment is responsible for executing AutoGrid and AutoDock for each individual snapshot. It uses
A Conceptual Many Tasks Computing Architecture
77
a heuristic function to: (1) dynamically schedule tasks based on the priority of each snapshot’s group and (2) discard groups with unpromising status.
Fig. 1. Framework to model the SW and MTC environments for the execution of MDock simulations with AutoDock4.2. Except the tasks executed by the MTC environment, highlighted by the rectangle, all the other tasks are performed by the SW environment.
3 Conceptual Architecture to Integrate the SW and MTC Environments The conceptual architecture (Fig. 2) was adapted from the middleware in [1]. The SW and MTC environments manage different tasks, including the communication between them. In the SW Environment, Prepare Files executes tasks 1, 2, 3, and 4 presented in Fig. 1 above. The SW Layer has three components: (1) Uploader: prepares the snapshot files for execution by the MTC environment. (2) Data Analyzer: analyzes and updates the priorities and status of different snapshot’s group through the PSaMI. (3) Downloader: retrieves data from Data Analyzer. Setup
MTC Environment Pre-Processing
SW Prepare Files Environment SW Layer
Uploader
Snap-
Properties MTC Layer
Data Analyzer
Task Manage Processing Dispatcher Monitor
Downloader
HPC Cluster
Post-Processing Results’s Concatenation
Results
SWfMS Storage
Control
Data
Fig. 2. Conceptual architecture to execute MDock simulations of FFR models. The components’ labels identify their functions.
78
R. De Paris et al.
In the MTC environment the MTC Layer is composed of three steps. (1) PreProcessing: receives files and prepares the tasks for execution in the HPC cluster. It has three components. (i) Snapshot: stores the snapshots. (ii) Properties: stores priorities and status of the snapshots. (iii) Task Manager: reads components (i) and (ii) and generates a data repository that stores the job queue to be processed. (2) Processing: gets the job queue to be executed by the HPC, and sends the results to the Results’ Component. (3) Post-Processing: stores the results of the MDock simulations and prepares files to send back to the SW Environment. The HPC Cluster Component executes tasks 5 and 6 (Fig. 1) on the snapshots from the Pre-Processing step.
4 Final Considerations This paper introduced a conceptual architecture to parallelize and orchestrate MDock simulations of a FFR model, and, from it, to develop a tool to build a RFFR model. This tool, a middleware, should be able to deliver, control and monitor multiple task’s execution in heterogeneous and distributed HPC environments. Acknowledgements. This project is supported by grants from CNPq to ONS and DDAR. ONS is a CNPq Research Fellow. RDP is supported by a CNPq M.Sc. scholarship. FAF is supported by HP-PROFACC M.Sc. scholarship.
References 1. Coutinho, F., Ogasawara, E., de Oliveira, D., Braganholo, V., Lima, A.A.B., Dávila, A.M.R., Mattoso, M.: Data Parallelism in Bioinformatics Workflows Using Hydra. In: 19th ACM International Symposium on High Performance Distributed Computing, New York, pp. 507–515 (2010) 2. Luscombe, N.M., Greenbaum, D., Gerstein, M.: What is Bioinformatics? A Proposed Definition and Overview of the Field. Methods Inf. Med. 30, 346–358 (2001) 3. Kapetanovic, I.M.: Computer-aided drug discovery and development (CADD): In silico-chemico-biological approach. Chem. Biol. Interact. 17, 165–176 (2008) 4. Morris, G.M., Huey, R., Lindstrom, W., Sanner, M.F., Belew, R.K., Goodsell, D.S., Olson, A.J.: AutoDock4 and AutoDockTools4: Automated docking with selective receptor flexibility. J. Comput. Chem. 30, 2785–2791 (2009) 5. Machado, K.S., Winck, A.T., Ruiz, D.D.A., Norberto de Souza, O.: Mining flexible-receptor docking experiments to select promising protein receptor snapshots. BMC Genomics 11, 1–10 (2010) 6. Lin, J.H., Perryman, A.L., Schames, J.R., McCammom, J.A.: The relaxed complex method: Accommodating receptor flexibility for drug design with an improved scoring scheme. Biopolymers 68, 47–62 (2003) 7. Alonso, H., Bliznyuk, A.A., Gready, J.E.: Combining docking, molecular dynamic simulations in drug design. Med. Res. Rev. 26, 531–568 (2006) 8. Machado, K.S., Schroeder, E.K., Ruiz, D.D.A., Norberto de Souza, O.: Automating Molecular Docking with Explicit Receptor Flexibility Using Scientifc Workflow. In: Sagot, M.-F., Walter, M.E.M.T. (eds.) BSB 2007. LNCS (LNBI), vol. 4643, pp. 1–11. Springer, Heidelberg (2007) 9. Hübler, P.N.: P-SaMI: self-adapting multiple instances – a data pattern to scientific workflows (in portuguese: P-MIA: padrão de múltiplas instâncias autoadaptáveis – um padrão de dados para workflows científicos). PhD Thesis. PPGCC – PUCRS. Porto Alegre, RS, Brasil (2010)
kGC: Finding Groups of Homologous Genes across Multiple Genomes Guilherme P. Telles1 , Nalvo F. Almeida2 , Marcelo M. Brigido3 , Paulo Antonio Alvarez4 , and Maria Emilia Walter4 1
Institute of Computing, University of Campinas, Campinas, Brazil College of Computing, Federal University of Mato Grosso do Sul, Campo Grande, Brazil Molecular Biology Laboratory, Institute of Biology, University of Brasilia, Brazil 4 Department of Computer Science, University of Brasilia, Brasilia, Brazil
[email protected],
[email protected],
[email protected],
[email protected],
[email protected] 2
3
Abstract. We present a simple method to obtain groups of homologous genes across multiple (k) organisms, called kGC. It takes all-againstall BLASTP comparisons as input and produces groups of homologous sequences as output. The algorithm is based on the identification of maximal cliques in graphs of sequences and paralogous groups. We have used our method on six Actinobacterial complete genomes and investigated the Pfam classification of the homologous groups with respect to the results produced by OrthoMCL. Although kGC is simpler, it presented similar results with respect to Pfam classification in reasonable time.
1
Introduction
Comparative genomics is an important tool in large-scale bioinformatics analysis which allows infering functions of biological sequences based on similarity to sequences of other genomes with known function. The rationale is that strong similarities among genes of different genomes indicate that they are homologous and could perform the same cellular activity. Some methods identify orthology relationships by phylogenetic trees [9]. Others based on sequence comparisons are faster and present good results [12,8]. Some combine phylogeny and comparative genomics [6]. OrthoMCL [10] is a broadly used method for finding orthologous groups across multiple genomes. MultiParanoid [2] uses InParanoid [11] results to group proteins across multiple species. Chen et al. [7] used Latent Class Analysis to assess various methods and observed a trade-off between sensitivity and specificity in orthology detection, with BLAST-based methods characterized by high sensitivity, and tree-based methods by high specificity. Among the analyzed methods, inParanoid and OrthoMCL have shown the best overall balance for sensitivity and specificity. This paper presents kGC, a method that generalizes a previous strategy [4] for three genomes, enabling the construction of groups of homologous genes among multiple genomes simultaneously, based on the identification of maximal cliques O. Norberto de Souza, G.P. Telles, and M.J. Palakal (Eds.): BSB 2011, LNBI 6832, pp. 79–82, 2011. c Springer-Verlag Berlin Heidelberg 2011
80
G.P. Telles et al.
in a graph of sequence similarities. Experiments on six bacterial genomes has shown that kGC produces results with quality, measured using Pfam [1] models, comparable to those found by OrthoMCL. The method is simple, with a small number of parameters, and has reasonable running time.
2
The kGC Method
Given k genomes, where each genome is a set of gene sequences, the input for kGC is the result of all-against-all BLASTP. kGC output is a collection of groups formed by similar sequences. The algorithm first produces groups of similar sequences in every genome and then searches for cliques in k-partite graphs to form groups called blocks. Each block may contain sequences from the same genome (putative paralogs) and sequences from different genomes (putative orthologs). Groups in a genome are found by part of the EGG method [3], that uses two simple graphs. In graph G = (V, E), each vertex v represents a gene gv , and an edge (u, v) ∈ E if there is a BLASTP alignment of gu and gv with e-value less than or equal to some threshold Iev and covers at least Icov % of gu and gv . Graph G = (V, E ) is analogously defined with different thresholds Iev and Icov . The algorithm proceeds in three steps. In the first step, it finds maximal cliques in G, representing sets of similar sequences. In the second step, the algorithm tries to aggregate sequences to the cliques to avoid loosing strongly connected subgraphs that are not maximal cliques, but that are groups of highly similar sequences. A sequence gv will be an aggregate to clique C if gv ∈ C and a vertex u ∈ C exists such that (v, u) ∈ E . Although the condition to belong to a group is relaxed, the thresholds Iev and Icov may be more stringent, allowing to keep the consistency of groups. In the third step, the method removes the redundancy generated in the second step, choosing among all groups that contain an aggregated vertex v the one presenting the highest average BLAST score. Cliques are searched for in two k-partite graphs S (of sequences) and G (of groups). The k-partite graph of sequences S is a simple graph whose vertices are sequences in the k genomes and there is an edge (su , sv ) between two sequences if they belong to different genomes and there is a BLASTP alignment of su and sv with e-value less than or equal to some threshold Aev and covers at least Acov % of su and sv . The k-partite graph of groups G is a simple graph whose vertices are groups of similar sequences in every genome and there is an edge (gu , gv ) between two groups if they belong to different genomes and there is at least one edge (su , sv ) in S such that su ∈ gu and sv ∈ gv . The search iterates from k to 2. The i-th iteration has two major steps. During the search in S, all cliques with size i in S are found and added to a list Li , sorted by non-increasing order of average coverage. Li is sequentially processed. For position j in Li , let C = {s1 , s2 , . . . , si } be the clique of vertices. Each vertex in C belongs to a group gsi (possibly a singleton) of similar sequences in one genome. A block b is built as the union of gs1 , gs2 , . . . , gsi and reported. No element of b is further considered. During the search in G, all cliques of size i in G are found and added to a list Li , initially empty, sorted by non-increasing
kGC: Finding Groups of Homologous Genes across Multiple Genomes
81
order of average coverage. Li is sequentially processed. For position j in Li , let C = {g1 , g2 , . . . , gi } be the clique of vertices. A block b is built as the union of g1 , g2 , . . . , gi and reported. Again, no element of b is further considered.
3
Results and Discussion
We have implemented kGC in Java. Cliques are found using the Bron-Kerbosch algorithm [5], with small changes to accelerate the search, such as demanding that a vertex from the smallest genome is always in a clique and bounding the clique size by the number of partitions. Vertices and edges are removed as blocks are reported. The output may be presented in html, through a page that allows selecting the desired genomes presenting homologous blocks. To test our method, we have chosen six complete genomes of Actinobacteria, available in January 2011 in GenBank: S. avermitilis MA 4680 (7676 genes), S. bingchenggensis BCW 1 (10022 genes), S. coelicolor A3 2 (8153 genes), S. griseus NBRC 13350 (7136 genes), S. scabiei 87 22 (8746 genes), S. roseum DSM 43021 (8975 genes). To validate kGC, we first computed the Pfam [1] model for each protein. Among the 50,708 genes in our dataset, 37,093 (73.15%) had a Pfam model. Given a block f identified by kGC, let pf be the most frequent Pfam model in f , nf be the number of proteins in f with Pfam model pf , and mf be the number of proteins in f with any Pfam model. To each block, define its score by score(f ) = nf /mf . The final score is given by the summation of all family scores, considering only blocks with at least one protein with Pfam model, divided by this number of blocks. Iev
Iev
10−7 10−8 10−9 10−10 10−11 10−12 10−13 10−14 10−15 10−16 10−17 10−18 10−19 10−20 10−21 10−22
10−12 10−13 10−14 10−15 10−16 10−17 10−18 10−19 10−20 10−21 10−22 10−23 10−24 10−25 10−26 10−27
Aev blocks with Pfam 10−5 6,908 5,011 10−6 6,941 5,044 10−7 6,945 5,063 10−8 7,025 5,142 10−9 7,058 5,190 10−10 7,073 5,215 10−11 7,102 5,266 10−12 7,122 5,307 10−13 7,141 5,321 10−14 7,160 5,369 10−15 7,210 5,438 10−16 7,218 5,468 10−17 7,242 5,504 10−18 7,271 5,551 10−19 7,260 5,575 10−20 7,275 5,605
final score 0.910 0.912 0.913 0.914 0.916 0.918 0.920 0.921 0.924 0.925 0.927 0.927 0.928 0.930 0.931 0.933
time min. 71 66 149 323 I cov Icov Acov blocks with final time 350 Pfam score min. 118 0.50 0.70 0.50 6,734 4,854 0.900 514 33 0.55 0.75 0.55 6,815 4,938 0.907 207 16 0.60 0.80 0.60 6,945 5,063 0.913 149 10 0.65 0.85 0.65 7,124 5,241 0.922 50 9 0.70 0.90 0.70 7,324 5,452 0.927 12 8 8 (b) kGC results for varying coverages, −9 −14 , and 7 with Iev = 10 , Iev = 10 −7 A = 10 . ev 7 7 6
(a) kGC results for varying e-values, with Icov = 0.6, Icov = 0.8 and Acov = 0.6.
82
G.P. Telles et al.
In the following tables we show the results of executing kGC in the six Actinobacterial genomes, for varying e-values and coverages. comparing closely re lated genomes, and Iev and Iev have been chosen to avoid the bias that can be caused by homologs inside a genome. OrthoMCL identified 9,793 families (7,694 with at least one protein with Pfam model), with final score 0.939, and took slightly less than 2 hours, on the same machine that executed kGC. Note that as the number of edges allowed in the graphs decreases, the cohesion and the score of the remaining groups with respect to Pfam families increases. kGC also produced fewer groups when compared to OrthoMCL. A drawback of kGC is the search for maximal cliques. Although the graph is bipartite, the algorithm may not scale very well. Our experiments on 6 genomes and 50,000 sequences run in reasonable time. Improvements may come from changing branch-and-bound to heuristics, but it will diminish precision. The estimate provided by Pfam is preliminary, in the sense that strongly related genes with no Pfam model may be formed without contributing to the score. Further analysis may reveal other features of the kGC approach.
References 1. Bateman, A., Coin, L., Durbin, R., Finn, R.D.: The Pfam protein families database. Nucleic Acids Res. 32, D138–D141 (2004) 2. Alexeyenko, A., Tamas, I., et al.: Automatic clustering of orthologs and inparalogs shared by multiple proteomes. Bioinform. 22, e9–e15 (2006) 3. Almeida, N.F.: Tools for genome comparison. PhD thesis, Instituto de Computa¸ca ˜o–Unicamp (2002) (in Portuguese) 4. Anjos, D.A.S., Zerlotini, G., et al.: A method for inferring biological functions using homologous genes among three genomes. In: Sagot, M.-F., Walter, M.E.M.T. (eds.) BSB 2007. LNCS (LNBI), vol. 4643, pp. 69–80. Springer, Heidelberg (2007) 5. Bron, C., Kerbosch, J.: Algorithm 457: finding all cliques of an undirected graph. Comm. of the ACM 16, 575–577 (1973) 6. Cannon, S.B., Young, N.B.: Orthoparamap: Distinguishing orthologs from paralogs by integrating comparative genome data and gene phylogenies. BioMed Central Bioinform. 4, 35 (2003) 7. Chen, F., Mackey, A.J., et al.: Assessing performance of orthology detection strategies applied to eukaryotic genomes. PLoS ONE 2(4), e383 (2007) 8. Kellis, M., Patterson, N., Birren, B., Berger, B., Lander, E.S.: Methods in comparative genomics: genome correspondence, gene identification and motif discovery. Bioinform. 11(2-3), 319–355 (2004) 9. Lee, Y., Sultana, R., et al.: Cross-referencing eukaryotic genomes: TIGR orthologous gene alignments (TOGA). Genome Res. 12(3), 493–502 (2002) 10. Li, L., Stoeckert Jr., C.J., Roos, D.S.: OrthoMCL: Identification of ortholog groups for eukaryotic genomes. Genome Res. 13(9), 2178–2189 (2003) 11. Remm, M., Storm, C.E., Sonnhammer, E.: Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. Journal of Molecular Biology 314, 1041–1052 (2001) 12. Tatusov, R.L., Fedorova, N.D., et al.: The COG database: an updated version includes eukaryotes. BMC Bioinform. 4, 41 (2003)
Author Index
Adi, Said Sadique 26 Almeida, Nalvo F. 79 Alvarez, Paulo Antonio
Holanda, Maristela 71 Huacarpuma, Ruben Cruz
Kishi, Rodrigo Mitsuo 26 Kowada, Luis Antonio B. 42
Bartschat, Sebastian 1 Bay´ a, Ariel E. 50 Bouillet, Leoneide Erica Maduro Braga, Marilia D.V. 42 Brigido, Marcelo M. 79
18
Cerri, Ricardo 10 Ciferri, Cristina Dutra de Aguiar Ciferri, Ricardo Rodrigues 58
58
Langenberger, David 1 Lima, Alexandre A.B. 66 Lopes, Marcelo P. 42 Mattoso, Marta 66 Merschmann, Luiz Henrique de Campos 18
da Louza, Felipe Alves 58 D´ avila, Alberto M.R. 66 de Carvalho, Andr´e C.P.L.F. 10 de Figueiredo, Celina M.H. 42 de Oliveira, Daniel 66 De Paris, Renata 75 do Lago, Alair Pereira 34 dos Santos, Ronaldo Fiorilo 26
Norberto de Souza, Osmar
Frantz, F´ abio A.
Sacomoto, Gustavo Akio T. Stadler, Peter F. 1
Granitto, Pablo M.
71
79
75
75
Oca˜ na, Kary A.C.S. 66 Ogasawara, Eduardo 66 Oliveira, Samuel Evangelista de Lima 18 Ruiz, Duncan D.
75
50
Hausen, Rodrigo de A. Hertel, Jana 1 Hoffmann, Steve 1
42
Tafer, Hakim 1 Telles, Guilherme P. Walter, Maria Emilia
79 71, 79
34