Lecture Notes in Bioinformatics
4645
Edited by S. Istrail, P. Pevzner, and M. Waterman Editorial Board: A. Apostolico S. Brunak M. Gelfand T. Lengauer S. Miyano G. Myers M.-F. Sagot D. Sankoff R. Shamir T. Speed M. Vingron W. Wong
Subseries of Lecture Notes in Computer Science
Raffaele Giancarlo Sridhar Hannenhalli (Eds.)
Algorithms in Bioinformatics 7th International Workshop, WABI 2007 Philadelphia, PA, USA, September 8-9, 2007 Proceedings
13
Series Editors Sorin Istrail, Brown University, Providence, RI, USA Pavel Pevzner, University of California, San Diego, CA, USA Michael Waterman, University of Southern California, Los Angeles, CA, USA Volume Editors Raffaele Giancarlo Università degli Studi di Palermo Department of Mathematics via Archirafi 34, 90123 Palermo, Italy E-mail:
[email protected] Sridhar Hannenhalli University of Pennsylvania Penn Center for Bioinformatics and Department of Genetics 1409 Blockley Hall, 423 Guardian Drive, Philadelphia, PA 19104-6021, USA E-mail:
[email protected]
Library of Congress Control Number: 2007932232
CR Subject Classification (1998): F.1, F.2.2, E.1, G.1-3, J.3 LNCS Sublibrary: SL 8 – Bioinformatics ISSN ISBN-10 ISBN-13
0302-9743 3-540-74125-9 Springer Berlin Heidelberg New York 978-3-540-74125-1 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springer.com © Springer-Verlag Berlin Heidelberg 2007 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12103256 06/3180 543210
Preface
We are very pleased to present the proceedings of the Seventh Workshop on Algorithms in Bioinformatics (WABI 2007), which took place in Philadelphia, September 8–9, 2007, under the auspices of the International Society for Computational Biology (ISCB), the European Association for Theoretical Computer Science (EATCS), the Penn Genomics Institute and the Penn Center for Bioinformatics. The Workshop on Algorithms in Bioinformatics covers research in all aspects of algorithmic work in bioinformatics. The emphasis is on discrete algorithms that address important problems in molecular biology, that are founded on sound models, that are computationally efficient, and that have been implemented and tested in simulations and on real datasets. The goal is to present recent research results, including significant work-in-progress, and to identify and explore directions of future research. Specific topics of interest include, but are not limited to: – Exact, approximate, and machine-learning algorithms for genomics, sequence analysis, gene and signal recognition, alignment, molecular evolution, polymorphisms and population genetics, protein and RNA structure determination or prediction, gene expression and gene networks, proteomics, functional genomics, and drug design. – Methods, software and dataset repositories for development and testing of such algorithms and their underlying models. – High-performance approaches to computationally hard problems in bioinformatics, particularly optimization problems. A major goal of the workshop is to bring together researchers in areas spanning the range from abstract algorithm design to biological dataset analysis, so as to enable a dialogue between application specialists and algorithm designers, mediated by algorithm engineers and high-performance computing specialists. We believe that such a dialogue is necessary for the progress of computational biology, inasmuch as application specialists cannot analyze their datasets without fast and robust algorithms and, conversely, algorithm designers cannot produce useful algorithms without being conversant with the problems faced by biologists. Part of this mix has been achieved for all seven WABI events. For six of them, WABI was collocated with the European Symposium on Algorithms (ESA), along with other occasional conferences or workshops, so as to form the interdisciplinary scientific meeting known as ALGO. As agreed by the WABI and ALGO Steering Committees, starting this year WABI will be part of ALGO only every two years, alternating between Europe and other continents. We received 133 submissions in response to our call for WABI 2007 and were able to accept 37 of them, ranging from mathematical tools to experimental studies of approximation algorithms and reports on significant computational analyses. Numerous biological problems were dealt with, including genetic mapping,
VI
Preface
sequence alignment and sequence analysis, phylogeny, comparative genomics, and protein structure. Both machine-learning and combinatorial optimization approaches to algorithmic problems in bioinformatics were represented. We want to thank all authors for submitting their work to the workshop and all presenters and attendees for their participation. We were particularly fortunate in enlisting the help of a very distinguished panel of researchers for our Program Committee, which undoubtedly accounts for the large number of submissions and the high quality of the presentations. Our sincere thanks go to all: Piotr Berman, Penn. State U., USA Mathieu Blanchette, McGill U., Canada Paola Bonizzoni, U. Milano-Bicocca, Italy Philipp B¨ ucher, EPFL, Switzerland Rita Casadio, U. Bologna, Italy Maxime Crochemore, U. Marne-la-Vall´ee, France Nadia El-Mabrouk, U. Montr´eal, Canada Liliana Florea, George Washington U., USA Olivier Gascuel, LIRMM-CNRS, France David Gilbert, U. Glasgow, UK Concettina Guerra, U. Padova, Italy & Georgia Tech, USA Roderico Guigo, CRG, U. Barcelona, Spain Daniel Huson, U. T¨ ubingen, Germany Shane Jensen, U. Penn., USA Jens Lagergren, KTH Stokholm, Sweden Arthur Lesk, Penn. State U., USA Ming Li, U. Waterloo, Canada Stefano Lonardi, UC Riverside, USA Webb Miller, Penn. State U., USA Satoru Miyano, Tokyo U., Japan Bernard Moret, EPFL, Switzerland Burkhard Morgenstern, U. G¨ ottingen, Germany Gene Myers, HHMI Janelia Farms, USA Uwe Ohler, Duke U., USA Laxmi Parida, IBM T.J. Watson Research Center, USA Kunsoo Park, Seoul National U., S. Korea Graziano Pesole, U. Bari, Italy Ron Pinter, Technion, Israel Cinzia Pizzi, INRIA, France Knut Reinert, Freie U. Berlin, Germany Mikhail Roytberg, Russian Academy of Sciences, Russia Marie France Sagot, INRIA, France David Sankoff, U. Ottawa, Canada Roded Sharan, Tel-Aviv U., Israel
Preface
VII
Adam Siepel, Cornell U., USA Mona Singh, Princeton U., USA Saurabh Sinha, UIUC, USA Steven Skiena, SUNY Stony Brook, USA Peter Stadler, U. Leipzig, Germany Jens Stoye, U. Bielefeld, Germany Granger Sutton, J. Craig Venter Institute, USA Anna Tramontano, U. Roma “La Sapienza”, Italy Olga Troyanskaya, Princeton U., USA Alfonso Valencia, U. Autonoma, Spain Gabriel Valiente, Tech U. Catalonia, Spain Li-San Wang, U. Penn., USA Lusheng Wang, City U. Hong Kong, Hong Kong Haim Wolfson, Tel-Aviv U., Israel We would also like to thank Alessandra Gabriele, Giusu´e Lo Bosco and Cesare Valenti, all of University of Palermo, for providing assistance in assembling this volume. Last but not least, we thank Junhyong Kim and his colleagues Stephen Fisher and Li-San Wang, all at U. Penn, for doing a superb job of organizing the first edition of the conference in the USA and for the continuous technical support during all phases of the conference. We hope that you will consider contributing to future WABI events, through a submission or by participating in the workshop. September 2007
Raffaele Giancarlo Sridhar Hannenhalli
Organization
The WABI 2007 Program Committee gratefully acknowledges the valuable input received from the following external Reviewers: Edo Airoldi J. A. Amgarten Quitzau Lars Arvestad Marie-Pierre B´eal Vincent Berry Enrique Blanco Guillaume Blin Serdar Bozdag Kajia Cao Ildefonso Cases Robert Castelo Cedric Chauve Giovanni Ciriello Jordi Cortadella Gianluca Della Vedova Pietro Di Lena Riccardo Dondi Iakes Ezkurdia Piero Fariselli Alfredo Ferro Oxana Galzitskaia Claudio Garutti Stoyan Georgiev Robert Giegerich Osvaldo Gra˜ na Clemens Gr¨ opl Roderic Guigo i Serra Bjarni Halldorsson Michael Hallett Sylvie Hamel Elena Harris
Robert Harris M. Helmer-Citterich Matthew Hibbs Curtis Huttenhower Seiya Imoto Yuval Inbar Dmitry Ivankov Katharina Jahn Jieun Jeong Tao Jiang Raya Khanin Jong Kim Gunnar W. Klau Tobias Kloepper Arun Konagurthu Mathieu Lajoie Florian Leitner Gonzalo Lopez Antoni Lozano Bill Majoros Mohamed Manal Florian Markowetz Pier Luigi Martelli David Martin Efrat Mashiach Jon McAuliffe Julia Mixtacki Chad Myers Luay Nakhleh Heiko Neuweger Giulio Pavesi
Ernesto Picardi M. Sohel Rahman Sven Rahmann Vincent Ranwez Christian Rausch Antonio Rausell Daniel Richter Romeo Rizzi Jairo Rocha Allen Rodrigo Oleg Rokhlenko Ivan Rossi Bengt Sennblad Maria Serna Maxim Shatsky Tomer Shlomi Michael Shmoish A. Shulman-Peleg Jijun Tang Ali Tofigh Vladimir Vacic Marco Vassura Stphane Vialette Jordi Villa i Freixa Robert Warren Tobias Wittkop Stefan Wolfsheimer Yonghui Wu Joseph Wun-Tat Chan Nir Yosef
Table of Contents
Shotgun Protein Sequencing (Keynote) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pavel A. Pevzner
1
Locality Kernels for Protein Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . Evgeni Tsivtsivadze, Jorma Boberg, and Tapio Salakoski
2
When Less Is More: Improving Classification of Protein Families with a Minimal Set of Global Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Roy Varshavsky, Menachem Fromer, Amit Man, and Michal Linial
12
Fault Tolerance for Large Scale Protein 3D Reconstruction from Contact Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marco Vassura, Luciano Margara, Pietro Di Lena, Filippo Medri, Piero Fariselli, and Rita Casadio Bringing Folding Pathways into Strand Pairing Prediction . . . . . . . . . . . . . Jieun K. Jeong, Piotr Berman, and Teresa M. Przytycka A Fast and Accurate Heuristic for the Single Individual SNP Haplotyping Problem with Many Gaps, High Reading Error Rate and Low Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Loredana M. Genovese, Filippo Geraci, and Marco Pellegrini
25
38
49
Two Birds, One Stone: Selecting Functionally Informative Tag SNPs for Disease Association Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Phil Hyoun Lee and Hagit Shatkay
61
Genotype Error Detection Using Hidden Markov Models of Haplotype Diversity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Justin Kennedy, Ion M˘ andoiu, and Bogdan Pa¸saniuc
73
Haplotype Inference Via Hierarchical Genotype Parsing . . . . . . . . . . . . . . . Pasi Rastas and Esko Ukkonen
85
Seeded Tree Alignment and Planar Tanglegram Layout . . . . . . . . . . . . . . . Antoni Lozano, Ron Y. Pinter, Oleg Rokhlenko, Gabriel Valiente, and Michal Ziv-Ukelson
98
Inferring Models of Rearrangements, Recombinations, and Horizontal Transfers by the Minimum Evolution Criterion (Extended Abstract) . . . . Hadas Birin, Zohar Gal-Or, Isaac Elias, and Tamir Tuller
111
An Ω(n2 / log n) Speed-Up of TBR Heuristics for the Gene-Duplication Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mukul S. Bansal and Oliver Eulenstein
124
XII
Table of Contents
Incremental Discovery of Irredundant Motif Bases in Time O(|Σ|n2 log n) (Extended Abstract) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alberto Apostolico and Claudia Tagliacollo
136
A Graph Clustering Approach to Weak Motif Recognition . . . . . . . . . . . . . Christina Boucher, Daniel G. Brown, and Paul Church
149
Informative Motifs in Protein Family Alignments . . . . . . . . . . . . . . . . . . . . Hatice Gulcin Ozer and William C. Ray
161
Topology Independent Protein Structural Alignment . . . . . . . . . . . . . . . . Joe Dundas, T.A. Binkowski, Bhaskar DasGupta, and Jie Liang
171
Generalized Pattern Search and Mesh Adaptive Direct Search Algorithms for Protein Structure Prediction . . . . . . . . . . . . . . . . . . . . . . . . . Giuseppe Nicosia and Giovanni Stracquadanio
183
Alignment-Free Local Structural Search by Writhe Decomposition . . . . . . Degui Zhi, Maxim Shatsky, and Steven E. Brenner
194
Defining and Computing Optimum RMSD for Gapped Multiple Structure Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xueyi Wang and Jack Snoeyink
196
Using Protein Domains to Improve the Accuracy of Ab Initio Gene Finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mihaela Pertea and Steven L. Salzberg
208
Genomic Signatures in De Bruijn Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lenwood S. Heath and Amrita Pati
216
Fast Kernel Methods for SVM Sequence Classifiers . . . . . . . . . . . . . . . . . . . Pavel Kuksa and Vladimir Pavlovic
228
On-Line Viterbi Algorithm for Analysis of Long Biological Sequences . . . ˇ amek, Broˇ Rastislav Sr´ na Brejov´ a, and Tom´ aˇs Vinaˇr
240
Predicting Protein Folding Kinetics Via Temporal Logic Model Checking (Extended Abstract) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christopher James Langmead and Sumit Kumar Jha
252
Efficient Algorithms to Explore Conformation Spaces of Flexible Protein Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ankur Dhanik, Peggy Yao, Nathan Marz, Ryan Propper, Charles Kou, Guanfeng Liu, Henry van den Bedem, and Jean-Claude Latombe Algorithms for the Extraction of Synteny Blocks from Comparative Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vicky Choi, Chunfang Zheng, Qian Zhu, and David Sankoff
265
277
Table of Contents
Computability of Models for Sequence Assembly . . . . . . . . . . . . . . . . . . . . . Paul Medvedev, Konstantinos Georgiou, Gene Myers, and Michael Brudno
XIII
289
Fast Algorithms for Selecting Specific siRNA in Complete mRNA Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jaime Davila, Sudha Balla, and Sanguthevar Rajasekaran
302
RNA Folding Including Pseudoknots: A New Parameterized Algorithm and Improved Upper Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chunmei Liu, Yinglei Song, and Louis Shapiro
310
HFold: RNA Pseudoknotted Secondary Structure Prediction Using Hierarchical Folding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hosna Jabbari, Anne Condon, Ana Pop, Cristina Pop, and Yinglei Zhao
323
Homology Search with Fragmented Nucleic Acid Sequence Patterns . . . . Axel Mosig, Julian J.-L. Chen, and Peter F. Stadler
335
Fast Computation of Good Multiple Spaced Seeds . . . . . . . . . . . . . . . . . . . . Lucian Ilie and Silvana Ilie
346
Inverse Sequence Alignment from Partial Examples . . . . . . . . . . . . . . . . . . . Eagu Kim and John Kececioglu
359
Novel Approaches in Psychiatric Genomics (Keynote) . . . . . . . . . . . . . . . . Maja Bucan
371
The Point Placement Problem on a Line – Improved Bounds for Pairwise Distance Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Francis Y.L. Chin, Henry C.M. Leung, W.K. Sung, and S.M. Yiu
372
Efficient Computational Design of Tiling Arrays Using a Shortest Path Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexander Schliep and Roland Krause
383
Efficient and Accurate Construction of Genetic Linkage Maps from Noisy and Missing Genotyping Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yonghui Wu, Prasanna Bhat, Timothy J. Close, and Stefano Lonardi
395
A Novel Method for Signal Transduction Network Inference from Indirect Experimental Evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . R´eka Albert, Bhaskar DasGupta, Riccardo Dondi, Sema Kachalo, Eduardo Sontag, Alexander Zelikovsky, and Kelly Westbrooks
407
Composing Globally Consistent Pathway Parameter Estimates Through Belief Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Geoffrey Koh, Lisa Tucker-Kellogg, David Hsu, and P.S. Thiagarajan
420
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
431
Shotgun Protein Sequencing Pavel A. Pevzner Ronald R. Taylor Professor of Computer Science, University of California, San Diego, La Jolla, CA 92093
Abstract. Despite significant advances in the identification of known proteins, the analysis of unknown proteins by tandem mass spectrometry (MS/MS) still remains a challenging open problem. Although Klaus Biemann recognized the potential of mass spectrometry for sequencing of unknown proteins in the 1980s, low-throughput Edman degradation followed by cloning still remains the main method to sequence unknown proteins. The automated spectral interpretation has been limited by a focus on individual spectra and has not capitalized on the information contained in spectra of overlapping peptides. Indeed, the powerful Shotgun DNA Sequencing strategies have not been extended to protein sequencing yet. We demonstrate, for the first time, the feasibility of Shotgun Protein Sequencing of protein mixtures and validate this approach by generating highly accurate de novo reconstructions of various proteins in western diamondback rattlesnake venom. We further argue that Shotgun Protein Sequencing has the potential to overcome the limitations of current protein sequencing approaches and thus catalyze the otherwise impractical applications of proteomics methodologies in studies of unknown proteins. We further describe applications of this technique to analyzing proteins that are not directly inscribed in DNA sequences (like antibodies and fusion proteins in cancer). This is a joint work with Nuno Bandeira (UCSD) and Karl Clauser (Broad).
R. Giancarlo and S. Hannenhalli (Eds.): WABI 2007, LNBI 4645, p. 1, 2007. c Springer-Verlag Berlin Heidelberg 2007
Locality Kernels for Protein Classification Evgeni Tsivtsivadze, Jorma Boberg, and Tapio Salakoski Turku Centre for Computer Science (TUCS) Department of Information Technology, University of Turku Joukahaisenkatu 3-5 B, FIN-20520 Turku, Finland
[email protected]
Abstract. We propose kernels that take advantage of local correlations in sequential data and present their application to the protein classification problem. Our locality kernels measure protein sequence similarities within a small window constructed around matching amino acids. The kernels incorporate positional information of the amino acids inside the window and allow a range of position dependent similarity evaluations. We use these kernels with regularized least-squares algorithm (RLS) for protein classification on the SCOP database. Our experiments demonstrate that the locality kernels perform significantly better than the spectrum and the mismatch kernels. When used together with RLS, performance of the locality kernels is comparable with some state-of-the-art methods of protein classification and remote homology detection.
1
Introduction
One important task in computational biology is inference of the structure and function of the protein encoded in the genome. The similarity of protein sequences may imply structural and functional similarity. The task of detecting these similarities can be formalized as a classification problem that treats proteins as a set of labeled examples which are in positive class if they belong to the same family and are in negative class otherwise. Recently, applicability of this discriminative approach for detecting remote protein homologies has been demonstrated by several studies. For example, Jaakkola et al. [1] show that by combining discriminative learning algorithm and Fisher kernel for extraction of the relevant features it is possible to achieve a good performance in protein family recognition. Liao and Noble [2] further improve results presented in [1] by proposing combination of pairwise sequence similarity feature vectors with Support Vector Machines (SVM) algorithm. Their algorithm called SVM-pairwise is performing significantly better than several other baseline methods such as SVM-Fisher, PSI-BLAST and profile HMMs. The methods described in [1] and [2] use an expensive step of generating vector valued features for protein discrimination problems, which increases computational time of the algorithm. The idea to use a simple kernel function that can be efficiently computed and does not depend on any generative model or separate preprocessing step is considered by Leslie et al. in [3]. They show that R. Giancarlo and S. Hannenhalli (Eds.): WABI 2007, LNBI 4645, pp. 2–11, 2007. c Springer-Verlag Berlin Heidelberg 2007
Locality Kernels for Protein Classification
3
simple sequence based kernel functions perform surprisingly well compared to other computationally expensive approaches. In this study, we address the problem of protein sequence classification using the RLS algorithm with locality kernels similar to the one we proposed in [4]. The features used by the locality kernels represent sequences contained in a small window constructed around matching amino acids in the compared proteins. The kernels make use of the range of similarity evaluations within the windows, namely position insensitive matching: amino acids that match are taken into account irrespective of their position, position sensitive matching: amino acids that match but have different positions are penalized, strict matching: only amino acids that match and have the same positions are taken into account. By incorporating information about relevance of local correlations and positions of amino acids in the sequence into the kernel function, we demonstrate significantly better performance in protein classification on Structural Classification of Proteins (SCOP) database [5] than that of the spectrum and the mismatch kernels [3,6,7]. Previously, we have shown that the locality-convolution kernel [4] can be successfully applied to parse ranking task in natural language processing. The similarity of the data representation in cases of biological sequence and text, as well as results obtained in this study, suggest that locality kernels can be applied to tasks where local correlations and positional information within the sequence might be important. The paper is organized as follows. In Section 2, we present overview of the RLS algorithm. In Section 3, we define notions of locality window, positional matching, and present locality kernels. In Section 5, we evaluate the applicability of the locality kernels for the task of protein classification and compare their performance with the spectrum and the mismatch kernels. We conclude this paper in Section 6.
2
Regularized Least-Squares Algorithm
Let {(x1 , y1 ), . . . , (xt , yt )}, where xi = (x1 , . . . , xn )T , xi ∈ S and yi ∈ {0, 1} be the set of training examples. The target output value yi is a label value which is either 0, indicating that xi does not belong to the class or 1 otherwise. The target output value is predicted by the regularized least-squares (RLS) algorithm T [8,9]. We denote a matrix whose rows are xT 1 , . . . , xt as X and a vector of output labels as y = (y1 , . . . yt )T . The RLS algorithm corresponds to solving following optimization problem: min w
t
(yi − f (xi ))2 + λw2 ,
(1)
i=1
where f : S → R, w ∈ Rn is a vector of parameters such that f (x) = w, x, and λ ∈ R+ is a regularization parameter that controls the trade-off between fitting the training set accurately and finding the smallest norm for the function f .
4
E. Tsivtsivadze, J. Boberg, and T. Salakoski
Rewriting (1) in matrix form and taking derivative with respect to w, we obtain w = (X T X + λI)−1 X T y,
(2)
where I denotes identity matrix of dimension n × n. In (2) we must perform matrix inverse in dimension of feature space, that is n × n. However, if the number of features is much larger than the number of training data points, a more efficient way is to perform inverse in the dimension of training examples. In that case, following [9], we present (2) as a linear combination of training data points: t w= ai xi , (3) i=1
where a = (K + λI)−1 y
(4)
and Kij = k(xi , xj ) is a kernel matrix that contains the pairwise similarities of data points computed by a kernel function k : S × S → R. Finally, we predict an output of new data point as follows: f (x) = w, x = yT (K + λI)−1 k,
(5)
where ki = k(xi , x). Kernel functions are similarity measures of data points in the input space S, and they correspond to the inner product in a feature space H to which the input space data points are mapped. The kernel functions are defined as k(xi , xj ) = Φ(xi ), Φ(xj ), where Φ : S → H. Next we formulate the locality kernel functions that are used with the RLS algorithm for protein classification task.
3
Locality Kernels
There are three key properties of the locality kernels that make them applicable to the task of remote homology detection in the proteins. Firstly, the features used by these kernels contain amino acids that are extracted in the order of their appearance in the protein sequence. Secondly, local correlations within the protein sequence are taken into account by constructing a small window around the matching amino acids. Finally, positional information of the amino acids contained within window is used for similarity evaluation. Let us consider proteins p, q and let p = (p1 , . . . , p|p| ) and q = (q1 , . . . , q|q| ) be their amino acid sequences. The similarity of p and q is obtained with kernel k(p, q) =
|p| |q| i=1 j=1
κ(i, j).
(6)
Locality Kernels for Protein Classification
5
By defining κ in the general formulation (6), we obtain different similarity functions between proteins. If we set κ(i, j) = δ(pi , qj ), where δ(x, y) =
0, if x = y 1, if x = y
then (6) equals to the number of matching amino acids irrespective of their position in two sequences. To take into account local correlations within a sequence, we construct small windows of length 2w + 1 around the matching amino acids. In addition we define real valued (2w + 1) × (2w + 1) matrix P that we use in the formulation of κ. The positional matrix P stores information about relevance of particular position in compared windows for the similarity evaluation task (see [10] for a related approach). Entries of P contain real valued coefficients that are defined for all possible position pairs within two windows. Below we propose several ways for selecting appropriate P for the task in question. Let us consider following kernel function: κ(i, j) = δ(pi , qj )
w
[P ]h,l δ(pi+h , qj+l ).
(7)
h,l=−w
Note that the rows and the columns of the positional matrix P are indexed from −w to w. Furthermore, we consider amino acids as mismatched when the indices i + h and j + l are not valid, e.g. i + h < 1 or i + h > |p|. When we set P = A, where A is a matrix whose all elements are ones, we get κ that counts the matching amino acids irrespective of their positions in the two windows. As another alternative, we can construct a function that requires the positions of matching amino acid to be exactly the same. This is obtained by P = I, where I denotes the identity matrix. Furthermore, when P is a diagonal matrix whose elements are weights increasing from the boundary to the center of the window, we obtain a kernel that is related to the locality improved kernel proposed in [11]. However, if we do not require strict position matching, but rather penalize matches that have a different position within the windows, we can use a positional similarity matrix whose off-diagonal elements are nonzero and smaller than the diagonal elements. We obtain such a matrix, for example, by [P ]h,l = e−
(h−l)2 2θ2
,
(8)
where θ ≥ 0 is a parameter. The choice of an appropriate κ is a matter closely related to the domain of the study. In Section 5 we show that positional information captured with (7) is useful and improves the classification performance. When using (7) with different positional matrices in (6), we obtain the kernels which we call the locality kernels. Due to the kernel closure properties and positive semidefiniteness of matrix P , the locality kernels are indeed valid kernel functions. Our kernels could be considered within more general convolution
6
E. Tsivtsivadze, J. Boberg, and T. Salakoski
framework described by Haussler [12]. From this point of view, we can distinguish between “structures” and “different decompositions” constructed by our kernels. Informally, we are enumerating all the substructures representing pairs of windows built around the matching amino acids in the proteins and calculating their similarity.
4
Spectrum and Mismatch Kernels
The spectrum kernel introduced in [3] (see also [9]) is very efficient kernel for sequence similarity estimation. It compares two sequences by counting the common contiguous subsequences of length v that are contained in both of them. Thus, the spectrum kernel can be considered as an inner product between vectors containing frequencies of the matching subsequences. For consistency, we present the spectrum and the mismatch kernels within already described framework for the locality kernels. For detailed feature map of these kernels, we refer to [7]. The spectrum kernel is obtained by using κ(i, j) =
v−1
δ(pi+l , qj+l ),
(9)
l=0
in (6). Leslie et al. [6] also proposed a more sensitive kernel function called the mismatch kernel. The intuition behind this approach is that similarity between two sequences is large if they share many similar subsequences. By restricting number of mismatches to m between the subsequences of length v, the (v, m)-mismatch kernel is obtained by using v−1 0, if l=0 δ(pi+l , qj+l ) < v − m κ(i, j) = (10) 1, otherwise in (6). The spectrum kernel (9) is a special case of the mismatch kernel where m = 0. Again, we consider amino acids as mismatched in (9) and (10), when the indices i + l and j + l are not valid, that is, i + l > |p| or j + l > |q|.
5
Experiments
The experiments to evaluate performance of RLS with the locality kernels, the spectrum kernel, and the (v, m)-mismatch kernel are conducted on the SCOP [5] database. The aim is to classify protein domains into SCOP-superfamilies. We follow the experimental setup and use the dataset described in [2]. For each family, the protein domains within the family are considered positive test examples, and protein domains outside the family but within the same superfamily are considered as positive training examples. Negative examples are taken from outside of the positive sequences’ fold and are randomly split into training and
Locality Kernels for Protein Classification
7
testing sets in the same ratio as positive examples. By this setup, we may simulate remote homology detection, because protein sequences belonging to different families but to the same superfamily are considered to be remote homologs in SCOP. To measure performance of the methods, we use receiver operating characteristics (ROC) scores. The ROC score is the normalized area under a curve (AUC) that represents true positives as a function of false positives for varying classification thresholds [13,14]. When obtaining perfect classification, the ROC score is 1, and the random classification yields score of 0.5. In Table 1, we present the best found parameters for the locality kernels with different positional matrices P , the spectrum and the (v, m)-mismatch kernels. The best found size of the window for the locality kernel is three (w = 1). The spectrum kernel has a parameter v corresponding to the size of subsequence and the mismatch kernel uses v and m, where m is the maximum number of allowed mismatches. The best found parameters for the spectrum and the mismatch kernels correspond to the ones reported in [3,6]. The RLS algorithm has the regularization parameter λ that controls the trade-off between the minimization of the training error and the complexity of the regression function. The results reported below are obtained with the best found combination of the parameters for every method. The main results of the experiments are summarized in Figure 1. Each curve corresponds to RLS with specific kernel function for remote homology detection. Higher curves reflect more accurate classification performance. Each plotted data point represents the number of the families that have ROC score higher than the corresponding value. We observe that RLS with the position sensitive locality kernel with positional matrix (8) performs significantly better (p < 0.05) than RLS with the spectrum or the mismatch kernels. We evaluate statistical significance of the performance differences using Wilcoxon signed-ranks test. The locality kernel using positional matrix P = I and a small window slightly looses to position sensitive locality kernel with matrix (8) in performance, whereas position insensitive locality kernel performs worst of all. Therefore, we do not present these results in Figure 1. We also observe that for the few families that are classified with high scores by all kernels the mismatch kernel is the best, however, for the rest of the families the locality kernel outperforms both the spectrum and the mismatch kernel. In Figures 2 and 3 we give more detailed performance comparison of the locality, the spectrum and the mismatch kernels. Clearly, the classification Table 1. The best found parameters used for conducting the experiments Kernel Positional matrix P =A (h−l)2 (7) − [P ]h,l = e 2θ2 P =I (9) (10)
Best parameters w=1
Figures
w = 1, θ = 0.9 1, 2 and 3 w=1 v=3 1 and 2 m = 1, v = 6 and m = 2, v = 8 1 and 3
8
E. Tsivtsivadze, J. Boberg, and T. Salakoski 60
Number of families
50
40
30
(3)-Spectrum kernel
20
Locality kernel (position sensitive) 10
(8,2)-Mismatch kernel (6,1)-Mismatch kernel
0 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
ROC score
Fig. 1. Performance comparison of RLS with the locality (position sensitive), the spectrum (subsequences of length 3) and the mismatch (subsequences of length 6 and 8, and number of mismatches 1 and 2, respectively) kernels for remote homology detection using 54 families of the SCOP database. Each data point on the curve represents the number of the families having higher ROC score for the method. 1 0.9 0.8
(3)-Spectrum kernel
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Locality kernel (position sensitive)
Fig. 2. Family-by-family performance comparison of RLS with the spectrum (subsequences of length 3) and the locality (position sensitive) kernels. The coordinates of each point are ROC scores obtained for one SCOP family.
Locality Kernels for Protein Classification
9
1 0.9 0.8
(8,2)-Mismatch kernel
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Locality kernel (position sensitive)
Fig. 3. Family-by-family performance comparison of RLS with the mismatch (subsequences of length 8, number of mismatches 2) and the locality (position sensitive) kernels. The coordinates of each point are ROC scores obtained for one SCOP family.
performance when using the position sensitive locality kernel is better than that of the spectrum and the mismatch kernels. In addition to the conducted experiments, we evaluated performance of the blended spectrum kernel [9], that is all subsequences of sizes from one to v are simultaneously compared, when measuring similarities between the proteins. However, performance of the blended spectrum kernel is not notably better than that of the spectrum kernel and its computation requires more time.
6
Conclusions
In this study, we propose kernels that take advantage of local correlations and positional information in sequential data and present their application to the protein classification problem. The locality kernels measure the protein similarities within a small window constructed around matching amino acids in both sequences. These kernels make use of the range of similarity evaluations within the windows, namely position insensitive matching, position sensitive matching, and strict matching. We demonstrate that RLS with our locality kernels performs significantly better than RLS with the spectrum or the mismatch kernels in recognition of previously unseen families from the SCOP database. Throughout our experiments we observe that the locality kernels incorporating positional information
10
E. Tsivtsivadze, J. Boberg, and T. Salakoski
perform better than the locality kernels that are insensitive to the positions of the amino acids within the windows containing protein subsequences. Although, we do not conduct experiments to compare performance of RLS with the locality kernels to other algorithms, by examining the results reported in [2,3,15], we may suggest that our method performs comparably with some state-of-the-art algorithms used for remote homology detection and protein classification. Moreover, our simple method does not require expensive step of generating vector valued features used in algorithms such as SVM-pairwise or SVM-Fisher. In the future we plan to cast classification problem of protein sequences as a bipartite ranking task and we aim to obtain better classification performance by maximizing AUC instead of minimizing least squares error.
Acknowledgments We would like to thank CSC, Finnish IT center for science, for providing us computing resources.
References 1. Jaakkola, T., Diekhans, M., Haussler, D.: A discriminative framework for detecting remote protein homologies. Journal of Computational Biology 7, 95–114 (2000) 2. Liao, L., Noble, W.S.: Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships. Journal of Computational Biology 10, 857–868 (2003) 3. Leslie, C., Eskin, E., Noble, W.S.: The spectrum kernel: A string kernel for svm protein classification. In: Pacific Symposium on Biocomputing, pp. 566–575 (2002) 4. Tsivtsivadze, E., Pahikkala, T., Boberg, J., Salakoski, T.: Locality-convolution kernel and its application to dependency parse ranking. In: Ali, M., Dapoigny, R. (eds.) IEA/AIE 2006. LNCS (LNAI), vol. 4031, pp. 610–618. Springer, Heidelberg (2006) 5. Hubbard, T.J.P., Murzin, A.G., Brenner, S.E., Chothia, C.: Scop: a structural classification of proteins database. Nucleic Acids Research 25, 236–239 (1997) 6. Leslie, C.S., Eskin, E., Cohen, A., Weston, J., Noble, W.S.: Mismatch string kernels for discriminative protein classification. Bioinformatics 20, 467–476 (2004) 7. Leslie, C., Kuang, R.: Fast string kernels using inexact matching for protein sequences. J. Mach. Learn. Res. 5, 1435–1455 (2004) 8. Poggio, T., Smale, S.: The mathematics of learning: Dealing with data. Amer. Math. Soc. Notice 50, 537–544 (2003) 9. Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, New York, USA (2004) 10. Pahikkala, T., Pyysalo, S., Ginter, F., Boberg, J., J¨ arvinen, J., Salakoski, T.: Kernels incorporating word positional information in natural language disambiguation tasks. In: Russell, I., Markov, Z. (eds.) Proceedings of the Eighteenth International Florida Artificial Intelligence Research Society Conference, Menlo Park, Ca., pp. 442–447. AAAI Press, Stanford, California, USA (2005) 11. Zien, A., Ratsch, G., Mika, S., Scholkopf, B., Lengauer, T., Muller, K.-R.: Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics 16, 799–807 (2000)
Locality Kernels for Protein Classification
11
12. Haussler, D.: Convolution kernels on discrete structures. Technical Report UCSCCRL-99-10, UC Santa Cruz (1999) 13. Gribskov, M., Robinson, N.L.: Use of receiver operating characteristic (roc) analysis to evaluate sequence matching. Computers & Chemistry 20, 25–33 (1996) 14. Fawcett, T.: Roc graphs: Notes and practical considerations for data mining researchers. Technical Report HPL-2003-4, HP Labs (2003) 15. Kuang, R., Ie, E., Wang, K., Wang, K., Siddiqi, M., Freund, Y., Leslie, C.: Profilebased string kernels for remote homology detection and motif extraction. J. Bioinform. Comput. Biol. 3, 527–550 (2005)
When Less Is More: Improving Classification of Protein Families with a Minimal Set of Global Features Roy Varshavsky1, , Menachem Fromer1 , Amit Man1 , and Michal Linial2 1
School of Computer Science and Engineering, The Hebrew University of Jerusalem 2 Department of Biological Chemistry, The Hebrew University of Jerusalem
[email protected]
Abstract. Sequence-derived structural and physicochemical features have been used to develop models for predicting protein families. Here, we test the hypothesis that high-level functional groups of proteins may be classified by a very small set of global features directly extracted from sequence alone. To test this, we represent each protein using a small number of normalized global sequence features and classify them into functional groups, using support vector machines (SVM). Furthermore, the contribution of specific subsets of features to the classification quality is thoroughly investigated. The representation of proteins using global features provides effective information for protein family classification, with comparable results to those obtained by representation using local sequence alignment scores. Furthermore, a combination of global and local sequence features significantly improves classification performance. Keywords and Abbreviations: Support Vector Machines (SVM), Feature Selection, Olfactory Receptor, Porins protein family.
1
Introduction
Protein classification is a central task in computational biology. A routinely-used principle in classification relies on a distance measure between protein sequences, as obtained by the Smith-Waterman local alignment algorithm or by one of a large number of heuristic search methods such as BLAST, PSI-BLAST [1], search by HMM [2, 3] models and by profile-profile search [4, 5]. These methods are typically based on matching subsequences, i.e. local sequence features. Despite the observed strength of these methods, many functional assignments for proteins fail to be detected by such local sequence-based methods [6], thus
Corresponding author. We thank David Horn for advising and guiding R.V, Nati Linial, Elon Portugaly and Yaniv Loewenstein for fruitful discussion. R.V and M.F are supported by the Sudarsky Center for Computational Biology of the Hebrew University of Jerusalem. This research was partially supported by a grant from the Israel Ministry of Defense. Supplementary Data and Code: www.protonet.cs.huji.ac.il/sequence_features.
R. Giancarlo and S. Hannenhalli (Eds.): WABI 2007, LNBI 4645, pp. 12–24, 2007. c Springer-Verlag Berlin Heidelberg 2007
When Less Is More: Improving Classification of Protein Families
13
yielding a larger than desired fraction of false negatives, especially at more coarse-grained (higher) levels of protein classification hierarchies. The shortcomings of the methods outlined above are partly derived from the fact that there exist many proteins that share very low sequence similarity and are thus considered to be in the ”twilight zone”, but nonetheless share strong structural similarity that reflects their homology [7]. Short proteins represent another set of proteins that often fail to be classified by their sequence similarity due to their low statistical significance scores [8]. Finally, for many proteins the sequence similarity methods fail in detecting related sequences and as a result, a large fraction of singletons are reported within the protein space [9]. An additional confounding factor is that, in practice, the large number of protein sequences currently available imposes a computational challenge for the protein family classification problem. Currently, > 4.5 million sequences are stored in the UniProt database, and this collection is expected to grow [10]. A reduction to 3 and to 1.5 million sequences is achieved by UniRef90 and UniRef50, respectively (i.e., no two sequences are permitted to share more than 90% or 50% identity, respectively). Since even such vast reductions in redundancy yield very large quantities of sequences, the power of the ubiquitously used local sequence similarity methods are severely strained. Similarly, each new multi-cellular eukaryotic genome sequenced introduces thousands of new sequences that wait for functional assignments, again burdening the local sequence similarity algorithms. To address the challenges in large-scale functional assignment, a complementary line of research has used a spectrum of sequence features ranging from amino acid (aa) composition to the appearance of short sequence motifs [11]. Besides perhaps improving upon the results of local-based methods, this research is expected to provide information for classification of more distantly related protein families, where local-based methods may often fail. One such attempt was presented by SVM-Prot [12]. The classification system was trained from representative proteins for ∼50 functional families extracted from Pfam [13]. Using a large number of features and an SVM classifier, high success in separating these protein families was reported. A different approach was carried out in [14], where a mixture of probabilistic decision trees for direct prediction of protein functions was applied. In [14], the proteins are represented by hundreds of features, including secondary structure assignment and structural-based information. Despite their success, these approaches do not always allow for interpretations and inferences based on the full interplay among features. In addition, the large set of features used could inadvertently conceal the fact that the prediction task is easier than it seems: it may be sufficient to consider only a small set of global features. While it may seem overly ambitious to expect the task of protein family classification to succeed based only on a small set of sequence features, similar features were successfully applied for restricted, but related, tasks. Successful examples include distinguishing membranous and globular proteins, separating sub-cellular localization, [15], determination of topology for multi-pass proteins [16], and even prediction of protein quaternary structure [17].
14
R. Varshavsky et al.
Herein we assume a minimalist feature-based approach, which for reductionismbased motivations does not take into account secondary or tertiary structure information, even when reliable predictions are available. Moreover, we ignore features derived from short motifs that are currently known to be associated with specific protein families, functions, or subcellular localizations. We thus address the following questions regarding a small set of easily extracted global sequence features: (i) Does there exist a small (minimal) set of features that provides high-quality protein family characterization? (ii) Is the information conveyed by global features redundant or, rather, complementary to that provided by the local features? (iii) And, more generally, are there some biological insights that predict the prototypical successes and failures of feature-based classifications? To define the minimal set of features sufficient for functional classification, we: (i) test the capacity of predetermined, small subsets of features, and (ii) incorporate machine learning tools (specifically, feature selection) to automatically determine those features. Feature selection is a fundamental component in large-scale data analysis as a preprocessing step. In general, preprocessing involves some operation on the feature-space intended to reduce the dimensionality. In feature selection, only a particular subset of features is chosen and used in subsequent computational tasks. There are two major classes of feature selection strategies: filters and wrappers. Filter methods rank and choose the features according to some criterion (e.g., data separation). Wrapper methods optimize an objective function, through the selection of features. For a comprehensive survey, see [18]. Herein, we apply one filter and two wrappers to the data.
2 2.1
Data and Methods Data
As a test case, we consider 10 large protein groups that represent the known diversity of cellular processes and functions. Protein sequences and annotationss were retrieved from the UniProt 8.1 database [10]. In order to avoid redundancy, we used the UniRef50 database [10]. Groups were selected based on Gene Ontology (GO) assignments [19], such that their sizes would range from 300–1000 proteins each. 5,471 proteins in total are included in the analysis (Table 1). 2.2
Preprocessing
We compare two alternative representations of these ∼5,500 proteins: either according to local sequence similarities, or according to global sequence features: 1. Local Sequence similarities All pairs of proteins were aligned using the Smith-Waterman (SW) local alignment algorithm [20]. Since the SW score is strongly dependent on protein length, the raw scores matrix was transformed to a matrix of normalized scaled scores, based on the percentile binning of scores in each column. As a result, the range of values in the scaled matrix is [0, 1]. Note that the column-by-column transformation yields an asymmetric matrix.
When Less Is More: Improving Classification of Protein Families
15
Table 1. Representative set of 10 groups derived from the GO systems: cellular component (CC), molecular function (MF) and biological process (BP) Group 1 2 3 4 5 6 7 8 9 10 Total
Type CC MF CC CC CC BP BP CC MF CC
CO Term name GO ID Group Size (UniRef50) Nucleosome 786 319 Olfactory receptor activity 4984 478 Vacuole 5773 533 Microtubule 5874 913 Plasma membrane 5886 781 Tricarboxylic acid cycle 6099 476 DNA unwinding duringreplication 6268 520 Thylakoid 9579 448 Porin activity 15288 644 Myosin complex 16459 359 5471
2. Global Sequence Features Extracting the features: Only features that are ”global” and can be applied to proteins with minimal biological pre-knowledge are included (e.g., the calculated isoelectric point of a protein). Biologically known signatures such as localization signals were not included. In summary, for each protein, 5 major attribute types (for a total of 70 features) are analyzed: Amino acid composition [AAC] (20 features). Amino acid grouped compositions [AAG] (11 features, see Table 3, Supplementary Data). Post-translational modifications [PTM] (14 features, see Table 4, Supplementary Data). The PTM signatures are treated as regular expressions. Such patterns have been extracted from the Prosite database [21]. Only PTMs that are highly abundant in the database are included. Biophysical properties of the full sequence [PHYS] (5 features): (a) (b) (c) (d)
Length - The number of amino acids in the sequence Molecular weight [22] predicted pI [22] Instability factor: based on the observation that the frequency of occurrence of certain dipeptides is significantly different in unstable proteins as compared to stable ones [23]. (e) ’Gravy’ hydrophobicity index [24] Amino acid enrichment [RICH] (20 features). We sampled an overlapping window of 20 aa in size, from the beginning of the sequence to the end. For each such window, the frequency of a certain aa was counted if it occurs at least 5 times its frequency in the UniProtKB database.
Scaling the features: Since the selected features represent properties that appear in vastly different representations (e.g., logarithmic scale for pI, percentage for AAC, frequency for RICH), we applied a scaling protocol by referring to a
16
R. Varshavsky et al.
background level of a randomly selected set of approximately 40K proteins from the UniProtKB database. For each of the 70 features the percentile bins of the background were computed. Each feature was transformed according to its percentile, yielding values in the range [0, 1]. We also applied the scaling using a background set of the 5,500 proteins in our set (Table 1) and the results were practically identical to that of the randomly selected background set. 2.3
Classification
Firstly, the 10 groups were randomly partitioned into 3 subsets (groups 1-4, 57, and 8-10), where it was attempted to separate each group of proteins from the other groups in its subset. The classification algorithm chosen for the task was SVM (linear kernel, one-against-all classification), which has been proven to be very efficient for this type of task (e.g. [12, 11]). For each dataset in every representation used, the following procedures were applied: 1. Random selection of the train (80%) and test (20%) sets. 2. Use the train set: train and validate SVM (5-fold Cross validation). 3. Apply the resulting classifier to the test set, for prediction and assessment. In order to reduce bias toward extreme train-test partitions, procedures 1-3 (which we refer to as the classification block ) were repeated 5 times (which we refer to as the classification compound ). 2.4
Feature Selection
We consider two strategies for selection of the global sequence features, applying the classification compound for each. Note that the selections and wrappings are applied only to the train set. – Selection based on a-priori knowledge. The original (scaled) dataset is partitioned according to the 5 different feature categories: AAC (20), AAG (11), PTM (14), PHYS (5) and RICH (20). – Supervised feature selection methods. Here, various approaches are applied: 1. Single-wise selection (GREEDY) – a filter method: the 70 features in the train set are ranked according to their t-test separability criterion – the first 10 features are selected. 2. Forward Filtering (FF) – a wrapper method, which starts out with 0 features and adds the most contributing feature to the predictive score (Jaccard, see below) of the train set. Feature addition is continued until no improvement in the score is achieved. 3. Backward Elimination (BE) – a wrapper method, which starts out with all features and removes the least contributing feature to the predictive score (Jaccard, see below) of the train set. Feature removal is continued until no improvement in the score is achieved.
When Less Is More: Improving Classification of Protein Families
2.5
17
Evaluation
For each classification block, TP, TN, FP, and FN counts are recorded (where TP, TN, FP, and FN denote the number of true positive, true negative, false positive, and false negative outcomes, respectively (detailed tables of all values appear in the Supplementary Data). We have applied the strict Jaccard score (J-score) that combines precision (specificity) and recall (sensitivity), but does not take into account the TN. The J-score is defined as: J = TP/(TP+FP+FN).
3
Results
In order to demonstrate both the strengths and limitations of the framework, we describe the results for two example groups. Detailing both computational and biological aspects, we demonstrate different scenarios that directly derive from the groups’ characterization (for the remaining 8 groups, see Supplementary Data); we then discuss the overall patterns, suggest a unique feature combination platform and draw some conclusions. We analyzed large sets of proteins based on their GO annotations. For representative sets, we ensured that their sizes (at a level of lower than 50% identity for any pair in the set) ranged from 300-1000 and that, overall, they represent a broad range of functionality of enzymes, membranous components (olfactory and transporters), cytoskeletal elements (myosin) and compartment-based annotations (i.e. vacuole). 3.1
Olfactory Receptor Activity Proteins
The first group we consider is the olfactory receptor activity proteins, consisting of ∼500 proteins (3,900 proteins in UniProtKB), which are cell surface receptors that recognize chemical compounds (odorants). Odorant binding to its cognate receptor leads to membrane depolarization, activating a signaling cascade. Could we gain any insight into the group, by revisiting the features selected to separate it from the other groups tested? Here, the FF approach performs almost as well as using all features (0.89 and 0.91, respectively, Fig. 1). Only 8 features are chosen by FF: AAG (hydrophilic), AAC (G), RICH (Y), PHYS (instability), AAC (T), AAG (sulfur-containing), AAC (V), and AAG (helix-redundant aa). The most powerful feature selected under the FF protocol marks the hydrophilic nature of this protein group. Even though the olfactory receptors are characterized by their seven membrane-transversing helices, the hydrophobic nature of these helices was not among the separating features. On the other hand, the leading feature chosen was the hydrophilic signal of the molecule, derived from the region of the protein facing the aqueous environment on either side of the membrane (protein loops and tails). In an effort to characterize motifs that specify the olfactory receptors, 10 short motifs were determined, and they were all found to reside in the loops and tails of the proteins [25]. Similarly, 5 short PSSM motifs were used to characterize this family by BLOCKS [26]. Again, four of them are indeed in the hydrophilic segments of the proteins.
18
R. Varshavsky et al.
Fig. 1. J-score results of SVM classification, for various protein representations, of the olfactory receptor activity group. Bars are of all 70 global features (All: black), the 5 different feature types (AAC, AAG, PTM, PHYS and RICH: gray), and the 3 automated feature selection schemes (GREEDY, FF and BE: blue). As a reference, a random classification of the dataset is shown (100 iterations, RAND: white).
Other features yielded by FF include the frequency of glycine (G) and threonine (T). Also, among the features that contributed to separation is the richness of tyrosine (Y). It has been noted that tyrosine is quite abundant, and specifically a short sequence of ’MAYDRY’ (tyrosine at positions 3 and 6) is conserved among most of the olfactory receptors in the group [27]. This short sequence has led to significant enrichment over the entire tested set. The rest of the selected features are cysteine (C) and methionine (M) (grouped as sulfur-containing aa), valine (V), and, the helix redundant amino acids group. The fact that this group of transmembrane proteins was distinguished from the other groups through the use of helix redundant amino acids is not completely surprising, since the proteins’ membrane-spanning segments are composed of alpha helices. This detailed example illustrates that the selection of the most informative features (8 features in this case) covers diverse but complementary properties of the proteins. 3.2
Porin Proteins
The other group we discuss is that of bacterial porin, consisting of about 650 proteins (3,500 proteins in UniProtKB) that are localized to the outer membrane of Gram-negative bacteria, but also found in plastidae and mitochondria [28]. As one of the major outer membrane proteins in bacteria, they form large channels that allow the diffusion of small hydrophilic molecules (< 1000 daltons). Classification results for the porin proteins group are displayed in Fig. 2. Classification quality reaches a J-score of ∼0.75. The global feature methods outperform the local feature method (J-score ∼0.66). Interestingly, FF requires only three features for successful classification (J-score 0.68): AAC (G), AAC (I), and AAG (aromatic). To evaluate the relative contribution of each of these features, we have applied the classification compound using either the first 1, 2 or 3 features. The results (Fig. 3) show that the first feature by itself has a strong
When Less Is More: Improving Classification of Protein Families
19
Fig. 2. Results of SVM classification, for various protein representations, of the porin activity proteins (notations, axes and colors are as in Fig. 1)
Fig. 3. The contribution of the first 3 features, selected by the FF method, to the classification quality of the porins group. The results are of random classification (white), classification using the single most, two most, and three most contributing features (AAC (G), AAC (I), and AAG (aromatic), gray), and all 70 features (black).
classification capability, with marginal contributions by the following two. The remaining 67 features have only a negligible contribution. 3.3
Group Size, Selection Method and Success
In order to estimate which protein families are best characterized by global features and which methods are preferred, we have applied several analyses. We computed the number of selected features in BE and FF. For the 10 groups of proteins presented here, the average number of features eliminated in the BE protocol is 5.4, and for FF an average of 5 features were selected. The extreme cases for the FF are the 3 features of the Porin group and 8 features for the olfactory protein group. These numbers and the average success in classification show no correlation with the number of proteins in the group (not shown). Next, we compare the various selection methods. The scores for the selection methods are displayed in Table 2. As shown, the selection method that yields
20
R. Varshavsky et al.
Table 2. Average and standard deviation of the classification scores, according to the various selection methods Selection Method Number of Features Average J-score J-score StDev All 70 0.67 0.126 AAC 20 0.63 0.149 AAG 11 0.57 0.188 PTM 14 0.45 0.171 PHYS 5 0.52 0.185 RICH 20 0.45 0.148 GREEDY 10 0.26 0.150 FF 5 0.56 0.163 BE 64.6 0.65 0.126
the highest scores is BE, followed by AAC (average J-scores 0.65 and 0.63, respectively). Not surprisingly, however, these are also the ones that retain high numbers of features (64.6 and 20, respectively). Nevertheless, it is noteworthy that the FF method yields a relatively high average score (J-score 0.56), although it uses as few as 5 features, on average. Another observation is that the more features selected, the lower the standard deviation of the J-score; this suggests that selection methods that use more features are more stable in their quality. For some of the groups classified, a large number of the original features are essential to reach maximal performance, while in other cases, only a few features are sufficient for good separability. For example, as observed above, very few features are required to separate the porin group (only 3 features). Finally, we are unable to find any specific subset of features that consistently dominates the entire set; the chosen ones range from AAC (e.g., in vacuole proteins) and AAG (the nucleosome group) to others, but only rarely includes the PTMs. The last observation seems to indicate that these signatures do not predict functional protein groupings, perhaps since identical modifications are often performed on differently functioning proteins [29]. The biophysical and enrichment features (25 features) are also rarely selected by the FF or BE protocols. 3.4
Global vs. Local Features
As can be discerned from Fig. 4 (top), a representation of proteins using global features compares to local comparison-based features (SW), as classification using the global features (all or partial) yields superior results in 6 of the 10 groups. Also shown is that classification using only a subset of features, as obtained by the BE and FF methods, yields good results. The quality in classification performance using global feature representations varies across the different groups tested. Some protein groups failed to classify with high precision (e.g., tricarboxylic acid cycle), while in other groups a very small set of features was found sufficient (e.g., porin activity). Nonetheless, using all 70 global features provided a very successful classification for all groups.
When Less Is More: Improving Classification of Protein Families
21
Fig. 4. Top: SVM results for the protein groups: local sequence similarities (SW: stripes), all 70 global features (All Features: black) and the best feature selection scheme (Best FS: gray). Bottom: Combination of both representations (SW + features: green), local sequence similarities (SW: stripes) and a random classification (RAND: white).
3.5
Combining Local with Global Features
Since both feature sets (SW and global) were transformed and scaled to a common representation (see Methods), it is possible to combine them into a unified dataset. This was performed in the following way: assuming that the N proteins are described by M global features, then the feature dataset matrix is [NxM] and the SW one is [NxN]. Combining the matrices is simply performed by appending them, resulting in a [Nx(M+N)] matrix. Fig. 4 (bottom) demonstrates that naive combination of global and local features significantly improves the classification quality, compared to relying on either of them alone (paired t-test < 0.001, and < 0.05, respectively). This suggests that the two representations contain complementary information. Thus it would seem that combining these features is an effective practice and should be adopted for large-scale functional protein classification.
4
Discussion
In this study we show that characterization of protein families can be obtained by relying on a small set of global features that, in some cases, can be further reduced. In previous studies, when much richer feature sets were used [11, 12], the comparison with local features (SW) showed lower success rates. We hypothesize that the high-quality results described here are due to the small number of features that describe the data. This small size may facilitate the training and predictive capabilities of the classifier and, as a result, improves the classification. We attempted to determine which global features and feature selection algorithms perform best in the task of protein function prediction. There is no one feature set that performed this task equally well for all groups, since only some
22
R. Varshavsky et al.
groups seem ”easy” to predict in that they require few features to characterize them well. Nevertheless, when a given group was found to be ”easy”, then it was usually discovered by the FF method (or by using one of the predefined classes of features). On the other hand, single-wise feature selection (GREEDY) was prone to over-fitting and inferior to methods that consider the interplay between features and attempt to separate the training set in a holistic fashion (FF and BE). Therefore, it would seem wise to avoid such greedy methods that independently select features. In summary, we have observed that the use of global sequence features compares with the use of local features in functional protein classification. Since the calculation of such global features is much faster (theoretically and in practice) than computation of local sequence alignments for all pairs of proteins to be compared, in future work we plan to assess the protein function classification problem using global features on a much larger scale (from the GO resource). In addition, since we have also shown that the combination of local and global sequence features succeed more than either method alone, it is certainly worthwhile for large-scale prediction algorithms to incorporate both protein representations. For computationally heavier methods that already use local sequence information (local alignment algorithms), the assimilation of global sequence properties as described here could be done at minimal overhead, yielding stronger prediction algorithms with little or no increase in computing time. The scheme presented here was also applied to protein sets of major biological importance and to a 10-fold larger set (not shown). Success in separating kinases (the serine-threonine, tyrosine and uncharacterized), as well as nuclear proteins of the DNA from RNA biosynthesis proteins, suggest that, at the coarse level of classification, protein groups may be characterized by a very minimal set of global features. On the other hand, substantial improvement was achieved for proteins that often fail by sequence similarity, such as snake toxins and cytokines.
References [1] Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res. 25(17), 3389–3402 (1997) [2] Scheeff, E.D., Bourne, P.E.: Application of protein structure alignments to iterated hidden markov model protocols for structure prediction. BMC Bioinformatics 7, 410 (2006) [3] Portugaly, E., Harel, A., Linial, N., Linial, M.: Everest: automatic identification and classification of protein domains in all protein sequences. BMC Bioinformatics 7, 277 (2006) [4] Gribskov, M., McLachlan, A.D., Eisenberg, D.: Profile analysis: detection of distantly related proteins. PNAS 84(13), 4355–4358 (1987) [5] Yona, G., Levitt, M.: Within the twilight zone: a sensitive profile-profile comparison tool based on information theory. J. Mol. Biol. 315(5), 1257–1275 (2002) [6] Levitt, M., Gerstein, M.: A unified statistical framework for sequence comparison and structure comparison. PNAS 95(11), 5913–5920 (1998)
When Less Is More: Improving Classification of Protein Families
23
[7] Rost, B.: Topits: threading one-dimensional predictions into three-dimensional structures. Proc. Int. Conf. Intell. Syst. Mol. Biol. 3, 314–321 (1995) [8] Frith, M.C., et al.: The abundance of short proteins in the mammalian proteome. PLoS Genet 2(4), e52 (2006) [9] Friedberg, I., Kaplan, T., Margalit, H.: Glimmers in the midnight zone: characterization of aligned identical residues in sequence-dissimilar proteins sharing a common fold. Proc. Int. Conf. Intell. Syst. Mol. Biol. 8, 162–170 (2000) [10] Wu, C.H., Apweiler, R., Bairoch, A., Natale, D.A., Barker, W.C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M., Martin, M.J., Mazumder, R., O’Donovan, C., Redaschi, N., Suzek, B.: The universal protein resource (uniprot): an expanding universe of protein information. Nucleic Acids Res. 34(Database issue), D187–D191 (2006) [11] Kunik, V., Solan, Z., Edelman, S., Ruppin, E., Horn, D.: Motif Extraction and Protein Classification. In: IEEE Computational Systems Bioinformatics Conference (CSB’05), pp. 80–85. IEEE Computer Society Press, Los Alamitos (2005) [12] Cai, C.Z., Han, L.Y., Ji, Z.L., Chen, X., Chen, Y.Z.: Svm-prot: Web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res. 31(13), 3692–3697 (2003) [13] Bateman, A., Birney, E., Cerruti, L., Durbin, R., Etwiller, L., Eddy, S.R., GriffithsJones, S., Howe, K.L., Marshall, M., Sonnhammer, E.L.: The pfam protein families database. Nucleic Acids Res. 30(1), 276–280 (2002) [14] Syed, U., Yona, G.: Using a mixture of probabilistic decision trees for direct prediction of protein function. In: Proceedings of RECOMB, pp. 224–234 (2003) [15] Chou, K.C.: Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 21(1), 10–19 (2005) [16] Kahsay, R.Y., Gao, G., Liao, L.: An improved hidden markov model for transmembrane protein detection and topology prediction and its applications to complete genomes. Bioinformatics 21(9), 1853–1858 (2005) [17] Chou, K.C., Cai, Y.D.: Predicting protein quaternary structure by pseudo amino acid composition. Proteins 53(2), 282–289 (2003) [18] Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. Journal of Machine Learning Research 3, 1157–1182 (2003) [19] Camon, E., Barrell, D., Lee, V., Dimmer, E., Apweiler, R.: The gene ontology annotation (goa) database–an integrated resource of go annotations to the uniprot knowledgebase. In Silico Biol. 4(1), 5–6 (2004) [20] Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–197 (1981) [21] Hulo, N., et al.: The prosite database. Nucleic Acids Res. 34(Database issue), D227–D230 (2006) [22] Gasteiger, E., Gattiker, A., Hoogland, C., Ivanyi, I., Appel, R.D., Bairoch, A.: Expasy: The proteomics server for in-depth protein knowledge and analysis. Nucleic Acids Res. 31(13), 3784–3788 (2003) [23] Jones, D.T.: Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 292(2), 195–202 (1999) [24] Eichacker, L.A., Granvogl, B., Mirus, O., Muller, B.C., Miess, C., Schleiff, E.: Hiding behind hydrophobicity. transmembrane segments in mass spectrometry. J. Biol. Chem. 279(49), 50915–50922 (2004) [25] Skoufos, E.: Conserved sequence motifs of olfactory receptor-like proteins may participate in upstream and downstream signal transduction. Receptors Channels 6(5), 401–413 (1999)
24
R. Varshavsky et al.
[26] Henikoff, J.G., et al.: Increased coverage of protein families with the blocks database servers. Nucl. Acids Res. 28(1), 228–230 (2000) [27] Conticello, S.G., Pilpel, Y., Glusman, G., Fainzilber, M.: Position-specific codon conservation in hypervariable gene families. Trends Genet 16(2), 57–59 (2000) [28] Paulsen, I.T., Park, J.H., Choi, P.S., Saier, M.H.: A family of gram-negative bacterial outer membrane factors that function in the export of proteins, carbohydrates, drugs and heavy metals from gram-negative bacteria. FEMS Microbiology Letters 156(1), 1–8 (1997) [29] Chakrabarti, S., Lanczycki, C.J.: Analysis and prediction of functionally important sites in proteins. Protein Sci. 16(1), 4–13 (2007)
Fault Tolerance for Large Scale Protein 3D Reconstruction from Contact Maps Marco Vassura1 , Luciano Margara1, Pietro Di Lena1 , Filippo Medri1 , Piero Fariselli2 , and Rita Casadio2 1
2
Computer Science Department, University of Bologna, Italy
[email protected] Biocomputing Group, Department of Biology, University of Bologna, Italy
[email protected] http://vassura.web.cs.unibo.it/cmap23derr/
Abstract. In this paper we describe FT-COMAR an algorithm that improves fault tolerance of our heuristic algorithm (COMAR) previously described for protein reconstruction [10]. The algorithm [COMAR-Contact Map Reconstruction] can reconstruct the three-dimensional (3D) structure of the real protein from its contact map with 100% efficiency when tested on 1760 proteins from different structural classes. Here we test the performances of COMAR on native contact maps when a perturbation with random errors is introduced. This is done in order to simulate possible scenarios of reconstruction from predicted (and therefore highly noised) contact maps. From our analysis we obtain that our algorithm performs better reconstructions on blurred contact maps when contacts are under predicted than over predicted. Moreover we modify the algorithm into FTCOMAR [Fault Tolerant-COMAR] in order to use it with incomplete contact maps. FT-COMAR can ignore up to 75% of the contact map and still recover from the remaining 25% entries a 3D structure whose root mean square deviation (RMSD) from the native one is less then 4 Å. Our results indicate that the quality more than the quantity of predicted contacts is relevant to the protein 3D reconstruction and that some hints about “unsafe” areas in the predicted contact maps can be useful to improve reconstruction quality. For this, we implement a very simple filtering procedure to detect unsafe areas in contact maps and we show that by this and in the presences of errors the performance of the algorithm can be significantly improved. Furthermore, we show that both COMAR and FT-COMAR overcome a previous state-of-the-art algorithm for the same task [13].
1
Introduction
One of the yet-unsolved problems in structural Bioinformatics is ab-initio Protein Structure Prediction (PSP), i.e. to determine the three-dimensional (3D) structure (tertiary structure) of proteins from their one-dimensional chain of amino acidic residues (primary structure) [9]. Predicting the tertiary structure of a protein directly from its primary structure is a complex problem. A typical alternative approach is to identify a set of sub-problems, such as the prediction R. Giancarlo and S. Hannenhalli (Eds.): WABI 2007, LNBI 4645, pp. 25–37, 2007. c Springer-Verlag Berlin Heidelberg 2007
26
M. Vassura et al.
of protein secondary structures, solvent accessibility and/or prediction of residue contacts and try to search specific solutions. Among different possibilities, the prediction of contact maps of proteins starting from the protein chain is particularly promising, since a partial solution of it can significantly help the prediction of the protein structure [6]. A contact map of a given protein 3D structure is a two-dimensional symmetric binary matrix M such that M i,j = 1 iff the Euclidean distance between amino acids i and j is less than or equal to a pre-assigned threshold t. The general problem to compute a set of 3D coordinates consistent with some given contact map has been shown to be NP-hard [5]. A series of heuristic algorithms have been developed to solve the problem. Galaktinov and Marshall [7] reconstructed the structures of five small proteins by adopting information relative to the residue coordination numbers. Other approaches rely on steepest descent with inequality distance constraints [4] and on an algorithm that minimizes a continuous cost function that embodies constraints associated with contact and angle maps [11], respectively. On average these methods reconstruct the protein structures without completely satisfying the contact map in the sense that the reconstructed proteins structures may have contact maps that slightly differ from the native ones. Vendruscolo et al. [12,13] described a method based on simulated annealing with the contact map as a target potential. They achieved an average RMSD of 2.5 Å on some 20 protein structures and it is considered the state-of-the-art solution. In [10] we proposed COMAR, a heuristic algorithm to find a set of 3D coordinates consistent with some native contact map. Our algorithm has been tested on a non-redundant data set consisting of 1760 proteins. It is always able to produce for the whole data set 3D coordinates consistent with the native contact maps (computed adopting contact thresholds ranging from 7 to 18 Å, [10]). Moreover, the algorithm shows good reconstruction performances in terms of RMSD and outperforms, to our knowledge, all other reconstruction techniques so far documented in literature [10]. Performance analysis of our algorithm shows that there exist native contact maps for which there are numerous different possible structures consistent with them. In general, the reconstruction quality is better for contact maps of threshold between 10 and 18 Å, suggesting that contact maps of higher threshold are more informative than those of lower threshold. However, despite the good performance, the algorithm cannot be directly used in the context of protein structure prediction. This is to some extent the consequence of the poor performance of contact maps predictors in predicting the physical contact map of proteins. Our previous version of the algorithm was tested on native contact maps [10]. However contact map predictions are highly blurred, typically noisy, and can produce non-physical contact maps, i.e. not consistent with any given set of 3D coordinates. In this paper we analyze and improve fault tolerance of COMAR for protein reconstruction. To the purpose of this investigation we introduce three different classes of random errors: general errors, errors on contacts (that is errors on 1-entries of contact maps) and errors on non-contacts (that is errors on 0-entries of contact maps). We perform extensive tests of the reconstruction quality of our algorithm on a set of 120 nonredundant protein chains and compare the reconstruction performances in terms
Fault Tolerance for Large Scale Protein 3D Reconstruction
27
of RMSD on the three classes of errors introduced. Our analysis shows that in general the reconstruction quality decreases with the length of the protein and that our algorithm largely tolerates errors on contacts. In particular, the experimental results show that the reconstruction quality of contact maps with 50% errors on contacts is comparable to the reconstruction quality of contact maps with 1% errors on non-contacts. That is, our algorithm is much more tolerant to under prediction than to over prediction of contacts. We further tested this hypothesis by performing an analysis on incomplete contact maps with an improved version of our algorithm, called FT-COMAR (Fault Tolerant COMAR). Experimental tests show that FT-COMAR can ignore up to 75% of the contact map and still obtain a protein 3D structure whose RMSD from the native one is less then 4 Å. Furthermore the reconstruction quality is independent from protein length. This suggests that, to improve protein reconstruction from contact maps, contact map prediction should put much more emphasis on prediction quality than quantity. A simple way to improve the quality of reconstruction is by a pre-processing of contact map in order to detect unsafe contact regions. This filtering pre-processing indicates to FT-COMAR which areas of the contact maps have to be ignored. In this paper we compare pre-processing computed according to a perfect filtering procedure (that eliminates all the wrong contacts and non contacts, and labels them as non-determined) with a simple basic real filter based on common neighbors information in the contact map. As expected, the perfect filter gives the upper limit of reconstruction efficiency. However from our analysis it appears that even with the simple basic filter the reconstruction quality is overall better than with COMAR and, furthermore, the results are independent of the length of the protein for errors less than 8%. To conclude, we compare the performances of our algorithms with results of the state-ofthe-art reconstruction algorithm [13] and we find here that both COMAR and FT-COMAR have better reconstruction quality.
2
Protein Structure Reconstruction from Contact Maps
In this paper we adopt the widely used Cα representation of the protein backbone, where residues are considered as unique entities. The contact map of a given protein is a binary symmetric matrix CM such that CM [i, j] = 1 iff the Euclidean distance between residues i and j is less than or equal to a pre-assigned threshold t (Fig. 1a, area above diagonal). Typical values of t considered in literature vary between 7 and 12 Å. As we showed in [10], higher threshold values allow better reconstruction, and in this work we adopt t = 12 Å. An introduction to reconstruction of protein structures from contact maps can be found in [3]. To measure the similarity between two 3D protein structures, described by some set of coordinates C, C ∈ R3×n , we use the Root Square Deviation Mean n 1 (RMSD); it is defined as the smallest distance Dk = n i=1 (C [i] − Ck [i])2 , where C k ∈ R3×n is obtained by rotating and translating the coordinates set C.
28
M. Vassura et al.
2.1
Description of COMAR and FT-COMAR
COMAR (Contact Map Reconstruction) finds a set of 3D coordinates consistent with some native contact maps [10]. COMAR consists of two phases (see the pseudo code below). In the first phase it generates an initial set of 3D coordinates C ∈ R3×n while in the second phase it refines iteratively the set of coordinates by applying a correction/perturbation procedure to C. The refinement applies until the set of coordinates is consistent with the given contact map or until a control parameter becomes 0. The control parameter has initially a positive value and it is decremented every some amount of refinement steps. If the parameter reaches 0 and a correct solution is still not found, a new initial random solution is generated and the refinement process starts over again. COMAR(CM ∈ {0,1}n×n, t ∈ N ) 1: while coordinates set C is not correct do 2:
//First phase: initial solution generation C ← RANDOM-PREDICT(CM, t )
//Second phase: refinement 3: C ← CORRECT(CM, C, t ) 4: set to a strictly positive value 5: while coordinates set C is not consistent with CM and > 0 do 6: C ← PERTURBATE(CM, C, t, ) 7: C ← CORRECT(CM, C, t ) 8: decrement slightly 9: return C Extended tests for native contact maps and detailed description of the algorithm can be found in [10]. To test the reliability of our reconstruction technique on faulty contact maps we need to modify the termination conditions of COMAR: in this paper the algorithm always stops after the first run of the main cycle, i.e. the while loop of the first line is executed just once. This modification is necessary since a faulty contact map can be not physical, i.e. there are no 3D structures consistent with it, and the termination condition of our original algorithm (COMAR line 1) imposes the procedure to run forever when applied on a not physical contact map. FT-COMAR(CM ∈ {-1,0,1}n×n, t ∈ N ) // Pre-processing phase: error filtering 1: CM’ ← FILTER(CM) //First phase: initial solution generation 2: C ← FT-RANDOM-PREDICT(CM’, t ) //Second phase: refinement 3: C ← FT-CORRECT(CM’, C, t ) 4: set to a strictly positive value 5: while coordinates set C is not consistent with CM’ and > 0 do
Fault Tolerance for Large Scale Protein 3D Reconstruction
29
6: C ← FT-PERTURBATE(CM’, C, t, ) 7: C ← FT-CORRECT(CM’, C, t ) 8: decrement slightly 9: return C To reconstruct partial and blurred contact maps we develop FT-COMAR (Fault Tolerant COMAR), a simple improvement of COMAR. FT-COMAR can work on incomplete contact maps, i.e. contact maps with some unknown entries, in the sense that FT-RANDOM-PREDICT, FT-CORRECT and FT-PERTURBATE are simple modifications of RANDOM-PREDICT, CORRECT and PERTURBATE which do not consider unknown entries during the processing. Moreover, to deal with blurred contact maps the reconstruction phase of FT-COMAR is preceded by preprocessing the contact map (FILTER) in order to detect (and then mark as unknown) unsafe entries of the contact map. FT-COMAR is general enough to accept any type of filtering procedure. In this work we analyze the performances of FT-COMAR adopting a perfect FILTER, i.e. enabling to detect and mark as unknown exactly all faulty entries of the contact map, and a basic real filtering algorithm (Sect. 3).
3
Experimental Results
Data Set. We selected the proteins from SCOP [2] release 1.67 with X-ray protein structures from the PDB, with resolution <2.5 Å, without missed internal residues. We removed sequence redundancies using BLAST [1], ending up with a datasets of 1760 protein chains with sequence similarity lower than 25%. Among these we selected 120 proteins, distributed (not uniformly) between lengths of 50 and 1100 residues. To avoid contact maps for which we know there are very different possible structures consistent with them [10] we choose proteins whose 3D structure can be reconstructed by COMAR up to a 1 Å RMSD distance from the native structure. Distribution of the resulting protein set according to the SCOP structural classes is: 8 all Alpha; 20 all Beta; 58 Alpha/Beta; 14 Alpha+Beta in the mono-domain; 3 Multi-{B,C,D} and 17 Other consist of multi-domain proteins, for a total of 100 proteins in the mono-domain and 20 proteins in the multi-domain1 . Error Generation and Tests Configuration. To study how protein 3D structure can be reconstructed with our algorithm from faulty contact maps we introduce three classes of random errors: • Err. Errors are generated by flipping the entry of randomly chosen rows and columns of the contact map (Fig. 1a, area below diagonal). To introduce x% x n(n−1) errors we generate x errors for each 100 couples of residues, that is 100 2 total errors. • Err-0 (designed to preserve contacts). Errors are generated as before but the entry of the contact map is flipped only if it is not a contact (Fig. 1b, below 1
The complete list is available at http://vassura.web.cs.unibo.it/protlist120.tgz
30
M. Vassura et al.
Fig. 1. Contact map of the Asn102 mutant of trypsin (PDB code: 1trmA). The contact map is computed with a threshold of 12 Å: gray areas are contacts, white areas are non-contact and black areas are errors. (a) Above diagonal: native map, 24753 pairs of residues, 3595 contacts, 21158 non-contacts, and no errors. (a) Below diagonal: Err 5%, so to say (5% of 24753 =) 1237 random errors. (b) Above diagonal: Err-1 5%, that is (5% of 3595 =) 179 random errors on contacts. (b) Below diagonal: Err-0 5%, that is (5% of 21158 =) 1057 random errors on non-contacts. The protein is also the test case shown in [13].
n(n−1) x diagonal). Here x% errors means a number of 100 − #contacts 2 total errors. • Err-1 (designed to preserve non-contacts). The entry of the contact map is flipped only if it is a contact (Fig. 1b, above diagonal). Here x% errors x means a number of 100 · #contacts total errors. In our testing, for each protein contact map and for each percentage of error considered, we generate 100 different faulty contacts maps. Thus, having 120 proteins in our set, we do 12000 tests for each percentage of error. By this, our test results have to be always considered as the average values obtained from the 100 different instances we generate. All test runs have been executed on personal computers equipped with the Intel Pentium 4 processor with clock rate of 2.8GHz and 1Gb of RAM memory. Times reported are Unix CPU times, and are measured using the time() C library function. The Heuristic is freely available for testing on the web2 . Structure Reconstruction from Faulty Contact Maps. We show now experimental results on the behavior of COMAR with faulty contact maps. We perform tests by introducing from 1% up to 10% random errors of class Err. The average RMSD of the reconstruction from those faulty contact maps is shown in Fig. 2. The results indicate that the quality of the protein 3D structure reconstruction depends on the protein size: proteins with less than 150 residues are 2
At the following URL: http://vassura.web.cs.unibo.it/cmap23derr/
Fault Tolerance for Large Scale Protein 3D Reconstruction
31
Fig. 2. Reconstruction quality (RMSD) as function of the number of residues in the protein (Size) and of the percentage of random errors on the total pairs of residues (Err%). Better reconstruction has darker colors (12000 contact maps are analyzed).
reconstructed with a RMSD (from the native structure) that is less than 5 Å even when 10% random errors are introduced. For proteins with a number of residues ranging between 150 and 400, the quality of the reconstruction decreases with the increase of errors but the average RMSD still remains less than 5 Å for small percentages of errors. For proteins with more than 400 residues our algorithm shows poor performances (RMSD>5Å) even for small percentages of errors including 1% errors. Note that the sheer number of errors relative to the same percentage increases with size: as example 10% random errors for a protein of size 100 means 450 errors, while 1% random errors for a protein of size 400 means 798 errors. In Fig. 3 we show how reconstruction quality varies for different SCOP categories when we introduce 5% random errors, with the aim of highlighting whether some categories can be reconstructed better than others. As shown in Fig. 2, the mean RMSD from the native structure increases proportionally to protein size, with some exceptions. The most notable exception is the CDK4/6 inhibitory protein p18INK4c (1ihb chain A; size 156) that is in the SCOP Alpha+Beta category. It appears (Fig. 3) that exceptions to the length dependent behavior of the quality of the reconstruction are rare and distributed among SCOP categories so that it cannot be concluded that one SCOP category is more difficult to be reconstructed from faulty contact maps than another. We analyze how different types of errors influence the quality of reconstruction. In particular, in Fig. 4, we compare the performance of COMAR on the three classes of errors Err, Err-0 (errors on non-contacts), Err-1 (errors on contacts) previously introduced in this Section. As shown in Fig. 4, on the average, for COMAR is better to deal with Err-1 errors than with Err-0 errors. For example, we can see that contact maps with 50% errors on contacts are reconstructed with the same quality of contact maps having 1% errors on non-contacts (which means about 10% extra contacts).
32
M. Vassura et al.
Fig. 3. Reconstruction quality (RMSD) with an error Err 5% as a function of the protein length (Size) clustered according to SCOP categories (the number of contact maps is as in Fig. 2)
Fig. 4. Average RMSD to the native structure of structures reconstructed from contact maps as a function of the percentage of errors with respect to (wrt) each error class: Err refers to random errors, Err-1 refers to errors on contacts and Err-0 refers to errors on non-contact (the number of contact maps is as in Fig. 2
Improving the Reconstruction from Faulty Contact Maps. Our tests give some clues on how the quality of the prediction of contact maps could influence the reconstruction phase. This is much more evident if we analyze the reconstruction quality of FT-COMAR on faulty contact maps assuming to have a perfect filtering procedure, i.e. a procedure which is able to detect all errors on faulty contact maps. To test this approach we generate random incomplete contact maps by randomly choosing a column and a row of the contact map and marking that entry, corresponding to a detected error, as not safe (to be not considered during the reconstruction routine). As shown in Fig. 5, FT-COMAR with perfect filtering can skip up to 75% of the contact map area and still compute a reconstructed 3D structure which is endowed with a RMSD < 4 Å from the native structure. Furthermore this reconstruction quality is independent of the protein size. This unexpected result is due to the fact that FT-COMAR does not consider skipped entries in the refinement phase (see Sect. 2.1 for the description of the algorithm). In this way FT-COMAR does not uses wrong information during the refinement phase avoiding the propagations of errors. The drawback is that this is true only assuming that the remaining entries of the contact map are correct, i.e. only in presence of a perfect filtering. As shown in Fig. 6, even if
Fault Tolerance for Large Scale Protein 3D Reconstruction
33
Fig. 5. Reconstruction quality (RMSD) as function of the number of residues in the protein chain (Size) and of the percentage of random skipped pairs on the total pairs of residues (see legend). Lower percentages of Skip have darker colors (the number of contact maps is as in Fig. 2).
Fig. 6. Reconstruction quality (RMSD) as function of protein length (Size) when 25% of the input contact map is skipped. Increasing percentages of random errors (Err) on the remaining 75% of the map are shown (see legend). Lower percentages of Err have darker colors (the number of contact maps is as in Fig. 2).
we skip only 25% of the entries, the reconstruction quality is rapidly decreasing at the increasing of errors on the remaining 75% of the map. Again note that in this case the reconstruction quality depends on the length of the protein. We can interpret these results as an evidence of the fact that the quality of the reconstruction is negatively influenced by the erroneous predictions of some contacts more than by ignoring a consistent subset of contacts during the reconstruction. Error Filters Preprocessing with FT-COMAR. The experimental results of previous paragraphs show that we can reconstruct with much more reliability the 3D structure of a protein if we are able to predict which areas of the contact map are unsafe. This suggests that prediction quality is more important than quantity of contacts predicted: for instance, comparing Fig. 2 and Fig. 5 it is evident that it is better to predict 25% of the contact map with no errors than 100% of the contact map with 5% errors. This holds especially for proteins with a high number of residues. At the present time there is no way to predict contact maps with high reliability. Labeling unsafe contact map areas seems therefore an alternative way out to find possible solutions. There are various properties that can be implemented to test the “safeness” of contact map areas, from physical constraints to graph properties. Here we propose a simple filtering procedure based on the so called common neighbors property, namely the
34
M. Vassura et al.
number of common contacts of two contact nodes in the undirected graph (contact map) and we analyze how this procedure improves the prediction of our algorithm on faulty contact maps. The common neighbors property roughly assumes that two residues i, j are in contact if and only if they share a high number of neighbors, i.e. there is a high number of residues which are close to both i and j. Experimentally, in our dataset of 1760 non-redundant protein chains only the 6% of residues which are in contact share less than 10 neighbors and just the 0.7% of residues which are not in contact share more that 18 neighbors. Thus our common neighbors filtering procedure skips contact i, j if: – C[i, j]=1 (i e j are in contact) and i, j share less than 10 neighbors, i.e. residue i is in contact with less than 10 residues which are in contacts also with residue j; – C[i, j]=0 (i e j are not in contact) and i, j share more than 18 neighbors, i.e. residue i is in contact with more than 18 residues which are in contacts also with residue j. Results for reconstruction quality using FT-COMAR with the simple filter described above are shown in Fig. 7. We note that for percentages of errors less than 8% the reconstruction quality is independent from the protein length, as in Fig. 5. This means that the filter skips large enough faulty areas to avoid their negative influence on the whole reconstruction. When errors are over 16% the reconstruction quality decreases at increasing protein length. To avoid this behavior a better adjustment of filtering parameters (based on number of expected contacts, or other types of filtering procedures) should be considered. Nevertheless, in general the overall reconstruction quality with this simple/basic filter is significantly improved, as it stems out of the comparison of Fig. 2 and Fig. 7. We remark also that our algorithms runs within minutes, allowing them to be used also for a large-scale number of predictions. The reconstruction times of FT-COMAR for our 120 proteins data set are shown in Fig. 8.
Fig. 7. Reconstruction quality (RMSD) of FT-COMAR as function of protein length (Size). Lower percentages of random errors (Err%) on the whole contact map are shown with darker colors (the number of contact maps is as in Fig. 2).
Comparison with Previous Work. In Fig. 9 our target is the protein 1trm chain A to compare with the previous state-of-the-art reconstructing algorithm of Vendruscolo et al. [13]. The reconstruction quality is shown as a function of the number of included random errors. Both with COMAR and FT-COMAR (with
Fault Tolerance for Large Scale Protein 3D Reconstruction
35
Fig. 8. Average FT-COMAR reconstruction times in seconds for our 120 proteins data set as function of the protein length for four percentages of random errors (see legend) (the number of contact maps is as in Fig. 2).
Fig. 9. Average reconstruction quality (RMSD) for the protein 1trm (chain A, 223 residues) as a function of the number of random errors included in the native contact map. Vend refers to the performances described in [13]. 1000 errors are approximately 4% of the number of pairs of residues.
the filtering procedure already described) we obtain better reconstruction quality. To compare this result with the other tests described in this work, it should be considered that 1000 errors are approximately 4% of the total number of contact residue pairs and 4000 errors are approximately 16% of contact residue pairs.
4
Conclusions and Perspectives
In this paper we develop FT-COMAR an algorithm that improves fault tolerance of our heuristic algorithm (COMAR) previously described for protein reconstruction [10]. We perform extensive tests of the reconstruction quality of COMAR on a set of 120 non-redundant protein chains and compare the reconstruction performances in terms of RMSD on three classes of different errors: general errors, errors on contacts (that is errors on 1-entries of contact maps) and errors on noncontacts (that is errors on 0-entries of contact maps). The experimental results show that the reconstruction quality of contact maps with 50% errors on contacts is comparable to the reconstruction quality of contact maps with 1% errors on non-contacts. That is, COMAR is much more tolerant to errors on contacts than to errors on non-contacts. FT-COMAR can work on incomplete contact maps, i.e. contact maps with a set of unknown entries. We showed that FT-COMAR can ignore up to 75% of the contact map and still recover a 3D structure from
36
M. Vassura et al.
the remaining 25% entries with a RMSD value from the native one of less then 4 Å. Our conclusion is therefore that in order to improve structure reconstruction from contact maps more emphasis should be put on the quality than on the quantity of contact predictions. This is corroborated also by the better results obtained when a simple basic filter is implemented to detect unsafe (randomly perturbed) contact map areas. The very basic filtering algorithm we develop is based on the common neighbors property and its performance is tested versus the reconstruction quality obtained with the not filtered faulty contact maps. The reconstruction quality of FT-COMAR with this simple filtering procedure is overall better and, furthermore, it results to be independent of the length of the protein for percentage of errors less than 8%. We think that on this line other more complex filtering procedures will further improve the reconstruction efficiency.
Acknowledgements We thank MIUR for the following grants: PNR 2001-2003 (FIRB art.8) and PNR 2003 projects (FIRB art.8) on Bioinformatics for Genomics and Proteomics and LIBI-Laboratorio Internazionale di BioInformatica, both delivered to RC. This work was also supported by the Biosapiens Network of Excellence project no LSHG-CT-2003-503265 (a grant of the European Unions VI Framework Programme).
References 1. Altshul, S.F., Madden, T.L., Shaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new Generation of protein database search programs. Nucleic Acid Res. 25(17), 3389–3402 (1997) 2. Andreeva, D., Howorth, S.E., Brenner, T.J., Hubbard, C., Chothia, A.G.: SCOP database in 2004: refinement integrate structure and sequence family data. Nucleic Acids Res. 32(Database issue), D226–D229 (2004) 3. Bartoli, L., Capriotti, E., Fariselli, P., Martelli, P.L., Casadio, R.: The pros and cons of predicting protein contact maps 4. Bohr, J., et al.: Protein structures from distance inequalities. J. Mol. Biol. 231, 861–869 (1993) 5. Breu, H., Kirkpatrick, D.G.: Unit disk graph recognition is NP-hard. Computational Geometry 9, 3–24 (1998) 6. Fariselli, P., Olmea, O., Valencia, A., Casadio, R.: Progress in predicting interresidue contacts of proteins with neural networks and correlated mutations. Proteins 45(5), 157–162 (2001) 7. Galaktinov, S.G., Marshall, G.R.: Properties of intraglobular contacts in proteins: an approach to prediction of tertiary structure. In: System Sciences, vol. V, 1994. Proceedings of the Twenty-Seventh Hawaii International Conference on Biotechnology Computing, January 4-7, 1994, vol. 5, pp. 326–335 (1994) 8. Havel, T.F.: Distance Geometry: Theory, Algorithms, and Chemical Applications in the Encyclopedia of Computational Chemistry (1998) 9. Lesk, A.: Introduction to Bioinformatics. Oxford University Press, Oxford (2006)
Fault Tolerance for Large Scale Protein 3D Reconstruction
37
10. Vassura, M., Margara, L., Medri, F., Di Lena, P., Fariselli, P., Casadio, R.: Reconstruction of 3D Structures From Protein Contact Maps. In: ISBRA 2007. Proceedings of Bioinformatics Research and Applications third International Symposium, Atlanta, May 2007. LNCS (LNBI), vol. 4463, pp. 578–589. Springer, Heidelberg (2007) 11. Pollastri, G., Vullo, A., Fiasconi, P., Baldi, P.: Modular DAG-RNN Architectures for assembling Coarse Protein Structures. J.Comp.Biol. 13(3), 631–650 (2006) 12. Vendruscolo, M., Kussell, E., Domany, E.: Recovery of protein structure from contact maps. Folding and Design 2(5), 295–306 (1997) 13. Vendruscolo, M., Domany, E.: Protein folding using contact maps. Vitam Horm 58, 171–212 (2000)
Bringing Folding Pathways into Strand Pairing Prediction Jieun K. Jeong1,2 , Piotr Berman1 , and Teresa M. Przytycka2 1
Computer Science and Engineering Department The Pennsylvania State University University Park, PA 16802 2 National Center for Biotechnology Information National Library of Medicine, National Institutes of Health Bethesda, MD 20894
[email protected],
[email protected],
[email protected]
Abstract. The topology of β-sheets is defined by the pattern of hydrogen-bonded strand pairing. Therefore, predicting hydrogen bonded strand partners is a fundamental step towards predicting β-sheet topology. In this work we report a new strand pairing algorithm. Our algorithm attempts to mimic elements of the folding process. Namely, in addition to ensuring that the predicted hydrogen bonded strand pairs satisfy basic global consistency constraints, it takes into account hypothetical folding pathways. Consistently with this view, introducing hydrogen bonds between a pair of strands changes the probabilities of forming other strand pairs. We demonstrate that this approach provides an improvement over previously proposed algorithms.
1
Introduction
The prediction of protein structure from protein sequence is a long-held goal that would provide invaluable information regarding the function of individual proteins and the evolution of protein families. The increasing amount of sequence and structure data, made it possible to decouple the structure prediction problem from the problem of modeling of protein folding process. Indeed, a significant progress has been achieved by bioinformatics approaches such as homology modeling, threading, and assembly from fragments [16]. At the same time, the fundamental problem of how actually a protein acquires its final folded state remains a subject of controversy. Can successes/failures of computational method shade some light on this issue? It is generally accepted that proteins fold to their global free energy minimum. Through his famous Paradox, Levinthal made an important point that a protein cannot explore all conformational states in the search of the optimal conformation and therefore a protein chain has to fold by following some directed process or a folding pathway [14]. One view that has been gathering a lot of support since at nearly three decades is the concept of hierarchical protein R. Giancarlo and S. Hannenhalli (Eds.): WABI 2007, LNBI 4645, pp. 38–48, 2007. c Springer-Verlag Berlin Heidelberg 2007
Bringing Folding Pathways into Strand Pairing Prediction
39
folding [1, 2, 6, 12, 13, 19]. Consequently, many structure prediction algorithms use hierarchical approach in which the structure is assembled in a bottom up fashion (e.g. where smaller locally folded fragments are assembled into larger folded units [4, 9, 21]). Studies of β-sheets topology indicate that the way strands assemble into larger sheets may be quite complex. While about half of hydrogen bonded pairs of strands are adjacent in the sequence of strands on the chain, many are separated by a significant distance. How pairs of strands which are distant in sequence find their hydrogen bonded partners? In her classic 1977 paper, Richardson proposed a set of folding rules where consecutive β-strands grow into larger hydrogenbonded structures in successive steps, and blocks of strands obtained in this way coalesce, providing they are consecutive in the chain. Richardson showed, by manual inspection, that 37 known strand topologies can be constructed using these rules. A smaller, more restricted set of folding rules was shown by Przytycka et al. [17] to be sufficient for 80% of fold families, while proteins in more than 90% consist of at most three substructures that can be completely folded using proposed rules. It is tempting to hypothesize that such procedures are related to actual folding pathways. If this hypothesis is correct, such folding rules should be helpful in prediction of β-sheet topology in general, and the pairing of β-strands in particular. The latter problem, despite of many attempts, remains unsolved. Early work by Hubbard [7] has been followed by other studies directed towards understanding and predicting β-sheet topology [23, 8, 26, 25, 22, 20, 15]. In a recent work, Cheng and Baldi [5] addressed the strand pairing problem using a three-stage approach. In the first stage they compute, for the input protein sequence, the scores (estimated probabilities) of residue pairs as potential partners in a βstrand pairing. This computation is performed by a neural network with input describing a window of size five around each residue and the information about the distance between the two residues in the protein sequence. In the second stage the above pairwise scores are used to define alignment scores for pairs of strands, and for each pair a highest scoring alignment is found with the use of dynamic programming. The alignment scores are used in the third and final stage to run a greedy selection algorithm. Cheng and Baldi reported 59% specificity and 54% sensitivity which is significantly better than what is achieved by a naive algorithm predicting that all pairs of strands that are consecutive in the sequence form hydrogen bonded partners is space. (The performance of such naive algorithm was approximated to be 42% specificity and 50% sensitivity [5]. ) The important novelty of the approach of Cheng and Baldi when compared with previous methods (e.g. Hubbard [7], Zhu and Braun [26] and Steward and Thornton [22]) is that the prediction of residue pairs that are partners in strand pairing is not performed independently for each pair, but instead it takes into
40
J.K. Jeong, P. Berman, and T.M. Przytycka
account a wider context; to wit, the information about 10 surrounding residues and the distance between them. The approach of Cheng and Baldi does not employ explicitly folding rules, although some bias towards formation of hairpins (and in general contacts between close in sequence strands) is included in the (learned) scoring function. On the other hand, the third stage of their algorithm is a very simple greedy algorithm, which raises a question: Would a more elaborate approach increase the quality of prediction even further? What is more important – a better optimization method (e.g., as discussed by Berman and Jeong in [3]), or a biological insight, in particular, the knowledge of folding rules? To address these questions, we investigated two new algorithms for predicting strand partners. To make direct comparisons, we use the same scoring function as of Cheng and Baldi. The objective of the first algorithm is very similar to the approach of Cheng and Baldi, but rather than having a two-stage greedy selection heuristic, it poses the problem as integer linear programming optimization problem and solves it using ILOG CPLEXTM package. The second approach is greedy, but it explicitly encourages two simple folding rules. This is achieved by dynamically increasing the scores of pairs of strands (as potential partners), depending on the pairs of strands predicted so far. In particular, we double the pairs of strands whose pairing is consistent with one of the rules, which are based on the pairs that are already formed. Our rules are simple and biochemically justified (as we explained later in the paper). Both methods provided noticeable improvement over the previous approach. Importantly, a more significant improvement was obtained with the approach that promotes folding rules. This is remarkable, since in the case of integer linear program we are heuristically solving a NP-complete problem using about 100 times more time than folding rules promotion algorithm (almost entire time of the latter algorithm is consumed by the dynamic programming that computes optimal pairing/alignment for each pair of strands). While the improvement, taken in absolute numbers, is not drastic (about 2.7% in specificity and 1% in sensitivity), one has to keep in mind that the improvement of Cheng and Baldi over a naive algorithm was only 4-5 times larger. In another perspective, without any new predictor or data source we decreased the number of false predictions by 10% while increasing the number of good predictions.
2
Methods
The common notions in the three algorithms considered here are – strand: interval of residue numbers predicted to form a β-strand, we visualize a strand as a paper ribbon covered with squares with numbers; – contact: adjacency of two strands, as in Fig. 1; – contact energy: sum of numbers for pairs of residues adjacent in a contact that are returned by Cheng-Baldi neural network.
Bringing Folding Pathways into Strand Pairing Prediction
41
A solution returned by an algorithm is a collection of contacts that satisfies the following constraints: – uniqueness: at most one contact between a single pair of strands; – sidedness: contacts of a strand are on one of the two sides of that strand; – overlap-free: contacts on the same side of a strand are not in contact with the same residue; – direction-consistent: contacts on the same side of a strand are either all parallel, or all anti-parallel. 40 41 42
While these constraints are necessary, they allow for a: 59 60 61 62 63 many impossible combinations of contacts. After some 100 101 102 103 104 experimentation we added the constraint that a solu- b: 59 60 61 62 63 tion is cycle-free (as did Cheng and Baldi [5]). In the 123 124 125 126 data set, among all 916 protein chains and ca. 9000 c: 63 62 61 60 59 strands there were only 80 cycles. At the same time, without prohibition of all cycles, our program was reFig. 1. turning solutions with many cycles, ca. 99% of them wrong. Lastly, we disallowed contacts with score below 0.06 from further consideration. This caused the number of predicted contacts (true and false positives in Table 1) to roughly coincide with the number of actual contacts (true positives and false negatives). 2.1
ILP Formulation
We can view the strand pairing problem as an optimization problem which identifies a solution with the maximum sum of contact energies. As shown in [5], this problem cannot be solved in polynomial time in the worst case. However, in almost all instances in the test set, an ILP solver found provably optimal solutions. While there are many ILP methods used for protein structure prediction (e.g., see [11,24,10]) none of them operated in our particular framework, instead, they were used in the context of all-atom model, threading etc. In our formulation, a contact c has an upper strand and a lower strand, and a strand similarly has two sides, upper and lower. In Fig. 1, contacts b and c are in conflict as not overlap-free, while contacts a and c are in conflict as not direction-consistent. A contact is characterized by these parameters: upper strand, lower strand, parallel (or not), the offset (relative shift of the strands). The score of c, E(c) was computed using dynamic programming (we allowed a single gap of length 1 in the alignment). We kept only the contacts with the optimal offset values. For every possible contact c we introduced a variable xc , and for every pair of strands i, j a variable yi,j . The value of xc indicates if contact c is in the solution (xc = 1) or not (xc = 0). Similarly, yi,j = 1 means that strands i and j were paired, i.e. that we selected a contact that binds these two strands together. To formulate our ILP we introduce two classes of 0-1 vectors: Ci,j such that Cc = 1 if and only if contact c binds strand i with strand j, and γ(S) such that
42
J.K. Jeong, P. Berman, and T.M. Przytycka
γ(S)i,j = 1 if and only if {i, j} ⊂ S. We also set conflict(c, d) to be true if there is a conflict between contact c and contact d. We wish to solve the following ILP: maximize Ex subject to Ci,j x ≤ yi,j for {i, j} ⊂ {1, . . . , n} contact/pairing xc + xd ≤ 1 for i, j s.t conflict(c, d) no-conflict γ(S)y ≤ |S| − 1 for S ⊂ {1, . . . , n} cycle-free This set of constraints is often too large as an input to ILP solver: when the number of strands reaches 20, the number of cycle-free constraints reaches 106 and for the largest protein domains, with more than 40 strands, it exceeds 1012 . To avoid that problem, we start with a single cycle-free constraint with S = {1, . . . , n} and run a row generation loop: we submit ILP, we obtain a solution, and if it contains a cycle of strands we add a cycle-free constraint for its set of nodes. When the number of repetitions is too large (as it happened in ca. 15% of the cases) we give up and return the solution of the greedy algorithm descibed below. 2.2
Greedy Algorithm with Pathway-Based Promotion
The greedy algorithm has the same goal as the ILP method, but it increases the solution set one contact at the time, always choosing the new contact with the maximum possible score (energy). On one hand, the initial choices may block subsequent choices and thus prevent the algorithm from finding a solution with the maximum score. On the other hand, the greedy algorithm is much more flexible in checking the consistency requirements, as they do not have to be formulated in the form of linear inequalities. In the preliminary stage of the algorithm, for each pair of strands we preselect the best parallel and the best anti-parallel contact, and we order them according to their score. We consider candidates starting with the one with the largest score, and we never consider a candidate again. We represent contacts with unordered pairs of strands, which means that we do not declare which strand is the upper one and which one is lower. Otherwise we could get the following anomaly: we greedily choose contacts for pairs (1,2) and (3,4), and decide that, say, strands 1 and 3 are upper ones. Then we cannot choose contact (1,4): if in the latter strand 1 is upper, we have conflict with (1,2), and if strand 4 is lower, we have a conflict with (3,4). This representation makes it less obvious how to verify the constraints of sidedness, overlap-free and direction-consistent. (Verifying the constraints of uniqueness, cycle-free constraint, as well as metric consistency described below is straightforward.) The crucial observation is that in a consistent set we can decide which strand is upper and which strand is lower in every contact, we just may need to alter such a decision later. Therefore when we test adding a new contact we check if the resulting set of contacts is two-colorable in the following sense:
Bringing Folding Pathways into Strand Pairing Prediction
43
– color of a contact tells if the upper strand has the lower number, or the larger one; – two contacts are connected if they share a strand (e.g. (i, j) and (k, j)) and either (a) one is parallel and one is anti-parallel, or (b) they share a residue of the common strand. – two connected contacts are either required to have the same color, or to have different colors.
i−1
i+2
i+1
i
i+2
To illustrate, suppose that we consider contacts (2, 5) and (5, 7) and they share a residue on strand number 5. Then from top to bottom the strands have to be ordered (2, 5, 7) or (7, 5, 2), and thus the contacts have to have the same color. But if the second contact is (3, 5), the strands have to be ordered (2, 5, 3) and thus the contacts have to have different colors. Two-colorability is very easy to check. When we have a solution, it consists of several connected components of contacts in terms of connections between contacts that we have just described. Connected components of the above graph correspond to rigid parts of the chain, and they can be mapped onto a grid in such a way that strands form rows and paired partners are adjacent in common columns. Such a layout allows to form a conservative estimate of the minimal length of coils that join the strands in the components. If such a coil is actually shorter, we disallow the candidate. As before, we disallow a candidate if it would create a cycle. Up to this point, the algorithm does not differ from i i+ that of Cheng and Baldi in a significant way. (Their 1 notion of consistency as exhibited by their program is a bit different than the one described in the paper, but in the evaluation it was indistinguishable). However, we have this new element: after selecting a consecutive contact, say between strands i and i+1, we double the score of contacts between strand pairs (i, i + 2), (i − 1, i + 1), (i − 1, i + 2) and change their Fig. 2. position within the ordering to reflect that. This rule is explicitly promoting a folding pathway. It is actually a part of a more general rule, but it restricts it to the cases of the smallest separation between strands and thus the most reliable scores. There are biophysical reasons for which the probability of hydrogen bonding between strands i and i+2 (Fig. 2) is increased under assumption that i is already hydrogen bonded. Namely, strand i + 2 would stabilize the conformation already acquired by strands i and i + 1. The higher probability of bonding between strands i − 1 and i + 2 upon hydrogen bonding between i and i + 1 is in turn justified by the loss of entropy of subchain separating strands i − 1 and i + 2 resulted from the hairpin formation. This rule can be extended to strands i − 2 and i + 3 but with the current scoring schema it had no effect on the results (see Discussion section).
130
126
95
28 30 90
22
17
5
10
J.K. Jeong, P. Berman, and T.M. Przytycka
2
44
7
145
140 130
6
126
5 95
1 90 30 28
4
2
4
3
22
6
2
17
1
7
3
5
Fig. 3. Example how promotion may have good secondary effects. We show here the table of pairwise scores for 2C-Methyl-D-erythritol-2,4-cyclodiphosphate Synthase (PDB id: 1iv1, chain a). The entries in the table are color-coded, purple codes interval 2/3 to 1, and each subsequent code (purple-blue, blue, blue-green etc.) codes an interval decreased by 2/3 factor (and white for the remaining values down to zero). Black background codes the true contacts, purple ovals are the contacts found by Cheng & Baldi, and the pink ovals are the contacts found by our version of greedy. After contact 2-3 was selected, contact 1-4 (between strand 1 and strand 7) was promoted over 1-2; once we got contacts 1-4-3-2, contact 1-2 was blocked by cycle-free rule; moreover 1-5 was blocked by 5-6 and 5-7, thus 1-7 became the best available contact for 1 — as well as for 7.
3
Results
We used the data set of Cheng and Baldi (see [5], page 176) that consists of 916 protein chains that contain up to 45 β-strands. We also used the output of their program that given a sequence of amino acids (residues) returns (a) a sequence of secondary structure identifications (α-helix, β-strand, coil) and (b) for every pair of residues classified as β-strand it provides a pseudo-probability that these two residues face each other in a pairing of two βstrands. To evaluate the result we used their file of DSSP identifications of correct secondary structure identifications and correct pairing of β-strand residues.
Bringing Folding Pathways into Strand Pairing Prediction
45
We defined the population of possible answers in two ways: pairs of β-strands as identified by predict beta fasta.sh and as identified by DSSP. Given a pair of predicted (true) strands, we defined the pairing to be true (correctly predicted) if for at least one residue of one strand there was a residue in the other strand that was in a contact described by DSSP (predicted by the evaluated program). These two definitions yielded different numbers, but they registered roughly the same differences between various programs, so our conclusions do not seem to depend on this somewhat arbitrary definition. We compare three programs: the three-stage program of Cheng and Baldi, ILP optimizer and our greedy algorithm with pathway based promotion. The differences in the quality of predictions are very consistent when we use various measures. We use T and F to indicate the number of true and false predictions and ⊕ and to indicate positive and negative predictions. To evaluate the set of prediction, we use the correlation coefficient, as well as selectivity/specificity pairs. T ⊕T − F ⊕ F Correlation coefficient = (T ⊕ + F ⊕ )(T ⊕ + F )(T + F ⊕ )(T + F ) Spe =
T⊕ + F⊕
T⊕
Sel =
T⊕ + F
T⊕
The correlation coefficient was 0.555 for Cheng and Baldi’s, 0.567 for ILP optimizer and 0.577 for the greedy with pathway based promotion.
4
Discussion and Conclusions
We considered two methods of predicting β-sheet pairing partners using the machine learned scores for inter-residue contacts from [5]. In the first method, we computed optimal set of pairs by solving an instance of integer linear program. The fact that increasing the sum of scores improves the predictions suggests that these scores are indeed related to the energy of contacts. On the other hand, giving preference according to our rule leads to lower sums of scores and yet it improves the specificity significantly without decreasing the sensitivity. This suggests that a local assembly may remain stable even when it is inconsistent with a conformational state that has the minimal energy. However, for contacts separated for more than 3 strands the reliability of Cheng and Baldi’s scores seems to decrease rather quickly and more complete versions of our rules do not lead to further improvements. In the future, more complete set of rules based on the work of Richardson [18] and Przytycka [17] et al. should be added. However, more complete rules are also more ambiguous — the number of possible successive steps in the folding process goes up and we need to rely on pairwise predictors more, while in the same time their reliability goes down. Improving pairwise prediction of contacts for more separated pairs of strands seems to be a necessary challenge before a qualitative improvement in ab initio
46
J.K. Jeong, P. Berman, and T.M. Przytycka
Table 1. Comparison of results of three tested algorithms on a set of 916 protein chains. Note that the discriminating power of the potential function quickly decreases as the separation grows and the statistical quality measures are largely determined by contacts separated by up to three other strands. sepa- true false false true speci- sensiration positive negative positive negative ficity tivity greedy — Cheng & Baldi’s version ALL 5032 3140 3370 61563 0.599 0.616 0 3748 363 2136 3577 0.637 0.912 1 521 485 484 7418 0.518 0.518 2 407 523 355 6710 0.534 0.438 3 169 359 161 6412 0.512 0.320 4 100 276 89 5788 0.529 0.266 5 38 241 58 5130 0.396 0.136 6 29 195 32 4482 0.475 0.129 7 11 157 10 3891 0.524 0.065 8+ 9 541 45 18155 0.167 0.016 ILP optimizer ALL 5092 3080 3253 61603 0.610 0.623 0 3781 330 2084 3621 0.645 0.920 1 538 468 552 7342 0.494 0.535 2 427 503 317 6741 0.574 0.459 3 167 361 119 6447 0.584 0.316 4 94 282 72 5798 0.566 0.250 5 36 243 39 5143 0.480 0.129 6 30 194 10 4498 0.750 0.134 7 14 154 13 3883 0.519 0.083 8+ 5 545 47 18130 0.096 0.009 greedy — with pathway-based promotion ALL 5089 3083 3035 61821 0.626 0.623 0 3715 396 1733 3972 0.682 0.904 1 594 412 619 7275 0.490 0.590 2 472 458 385 6673 0.551 0.508 3 142 386 122 6444 0.538 0.269 4 81 295 68 5802 0.544 0.215 5 37 242 41 5141 0.474 0.133 6 30 194 10 4498 0.750 0.134 7 11 157 14 3882 0.440 0.065 8+ 7 543 43 18134 0.140 0.013
corr. coef. 0.557 0.541 0.457 0.423 0.368 0.348 0.209 0.230 0.175 0.044 0.568 0.558 0.449 0.457 0.398 0.352 0.230 0.306 0.196 0.021 0.577 0.596 0.473 0.469 0.347 0.318 0.231 0.306 0.158 0.034
prediction methods of the tertiary structure of proteins. In the same time, this task cannot be separated from the search of the best methods of using such predictors. An important implication of this work is the demonstration that a simple algorithm that takes into account folding rules works better than heavy duty integer linear programming. This suggests that the future line of research should developing a folding-rule depending scoring function that would allow to explore a richer set of folding rules.
Bringing Folding Pathways into Strand Pairing Prediction
47
Acknowledgments The authors thank George D. Rose (JHU), Bonnie Berger (MIT) and Arthur M. Lesk (PSU) for an insightful discussions. We also thank Jailing Cheng for help in using their program. This work was supported in part by the intramural research program, National Institutes of Health, National Library of Medicine.
References 1. Baldwin, R.L., Rose, G.D.: Is protein folding hierarchic? I. II. Folding intermediates and transition states. Trends in Biochemical Sciences 24(2), 77–83 (1999) 2. Baldwin, R.L., Rose, G.D.: Is protein folding hierarchic? I. Local structure and peptide folding. Trends in Biochemical Sciences 134(3), 26–33 (1999) 3. Berman, P., Jeong, J.: Consistent sets of secondary structures in proteins, http://www.cse.psu.edu/∼jijeong 4. Bystroff, C., Baker, D.: Prediction of local structure in proteins using a library of sequence-structure motifs. Journal of Molecular Biology 281(3), 565–577 (1998) 5. Cheng, J., Baldi, P.: Three-stage prediction of protein beta-sheets by neural networks, alignments and graph algorithms. Bioinformatics 21(suppl. 1), i75–84 (2005) 6. Crippen, G.M.: The tree structural organization of proteins. Journal of Molecular Biology 126, 315–332 (1978) 7. Hubbard, T.J., Park, J.: Fold recognition and ab initio structure predictions using hidden markov models and β-strand pair potentials. Proteins: Structure, Function, and Genetics 23(3), 398–402 (1995) 8. Huthinson, E.G., Sessions, R.B., Thornton, J.M., Woolfson, D.N.: Determinants of strand register in antiparallel β-sheets of proteins. Protein Science 7(11), 2287–2300 (1998) 9. Inbar, Y., Benyamini, H., Nussinov, R., Wolfson, H.J.: Protein structure prediction via combinatorial assembly of sub-structural units. Bioinformatics 19(suppl. 1), 158–168 (2003) 10. Kingford, C.L., Chazelle, B., Singh, M.: Solving and analyzing side-chain positioning problems using linear and integer programming. Bioinformatics 21(7), 1028– 1036 (2004) 11. Klepeis, J.L., Floudas, C.A.: Astro-fold: A combinatorial and global optimization framework for ab initio prediction of three-dimensional structures of proteins from the amino acid sequence. Biophysical Journal 85, 2119–2146 (2003) 12. Kryshtafovych, A., Venclovas, C., Fidelis, K., Moult, J.: Protein folding: From the levinthal paradox to structure prediction. Journal of Molecular Biology 293(2), 283–293 (1999) 13. Lesk, A.M., Rose, G.D.: Folding Units in Globular Proteins. PNAS 78(7), 4304– 4308 (1981) 14. Levinthal, C.: Are there pathways for protein folding? Journal de Chimie Physique et de Physico-Chimie Biologique 65, 44 (1968) 15. Menke, M., King, J., Berger, B., Cowen, L.: Wrap-and-pack: A new paradigm for beta structural motif recognition with application to recognizing beta trefoils. Journal of Computational Biology 12(6), 777–795 (2005) 16. Moult, J.: A decade of CASP: progress, bottlenecks and prognosis in protein structure prediction. Current Opinion in Structural Biology 15(3), 285–289 (2005)
48
J.K. Jeong, P. Berman, and T.M. Przytycka
17. Przytycka, T.M., Srinivasan, R., Rose, G.D.: Recursive domains in proteins. Protein Science 11(2), 409–417 (2002) 18. Richardson, J.S.: beta-Sheet topology and the relatedness of proteins. Nature 268(5620), 495–500 (1977) 19. Rose, G.D.: Hierarchic organization of domains in globular proteins. Journal of Molecular Biology 134(3), 447–470 (1979) 20. Ruczinski, I., Kooperberg, C., Bonneau, R., Baker, D.: Distributions of beta sheets in proteins with application to structure prediction. Proteins: Structure, Function, and Genetics 48(1), 85–97 (2002) 21. Srinivasan, R., Rose, G.D.: LINUS: A hierarchic procedure to predict the fold of a protein. Proteins: Structure, Function, and Genetics 22(2), 81–99 (1995) 22. Steward, R.E., Thornton, J.M.: Prediction of strand pairing in antiparallel and parallel β-sheets using information theory. Proteins: Structure, Function, and Genetics 48(2), 178–191 (2002) 23. Woolfson, D.N., Evans, P.A., Hutchinson, E.G., Thornton, J.M.: On the conformation of proteins: The handedness of the connection between parallel β-strands. Journal of Molecular Biology 110, 269–283 (1977) 24. Xu, J., Li, M., Kim, D., Xu, Y.: Raptor: Optimal protein threading by linear programming. Journal of Bioinformatics and Computational Biology 1(1), 85–117 (2003) 25. Zhang, C., Kim, S.-H.: The anatomy of protein [beta]-sheet topology. Journal of Molecular Biology 299(4), 1075–1089 (2002) 26. Zhu, H., Braun, W.: Sequence specificity, statistical potentials, and threedimensional structure prediction with self-correcting distance geometry calculations of beta-sheet formation in proteins. Protein Science 8(2), 326–342 (1999)
A Fast and Accurate Heuristic for the Single Individual SNP Haplotyping Problem with Many Gaps, High Reading Error Rate and Low Coverage Loredana M. Genovese, Filippo Geraci, and Marco Pellegrini Istituto di Informatica e Telematica del CNR, Via G. Moruzzi 1, 56100-Pisa (Italy)
[email protected],
[email protected],
[email protected]
Abstract. Single nucleotide polymorphism (SNP) is the most frequent form of DNA variation. The set of SNPs present in a chromosome (called the haplotype) is of interest in a wide area of applications in molecular biology and biomedicine, including diagnostic and medical therapy. In this paper we propose a new heuristic method for the problem of haplotype reconstruction for (portions of ) a pair of homologous human chromosomes from a single individual (SIH). The problem is well known in literature and exact algorithms have been proposed for the case when no (or few) gaps are allowed in the input fragments. These algorithms, though exact and of polynomial complexity, are slow in practice. Therefore fast heuristics have been proposed. In this paper we describe a new heuristic method that is able to tackle the case of many gapped fragments and retains its effectiveness even when the input fragments have high rate of reading errors (up to 20%) and low coverage (as low as 3). We test our method on real data from the HapMap Project.
1
Introduction
The single nucleotide polymorphism or SNP (pronounced “snip”) is the most common variation in the human DNA. In fact a recent study of 2001, has shown that similarity among human DNA sequences is over 99% and only a few bases (just 1.42M bases overall) are responsible for the variations in human phenotypes [12]. A SNP is a variation of a single nucleotide in a fixed point of the DNA sequence and in a bounded range of possible values. The sequence of SNPs in a specific chromosome (or a large portion of a chromosome) is called generically Haplotype. Since most cells in humans are diploid, each chromosome (except the X and Y chromosomes in males) comes in two almost identical copies, one inherited from the mother and one from the father. Thus the haplotype of a chromosome is fully described by two sequences of SNPs in the two copies of the chromosome. The Single Individual SNP Haplotype reconstruction problem is the problem of rebuild the two strings forming the haplotype from a set of fragments obtained by shotgun sequencing of the chromosomes’ DNA strands. The most important aspect of the problem is that with the current technology it is difficult and/or impractical to keep trace of the association of the fragments with their chromosome, thus this association has to be reconstructed computationally and it is a preliminary necessary phase to the actual fragment assembly to reconstruct the haplotype. Unlike the classical DNA fragment assembly problem, in R. Giancarlo and S. Hannenhalli (Eds.): WABI 2007, LNBI 4645, pp. 49–60, 2007. c Springer-Verlag Berlin Heidelberg 2007
50
L.M. Genovese, F. Geraci, and M. Pellegrini
which the position and orientation of fragments is unknown, in the parental haplotype reconstruction problem the position of each fragment is fixed and known. Further aspects that must be considered that render the problem difficult (and computationally interesting) are the following: 1) Reading errors. The complex nature of the biological/chemical/optical processes involved in shotgun sequencing implies that a non negligible error probability is attached to each single SNP reading. 2) Coverage of fragments. Algorithms using fragments to reconstruct a string rely heavily on the fragment’s overlaps and on the redundancy of information provided by several fragments covering the same SNP position, to perform in silico correction of reading errors. Thus a critical parameter of the input data is the minimum (or average) coverage of SNPs by fragments. This number is also related to the throughput of the sequencing equipment. 3) Gaps in fragments. Ideally each fragment covers consecutive SNP positions in the order of the SNPs of a chromosome. However in practice we may have many fragments with gaps due to several phenomena. 3.1 Ambiguous Readings. In the reading of fragments it may happen that is impossible to detect the value of a SNP with a sufficient confidence. It is better to model this ambiguous case with a small gap rather than introduce spurious values. 3.2 Matepair sequences. Some shotgun sequencing methodologies produce pairs of fragments that are from the same chromosome, do not overlap and whose distance is known up to a certain degree of precision. Matepair sequences are used to cope with the presence of repeat subsequences that complicate the reconstruction efforts This extra information attached to the produced fragments can be considered logically equivalent to a single fragment with one gap. Our contribution. In this paper we propose a heuristic algorithm for the SIH problem that is fast, handles well gaps, and is able to deal with high reading error rates and low fragment coverage. We demonstrate these properties via experiments on real human data from the HapMap project [5]. Advanced Personalized Medicine is one of the goals of current research trends and in this area new genetic diagnostic methods are critical. It is thus important to support diagnostic technologies that can be used as much as possible in the field (closer to the patient, and far from the traditional high tech labs). Away from the controlled environment of a lab it is likely that the current portable technology for sequencing will produce less reliable data. Moreover, if a real time and high throughput response is needed to care for the needs of many individuals in a short time span, one might not be able to guarantee a high coverage of the fragments and low reading error rates. Our algorithm is a step forward in the direction of extracting efficiently useful information even from low quality data. Formalization of the problem. From the computational point of view the problem of the haplotype reconstruction was defined in [7,13]. It can be easily described as follow: let S = s1 , s2 , . . . , sn a set of SNPs (specific positions in a DNA string) and F = f1 , f2 , . . . , fm a set of DNA fragments. Each SNP can be covered by a certain number of fragments and can take only two values (The values of the haplotype in that position). The natural way of representing fragments is to store them in an m x n matrix M called SNP matrix. The element
A Fast and Accurate Heuristic for the Single Individual SNP
51
Mi,j contains the value of the SNP sj in the fragment fi or the special character − if that SNP is unspecified in the fragment. If the element Mi,j = − we say that it is a gap or, equivalently, that the fragment fi contains a gap at position j. Let fi ∈ F and 1 ≤ a ≤ b ≤ n such that ∀k ∈ [a, b], Mi,k = − and ∀k ∈ [1, m] \ [a, b], Mi,k = −, the fragment fi is called gapless. We say that M is gapless if all its fragments are gapless. We say that two fragments fi and fj have a collision if the following condition is true: ∃k ∈ [1, n] such that Mi,k = Mj,k ∧ Mi,k = − ∧ Mj,k = −. Given the matrix M the conflict graph G = (V, E) is defined as follow: for each row of M there is a vertex labelled with the correspondent fragment fi . If fi has a collision with fj , insert an edge between Vi and Vj . In this case the haplotype reconstruction is easy to solve. The rows of M can be split in two disjoint sets according to the bipartition of G. By construction of graph G the i-th character of all elements of a set induced from bipartition have the same value or is a gap. Thus for each set we build an haplotype simply choosing as value for SNP si the value of the i-th character (not equal to −). If M is not error free, the graph G may be not bipartite. The single individual haplotype reconstruction problem, can be reduced to one of the following problems [2]: – Minimum Fragment Removal (MFR): determine a minimal number of fragments (rows of the matrix M ) whose removal from the input set induces a bipartite graph. – Minimum SNP Removal (MSR): determine a minimal number of SNPs (columns of the matrix M ) whose removal from the input set induces a bipartite graph. – Longest Haplotype Reconstruction (LHR): determine set of fragments (rows of the Matrix M ) whose removal from the input set induces a bipartite graph and the length of the induced haplotype is maximized. – Minimum Error Correction (MEC): determine a minimal set of entries of the matrix M whose correction to a different value induces a bipartite graph. Our approach. We give a heuristic method for the minimum error correction problem MEC, since we permit to change single matrix entries. It is a heuristic method since we have no guarantee of attaining the minimum, nor any guarantee on the approximation to the minimum that we can achieve. Note however than MEC is the hardest of the problems listed above. Our method is organized in phases (four phases) and it is greedy in nature (making choices that are optimal in a local sense). In each phase we perform three tasks: 1) detect likely positions of errors 2) allocate fragments to the two partially built haplotype strings, and 3) build partial haplotype strings deciding via majority on ambiguous SNPs. The difference among the phases is twofold: on one hand we can use the knowledge built up in the previous phases, and on the other hand in passing from one phase to the next we relax the conditions for the decisions to be taken regarding tasks 1), 2) and 3). Organization of the paper. In Section 2 we review the state of the art for the SIH problem. In Section 3 we describe our algorithm. In Section 4 we describe the experiments and their results.
52
2
L.M. Genovese, F. Geraci, and M. Pellegrini
State of the Art
SNP’s and Haplotypes have become recently a focus of research (See the HapMap project [5]) because of their potential for associating observable phenotypes (e.g. resilience to diseases, reactivity to drugs) to individual genetic profiles [15]. The technology for detecting the position of SNP’s in the human genome has been developed [9,12] and continues to be refined to produce more accurate SNP maps. Two large and active areas of research involving haplotypes are the determination of the genetic variability in a population (see surveys in [2,6]) starting from genotyping data, and the association of genetic variability with phenotypes. In this paper we discuss the problem of determining the haplotype of a single individual based on fragments from shotgun sequencing of his/her DNA which is known as the Single Individual SNP Haplotyping Problem (SIH)1 . This problem has been tackled both from a theoretical point of view [1,3,4,7,13] and from a more practical one [8,11,14]. Weighted versions of the problem are studied in [16]. The SIH problem is clearly not formally an input/output problem as defined usually in computer science2 , therefore precise complexity statements can be made only for the derived problems such as: MEC, LHR, MFR and MSR. MEC even with gapless fragments is NP-hard [3], and it is APX-hard for fragments with at most 1 gap [4]. There is an O(log n)-approximate polynomial time algorithm [11]. LHR with gapless fragments can be solved exactly in polynomial time [3]; it is NP-hard and APX-hard for fragments with at most 1 gap [4]. MFR is NP-hard for fragments with at most 1 gap, and MSR is NP-hard for fragments with at most 2 gaps [7]. If we have a bound k on the total number of gaps, for k constant, MFR and MSR are polynomially solvable [13]. In general MFR and MSR are APX-hard. The polynomial time algorithms proposed for the above problems are at least cubic (in the gapless case) therefore a faster heuristic method has been proposed in [11] that is based on an incremental construction. We improve upon [11] by giving a method that is as fast in practice and more accurate when the reading error rate increases and/or the fragment coverage decreases. Interestingly, even if exact polynomial algorithms are known for MFR on gapless input in [13], simulations reported in [11] show that the heuristic method of [11] achieves better accuracy in solving the original SIH problem. For this reason we take [11] as baseline algorithm even when dealing with fragments with gaps. Wang et al. [14] describe a Genetic Algorithm for this problem that in some reported experiments gives good performance for short haplotypes (about 100 SNPs). It is unclear how this method would performs on longer haplotypes and with lower coverage rate. We are not aware of any publicly available implementation of the methods described in [8,11,14,16], therefore we chose as baseline the method in [11] that is comparable to ours in terms of speed, and does not rely on any statistical model. As future work we plan a comparison of our method with the one in [14]. 1 2
Also called the Haplotype Assembly Problem. SIH informally relates the output of the algorithm to an unknown DNA string whose "approximation" is the purpose of the algorithm. The formal input to the algorithm is a set of fragments that are related to the unknown string via physical error-prone processes. Thus there is no mathematically formalized relationship between the input and the criterion for evaluating the output of the algorithm.
A Fast and Accurate Heuristic for the Single Individual SNP
3
53
Our Heuristic
The Input to the problem is a set of fragments F and a set of SNP’s positions S. The Output is pair of consensus strings calS 1 and S2 . In the process of obtaining the consensus strings one has to decide to which string a fragment should be associated, whether any letter in a fragment should be modified and finally decide by majority the output letter at any given position. Ideally one should strive for a minimal modification of the input letters. Note however that our quality metric is the reconstruction error, not the number of letters changed. We start by building the SNP matrix M with m rows and n columns where each row is a fragment. The element in position Mi,j is the j-th SNP in fragment fi or −, if it is a gap. Our heuristic builds the haplotype consensus with a pre-processing (phase 1) and three main phases (2-4): Ph-1 We perform a statistical analysis of potential conflicts among pairs of columns in M ; Ph-2 in phase Ph-2 we select a first group of columns with the highest possible confidence to be error-free and we build an initial solution from them; Ph-3 in the third phase we select those columns that we are able to disambiguate using the solution obtained in the previous phase; Ph-4 in the last phase we try to complete the solution using weaker conditions for assigning columns to the final solution. In this section we will give priority to an intuitive understanding of the several phases and steps, skipping on some more formal details to be expanded in the full paper. First phase: Preprocessing. For each column of M we build a group Gi containing a certain number of sets. Each set is initialized with the indexes of all the rows which have in position i a character different from −. So Gi can contain from 0 up to 4 sets (the empty set and one for each base: a, c, g, t). Observation 1. If Gi has 0 sets, column i is empty. In this case there is no data to reconstruct the haplotype for column i. If Gi has just 1 set, all the character in column i are the same. If Gi has more then 2 sets, column i contains errors. If Gi contains three of four sets, we can suppose that the one or two smaller sets are due to errors. Unfortunately we can only detect the presence of errors, but we have not enough information to correct them. In this case we remove from the matrix M the information about the possibly uncorrect values and update Gi accordingly. Note that in cases where Gi contains a large set and two smaller ones of the same size, we can not remove those sets because we could likely be removing correct data. If we suppose a constant coverage of each locus by both the haplotypes, in the case Gi has two sets and one of them is much bigger than the other, we can suppose that locus to be homozygote and the data in the smaller set is a reading error. Clearly in this case we can predict the right content of the matrix M in these positions. After filtering out the above easy cases we are left to deal with groups of two sets of non negligible size. Given two groups Gi = (Si,1 , Si,2 ) and Gj = (Sj,1 , Sj,2 ) having exactly 2 sets and such that i = j, we call conflict matrix the squared matrix Ei,j of order 2: Si,1 ∩ Sj,1 Si,1 ∩ Sj,2 Ei,j = Si,2 ∩ Sj,1 Si,2 ∩ Sj,2
54
L.M. Genovese, F. Geraci, and M. Pellegrini
When only one diagonal of E has its elements non-zero and it is of full rank, there are no detectable errors. Otherwise we have a conflict between column i and j. The detected errors could be in one or both columns. Observation 2. If Ei,j has only one element equal to ∅ we can suppose that the corresponding diagonal element contains the reading errors and its cardinality is the number of such errors. For example if in Ei,j only the element Si,2 ∩ Sj,1 is 0 then there are |Si,1 ∩ Sj,2 | errors in at least one of the columns i and j in the rows of indexes in Si,1 ∩ Sj,2 . The assumption that the elements in Si,1 ∩ Sj,2 are the errors in Ei,j becomes more plausible if its cardinality is significantly smaller than the others. Observation 3. In presence of errors in Ei,j we can not establish if the error is in column i or j or both. We can locate the error if one of the following conditions hold: – If ∀k = j Ei,k does not contain errors, then errors is likely in column j; – If ∃k such that Ei,j has an error, Ej,k has an error and Ei,k has no errors, then we deduce that the error is likely in column j. In the case of the example in observation 2, if also one of the conditions of observation 3 holds, we deduce that the errors are in the rows Si,1 ∩ Sj,2 in the column j. So we can correct the error by removing from M the incorrect values and updating Sj,2 via removing Si,1 ∩Sj,2 . If none of the conditions in observation 3 hold we can not discriminate between columns i and j so we can remove the errors at the cost of a loss of information by assigning: Sj,2 = Sj,2 \ Si,1 ∩Sj,2 and Si,1 = Si,1 \ Si,1 ∩ Sj,2 . We have observed empirically that the error correcting criteria of first Phase are effective when the input has a very low reading error rate. As the error rate increases the bulk of the disambiguation is on the next phases 2-4. Second phase. The main goal of this phase is the selection of a set of pair of groups with the highest possible probability of containing no inconsistencies and extract from them two sets of fragments that will be the core of the first (partial) solution. Candidate list selection. The optimal set of candidate pairs to select is that in which each group has no conflicts with all the other groups. Unfortunately, if the percentage of errors in M is high this set can be empty. Moreover a correct group can be involved in a conflict with another group due to reading errors in the latter. This fact causes the removal of all the pairs in which that group appears. Higher coverage tends to increase this bad effect on the size of the candidate set. In fact the probability that a group with no errors has a conflict with a group with errors is proportional to the coverage. If the optimal candidate set of groups pairs is empty, we must to find the set with the highest confidence to be a good candidate set. First of all we compute the mean number of conflicts among pairs of groups. As candidate set we pick all the pairs for which all the following conditions hold: a) both its groups have two sets, b) the number of conflicts in which its groups are involved is less than the mean, c) the matrix E of its groups is diagonal and of full rank
A Fast and Accurate Heuristic for the Single Individual SNP
55
Extraction of initial core. From the candidate list obtained in the previous paragraph, we build now two disjoint set of rows of M that will be used as core of the final solution. We build a series of chains of pair in this way: the first pair of a new chain is the first unused pair of the candidate list. Then we add a pair to the chain if at least one group of the pair is already in the chain until no more pairs can be added to the chain. The procedure stops when all the pairs are in a chain. At the end we select the longest chain. The construction of the series of chains is straightforward. First of all, we sort the candidate pairs in lexicographic order and place them in a vector L = [C0 , . . . , C|L| ]. We build also a vector V in which we store all the indexes j ∈ [1, |L|] of L such that the the first elements of consecutive pairs are different, Cj [0] = Cj−1 [0]. We set as first element of V the value 0 and as last element of V the value |L|. We build also a vector v of size m containing several status flags: position i is set to “to visit” if the group i-th does not appear in any chain, set to “visited” if it appears in the chain we are building and set to “complete” if all the pair containing the index i were already used. A new chain is built as follows: 1. Find an index i such that the pair Ci is not already used and set it as the first element of the chain. 2. All the elements of L not yet used in the range [V [i], V [i + 1] − 1] are added to the chain, if they exist. The vector v is updated accordingly. 3. If there is an index j such that v[Cj [0]] is set to “visited” got step (2.) using i such that V [i] = j. Otherwise search a pair where v[Cj [1]] is set to “visited” and goto step (2.) using i such that V [i] = j. 4. if v has no element set to “visited” the chain is complete. It is easy to note that the arbitrary choice of the first element do not influence the pairs that will fall in the chain, but only their order that is not important in our heuristic. Chains have the important propriety: Property 1. If we consider groups in the same order in which they appear in a chain, one of the following conditions holds: 1. Si,1 ∩ Si+1,1 = ∅ ∧ Si,2 ∩ Si+1,2 = ∅ ∧ Si,1 ∩ Si+1,2 = ∅ ∧ Si,2 ∩ Si+1,1 = ∅ 2. Si,1 ∩ Si+1,2 = ∅ ∧ Si,2 ∩ Si+1,1 = ∅ ∧ Si,1 ∩ Si+1,1 = ∅ ∧ Si,2 ∩ Si+1,2 = ∅ We are now ready to build a sort of “super-group” G = (S1 , S2 ) in which S1 will be used to build the first haplotype consensus, and S2 for the second haplotype. If G0 is the first element of the longest chain, S1 is initialized with the elements of S0,1 and S2 is initialized with the elements of S0,2 . Property 1 suggests a simple way to assign the sets of each considered group to a set Si . In fact if, for example, the elements in set Si,1 are assigned to S1 and Si,1 ∩ Si+1,1 = ∅ holds, the elements in set Si+1,1 can also be assigned to S1 . All the groups whose sets are assigned to G are marked as “used” and will not be considered in the next phases. If the considered columns of M , (remember that Gi refers to column i of M ), have no errors we have that S1 ∩ S2 = ∅. Otherwise3 there are errors in the rows of M whose indexes are in the intersection and in at least one of the columns considered. If there is an element j in both the Si ’s and we do not remove it from one of these sets, the fragment fj would give its contribute to both the haplotypes, which is incorrect. In order to choose to which haplotype 3
As before, high reading error rates reduce the efficacy of previous filtering steps.
56
L.M. Genovese, F. Geraci, and M. Pellegrini
to assign the fragment fj , we simply count how many times j appear in the sets assigned to S1 and S2 and assign j to the set with the highest number of assignments. Third phase. If we succeed in partitioning all the rows of M we are ready to build the final haplotype consensus using the method described at the end of this section. Experiments with high error rates show that at the end of the previous phase we are able to assign a large part of the rows of M , but not all of them because we had not enough information to unambiguously assign some fragment to a set Si . In this phase we already have a partial solution that could give us more information and we can use weaker conditions to assign elements of the groups to G. The first information we distill from the partial solution is an estimation of the mean ratio between the cardinality of sets of the fragments belonging to the two haplotype strings. We compute this ratio only for those groups that were involved in the partial solution because they have higher probability to be correct with respect to the others. We can now safely assume that if the ratio between the cardinality of the sets of an unused group is far enough from the mean, the locus represented from that group is homozygote and the elements in the smaller set of that group are all errors and can be corrected updating M accordingly. Considering G as a group, we can build a vector of conflict matrices E = E1 , . . . , En , such that Ei is the conflict matrix relative to G and Gi . Note that these matrices are more informative than those of previous phase because they are representative of a greater part of the input and not only of two columns. In case of conflicts in Ei we can with high probability say that the errors are in Gi and not in G. This becames more evident in the case of a matrix Ei that have just one element equal to 0 and the value in the diagonal with the 0 is much smaller than the values in the other diagonal. A matrix of this form was discarded in the previous phase, because the error position was not predictable with enough confidence. Instead, here the information provided by G gives us the ability to deduce the exact position of the errors in the i-th column of M and correct them. The main goal on this phase is to add as many possible elements to Gi trying to correct some errors in M for improving the haplotype consensus. The procedure acts as follow: 1. Let α = ∅, β = ∅ 2. For all those groups Gi with i ∈ [1, n] not yet marked, with 2 sets and such that Ei is diagonal and of full rank: if Si,1 ∩ S1 = ∅ and Si,1 ∩ S2 = ∅ add the elements of Si,1 to α and Si,2 to β. Otherwise, due to the fact Ei is diagonal, must hold that Si,2 ∩ S1 = ∅ and Si,2 ∩ S2 = ∅. In this case simply add the elements of Si,1 to β and Si,2 to α. Gi becomes marked. 3. If an element j appear in both α and β, we simply count how many times j is present in the sets assigned to α and β and assign j to the set with the highest number of assignments. 4. Assign all the elements of α to S1 and the elements of β to S2 . 5. Recompute the conflict matrix for the groups that are still not marked and restart from step (1.) until no more groups can be marked. 6. Correct errors that can be detected in M and restart from step (1.) until no more groups can be marked.
A Fast and Accurate Heuristic for the Single Individual SNP
57
Fourth phase. At the end of phase three, if there is some other group that is not marked yet, there is no further weaker condition that we can use to add those groups to G safely. The goal in this phase is not to add elements to G one by one, but to build another super-group G from the remaining unmarked groups and merge it with G, if possible. This strategy relies on the fact that an aggregation of columns is more robust to errors with respect to a single column. The choice to reuse the previous phases seems the most reasonable, but we must use weaker constraints. We can not use the techniques of the second phase to initialize G because at the end it could not intersect G (or the intersection could be too small). The problem of the intersection between G and the G is important. In fact if all the sets of both have null intersection there is not a way to join G and G. Instead, if the intersection is small, because of errors, by mistake we can join each set of G with the wrong one of G. The safest way to initialize G, is selecting the unmarked group with the highest possible intersection with G. Analyzing the matrices Ei from the previous phase for the unmarked groups, the one with the highest sum of the elements in a diagonal is the best candidate to initialize G. After the initialization of G, we can use the previous phase to add other elements. Here two constraints are relaxed: it is no more necessary that the conflict matrices are of full rank; detected errors in M are not corrected, but simply the wrong data is removed from M . Let a and b such that |Sa ∩ S1 | > |Sa ∩ S2 | and |Sb ∩ S1 | < |Sb ∩ S2 | and a = b, we assign to S1 all the elements of Sa not in S2 and assign to S2 all the elements of Sb not in S1 . Haplotype consensus. At the end of the previous phase, some fragments could still be assignable to both haplotype strings. They will be assigned a posteriori after the process of consensus construction to the most similar haplotype. We split M in two sub-matrices: M1 containing all the rows with indexes in S1 and M2 containing the rows with indexes in S2 Naturally it is impossible to establish which of the parent’s haplotype is deduced from S1 and which from S2 . We call pivot of M at position i the element P viM (different from a gap) that appears more frequently. If the column i of M has no elements, its pivot will be a gap. The consensus haplotype induced from S1 is a sequence in which the i-th element is P viM1 and the consensus haplotype induced from S2 is obtained in the same way from M2 .
4
Experiments
In our experiment we compared the following algorithms: A) Our heuristic, as described in section 3; B) Our implementation of Fast Hare (F.H.) following the description in [11]; C) The trivial reconstruction algorithm by majority voting that has the true fragment assignment as part if its input (Base). We implemented the algorithms in Python. Tests have been run on a Intel(R) Pentium(R) D CPU 3.20GHz with 4GB of RAM and with operating System Linux. All algorithms completed their task in less than 10 seconds for the data of largest size considered (strings of 1000 SNP’s). Input data and fragment generation. In previous papers [7,11] experiments were based on SNP matrices obtained from the fragmentation of artificially generated haplotype data. The most common approach to the generation of the
58
L.M. Genovese, F. Geraci, and M. Pellegrini
SNP matrices was suggested in [10]. The recent research project HapMap [12] has produced a map of the human haplotypes that is now publicly available [5]. Thus we were able to generate the fragment matrices from real data instead of using synthetic input haplotypes. Using real data, the Hamming distance between the two haplotypes is not a free parameter of our choice in the generation of M . For the extraction of the SNP matrix from the haplotypes we were inspired by the approach suggested in [10] taking in account standard parameters in current technology for shotgun sequencing. The free parameters we set in our experiments are: (a) the length l of the haplotype section to be reconstructed, (b) the coverage c of each haplotype and (c) the error rate e. Current technology for shotgun sequencing is able to manage fragments of the order of one hundreds of bases. In Li et al.[8] the average distance in bp of two SNP’s in the DNA sequence is quantified as 300 bp on average, and each fragment is of 650 bp’s. Each fragment covers a number of SNP’s in the range roughly [3, 7], thus we chose the length of each fragment in this range. Our generation schema is as follow for each experiment: we select the haplotype strings from a random chromosome among the human chromosomes numbered in [1..22] (thus excluding the gender chromosomes), we get a contiguous substring of length l from the first haplotype starting from a random location and its homologous substring from the second haplotype. As in [10] each such string is replicated c times. Next, errors are inserted uniformly at random in the haplotype substrings with probability e. At this point the strings are split in fragments by selecting iteratively the next cut point at an integer distance from the previous one chosen uniformly at random in the range [3, 7], starting from the first base. Note that the number of fragments is not determined a priori but it depends on the length l, on the coverage c and on the distribution of the fragment lengths. Gaps came from two sources. Input SNP gaps are those present in the original HapMap data. Mate pairs are obtained as follows: random pairs of disjoint fragments belonging to the same haplotype string are mated in a single gapped fragment (at the end of this phase globally 50% of the fragments are 1-gapped). Outcome of the experiments. We investigate the performance of our algorithm in different settings varying the input parameters. We choose three different length for the haplotyes: 100 bases as in [11], 500 bases like in [10] and 1000 bases. To test the effectiveness of the method we vary the coverage of each haplotype from 3 to 10 considering that in most reported experiments the coverage is about 5 [10]. To test algorithms robustness we used different levels of errors: from 0% to 20%. Each test was repeated 100 times and in table 1 is reported the mean number of errors in the reconstructed haplotypes with respect to the strings before error implants. Analysis of the experiments. In absence if errors (but with gaps) our method was able to reconstruct the haplotypes exactly in all cases. The reconstruction error rate increases for all three methods as the reading error rate increases and it decreases with the increase of coverage. In order to give a synthetic view of the data in Table 1 we use the Merit Function f : ⎧ if Our = F H ⎪ ⎨ 0 Our−B 1 − f= F H−B if Our < F H ⎪ ⎩ − 1 − F H−B if Our > F H Our−B
(1)
A Fast and Accurate Heuristic for the Single Individual SNP
59
Table 1. Quality measurements on the compared algorithms. Mean over 100 runs of the number of errors in the reconstructed haplotypes for error rate in [0.0,0.2], coverage in [3,10], and haplotypes length l = 100, 500, 1000. Err. Alg. Base Our F. H. Base. 5% Our F. H. Base. 10% Our F. H. Base 15% Our F. H. Base 20% Our F. H. 0%
Coverage. l = 100 3 5 8 10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.04 0.97 0.10 0.00 0.00 0.97 0.15 0.07 0.02 1.26 0.18 0.19 0.03 4.05 0.75 0.03 0.01 5.39 0.88 0.44 0.03 9.32 1.54 0.41 0.02 9.78 2.26 0.34 0.08 12.21 2.83 0.43 0.25 18.41 3.40 1.55 0.83 15.13 5.71 1.19 0.35 20.44 7.77 2.16 0.93 32.63 11.51 3.40 1.68
Coverage. l = 500 3 5 8 10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.72 0.40 0.04 0.27 4.81 0.47 0.01 0.00 5.60 0.57 0.01 0.04 11.79 0.95 0.03 0.03 21.14 3.74 0.24 0.03 26.45 4.28 0.33 0.07 45.52 5.91 0.43 0.07 47.71 12.57 1.49 0.38 66.55 25.70 2.21 0.96 102.60 25.34 2.65 0.88 80.97 27.90 5.04 1.68 120.53 52.38 10.17 4.74 224.46 64.14 12.32 4.16
Coverage. l = 1000 3 5 8 10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 8.94 1.79 2.24 0.04 11.60 0.99 0.04 0.01 14.95 1.59 0.13 0.03 21.24 2.97 0.26 0.55 43.38 7.85 0.60 0.13 60.87 9.95 2.59 0.29 123.92 15.17 1.43 0.46 95.40 25.42 2.66 0.77 134.74 35.61 4.59 2.46 268.09 58.81 4.63 1.59 159.74 56.90 10.86 3.53 220.52 94.26 23.12 13.54 469.54 150.18 22.21 11.05
where Our is the error count of our algorithm, F H is the error count for Fast Hare and B is the error count for the baseline algorithm. Note that when Our and F H tie f has value zero. When Our is better than F H, f assumes a value in the range [0, 1], the higher the absolute value, the better is our algorithm w.r.t. Fast Hare. Symmetrically when Fast Hare is better than Our algorithm f assumes values in the range [−1, 0] the higher the absolute value, the better is Fast Hare w.r.t. our algorithm. This indicator is almost always in our favor (see Figure 1). The figure of merit f gives an idea of the quality ratio of FH and Our method w.r.t the baseline. There are 10 cases out of 60 in which FH has a better ratio. This happens mostly with high coverage (8 or 10). However in these case the quality difference is always rather small: less then 0.03 bases # 100 bases
# 500 bases
Merit
Coverage 1 3 5 8 10 0.5
# 1000 bases
1
1
0.5
0.5
0
0
0
-0.5
-0.5
-0.5
-1
-1 0
0.05
0.1
0.15
Error rate
0.2
-1 0
0.05
0.1 Error rate
0.15
0.2
0
0.05
0.1
0.15
0.2
Error rate
Fig. 1. Figure of merit (see equation 1) for the experiments in Table 1. Values above 0, indicates better relative performance of Our Method over Fast Hare. Values below 0 indicate better relative performance of Fast Hare.
60
L.M. Genovese, F. Geraci, and M. Pellegrini
over 100 bases, less than 0.58 over 500 bases and less than 2.49 over 1000 bases. Conversely often when our method has a better quality ratio for lower coverage also the absolute difference of the reconstruction errors is large.
References 1. Bafna, V., Istrail, S., Lancia, G., Rizzi, R.: Polynomial and APX-hard cases of the individual haplotyping problem. Theor. Comput. Sci. 335(1), 109–125 (2005) 2. Bonizzoni, P., Della Vedova, G., Dondi, R., Li, J.: The haplotyping problem: an overview of computational models and solutions. J. Comput. Sci. Technol. 18(6), 675–688 (2003) 3. Cilibrasi, R., van Iersel, L., Kelk, S., Tromp, J.: On the complexity of several haplotyping problems. In: Casadio, R., Myers, G. (eds.) WABI 2005. LNCS (LNBI), vol. 3692, pp. 128–139. Springer, Heidelberg (2005) 4. Cilibrasi, R., van Iersel, L., Kelk, S., Tromp, J.: On the complexity of the single individual SNP haplotyping problem. Algorithmica (in print, 2007) 5. The International HapMap Consortium: A haplotype map of the human genome. Nature 437, 1299–1320 (2005), http://snp.cshl.org 6. Gusfield, D., Orzack, S.H.: Haplotype Inference. In: CRC Handbook on Bioinformatics, ch. 1, pp. 1–25. CRC Press, Boca Raton, USA (2005) 7. Lancia, G., Bafna, V., Istrail, S., Lippert, R., Schwartz, R.: SNPs problems, complexity, and algorithms. In: Meyer auf der Heide, F. (ed.) ESA 2001. LNCS, vol. 2161, pp. 182–193. Springer, Heidelberg (2001) 8. Li, L., Kim, J.H., Waterman, M.S.: Haplotype reconstruction from SNP alignment. In: Proceedings of the seventh annual international conference on Computational molecular biology, pp. 207–216. ACM Press, New York (2003) 9. Matukumalli, L.K., Grefenstette, J.J., Hyten, D.L., Choi, I.-Y., Cregan, P.B., Van Tassell, C.P.: Application of machine learning in SNP discovery. BMC Bioinformatics 7, 4 (2006) 10. Myers, G.: A dataset generator for whole genome shotgun sequencing. In: Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, pp. 202–210. AAAI Press, Stanford, California, USA (1999) 11. Panconesi, A., Sozio, M.: Fast hare: A fast heuristic for single individual SNP haplotype reconstruction. In: Jonassen, I., Kim, J. (eds.) WABI 2004. LNCS (LNBI), vol. 3240, pp. 266–277. Springer, Heidelberg (2004) 12. Sachidanandam, R., et al.: A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 409, 928–933 (2001) 13. Rizzi, R., Bafna, V., Istrail, S., Lancia, G.: Practical algorithms and fixedparameter tractability for the single individual SNP haplotyping problem. In: Guigó, R., Gusfield, D. (eds.) WABI 2002. LNCS, vol. 2452, pp. 29–43. Springer, Heidelberg (2002) 14. Wang, R.-S., Wu, L.-Y., Li, Z.-P., Zhang, X.-S.: Haplotype reconstruction from SNP fragments by minimum error correction. Bioinformatics 21(10), 2456–2462 (2005) 15. Weiner, M.P., Hudson, T.J.: Introduction to SNPs: discovery of markers for disease. Biotechniques Suppl (2002) 16. Zhao, Y.-Y., Wu, L.-Y., Zhang, J.-H., Wang, R.-S., Zhang, X.-S.: Haplotype assembly from aligned weighted SNP fragments. Computational Biology and Chemistry 29(4), 281–287 (2005)
Two Birds, One Stone: Selecting Functionally Informative Tag SNPs for Disease Association Studies Phil Hyoun Lee and Hagit Shatkay Computational Biology and Machine Learning Lab School of Computing, Queen’s University Kingston, ON, Canada {lee,shatkay}@cs.queensu.ca
Abstract. Selecting an informative subset of SNPs, generally referred to as tag SNPs, to genotype and analyze is considered to be an essential step toward effective disease association studies. However, while the selected informative tag SNPs may characterize the allele information of a target genomic region, they are not necessarily the ones directly associated with disease or with functional impairment. To address this limitation, we present a first integrative SNP selection system that simultaneously identifies SNPs that are both informative and carry a deleterious functional effect – which in turn means that they are likely to be directly associated with disease. We formulate the problem of selecting functionally informative tag SNPs as a multi-objective optimization problem and present a heuristic algorithm for addressing it. We also present the system we developed for assessing the functional significance of SNPs. To evaluate our system, we compare it to other state-of-the-art SNP selection systems, which conduct both information-based tag SNP selection and function-based SNP selection, but do so in two separate consecutive steps. Using 14 datasets, based on disease-related genes curated by the OMIM database, we show that our system consistently improves upon current systems.
1 Introduction Identifying single nucleotide polymorphisms1 (SNPs) that are involved in complex common diseases, such as cancer, is a major challenge in current molecular epidemiology. Due to their genome-wide prevalence, knowledge of such SNPs is expected to be essential for unraveling the genetic etiology of human diseases, and thus, for enabling timely diagnosis, treatment, and, ultimately, prevention of disease. However, genotyping2 and analyzing all the SNPs on the human genome [2] is practically infeasible as the number of SNPs is estimated at over ten million [3]. Thus, selecting a subset of SNPs that is sufficiently informative to conduct disease-gene association but still small enough to reduce the genotyping and analysis overhead, a process known as tag SNP selection, is a key step toward effective association studies. 1
2
A single nucleotide polymorphism (SNP) is the substitution of a single nucleotide at a certain position on the genome [1]. Genotyping is the biomolecular process of identifying the nucleotide of a genetic variation [1].
R. Giancarlo and S. Hannenhalli (Eds.): WABI 2007, LNBI 4645, pp. 61–72, 2007. c Springer-Verlag Berlin Heidelberg 2007
62
P. H. Lee and H. Shatkay
A variety of measures and algorithms have been proposed for tag SNP selection, and their utility has been empirically demonstrated by simulation studies or by association studies. Yet, while the selected informative tag SNPs may effectively characterize the allele information of a target genomic region, they are not necessarily the ones directly associated with disease or with functional impairment. Given this limitation, SNPs with deleterious functional effects have drawn recent attention [4, 5]. Typically, SNPs occurring in functional genomic regions are more likely to cause functional distortion and, as such, more likely to underly disease-causing variations [2,6]. As of yet, methods for the selection of informative tag SNPs do not take into account the functional significance of SNPs; similarly, methods for identifying disease-related SNPs do not attempt to capture the allele information of the complete target locus3 . The identification of informative tag SNPs and of functionally significant SNPs can be viewed as two distinct optimization problems with possibly conflicting objectives. Consequently, current systems that try to support both information-based tag SNP selection and function-based SNP selection [7, 8] address each selection problem independently. That is, they separately conduct tag SNP selection and function-based SNP selection, and combine the two selected sets as a last step. A major shortcoming of such systems is that the number of selected SNPs can be much larger than necessary. Moreover, the functional SNPs selected may not be predictive of the other SNPs in the locus, while the predictive SNPs selected may have no relation to disease. To address this limitation, we propose an integrative SNP selection system that simultaneously identifies SNPs that are both informative and carry a deleterious functional effect – which in turn means that they are likely to be disease-related. We formulate SNP selection as a multi-objective optimization problem, to which we refer as functionally informative tag SNP selection. We define a single objective function, incorporating both allelic information and functional significance of SNPs, and present a heuristic selection algorithm that we show, through a comparative study, to improve upon other state-of-the-art systems. To our knowledge, the idea of combining the two notions of SNP selection – the function-based and the information-based – into a single optimized selection process is new, and was not attempted before. In Sec. 2, we formulate the problem of functionally informative tag SNP selection, and introduce the basic notations that are used throughout the paper. Section 3 describes our functional-significance assessment process and our heuristic algorithm for selecting functionally informative SNPs. Section 4 reports the results from a comparative study. Section 5 summarizes our findings and outlines future directions.
2 Functionally Informative Tag SNP Selection We are concerned with identifying a set of SNPs associated with a given disease. The relevant target locus on the genome can be as large as a whole chromosome or as small as a part of a gene. Disease association studies typically involve the following steps: 1) chromosome samples are obtained from cases bearing the disease and from controls (people not bearing the disease); 2) The allele information for all the SNPs on the target locus is obtained (genotyped) from the chromosome samples; 3) a subset of SNPs that 3
A locus is the chromosomal location of the target region for biomolecular experiments [1].
Two Birds, One Stone: Selecting Functionally Informative Tag SNPs
63
is most associated with the disease phenotype4 is identified. However, in practice, due to experimental cost and time, not all the SNPs on the target locus can be genotyped or analyzed. We thus need to select a subset of at most k SNPs on the target locus (where k is a pre-specified number) whose allele information is as informative as that of the whole set of SNPs, while including those SNPs that are most functionally significant. We refer to the problem as functionally informative tag SNP selection. Before we formulate and address this problem, we introduce here the basic notations that are used throughout this paper. Suppose that our target locus contains p consecutive SNPs. Each SNP can be represented as a discrete random variable, Xj (j = 1, ..., p), whose possible values are the 4 nucleotides, {a, g, c, t}. For each value x ∈ {a, g, c, t}, there is a probability P r(Xj = x) that Xj is assigned the nucleotide x. Let V ={X1 , ..., Xp } denote the set of random variables corresponding to the p SNPs. We are given a haplotype5 dataset, D, containing the allele information of n haplotypes, each of which consists of the p SNPs in V . The set D can be viewed as an n by p matrix; each row, Di+ , in D corresponds to the allele information of the p SNPs comprising haplotype hi , while each column, D+j , corresponds to the allele information of SNP Xj in each of the n haplotypes. We denote by Dij the allele information of the j th SNP in the ith haplotype. To formally address functional significance of SNPs, we denote by ej the functional significance score for each SNP Xj in V , and define E = {e1 , ..., ep } to be the set of scores for all the SNPs. We further discuss how these values can be obtained in Sec. 3.1. For a subset of SNPs, T ⊂ V , we define an objective function, f (T |D, E), to reflect both the allele information carried by the SNPs in T about the remaining SNPs in V − T , and the functional significance of the SNPs in T . The problem of functionally informative tag SNP selection can then be stated as follows: Problem : Functionally Informative Tag SNP Selection Input : A set of SNPs, V ; A maximum number of SNPs to select, k; A haplotype dataset, D; A set of functional significance scores, E; Output : A set of SNPs T = argmax f (T |D, E) . T s.t. T ⊂ V & |T | ≤ k
That is, to select a subset of functionally informative tag SNPs, we need to find among all possible subsets of the original SNPs in the set V , an optimal subset of SNPs, T , of size ≤ k, based on the objective function f (T |D, E). Our first task is to define the objective function, f (T |D, E). To do so, we first introduce two simpler objective functions, denoted by f1 (T |D) and f2 (T |E); the former measures the allelic information, while the latter measures the functional significance of a SNP set T . Definition 1. Information-based Objective. Given a set of k SNPs, T = {Xt1 , ..., Xtk}, and a dataset D of n haplotypes, we define an information-based objective function, f1 (T |D), as: p n 1 f1 (T |D) = I(Xj , T, Di+ ) np j=1 i=1 4 5
A phenotype is the physical, observed manifestation of a genetic trait [1]. A haplotype is a set of consecutive SNPs present on one chromosome [1].
64
P. H. Lee and H. Shatkay
where I(Xj , T, Di+ ) =
1 : if Dij == argmax P r(Xj = x |Xt1 = Dit1 , ..., Xtq = Ditk) ; x∈{a,g,c,t}
0 : otherwise .
The function I returns 1 if the allele of the j th SNP in the ith haplotype (i.e., Dij ) is correctly predicted based on the allele information of the SNPs in T . We note that, by using the conditional probability expression, the allele of Dij is predicted as the one that is most likely to occur given the allele information of predictive tag SNPs in T . 6 Otherwise, the function I returns 0. To summarize, the allelic information provided by a SNP set, T , with respect to a given haplotype dataset D, is measured by the average proportion of the correctly predicted alleles of each SNP, Xj , given the allele information of the SNPs in T . This information-based objective function, f1 (T |D), was introduced in our previous work [9], and is based on the prediction-based tag SNP selection approach [10, 11], which aims to select a subset of SNPs (i.e., tag SNPs) that can best predict the alleles of the remaining, unselected SNPs (i.e., tagged SNPs). This approach is appealing since: (1) it does not require prior block partitioning [12]; (2) it tends to select a small number of SNPs [13]; and (3)it works well even for genomic regions with low linkage disequilibrium7 [9]. An in-depth discussion and survey of information-based tag SNP selection approaches is given elsewhere [14, 15]. Definition 2. Function-based Objective. Given a set of k SNPs, T ⊂ V , and a set of functional significance scores, E = {e1 , ..., ep }, we define a function-based objective function, f2 (T |E) as: p ej · IT (Xj )
f2 (T |E) = where IT (Xj ) =
j=1
1 :
p e
j
j=1
if Xj ∈ T ; 0 : otherwise .
In other words, the functional significance of a SNP set T is the normalized sum of the functional significance of SNPs in T . We note that, for the vast majority of SNPs, no experimental evidence is yet available to substantiate their functional significance [2]. We thus define and evaluate the functional significance of SNPs using a large variety of bioinformatics tools for function-assessment. The details of our assessment procedure are described in Sec. 3.1. Based on the two functions defined above, we next define a single objective function, f (T |D, E), incorporating allelic information and functional significance. Definition 3. Functionally Informative Objective Function. Given a set of k SNPs, T ⊂ V , a haplotype dataset, D, a functional significance score set, E = {e1 , ..., ep }, and a parameter value, α (0 ≤ α ≤ 1), we define a functionally informative (FI) objective function, f (T |D, E) as: f (T |D, E) = α · f1 (T |D) + (1 − α) · f2 (T |E) . 6 7
Note that for any SNP Xtl ∈ T, I(Xtl , T, Di+ ) is by definition always 1. Linkage disequilibrium (LD) refers to the non-random association of SNPs [1].
Two Birds, One Stone: Selecting Functionally Informative Tag SNPs
65
The parameter α is a weighting factor, which allows us to adjust the importance of information-based selection with respect to that of functional significance. In the work described here, we assign an equal weight to the two criteria, that is, α = 0.5. We refer to the value assigned by this function to the subset of SNPs T, as the FI-score of T. To summarize, we are looking for a subset of at most k SNPs, T , that is both functionally significant and likely to correctly predict the remaining SNPs in V − T . Bafna et al. [12] have previously shown that finding k most informative tag SNPs is NP-hard. Based on this, we take it as a conjecture that the current problem is also NP-hard, (the proof is beyond the scope of this paper). The next section introduces a functionassessment process and a heuristic algorithm to address the problem.
3 Models and Algorithms Our SNP selection system involves two main steps: 1) assessing the functional significance, ej , of SNPs and 2) selecting a set of functionally informative tag SNPs, T . These are described next. 3.1 Assessing the Functional Significance of SNPs Using a variety of existing, publicly available bioinformatics tools, we examine the deleterious effects of SNPs on the molecular function of their genomic region. In particular, we focus on the following three major categories of biological function: – Protein Coding: SNPs in protein coding regions may cause an amino acid substitution (i.e., a missense mutation) or interfere with protein translation (i.e., a nonsense mutation). – Splicing Regulation: SNPs in splicing regulatory regions may affect alternative splicing or result in exon skipping or intron retention. – Transcriptional Regulation: SNPs in transcription regulatory regions (e.g., transcription factor binding sites, CpG islands, regulatory RNAs) can alter the affinity of the binding sites, and disrupt proper gene regulation. We assess the functional significance of SNPs based on their location and possible deleterious effects along these three functional categories. Figure 1 illustrates the following assessment process: For each of the three categories, a SNP is separately assigned into one of three classes8 : Class 1 indicates irrelevance to the biological function; Class 2 indicates that the SNP is relevant to the biological function, but predicted to be benign or has no evidence of deleterious effects; Class 3 indicates that the SNP is likely to be deleterious. For example, SNPs outside a protein coding region are considered to be irrelevant to protein coding, and as such are assigned to Class 1 with respect to Protein Coding. Among the SNPs within a protein coding region, nonsense SNPs and some missense SNPs are predicted to have deleterious effects to protein coding, and are thus assigned to Class 3; the remaining SNPs within the protein coding region are assigned to Class 2. 8
Thus, a SNP is assigned three class labels; one label for each of the three functional categories.
66
P. H. Lee and H. Shatkay
SNP
Splicing Regulation
Protein Coding
NO
Co din g Re gi on?
YES
Intronic Splice Site ?
Class 1
Transcription Regulation
Class 3
YES
TF Binding Site?
YES Non Se n se ?
NO
YES
Exonic Splicing Regulator ?
NO NO
Class 2
M is Se ns e ?
NO
Class 3 NO
Other Regulatory Region?
Class 1
NO
Class 1
YES
YES YES
Deleterious ?
NO
NO
Conserved ?
NO
Conserved ?
Class 2
YES YES
Class 3
Class 3
Class 3 Pol yP h en
S IFT
Class 2
YES
S NP effect
Majority Vote
SNP s3D
LS-SNP
ES E fi nder
Re scue ESE
ESR Se arch
Majority Vote
PESX
Gol de n Pa th
TF Se arch
HGMD
rSNP
Consite
Majority Vote
Fig. 1. Our functional significance assessment system
Similarly, the SNPs within a highly conserved splice regulatory region or transcriptional regulatory region are assumed to be deleterious with respect to the corresponding regulatory function [2], and are thus assigned to Class 3, while the SNPs within nonconserved regulatory regions are only relevant to the respective function, and are thus assigned to Class 2. To make a robust assessment, we use multiple bioinformatics tools that are based on different data, algorithms, or theory for examining each biological functional category. The tools, PolyPhen [16], SIFT [17], SNPeffect [18], SNPs3D [19], and LS-SNP [20] are used to examine missense SNPs; ESEfinder [21], RescueESE [22], ESRSearch [23], and PESX [24] are used to identify the SNPs in exonic splice regions; TFSearch [25] and Consite [26] are used to identify transcriptional regulatory SNPs in promoter regions; Ensembl [27], GoldenPath [28], and HGMD [29] databases are used to identify SNPs in other transcriptional regulatory regions (e.g., microRNA); and Ensembl [27] database is used to identify nonsense SNPs and the SNPs in intronic splicing sites. The classes assigned to each SNP, with respect to each functional category are decided by a majority vote of the integrated tools in the category. As a result, three class labels are assigned to each SNP, one for each of the three categories of biological function. To assign a single functional significance value to each SNP, we follow Bhatti et al. [2], and assign the highest class tag along all three categories as the functional significance score, ej , for the SNP Xj . For example, SNP rs4963 on gene ADD1 is assigned to Class 3 with respect to Protein Coding, Class 1 with respect to Splicing Regulation, and Class 1 with respect to Transcription Regulation. The functional significance score of SNP rs4963 is thus 3 because it is highly significant for the protein coding function.
Two Birds, One Stone: Selecting Functionally Informative Tag SNPs
67
3.2 Selecting Functionally Informative Tag SNPs Our selection algorithm takes an incremental, greedy approach. It starts with an empty tag SNP set, T , and iteratively adds one SNP to T until a maximum number, k, of SNPs are selected. Each greedy selection step identifies a SNP whose addition to T will result in the maximum increase in the value of the functionally informative objective function (FI-score) with respect to the current tagging set T . We first explain the basis for our greedy incremental selection process. Let T (m) denote the set of m selected SNPs after the mth iteration, where m = 0, ..., k and T (0) = ∅. The FI-score of T (m) was defined in Def. 3 as follows:
p n α · 1 · I(X , T (m) , D ) +(1 − α)· =
f (T (m) |D, E) = α · f1 (T (m) |D) + (1 − α) · f2 (T (m) |E) j=1
np
j
i=1
i+
e ej
· IT (m)
p
(Xj ) .
l
l=1
Note that the FI-score of T (m) is the weighted sum of the allelic information of T (m) and the functional significance of T (m) for each SNP Xj (j = 1, ..., p). For simplicity, we denote the contribution of each SNP Xj to the FI-score of T (m) as fj (T (m) |D, E), and refer to it as the FI-score of Xj with respect to T (m) . That is,
n e 1 fj (T (m) |D, E) = α · np · I(Xj , T (m) , Di+ ) + (1 − α)·
j
p
i=1
el
· IT (m)
(Xj )
,
l=1
and f (T (m) |D, E) =
p
fj (T (m) |D, E) .
j=1
In the next iteration, m+ 1, we aim to select a SNP, X (m+1) , whose addition to T (m) will maximally increase the FI-score. Using the FI-score of Xj with respect to T (m) , fj (T (m) |D, E), defined above, this goal can be stated as follows: X (m+1) = argmax
p f (T (m) ∪ j
{X}|D, E) − fj (T (m) |D, E)
.
X∈ V −T (m) j=1
Our algorithm is outlined in Fig. 2. It starts with an empty set of tag SNPs, T , and computes the FI-score of each SNP with respect to the current set T . We note that although no SNP is currently selected, our algorithm can still predict the allele information of SNPs, and can thus lead to a different FI-score for each SNP. The reasoning is that in this initial case where T is empty, the posterior probability, P r(Xj |T ), shown in the definition of the function I within Def. 1, is simply the prior probability, P r(Xj ). That is, we always predict the alleles of Xj , Dij (i = 1, ..., n), as the major allele of the SNP. This approach is taken because it maximizes the expected prediction accuracy when no other information is given. At each subsequent iteration, the SNP that leads to the maximum increase in the FI-score is selected and added to T . The FI-score for
68
P. H. Lee and H. Shatkay Input: a set of SNPs, V ; a maximum number of SNPs to select, k; a haplotype dataset, D; a set of functional significance scores, E; Output: a set of tag SNPs, T . m ← 0. T (m) ← ∅. For each SNP Xj ∈ V F Ij ← fj (T (m) |D, E). While m < k For each t where Xt ∈ V − T (m) (m)
Δt
←
p f (T (m) ∪ X |D, E) − F I . j
t
j
j=1
X
(m+1)
← argmax Xt ∈V −T (m)
(m) Δt .
T (m+1) ← T (m) ∪ X (m+1) . For each Xj ∈ / T (m) F Ij ← fj (T (m+1) |D, E). m ← m + 1. T ← T (m) . Fig. 2. The incremental, greedy algorithm for selecting functionally informative tag SNPs
each SNP is updated based on the augmented set T and used in the next iteration. This procedure is repeated until the set T contains the pre-specified number of SNPs, k. The time complexity of each incremental greedy selection is O((p − m)2 · n), where p−m is the number of SNPs that can be selected, and n is the number of haplotypes in a dataset D. As this iteration is repeated for m = 0 to m = k − 1, the overall complexity of our algorithm is O(k · n · p2 ).
4 Experiments and Results 4.1 Experimental Setting For evaluation, we have selected 14 genes that are involved in the etiology of common and complex diseases according to the OMIM database [30] and have disease-related SNPs identified and recorded by the HapMap Project [31]. To identify the candidate genes, we scanned the OMIM database for several major common and complex diseases, including diabetes, cancer, hypertension, and heart disease. The retrieved genes were then scanned to find those that have SNPs with possible deleterious functional effects reported in the biomedical literature and also have haplotype information available from the HapMap consortium [31]. From the genes satisfying these criteria, 14 were selected at random. Table 1 provides the genetic characteristics of the 14 genes and their associated disease. The haplotype datasets of the 14 genes were downloaded from the HapMap project website [31]; The genomic location of each gene, including a 10k promoter region, was used to download the phased haplotype data (HapMap public release #20/phaseII) for the CEU population.
Two Birds, One Stone: Selecting Functionally Informative Tag SNPs
69
Table 1. Summary of 14 test datasets. Linkage disequilibrium (LD) is estimated by the multiallelic extension of Lewontin’s LD, D [32]. The number of SNPs selected by TAMAL and by SNPSelector are shown in the right column.
Gene
Target Disease
Locus
LD (D ) Total # of SNPs
ADD1 BRCA2 CMA1 ELAC2 ERBB2 F7 HEXB ITGB3 LEPR LTA MSH2 NOS3 PTPRJ TP53
Hypertension Breast Cancer Hypertension Prostate Cancer Prostate Cancer Heart Disease Mental Retardation Heart Disease Diabetes Heart Disease Colon Cancer Alzheimer Disease Colon Cancer Colon Cancer
4p16.3 13q12.3 14q11.2 17p11 17q21.1 13q34 5q13 17q21.32 1p31 6p21.3 2p22-p21 7q36 11p11.2 17p13.1
0.7718 0.7657 0.8361 0.8336 0.8104 0.8629 0.7371 0.6491 0.7048 0.7865 0.8413 0.6183 0.7863 0.7154
60 106 20 35 8 13 51 83 245 12 51 16 115 9
# of Selected SNPs TAMAL SNPSelector 16 28 6 13 6 8 10 20 46 4 18 7 32 5
1 13 4 2 1 5 5 8 11 2 4 0 7 2
We compare our system with two state-of-the-art SNP selection systems that support both tag SNP selection and function-based SNP selection: TAMAL [7] and SNPselector [8]. The two systems share the same goal with our system, namely, selecting a set of tag SNPs, with significant functional effects on the molecular function of the genes, for association studies. However, they differ from our system in the assessment process for the functional significance of SNPs, the integrated bioinformatics tools, and the criteria used for selecting SNPs. Moreover, they conduct tag SNP selection and function-based SNP selection in two separate consecutive steps, while we address it as a single optimization problem. As evaluation measures, we use Halperin’s prediction accuracy [11] and the FI-score, introduced in Def. 3, (we note that the two systems to which we compare do not provide an evaluation measure). To compare the performance of the systems using the two measures, the SNP sets selected by each of the compared systems must include an equal number of SNPs. However, unlike our system, TAMAL and SNPselector do not allow the user to specify the number of selected SNPs, but rather calculate a subset of SNPs and provide it as their output. Thus, when they do not select the same number of SNPs for the same gene, they cannot be directly compared. Hence, for a fair comparison, we first apply each of the compared systems to each of 14 test datasets, and then use our system on the same dataset to select the same number of SNPs as selected by the compared system. We then compute the two evaluation measures for the sets selected by each of the systems, and compare the resulting scores. The number of SNPs selected by TAMAL and SNPselector for the 14 tested genes is shown in Table 1. To ensure robustness of the results obtained from our system, we employ 10-fold cross validation 10 times, each using a randomized 10-way split of the n haplotypes. In all cases, the average performance is used in the comparison.
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Our System TAMAL
1
2
3
4
5
6
7
8
Prediction Accuracy .
P. H. Lee and H. Shatkay
Prediction Accuracy .
70
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
9 10 11 12 13 14
Our System SNPselector 1
2
3
4
5
Tested Genes
6
7
8
9 10 11 12 13 14
Tested Genes
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Our System TAMAL 1
2
3
4
5
6
7
8
9 10 11 12 13 14
Tested Genes
The FI-Score
The FI-Score
(a) The prediction accuracy of the selected tag SNPs for each gene 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Our System SNPselector 1
2
3
4
5
6
7
8
9 10 11 12 13 14
Tested Genes
(b) The FI-score of the selected tag SNPs for each gene Fig. 3. The performance of our system and the compared systems for 14 gene datasets
4.2 Results Figure 3 shows the performance of our system compared with TAMAL (left) and with SNPselector (right). The X-axis represents the 14 genes in an alphabetical order of their names, as listed in Table 1. In Fig. 3(a) (top), the Y-axis shows Halperin’s prediction accuracy [11], and in Fig. 3(b) the Y-axis shows the FI-score for the selected SNP set of the corresponding gene. Our system (upper solid line with diamonds) consistently outperforms the other two systems, TAMAL and SNPselector (lower dotted line with rectangles) on both evaluation measures. The performance difference in all cases is statistically significant, as confirmed by the Wilcoxon rank-sum test (p-values are 1.144e005 and 4.7e-003 with respect to the TAMAL system and 1.7382e-005 and 5.6780e-004 with respect to the SNPselector system.). We note that optimizing the FI-score when selecting SNPs does not compromise the predictive power of the SNPs selected by our system, that is, our selected SNPs still have a high prediction accuracy according to Halperin’s original measure as demonstrated by Fig. 3(a).
5 Conclusions We have presented a first integrative SNP selection system that simultaneously identifies SNPs that are both highly informative in terms of providing allele information for
Two Birds, One Stone: Selecting Functionally Informative Tag SNPs
71
the target locus, and are of high functional significance. Our main contributions include the formulation of the problem of functionally informative tag SNP selection as a multiobjective optimization problem, presenting a heuristic selection algorithm to address the problem, and proposing an assessment process for scoring the functional significance of SNPs. An empirical study over a set of 14 disease-associated genes shows that our system indeed improves upon current state-of-the-art systems. In the near future we plan to apply a general computational approach, such as goal programming [33], for addressing the multi-objective optimization problem of selecting functionally informative tag SNPs. We also plan to apply a probabilistic approach to assess the functional significance of SNPs.
References 1. Hedrick, P.: Genetics of pouplation, 3rd edn. Jones and Bartlett Publishers (2004) 2. Bhatti, P., Church, D., Rutter, J.L., Struewing, J.P., Sigurdson, A.J.: Candidate single nucleotide polymorphism selection using publicly available tools: a guide for epidemiologists. American Journal of Epidemiology 164, 794–804 (2006) 3. Sherry, S., Ward, M., Kholodov, M., Baker, J., Phan, L., Smigielski, E., Sirotkin, K.: dbSNP: the NCBI database of genetic variation. Nucleic Acids Research 29, 308–311 (2001) 4. Brunham, L.R., Singaraja, R.R., Pape, T.D., Kejariwai, A., Thomas, P.D., Hayden, M.R.: Accurate prediction of the functional significance of single nucleotide polymorphisms and mutations in the ABCA1 gene. PLOS Genetics 1, 739–747 (2005) 5. Rebbeck, T.R., Ambrosone, C.B., Bell, D.A., Chanock, S.J., Hayes, R.B., Kadlubar, F.F., Thomas, D.C.: SNPs, haplotypes, and cancer: applications in molecular epidemiology. Cancer Epidemiology, Biomarkers & Prevention 13, 681–687 (2004) 6. Conde, L., Vaquerizas, J.M., Ferrer-Costa, C., de la Cruz, X., Orozco, M., Dopazo1, J.: PupasView: a visual tool for selecting suitable SNPs, with putative pathological effect in genes, for genotyping purposes. American Journal of Epidemiology 33, W501-W505 (2005) 7. Hemminger, B.M., Saelim, B., Sullivan, P.F.: TAMAL: an integrated approach to choosing SNPs for genetic studies of human complex traits. Bioinformatics 22, 626–627 (2006) 8. Xu, H., Gregory, S.G., Hauser, E.R., Stenger, J.E., Pericak-Vance, M.A., Vance, J.M., Zuchner, S., Hauser, M.A.: SNPselector: a web tool for selecting SNPs for genetic association studies. Bioinformatics 21, 4181–4186 (2005) 9. Lee, P.H., Shatkay, H.: BNTagger: improved tagging SNP selection using Bayesian networks. Bioinformatics 22, e211–219 (2006) 10. Sebastiani, P., Lazarus, R., Weiss, S.T., Kunkel, L.M., Kohane, I.S., Ramoni, M.F.: Minimal haplotype tagging. Proceedings of the National Academy of Sciences 100, 9900–9905 (2003) 11. Halperin, E., Kimmel, G., Sharmir, R.: Tag SNP selection in genotype data for maximizing SNP prediction accuracy. Bioinformatics 21, i195–i203 (2005) 12. Bafna, V., Halldorsson, B.V., Schwartz, R., Clark, A.G., Istrail, S.: Haplotypes and Informative SNP Selection Algorithms: Don’t Block Out Information. In: Proceedings of the 7th International Conference on Computational Molecular Biology, pp. 19–26 (2003) 13. Bakker, P.D., Graham, R.R., Altshuler, D., Henderson, B., Haiman, C.: Transferability of tag SNPs to capture common genetic variation in DNA repair genes across multiple population. In: Proceedings of Pacific Symposium on Biocomputing (2006) 14. Halldorsson, B.V., Istrail, S., Vega, F.D.L.: Optimal selection of SNP markers for disease association studies. American Journal of Epidemiology 58(3-4), 190–202 (2004)
72
P. H. Lee and H. Shatkay
15. Lee, P.H.: Computational haplotype analysis: An overview of computational methods in genetic variation study. Technical Report, -512, Queen’s University, Kingston, ON, Canada (2006), WEB URL: http://www.cs.queensu.ca/TechReports/ Reports/2006-512.pdf 16. Ramensky, V., Sunyaev, S.: Human non-synonymous SNPs: surver and survey. Nucleic Acid Research 30, 3894–3900 (2002) 17. Ng, P., Henikoff, S.: Predicting deleterious amino acid substitutions. Genome Research 11, 863–874 (2001) 18. Reumers, J., Schymkowitz, J., Ferkinghoff-Borg, J., Stricher, F., Serrano, L., Rousseau, F.: SNPeffect: a database mapping molecular phenotypic effects of human non-synonymous coding SNPs. Nucleic Acid Research 33, D527–532 (2005) 19. Yue, P., Melamud, E., Moult, J.: SNPs3D: candidate gene and SNP selection for association studies. BMC Bioinformatics 7, 166 (2006) 20. Karchin, R., et al.: LS-SNP: large-scale annotation of coding non-synonymous SNPs based on multiple information sources. Bioinformatics 21, 2814–2820 (2005) 21. Cartegni, L., Wang, J., Zhu, Z., Zhang, M.Q., Krainer, A.R.: ESEfinder: A web resource to identify exonic splicing enhancers. Nucleic Acids Research 31, 3568–3571 (2003) 22. Yeo, G., Burge, C.B.: Variation in sequence and organization of splicing regulatory elements in vertebrate genes. In the Proceeding of Proc. Natl. Acad. Sci. 101(44), 15700–15705 (2004) 23. Fairbrother, W.G., Yeh, R.F., Sharp, P.A., Burge, C.B.: Predictive identification of exonic splicing enhancers in human genes. Science 297, 1007–1013 (2002) 24. Zhang, et al.: Exon inclusion is dependent on predictable exonic splicing enhancers. Molecular and Cellular Biology 25(16), 7323–7332 (2005) 25. Akiyama, Y.: TFSEARCH: Searching Transcription Factor Binding Sites (1998), WEB URL: http://www.rwcp.or.jp/papia/ 26. Sandelin, A., Wasserman, W.W., Lenhard, B.: ConSite: web-based prediction of regulatory elements using cross-species comparison. Nucleic Acids Research 32, W249–252 (2004) 27. Hubbard, T.J.P., et al.: Ensembl, Nucleic Acids Research (Database issue) (2007) 28. Karolchik, D., et al.: The ucsc genome browser database. Nucl. Acids Res. 31(1), 51–54 (2003) 29. Krawczak, M., Thomas, N.S., Hundrieser, B., Mort, M., Wittig, M., Hampe, J., Cooper, D.N.: Single base-pair substitutions in exon-intron junctions of human genes: nature, distribution, and consequences for mrna splicing. Human Mutation 28(2), 150–158 (2007) 30. McKusick-Nathans Institute of Genetic Medicine, J.H.U., National Center for Biotechnology Information, N.L.o.M.: Online Mendelian Inheritance in Man, OMIM (TM). WEB URL: http://www.ncbi.nlm.nih.gov/omim/ 31. The International HapMap Consortium,: The International HapMap Project. Nature 426, 789–796 (2003) 32. Hedrick, P.: Gametic disequilibrium measures: proceed with caution. Genetics 117, 331– 341 (1987) 33. Lee, S.M.: Goal programming for decision analysis. Auerback, Philadelphia (1972)
Genotype Error Detection Using Hidden Markov Models of Haplotype Diversity Justin Kennedy, Ion M˘ andoiu, and Bogdan Pa¸saniuc CSE Department, University of Connecticut, Storrs, CT 06269 {jlk02019,ion,bogdan}@engr.uconn.edu
Abstract. The presence of genotyping errors can invalidate statistical tests for linkage and disease association, particularly for methods based on haplotype analysis. Becker et al. have recently proposed a simple likelihood ratio approach for detecting errors in trio genotype data. Under this approach, a SNP genotype is flagged as a potential error if the likelihood associated with the original trio genotype data increases by a multiplicative factor exceeding a user selected threshold when the SNP genotype under test is deleted. In this paper we give improved error detection methods using the likelihood ratio test approach in conjunction with likelihood functions that can be efficiently computed based on a Hidden Markov Model of haplotype diversity in the population under study. Experimental results on both simulated and real datasets show that proposed methods achieve significantly improved detection accuracy compared to previous methods with highly scalable running time.
1
Introduction
Despite recent advances in typing technologies and calling algorithms, significant error levels remain present in SNP genotype data (see [1] for a recent survey). A recent study of dbSNP genotype data [2] found that as much as 1.1% of about 20 million SNP genotypes typed multiple times have inconsistent calls, and are thus incorrect in at least one dataset. When genotype data is available for related individuals, some errors become detectable as Mendelian inconsistencies (MIs). However, a large proportion of errors (as much as 70% in mother-father-child trio genotype data [3,4]) remains undetected by Mendelian consistency analysis. Since even low error levels can lead to substantial losses in the statistical power of linkage and association studies [5,6,7], error detection remains a critical task in genetic data analysis. This task becomes particularly important in the context of association studies based on haplotypes instead of single locus markers, where error rates as low as 0.1% may invalidate some statistical tests for disease association [8]. An indirect approach to handling genotyping errors is to explicitly model them in downstream statistical analyses, see, e.g., [9,10]. While powerful, this approach often leads to complex statistical models and impractical runtimes for large datasets such as those generated by current large-scale association studies. A more practical approach is to perform genotype error detection as a separate R. Giancarlo and S. Hannenhalli (Eds.): WABI 2007, LNBI 4645, pp. 73–84, 2007. c Springer-Verlag Berlin Heidelberg 2007
74
J. Kennedy, I. M˘ andoiu, and B. Pa¸saniuc
analysis step following genotype calling. SNP genotypes flagged as putative errors can then be excluded from downstream analyses or can be retyped when high quality genotype data is required. Error detection is currently implemented in all widely-used software packages for pedigree genotype data analysis such as SimWalk2 [11] and Merlin [12], which detect Mendelian consistent errors by independently analyzing each pedigree and identifying loci of excessive recombination. Unfortunately, these methods are not appropriate for error detection in genotype data from unrelated individuals or small pedigrees such as motherfather-child trios, which require using population level linkage information. In this paper we propose novel methods for genotype error detection extending the likelihood ratio error detection approach recently proposed by Becker et al. [13]. While we focus on detecting errors in trio genotype data, our proposed methods can be applied with minor modifications to genotype data coming from unrelated individuals and small pedigrees other than trios Unlike Becker et al., who adopt a window-based approach and rely on creating a short list of frequent haplotypes within each window, we use a hidden Markov model (HMM) to represent frequencies of all haplotypes over the set of typed loci. Similar HMMs have been successfully used in recent works [14,15,16,17] for genotype phasing and disease association. Two limitations of previous uses of HMMs in this context have been the relatively slow (typically EM-based) training on genotype data, and the inability to exploit available pedigree information. We overcome these limitations by training our HMM on haplotypes inferred using the pedigree-aware ENT phasing algorithm of [18], based on entropy minimization. Becker et al. [13] use the maximum phasing probability of a trio genotype as the likelihood function whose high sensitivity to single SNP genotype deletions signals potential errors. The former is heuristically approximated by a computationally expensive search over quadruples of frequent haplotypes inferred for each window. When all haplotype frequencies are implicitly represented using an HMM, we show that computing the maximum trio phasing probability is in fact hard to approximate in polynomial time. Despite this result, we are able to significantly improve both detection accuracy and speed compared to [13] by using alternate likelihood functions such as Viterbi probability and the total trio genotype probability. We show that these alternate likelihood functions can be efficiently computed for small pedigrees such as trios, with a worst-case runtime increasing linearly in the number of SNP loci and the number of trios. Further improvements in detection accuracy are obtained by combining likelihood ratios computed for different subsets of trio members. Empirical experiments show that this technique is very effective in reducing false positives within correctly typed SNP genotypes for which the same locus is mistyped in related individuals. The rest of the paper is organized as follows. We introduce basic notations in Section 2 and describe the structure of the HMM used to represent haplotype frequencies in Section 3. Then, in Section 4 we present the likelihood ratio framework for error detection, and in Section 5 we describe three likelihood functions that can be efficiently computed using the HMM. Finally, we give experimental results assessing the error detection accuracy of our methods on both simulated
Genotype Error Detection Using Hidden Markov Models
75
and real datasets in Section 6, and conclude with ongoing research directions in Section 7.
2
Preliminaries
We start by introducing the basic definitions and notations used throughout the paper. We denote the major and minor alleles at a SNP locus by 0 and 1. A SNP genotype represents the pair of alleles present in an individual at a SNP locus. Possible SNP genotype values are 0/1/2/?, where 0 and 1 denote homozygous genotypes for the major and minor alleles, 2 denotes the heterozygous genotype, and ? denotes missing data. SNP genotype g is said to be explained by an ordered pair of alleles (σ, σ ) ∈ {0, 1}2 if g =?, or σ = σ = g when g ∈ {0, 1}, or σ = σ when g = 2. We denote by n the number of SNP loci typed in the population under study. A multi-locus genotype (or simply genotype) is a 0/1/2/? vector G of length n, while a haplotype is a 0/1 vector H of length n. An ordered pair (H, H ) of haplotypes explains multi-locus genotype G iff, for every i = 1, . . . , n, the pair (H(i), H (i)) explains G(i). A trio genotype T = (Gm , Gf , Gc ) consists of multilocus genotypes for the mother, father, and child of a nuclear family. An ordered 4-tuple (H1 , H2 , H3 , H4 ) of haplotypes is said to explain a trio T = (Gm , Gf , Gc ) iff (H1 , H2 ) explains Gm , (H3 , H4 ) explains Gf , and (H1 , H3 ) explains Gc .
3
Hidden Markov Model
The HMM used to represent haplotype frequencies has a similar structure to HMMs recently used in [14,15,16,17] (see Figure 1). This structure is fully determined by the number of SNP loci n and a user-specified number of founders K (typically a small constant, we used K = 7 in our experiments). Formally, the HMM is specified by a triple M = (Q, γ, ε), where Q is the set of states, γ is the transition probability function, and ε is the emission probability function. The set of states Q consists of disjoint sets Q0 = {q 0 }, Q1 , Q2 , . . . , Qn , with |Q1 | = |Q2 | = · · · = |Qn | = K, where q 0 denotes the start state and Qj , 1 ≤ j ≤ n, denotes the set of states corresponding to SNP locus j. The transition
-g -g -g -g g Q 3 Q 3 Q 3 Q 3 7 7 AS Q 7 AS Q 7 AS Q 7 Q AS Q AS Q Q AS Q s s AS sg Q Qg Qg AS sg Q g S S S S * ZA ZA ZA ZA 3 3 3
3
S
S S SZ A S A S A S A S q0 Z Z
Z H H S S S S ZZ S S S S wg AZ wg A
wg AZ wg
~
~ H g AZ S j Z Z ~ Z S A~
A3
A
A 3 QS QS 3 QS 3 S Q SA Q SA Q SA Q SA S Q QS QS QS QS A A A w Sg w U w U w U wAU g s Q s Q s Q s Q
- g - g - g Fig. 1. The structure of the Hidden Markov Model for n=5 SNP loci and K=4 founders
76
J. Kennedy, I. M˘ andoiu, and B. Pa¸saniuc
probability between two states a and b, γ(a, b), is non-zero only when a and b are in consecutive sets. The initial state q 0 is a silent state, while every other state q emits allele σ ∈ {0, 1} with probability ε(q, σ). The probability with which M emits a haplotype H along a path π starting from q 0 and ending at a state in Qn is given by: n P (H, π|M ) = γ(q 0 , π(1))ε(π(1), H(1)) γ(π(i − 1), π(i))ε(π(i), H(i)) (1) i=2
In [14,15], similar HMMs were trained from genotype data using variants of the EM algorithm. Since EM-based training is generally slow and cannot be easily modified to take advantage of phase information that can be inferred from available family relationships, we adopted the following two-step approach for training the HMM. First, we use a highly scalable algorithm based on entropy minimization [18] to infer haplotypes for all individuals in the sample. The phasing algorithm can handle genotypes related by arbitrary pedigrees, and has been shown to yield high phasing accuracy as measured by the so called switching error. In the second step we use the classical Baum-Welch algorithm to train the HMM based on the inferred haplotypes.
4
Likelihood Ratio Approach to Error Detection
Our detection methods are based on the likelihood ratio approach of Becker et al. [13]. We call likelihood function any function L assigning non-negative real-values to trio genotypes, with the further constraint that L is non-decreasing under data deletion. Let T = (Gm , Gf , Gc ) denote a trio genotype, x ∈ {m, f, c} denote one of the individuals in the trio (mother, father, or child), and i denote one of the n SNP loci. The trio genotype T(x,i) is obtained from T by marking SNP genotype Gx (i) as missing. The likelihood ratio of SNP genotype Gx (i) is defined as L(T(x,i) ) L(T ) . Notice that, by L’s monotony under data deletion, the likelihood ratio is always greater or equal to 1. A SNP genotype Gx (i) will be flagged as a potential error whenever the corresponding likelihood ratio exceeds a user specified detection threshold t. The likelihood function used by Becker et al. [13] is the maximum trio phasing probability, L(T ) =
max (H1 ,H2 ,H3 ,H4 )
P (H1 )P (H2 )P (H3 )P (H4 )
(2)
where the above maximum is computed over all 4-tuples (H1 , H2 , H3 , H4 ) of haplotypes that explain T . The use of maximum trio phasing probability as likelihood function is intuitively appealing, since one does not expect a large increase in this probability when a single SNP genotype is deleted. The computational complexity of computing the maximum trio phasing probability L(T ) depends on the encoding used to represent haplotype frequencies. When all N = 2n haplotype frequencies are given explicitly, computing L(T )
Genotype Error Detection Using Hidden Markov Models
77
can be trivially done in O(N 4 ) time. Unfortunately this representation can only be used for a small number n of SNP loci. To maintain practical running time, Becker et al. [13] adopted a heuristic approach that relies on creating a short list of haplotypes with frequency exceeding a certain threshold (computed using the FAMHAP software package [19]) followed by a pruned search over 4-tuples of haplotypes from the list. Due to the high computation cost of the search algorithm, the list of haplotypes must be kept very short (between 50 and 100 for the experiments reported in [13]), which makes the approach applicable only for short windows of consecutive SNP loci. This limits the amount of linkage information that could be used in error detection, explaining at least in part the high number of false positives observed in [13] within correctly typed SNP genotypes located in the neighborhood SNP genotypes that are mistyped in the same individual. The HMM described in previous section provides a compact implicit representation of all haplotype frequencies that can be used for large numbers of SNP loci. The problem of computing L(T ) based on the HMM is formalized as follows: HMM-based maximum trio phasing probability: Given an HMM model M of haplotype diversity with n SNP loci and K founders and a trio genotype T = (Gm , Gf , Gc ), compute L(T |M ) =
max (H1 ,H2 ,H3 ,H4 )
P (H1 |M )P (H2 |M )P (H3 |M )P (H4 |M )
(3)
where the maximum is computed over all 4-tuples (H1 , H2 , H3 , H4 ) of haplotypes that explain T . Computing P (H|M ) for a given haplotype H can be easily done in O(nK) time by using a standard forward algorithm, and thus the probability of any given 4-tuple (H1 , H2 , H3 , H4 ) that explains T can also be computed within the same time bound. Unfortunately, as stated in the following theorem whose proof we omit due to space constraints, approximating the HMM-based maximum trio phasing probability is hard under some standard computational complexity assumption.1 Theorem 1. For every ε > 0, L(T |M ) cannot be approximated within a factor 1 of O(n 4 −ε ) for any ε > 0, unless ZPP=NP. In next section we propose alternative likelihood functions that are efficiently computable based on an HMM model of haplotype diversity, even for very large numbers of SNP loci. 1
A proof similar to that of Theorem 1 shows that, when haplotype frequencies are represented using an HMM, computing the maximum phasing probability for a single 1 multi-locus genotype is hard to approximate within a factor of O(n 2 −ε ) for any ε > 0, unless ZPP=NP, thus solving a problem left open in [15].
78
5
J. Kennedy, I. M˘ andoiu, and B. Pa¸saniuc
Efficiently Computable Likelihood Functions
In this section we consider three alternatives to the likelihood function used in [13], and describe efficient algorithms for computing them given an HMM model of haplotype diversity. 5.1
Viterbi Probability
The probability with which the HMM M emits four haplotypes (H1 , H2 , H3 , H4 ) along a set of 4 paths (π1 , π2 , π3 , π4 ) is obtained by a straightforward extension of (1). The first proposed likelihood function is the Viterbi probability, defined, for a given trio genotype T , as the maximum probability of emitting haplotypes that explain T along four HMM paths. Viterbi probability can be computed using a “4-path” extension of the classical Viterbi algorithm as follows. For every 4-tuple q = (q1 , q2 , q3 , q4 ) ∈ Q4j , let Vf (j; q) denote the maximum probability of emitting alleles that explain the first j SNP genotypes of trio T along a set of 4 paths ending at states (q1 , q2 , q3 , q4 ) (we will refer to these values as the forward Viterbi values). Also, let Γ (q , q) = γ(q1 , q1 )γ(q2 , q2 )γ(q3 , q3 )γ (q4 , q4 ) be the probability of transition in M from the 4-tuple q ∈ Q4j−1 to the 4-tuple q ∈ Q4j . Then, Vf (0; (q 0 , q 0 , q 0 , q 0 )) = 1 and Vf (j; q) = E(j; q) max {Vf (j − 1; q )Γ (q , q)} 4 q ∈Qj−1
(4)
4 Here, E(j; q) = max(σ1 ,σ2 ,σ3 ,σ4 ) i=1 ε(qi , σi ), where the maximum is computed over all 4-tuples (σ1 , σ2 , σ3 , σ4 ) that explain T ’s SNP genotypes at locus j. For a given trio genotype T , the Viterbi probability of T is given by V (T ) = maxq∈Q4n {Vf (n; q)}. The time needed to compute forward Viterbi values with the above recurrences is O(nK 8 ), where n denotes the number of SNP loci and K denotes the number of founders. Indeed, for each one of the O(K 4 ) 4-tuples q ∈ Q4j , computing the maximum in (4) takes O(K 4 ) time. A O(K 3 ) speed-up is achieved by computing, in order: P re1 (j; q1 , q2 , q3 , q4 ) = maxq1 ∈Qj {Vf (j; (q1 , q2 , q3 , q4 ))γ(q1 , q1 )} P re2 (j; q1 , q2 , q3 , q4 ) = maxq2 ∈Qj {P re1 (j; (q1 , q2 , q3 , q4 ))γ(q2 , q2 )} P re3 (j; q1 , q2 , q3 , q4 ) = maxq3 ∈Qj {P re2 (j; (q1 , q2 , q3 , q4 ))γ(q3 , q3 )} Vf (j + 1; q) = E(j + 1; q) maxq4 ∈Qj {P re3 (j; (q1 , q2 , q3 , q4 ))γ(q4 , q4 )} for each SNP locus j = 1, . . . , n and all 4-tuples (q1 , q2 , q3 , q4 ) ∈ Qj+1 × Q3j , (q1 , q2 , q3 , q4 ) ∈ Q2j+1 ×Q2j , (q1 , q2 , q3 , q4 ) ∈ Q3j+1 ×Qj , respectively q = (q1 , q2 , q3 , q4 ) ∈ Q4j+1 . A similar speed-up idea was used in the context of single genotype phasing by Rastas et al. [15]. To apply the likelihood ratio test, we also need to compute Viterbi probabilities for trios with one of the SNP genotypes deleted. A na¨ıve approach is to compute each of these probabilities from scratch using the above O(nK 5 ) algorithm. However, this would result in a runtime that grows quadratically with
Genotype Error Detection Using Hidden Markov Models
79
the number of SNPs. A more efficient algorithm is obtained by also computing backward Viterbi values Vb (j; q), defined as the maximum probability of emitting alleles that explain genotypes at SNP loci j + 1, . . . , n of trio T along a set of 4 paths starting at the states of q ∈ Q4j . Once forward and backward Viterbi values are available, the Viterbi probability of a modified trio can be computed in O(K 5 ) time by using again the above speed-up idea, for an overall runtime of O(nK 5 ) per trio. 5.2
Probability of Viterbi Haplotypes
The Viterbi algorithm described in previous section yields, together with the 4 Viterbi paths, a 4-tuple of haplotypes which we refer to as the Viterbi haplotypes. Viterbi haplotypes for the original trio can be computed by a standard traceback algorithm. Similarly, Viterbi haplotypes corresponding to modified trios can be computed without increasing the asymptotic runtime via a bi-directional traceback. The second likelihood function that we considered is the probability of Viterbi haplotypes, which is obtained by multiplying individual probabilities of Viterbi haplotypes. The probability of each Viterbi haplotype can be computed using a standard forward algorithm in O(nK) time. Unfortunately, Viterbi paths for modified trios can be completely different from each other, and the probability of each of them must be computed from scratch by using the forward algorithm. This results in an overall runtime of O(nK 5 + n2 K) per trio. 5.3
Total Trio Genotype Probability
The third considered likelihood function is the total trio genotype probability, i.e., the total probability P (T ) with which M emits any four haplotypes that explain T along any 4-tuple of paths. Using again the forward algorithm, P (T ) can be computed as p(n; q), where p(0; (q 0 , q 0 , q 0 , q 0 )) = 1 and q∈Q4n
p(j; q) =
p(j − 1; q )Γ (q , q)
q ∈Q4j
4
ε(qi , σi )
(5)
(σ1 ,σ2 ,σ3 ,σ4 ) i=1
The second sum in last equation is computed over all 4-tuples (σ1 , σ2 , σ3 , σ4 ) that explain T ’s SNP genotypes at locus j. Using the speed-up techniques from Section 5.1, we obtain an overall runtime of O(nK 5 ) per trio.
6 6.1
Experimental Results Experimental Setup
HMM-based genotype error detection algorithms using the three likelihood functions described in Section 5 were implemented in C++. Since the detection accuracy of the three likelihood functions is very similar, we report here accuracy results only for the total trio genotype probability.
80
J. Kennedy, I. M˘ andoiu, and B. Pa¸saniuc
We tested the performance of our methods on both synthetic datasets and a real dataset obtained from Becker et al. [13]. Synthetic datasets were generated following the methodology of [13]. We started from the real dataset in [13], which consists of 551 trios genotyped at 35 SNP loci spanning a region of 91,391 base pairs from chromosome 16. The FAMHAP software [19] was used to estimate the frequencies of the haplotypes present in the population. The 705 haplotypes that had positive FAMHAP estimated frequencies were used to derive synthetic datasets with 551 trios as follows. For each trio, four haplotypes were randomly picked by random sampling from the estimated haplotype frequency distribution. Two of these haplotypes were paired to form the mother genotype, and the other two were paired to form the father genotype. We created child genotypes by randomly picking from each parent a transmitted haplotype (assuming that no recombination is taking place). To make the datasets more realistic, missing data was inserted into the resulting genotypes by replicating the missing data patterns observed in the real dataset. Errors were inserted to the genotype data using the random allele model [20]. Under this model, we selected each (trio, SNP locus) pair with a probability of δ (δ was set to 1% in all our experiments). For each selected pair, we picked uniformly at random one of the non-missing alleles and flipped its value. Similar detection accuracy was obtained in experiments in which we simulated recombination rates of up to 0.01 between adjacent SNPs, and in experiments where errors were inserted using the random genotype, heterozygous-to-homozygous, and homozygous-to-heterozygous error models described in [20]. 6.2
Results on Synthetic Datasets
Following the standard practice, we first removed the trivially detected MI errors by marking child SNP genotypes involved in MIs as missing (similar results were obtained by marking all three SNP genotypes as missing). Figure 2 shows the distributions of log-likelihood ratios (computed using the total trio genotype probability as likelihood function) for error and nonerror SNP genotypes in both parents and children. These results are based on averages over 10 synthetic instances of 551 trios typed at 35 SNP loci, with errors inserted using the random allele model with δ = 1%. It is known that there is an asymmetry in the amount of information gained from trio genotype data about children and parent haplotypes: while each of the two child haplotypes are constrained to be compatible with two genotypes, only one of the parent haplotypes has the same degree of constraint. This asymmetry was shown to make errors in children more likely to result in MIs [3,4]. As shown by the histograms in Figure 2, the asymmetry also results in a much sharper separation between errors and non-errors in children than in parents. Surprisingly, the histogram of log-likelihood ratios for non-error SNP genotypes in children has a significant peak between 3 and 4. Upon inspection, we found that these SNP genotypes are at loci for which parents have inserted errors. A similar bias towards higher false positive rates in correctly typed SNP genotypes for which the same locus is mistyped in related individuals has been noted for
Genotype Error Detection Using Hidden Markov Models
Children-TRIOS
4.8
5.12
5.44
5.76
5.44
5.76
4.48
5.12
4.16
4.48
4.8
4.16
3.2
2.88
2.56
2.24
1.6
3.84
ERR
3.84
3.2
2.88
2.56
2.24
1.92
1.6
1.28
0.96
0
5.76
5.44
4.8
5.12
4.48
4.16
3.84
3.52
3.2
2.88
2.56
1
2.24
10
1
1.92
100
10
1.6
1000
100
1.28
10000
1000
0.96
100000
10000
0
100000
0.64
1000000
0.64
NO_ERR
1000000
0.32
1.92
Children-COMBINED
ERR
0.32
NO_ERR
ERR
3.52
Parents-COMBINED
1.28
0
5.76
4.8
5.44
4.48
5.12
4.16
3.84
3.2
3.52
2.88
2.56
1.6
1
2.24
10
1
1.92
100
10
1.28
1000
100
0.96
10000
1000
0
100000
10000
0.64
100000
0.32
1000000
0.96
NO_ERR
1000000
0.64
ERR
0.32
NO_ERR
3.52
Parents-TRIOS
81
Fig. 2. Histograms of log-likelihood ratios for parents (left) and children (right) SNP genotypes, computed based on trios (top) or by using the minimum of uno, duo, and trio log-likelihood ratios
other pedigree-based error detection algorithms [21]. To reduce this bias, we propose a simple technique of combining multiple likelihood ratios computed for different subsets of trio members. Under this combined approach, henceforth referred to as TotalProb-Combined, for each SNP genotype we compute likelihood ratios using the total probability of (a) the trio genotype, (b) the duo genotypes formed by parent-child pairs, and (c) the individual’s genotype by itself. Likelihood ratios (b) and (c) can be computed without increasing the asymptotic running time via simple modifications of the algorithm in Section 5.3. A SNP genotype is then flagged as a potential error only if all above likelihood ratios exceed the detection threshold. To assess the accuracy of our error detection methods we use receiver operating characteristic (ROC) curves, i.e., plots of achievable sensitivity vs. false positive rates, where – the sensitivity is defined as the ratio between the number of Mendelian consistent errors flagged by the algorithm and the total number of Mendelian consistent errors inserted; and – the false positive rate is defined as the ratio between the number of non-errors flagged by the algorithm and the total number of non-errors. Figure 3 shows the ROC curves for TotalProb-Combined and for flagging algorithms that use single log-likelihood ratios computed from the total probability of uno/duo/trio genotypes. We also included ROC curves for two versions of the
82
J. Kennedy, I. M˘ andoiu, and B. Pa¸saniuc
algorithm of [13], which test one SNP genotype at a time (FAMHAP-1) or simultaneously test the mother/father/child SNP genotypes at a locus (FAMHAP-3). The results show that simultaneous testing yields low detection accuracy, particularly in parents, and it is therefore not advisable. The combined algorithm yields the best accuracy of all compared methods. The improvement over the trio-based version is most significant in parents, where, surprisingly, uno and duo log-likelihood ratios appear to be more informative than the trio log-likelihood ratio. 6.3
Results on Real Data from [13]
For simplicity, in previous section we used the same detection threshold in both children and parents. However, histograms in Figure 2 suggest that better tradeoffs between sensitivity and false positive rate can be achieved by using differential detection thresholds. For the results on the real dataset from Becker et al. [13] (Table 1) we independently picked parent and children thresholds by finding the minimum detection threshold that achieves false positive rates of 0.1-1% under log-likelihood ratio distributions of simulated data. Unfortunately, for this dataset we do not know all existing genotyping errors. Becker et al. resequenced all trio members at a number of 41 SNP loci flagged by their FAMHAP-3 method with a detection threshold of 104 . Of the 41 × 3 resequenced SNP genotypes, 26 (12 in children and 14 in parents) were identified as being true errors, 90 were confirmed as originally correct. The error status of remaining 7 resequenced SNP genotypes is ambiguous due to missing calls in either the original or resequencing data. The “True Positive” columns in Table 1 1
1
0.9
0.9
0.8
0.8
0.7
0.7
Sensitivity
0.5
0.4
0.5
0.4
0.3
0.3
0.2
0.2
0.1
TotalProb-UNO TotalProb-DUO TotalProb-TRIO TotalProb-COMBINED FAMHAP-1 FAMHAP-3
0.6
Sensitivity
TotalProb-UNO TotalProb-DUO TotalProb-TRIO TotalProb-COMBINED FAMHAP-1 FAMHAP-3
0.6
0.1
0
0
0
0.005
0.01
0.015
0
0.005
FP rate
0.01
0.015
FP rate
Fig. 3. Comparison with FAMHAP accuracy for parents (left) and children (right)
Table 1. Results of TotalProb-Combined on Becker et al. dataset FP rate Parents Children Total
Total Signals 1% 0.5% 0.1% 218 127 69 104 74 24 322 201 93
True Positives 1% 0.5% 0.1% 9 9 8 11 11 11 20 20 19
False Positives 1% 0.5% 0.1% 1 0 0 3 3 2 4 3 2
Unknown Signals 1% 0.5% 0.1% 208 118 61 90 60 11 298 178 72
Genotype Error Detection Using Hidden Markov Models
83
give the number of TotalProb-Combined flags among the 26 known errors, the “False Positive” columns give the number of flags among the 90 known nonerrors, and the “Unknown Signals” columns give the number flags among the 57,739 SNP genotypes for which the error status is not known (since resequencing was not performed or due to missing calls). With a predicted false positive rate of 0.1%, TotalProb-Combined detects 11 out of the 12 known errors in children, and 8 out of the 14 known errors in parents, with only 2 false positives (both in children). TotalProb-Combined also flags 72 SNP genotypes with unknown error status, 61 of which are in parents. We conjecture that most of these are true typing errors missed by FAMHAP-3, which, as suggested by the simulation results in Figure 3, has very poor sensitivity to errors in parent genotypes. We also note that the number of Mendelian consistent errors in parents is expected to be more than twice higher than the number of Mendelian consistent errors in children, due on one hand to the fact that there are twice more parents than children and on the other hand to the higher probability that errors in parents remain undetected as Mendelian inconsistencies [3,4].
7
Conclusions
In this paper we have proposed high-accuracy methods for detection of errors in trio genotype data based on Hidden Markov Models of haplotype diversity. The runtime of our methods scales linearly with the number of trios and SNP loci, making them appropriate for handling the datasets generated by current large-scale association studies. In ongoing work we are exploring the use of locus dependent detection thresholds, methods for assigning p-values to error predictions, and iterative methods which use maximum likelhood to correct MIs and SNP genotypes flagged with a high detection threshold, then recompute loglikelihoods to flag additional genotypes. Finally, we are exploring integration of population-level haplotype frequency information with typing confidence scores for further improvements in error detection accuracy, particularly in the case of unrelated genotype data.
Acknowledgments We would like to thank the authors of [13] for kindly providing us the real dataset used in their paper. This work was supported in part by NSF CAREER award IIS-0546457 and NSF award DBI-0543365.
References 1. Pompanon, F., Bonin, A., Bellemain, E., Taberlet, P.: Genotyping errors: causes, consequences and solutions. Nat. Rev. Genet. 6, 847–859 (2005) 2. Zaitlen, N., Kang, H., Feolo, M., Sherry, S.T., Halperin, E., Eskin, E.: Inference and analysis of haplotypes from combined genotyping studies deposited in dbSNP. Genome Research 15, 1595–1600 (2005)
84
J. Kennedy, I. M˘ andoiu, and B. Pa¸saniuc
3. Douglas, J., Skol, A., Boehnke, M.: Probability of detection of genotyping errors and mutations as inheritance inconsistencies in nuclear-family data. AJHG 70, 487–495 (2002) 4. Gordon, D., Heath, S., Ott, J.: True pedigree errors more frequent than apparent errors for single nucleotide poloymorphisms. Hum. Hered. 49, 65–70 (1999) 5. Ahn, K., Haynes, C., Kim, W., Fleur, R., Gordon, D., Finch, S.: The effects of SNP genotyping errors on the power of the Cochran-Armitage linear trend test for case/control association studies. Ann. Hum. Genet. 71, 249–261 (2007) 6. Abecasis, G., Cherny, S., Cardon, L.: The impact of genotyping error on familybased analysis of quantitative traits. Eur. J. Hum. Genet. 9, 130–134 (2001) 7. Cherny, S., Abecasis, G., Cookson, W., Sham, P., Cardon, L.: The effect of genotype and pedigree error on linkage analysis: Analysis of three asthma genome scans. Genet. Epidemiol. 21, S117–S122 (2001) 8. Knapp, M., Becker, T.: Impact of genotyping errors on type I error rate of the haplotype-sharing transmission/disequilibrium test (HS-TDT). Am. J. Hum. Genet. 74, 589–591 (2004) 9. Cheng, K.: Analysis of case-only studies accounting for genotyping error. Ann. Hum. Genet. 71, 238–248 (2007) 10. Liu, W., Yang, T., Zhao, W., Chase, G.: Accounting for genotyping errors in tagging SNP selection. Am. J. Hum. Genet. 71(4), 467–479 (2007) 11. Sobel, E., Papp, J., Lange, K.: Detection and integration of genotyping errors in statistical genetics. Am. J. Hum. Genet. 70, 496–508 (2002) 12. Abecasis, G., Cherny, S., Cookson, W., Cardon, L.: Merlin-rapid analysis of dense genetic maps using sparse gene flow trees. Nat. Genet. 30, 97–101 (2002) 13. Becker, T., Valentonyte, R., Croucher, P., Strauch, K., Schreiber, S., Hampe, J., Knapp, M.: Identification of probable genotyping errors by consideration of haplotypes. European Journal of Human Genetics 14, 450–458 (2006) 14. Kimmel, G., Shamir, R.: A block-free hidden Markov model for genotypes and its application to disease association. Journal of Computational Biology 12, 1243–1260 (2005) 15. Rastas, P., Koivisto, M., Mannila, H., Ukkonen, E.: Phasing genotypes using a hidden Markov model. In: Bioinformatics Algorithms: Techniques and Applications, Wiley, Chichester (to appear, preliminary version in Proc. WABI 2005) 16. Scheet, P., Stephens, M.: A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. American Journal of Human Genetics (to appear) 17. Schwartz, R.: Algorithms for association study design using a generalized model of haplotype conservation. In: Proc. CSB, pp. 90–97 (2004) 18. Gusev, A., Pa¸saniuc, B., M˘ andoiu, I.: Highly scalable genotype phasing by entropy minimization. IEEE Transactions on Computational Biology and Bioinformatics (to appear) 19. Becker, T., Knapp, M.: Maximum-likelihood estimation of haplotype frequencies in nuclear families. Genet. Epidemiol. 27, 21–32 (2004) 20. Douglas, J., Boehnke, M., Lange, K.: A multipoint method for detecting genotyping errors and mutations in sibling-pair linkage data. AJHG 66, 1287–1297 (2000) 21. Mukhopadhyaya, N., Buxbauma, S., Weeks, D.: Comparative study of multipoint methods for genotype error detection. Hum. Hered. 58, 175–189 (2004)
Haplotype Inference Via Hierarchical Genotype Parsing Pasi Rastas and Esko Ukkonen Department of Computer Science and Helsinki Institute for Information Technology HIIT P.O Box 68, FIN-00014 University of Helsinki, Finland
[email protected]
Abstract. The within-species genetic variation due to recombinations leads to a mosaic-like structure of DNA. This structure can be modeled, e.g. by parsing sample sequences of current DNA with respect to a small number of founders. The founders represent the ancestral sequence material from which the sample was created in a sequence of recombination steps. This scenario has recently been successfully applied on developing probabilistic Hidden Markov Methods for haplotyping genotypic data. In this paper we introduce a combinatorial method for haplotyping that is based on a similar parsing idea. We formulate a polynomial-time parsing algorithm that finds minimum cross-over parse in a simplified ‘flat’ parsing model that ignores the historical hierarchy of recombinations. The problem of constructing optimal founders that would give minimum possible parse for given genotypic sequences is shown NP-hard. A heuristic locally-optimal algorithm is given for founder construction. Combined with flat parsing this already gives quite good haplotyping results. Improved haplotyping is obtained by using a hierarchical parsing that properly models the natural recombination process. For finding short hierarchical parses a greedy polynomial-time algorithm is given. Empirical haplotyping results on HapMap data are reported.
1
Introduction
Recombination is a major factor causing genetic variation between individuals of a population by combining mutations. In this paper we generalize and improve the combinatorial founder model [19] of recombinations. This model assumes that the current population is evolved from a small number of ‘founder’ individuals, thus the current sequences are recombinations of these founder sequences. If visualized by giving each founder sequence a different color, the founders define a coloring of the current sequences, thus uncovering a mosaic-like structure. Figure 1 shows an example mosaic obtained by parsing 20 sequences (‘Recombinants’) with respect to four founder sequences. The term mosaic is also used in this sense in [13,20]. A key-question with the founder model is how to find
Supported by the Academy of Finland under grant 211496 (From Data to Knowledge).
R. Giancarlo and S. Hannenhalli (Eds.): WABI 2007, LNBI 4645, pp. 85–97, 2007. c Springer-Verlag Berlin Heidelberg 2007
86
P. Rastas and E. Ukkonen
Fig. 1. Screenshot from HaploVisual program that implements the parsing algorithms of [19]. The program is available on www.cs.helsinki.fi/u/prastas/haplovisual/
good founders. A natural parsimonicity criterion was used in [19]: find ancestral sequences that explain the given data with fewest recombinations. This model is also studied in [22]. Our contribution is three-fold. First we generalize the model from haplotypes to genotypes, and give a parsing algorithm that parses given genotype sequences into fragments that are taken from the haplotypic founders. This immediately suggests a phasing for the genotypes as the parse is composed of two haplotype sequences, i.e. we have a haplotyping algorithm.1 This type of parsing is flat in the sense that it ignores the historical order of recombinations. Then we provide some computational complexity results on finding a founder set for which the flat parse has smallest possible number of recombinations: Finding founders that are optimal in this sense is NP-hard in general but polynomial-time if the number of founders is restricted to two. As finding optimal founders is hard, we develop a locally-optimal method for finding a good founder set. Finally we improve the parsing model to include the hierarchical structure of recombinations. This leads to a novel parsing problem of finding a shortest hierarchical parse with respect to given founders. We propose a greedy polynomialtime parsing algorithm. The paper is concluded by reporting some haplotyping experiments on the HapMap data. It turns out that already the flat parsing with respect to the locally optimal set of founders gives reasonably good
1
Essentially the same observation was independently made in [9].
Haplotype Inference Via Hierarchical Genotype Parsing
87
haplotyping results as compared to the state-of-the-art probabilistic methods. The performance improves if the hierarchical parsing is applied. Our hierarchical parsing can be seen as a method for constructing a variant of the so–called ancestral recombination graph (ARG) that connects the founders to the given genotypes. An ARG is the most accurate model of the genealogy of sequences [3]. It is a directed graph, whose nodes correspond to sequences, single edges correspond to mutations, and edges connecting nodes a and b to a single node stand for a recombination between a and b. All sequences in the ARG model evolve from a single root sequence. A natural parsimonious problem is to find an ARG with minimum number of recombinations assuming each allele mutates only once. The problem is known to be NP-hard in the case of a known root sequence [21]. Current state-of-the-art methods can produce ARGs for about 20 SNPs and 40 haploid sequences [12].
2
Genotypes, Haplotypes, and Recombination
A Single Nucleotide Polymorphism (SNP) is a single base-pair position in genomic DNA where different nucleotides, called alleles, occur in some population. In most SNPs only two alleles out of A, C, G and T are present. Most of the genetic variance between individuals is due to SNPs. Thus, when studying genetic diseases or factors, one often studies variations in certain SNPs. In humans and other diploid organisms, most cells contain two almost identical copies of each chromosome, one inherited from the organism’s mother and the other from the father. These copies are called haplotypes. Thus, each individual has two haplotypes: maternal and paternal. Current practical laboratory methods determine for each SNP of an individual only the alleles, but no information on which copy they are from. In this context, these sequences without copy information are called genotypes. Thus, a genotype is a sequence that gives at each SNP the alleles of the two haplotypes. For example, consider the case that alleles at three SNPs are A-A, C-T and A-T. The possible haplotypes are therefore ACA and ATT, or ATA and ACT. If the alleles at an SNP are different we call this site heterozygous (the second and the third SNP in the example), and otherwise homozygous (the first SNP). If a genotype has k heterozygous sites then there are 2k−1 possible pairs A
b
A
B
A
B
a
B
a
b
c
d
C
d
C
D
a
b
c
D
c
d
C
D
Fig. 2. On the left hand side there are the haplotypes of the parents and on the right hand side are the outcomes for the child’s haplotypes from the two recombinations illustrated in the middle
88
P. Rastas and E. Ukkonen
of haplotypes for that genotype. Without assumptions about the haplotypes or about the population, all of these possibilities are equally likely. For measuring haplotyping performance we use a commonly used metric called the switch distance [11]. We use the unnormalized version of this distance, i.e. the number of switches. It equals the number of phase changes in the inferred haplotypes, that are needed to get the correct haplotypes. For example, assume that the correct haplotypes are AAAAA and TTTTT. Then a phasing solution ATTTT, TAAAA would score one switch, and a solution ATATA, TATAT four switches. Figure 2 shows how recombination in a meiosis combines maternal and paternal sequences. Child’s haplotype inherited from one parent contains fragments from both haplotypes of that parent. An underlying assumption in this paper is that recombinations happen in an equal crossing over fashion, i.e. in such way that sequence fragments retain their locations in the resulting sequence [19].
3
Combinatorial Mosaic Model
In this section the founder models presented in [19,17,8,6] are generalized for genotypes and point mutations. We assume that our sequences of interest are from a finite alphabet Σ, for example Σ = {A, C, G, T } for DNA or Σ = {0, 1} for SNP haplotypes. If our sequences are from n markers, then a sequence can be described as a string of length n from alphabet Σ ∪ {−}, where symbol − is used to indicate missing values. We want to analyze a set D of current sequences with respect to some small set of founder sequences F . The sequences in D and F are haplotypes over the same n markers. We denote |D| = m and |F | = K. The differences between D and F are due to point mutations and recombinations. Recombination is modeled as a process, that builds a new haploid sequence by combining a prefix of one sequence with a suffix of another one [8]. Mutation is simply an event that changes one symbol of a certain sequence. A haplotype h ∈ D has a simple parse (with no mutations) of cost k with respect to founder sequences F , if we can write h as concatenation fi0 · fi1 · · · fik of nonempty substrings fij , where each fij occurs in some f ∈ F at the same position as it is used in the parse. Let c be a parameter that gives a relative weight for mutations as compared to recombinations. Then a haplotype h has a parse of cost k + k c with respect to F , if there is a simple parse of cost k of some h and the Hamming distance between h and h , d(h, h ), is k . We say that a parse is optimal, if it has the lowest possible cost. The score that we want to minimize is the sum of optimal parse costs for the sequences in D. This score can be computed for each sequence independently, and it depends on the number of recombination and mutation events. Each recombination adds amount of 1 to the score and each point mutation adds amount of c > 0. By setting a high value for c, the parse is forced to use mutations only rarely, and by setting c to a small positive value, the parse is forced to use recombinations rarely.
Haplotype Inference Via Hierarchical Genotype Parsing
89
Let F = (F1 , . . . , FK ) ⊂ Σ n , |F | = K, be a fixed founder set where each Fa = Fa1 . . . Fan . Then we can compute the minimum cost of a parse of a haplotype h = h1 . . . hn with respect to F by dynamic programming using the following formulas: S(0, a) = 0 (1) S(i, a) = pc (hi , Fai ) + mina S(i − 1, a ) + Ia =a for a = 1, . . . , K, and i = 1, . . . , n. Here IA is the indicator function of predicate A, and pc (S, S ) is the cost of mutating symbol S to symbol S , i.e., 0 , if S = S or S = − pc (S, S ) = c , otherwise. The minimum score, scorecF (h), is mina S(n, a). The corresponding parse can be found using standard trace-back after each S(i, a) has been computed. Direct evaluation of formula (1) would take space O(nK) and time O(nK 2 ). A more efficient evaluation is possible, however, as follows. Bywriting the minimization in the latter equation S(i, a) = pc (hi , Fai ) + mina S(i − 1, a ) + Ia =a as min S(i − 1, a), 1 + mina S(i − 1, a) , we can see that by maintaining the value mina S(i − 1, a ) during the computation one can reduce the running time to O(nK). Moreover, the space requirement can be reduced trivially to O(K), if we do not need the corresponding parse. However, space O(K) and time O(nK) is also enough to get the parse. This achieved by using a divide and conquer algorithm similar to the Hirshberg’s algorithm [5]. Theorem 1. The optimal parse and parsing score of a haplotype with respect to a given founder set F can be found in time O(n|F |) and in space O(|F |). The score of a set of m haplotypes H is defined as a sum of individual scores h ∈ H, i.e. scorecF (H) = h∈H scorecF (h). The running time to compute the score for a set H of m haplotypes is O(mn|F |). Let us next consider unphased sequences, i.e. our dataset D consists of genotypes instead of haplotypes. We show that with some modifications to (1) we can parse genotype sequences as well. A similar algorithm appears in [9] while our formulation is from [14]. For reasons of clarity, we assume that our alphabet Σ is {0, 1}. All results apply for general alphabet too, but the notation would get too complicated. A genotype g is a string in alphabet {0, 1, 2, −}, where 0 and 1 denote the two homozygote alleles and value 2 denotes heterozygous alleles. Two haplotypes h = h1 . . . hn ∈ {0, 1}n and h = h1 . . . hn ∈ {0, 1}n are compatible with a genotype g = g1 . . . gn ∈ {0, 1, 2, −}n, denoted g = γ(h, h ), if either (gi = hi = hi ) or (gi = 2 and hi = hi ) or (gi = −) for each i, 1 ≤ i ≤ n. Given a founder set F and a genotype g, we define the score of g, scorecF (g), as minh,h :γ(h,h )=g scorecF (h) + scorecF (h ) . Score of a set of genotypes G is
90
P. Rastas and E. Ukkonen
defined, as with haplotypes, as g∈G scorecF (g). Score of g can be computed efficiently using dynamic programming as follows:
S(0, a, b) = 0
S(i, a, b) = pc (gi , Fai , Fbi ) + mina ,b S(i − 1, a , b ) + Ia =a + Ib =b
(2)
for a, b = 1, . . . , K, and i = 1, . . . , n. Here pc (T, S, S ) is the cost of mutating genotype T to γ(S, S ), i.e. ⎧ ⎨0 , if T = γ(S, S ) or T = − pc (T, S, S ) = 2c , if T = γ(S, S”) and T = γ(S”, S ) for all S” ∈ Σ ⎩ c , otherwise. Minimum score, scorecF (g), is mina,b S(n, a, b) and the parse can be uncovered by trace–back. Direct evaluation of (2) would take O(nK 4 ) time. Using similar trick as earlier we can write the minimization as min(C00 , C01 , C10 , C11 ), where C00 = S(i − 1, a, b), C01 = 1 + mina S(i − 1, a , b), C10 = 1 + minb S(i − 1, a, b ) and C11 = 2 + mina ,b S(i − 1, a , b ). Now we can maintain values mina S(i − 1, a , b) and minb S(i − 1, a, b ) by using two arrays (actually one is enough) of size O(K) (indexes a and b). These arrays can be computed in time O(K 2 ) for column i−1. Further on, we need to keep track of a single value mina ,b S(i − 1, a , b ). By computing S(i, a, b) in this way we get time complexity of O(nK 2 ). The space complexity can be reduced to O(K 2 ) similarly as in the case of haplotypes, to O(K). The parse of a genotype also fixes its haplotypes, i.e. we can use this parse to infer haplotypes based on a founder set F . Theorem 2. The optimal parse and parsing score of a genotype with respect to a given founder set F can be found in time O(n|F |2 ) and space O(|F |2 ). The parse suggests a phasing of the genotype by giving two haplotypes that are compatible with the genotype.
4
Hardness of Finding Founders
In this section we consider the complexity of the problem of finding a set F of K founder sequences that minimizes scorecF (D). The decision version of the problem can be proven to be NP-complete. Problem 1. Given haplotype or genotype data D, parameters K, c and T , is there a set F of K founder sequences such that the score of data D, scorecF (D), is at most T ? Theorem 3. Problem 1 is NP-complete when the data consists of haplotypes, i.e., D ⊂ {0, 1, −}n.
Haplotype Inference Via Hierarchical Genotype Parsing
91
Proof. Problem is in NP, because if we are given a founder set F we can check if it gives score ≤ T using the polynomial algorithm derived from (1). Problem is NP-hard because we can reduce the graph coloring problem to it. The graph coloring problem asks one to color the vertices of a graph with K colors such that there are no edges between vertices with the same color. The problem is NP-hard even for K = 3 [2]. Let G = (V, E) be a graph with n = |V | vertices. Let H = {h1 , .., hn } be the corresponding set of haplotypes represented as an n × n matrix (Hij ) where Hij = hij . Graph G is coded into H by setting ⎧ ⎨1 , if i = j Hij = 0 , if (i, j) ∈ E ⎩ − , otherwise. In this coding, the vertex i of the graph G corresponds to the haplotype hi . Now there is a graph coloring of G with K colors if and only if there is founder set of size K giving for H total parsing score 0. “if”: Let a founder set F give score 0, |F | = K. Then we can construct a parse where each h ∈ H is parsed using exactly one f ∈ F . Now let Vf ⊂ V be the set of all vertices i corresponding haplotypes hi ∈ H that are parsed using f . Since each hi has 1 at position i and no other haplotype have one at that position, all haplotypes corresponding to vertices Vf \ {i} must have symbol − at position i ∈ Vf . Thus, there cannot be any edges between i ∈ Vf and Vf and therefore we have a coloring of G with K colors by coloring each Vf using the same, unique color. “only if”: Assume that G has a valid K coloring. Let us define Vj as the vertices that have been colored using the color j. Now we can construct a founder set F as a set of haplotypes fj with ones at positions Vj and zeros otherwise. Because vertices Vj can be colored using same colors there cannot be any edges between them. So the corresponding haplotypes will have only values 1 and − at positions i ∈ Vj and therefore we can parse all haplotypes H with score 0. In such a parse, haplotype hi ∈ Vj could be parsed using founder fj . 2 Theorem 4. Problem 1 is NP-complete when the data consists of genotypes, i.e., D ⊂ {0, 1, 2}n. Proof. The problem of finding founder set with score 0 for genotype data (without missing values) is the Pure Parsimony problem from [4]. In pure parsimony problem one asks whether there is a set of haplotypes F , such that |F | ≤ K and F generates input genotypes D. This problem is NP-complete [10]. We note that for
maximum parsimony to have a solution, parameter K must be greater than 2|G| [10]. 2 Theorem 5. The optimization version of Problem 1 cannot be approximated in polynomial time within any factor, unless NP=P. Proof. Had we a polynomial time approximation algorithm with some fixed approximation factor α, we could solve Problem 1 in the case of T = 0, D ⊂ {0, 1, 2} (genotypes) and D ⊂ {0, 1, −}n (haplotypes). But by Theorems 3 and 4, these problems are NP-hard. 2
92
P. Rastas and E. Ukkonen
Theorem 6. Problem 1 is NP-complete when data D ⊂ {0, 1} consists of hap1 lotypes and c = n|D| . 1 Proof. When c is n|D| Problem 1 becomes a clustering problem. The problem can be stated: is there a set of K binary vectors F , such that h∈D minf ∈F d(f, h) ≤ T where d(f, h) is the Hamming distance of f and h. This problem is the complementary (Hamming distance instead of Hamming overlap) problem of the Hypercube segmentation problem, which is NP-complete even for K = 2 [7]. Thus, Problem 1 is also NP-complete. 2
5
Heuristic Algorithm for Founder Construction
The simplest algorithm to find the optimal set of founders is to enumerate all founder sets F of a given size K, compute scorecF (D) for each of them, and choose |Σ|n the best solution. Time complexity of this algorithm is proportional to K . This algorithm is not very practical, but could be improved by clever enumeration of the sets F (Branch-and-Bound). In some cases finding founder sets for haplotypes is easy. If we set parameter c to ∞, we try to find parses with minimum number of recombinations. Then, if K = 2 there is a polynomial O(mn) time algorithm for finding optimal founder set [19], without missing values in the haplotypes. It is based on the fact that we can consider only haplotype columns on which both alleles are present and each such column infers the correct partition into two classes. The optimal founder set can be obtained in this case as follows: Without loss of generality we can remove all columns that contain only single value. We process the haplotypes from left to right and consider adjacent columns. The alleles of the first founder column can be set arbitrarily to 0 and 1. Now let us assume that the founders have been fixed for column i and we proceed to the column i+1. We count how many times substrings 00, 01, 10 and 11 occur at position i in the haplotypes. We pick either 01 and 10 or 00 and 11, depending which combination is more common. The founder column i + 1 is determined from the picked substrings; founders are set in a unique way such that the picked substrings occur at founder position i. Proceeding this way until column n we get a solution for K = 2 whose optimality is easily shown as between any successive columns this procedure uses a minimum number of recombinations on two founders. Since the more common of substrings of 01 and 10 or 00 and 11 occurs at least in 12 m haplotypes, we can have at most as many recombinations. Thus,an optimal solution can have at most 12 m(n−1) recombinations (this is a tight bound). Note that the above method finds optimal founder set of size 2 for genotypes as well [22], as we can compute maximum number of occurrences of 00, 01, 10 and 11 in the case of genotype input. The possible substrings in genotypes are 00, 01, 10 and 11 as in haplotypes and 20, 21, 02, 12 and 22. In the latter substrings first four correspond uniquely to two of 00, 01, 10 and 11. two possibilities we can choose the one minimizing the number of recombinations. However, this does not work when sequences contain missing values.
Haplotype Inference Via Hierarchical Genotype Parsing
93
The algorithm just described works from left to right by assigning founder columns in a greedy fashion. Next we generalize this idea for arbitrary number of K founders. The algorithm is the following. From left to right we construct the columns of F in a greedy manner. For column i we enumerate all |Σ|K possibilities and choose the one that minimizes the parsing score of the prefixes of the sequences in D up to column i. After the first pass we make repeated left–to–right passes until we have found a local optimum. In each pass the content of column i is reselected from the |Σ|K possibilities such that it minimizes the parsing score of the entire data D when F is kept fixed for all columns other than i. A single pass of this algorithm can be implemented in time O(mnK|Σ|K ) (in time O(mnK 2 |Σ|K )) for dataset of m haplotypes (genotypes) of length n. Algorithm finds the optimum when K = 2, c = ∞, and sequences have no missing values. Similar algorithm was used in [15].
6
Hierarchical Parsing
The parsing scheme of Section 3 applies recombinations independently on each sequence to be parsed. There is no attempt to utilize the same recombination several times. The hierarchical parsing scheme aims at finding recombinations that are common to several sequences in the data. The structure of the hierarchical parsing is simple: we start from some given founder set and add to it recombinants of the founders (including the recombinants added in earlier steps) such that the resulting process finally generates data D. This process forms a tree like history for the sequences D. Again, finding a shortest possible hierarchical parse is of interest but it seems very difficult. We note that if we start from a single sequence (|F | = 1) and add mutations and recombinations in this fashion, we would construct an ARG. Instead of exact algorithm, we use following greedy heuristic algorithm: We try in all, at most (n − 1)|F |(|F | − 1) ways to add new founder f that is recombinant of founders F , and add to the founder set the one that minimizes scorecF ∪{f } (D). By assigning the parameter c properly we would model errors or mutations in the sequences. If c is set to a low value we should stop the greedy algorithm when there are no recombinations. We take as the initial set of founders the ones that minimize scorecF (D). Algorithm is the following: Find out founder set F0 minimizing scorecF0 (D). Set Fi+1 = Fi ∪ {f }, where f is the recombinant of f1 ∈ Fi and f2 ∈ Fi that minimizes scorecFi+1 (D). Repeat until scorecFi+1 (D) = 0. As each greedy step decreases score by at least one, the total number of greedy steps cannot exceed scorecF0 (D). On the other hand, there must be a haplotype with at least scorecF0 (D)/m recombinations, where m is the numberof haplotypes (2|D| in case of genotypes). Thus we must take at least as many greedy steps.
94
P. Rastas and E. Ukkonen
By taking starting founders F0 as the ones minimizing scorecF0 (D), we minimize both the upper and lower bound on the number of greedy steps. Trivial implementation of this greedy step would take time O(mn2 |Fi |3 ) in case of haplotypes and O(mn2 |Fi |4 ) in case of genotypes. Our implementation takes time O(mn|Fi |2 ) for haplotypes and O(mn|Fi |3 ) for genotypes. In practice, this algorithm becomes quite slow, as |Fi | increases by one in every step.
7
Experimental Results
We used 220 datasets obtained from the HapMap data [18]. We selected data from two groups, abbreviated YRI (Yoruba) and CEU (Utah). For both these groups unphased genotypes are available for 30 trios, resulting in the total of 120 known haplotypes. From each of the 22 chromosomes we chose 5 fragments covering 100 consecutive SNPs starting from SNPs 1001, 2001, . . ., 5001. We haplotyped 60 genotypes taken from the trios. We used c = 1000. We compared our phasing results against fastPHASE [16] with standard settings (fastPHASE). Unlike our method, fastPHASE builds its final solution by combining several haplotype predictions. Therefore we also generated with fastPHASE a single-run solution based on 10 founders (clusters). We call this solution fastPHASE-10. With HIT [15] we generated a solution using 10 founders. We started the hierarchical algorithm with K = 3, 7, 10 initial founders, constructed by the algorithm of Section 5, and applied the greedy hierarchical parsing algorithm of Section 6 until the flat parsing score of the data decreased by one. So we can stop here as the recombinants added from now on would participate in only one parse and there is no hierarchy. Our Java implementation took a couple of minutes to run on a single HapMap dataset on a standard desktop PC. Table 1. Average values of switch distances for different algorithms and datasets
CEU YRI
Flat(K=3/7/10) Hierarchical(K=3/7/10) fastPHASE-10 fastPHASE HIT 225/138/136 119/110/113 136 85 89 406/248/230 203/181/178 207 134 143
Table 1 gives the number of switches averaged over CEU and YRI datasets. “Flat” is the parsing algorithm (2), and “Hierarchical” is the greedy hierarchical algorithm. The hierarchical method gives comparable results to fastPHASE-10, but fastPHASE and HIT are somewhat better. Figure 3 shows how the number of switches develops after every hierarchical step. The figure also shows the switch distances achieved by fastPHASE, fastPHASE-10, and HIT. Our implementation selects between equally good alternatives (columns of initial founders and added recombinants of the greedy parsing step) with equal probabilities. Sometimes the actual choices had a significant effect on the result.
Haplotype Inference Via Hierarchical Genotype Parsing
CEU chr10 (1001−1100)
95
YRI chr3 (3001−3100)
250 400
switch
200 150 100
300
switch
K=3 K=7 K=10 fastPHASE−10 fastPHASE HIT
200 100
50 0
20
40 60 iterations
80
0
100
CEU chr3 (1001−1100)
50
100 iterations
150
YRI chr7 (1001−1100)
250 400 200
switch
switch
300 150 100
100
50 0
200
20
40 60 80 iterations
0
100 120
CEU chr10 (1001−1100)
150
200 180
switch
80
switch
100 iterations
YRI chr3 (3001−3100)
90
70
60
50
50
160 140 120
100
150
200 score
250
300
100
200
300 400 score
500
Fig. 3. Four upper panels illustrate on four typical datasets the bahaviour of the switch distance after each greedy hierarchical step that adds a new recombinant to the founders. We used K = 3, 7, 10 initial founders (magenta, blue and black regular lines). The straight red dotted line shows fastPHASE’s, straight dash dotted green line fastPHASE-10’s, and straight dashed black line HIT’s performance. The two lower diagrams visualize the correlation between the flat parsing score and the switch distance during the greedy hierarchical steps.
96
P. Rastas and E. Ukkonen
References 1. Daly, M., Rioux, J., Schaffner, S., Hudson, T., Lander, E.: High-resolution haplotype structure in the human genome. Nature Genetics 29, 229–232 (2001) 2. Garey, M., Johnson, D.: Computers and Intractability: A Guide to the Theory on NP-Completeness. W. H. Freeman and Company, New York (1979) 3. Griffiths, R., Marjoram, P.: Ancestral inference from samples of DNA sequences with recombination. Journal of Computational Biology 3, 479–502 (1996) 4. Gusfield, D.: Haplotype inference by pure parsimony. Technical Report CSE-20032, Department of Computer Science, University of California (2003) 5. Hirschberg, D.S.: A linear space algorithm for computing maximal common subsequences. Comm. ACM 18, 341–343 (1975) 6. Kececioglu, J., Gusfield, D.: Reconstructing a history of recombinations from a set of sequences. Discrete Applied Mathematics 88, 239–260 (1998) 7. Kleinberg, J., Papadimitriou, C., Raghavan, P.: Segmentation problems. In: Proc. STOC ’98, In practice several alternative columns of initial founders as well as several recombinants to be added in the greedy parsing step have equally good effect, New York, USA, pp. 473–482. ACM Press, New York (1998) 8. Koivisto, M., Rastas, P., Ukkonen, E.: Recombination systems. In: Karhum¨ aki, J., Maurer, H., P˘ aun, G., Rozenberg, G. (eds.) Theory Is Forever. LNCS, vol. 3113, pp. 159–169. Springer, Heidelberg (2004) 9. Lajoie, M., El-Mabrouk, N.: Recovering haplotype structure through recombination and gene conversion. Bioinformatics 21(2), ii173–ii179 (2005) 10. Lancia, G., Pinotti, C., Rizzi, R.: Haplotyping populations: Complexity and approximations. Technical Report DIT-02-0080, Department of Information and Communication Technology, University of Trento (2002) 11. Lin, S., Cutler, D.J., Zwick, M.E., Chakravarti, A.: Haplotype inference in random population samples. American Journal of Human Genetics 71, 1129–1137 (2002) 12. Lyngsø, R., Song, Y., Hein, J.: Minimum recombination histories by branch and bound. In: Casadio, R., Myers, G. (eds.) WABI 2005. LNCS (LNBI), vol. 3692, pp. 239–250. Springer, Heidelberg (2005) 13. P¨ aa ¨bo, S.: The mosaic in our genome. Nature 421, 409–412 (2003) 14. Rastas, P.: Haplotyyppien m¨ aa ¨ritys (Haplotype inference). Report C-2004-69 (M.Sc. thesis), Department of Computer Science, University of Helsinki (2004) 15. Rastas, P., Koivisto, M., Mannila, H., Ukkonen, E.: A hidden markov technique for haplotype reconstruction. In: Casadio, R., Myers, G. (eds.) WABI 2005. LNCS (LNBI), vol. 3692, pp. 140–151. Springer, Heidelberg (2005) 16. Scheet, P., Stephens, M.: A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. American Journal of Human Genetics 78, 629–644 (2006) 17. Schwartz, R., Clark, A., Istrail, S.: Methods for inferring block-wise ancestral history from haploid sequences. In: Guig´ o, R., Gusfield, D. (eds.) WABI 2002. LNCS, vol. 2452, pp. 44–59. Springer, Heidelberg (2002) 18. The International HapMap Consortium: A haplotype map of the human genome. Nature 437, 1299–1320 (2005) 19. Ukkonen, E.: Finding founder sequences from a set of recombinants. In: Guig´ o, R., Gusfield, D. (eds.) WABI 2002. LNCS, vol. 2452, pp. 277–286. Springer, Heidelberg (2002) 20. Wade, C., Kulbokas, E., Kirby, A., Zody, M., Mullikin, J., Lander, E., Daly, M.: The mosaic structure of variation in the laboratory mouse genome. Nature 420, 574–578 (2002)
Haplotype Inference Via Hierarchical Genotype Parsing
97
21. Wang, L., Zhang, K., Zhang, L.: Perfect phylogenetic networks with recombination. Journal of Computational Biology 8, 69–78 (2001) 22. Wu, Y., Gusfield, D.: Improved algorithms for inferring the minimum mosaic of a set of recombinants. In: Proc. CPM 2007, Springer, Heidelberg (to appear, 2007)
Seeded Tree Alignment and Planar Tanglegram Layout Antoni Lozano1 , Ron Y. Pinter2 , Oleg Rokhlenko2, Gabriel Valiente3 , and Michal Ziv-Ukelson4 1
Logic and Programming Research Group, Technical University of Catalonia, E-08034 Barcelona, Spain
[email protected] 2 Department of Computer Science, Technion – Israel Institute of Technology, Haifa 32000, Israel {pinter,olegro}@cs.technion.ac.il 3 Algorithms, Bioinformatics, Complexity and Formal Methods Research Group, Technical University of Catalonia, E-08034 Barcelona, Spain
[email protected] 4 School of Computer Science, Tel-Aviv University, Tel-Aviv 69978, Israel
[email protected] Abstract. The optimal transformation of one tree into another by means of elementary edit operations is an important algorithmic problem that has several interesting applications to computational biology. We introduce a constrained form of this problem in which a partial mapping of a set of nodes in one tree to a corresponding set of nodes in the other tree is given, and present efficient algorithms for both ordered and unordered trees. Whereas ordered tree matching based on seeded nodes has applications in pattern matching of RNA structures, unordered tree matching based on seeded nodes has applications in co-speciation and phylogeny reconciliation. The latter involves the solution of the planar tanglegram layout problem, for which we give a polynomial-time algorithm.
1
Introduction
Matching and aligning trees is a recurrent problem in computational biology. Two prominent applications are the comparison of phylogenetic trees [2,12,15,19,22] and the comparison of RNA structures [8,9,20,24]. The specific problems defined and addressed in this paper are motivated by applications where densely seeded local tree alignments are sought. In what follows, we first describe an example motivated by evolutionary studies of RNase P RNAs and their target tRNAs; it is interesting as it demonstrates the need for seeded tree alignments for both ordered and unordered trees. Section 2 describes a general framework and the corresponding analysis for seeded tree alignment. Finally, in Section 3, an algorithm is presented which computes a planar layout for two unordered seeded trees, if such exists. Ribonuclease P is the endoribonuclease responsible for the 5’ maturation of tRNA precursors [5]. RNase P is a ribonucleoprotein in all organisms, but is R. Giancarlo and S. Hannenhalli (Eds.): WABI 2007, LNBI 4645, pp. 98–110, 2007. c Springer-Verlag Berlin Heidelberg 2007
Seeded Tree Alignment and Planar Tanglegram Layout
99
Fig. 1. The known secondary structures for two RNase P sequences and the corresponding coarse-grain trees. (left) E. coli RNase P, based on [5], shaded elements represent conserved loci. (right) M. barkery RNase P obtained from the RNase P database.
Fig. 2. Seeded tree alignment for the E. coli versus M. barkery RNase P secondary structures of Fig. 1. Dark vertices represent conserved loci, dotted lines represent alignment seeds.
best understood in Bacteria, in which the RNA component of the enzyme is by itself catalytically proficient in vitro (it is a ribozyme). The structure of bacterial RNase P RNA has been studied in detail, primarily using comparative methods [4,7,13,23]. Bacterial RNase P RNAs share a common core, and synthetic minimal RNase P RNAs consisting only of these core sequences and structures are catalytically proficient. Structural variation in RNase P RNA is predominated by variation in the presence or absence of helical elements and in variation of the size of the distal regions of these helices. However, there is additional variation in the form of small differences in the lengths of helices, loops and joining regions. In terms of RNA secondary structure tree alignment, this means that the operations applied in transforming one tree to another consist of subtree deletions and insertions as well as homeomorphic node insertions and deletions in ordered rooted trees (see Fig. 1).
100
A. Lozano et al.
Fig. 3. Seeded phylogenetic unordered tree alignment, in the context of horizontal gene transfer prediction. (left) Tanglegram formed by connecting, via seed edges, the phylogenetic tree based on archeal RNase P structures [6] with another phylogenetic tree based on archeal RNA [25]. The seed edges for Archeaglogi and Methanococci, which are putatively involved in RNase P RNA horizontal transfer [6], were omitted. (right) Planar layout of the tanglegram.
Recently, sequences encoding RNase P RNAs of various genomes have been determined (see the RNase P database, http://www.mbio.ncsu.edu/RNaseP/). This broad sampling of RNase P RNAs allows some phylogenetic refinement of the secondary structure, and reveals patterns in the evolutionary variation of sequences and secondary structures. In [5], the extent and patterns of evolutionary variation in RNase P RNA sequence and structure were studied, in both bacterial and archeal species, and it was shown that highly-conserved bases are scattered throughout the sequence and secondary structure. A detailed description of the conserved loci is given in [5] and shown in Fig. 1. In terms of RNA secondary structure tree comparison, this means that in a biologically correct alignment of two RNase P trees, the nodes corresponding to the conserved loci should be mapped to each other (“alignment seeds”), as shown in Fig. 2. The need to align seeded tree-pairs also arises in applications where the bioinformatics data is represented in the form of unordered trees. To demonstrate this, consider the example in Fig. 3, which illustrates the reconciliation of a phylogenetic tree based on archeal RNase P structures with the phylogenetic tree based on archeal rRNA structures. It is based on a detailed comparative analysis of archaeal RNase P RNA structures [6] from a wide range of archaeal species. The RNase P RNA sequences were rigorously aligned in a comparative analysis of secondary structure, providing an opportunity to compare phylogenetic relations derived from RNase P RNA sequences with those derived from small subunit ribosomal RNA sequences from the same group of organisms [10]. Although the RNase P RNA sequences generally recreate trees similar to those based on rRNA, a significant exception is the placement of the sequence from Archaeoglobus fulgidus. In rRNA-based trees, this genus lies on a branch distinct from the other major euryarchaeal groups, separating from the other groups at approximately the bifurcation between methanobacteria and halobacteria/methanomicrobia [10]. The A. fulgidus RNase P RNA, however, is clearly related in structure and sequence to those of Methanococcus. Trees constructed using parsimony (DNAPARS) and maximum likelihood (DNAML) methods [3]
Seeded Tree Alignment and Planar Tanglegram Layout
101
agree on the placement of this sequence as a relative of Methanococcus, and this placement is robust. The most likely interpretations of the similarities between RNase P RNAs of Methanococcus and A. fulgidus are that either (1) the ribosomal RNA-based trees are for some reason misleading, and A. fulgidus is specifically related to the methanococcus, or (2) the gene encoding RNase P RNA has been transferred laterally from one group to another. The above analysis could be formulated as a seeded unordered tree alignment, as follows (see Fig. 3). Connect each leaf from the RNase P RNA tree with the corresponding (same species) leaf from the ssu-rRNA tree, if such exists. Note that the layout of two unordered trees with additional edges forming a bijection among their leaves is called a tanglegram [14]. It is easy to see that the seeded unordered trees can be aligned if the input trees can be put in a non-crossing representation (in other words: if the tanglegram formed by the input trees together with the seed has a planar layout). Correspondingly, when formulating the problem raised by [6] as that of seeded unordered tree alignment: if the tanglegram formed by the two seeded RNA trees has a planar layout, and the two trees agree, then there is no basis for a lateral transfer hypothesis. In the above example, however, the tanglegram formed by the two RNA trees can be untangled, and the two trees can be aligned after removing the seed edges corresponding to the two new neighbors (by RNase P RNA homology) Archeaglobi and Methanococci. This supports the hypothesis of a lateral transfer of the gene encoding RNase P RNA from Archeaglobi to Methanococci, or vice versa.
2
Tree Alignment Based on Seeded Nodes
Consider the constrained form of the tree matching problem in which the mapping of a subset of the nodes in one tree to a corresponding subset of the nodes in the other tree is given in advance. The initial node mapping is called the seed set of the matching. Definition 1 (Mapping). A mapping M of a tree T1 = (V1 , E1 ) to a tree T2 = (V2 , E2 ) is a bijection M ⊆ V1 × V2 such that for all (v1 , v2 ), (w1 , w2 ) ∈ M , it holds that v1 is an ancestor of w1 in T1 if and only if v2 is an ancestor of w2 in T2 . A seed set S is a bijection S ⊆ M ⊆ V1 × V2 such that S itself also is a mapping of T1 to T2 . Among all possible mappings, in this paper we deal with the commonly used least-common-ancestor (LCA) preserving ones. Definition 2 (LCA-preserving mapping). Let T1 = (V1 , E1 ) and T2 = (V2 , E2 ) be trees, and let M ⊆ V1 × V2 be a mapping of T1 to T2 . M is LCApreserving if the following condition holds: if (x1 , x2 ) ∈ M and (y1 , y2 ) ∈ M then (lca(x1 , y1 ), lca(x2 , y2 )) ∈ M . We next define a new tree alignment optimization problem over pairs of seeded trees, to be applied as a constrained form of a general, pre-selected tree alignment
102
A. Lozano et al.
algorithm. Therefore, let TAA(T1 , T2 ) denote a “black box” tree alignment algorithm, which applies a pre-selected tree alignment algorithm to an input consisting of two labeled trees T1 and T2 . The class of tree alignment algorithms to which the seed constraint can actually be applied is discussed in the following section. Definition 3 (Seeded tree alignment problem). Given two trees T1 = (V1 , E1 ) and T2 = (V2 , E2 ), a set of seeds S ⊆ V1 × V2 , and a predefined tree similarity measure SIM , such that SIM (M ) denotes a similarity score computed based on the pairs of nodes (v1 , v2 ) ∈ M . The seeded tree alignment problem STA(T1, T2 , S,TAA) is to find a mapping M ⊆ V1 × V2 such that S ⊆ M and the alignment score SIM (M ) is maximal under this constraint. In this section, we show how to efficiently apply seeded alignment on top of existing tree alignment algorithms. We note that our results apply to LCApreserving mappings (see Def. 2). This class of algorithms includes subtree isomorphism [11,18], subtree homeomorphism [1,16,17], and maximum common subtree (MCS) [21] finding algorithms. For the sake of clarity of presentation, we note that all these algorithms employ a dynamic programming table where entry [i, j] denotes the similarity score of subtree i of tree T1 versus subtree j of tree T2 . Moreover, the time complexity of each of the above algorithms is computed by summing up the work invested in matching a subtree tu ∈ T1 with a subtree tv ∈ T2 : O(
n m
(c(u)x c(v)y f (c(u), c(v)))
(1)
u=1 v=1
where |T1 | = n, |T2 | = m, c(u) denotes the out-degree of tu , c(v) denotes the out-degree of tv , and f (c(u), c(v)) is a concave function that differs from one algorithm to another along with coefficients x and √ y. For example, unordered subtree homeomorphism can be computed in O(nm m) time √ using the top-down algorithm of [1] and the corresponding concave function is m (see Exmp. 1 for a complete analysis). In the discussion to follow, let the seeds contained in the initial seeds set S be denoted primary seeds. Since we restrict our analysis to LCA-preserving mappings, the LCAs of the primary seeds also function as seeds, to be denoted secondary seeds (see Fig. 4 (left)). For the sake of simplicity of presentation we will describe the seeded tree alignment algorithm for binary trees. Extensions to non-binary trees are straightforward. Note that, given an LCA-preserving tree alignment algorithm, and given as input a planar layout tanglegram of a pair of seeded trees that are to be aligned, the corresponding seeded tree alignment could immediately be derived by extending the applied node label similarity table as follows: For each seed s = (u, v) ∈ S such that u ∈ T1 and v ∈ T2 , relabel the seeded nodes to u and v respectively and add two new rows and two new columns to the label similarity table – one for node u and the other one for node v . Then, the similarity score for entries [u , v ] and [v , u ] is set to highest possible value of a node-to-node pairwise score, while all the remaining entries in
Seeded Tree Alignment and Planar Tanglegram Layout
103
Fig. 4. An illustration of the dynamic programming table computed during the seeded matching algorithm. (left) The matched trees with a partitioning induced by seeds. (right) The corresponding DP table, divided into independent rectangles to be computed by an appropriate LCA-preserving mapping algorithm. The colored areas illustrate which parts of the DP need to be computed. The lowest right-most corner of each colored rectangle holds the value for the roots of the corresponding compared subtrees. Each dashed rectangle corresponds to a secondary seed. Within each dashed rectangle, the single-cell components correspond to seeds, where the primary seeds are at the bottom-right and top-left (only for the subtrees framed by two primary seeds) corners, and the secondary seed is located between the two rectangles, each corresponding to one of the subtrees rooted at this secondary seed.
these two rows and columns are set to zero. This way we ensure that the above LCA-preserving tree alignment algorithms will match seeds as required. That being said, in this section we show how to exploit the seeds to more efficiently apply the tree alignment, and avoid redundant work by restricting the computations to limited areas in the dynamic programming (DP) matrix. This “constrained-by-seeds” dynamic programming can be intuitively explained by following the example in Fig. 4. A regular, unconstrained application of the LCApreserving algorithms mentioned above to the two trees in Fig. 4 (left) would require the computation of each and every entry in the DP table of Fig. 4 (right). The algorithm described below, however, will only compute the shaded rectangles along the diagonal of the table. Note that each primary seed corresponds to a single entry in the DP table whose score can be computed in an initialization step. Furthermore, each pair of consecutive seeds in S (according to a planar layout) defines a rectangle in the DP matrix with a side of size at most k, where k denotes the maximum gap between two consecutive seeds (in a planar layout), that can be filled independently of other rectangles. This is true for all entries except for the single entry in the rectangle which corresponds to a secondary seed, and whose computation depends on the availability of entries external to the rectangle. This availability, however, can be taken care of if the rectangles are processed by postorder traversal of the corresponding secondary seeds. The number of rectangles is bounded by n/2k and the size of each such rectangle is
104
A. Lozano et al.
bounded by k 2 , and thus, there is an immediate O(nk) bound on the number of entries that need to be computed in the table (in comparison to O(n2 ) entries in the unconstrained tree alignment case). Furthermore, each application of TAA is given as input two subtrees with no more than k nodes. The time complexity of seeded LCA-preserving tree alignment algorithms is formally analyzed in Obs. 2 and demonstrated in Exmp. 1. We refer the reader to Fig. 4 (left) for the following discussion. Consider the subtree obtained during a postorder traversal of T1 , from node c to node d: note that all nodes located in the left part are colored green and all nodes located in the right part are colored blue. Correspondingly, in the subtree obtained during a postorder traversal of T2 , from node c to node d , all nodes located in the right part are colored green and all nodes located in the left part are colored blue. This correspondence of colors is explained by Obs. 1; before we state it we need the following definition. Definition 4. For any tree T and nodes x, y ∈ T , let tx−y denote the subtree consisting of all nodes found in a postorder traversal of T , starting from node x and ending in node y. Also, let leftx−y and rightx−y denote the left and the right subtrees of tx−y , respectively. Note that both leftx−y and rightx−y are rooted at lca(x, y). Observation 1. Let T1 = (V1 , E1 ) and T2 = (V2 , E2 ) be trees to be aligned and (x1 ∈ V1 , x2 ∈ V2 ) and (y1 ∈ V1 , y2 ∈ V2 ) be a pair of seeds such that x1 < y1 and x2 < y2 in the postorder traversal of T1 and T2 , respectively. In an LCA-preserving mapping of T1 to T2 , all nodes in leftx1 −y1 are mapped to nodes in rightx2 −y2 . Symmetrically, all nodes in rightx1 −y1 are mapped to nodes in leftx2 −y2 . Proof. By recursive invocation of Def. 2.
The seeded tree alignment algorithm starts by extending the seeds set S to include the secondary seeds. Next, it orders S such that all the seeds obey a planar layout, that is, there is no crossing between seeds. An algorithm to compute this layout, if such layout exists, is given in Sect. 3. The resulting order partitions the target trees, according to Obs. 1, into exclusive subtree-pair intervals (see Fig. 4 (right)). The suggested algorithm processes these subtree pairs in postorder traversal of their roots (which are paired as secondary seeds). For each such interval, it retrieves the corresponding subtrees and feeds them as input to TAA. Lemma 1. Let T1 = (V1 , E1 ) and T2 = (V2 , E2 ) be two trees to be aligned, and let S ⊆ V1 × V2 be a primary seeds set. Given an LCA-preserving tree alignment algorithm TAA and the corresponding score function SIM , the algorithm described above computes STA(T1 , T2 , S, TAA). Restricting the computations to limited areas in the DP matrix results in a speedup of the applied tree comparison algorithms, as analyzed below.
Seeded Tree Alignment and Planar Tanglegram Layout
105
Lemma 2. The above framework for computing STA(T1, T2 , S, TAA) yields a speedup of Ω((n/k)x+y−1 f (n, n)/f (k, k)) over the time complexity of the corresponding, unseeded, tree alignment algorithm TAA(T1 , T2 ). Proof. Let f (c(u), c(v)) denote the concave function quantifying the work that a given (LCA-preserving, DP subtree-to-subtree based) tree-comparison algorithm TAA applies per alignment of a subtree tu ∈ T1 with a subtree tv ∈ T2 , where c(u) denotes the out-degree of tu and c(v) denotes the out-degree of tv . k k Observation 2. u=1 c(u) = k and v=1 c(v) = k. Summing up the work over all node pairs, applying Obs. 2 to Eq. 1 we get: O(
k k n (c(u)x c(v)y f (c(u), c(v))) = k u=1 v=1
n n n = O( k x (c(v)y f (c(u), k) = O( k x k y f (k, k)) = O(nk x+y−1 (f (k, k)) . k v=1 k
This yields a speedup of Ω((n/k)x+y−1 f (n, n)/f (k, k)) over the time complexity obtained by applying the corresponding unseeded version of the tree comparison algorithm. Below we give an example of one such seeded tree alignment algorithm. Time complexities of the seeded versions of additional currently known LCA-preserving tree comparison algorithms will be given in the full version of this paper. Example 1 (Top-down unordered subtree homeomorphism [1]). The algorithm for top-down unordered subtree isomorphism between trees T1 and T2 with |T1 | = √ n1 and |T2 | = n2 runs in O(n1 n2 n2 ) time, since O(
n1 n2
√ (c(u)c(v) c(v)) = O(n1 n2 n2 )
u=1 v=1
When applied over a seeded tree matching, we get O((n1 /k)
k k
k (c(u)c(v) c(v)) = O((n1 /k)k (c(v) c(v))) =
u=1 v=1
√ √ = O((n1 /k)k k) = O(n1 k k)
v=1
2
Thus, if the compared trees are heavily seeded and k = O(1) then the algorithm √ runs in O(n1 ) time and the speedup factor is O(n2 n2 ).
3
Planar Tanglegram Layout
A layout of two unordered trees with additional edges forming a bijection among their leaves, is called a tanglegram [14]. These diagrams arise in host-parasite cospeciation studies and in the reconciliation of gene and species phylogenies.
106
A. Lozano et al.
Definition 5 (Tanglegram). A tanglegram is a triple (T1 , T2 , S) where T1 = (V1 , E1 ) and T2 = (V2 , E2 ) are unordered trees, and S ⊆ V1 × V2 is a seed, that is, a partial mapping of T1 to T2 . A tanglegram is binary if both T1 and T2 are binary trees. Given a tanglegram (T1 , T2 , S), we will be interested in finding a way to represent the two trees such that the seed does not create any crossings among the edges corresponding to seeds in that representation. We call such a representation a planar layout of the tanglegram. To define it formally, we first introduce the notion of an extension of a set (and a pair of sets) of nodes. Definition 6 (Extension). Let T1 be an unordered tree, let X be an ordered set of nodes in T1 , and let u ∈ X be a non-leaf of T1 . Denote by X the ordered set X where u has been replaced by its children in some particular ordering. Then, we call X a one-step extension of X. We say that Z is an extension of X if there is a sequence of zero or more one-step extensions from X to Z. Let Y be an ordered set of nodes in an unordered tree T2 . Then, we also say that (X , Y ) is an extension of (X, Y ) if X is an extension of X and Y is an extension of Y . We are interested in extending the pair formed by the roots of the two trees of a tanglegram until there is no point in extending it further. The extensions are performed until no seed with seeded descendants can be found (for instance, seeded leaves satisfy this condition). In the following, we will call these nodes terminals. Definition 7 (Planar layout). Let T1 and T2 be unordered trees with roots r1 and r2 , respectively. A planar layout of a tanglegram (T1 , T2 , S) is a pair (x, y) with x = (x1 , . . . , xn ) and y = (y1 , . . . , yn ), such that: (x, y) is an extension of ((r1 ), (r2 )), the nodes in x and y are terminals, and (xi , yi ) ∈ S for every i with 1 i n. Example 2. The tanglegram to the left has a planar layout, namely: ((a, b, d, c), (a, b, d, c)), while the one to the right does not.
T1
a b c d
S
d b a c
T2
T1
a
S
b
b c
d a
d
c
T2
We next describe an algorithm for finding a planar layout of a binary tanglegram. The procedure Untangle computes the layout of a binary tanglegram by successive refinements of two lists, the ordered sets X and Y , which initially contain the roots of the trees. At each iteration of the loop, a node of one of the lists is “refined,” which means that a correct ordering of its children is found and
Seeded Tree Alignment and Planar Tanglegram Layout
107
fixed for the rest of the algorithm. The loop stops when all the elements of the lists X and Y are terminal nodes of the trees; at this point, the planar layout (if it exists) is completed. Before starting the main loop, the procedure Paths computes a table P of Boolean values which can be understood as an extension of the bijection S to all the nodes of the trees. In particular, for any node u in T1 and any node v in T2 , P [u, v] is true if and only if the subtree of T1 rooted at u has a descendant u , the subtree of T2 rooted at v has a descendant v , and (u , v ) ∈ S. Now we return to the main procedure. Algorithm 1. Given a tanglegram (T1 , T2 , S), obtain a planar layout (X, Y ) for (T1 , T2 , S). Let r1 , r2 be the roots of T1 , T2 , respectively. procedure Untangle(T1 , T2 , S) X, Y ← (r1 ), (r2 ) E ← {{r1 , r2 }} P ← Paths(T1 , T2 , S) while X ∪ Y contain some non-terminal node do u ← a non-terminal node of highest degree in (X ∪ Y, E) if u is in X then Refine(u, X, Y, E, P ) else Refine(u, Y, X, E, P ) return (X, Y ) In the refinement step, a node u in the graph (X ∪ Y, E) is substituted by its children u1 , u2 in such a way that no edge crossing is introduced. Algorithm 2. Given a partial planar layout (A ∪ B, E) and a node u, refine the planar layout by substituting u by its children and return A and E modified according to the refinement. procedure Refine(u, A, B, E, P ) u1 , u2 ← children of u for every node v ∈ B such that {u, v} ∈ E do if P [u1 , v] then add edge {u1 , v} to E if P [u2 , v] then add edge {u2 , v} to E delete {u, v} from E if u1 is an isolated node in ({u1 } ∪ B, E) then replace u by u2 in A else if u2 is an isolated node in ({u2 } ∪ B, E) then replace u by u1 in A else if not Crossings(u1 , u2 , B, E) then replace u by the ordered set (u1 , u2 ) in A else if not Crossings(u2 , u1 , B, E) then
108
A. Lozano et al.
replace u by the ordered set (u2 , u1 ) in A and flip clade u else reject The above procedure selects an ordering of the nodes U = {u1 , u2 } such that, replacing u by U in A, the graph (A ∪ B, E) does not create any edge crossings. We say that (A ∪ B, E) has an edge crossing if there are two nodes a1 , a2 in A and two more nodes b1 , b2 in B, appearing in this order in A and B, such that E contains edges (a1 , b2 ) and (a2 , b1 ). Assuming (A ∪ B, E) does not already have any edge crossings before replacing u by U in A, this property is checked in the procedure Crossings with cost O(n) by just checking if any edge adjacent with node u2 crosses the last (in the order given by B) edge adjacent with node u1 . Theorem 1. The procedure Untangle(T1 , T2 , S) computes a planar layout for (T1 , T2 , S) if there is one. Proof (Sketch). The whole algorithm can be thought of as the computation of an extension of ((r1 ), (r2 )), which becomes a planar layout at the end. Furthermore, if (X, Y ) is promising (it can be extended to a planar layout) at the beginning of the main loop, then it is promising at the end. In order to ensure this invariant, the choice of a highest degree node is crucial: for degree at least 2, the ordering of u’s refinement will be unique, while for degree 1, it can be argued that any ordering will be equally promising. The computation finishes when all nodes have degree 1, in which case we already have a planar layout. Lemma 3. Algorithm 1 runs in O(n2 ) time and space. Proof. Let T1 and T2 be unordered trees with |T1 | = n1 and |T2 | = n2 , and let n = n1 + n2 . The cost of Alg. 1 is dominated by the computation of the path matrix P , which takes O(n2 ) time and uses O(n2 ) additional space. Once P is available, the Refine procedure is called exactly once for each non-terminal node of the trees, and in each call the neighbors of the node in the graph (A ∪ B, E) are updated in O(max(n1 , n2 )) = O(n) time; the Crossing procedure also takes O(n) time. Therefore, the Untangle procedure runs in O(n2 ) time. Note that, in practical applications, the local, or “all subtree versus subtree” version of seeded tree alignment is actually sought, in which case we iteratively run the described framework over all subtree pairs of T1 and T2 . In this case, P is only constructed once, as a preprocessing step, in O(n2 ) time, and then, for each local seeded subtree pair to be aligned, the processing work consists of untangling the corresponding tanglegram in O(n) time, using the table P which was already computed in the preprocessing stage, and then applying the seeded tree alignment algorithm. Since there are O(n2 ) subtree pairs to be processed, the bottleneck in practice is actually dictated by the time complexity of the seeded tree alignment, according to the density of the given seeds set and the pre-selected tree alignment algorithm TAA to be applied.
Seeded Tree Alignment and Planar Tanglegram Layout
109
Acknowledgements AL was partially supported by Spanish CICYT project TIN2004-04343 iDEAS and TIN2005-08832-C03-03 MOISES-BAR. GV was partially supported by Spanish CICYT project TIN 2004-07925-C03-01 GRAMMARS and DGES project MTM2006-07773 COMGRIO. MZ was partially supported by an Eshkol grant of the Israeli Ministry of Science and Technology.
References 1. Chung, M.J.: O(n2.5 ) time algorithms for the subgraph homeomorphism problem on trees. J. Algorithms 8, 106–112 (1987) 2. DasGupta, B., et al.: On distances between phylogenetic trees. In: Proc. 8th Annual ACM-SIAM Symp. Discrete Algorithms, pp. 427–436. ACM Press, New York (1997) 3. Felsenstein, J.: Phylip - phylogeny inference package. Cladistics 5, 164–166 (1989) 4. Gardiner, K.J., Marsh, T.L., Pace, N.R.: Ion dependence of the bacillus subtilis rnase p reaction. J. Biol. Chem. 260, 5415–5419 (1985) 5. Haas, E.S., Brown, J.W.: Evolutionary variation in bacterial RNase P RNAs. Nucleic Acids Res. 26, 4093–4099 (1998) 6. Harris, J.K., et al.: New insight into RNase P RNA structure from comparative analysis of the archaeal RNA. RNA 7, 220–232 (2001) 7. James, B.D., et al.: The secondary structure of ribonuclease P RNA, the catalytic element of a ribonucleoprotein enzyme. Cell 52, 19–26 (1988) 8. Jansson, J., Hieu, N.T., Sung, W.-K.: Local gapped subforest alignment and its application in fnding RNA structural motifs. J. Comput. Biol. 13, 702–718 (2006) 9. Le, S.-Y., Nussinov, R., Maizel, J.V.: Tree graphs of RNA secondary structures and their comparisons. Comput. Biomed. Res. 22, 461–473 (1989) 10. Maidak, B.L., et al.: The RDP (Ribosomal Database Project) continues. Nucleic Acids Res. 28, 173–174 (2000) 11. Matula, D.W.: Subtree isomorphism in O(n5/2 ). Ann. Discrete Math. 2, 91–106 (1978) 12. Nye, T.M., Lio, P., Gilks, W.R.: A novel algorithm and web-based tool for comparing two alternative phylogenetic trees. Bioinformatics 22, 117–119 (2006) 13. Pace, N.R., Brown, J.W.: Evolutionary perspective on the structure and function of ribonuclease P, a ribozyme. J. Bacteriol. 177, 1919–1928 (1995) 14. Page, R.D.M. (ed.): Tangled Trees: Phylogeny, Cospeciation, and Coevolution. The University of Chicago Press, Chicago (2002) 15. Page, R.D.M., Valiente, G.: An edit script for taxonomic classifications. BMC Bioinformatics 6, 208 (2005) 16. Pinter, R.Y., et al.: Approximate labelled subtree homeomorphism. In: Sahinalp, S.C., Muthukrishnan, S.M., Dogrusoz, U. (eds.) CPM 2004. LNCS, vol. 3109, pp. 59–73. Springer, Heidelberg (2004) 17. Reyner, S.W.: An analysis of a good algorithm for the subtree problem. SIAM J. Comput. 6, 730–732 (1977) 18. Shamir, R., Tsur, D.: Faster subtree isomorphism. J. Alg. 33, 267–280 (1999) 19. Shan, H., Herbert, K.G., Piel, W.H., Shasha, D., Wang, J.T.L.: A structure-based search engine for phylogenetic databases. In: SSDMB 2002, pp. 7–10. IEEE Computer Society Press, Los Alamitos (2002)
110
A. Lozano et al.
20. Shapiro, B.A., Zhang, K.: Comparing multiple RNA secondary structures using tree comparisons. Comput. Appl. Biosci. 6, 309–318 (1990) 21. Valiente, G.: Algorithms on Trees and Graphs. Springer, Heidelberg (2002) 22. Valiente, G.: A fast algorithmic technique for comparing large phylogenetic trees. In: Consens, M.P., Navarro, G. (eds.) SPIRE 2005. LNCS, vol. 3772, pp. 371–376. Springer, Heidelberg (2005) 23. Woese, C.R., Pace, N.R.: Probing RNA structure, function, and history by comparative analysis. In: Gesteland, R.F., Atkins, J.F. (eds.) The RNA World, pp. 91–117. Cold Spring Harbor Laboratory Press (1993) 24. Zhang, K., Wang, L., Ma, B.: Computing similarity between RNA structures. In: Crochemore, M., Paterson, M.S. (eds.) Combinatorial Pattern Matching. LNCS, vol. 1645, pp. 281–293. Springer, Heidelberg (1999) 25. Hugenholtz, P.: Exploring prokaryotic diversity in the genomic era. Genome Biol. 3, 1–8 (2002)
Inferring Models of Rearrangements, Recombinations, and Horizontal Transfers by the Minimum Evolution Criterion (Extended Abstract) Hadas Birin1 , Zohar Gal-Or1, Isaac Elias2 , and Tamir Tuller1, 1
School of Computer Science Tel Aviv University {hadasbir,zohargal,tamirtul}@post.tau.ac.il 2 School of Computer Science and Communication KTH
[email protected]
Abstract. The evolution of viruses is very rapid and in addition to local point mutations (insertion, deletion, substitution) it also includes frequent recombinations, genome rearrangements, and horizontal transfer of genetic material. Evolutionary analysis of viral sequences is therefore a complicated matter for two main reasons: First, due to HGTs and recombinations, the right model of evolution is a network and not a tree. Second, due to genome rearrangements, an alignment of the input sequences is not guaranteed. Since contemporary methods for inferring phylogenetic networks require aligned sequences as input, they cannot deal with viral evolution. In this work we present the first computational approach which deals with both genome rearrangements and horizontal gene transfers and does not require a multiple alignment as input. We formalize a new set of computational problems which involve analyzing such complex models of evolution, investigate their computational complexity, and devise algorithms for solving them. Moreover, we demonstrate the viability of our methods on several synthetic datasets as well as biological datasets. Keywords: Phylogenetic network, horizontal gene transfer, genome rearrangements, recombinations, minimum evolution.
1
Introduction
Eukaryotes evolve largely through vertical lineal descent driven by local point mutations and genome rearrangements. Unlike the Eukaryotes, bacteria also acquire genetic material through the transfer of DNA segments across species boundaries—a process known as Horizontal Gene Transfer (HGT) [8]. In the presence of HGTs, the evolutionary history of a set of organisms is modelled by
Corresponding author.
R. Giancarlo and S. Hannenhalli (Eds.): WABI 2007, LNBI 4645, pp. 111–123, 2007. c Springer-Verlag Berlin Heidelberg 2007
112
H. Birin et al.
a phylogenetic network, which is a directed acyclic graph obtained by inferring a set of edges between pairs of edges in the organismal tree to model the horizontal transfer of genetic material [15] (see Figure 1). We call such a network a rooted phylogenetic network. In the case of viruses, the evolutionary model is even more complicated. The viral genomes are usually compact and evolve very rapidly by all the aforementioned mutations in addition to a large number of recombinations [23], and rearrangements. Furthermore, in this case an organismal tree usually does not exist [23], thus the right model is an unrooted tree with an additional small set of undirected edges (between pairs of edges in the initial tree). We call such networks unrooted phylogenetic networks. There are many strategies and models for dealing with non tree-like evolution, here we briefly describe some of them. For example, Splits networks (see e.g. [13]) are graphical models that capture incompatibilities in the data due to various factors, not necessarily HGT or hybrid speciation. Some works describe a phylogenetic network as probabilistic models and use maximum likelihood for analyzing it [25,14], while others use maximum parsimony [15], or deal with the problem by a graph-theoretic approach of reconciling species and gene trees into phylogenetic networks [1]. None of the mentioned works deal with rearrangements. In this work we advise a distance based method for inferring evolution under complicated models that can involve substitutions, insertions, and deletions of single nucleotides, rearrangement, HGT, and recombination. We believe that in our case, where the models of evolution are complex, distance methods have advantages for three main reasons: First, sometimes the appropriate probabilistic model is not completely clear, thus, using ML is not feasible. Second, by our experience [14,15], usually ML and MP are more time consuming than distance methods even when considering complete HGT. If the models include HGTs together with rearrangements these methods are not feasible. Finally, MP and ML require multiple alignment as an input, while we want to separate our method from this requirement. The multiple alignment problem is NP-hard [9], and to the best of our knowledge, at some stage of the processing most methods for inferring evolutionary networks require a multiple alignment. We believe that this requirement is problematic, especially with regard to complete viral genomes. Thus our method takes unaligned sequences as input. Boc and Makarenkov suggested a distance based method for detecting HGTs [4], however, there are two main differences between this research and the work of Boc and Makarenkov: First, as opposed to their method, our models allow for rearrangements and recombination. Second, our models are sequence oriented (i.e. the input in our case is a set of sequences), while the approach of Boc and Makarenkov [4] requires distance matrices as input. Consequently, our work considers a more general and realistic setting. Our methods are based on the following basic biological observations: 1) In phylogenetic networks each nucleotide evolves according to a tree (which may be different from the organismal tree) [12]. 2) Closely positioned nucleotides are
Inferring Models of Rearrangements and Recombinations by ME
113
more likely to have evolved according to the same tree than distantly positioned nucleotides [23]. Therefore, our method infer different trees to different subset of sequences, and partition the genomes into subsequences (each of at least a few dozens bp) and constrain the nucleotides in each subsequence to have the same evolution. Given an organismal tree and a set of sequences1 , our method finds families of homologue subsequences and reconstructs their evolutionary history by adding reticulation edges to the organismal tree while optimizing the minimum evolution criteria. This work does not handle gene duplication or deletion; dealing with these operations has been deferred to future works. However, as we demonstrate in this work, there are many interesting datasets that do not involve events such as duplication or deletion.
2
Definitions
Let T = (V, E) be a tree, where V and E are the tree nodes and tree edges, respectively, and let L(T ) denote its leaf set. Further, let X be a set of taxa (species). Then T is a phylogenetic tree over X if there is a bijection between X and L(T ). Henceforth, we identify the taxa with their associated leaves and denote the set of leaf-labels with [n] = {1, .., n}. A tree T is said to be rooted if the set of edges E is directed and there is a single distinguished internal vertex r with in-degree 0. Let S = [s1 , s2 , s3 , . . . , sn ] be the sequences corresponding to the n taxa (note that these sequences may be of different lengths). A family over the set of sequences S is a set of sequences S = [s1 , s2 , s3 , . . . , sn ], such that for all i, si is a subsequence of si . The definition of the ACS distance between two sequences appear in [26]. Let D(·, ·) denote a distance measure between pairs of sequences. In this paper D(·, ·) is either the cost of the pairwise alignment or the ACS distance. Two sequences, s1 and s2 , are considered d-homologous with respect to a block length L, if each sequence is longer than L, and every window of length L in their pairwise alignment has evolutionary distance < d; we denote this property DL (s1 , s2 ) < d. A family of d-homologous subsequences is defined as a set of subsequences S with the following property: ∀s1 , s2 ∈ S DL (s1 , s2 ) < d. A non-overlapping set of families is a set of families such that in each sequence, subsequences from different families do not overlap. The subsequence s ⊆ s that is part of the family f is denoted by f (s). We call the set of subsequences that are induced by a set of families a partitioning. A rooted phylogenetic network N = N (T ) = (V , E ) over the taxa set X is derived from a rooted tree T by inferring reticulation edges between pairs of edges in T . That is, each reticulation edge is inferred by adding two new vertices on two edges of E and thereafter joining the two new vertices with the directed reticulation edge. A tree edge can take part in more than one reticulation event. In a similar way, an unrooted phylogenetic network is derived from an unrooted 1
If the organismal is not part of input we estimate it from the input sequences.
114
H. Birin et al.
tree by adding undirected edges to the tree. Each family f of subsequences is related to a subset of the reticulation edges, denoted by M (f ), which describes the evolution of the family. If a family, f , evolves along the organismal tree then M (f ) = ∅. A rooted phylogenetic network must satisfy additional temporal constraints, such as acyclicity [14,15]. Such temporal constraints do not exist in an unrooted network. Finally, we denote the set of all trees contained inside the network N (rooted or unrooted) by T (N ). In the case of rooted network, each such tree is obtained by the following two steps: (1) for each node of in-degree 2, remove one of the incoming edges, and then (2) for every node x of in-degree and out-degree 1, whose parent is u and its child is v, remove node x and its two adjacent edges, and add a new edge from u to v. In the case of unrooted networks, a tree is obtain by removing an edge from each cycle of the tree, removing each node, x, with exactly two neighbors u and v, removing the two edges that include the node x, and adding a new undirected edge, (u, v). In our setting the tree, Tf ∈ T (N ) that includes exactly all the reticulation edges in M (f ) describes the evolution of the family f . In this work, we deal with the Minimum Evolution (ME) criteria [18]. It is known to be consistent when using the least-squares criterion [21] (as in this work); meaning that it converges to the correct tree for long enough sequences. In the case of evolutionary trees, the decision variant of the problem of finding the minimum evolution tree is defined as follows: Problem 1. [7] Input: Set of n sequences, S, that induces a distance matrix B and a real number e. Output: A tree, T , with total edge lengths less than e, while the edge lengths are least squares estimated from B. The sum of edge lengths of a tree T is the ME score of a tree; Let E(T, S, D) denotes the ME score for a tree T , with a set of sequences S corresponding to its leaves, and when D is used as a distance measure between pair of sequences. In our setting, we use the minimum evolution criterion to select the additional reticulation edges that best explain the evolution of each family of subsequences. That is, given a set of sequences S and a phylogenetic tree T , our goal is to find a set of non-overlapping families, a set of reticulation edges, and a mapping relating each family to a subset of the reticulation edges (i.e. one tree for each family). These are selected with the objective of minimizing the sum of minimum evolution scores for each family and associated tree. If the set of families is F = [f1 , f2 , .., fh ], the set of reticulation edges is H, the mapping is M , and the pairwise distance measure between sequences is D , then we denote this score by E(T, F, H, M, D) = i E(Tfi , fi , D). Let s1 and s2 denote two subsequences of the sequence s. We say that s1 precedes s2 (s1 ≺ s2 ) if s1 ends before s2 begins. Under a non-rearrangement assumption, there is an order of the families, f1 , f2 , .., fh , such that in each sequence, si : f1 (si ) ≺ f2 (si ), .., fh−1 (si ) ≺ fh (si ) (see figure 1), but this assumption is not always justified.
Inferring Models of Rearrangements and Recombinations by ME
y
x A
B a
115
C b
(a)
D c
d
a
c
b
(b)
d
A\a
B\b
C\c
D\d
(c)
(d)
(e)
Fig. 1. (a)-(c): A simple example of a HGT or recombination event between the ancestral black sequences in the ancestral taxa x and y. (a) A phylogenetic network with a single HGT event (the directed edge) which describes the evolution of the black family of sub-sequences. (b) The tree of the horizontally transferred family. (c) The underlying organismal (species) tree which describe the rest of the sequences. (d)-(e): Families.(d) Set of families under the non-rearrangement assumption. (e) Set of families without the non-rearrangement assumption. The white parts in the two cases denotes families that evolve along the organismal tree, T , while the evolution of the colored families is along trees that are different than T .
Here we deal with three variants of the problem, each related to different assumptions about the input: 1. The first variant, Non Rearrangement Given Tree (NRGT), assumes an organismal tree and that subsequences have not been rearranged. An example of such input is a set of proteins and an organismal tree of bacteria. 2. The second variant, Rearrangement Given Tree (RGT), assumes an organismal tree and that subsequences may be rearranged. An example of such input is a set of genomes and an organismal tree of bacteria. 3. The third variant, Rearrangement No Tree (RNT), does not assume an organismal tree and subsequences may be rearranged. An example of such input is a set of viral genomes. The output for the first two variants is a set of homologous non-overlapping families, a set of reticulation edges, and a mapping from each family to a subsets of reticulation edges (that is related to that family). In the third variant, the organismal tree is also part of the output.
3
Hardness Issues
In this section, we deal with the computational hardness of some variants of the problems that were mentioned in the previous section, and other related problems. Roughly, the problem can be divided into two subproblems 2 : (i) Dividing the set of input sequences into non-overlapping d-homologous families and (ii) Finding the best set of reticulation edges for each family. By the results that are presented in this section, it seems that these two problems are NP-hard. A related problem is Binary Minimum Common String Partitioning (BMCSP); we will use the hardness result of this problem for establishing the hardness result of our problems. A minimum common partitioning of two binary strings s1 and s2 is given by the least number of blocks that s1 has to be cut into such that these blocks can be reconcatenated to form s2 . Formally, BM CSP is defined as follows: 2
In practice these two problems are not independent.
116
H. Birin et al.
Problem 2 [BMCSP]. Input: Two binary strings, s1 and s2 , an integer B. Output: Can the sequence s2 be formed from the sequence s1 by cutting it into less than B subsequences and subsequently reconcatinating them. The hardness of BM CSP can be proved by a reduction from the APXcomplete problem 2-MCSP [11], which is defined as follows (due to lack of space the full details of the proof are deferred to the full version of the paper): Problem 3 [2-MCSP] [11]. Input: Two strings of integers, s1 and s2 , where each integer appears exactly twice in each sequence, and an integer B. Output: Can the sequence s2 be formed from the sequence s1 by cutting it into less than B subsequences and subsequently reconcatinating them. The hardness of the BMCSP problem implies the hardness of our problem. The decision variant of the RGT problem, which is defined as follows: Problem 4 [RGT].3 Input: A set of binary sequences S, a phylogenetic tree T , two integers h, and k, a real number c, and a distance measure between pairs of sequences, D. Question: Is there a set, F , of h non-overlapping families S1 , .., Sh : ∀i Si ⊂ S, a set, H, of k reticulation edges, and a mapping, M , from each family to subset of H , such that the score E(T, F, H, M, D) ≤ c. A reduction from BM CSP can show that RGT and RN T are hard even when there are 0 reticulation edges (details are deferred to the full version of the paper). Theorem 1. RGT and RN T are NP-hard. As mentioned, in this work we deal with minimum evolution criteria (minimum evolution tree, or M ET , see Problem 1. This problem is probably NP-hard for trees (it is still an open problem). It is easy to see that the N RGT problem (even if there is only one family) is NP-hard if Problem 1 is NP-hard (details are deferred to the full version of the paper). Observation 1. NP-hardness of the problem M ET implies NP-hardness of N RGT (when there is no rearrangement and the tree is given).
4
Algorithms and Parameters
In this section, we describe our method, Find Net-Families. As can be seen in Figure 2 this method consists of three stages, each of which solves a seperate computational problem. As was just shown, most of the problems we deal with are NP-hard and consequently the algorithms presented here are heuristics. The input to Find Net-Families is an organismal tree. This tree is either provided by the user or generated by computing a distance matrix with the ACS [26] method and then building the associated neighbor joining tree [22]. In 3
The problem N GT is defined in a similar way while the input does not include a tree T , the problem N RGT is defined in a similar way but the families must be in order.
Inferring Models of Rearrangements and Recombinations by ME
Find Net-Families: A.
a
b
The input includes a set of sequences; the organismal tree is either part of the input or is generated by the ACS method. c a b c
B.
a
{ { {
b
Stage 1: For all pairs of sequences find the best partition such that each block is longer than L; details in c subsection 4.1. a b b c c
C.
a
b
c
Stage 2 : By expanding one of the partitions from stage 1 generate set of non-overlapping families that cover all the input sequences; a details in subsection b 4.2. c
D.
a
b
c a b c
117
Stage 3: While improving the ME score, greedily add HGTs to each family that was found in stage 2. Update the set H, if the solution includes HGTs that do not appear in H; This stage also includes small modifications in the set of the families given the new set of HGTs; details in subsection 4.3.
a
Fig. 2. A sketch of the main algorithm for finding families and a set of reticulation edges for each family
the first stage we generate good partitions to each of the n2 input sequences. Thereafter, we greedily expand each of the partitionings to get a set of nonoverlapping families that covers all of the input sequences. In the final stage, we improve the total minimum evolution score of the families by greedily adding reticulation edges to the organismal tree. In the final stage, we also adjust the boundaries of the subsequences representing the families . Due to running time considerations we used a parameter L that constrains the length of each subsequence in each family to be a multiple of L 4 . The typical length of a gene is a few hundred nucleotides, usually only complete genes are horizontally transferred [6]. In the case of partial HGT or recombination, the lengths have similar order of magnitude [17,3] (very short horizontal transfer are hard and somtimes impossible to detect by any method). Indeed using L of few hundreds nucleotides gave good results (usually changing L from one hundred to few hundreds does not change the results dramatically). Finding d-Homologous Families in two Sequences. Given two sequences s1 and s2 of length , our goal is to find d-homologous families where each block should be longer than L, and such that d is minimal. Namely, we wish to match each block in one sequence with exactly one as similar as possible block in the other sequence. In this work we assume that there is one unique such matching. In practice, for large enough L (i.e more than few dozen characters) and when duplications are not present, this is indeed the case. The procedure has two stages; in the first stage we search for common subsequences of length at least W , where W has to be tuned with respect to the input sequences. In general, too small W (e.g. W = 2) is not specific enough, since we expect that both non-homologue and homologue sequences share common subsequences of length W . On the other hand too large W (e.g. W = 100) is also problematic, since even homologue sequences do not share such long subsequences. In practice 12 < W < 18 gave good results for nucleotides, and 7 < W < 12 gave good results in the case of amino acids. Let Si (s1 , s2 ) denote the longest substring that starts in position i in s1 and appears in the two sequence (we assume that |Si (s1 , s2 )| = O(1) ). In the first stage, we performed the following steps: 1. Generate the suffix array for s1 . 4
In practice the adjusting procedure allows lengths that are up to 10% different than this constraint.
118
H. Birin et al.
2. Scan s2 , in each position i, find the longest subsequence that starts in that position and appears in both sequence, Si (s1 , s2 ). 3. If |Si (s1 , s2 )| > W keep the position and the length of the matching substring. In our implementation we used the ”lightweight suffix array” of [5,26] which is constructed in time O( log()). Step 2 of the algorithm above, for each position, i, can be accomplished in log() time by performing lexicographic binary search for s2 (i) in the suffix array of s1 . After the first stage we have a set of position-pairs for each common substring longer than W . In the second stage, we map each overlapping window of length L in the first sequence with the window in the second sequence which has the maximal sum of lengths of common substrings. We call each such match the core of a family f ∈ F . Finally, we greedily adjust the boundaries of each family by adding/removing small blocks at the ends of the windows while optimizing minF ;s (i),s (j)∈F DL (s (i), s (j)), such that in the end of this stage the two strings have been partitioned into families that cover all of the sequences. The runtime complexity of this stage is O(2 ) for each pair of sequences. Thus the total runtime complexity for n sequences is O(2 · n2 ). Finding a Family of d-homologous Subsequences. From the previous stage we have a d-homologous partitioning for each n2 pairs of sequences. In this stage the aim is to expand these pairwise matchings to families of d-homologous subsequences with minimal d that cover all the n input sequences. As mentioned before, we assume that each window of length close to L in a sequence has exactly one homologue in each of the other sequences, an assumption which is supported by our biological inputs. We examine the expansion of each of the n2 partitionings of pairs of sequences to a partitioning over all the n sequences. This is done by the following steps: 1. For each of the n2 partitionings of pairs of sequences. a. Start with one partitioning. b. The k-th (k ≤ n − 1) step: Greedily add another sequence to the partitioning of k − 1 sequences that was generated in the previous step. This is done by checking consecutive overlapping windows of length L, and for each family choosing a non-overlapping window(s) (i.e. a subsequence of the new sequence) that includes the maximal sum of lengths of common subsequences that appear in the other members of this family in the k − 1 previous sequences. 2. Chose the expansion that minimizes minF ;s (i),s (j)∈F DL (s (i), s (j)). The runtime complexity of this stage is O( · n2 ) for each pair of sequences. Thus the total runtime complexity for n sequences is O( · n2 · n2 ). Adding Reticulation Edges and Refining the Partitioning to Families. In this subsection we describe how to find the set of reticulation edges that are related to each family. In this stage we assume a given initial (organismal) tree and a set of d-homologous families. Each family induces a distance matrix. Our procedure greedily chooses one of the families and adds a new reticulation edge
Inferring Models of Rearrangements and Recombinations by ME
119
that is related to that family. In each such step the size of the set of reticulation edges that is related to one of the families is increased by one. We plot a graph of the improvement in the ME score after each such step. Such a graph can help biologists to decide the actual number of reticulation edges. As is described in the simulation study, usually after adding the actual number of reticulation edges the improvement in the ME score is insignificant. Given a tree topology (an organismal tree and a set of reticulation edges) and a set of sequences at its leaves (a family). We use the least square estimation to calculate the edge lengths of the tree. This can be done in the time complexity of an n × n matrix inversion [22], less than O(n3 ). By using the more sophisticated method of [10] the least square estimation of the edge lengths of a given tree and distance matrix can be done in O(n2 ). After each stage of adding a reticulation edge we perform a stage of greedily adjusting the boundaries of the families (by increasing or decreasing the boundaries of each subsequences in each family) while improving the ME criteria. Since after each such stage the ME criteria is improved, a convergence to a local optima is guaranteed. The time complexity of this stage is (h2 · f · n) · n2 . Total Time Complexity. Suppose the input includes n sequences of length , and the result includes h families each edges. The total with f reticulation runtime complexity of our method is n2 ·2 + · n2 · n2 +(h2 · f · n) · n2 = O(n2 · (2 + n · h2 · f + n2 · )).
5
Experimental Results
For evaluating our methods we performed three tests. First, we implemented our method on two biological dataset (bacterial rbcL proteins, and the plants’ gene rps11) that underwent horizontal gene transfer. In the second test we simulated evolution that included HGT/recombination, rearrangements, and local point mutations, our method were used for reconstructing the simulated evolution. Finally, we implementation of our method on two datasets of viruses’ genomes. 5.1
Biological Inputs: Proteins and Genes
Proteins of Bacteria. The first input includes the rubisco gene rbcL of a group of 14 plastids, cyanobacteria, and proteobacteria, which were first analyzed by Delwiche and Palmer [6] (they and other suggest that it includes HGTs). This dataset consists of amino acid sequences, part of them are from Form I of rubisco, and the other six are from Form II of rubisco. We used exactly the same sequences that Delwiche and Palmer used in their paper. The species tree was based on information from the ribosomal database project (http://rdp.life.uiuc.edu) and the work of [6]. We checked two distance matrices, P AM 250 and Blosum62, both with gap penalty −8. Since this dataset includes a set of proteins we constrained the families to be ordered (the NRGT problem). We checked various sizes of L and got similar results (due to lack of space the more details about the
120
H. Birin et al.
results are deferred to the full version of this paper). We got similar results for the two distance matrices; this indicates that our method is robust to changes in the distance matrix. In general, our results support previous results that analyzed this dataset [6,4,15,14]. For example, we and previous methods discovered reticulation edge between α and β proteobacteria, and reticulation edge between proteobacteria and the plastid. Genes of Plants. The second database includes the ribosomal protein gene rps11 of a group of 47 flowering plants, which was first analyzed by Bergthorsson et al. [3] (they and others suggest that this dataset includes partial HGT). The species tree was reconstructed based on various sources, including the work of [20] and [16]. We used exactly the same sequences that Bergthorsson et al. used in their paper. Due to lack of space more details about the results are deferred to the full version of the paper. By Bergthorsson et al. these species underwent chimeric HGT (e. g. partial HGT), this conjecture is supported by our results which relate all the HGT to the family in positions 150 trough 300. In general, our HGTs suggest transfer of genetic material between Liliopsida and Dipsacales, Liliopsida and Papaveraceae, and Ranunculales and Dipsacales. The first two HGTs are similar to HGTs reported in previous works (for example, in [15]), while the third is new and suggests further biological research. Simulated Data: Simulating HGT/Recombination, Rearrangements, and Local Point Mutations. Here we evaluate the accuracy of our method on simulated data. The data consists of sequences which have evolved through substitutions, insertions, deletions, and lateral transfers. We generated 20 data sets with 10 leaves and 20 data sets with 20 leaves using the following recipe (see figure 3). (1) The species tree was generated using a regular birth death process from the Beep software package [2]. These trees are ultra metric with a root to leaf distance of 1. (2) Three transfer trees were independently created from the species tree by applying two random lateral transfers. Each transfer event was chosen to occur at time t ∈ [0, 1] with probability P (t) =
# concurrent lineages at time t , 1 0 P (r)
i.e, the probability increases linearly with the number of concurrent lineages. Once t was selected the transfer was selected uniformly at random from the possible transfer events at time t. (3)Species sequences, the sequences which have evolved according to the species tree, of expected length 4000 (similar to the typical length of genome virus, which is is few kbp) were generated using the ROSE [24] software package. Each nucleotide evolved according to the Jukes-Cantor model with a substitution probability of 0.2 from the root down to any leaf. Moreover, in each nucleotide insertions and deletions of up to 7 nucleotides5 occurred with probability 0.01 from the root down to any leaf. (4) Transfer blocks, the sequences which have evolved according to the transfer trees, of expected 5
The standard insertion and deletion functions in ROSE were used.
Inferring Models of Rearrangements and Recombinations by ME
121
length 500 were generated using the same process as for the species sequences (The typical length of a gene is few hundreds nucleotides, usually complete gene are horizontally transferred [6]. In the case of partial HGT or recombination, the lengths are in the order of magnitude of at least half a gene [17,3], i.e. few hundreds bp. Thus, we transfers blocks with similar size.). (5) The combined sequences, the sequences containing both the species sequences and the transfer blocks, where created by inserting the transfer blocks uniformly at random into the species sequences such that no evolutionary block in the sequences was shorter than 500.
(a)
(b)
(c)
(d)
Fig. 3. Illustration of our simulation. We generated synthetic data by the following steps. (a) Generated a random tree. (b) Add three random reticulation edges to the tree. (c) Evolve sequences along the trees, most of the positions evolve according the organismal tree, three blocks were evolve according to the organismal tree plus subset of the reticulation edges. (d) Randomly rearrange the blocks in each of the leaves.
We ran our algorithm with 380 < L < 600, and with W = 15 (the results for 12 ≤ W ≤ 18 where similar). For each of the 20 dataset of each size, there are 3 blocks of length about 500 that were transferred (while the rest of the sequences evolve in the original tree). Thus, there were 7 families for each dataset with a total of 140 families (for both the 10 and 20 leaf test sizes). Moreover, each family had been affected by two HGT events. Thus, there was a total of 120 HGT events (for both the 10 and 20 leaf test sizes). The results were similar for the two datasets, while the results for the 20 leaves datasets were a bit better. Due to lack of space we describe only the results of the 10 leaf datasets, while the results for the 20 leaf datasets are deferred to the full version of the paper: Out of the 140 families our algorithm did not completely miss any family. Only four families were shifted; three by 300 positions and one by 200 positions. Our algorithm identified 93 of the total 120 HGT events. Four of the edges were identified but with reversed direction. Only 23 edges were different than the original edges. However, in this case the edges our algorithm found were very close to the original edges. According to our results the accuracy of the algorithm improves when the number of leafs increases. One important goal of the method is its ability to infer the right number of HGT events. According to the results, our method performed very well in achieving this goal. Usually after adding the correct number of reticulation edges the improvement in the score is negligible. This is a major advantage compared to methods such as MP or ML when sites are independent [15,14], where usually there is less clearer change in the slope of the score graph.
122
H. Birin et al.
Genome of viruses. Our last datasets include complete genomes of two RNA viruses, one of HIV the other of Hepatitis C. We checked our method on these two typical inputs. The genomes were downloaded from [19], and each dataset included 10 genomes. We used our method to check if the datasets include HGTs/recombination and/or rearrangement. For the HIV dataset, our method did not find HGT events nor did it find rearrangement events. In the case of Hepatitis C we found two possible reticulation edges that may suggest an ancient recombination or horizontal gene transfer events. Due to lack of space more details about the virus datasets and results are deferred to the full version of the paper.
6
Concluding Remarks and Further Research
In general, genomic material evolves through local point mutations (insertion, deletion, substitution), genome rearrangements, horizontal gene transfers, recombinations, duplications, and deletions. This work is a step towards developing a method for inferring evolution under all these types of operations, and it is mainly a proof of concept. We showed that our method, which is based on the ME criterion, is useful for inferring partial or complete HGT events, and can infer rearrangements together with HGTs or recombinations. One work on this new topic is clearly not enough for solving all the problems. Further research in this direction will include: extending the set of operations to include duplications, deletions, and inversions; developing a more sophisticated simulator of virus evolution; investigating the hardness of N RGT (in this work we proved the hardness RGT and RN T ); and improving the running time of our heuristic. are currently aimed at using our approach for exploring the evolution of various groups of viruses and bacteria.
Acknowledgment We thank Prof. Benny Chor for helpful discussions. T.T. was supported by the Edmond J. Safra Bioinformatics program at Tel Aviv University.
References 1. Addario-Berry, L., Hallett, M., Lagergren, J.: Towards identifying lateral gene transfer events. In: PSB03, pp. 279–290 (2003) 2. Arvestad, L., Berglund, A., Lagergren, J., Sennblad, B.: Beep software package (2006) 3. Bergthorsson, U., Adams, K., Thomason, B., Palmer, J.: Widespread horizontal transfer of mitochondrial genes in flowering plants. Nature 424, 197–201 (2003) 4. Boc, A., Makarenkov, V.: New efficient algorithm for detection of horizontal gene transfer events. In: Benson, G., Page, R.D.M. (eds.) WABI 2003. LNCS (LNBI), vol. 2812, pp. 190–201. Springer, Heidelberg (2003)
Inferring Models of Rearrangements and Recombinations by ME
123
5. Burkhardt, S., K¨ arkk¨ ainen, J.: Fast lightweight suffix array construction and checking. In: Baeza-Yates, R.A., Ch´ avez, E., Crochemore, M. (eds.) CPM 2003. LNCS, vol. 2676, pp. 55–69. Springer, Heidelberg (2003) 6. Delwiche, C., Palmer, J.: Rampant horizontal transfer and duplicaion of rubisco genes in eubacteria and plastids. Mol. Biol. Evol 13(6) (1996) 7. Desper, R., Gascuel, O.: Fast and accurate phylogeny reconstruction algorithms based on the minimum-evolution principle. J. Comp. Biol. 9(5), 687–705 (2002) 8. Doolittle, W.F., Boucher, Y., Nesbo, C.L., Douady, C.J., Andersson, J.O., Roger, A.J.: How big is the iceberg of which organellar genes in nuclear genomes are but the tip? Phil. Trans. R. Soc. Lond. B. Biol. Sci. 358, 39–57 (2003) 9. Elias, I.: Settling the intractability of multiple alignment. In: Ibaraki, T., Katoh, N., Ono, H. (eds.) ISAAC 2003. LNCS, vol. 2906, pp. 352–363. Springer, Heidelberg (2003) 10. Gascuel, O.: Concerning the NJ algorithm and its unweighted version UNJ (1997) 11. Goldstein, A., Kolman, P., Zheng, J.: Minimum common string partition problem: Hardness and approximations. In: Fleischer, R., Trippen, G. (eds.) ISAAC 2004. LNCS, vol. 3341, pp. 484–495. Springer, Heidelberg (2004) 12. Hein, J.: A heuristic method to reconstruct the history of sequences subject to recombination. J. Mol. Evol. 36, 396–405 (1993) 13. Huson, D.H., Bryant, D.: Application of phylogenetic networks in evolutionary studies. Mol. Biol. Evol 23(2), 254–267 (2006) 14. Jin, G., Nakhleh, L., Snir, S., Tuller, T.: Maximum likelihood of phylogenetic networks. Bioinformatics 22(21), 2604–2611 (2006) 15. Jin, G., Nakhleh, L., Snir, S., Tuller, T.: Efficient parsimony-based methods for phylogenetic network reconstruction. Bioinformatics 23(2), 123–128 (2007) 16. Judd, W.S., Olmstead, R.G.: A survey of tricolpate (eudicot) phylogenetic relationships. American Journal of Botany 91, 1627–1644 (2004) 17. Kalinina, O., Norder, H., Magnius, L.O.: Full-length open reading frame of a recombinant hepatitis c virus strain from St Petersburg: proposed mechanism for its formation. J. Gen. Virol. 85, 1853–1857 (2004) 18. Kidd, K.K., Sgaramella-Zonta, L.A.: Phylogenetic analysis: concepts and methods. Am. J. Hum. Genet. 23(3), 235–252 (1971) 19. Kuiken, C., Yusim, K., Boykin, L., Richardson, R.: The los alamos hcv sequence database. Bioinformatics 21(3), 379–384 (2005) 20. Michelangeli, F.A., Davis, J.I., Stevenson, D.W.: Phylogenetic relationships among Poaceae and related families as inferred from morphology, inversions in the plastid genome, and sequence data from mitochondrial and plastid genomes. American Journal of Botany 90, 93–106 (2003) 21. Rzhetsky, A., Nei, M.: Theoretical foundation of the minimum-evolution method of phylogenetic inference. Mol. Biol. Evol. 10, 1073–1095 (1993) 22. Saitou, N., Nei, M.: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4(4), 406–425 (1987) 23. Sinkovics, J., Horvath, J., Horak, A.: The origin and evolution of viruses (a review). Acta Microbiol Immunol Hung. 45(3-4), 349–390 (1998) 24. Stoye, J., Evers, D., Meyer, F.: Rose: generating sequence families. Bioinformatics 14, 157–163 (1998) 25. Strimmer, K., Moulton, V.: Likelihood analysis of phylogenetic networks using directed graphical models. Mol. Biol. Evol. 17(6), 875–881 (2000) 26. Ulitsky, I., Burstein, D., Tuller, T., Chor, B.: The average common substring approach to phylogenomic reconstruction. J. Comp. Biol. 13(2), 336–350 (2006)
An Ω(n2 / log n) Speed-Up of TBR Heuristics for the Gene-Duplication Problem Mukul S. Bansal and Oliver Eulenstein Department of Computer Science, Iowa State University, Ames, IA, USA {bansal,oeulenst}@cs.iastate.edu
Abstract. The gene-duplication problem is to infer a species supertree from gene trees that are confounded by complex histories of gene duplications. This problem is NP-hard and thus requires efficient and effective heuristics. Existing heuristics perform a stepwise search of the tree space, where each step is guided by an exact solution to an instance of a local search problem. We improve on the time complexity of the local search problem by a factor of n2 / log n, where n is the size of the resulting species supertree. Typically, several thousand instances of the local search problem are solved throughout a stepwise heuristic search. Hence, our improvement makes the gene-duplication problem much more tractable for large-scale phylogenetic analyses.
1
Introduction
An abundance of potential information for phylogenetic analyses is provided by the rapidly increasing amount of available genomic sequence information. Most phylogenetic analyses combine genomic sequences, from presumably orthologous loci, or loci whose homology is the result of speciation, into gene trees. These analyses largely have to neglect the vast amounts of sequence information, in which gene duplication generates gene trees that differ from the actual species tree. Phylogenetic information from such gene trees can be utilized through a species tree obtained by solving the gene-duplication problem [1]. This problem is a type of supertree problem, that is, assembling from a set of gene trees a supertree that contains all species found in at least one of the input trees. The decision version of the gene-duplication problem is NP-complete [2]. Existing heuristics aimed at solving the gene-duplication problem search the space of all possible supertrees guided by a series of exact solutions to instances of a local search problem [3]. The gene-duplication problem has shown much potential for building phylogenetic species trees for snakes [4], vertebrates [5,6], Drosophia [7], and plants [8]. Yet, the computation time of local search problems which are solved by existing heuristics has largely limited the size of such studies. Throughout the current section n denotes the number of leaves in the resulting species tree, and, for brevity in stating time complexities, gene trees and the resulting species tree are assumed to have similar sizes.
This research was supported in part by NSF grant no. 0334832.
R. Giancarlo and S. Hannenhalli (Eds.): WABI 2007, LNBI 4645, pp. 124–135, 2007. c Springer-Verlag Berlin Heidelberg 2007
An Ω(n2 / log n) Speed-Up of TBR Heuristics
125
We improve on the best existing solution for a particular local search problem, the TBR local search problem, by a factor of n2 / log n. Heuristics solving the TBR local search problem, TBR heuristics, were rarely applied in practice due to inefficient running times. Our method greatly improves the speed of TBR based heuristics for the gene-duplication problem and makes it possible to infer larger supertrees that were previously difficult, if not impossible, to compute. For convenience, we use the term “tree” to refer to a rooted and full-binary tree. The terms “leaf-gene” and “leaf-species” refer to a gene or species that is represented by a leaf of a gene or species tree respectively throughout this work unless otherwise stated. Previous Results: The gene-duplication problem is based on the Gene Duplication model from Goodman et al. [9]. In the following, we (i) describe the Gene Duplication model, (ii) formulate the gene-duplication problem, and (iii) describe a heuristic approach of choice [3] to solve the gene-duplication problem. Gene Duplication model: The Gene Duplication (GD) model [1, 10, 11, 12, 13, 14, 15, 16] explains incompatibilities between a pair of “comparable” gene and species trees through gene duplications. A gene and a species tree are comparable, if a sample mapping, called leaf-mapping, exists that maps every leaf-gene to the leaf-species from which it was sampled. Figure 1 depicts an example. Gene tree G is inferred from the leave-genes that were sampled from the leaf-species of the species tree described by the leaf-mapping. However, both trees describe incompatible evolutionary histories. The GD model explains such incompatibilities by reconciling the gene tree with postulated gene duplications. For example, in Figure 1 a reconciled gene tree R can be theoretically inferred from the species tree S by duplicating a gene x in species X into the copies x and x and letting both copies speciate according to the topology of S. In this case, the gene tree can be embedded into the reconciled tree. Thus, the gene tree can be reconciled by using the duplication of gene x to explain the incompatibility. The gene duplications that are necessary under the GD model to reconcile the gene tree can be described by the mapping M, which is an extension of the given leaf-mapping. M maps every gene in the gene tree to the most recent species in the species tree
Fig. 1. (a) Gene trees G and species tree S are comparable, as the mapping from the leaf-genes to the leaf-species indicates. M is the lca-mapping from G to S. (b) R is the reconciled tree for G and S. In species X of R gene x duplicates into the genes x and x . The solid lines in R represent the embedding of G into R.
126
M.S. Bansal and O. Eulenstein
that could have contained the gene (i.e. their least common ancestor). A gene in the gene tree is a (gene) duplication if it has a child with the same mapping under M. In Figure 1 gene h and its child t map under the mapping M to the same species X. The reconciliation cost for a gene tree and a comparable species tree is measured in the number of duplications in the gene tree induced by the species tree. The reconciliation cost for a given set of gene trees and a species tree is the sum of the reconciliation costs for every gene tree in the set and the species tree. The reconciliation cost is linear time computable [13, 17, 18]. Gene-duplication problem and heuristic: The gene-duplication problem is to find, for a given set of gene trees, a comparable species tree with the minimum reconciliation cost. The decision variant of this problem and some of its characterizations are NP-complete [2, 19] while some parameterizations are fixed parameter tractable [20, 21]. However, GeneTree [3], an implementation of a standard local search heuristic for the gene-duplication problem, was used to show that the gene-duplication problem can be an effective approach. Therefore, in practice, heuristics are commonly applied to solve the gene-duplication problem, even if they are unable to guarantee an optimal solution. While the local search heuristic for the gene-duplication problem performs reasonably well in computing smaller sized instances, it does not allow the computation of larger species supertrees. In this heuristic, a tree graph is defined for the given set of gene trees and some, typically symmetric, tree edit operation. The nodes in the tree graph are the species trees which are comparable with every given gene tree. An edge adjoins two nodes exactly if the corresponding trees can be transformed into each other by the tree edit operation. The reconciliation cost of a node in the graph is the reconciliation cost of the species tree represented by that node and the given gene trees. Given a starting node in the tree graph, the heuristic’s task is to find a maximal-length path of steepest descent in the reconciliation cost of its nodes and to return the last node on such a path. This path is found by solving the local search problem for every node along the path. The local search problem is to find a node with the minimum reconciliation cost in the neighborhood (all adjacent nodes) of a given node. The neighborhood searched depends on the edit operation. Edit operations of interest are rooted subgraph pruning and regrafting (SPR) [22,23,24] and rooted tree bisection and reconnection (TBR) [22,23,25]. We defer the definition of these operations to Section 2. The best known run times for the SPR and TBR local search problems are O(kn2 ) [26] and O(kn4 ) (naive solution) respectively, where k is the number of input gene trees. Our Contribution: The efficient solution for the SPR local search problem makes SPR based heuristics suitable for large-scale phylogenetic analyses. Currently, TBR based heuristics are not applicable for phylogenetic analyses because no efficient solution is known for the TBR local search problem. However, TBR based heuristics are more desirable because they significantly extend the search space explored at each local search step. In particular, TBR heuristics search a neighborhood of Θ(n3 ) nodes, including the Θ(n2 ) nodes of the SPR neighborhood, at each local search step. Our contribution is an O(kn2 log n) algorithm
An Ω(n2 / log n) Speed-Up of TBR Heuristics
127
for the TBR local search problem. This makes TBR heuristics almost as efficient as SPR heuristics for large-scale phylogenetic analyses.
2
Basic Definitions, Notation, and Preliminaries
In this section we first introduce basic definitions and notation and then define preliminaries required for this work. 2.1
Basic Definitions and Notation
A tree T is a connected graph with no cycles, consisting of a node set V (T ) and an edge set E(T ). The nodes in V (T ) of degree at most one are called leaves and denoted by Le(T ). A node in V (T ) that is not a leaf is called an internal node. T is rooted if it has exactly one distinguished node called the root which we denote by Ro(T ). Let T be a rooted tree. For any pair of nodes x, y ∈ V (T ) where y is on a path from Ro(T ) to x we call (i) y an ancestor of x, and (ii) x a descendant of y. If {y, x} ∈ E(T ) then we call y the parent of x denoted by Pa(x) and we call x a child of y. We write (y, x) to denote the edge {y, x} where y = Pa(x). The set of all children of y is denoted by Ch(y). If two nodes in T have the same parent, they are called siblings. T is (fully) binary if every internal node has exactly two children. A subtree of T rooted at node x ∈ V (T ), denoted by Tx , is the tree induced by x and all its descendants. The depth of a node x ∈ V (T ) is the number of edges on the path from Ro(T ) to x. The least common ancestor of a non-empty subset L ⊆ V(T ), denoted as lca(L), is the common ancestor of all nodes in L with maximum depth. 2.2
The Gene Duplication Problem
We now introduce necessary definitions to state the gene duplication problem. A species tree is a tree that depicts the evolutionary relationships of a set of species. Given a gene family for a set of species, a gene tree is a tree that depicts the evolutionary relationships among the sequences encoding only that gene family in the given species. Thus the nodes in a gene tree represent genes. In order to compare a gene tree G with a species tree S a mapping from each gene g ∈ V (G) to the most recent species in S that could have contained g is required. Definition 1 (Mapping). The leaf-mapping LG,S : Le(G) → Le(S) specifies the species MG,S (g) from which gene g was sampled from. An extension of LG,S to MG,S : V (G) → V (S) is the mapping where MG,S (g) = LG,S (g), if g ∈ Le(G), and MG,S (g) = lca(MG,S (Le(Gg )) otherwise. Definition 2 (Comparability). The trees G and S are comparable if there exists a leaf-mapping LG,S . A set of gene trees G and S are comparable if each gene tree in G is comparable with S. Let G and S be comparable trees for the remainder of this section.
128
M.S. Bansal and O. Eulenstein
Definition 3 (Duplication). A node v ∈ V (G) is a (gene) duplication if MG,S (v) = MG,S (u) for some u ∈ Ch(v) and we define Dup(G, S) = {g ∈ V (G) : g is a duplication }. Definition 4 (Reconciliation cost). We define reconciliation costs for gene and species trees as follows: 1. Δ(G, S) = | Dup(G, S)| is the reconciliation cost from G to S. 2. Δ(G, S) = G∈G Δ(G, S) is the reconciliation cost from G to S. 3. Let T be the set of species trees that is comparable with G. We define Δ(G) = minS∈T Δ(G, S) to be the reconciliation cost of G. Problem 1 (Duplication). Instance: A set G of gene trees. Find: A species tree S ∗ such that Δ(G, S ∗ ) = Δ(G). 2.3
Local Search Problems
Here we first provide definitions for the TBR [25] and SPR [24] edit operations and then formulate the related local search problems that were motivated in the Introduction. Definition 5 (RR operation). Let T be a tree and x ∈ V (T ). RR(T, x) is defined to be the tree T , if x = Ro(T ). Otherwise, RR(T, x) is the tree that is obtained from T by (i) suppressing Ro(T ), and (ii) subdividing the edge {Pa(x), x} by a new root node. We define the following extension: RR(T ) = x∈V (T ) {RR(T, x)}. Definition 6 (TBR operation). For technical reasons we first define for a tree T the planted tree P (T ) that is the tree obtained by adding an additional edge, called root edge, {u, Ro(T )} to T . Let T be a tree, e = (u, v) ∈ E(T ) and X, Y be the connected components that are obtained by removing edge e from T where v ∈ X and u ∈ Y . We define TBRT (v, x, y) for x ∈ X and y ∈ Y to be the tree that is obtained from P (T ) by first removing edge e, then replacing the component X by RR(X, x), and then adjoining a new edge f between x = Ro(RR(X, x)) and Y as follows: 1. Create a new node y that subdivides the edge (Pa(y), y). 2. Adjoin the edge f between nodes x and y . 3. Suppress the node u, and rename x as v and y as u. We say that the tree TBRT (v, x, y) is obtained from T by a tree bisection and reconnection (TBR) operation that bisects the tree T into the components X, Y and reconnects them above the nodes x, y. We define the following extensions for the TBR operation: 1. TBRT (v, x) = y∈Y {TBRT (v, x, y)} 2. TBRT (v) = x∈X TBRT (v, x) 3. TBRT = (u,v)∈E(T ) TBRT (v)
An Ω(n2 / log n) Speed-Up of TBR Heuristics
129
An SPR operation for a given tree T can be briefly described through the following three steps: (i) prune some subtree P from T , (ii) add a root edge to the remaining tree S, (iii) regraft P into an edge of the remaining tree S. For our purposes we define the SPR operation as a special case of the TBR operation. Definition 7 (SPR operation). Let T be a tree, e = (u, v) ∈ E(T ) and X, Y be the connected components that are obtained by removing edge e from T where v ∈ X and u ∈ Y . We define SPRT (v, y) for y ∈ Y to the tree TBRT (v, v, y). We say that the tree SPRT (v, y) is obtained from T by a subtree prune and regraft (SPR) operation that prunes subtree Tv and regrafts it above node y. We define the following extensions of the SPR operation: 1. SPRT (v) = y∈Y {SPRT (v, y)} 2. SPRT = (u,v)∈E(T ) SPRT (v) Problem 2 (TBR-Scoring (TBR-S)). Instance: A gene tree set G, and a comparable species tree S. Find: A tree T ∗ ∈ TBRS such that Δ(G, T ∗ ) = minT ∈TBRS Δ(G, T ). Problem 3 (TBR-Restricted Scoring (TBR-RS)). Instance: A triple (G, S, v), where G is a set of gene trees, S is a comparable species tree, and (u, v) ∈ E(S). Find: A tree T ∗ ∈ TBRS (v) such that Δ(G, T ∗ ) = minT ∈TBRS (v) Δ(G, T ). The problems SPR-Scoring (SPR-S) and SPR-Restricted Scoring (SPR-RS) are defined analogously to the problems TBR-S and TBR-RS respectively. Throughout this paper we use the following terminology: (i) G is a set of gene trees, (ii) S denotes a compatible species tree, (ii) r = Ro(S), (iii) P denotes a proper (pruned) subtree of S, and (iv) v = Ro(P ).
3
Solving the TBR-S Problem
In this section we study the TBR-S problem in more detail. First, we show how the algorithm developed by Bansal et al. [26] to solve the SPR-RS problem can be slightly modified to solve the TBR-S problem. This already improves the running time of the existing solution considerably. Second, we show how the inherent structure of the TBR-S problem can be used to further improve the running time. To do this, we define the “BestRooting” problem, and show how an efficient solution for this problem leads to an efficient solution for the TBR-S problem. 3.1
Relating Scores of TBR and SPR Neighborhoods
The following algorithm Alg-SPR-RS is a brief restatement of the algorithm presented in [26] to solve the SPR-RS instance (G, S, v) efficiently. Algorithm Alg-SPR-RS 1. Prune P from S, regraft P above node r to obtain the resulting tree denoted by (P ). Compute the reconciliation cost of (P ).
130
M.S. Bansal and O. Eulenstein
2. Compute the difference between the reconciliation cost of each tree in SPRS (v) and (P ). This gives the reconciliation cost of each tree in SPRS (v). Observe that SPRS (v) = TBRS (v, v). In fact, Alg-SPR-RS can be modified to efficiently compute the reconciliation costs of all trees in TBRS (v, x) for any node x ∈ V (P ). To do this, we simply modify Step 1 of Alg-SPR-RS as follows: 1. Prune P from S, re-root P to obtain P = RR(S, x), and regraft P above node r to obtain (P ). Compute the reconciliation cost of (P ). Note, this modification does not change the algorithms’s complexity. Observation 1. The TBR-RS problem on (G, S, v) can be solved by computing the reconciliation cost of each tree in TBRS (v, x), for all x ∈ V (P ). The TBR-S problem in turn can be solved by solving the TBR-RS problem |V (S)| − 1 times. Let us assume, for convenience, similar gene tree and species tree sizes. It is known that the SPR-RS problem is solvable in O(kn) time [26], where k = |G|. Based on Observation 1, and the modification described above, the TBR-S problem can then be solved in O(kn3 ) time. This already gives us a speed up of Θ(n) over known algorithms for this problem. We will show how to solve the TBR-S problem in O(kn2 log n) time. This gives a speed-up of Θ(n2 / log n) over existing algorithms. Also, it should be noted that the correctness or efficiency of our algorithm does not depend on the simplifying assumption of similar gene and species tree sizes. It is interesting to note that the size of the set TBRS is Θ(n3 ). Thus, for one gene tree the time complexity of computing and enumerating the reconciliation costs of all trees in TBRS is Ω(n3 ). However, to solve the TBR-S problem one is only interested in finding a tree with the minimum reconciliation cost. This lets us solve the TBR-S problem in time that is sub-linear in the size of TBRS , and obtain a time complexity of O(n2 log n) for the TBR-S problem. In fact, after the initial O(n2 log n) preprocessing step, our algorithm can output the reconciliation cost of any tree in TBRS in O(1) time. 3.2
Relating TBR-RS with SPR-RS
To obtain our speed-up, we concentrate on improving the complexity of solving the TBR-RS problem. To do this, we take a closer look at Step 2 of Alg-SPR-RS. This part of the algorithm computes the difference in reconciliation cost of each tree in SPRS (v) and the tree (P ). To compute this difference, the algorithm considers only the leaf set of P , and not its topology. This means that the difference values would be the same if P was replaced by any tree P ∈ RR(P ). Based on this observation, we have the following theorem. In the interest of brevity, this theorem is stated here without proof. Theorem 1. Let x , x ∈ V (P ), and y , y ∈ V (S) \ (V (P ) ∪ {r}). Let T1 = TBRS (v, x , y ), T2 = TBRS (v, x , y ), and, T3 = TBRS (v, x , y ), T4 = TBRS (v, x , y ). Then, Δ(G, T1 ) − Δ(G, T2 ) = Δ(G, T3 ) − Δ(G, T4 ).
An Ω(n2 / log n) Speed-Up of TBR Heuristics
131
Corollary 1. To obtain the reconciliation cost of each tree in TBRS (v), it is sufficient to compute the reconciliation cost of (P ) for each P ∈ RR(P ), and then perform Step 2 of Alg-SPR-RS starting with any (P ), P ∈ RR(P ). This is because the output of Step 2 of Alg-SPR-RS will be the same for all (P ) where P ∈ RR(P ). To solve the TBR-RS problem it is sufficient to find one tree in TBRS (v) with minimum reconciliation cost. Based on Alg-SPR-RS and Corollary 1 we have the following theorem. Theorem 2. Let T1 be a tree with minimum reconciliation cost in TBRS (v). Consider tree P ∈ RR(P ) where (P ) has minimum reconciliation cost and let P = RR(P, x). Then, there exists a tree T2 ∈ TBRS (v, x) such that Δ(G, T1 ) = Δ(G, T2 ). In other words, to obtain a solution for the TBR-RS problem for instance (G, S, v), it is sufficient to obtain the reconciliation costs of only the trees in TBRS (v, x), where P = RR(P, x) such that (P ) has the minimum reconciliation cost. Based on Corollary 1 and Theorem 2 we have the following corollary. Corollary 2. The minimum reconciliation cost of a tree in TBRS (v) can be obtained by performing Step 2 of Alg-SPR-RS starting with (P ), where P ∈ RR(P ) such that (P ) has minimum reconciliation cost. Problem 4 (BestRooting (BR)). Instance: A set of gene trees G, a compatible species tree S, and a proper subtree P of S. Find: A tree P ∈ RR(P ) for which Δ(G, (P )) is minimum. Thus, based on Observation 1, Theorems 1 and 2, and, Corollaries 1 and 2, an efficient solution to the BR problem leads naturally to an efficient solution for the TBR-S problem. The remainder of this paper deals mostly with our solution to solve the BR problem efficiently. In the next section we take a closer look at the BR problem and study some of its structural properties.
4
Structural Properties of the BR Problem
Our solution to solve the BR problem for a set of input gene trees involves computing the reconciliation cost of (P ), where P ∈ RR(P ), for each gene tree separately, and then combining the results to obtain the final solution. The solution for the BR problem is easily obtained by picking that P ∈ RR(P ) for which the sum of the reconciliation costs from each gene tree is minimum. Therefore, in the remainder of this section we assume that there is only one input gene tree G for the BR problem. Thus, the problem to be solved is the following: Problem 5. (Rooting) Instance: A triple (G, S, P ), where G is a gene tree, S a compatible species tree, and P a proper subtree of S. Find: The reconciliation cost Δ(G, (P )) for each P ∈ RR(P ).
132
M.S. Bansal and O. Eulenstein
To solve the Rooting problem we first calculate the reconciliation cost of (P ). As P is re-rooted to form P , the duplication status of some of the nodes from G may change, which changes the reconciliation cost. We show how to efficiently compute this difference between the reconciliation cost of (P ) and the reconciliation cost of (P ) for each P ∈ RR(P ). To realize this strategy it is imperative to study the change in the duplication status of nodes in the gene tree as P is re-rooted step-by-step. Lemma 1. The duplication status of any node g ∈ G for which MG,S (g) ∈ V (P ) remains the same for each (P ), P ∈ RR(P ). Thus, under our strategy, we only need to consider those nodes in G that map to a node in V (P ) under MG,S . These are the nodes that are responsible for any difference in the reconciliation costs of (P ) and (P ), where P ∈ RR(P ). Definition 8. An internal node g ∈ V (G) is relevant if MG,S (g) ∈ V (P ). For the remainder of this section let g ∈ V (G) be relevant, and Ch(g) = {g , g }. Lemma 2. If MG,(P ) (g) = MG,(P ) (g ) = MG,(P ) (g ) for some P ∈ RR(P ), then g remains a duplication under MG,(P ) for every P ∈ RR(P ). Lemma 3. Let a = MG,(P ) (g). The duplication status of g under MG,(P ) is preserved under MG,(P ) where P = RR(P, x) for x ∈ V (P ) \ (V (Pa ) \ {a}). Lemma 4. Suppose g is not a duplication under MG,(P ) . Let b = MG,(P ) (g ), c = MG,(P ) (g ). Then g is a duplication under MG,(P ) where P = RR(P, x) for x ∈ (V (Pb )\ {b})∪(V (Pc )\ {c}). And, g is not a duplication under MG,(P ) for any other P . Lemma 5. Let a = MG,(P ) (g) = MG,(P ) (g ), and b = MG,(P ) (g ). Let α denote the node closest to b along the path from a to b in (P ), such that there exists a node v ∈ Gg with MG,(P ) (v) ∈ Pα . Then, (i) g is not a duplication under MG,(P ) where P = RR(P, x) and x is a node along the path from a to b, but not a, in (P ), and, (ii) g is a duplication under MG,(P ) for any other P .
5
Description of the Algorithm
We first design an efficient algorithm, called RootingCostTree (Alg-RCT), which solves the Rooting problem. Based on the lemmas seen in Section 4, we then show how this algorithm fits into our algorithm for solving the TBRS problem. Finally we analyze the complexity of our algorithm for solving the TBR-S problem. 5.1
Algorithm Alg-RCT(G, S, P )
The input for Alg-RCT is the instance (G, S, P ) of the Rooting problem. The first step in the algorithm is to obtain the tree (P ). The output P is a W : V (P) → N0 node weighted version of tree P , where W (s) = Δ(G, (P )) for P = RR(P, s).
An Ω(n2 / log n) Speed-Up of TBR Heuristics
133
Initialization: Construct (P ) and initialize two counters g(s) and l(s) with 0, for each node s ∈ V (P ). Then, compute MG,(P ) . Create two empty sets “start” and “end” at each node in P . Partially updating the values for g and l: For each relevant node g do the following: If g is not a duplication under MG,(P ) , then g(MG,(P ) (c)) ← g(MG,(P ) (c)) + 1 for each c ∈ Ch(g). If g is a duplication where a = MG,(P ) (g) = MG,(P ) (u), and b = MG,(P ) (v), for Ch(g) = {u, v} and b = a. Add u to the “start” set of node a and the “end” set of node b. Fully updating the values for g and l: We now update the l and g values for those nodes that satisfy the condition of Lemma 5. Lets call these nodes “special”. Following the notation from Lemma 5, the goal is to find node α ∈ P for each special node from G. In the interest of brevity we only give a high level idea of the algorithm to be followed for this step. An in-order labeling of G lets us store the subtree Gg for any special node g ∈ V (G) as an interval. These intervals can be stored in an interval tree, so that stabbing queries can be performed efficiently. We traverse P in post-order, and for each node, say x, we keep track of those nodes from the gene tree that might have a descendant mapping to x and for which α can be deduced from x. This is done by making use of the “start” and “end” sets established in the previous step. This ‘currently active’ set of nodes (intervals) is maintained dynamically in the interval tree. Suitably querying the interval tree allows us to obtain those special nodes for which the α nodes can be deduced easily from x. This step can be shown to run in time O(|V (P ) + V (G)| log(|V (P ) + V (G)|)). Computing P: The tree P is initialized to be P and its node weights are set to 0. Set d ← Δ(G, (P )). For each node s in a preorder traversal on the tree P , we calculate the weight of that node as follows: If s ∈ Ch(Ro(P )) then W (s) ← d. Otherwise, set W (s) ← W (Pa(s)) + g(Pa(s)) − l(s). Note: The value g(s), represents the number of additional nodes from G that will become duplications when P = RR(P, s) is re-rooted to form P = RR(P, t), t ∈ Ch(s). The value l(s) represents the number of nodes from G that will lose their duplication status when P = RR(S, Pa(s)) is re-rooted to form P = RR(S, s). 5.2
Algorithm Alg-TBR(G, S, P )
This algorithm solves the TBR-S problem. The algorithm is as follows: We first use Algorithm Alg-RCT to solve the BR problem as shown in Section 4. A solution to the BR problem leads naturally to a solution for the TBR-S problem (see Observation 1, Theorems 1 and 2, and, Corollaries 1 and 2). 5.3
Correctness and Complexity
To establish the correctness of our algorithm for the TBR-S problem, it is sufficient to show that the Rooting problem is correctly solved by Algorithm Alg-RCT. The correctness of algorithm Alg-RCT is based on Lemmas 1-5. For brevity, a detailed proof is omitted herein.
134
M.S. Bansal and O. Eulenstein
We first state the time complexity of Alg-RCT, and then derive the time complexity of algorithm Alg-TBR which solves the TBR-S problem. Note, to simplify our analysis we assume that all G ∈ G have approximately the same size. The input for BR problem is a gene tree G, a species tree S, and the pruned subtree P of S. Let n = | Le(S)|, and k = |G|. Complexity of Alg-RCT(G, S, P ): Let m = | Le(S)|+ | Le(G)|. The overall time complexity of Alg-RCT(G, S, P ) is bounded by O(m log m) (proof omitted for brevity). This implies that the complexity of the BR problem is O(km log m). Complexity of Alg-TBR(G, S, P ): By Corollary 2 the time complexity of the TBR-RS problem is O(km) + O(km log m) which is O(km log m). The time complexity of Alg-TBR is thus, O(n) × O(km log m), which is O(knm log m). The time complexity of the existing naive solution for the TBR-S problem is O(kn3 m). Thus, our algorithm improves on the current solution by a factor of n2 / log m.
6
Outlook and Conclusion
Despite the inherent complexity of the duplication problem, it has been an effective approach for incorporating data from gene families into a phylogenetic inference [4, 5, 6, 7]. The duplication problem is typically approached by using local search heuristics. Among these, TBR heuristics are especially desirable for large-scale phylogenetic analyses, but current solutions have prohibitively large run times. Our algorithm offers a vast reduction in run time, which makes TBR heuristics applicable for such large-scale analyses. The ideas developed in this paper could possibly be applied to other problems related to the reconciliation of gene and species trees. For example, our solution for the rooting problem can be used to efficiently find an optimal rooting for any species tree, with respect to the given gene trees.
References 1. Guig´ o, R., Muchnik, I., Smith, T.F.: Reconstruction of ancient molecular phylogeny. Molecular Phylogenetics and Evolution 6(2), 189–213 (1996) 2. Ma, B., Li, M., Zhang, L.: On reconstructing species trees from gene trees in term of duplications and losses. In: RECOMB, pp. 182–191 (1998) 3. Page, R.D.M.: GeneTree: comparing gene and species phylogenies using reconciled trees. Bioinformatics 14(9), 819–820 (1998) 4. Slowinski, J.B., Knight, A., Rooney, A.P.: Inferring species trees from gene trees: A phylogenetic analysis of the elapidae (serpentes) based on the amino acid sequences of venom proteins. Molecular Phylogenetics and Evolution 8, 349–362 (1997) 5. Page, R.D.M.: Extracting species trees from complex gene trees: reconciled trees and vertebrate phylogeny. Molecular Phylogenetics and Evolution 14, 89–106 (2000) 6. Cotton, J., Page, R.D.M.: Vertebrate phylogenomics: reconciled trees and gene duplications. In: Pacific Symposium on Biocomputing, pp. 536–547 (2002) 7. Cotton, J.A., Page, R.D.M.: Tangled tales from multiple markers: reconciling conflict between phylogenies to build molecular supertrees. In: Bininda-Emonds, O.R.P. (ed.) Phylogenetic Supertrees: Combining Information to Reveal the Tree of Life, pp. 107–125. Springer, Heidelberg (2004)
An Ω(n2 / log n) Speed-Up of TBR Heuristics
135
8. Sanderson, M.J., McMahon, M.M.: Inferring angiosperm phylogeny from EST data with widespread gene duplication. BMC Evolutionary Biology 7(suppl. 1), S3 (2007) 9. Goodman, M., Czelusniak, J., Moore, G.W., Romero-Herrera, A.E., Matsuda, G.: Fitting the gene lineage into its species lineage, a parsimony strategy illustrated by cladograms constructed from globin sequences. Systematic Zoology 28, 132–163 (1979) 10. Page, R.D.M.: Maps between trees and cladistic analysis of historical associations among genes, organisms, and areas. Systematic Biology 43(1), 58–77 (1994) 11. Mirkin, B., Muchnik, I., Smith, T.F.: A biology consistent model for comparing molecular phylogenies. Journal of Computational Biology 2(4), 493–507 (1995) 12. Eulenstein, O.: Predictions of gene-duplications and their phylogenetic development. PhD thesis, University of Bonn, Germany, GMD Research Series No. 20 / 1998 (1998), ISSN: 1435-2699 13. Zhang, L.: On a Mirkin-Muchnik-Smith conjecture for comparing molecular phylogenies. Journal of Computational Biology 4(2), 177–187 (1997) 14. Chen, K., Durand, D., Farach-Colton, M.: Notung: a program for dating gene duplications and optimizing gene family trees. Journal of Computational Biology 7, 429–447 (2000) 15. Bonizzoni, P., Vedova, G.D., Dondi, R.: Reconciling gene trees to a species tree. In: Petreschi, R., Persiano, G., Silvestri, R. (eds.) CIAC 2003. LNCS, vol. 2653, Springer, Heidelberg (2003) 16. G´ orecki, P., Tiuryn, J.: On the structure of reconciliations. In: Lagergren, J. (ed.) Comparative Genomics. LNCS (LNBI), vol. 3388, Springer, Heidelberg (2005) 17. Bender, M.A., Farach-Colton, M.: The LCA problem revisited. In: Latin American Theoretical INformatics, pp. 88–94 (2000) 18. Harel, D., Tarjan, R.E.: Fast algorithms for finding nearest common ancestors. SIAM Journal on Computing 13(2), 338–355 (1984) 19. Fellows, M., Hallett, M., Korostensky, C., Stege, U.: Analogs & duals of the mast problem for sequences & trees. In: Bilardi, G., Pietracaprina, A., Italiano, G.F., Pucci, G. (eds.) ESA 1998. LNCS, vol. 1461, pp. 103–114. Springer, Heidelberg (1998) 20. Stege, U.: Gene trees and species trees: The gene-duplication problem is fixedparameter tractable. In: Proceedings of the 6th International Workshop on Algorithms and Data Structures (1999) 21. Hallett, M.T., Lagergren, J.: New algorithms for the duplication-loss model. In: RECOMB, pp. 138–146 (2000) 22. Swofford, D.L., Olsen, G.J.: Phylogeny reconstruction. In: Molecular Systematics, Sinauer Associates, pp. 411–501 (1996) 23. Allen, B.L., Steel, M.: Subtree transfer operations and their induced metrics on evolutionary trees. Annals of Combinatorics 5, 1–13 (2001) 24. Bordewich, M., Semple, C.: On the computational complexity of the rooted subtree prune and regraft distance. Annals of Combinatorics 8, 409–423 (2004) 25. Chen, D., Eulenstein, O., Fern´ andez-Baca, D., Burleigh, J.G.: Improved heuristics for minimum-flip supertree construction. Evolutionary Bioinformatics (2006) 26. Bansal, M.S., Burleigh, J.G., Eulenstein, O., Wehe, A.: Heuristics for the geneduplication problem: A θ(n) speed-up for the local search. In: RECOMB, pp. 238– 252 (2007)
Incremental Discovery of Irredundant Motif Bases in Time O(|Σ|n2 log n) (Extended Abstract) Alberto Apostolico and Claudia Tagliacollo
Georgia Institute of Technology and Università di Padova
[email protected]
1
Introduction
Compact bases formed by motifs called “irredundant” and capable of generating all other motifs in a sequence have been proposed in [8,10] and successfully tested in tasks of biosequence analysis and classification. Given a sequence s of n characters drawn from an alphabet Σ, the problem of extracting such a base from s had been previously solved in time O(n2 log n log |Σ|) and O(|Σ|n2 log2 n log log n), respectively in [9] and [7], through resort to the FFT-based string searching by Fischer and Paterson [5]. More recently, a solution taking time O(|Σ|n2 ) without resort to the FFT was also devised [4]. In the present paper, the problem is considered of extracting the bases of all suffixes of a string incrementally. In previous work [3], this task was accomplished in time O(n3 ). A much faster incremental algorithm is described here, which takes time O(|Σ|n2 log n). Whereas also this algorithm does not make use of the FFT, its performance is comparable to the one exhibited by the previous FFT-based algorithms computing only one base. The implicit representation of a single base requires O(n) space, whence for finite alphabets the proposed solution is within a log n factor from optimality. The present paper assumes some familiarity with [2,3,4], to which notation largely conforms. With ‘•’ ∈ Σ denoting a don’t-care character, a pattern is a string over Σ ∪{•} containing at least one solid character. For characters σ1 and σ2 , we write σ1 σ2 if and only if σ1 is a don’t care or σ1 = σ2 . Given two patterns p1 and p2 with |p1 | ≤ |p2 |, p1 p2 holds if p1 [j] p2 [j], 1 ≤ j ≤ |p1 |. We also say in this case that p1 is a sub-pattern of p2 , and that p2 implies or extends p1 . For example, let p1 = ab • •e, p2 = ak • •e and p3 = abc • e • g. Then p1 p3 , and p2 p3 . The
Corresponding author. Dipartimento di Ingegneria dell’ Informazione, Università di Padova, Padova, Italy and College of Computing, Georgia Institute of Technology, 801 Atlantic Drive, Atlanta, GA 30318, USA.Work Supported in part by the Italian Ministry of University and Research under the Bi-National Project FIRB RBIN04BYZ7, and by the Research Program of Georgia Tech. Performed in part while visiting IMS, National University of Singapore in 2006, and CAS-MPI, Shanghai in 2007, with support provided by those Institutes. Work performed in part while visiting the College of Computing of the Georgia Institute of Technology.
R. Giancarlo and S. Hannenhalli (Eds.): WABI 2007, LNBI 4645, pp. 136–148, 2007. c Springer-Verlag Berlin Heidelberg 2007
Incremental Discovery of Irredundant Motif Bases in Time O(|Σ|n2 log n)
137
relation is clearly transitive. The operator “⊕” is further introduced, such that for σ1 , σ2 ∈ Σ∪•, σ1 ⊕σ2 = σ1 , if σ1 = σ2 and• , if σ1 = σ2 .A natural extension of “⊕” is also defined: given patterns p1 and p2 , p1 ⊕ p2 = p1 [i] ⊕ p2 [i], ∀1 ≤ i ≤ min{|p1 |, |p2 |}). Given the patterns p1 , p2 , the consensus of p1 and p2 is the pattern p = p1 ⊕p2 . Deleting all leading and trailing don’t cares from p yields the meet of p1 and p2 , denoted by [p1 ⊕ p2 ]. For instance, aac•tgcta ⊕ caact•cat = •a••t•c••, and [aac•tgcta ⊕ caact•cat] = a••t•c. Note that a meet may be the empty word. With sufi denoting the suffix si si+1 ...sn of s, a pattern p is an autocorrelation of s if p is the meet of s and one of its suffixes, i.e., if p = [s ⊕ sufi ] for some 1 < i ≤ n. For a sequence s and positive integer k, k ≤ |s|, a k-motif of s is a pair (m, Lm ), where m is a pattern such that |m| ≥ 1 and m[1], m[|m|] are solid characters, and Lm = (l1 , l2 , . . . , lq ) with q ≥ k is the exhaustive list of the starting position of all occurrences of m in s. Given a motif m, a sub-motif of m is any motif m that may be obtained from m by (i) changing one or more solid characters into don’t care, (ii) eliminating all resulting don’t cares that precede the first remaining solid character or follow the last one, and finally (iii) updating Lm in order to produce the (possibly, augmented) list Lm . We also say that m is a condensation for any of its sub-motifs. We are interested in motifs for which any condensation would disrupt the list of occurrences. A motif with this property has been called maximal or saturated. Thus, a motif m is maximal or saturated if we cannot make it more specific while retaining the cardinality of the list Lm of its occurrences in s. A motif (m, Lm ) is redundant if m and its location list Lm can be deduced from the other motifs without knowing the input string s. Trivially, every unsaturated motif is redundant. As it turns out, however, saturated motifs may be redundant, too. More formally: A saturated motif (m, Lm ), is redundant if there exist saturated motifs (mi , Lmi ) 1 ≤ i ≤ t, such that Lm = (Lm1 + d1 ) ∪ (Lm2 + d2 ) ∪ ... ∪ (Lmp + dt ) with 0 ≤ dj < |mj |. Here and in the following, (L + d) is used to denote the list that is obtained by adding a uniform offset d to every element of L. For instance, the saturated motif m1 = a•a is redundant in s = acacacacabaaba, since Lm1 = {1, 3, 5, 7, 9, 12} = (Lm2 ) ∪ (Lm3 ) ∪ (Lm4 + 1) where m2 = acac, m3 = aba and m4 = ca•a. Saturated motifs enjoy some special properties. First (Property 1), if (m1 , Lm1 ) and (m2 , Lm2 ) are saturated motifs, then, m1 = m2 ⇔ Lm1 = Lm2 . Whereas, given a generic pattern m, it is always possible to determine its occurrence list in any sequence s, with a saturated motif m it is possible in addition to retrieve of m from the sole list Lm in s, simply by taking: the structure m= suf . Moreover (Property 2), if (m1 , Lm1 ), (m2 , Lm2 ) are moi i∈Lm tifs of s, then, m1 m2 ⇔ Lm2 ⊆ Lm1 . Finally (Property 3), if (m, Lm ) is a saturated motif of s, then ∀L ⊆ Lm it is m suf k . k∈L Let now sufi (m) denote the ith suffix of m. The occurrence at j of m1 is covered by m2 if m1 sufi (m2 ), j ∈ Lm2 + i − 1 for some sufi (m2 ). For instance, m6 = aca•a with Lm6 = {1, 3, 5, 7} is covered at position 5 by m2 =
138
A. Apostolico and C. Tagliacollo
acacaca•a••a, Lm2 = {1, 3}. In fact, let m be ith suffix of m3 with i = 5, that is, m = aca•a••a. Then 5 ∈ Lm2 +4 and m6 ≺ m , which together lead to conclude that m6 is covered at 5 by m2 . An alternate definition of the notion of coverage can be based solely on occurrence lists, since the occurrence at j of m1 is covered by m2 if there is i such that Lm2 + i ⊆ Lm1 , j ∈ Lm2 + i. In terms of our running example, we have: 5 ∈ Lm2 + 4 and Lm2 + 4 = {5, 7} ⊂ Lm6 = {1, 3, 5, 7}. A maximal motif that is not redundant is called an irredundant motif. Hence a saturated motif (m, Lm ) is irredundant if the components of the pair (m, Lm ) cannot be deduced by the union of a number of other saturated motifs. We use Bi to denote the set of irredundant motifs in sufi . Set Bi is called the base for the motifs of sufi. In particular, B is used to denote the base of s, which coincides with B1 . Formally, let M be the set of all saturated motifs on s. A set of saturated motifs B is called a base of M iff the following hold: (1) for each m ∈ B, m is irredundant with respect to B − {m}, and, (2) let G(X ) be the set of all the redundant maximal motifs generated by the set of motifs X , then M = G(B). In general, |M| = Ω(2n ). However, the base of 2-motifs has size linear in |s|. This follows immediately from the known result (see, e.g., [3]): Theorem 1. Every irredundant motif is the meet of s and one of its suffixes. From now on and for the remainder of this paper, treatment will be restricted to 2-motifs. Recall now that in order for a motif to be irredundant it must have at least one occurrence that cannot be deduced from occurrences of other motifs. In [3], such an occurrence is called maximal and the motif is correspondingly said to be exposed at the corresponding position. Clearly, every motif with a maximal occurrence is saturated. However, not every saturated motif has a maximal occurrence. In fact, it is seen that the set of irredundant motifs is precisely the subset of saturated motifs with a maximal occurrence. We use Lmax to denote m the list of maximal occurrences of m. The following known definitions and properties (see, e.g., [3,4,9]) are listed for future reference. Definition 1. (Maximal occurrence) Let (m, Lm ) be a motif of s and j ∈ Lm . Position j is a maximal occurrence for m if for no d ≥ 0 and (m , Lm ) we have Lm ⊆ (Lm − d ) with (j − d ) ∈ Lm . Lemma 1. m ∈ B ⇔ |Lmax m | > 0. Lemma 2. If m ∈ B, then j ∈ Lmax ⇔ [s ⊕ suf(max{j,k}−min{j,k}) ] = m, ∀k ∈ m Lm . Lemma 3. m∈B |Lm | < 2n. Lemma 2 shows that in order to check whether a position i is a maximal occurrence for an assigned motif (m, Lm ), it suffices to check the condition [sufi ⊕ sufk ] = m, ∀k ∈ Lm . Also Lemma 3 [9], which poses a counter-intuitive linear bound on the cumulative size of the occurrence lists in a base, will play an important role in our construction.
Incremental Discovery of Irredundant Motif Bases in Time O(|Σ|n2 log n)
2
139
The Incremental Management of Motif Occurrences
Any approach to the extraction of bases of irredundant motifs must solve the problem of finding the occurrences of the autocorrelations or meets of the input string s or of its suffixes. This evokes the notable variant of approximate string searching featuring don’t cares (see, e.g., [1,6]), which admits of a classical O(n log m log |Σ|) time solution based on the FFT [5]. Such an FFT-based solution is the one adopted in [7,9], resulting in an overall time O(n2 log n log |Σ|). The incremental approach in [3] proceeds instead by computing those lists directly and for consecutively increasing suffixes of each autocorrelation. This produces the base for each suffix of s, at the overall cost of O(n3 ) time. In [4], a fast test for meet occurrence is built on the observation that these patterns all come all from the set of autocorrelations of the same string. In a nutshell, an occurrence of m = sufi ⊕ sufj at some position k in s induces strong interdependencies among the number of don’t cares in each of the three patterns m = sufi ⊕ sufj , m = sufi ⊕ sufk and m = sufj ⊕ sufk . The specific structure of these relationships depends on whether the alphabet is binary or larger. For binary alphabets, with dx denoting the number of don’t cares in x and prefi (x) the prefix of x of length i, the following holds. Lemma 4. [4] Let m = [sufi ⊕ sufj ], m = pref|m| (sufi ⊕ sufk ) and m = pref|m| (sufj ⊕ sufk ). k ∈ Lm ⇔ dm = dm + dm . Thus, following an O(n2 ) preprocessing in which the number of don’t cares in every suffix of each autocorrelation of s is counted, it is possible to answer in constant time whether any meet occurs at any position of s, just by checking the balance of don’t cares. In generalizing to arbitrary alphabets, an O(|Σ|n2 ) preprocessing of the input string s is required, in which the count is partitioned among the different symbols of Σ, whereby every don’t care is accompanied by a “pedigree” specifying one out of 4 possible origins. Correspondingly, the test takes now O(|Σ|). We refer to [4] for details and summarize these findings in the following: Theorem 2. Let s be a string of n characters over an alphabet Σ, and m the meet of any two suffixes of s. Following an O(|Σ|n2 ) time preprocessing of s, it is possible to decide for any assigned position k whether or not k is an occurrence of m in time O(|Σ|). We concentrate now on designing an algorithm that produces the bases of all suffixes of an input string s. Following an initial preparation, the algorithm will proceed incrementally on suffixes of increasing length, along the lines of a paradigm introduced in [3]. At the generic iteration n − i the algorithm builds the base Bi relative to sufi . This base is formed in part by selecting the elements of Bi+1 that are still irredundant in sufi , in part by identifying and discarding, from the set of new candidate motifs consisting of the meets of sufi , those motifs that are covered by others. Since the elements in any base will come from meets of some of the suffixes of s, a bottleneck for the procedure is represented by the
140
A. Apostolico and C. Tagliacollo
need to compute the occurrences of all such meets. Before entering the details of our construction, it is worthwhile to examine more closely the challenge posed by the incremental management of such meets. Our algorithm must build the sets Mi = {[sufi ⊕ sufj ], ∀j > i} and Bi ⊆ Mi as i goes from n − 1 down to 1 through the main cycle. For the generic iteration, this entails, in particular, to update the lists of occurrences of Mi+1 = {[sufi+1 ⊕ sufj ], ∀j > i + 1} and Bi+1 ⊆ Mi+1 in order to produce those of Mi as well as Bi ⊆ Mi . As there are possibly O(n2 ) occurrences to update at each of the n − 1 iterations, this task is a major potential source of inefficiency, even though it is not difficult to see that the lists do not need to be built from scratch at each iteration [3]. In fact, consider a generic motif m = [sufi ⊕ sufj ] and let m = [sufi+d ⊕ sufj+d ], m ∈ Mi+d , be the motif such that m = d−1 σ {•} m . Then, the set of occurrences of m is determined by scanning the list of occurrences of m and verifying the condition s[i] = s[k − d] for every k ∈ Lm . This is accomplished in constant time per update. Still, for any of the sets M under consideration, we have m∈M |Lm | = O(n2 ). Thus, the method bears a cost of O(n2 ) per iteration, O(n3 ) in total. Our goal is to set up a more prudent organization of the data, leading to a global cost O(|Σ|n2 log n), amortized over all iterations. This seems counterintuitive, since there is no way around listing all occurrences in all lists in less than cubic space. However, we can take advantage of the dynamics undergone by our list and make do with a partially implicit representation. In order to proceed, we need some preparatory developments. It is a crucial consequence of Theorem 2 that once the don’t cares have been tallied for all suffixes of each meet of s (together with each don’t care’s “pedigree” information), then it takes only constant time to determine whether or not an arbitrary position k is an occurrence of m, m being the meet of an arbitrary pair of suffixes of s. Although the same could be done on-the-fly with no penalty, we will assume for simplicity that a trivial, O(|Σ|n2 ) pre-processing phase has already been performed to determine the number (and individual “pedigree”) of don’t cares in (every suffix of) each [s ⊕ sufi](i = 1, 2, ..., n), and concentrate on computing the occurrences of every pairwise suffix meet. Definition 2. (Earliest index) Let m be a meet of s and 1 ≤ j ≤ |s|, and let sufk (m) indicate as usual the k-th suffix of m. The earliest index Ijm of m at j is Ijm = min{k : (j − |m| + k) ∈ Lsufk (m) }. That is, starting at some occurrence j of the last solid character of m, the index Ijm is k if sufk (m) is the longest suffix of m ending at j (m[k] = s[j − |m| + k]). Consider now a generic meet m = [s ⊕ sufi ]. Knowing the earliest index relative to m at every position j of s, we also know that for l ≥ Ijm , position j − |m| + l must be included in the list of occurrences of sufl (m) , whereas for l < Ijm the position j − |m| + l is not an occurrence of sufl (m). Lemma 5. Let m = [s ⊕ sufi ] and 1 ≤ j ≤ |m|. Computing Ijm , the earliest index of j relative to m, takes time O(|Σ| log n). Proof. The computation is carried out by straightforward binary search of the longest matching suffix of m. At the generic step of the recursion, we check
Incremental Discovery of Irredundant Motif Bases in Time O(|Σ|n2 log n)
141
whether j − |m| + k is an occurrence of sufk (m); if this is the case, we proceed with the next longer suffix in the recursion, otherwise with the next shorter one. The cost of the step is that of determining whether a given position is an occurrence for a certain meet of two suffixes of s, which Theorem 2 affords in time O(|Σ|). Through the O(log n) steps, the computation of Ijm takes thus O(|Σ| log n). This immediately yields: Corollary 1. Computing the earliest indices of all meets of s at all positions 1, 2, ..., |s| = n takes time O(|Σ|n2 log n). Corollary 1 is a crucial handle for our speedup, which requires nevertheless a few additional observations. First, recall that the motifs that survive each round of updates are essentially a subset of the current version of M: their respective lists are sublists of the original ones. Upon updating, each surviving occurrence in a list will retain its original starting position, up to an offset which is uniform for all fellow survivors. By making the convention that the elements in a list are represented by their ending, rather than by their starting position, and keeping track of lengths we will never need to re-name the survivors. Moreover, the occurrences that do not survive will never be readmitted to any list. Finally, because the base Bi comes only from meets of sufi , then at iteration n − i, we only need, in addition to Bi+1 , the lists in the set: Mi = {[sufi ⊕ sufj ], ∀j > i}. We will see next that, under our conventions, these lists can be made readily available throughout, at a total cost of O(|Σ|n2 log n). In fact, a stronger construct can be established, whereby all sets Mji = {[sufi ⊕ sufk ], ∀k = i, k ≥ j} can be implicitly maintained with their individual lists throughout, for each j ≤ i, at the overall cost of O(|Σ|n2 log n) instead of O(n3 ). The remainder of the section is devoted to substantiate this claim. At the iteration for sufi the list for the generic meet m of s will appear as partitioned into sections, as follows. The currently open section contains the ending positions in sufi of occurrences of suffixes of m that fall still short of their respective earliest indices. The remaining sections are called closed and assigned to various lengths, as follows: the section assigned to length stores the ending positions, if any exist, of the occurrences in sufi of suffixes of m of length that cannot be prolonged into occurrences of length > . A list will be initialized as soon as the rightmost two replicas in s of the last solid character of its meet are found. Let these positions be respectively k and h > k. These two entries k and h are dubbed open and appended to the open list of name j = (h − k). New entries are added to the open list while longer and longer suffixes of the input string s are considered. At the iteration for sufi , i is added to the open section of the all lists of meets having s[i] as their last character. At that point, a “sentinel” pointer is also issued from the positions i − |m| + Iim of s to this entry in the list. The role of each sentinel is to gain access to its corresponding entry when the latter “decays” at iteration k = i − |m| + Iim , as a consequence of the corresponding occurrence becoming “too short” to survive. At that point, the entry i is taken out of the open section of the list and moved to the closed
142
A. Apostolico and C. Tagliacollo
section assigned to length m−Iim . In conclusion, the list assigned to m undergoes “refresh cycles” as longer and longer extensions provoke the defection of more and more entries from the open to the closed length sublists and new, shorter suffix occurrences are discovered and added to the open list. For each meet m, the list assigned to m is partitioned into sublists arranged in order of decreasing length, with the length of the open list set conventionally equal to n, and the items inside each sublist are in turn sorted in order of ascending position. The collective list will be referred to as the panpipes of m, after the ancient musical instrument it resembles. For any integer ≤ n − 1, tallying the current size of all the occurrences of suffixes of m not longer than is like stabbing the set of degrading pipes with an orthogonal stick, striking at a height of from the base, and then counting how many were hit. Standard balanced tree implementation of the list with its subsections supports in O(log n) time each of: – Insertion of an element in the open section; – Demotion of an element to the closed section of a given length; – Line stabbing at any height, or tallying elements of a given minimum length; Theorem 3. Maintaining the panpipes of all distinct meets of s consecutively at sufi , for i = n − 1, n − 2, ..., 1 takes overall time O(|Σ|n2 log n). Proof. It takes time O(|Σ|n2 log n) to determine Ijm for all j’s and meets of s. Then, refer to the preceding description for the updates. Each one of the O(n) candidate occurrences of each of the O(n) meets is inserted in the open sub-list exactly once, and then possibly moved from there to a specific length list once and forever. This accounts for O(n2 ) panpipes primitives in total, at an individual cost of O(log n) each, which yields a total complexity of O(|Σ|n2 log n). Corollary 2. The sequence of sets Mi = {[sufk ⊕ sufj ], ∀k ≥ i, j > k} (i = n − 2, n − 3, ..., 1), each with its occurrence lists and individual list cardinalities can be consecutively generated one from the other in overall time O(|Σ|n2 log n). Proof. Following all necessary preprocessing, and with m denoting the generic meet of s, just let sufi (m) first “inherit” the whole list of sufi+1 (m) (that is, Lsufi (m) = Lsufi+1 (m) −1), and then use the sentinels at i to access and eliminate from that list all occurrences j − |m| + i such that Ijm = i.
3
Computing the Bases of All Suffixes of a String
We are ready to detail the generic iteration of the algorithm. Iteration n − i will determine the base Bi for sufi , so that in particular the base of s itself will be available after n iterations. The input for this iteration is as follows:
Incremental Discovery of Irredundant Motif Bases in Time O(|Σ|n2 log n)
143
– The set Mi+1 of meets of sufi+1 each with its individual occurrence list. – The base Bi+1 , represented by the collection of patterns on Σ ∪ {•} each with its list of occurrences in sufi+1 , with maximal occurrences tagged. The output of the iteration are Mi and Bi , in the same representation. Recall that at any time the collective size of all lists in any given set Bi is linear in n− i, by virtue of Lemma 3. This is not necessarily true of the collective size of the lists of Mi . However, these sets possess each at most n − 1 meets. Each iteration of the main cycle consists of the two phases: – Phase 1: extract from Bi+1 the motifs that are still irredundant in sufi ; – Phase 2: identify all new irredundant motifs. We describe these two phases in succession. 3.1
Phase 1 - Computing Bi+1 ∩ Bi
This phase consists of identifying the motifs of Bi+1 that are still irredundant in sufi . Two distinct events may lead a motif m in Bi+1 to become redundant: 1. m is covered by a new motif discovered at the current iteration; 2. m is covered by the occurrence starting at i of some other element of Bi+1 . It is convenient to single out from Bi+1 the motifs that exhibit a new occurrence starting at i, and handle them separately from the rest. This enables us to search for the motifs of Bi+1 that are still irredundant in sufi , among motifs: – [1(a)] with an occurrence starting at i; – [1(b)] without an occurrence starting at i. These two cases differ on the basis of how a maximal occurrence is covered. If a motif m that becomes redundant in sufi has an occurrence starting at i and maximal occurrence j in sufi+1 , this means that m ≺ m for some m with j ∈ Lm . In the second case, as it shall be seen later in detail, such a motif becomes redundant because some other motif extends its maximal occurrence j by adding a solid character σ = s[i] at position j − i + 1. Since the two phases operate on distinct sets of motifs (respectively with and without an occurrence at i), they can be handled independently upon separating their respective inputs. Alternatively, the entire Bi+1 is fed as input to Phase 1(a) and the output of this phase will be the input of Phase 1(b). Whereas the preliminary separation reduces the input size for either phase, deciding for each motif of Bi+1 whether or not it has an occurrence at i induces an extra cost O(|Σ||Bi+1 |). The second approach also requires some partitioning, but this can be performed on a smaller input in between phases or at the end. The approach described next uses preliminary partitioning. The second approach is left for an exercise.
144
A. Apostolico and C. Tagliacollo
Phase 1(a). Since we know the name (meet-id, list-head) and length of the motif in Bi+1 to be checked, we can compute the position at which an occurrence at i would end, and then check (or compute from scratch) the earliest index of that position relative to the meet name. Therefore, separating from Bi+1 the i motifs with an occurrence at i takes at most O(|Σ||Bi+1 |). With Bi+1 denoting the subset of Bi+1 containing such motifs, the goal is then that of determining i Bi+1 ∩ Bi . This set exhibits some important properties, which are derived next. Lemma 6. Let B j be the set of irredundant motifs with an occurrence at j, and Mj the set of meets [sufj ⊕ sufk ], ∀k = j. Then B j ⊆ Mj . Proof. Let m be an element of B j . From Lemma 1, m must have at least one maximal occurrence k. If k = j then m = [sufj ⊕ sufl ], ∀l ∈ Lm . If this is not the case, it follows from the maximality of the occurrence at k that m = [sufk ⊕ sufl ], ∀l ∈ Lm , which holds in particular for l = j. Lemma 6 is useful when searching irredundant motifs of which a specific occurrence is known, since it restricts the set of candidates to a linear subset of all pairwise suffix meets. In the present context, the lemma can be used to determine which ones among the old motifs having an occurrence at i conserve their irredundancy in sufi . Corollary 3. Let, as earlier, Mi denote the set of meets [sufi ⊕ sufk ], ∀k > i. Then i Bi+1 ∩ Bi ⊆ Mi .
Proof. From Lemma 6.
In order for a motif in Bi+1 to stay irredundant in the transition from sufi+1 to sufi , at least one of its maximal occurrences in sufi+1 must be preserved also in sufi . Lemma 7. Let m ∈ Bi+1 . Then m ∈ Bi ⇔ ∃k = i : k ∈ Lmax m . Proof. This holds clearly for a motif of Bi+1 with no occurrence at i, since irredundancy presupposes a maximal occurrence. Assume then a motif of Bi+1 having i as its sole maximal occurrence. Then, m = [sufi ⊕ sufk ], ∀k ∈ Lm . Let k ∈ Lm be a maximal occurrence of m in sufi+1 . Since k ∈ Lm , we have m = [sufi ⊕ sufk ], so that k is a maximal occurrence of m in sufi as well. Therefore, no motif of the old base can preserve its irredundancy by having i as its sole maximal occurrence. On the other hand, preserving irredundancy for an old motif does not necessarily require a maximal occurrence at i. These properties suggest that the redundancy of a motif m ∈ Bi+1 can be assessed by scanning maximal occurrences of m and deciding which ones among them are still maximal in sufi. If the maximal occurrences of m are already known in sufi+1 , all that is left is to check maximality with respect to the new occurrence at i.
Incremental Discovery of Irredundant Motif Bases in Time O(|Σ|n2 log n)
145
Lemma 8. m ∈ Bi+1 , i ∈ Lm , and m ∈ Bi ⇔ ∃k ∈ Lmax such that m = m [sufi ⊕ sufk ]. Proof. Immediate from Lemma 7.
In conclusion, the bulk of Phase 1(a) consists of scanning the maximal occurrences of each m ∈ Bi+1 also occurring at i and determining whether at least one such occurrence stays maximal. Maximal occurrence j stays maximal in sufi iff [sufi ⊕ sufj ] = m, a condition that can be tested by comparing the number of don’t cares respectively in m and [sufi ⊕ sufj ], given that m occurs at i and j. Alternatively, this condition can be checked by comparing the occurrence lists of m and [sufi ⊕ sufj ]. In fact, since both i and j are in Lm , it must be that m [sufi ⊕ sufj ] ⇔ L[sufi ⊕sufj ] ⊆ Lm . Hence, in order to check whether [sufi ⊕ sufj ] and m coincide it suffices to check the condition: |L[sufi ⊕sufj ] | = |Lm |, Note that, as a by-product, either method inductively maintains knowledge of the maximal occurrences for all motifs in the sets B. Lemma 9. Phase 1(a) takes time O(n). Proof. For any j > i, we identify [sufi ⊕ sufj ] from our knowledge of i, j, and of their difference. In fact, the latter identifies a specific meet of s. Let i now m ∈ Bi+1 . For every maximal occurrence j of m it takes constant time to compare don’t cares or list sizes for m and [sufi ⊕ sufj ]. By Lemma 3, the size of all lists in a base cumulates to less than 2n, whence the total number of occurrences that need to be checked is O(n). Phase 1(b). Recall that the task of this phase is the identification of the motifs that stay irredundant in sufi among the elements of Bi+1 with no occurrence at i. The identification of these motifs is rather straightforward once it is observed that the only way in which such a motif m may become redundant in sufi is for it to be covered, in its maximal occurrences in sufi+1 , by a motif m = σ(•)d m with an occurrence at i. Lemma 10. If m ∈ Mi covers m ∈ Bi+1 , i ∈ / Lm , then m = σ(•)d m where σ = s[i] and d ≥ 0. Proof. Occurrence j ∈ Lmax loses maximality if ∃k ∈ Lm such that [sufi ⊕ m sufi+k−j ] m, where it is assumed w.l.o.g. k > j. Since j is a maximal occurrence of m in sufi+1 , then [sufk ⊕ sufj ] = m and the only possibility is [sufi ⊕ sufi+k−j ] = s[i](•)d m, where d = j − i. The elimination from Bi+1 of the motifs m ∈ / Bi without an occurrence at i is done by checking for every maximal occurrence of m whether it can be extended in such a way as to lose maximality. The procedure terminates as soon as an occurrence that stays maximal is met, or when all maximal occurrences have been obliterated.
146
A. Apostolico and C. Tagliacollo
Lemma 11. Phase 1(b) takes time O(n). Proof. Since each m to be checked is in Bi+1 , then the total number of motif occurrences of which the possible extension into i needs to be checked is O(n). Checking for extensibility of an occurrence is easily done in constant time. 3.2
Phase 2 - Identifying the New Irredundant Motifs
Recall that by new irredundant motifs we mean those elements of Bi that did not belong to Bi+1 . Lemma 6 prescribes that these motifs are to be identified among the elements of Mi = {[sufi ⊕ sufj ], ∀j > i}. Indeed, to be irredundant these motifs must have a single maximal occurrence at position i in case they already had multiple occurrences in sufi+1 ; otherwise, they must have precisely two occurrences, both occurrences being maximal. Let then = Mi − (Bi+1 ∩ Bi ) B be the set of the candidate new irredundant motifs. Since i must be a maximal we need to check which ones among old and occurrence of any motif m ∈ B, new motifs with an occurrence at i can cover this occurrence of m. The way this is done is based on the following properties. and let j = i ∈ Lm1 ∩ Lm2 , |Lm1 | < |Lm2 |. Then, Lemma 12. Let m1 , m2 ∈ B m2 ∈ / Bi . Proof. Observe first that it is impossible for both motifs to be irredundant in sufi . In fact, since they do not belong to the old base they would have to have maximal occurrences at i. But if this holds for m1 then [sufi ⊕ sufj ] = m1 , ∀k ∈ Lm1 , whence, in particular, [sufi ⊕ sufj ] = m1 . Likewise, it must be [sufi ⊕ sufk ] = m2 , ∀k ∈ Lm2 , hence [sufi ⊕ sufj ] = m2 = m1 . Assume then that only m2 is irredundant. Then, [sufi ⊕ sufk ] = m2 , ∀k ∈ Lm2 and thus [sufi ⊕ sufj ] = m2 . Since i, j ∈ Lm1 we have [sufi ⊕ sufj ] m1 and thus |Lm2 | ≤ |Lm1 |, which contradicts the hypothesis. mold ∈ B i ∩Bi , and j = i ∈ Lmnew ∩Lm , |Lm | Lemma 13. Let mnew ∈ B, old old i+1 < |Lmnew |. Then, mnew ∈ / Bi . Proof. As was already argued, if mnew ∈ Bi then its occurrence at i must be maximal, hence [sufi ⊕ sufj ] = mnew . We have then again mnew mold which generates the contradiction |Lmnew | ≤ |Lmold |. it must be In conclusion, in order to check for irredundancy of the elements of B, or by checked for every such motif m whether it is covered by another motif of B some old irredundant motif which is still irredundant in sufi . Let m1 , m2 , ..., ml (l ≤ n) be the motifs to verify. They all come in the form [sufi ⊕ sufk ] for some k > i. Considering m1 = [sufi ⊕ sufk1 ] with Lm1 = {i, k1 , k2 , ..., kr }, the motifs that can possibly obliterate m1 are [sufi ⊕ sufj ], j ∈ Lm1 , j = i, j = k1 . Taking m = [sufi ⊕ sufk2 ] as the first motif to be considered, we check the condition
Incremental Discovery of Irredundant Motif Bases in Time O(|Σ|n2 log n)
147
|Lm1 | ≥ |Lm |. Note that having chosen m as [sufi ⊕ sufk2 ] where both i and k2 are occurrences for m1 , we must have |Lm1 | ≥ |Lm |. If |Lm1 | = |Lm |, then m1 = m and m is excluded from further analysis. If |Lm1 | > |Lm |, then m1 is obliterated by m and thus m must be eliminated since it is redundant. The procedure is repeated with the surviving motif until all redundant motifs have been eliminated. Lemma 14. Phase 2 takes time O(n). Proof. We need first to establish that the described approach is correct. Indeed, assume that the k-th iteration is handling the motif m with an occurrence at j, and that the pair m, m = [sufi ⊕ sufj ] is checked, where m had been already considered during some previous iteration h < k. Two situations are possible: 1. m had been eliminated at iteration h. In this case, there must be a motif covering m at i, whence that motif will also cover the occurrence at i of m. Thus, m can be eliminated at the current iteration. 2. m has been previously checked but not eliminated. This means that m ∈ Bi . Since m and m share an occurrence other than i it must be m ∈ / Bi , so that also in this case m can be eliminated at the current iteration. Following each one of the comparisons, the procedure eliminates a distinct meet of sufi from further consideration. Since there are O(n) such meets, we also have O(n) iterations, each requiring constant time to compare the cardinalities of two lists. Theorem 4. The irredundant motif bases of all suffixes of a string can be computed incrementally in time O(|Σ|n2 log n). Proof. By the preceding properties and discussion.
4
Concluding Remarks
Several issues are still open, notable among them, the existence of an optimal algorithm for general alphabets and of an optimal incremental algorithm for alphabets of constant or unbounded size.
References 1. Apostolico, A., Galil, Z. (eds.): Pattern matching algorithms. Oxford University Press, New York (1997) 2. Apostolico, A.: Pattern discovery and the algorithmics of surprise. Artificial Intelligence and Heuristic Methods for Bioinformatics, 111–127 (2003) 3. Apostolico, A., Parida, L.: Incremental paradigms of motif discovery. Journal of Computational Biology 11(1), 15–25 (2004) 4. Apostolico, A., Tagliacollo, C.: Optimal extraction of irredundant motif bases. In: Lin, G. (ed.) Proceedings of COCOON 07. LNCS, vol. 4598, pp. 360–371. Springer, Heidelberg (2007)
148
A. Apostolico and C. Tagliacollo
5. Fischer, M.J., Paterson, M.S.: String matching and other products. In: Karp, R. (ed.) Proceedings of the SIAM-AMS Complexity of Computation, American Mathematical Society, Providence, R.I, pp. 113–125 (1974) 6. Navarro, G.: A guided tour to approximate string matching. ACM Computing Surveys 33(1), 31–88 (2001) 7. Pelfrêne, J., Abdeddaïm, S., Alexandre, J.: Extracting approximate patterns. Journal of Discrete Algorithms 3(2-4), 293–320 (2005) 8. Parida, L.: Algorithmic Techniques in Computational Genomics. PhD thesis, Department of Computer Science, New York University (1998) 9. Pisanti, N., Crochemore, M., Grossi, R., Sagot, M.-F.: Bases of motifs for generating repeated patterns with wild cards. IEEE/ACM Trans. Comput. Biol. Bioinformatics 2(1), 40–50 (2005) 10. Parida, L., Rigoutsos, I., Floratos, A., Platt, D., Gao, Y.: Pattern discovery on character sets and real-valued data: linear bound on irredundant motifs and an efficient polynomial time algorithm. In: Symposium on Discrete Algorithms, pp. 297–308 (2000)
A Graph Clustering Approach to Weak Motif Recognition Christina Boucher, Daniel G. Brown, and Paul Church D.R. Cheriton School of Computer Science University of Waterloo {caboucher,browndg,pchurch}@cs.uwaterloo.ca
Abstract. The aim of the motif recognition problem is to detect a set of mutually similar subsequences in a collection of biological sequences. Weak motif recognition is where the sequences are highly degenerate. Our new approach to this problem uses a weighted graph model and a heuristic that determines high weight subgraphs in polynomial time. Our experimental tests show impressive accuracy and efficiency. We give results that demonstrate a theoretical dichotomy between cliques in our graph that represent actual motifs and those that do not.
1
Introduction
Understanding the structure and function of genomic data remains an important biological and computational challenge. Motifs are short subsequences of genomic DNA responsible for controlling biological processes, such as gene expression. Motifs with the same function may not entirely match, due to mutation. The motif consensus of the instances is a sequence representing the shared pattern. Given a number of DNA sequences, motif recognition is the task of discovering motif instances in sequences without knowing their positions or pattern. This problem becomes increasingly difficult as the number of allowed mutations grow. Weak motif recognition addresses the difficult case when many degenerate positions are allowed. Many useful versions of motif recognition are NP-complete, and therefore are unlikely to have polynomial-time algorithms. Pevzner and Sze define the weak motif recognition problem concretely, illustrating the limitations of motif recognition programs. In 2000, most methods were capable of finding motifs of length 6 with no degeneration but failed to detect motif instances of length 15 with 4 degenerate positions in a random sample containing 20 sequences of length 600 [8]. Since this “challenge problem” was defined, many approaches have been developed to detect motifs with a relatively large number of degenerate positions. We describe a new approach for this problem, and provide theoretical and experimental results that support our novel motif recognition algorithm. Our algorithm, MCL-WMR, builds an edge-weighted graph model of the given motif recognition problem and uses a graph clustering algorithm to quickly determine important subgraphs are to be searched for valid motifs. Synthetic data has R. Giancarlo and S. Hannenhalli (Eds.): WABI 2007, LNBI 4645, pp. 149–160, 2007. c Springer-Verlag Berlin Heidelberg 2007
150
C. Boucher, D.G. Brown, and P. Church
shown that MCL-WMR has competitive running time capabilities and accuracy. An added advantage of MCL-WMR is the ability to detect multiple motif instances. The efficiency of MCL-WMR lies in the use of the Markov Cluster algorithm (MCL) to quickly find dense subgraphs likely to contain a motif. These subproblems are then solved optimality via dynamic programming. Extracting important subgraphs is in the spirit of WINNOWER, the combinatorial algorithm created by Pevzner and Sze [8], which builds a similar graph model and eliminates spurious edges sequentially. Our algorithm eliminates complete subgraphs and hence, avoids considering edges individually. One of the main contributions of the creation of MCL-WMR is the introduction of a novel model for motif recognition. Previous algorithms and programs search exhaustively or probabilistically on an unweighted graph or string. Due to the lack of information contained in these models, the required search requires extensively computation. By considering a weighted graph model, we narrow the search dramatically to easy problems. We argue that there exists a dichotomy between the weight of cliques corresponding to actual motifs and that of cliques which do not, and suggest this separation can be used to filter data to be searched.
2
Previous Approaches to Weak Motif Recognition
The limitations of the existing motif recognition programs were first highlighted by Pevzner and Sze, who identified “challenge” problems in motif discovery [8]. We approach the problem from a similar combinatorial perspective and hence, consider the following combinatorial formulation. Definition 1. The (l, d)-motif problem: Let S = {S1 , . . . , Sm } be a set of n length DNA sequences, and M be the motif consensus, a fixed and unknown sequence of length l. Suppose that M is contained in each Si , corrupted with at most d substitutions, so their Hamming distance is at most d. The aim is to determine M and the location of the motif instance in each sequence. The Hamming distance between two sequences si and sj is H(si , sj ). The weak motif recognition problem is to find the motif instances when the number of degenerate positions d is large in relation to the motif length l; well-known weak motif recognition problems exist when the motif instances are (9, 2) (11, 3), (15, 4), and (18, 6), with 20 DNA random sequences, each 600 nucleotides long. Although the strength of the motif leads to an increased or decreased inherent difficulty, varying the background sequence length is also important. As the number of sequences increases, the number of noisy l-mers increases: detection of the motif instances becomes increasingly difficult, and spurious motifs are more likely to occur. Also, as the sequence length grows, the number of nearmotifs will also increase dramatically. Existing software programs developed for motif finding use either a heuristic or enumeration approach. Heuristic methods attempt to maximize a score function representative of how likely a particular subsequence is a motif instance;
A Graph Clustering Approach to Weak Motif Recognition
151
they are often unsatisfactory for weak motifs because they get trapped in local maxima. Pevzner and Sze developed WINNOWER and SP-STAR for weak motif recognition. WINNOWER creates a graph representation with a vertex for every occurring l-mer and an edge between all pairs of vertices that are at most 2d distance apart; spurious edges are deleted to reveal sets of vertices whose corresponding subsequences are possible motif instances [8]. Due to spurious edges, the running time is prohibitively large and grows immensely as motif strength weakens or subsequence length or number increases [5]. Sze et al. [10] extend upon the graph formation of WINNOWER [8]: they formulate motif finding as finding cliques in k-partite graphs, with the additional requirement of a string s close to every motif instance. They hypothesize that this approach provides a better formulation to model motifs than using cliques alone; the use of k-partite graphs lends itself to be solved exactly and efficiently by a divide-and-conquer algorithm. Experimental results demonstrate that the approach is feasible on difficult motif finding problems of moderate size [10]. Buhler and Tompa [1] developed a heuristic algorithm called PROJECTION that projects every occurring l-mer onto a smaller space by hashing. The hash function is based on k of the l positions that are selected at random when the algorithm begins. After the initialization step, a consensus is derived for each grouping of l-mers, and expectation maximization is used for refinement. PROJECTION does significantly better than other program but its accuracy is dependent on an user-defined input parameter [1]. As m becomes larger, PROJECTION recovers motif instances slower as m increases; hence, the running time and accuracy of is very sensitive to changes in m. An obvious method to detect motif instances of length l is to enumerate all 4l possible motif consensus sequences, count occurrences, and calculate a significance value for each of the considered l-mers or count instances of them or see if they satisfy a requirement as in the (l, d) problem. These algorithms are guaranteed to find the best motif (or the most probable one, in the case of maximizing a likelihood function), but their running times become prohibitively slow for large degenerate motifs. To tackle more significant motif recognition problems, enumeration methods have been created that consider only oligomers which are present in the given data sets. SP-STAR, developed by Pevzner and Sze [8], does an enumerative search but only over the occurring data rather than the entire space of 4l l-mers.; however, we note that the number of sequences to be searched is approximately dl 3d . SP-STAR was successful in finding (15,4)-motif instances in data sets containing 20 DNA sequences, each of which has maximum length 700 but failed to have reasonable accuracy when the sequence length exceeded 700 [8].
3
System and Methods
MCL-WMR involves three stages: graph construction, clique finding using graph clustering, and recovering the motif instances and their consensus. The construction of MCL-WMR is as follows: a reference sequence Sr is chosen randomly
152
C. Boucher, D.G. Brown, and P. Church
from the data set and for each l-length subsequence of Sr a graph Gr is built from comparing that subsequence with all other possible l-length subsequences in the data set S1 , . . . , Sr−1 , Sr+1 , . . . , Sm . The entire graph G is the union of these subgraphs G1 , . . . , Gm−l . We use MCL to generate subgraphs which contain vertices that are highly inter-related. From these clusters of vertices, we generate the positions of the possible motif instances and their corresponding motif consensus. The algorithm terminates when a motif is found. In order to increase the probability a motif is found, we minimize searching subgraphs with low probability of containing a motif; hence, the adjacency subgraphs are not clustered and searched in a sequential manner. 3.1
Graph Construction
In our graphical representation of the data set, each subsequence of length l is represented by a vertex and the construction of our graph ensures that the motif instances represented by vertices in the graph are connected to each other and form a clique of size m (though the converse need not hold). The vertex set contains a vertex vi,j representing the l-length subsequence in sequence i starting at position j, for each i and j = 1, 2, . . . , n − l + 1. Each pair of vertices vi,j and vi ,j , for i = i is joined by an edge when the Hamming distance between the two represented subsequences is at most 2d. An edge between vertices at distance k has weight l − k for d < k ≤ 2d, or 10(l − k) for k ≤ d. This emphasizes subsequences at small distances. This graph is represented by a symmetric adjacency matrix, constructed in O(m2 (n − l)(n + l)) time. The graph is m-partite so a clique of size m contains exactly one vertex from each sequence. We reduce the size of the instance being passed to MCL by considering subgraphs {G0 , G1 , . . . , Gm−l }, where Gi is the subgraph induced by a reference vertex, denoted as vR,i , and its neighbors (for some arbitrary choice of reference sequence R) instead of searching all of G at once. 3.2
Using Clustering to Find Motifs
A clustering of graph G(V, E) is a decomposition of V into subsets of highly intraconnected vertices. A good clustering of a graph is an approximation of a partitioning of the graph into cliques. A clique corresponding to a motif will exist in one of the subgraphs of G since each motif instance appears as a vertex in a clique of size m. We use MCL [11] to cluster the sets of vertices to determine subgraphs that are highly intra-connected with high-weight edges, and scarcely inter-connected and thus, likely to correspond to a motif instance. MCL can handle larged, undirected weighted graphs efficiently. The idea underlying the MCL algorithm is that dense subgraphs correspond to regions where the number of k-length paths is relatively large, for small k. Random walks of length k have higher probability for paths beginning and ending in the same dense region than for other paths. 3.3
Recovering Motifs
MCL identifies dense high-scoring regions of the subgraph Gi ; we filter the subgraphs obtained from MCL to subgraphs that have high probability of containing
A Graph Clustering Approach to Weak Motif Recognition
153
a motif. A clique in G that represents a motif instance must have size n and weight greater than or equal to (l − 2d) m since each pair of vertices are adja2 cent. We filter out clusters that do not meet these criteria. Clusters that pass this test may contain multiple cliques formed by choosing different subsets of n cluster vertices, or possibly no cliques at all. We identify all ways of forming a clique from the cluster vertices by using the m-partite nature of the graph to explore all possible cliques with a depth-first search. As the number of cliques can be exponential in the cluster size in the worst case, this step becomes a bottleneck for problem sizes such as (18, 6), where MCL returns large clusters. For each clique, we test if it represents a motif instance by attempting to build a motif consensus that has distance at most d to every vertex. We do this by building up a list of possible consensuses and the number of mismatches to each vertex for each possibility, one character at a time. Once a candidate consensus has d + 1 mismatches to some vertex, it is discarded. Although the space of 4l possible consensus strings is very large, in practice the list is pruned very rapidly on the d + 1st character, i. e. after reaching size 4d .
4
Analysis of Graph-Theoretic Model
To validate our weighted graph approach, we show the existence of a separation between the total weight of a clique corresponding to a motif and that of a clique that does not. We demonstrate theoretically that the total weight of a clique corresponding deviates from the mean with low probability. Empirical results support this, and also show that there exists some separation between cliques that can be extended to motifs and those that cannot. 4.1
Analysis of the Weight of a Clique Containing a Motif
Consider a clique C containing a motif. Define the weight of an edge to be l minus the Hamming distance between the sequences corresponding to the endpoints of the edge. Let W be the random variable for the sum of each of the m edge 2 weights in C. Without loss of generality, let v1 , v2 , . . . vm be the set of m vertices in C corresponding to sequences s1 , . . . , sm . We seek E[W ] and a tail bound a large deviations from the mean. Let Wi be the expected value of the random variable W given that the first i subsequences in C are known. Theorem 1. The expected weight of a clique in G, which models a random (l, d)-motif recognition problem containing m sequences, is d d m 1 l l a+b 4ba E[W ] = l− 2 3 a+b− 2 β a=0 b a 3l b=0
where β =
l i i=0 i 3 .
d
Proof. Given an (l, d) motif, we aim to compute the expected value of the clique’s m m total weight, E[ i=1 j=i+1 (l − H(si , sj ))]. Let μe be E[H(si , sj )] for any pair
154
C. Boucher, D.G. Brown, and P. Church
of sequences si and sj , where (vi , vj ) is an edge in a clique that contains a motif and si and sj are unknown. ⎡ ⎤ m E [W ] = E ⎣ (l − H(si , sj ))⎦ = μe 2 ∀vi ,vj ,i<j
d l i We choose the m sequences uniformly from the β = i=0 i 3 possible choices. Let αi denote the Hamming distance between vertex vi and the consensus S. The expected weight of an edge depends on the distance of the two subsequences from the consensus, so we break the expectation into pieces: d μe = αi =0 P r[H(S, si ) = αi ]E[H(si , sj )|H(S, si ) = αi ] d d = αi =0 P r[H(S, si ) = αi ] · αj =0 P r[H(S, sj ) = αj ]|| E[H(si , sj )|H(S, si ) = αi , H(S, sj ) = αj ] (αlj )3αj ( l )3αi d = dαi =0 αiβ E[H(si , sj )|H(S, si ) = αi , H(S, sj ) = αj ] αj =0 β The remaining problem is to compute the expected Hamming distance between si and sj , knowing that the strings consist of copies of S with a and b positions mutated, respectively. If a position was mutated in neither string it is a match; if a position was mutated in one string but not the other it is a mismatch; if a position was mutated in both strings, it is a match with probability 13 . If si is fixed, sj consists of b mutations that each either hit one of the a mutated positions in si or one of the other l − a positions, sampled without replacement. The number that hit the a mutated positions in si follows a hypergeometric distribution with mean ba l . If the number of hits to mutated positions is c, the expected total number of mismatches is: b − c positions that hit among the l − a non-mutated positions in si , a − c positions among the a mutated positions of si that were not hit, and 23 c mismatches from the hits among the mutated positions, for a total of (b − c) + (a − c) + 23 c = a + b − 43 c. Therefore, E(H(si , sj )|H(S, si ) = a, H(S, sj ) = b) = a + b − 4ba 3l . Let μe be d l a d l b 1 4ba l − α2 a=0 a 3 b=0 b 3 a + b − 3l , the expected weight of a single edge. We have the following: l α d l α d 4α α 1 i j E [W ] = m αi + αj − 3li j αi =0 αi 3 αj =0 αj 3 2 l − α2 l l α +α d 4αi αj 1 d i j = m l − 3 α + α − 2 i j αi =0 αj =0 αj αi 2 α 3l We are able to easily bound the variance of W by first demonstrating that W0 , W1 , . . . , Wm is a martingale sequence and next, applying Azuma’s inequality to determine the probability of a specific deviation. Theorem 2. Consider the (l, d)-motif recognition problem containing m sequences. Let W be the sum of the m edge weights in an arbitrary clique in 2 Gmotif that contains a motif and let μW be the expected weight of a clique in G that contains a motif, then for any λ > 0, λ2 P r[|W − μW | ≥ λ] ≤ 2 exp − 2 2d (m + 1)
A Graph Clustering Approach to Weak Motif Recognition
155
Proof. The mean of W has been previously defined, here we concentrate on proving the tail bound. Recall that Wi is the expected value of the random variable W given that the first i subsequences in C are known and hence, the distances between the consensus S and the first i vertices are known. Without loss of generality we choose a consensus S and let Fi be the σ-field generated by the random choice of the subsequence i from the set of all subsequences at most distance d from the consensus and hence, αi randomly chosen with the same probability. It follows that Wi = [W |Fi ], since Wi denotes the conditional expectation of W knowing the first i subsequences. Therefore, W0 , W1 , . . . , Wm is a martingale sequence [6], with W0 = E[W ] and Wm = W . We now focus on the value Wi − Wi−1 . Let Δi,e be the change in the random variable representing the weight of an edge e from knowing the first i−1 sequences to knowing the first i sequences. The value of Δe,i is non-zero for edges where the sequence corresponding to one of the endpoints of that edge was previously not known and is now known. Each vertex in the clique is adjacent to m − 1 vertices. For i − 1 the corresponding sequences were known and for n − i, the corresponding sequences were unknown. All other m 2 − m + 1 remaining edges in the clique have no change. The expectation of the weight of the edge can change by at most d. |Wi − Wi−1 | ≤ d(m − 1) The random variables W0 , W1 , . . . , Wm form a martingale with W0 = E[W ] and Wm = W and that |Wi − Wi−1 | ≤ d(m − 1) Therefore, we can invoke Azuma’s inequality to give us the following for any λ > 0: λ2 λ2 P r [|W − μW | ≥ λ] ≤ 2 exp − m 2 = = 2 exp − 2 2 i=0 d 2d (m + 1) We compare the theoretical tail bound with the distribution of values obtained from MCL-WMR; Figure 4.1 demonstrates that the distribution of the values of W approach the normal distribution in the limit with the mean value centered at 897. This corresponds to the theoretical mean of 900.1 calculated using the result from Theorem 1. The weight of cliques that do not represent valid motifs appears to follow a normal distribution but their mean weight is slightly lower, approximately 885. This result, shown in Figure 4.1, was determined by generating the weight of cliques that do not correspond to motifs in 100 random data sets but these were discovered using MCL-WMR and so are likely are not a uniform sample of such cliques. These results demonstrate a partial separation between the weight of cliques representing motifs and those that do not, which can be exploited to efficiently find dense subgraphs that are of interest. As highlighted in Figure 4.1, we use the weight to determine which subgraphs a further search for valid motifs is necessary. Further, Figure 4.1 demonstrates that as the value of m increases this separation will become more apparent since the deviation of the weight of cliques corresponding to motifs will occur with less probability, and the weight of the cliques will
156
C. Boucher, D.G. Brown, and P. Church
Fig. 1. Distribution of the weight of cliques containing a motif consensus and the distribution of the weight of cliques not containing a motif consensus. The data for non-motif cliques was generated was generated by running MCL-WMR 100 times, calculating the total weight of the clique, and generating a histogram of these values. The data is given for the (15,4) motif problem instance with m = 15.
become more centralized around the mean. Similar experimental tests were completed to demonstrate the relationship between the weight of spurious cliques when m = 15 and when m = 50, specifically, we ran MCL-WMR 100 times with n = 800, l = 15, and d = 4 and determined cliques that did not correspond to valid motifs. We found no spurious cliques in the data sets when m = 50, agreeing with our intuition that very few spurious cliques occur randomly in the data set when m becomes large. We should further note our confidence in MCL-WMR being able to detect cliques–both spurious and those corresponding to motifs–this is due to the accuracy is detecting the embedded motifs (see Section 4.2 details concerning these experimental tests). These results also suggest that when m is relatively large we can be more certain than any cliques found correspond to valid motifs; an attribute that should be further explored. 4.2
Discussion of Complexity
A few interesting observations can be made regarding the complexity of the algorithm and the quality of its solutions. Finding cliques of maximum size in a given input graph is NP-complete and thus, unlikely to be solved in polynomialtime [3]. Further, the results from Chen et al. [2] show that unless unlikely consequences occur in parameterized complexity theory, the problem of finding maximum–size cliques cannot be solved in no(k) time. Thus, finding cliques of a specific size k is not likely to be computationally feasible for graphs of significant size. The best known algorithm for finding cliques of size k in a given input graph runs in time O(mck/3 ), where c is the exponent on the time bound for multiplying two integer m×m matrices; the best known value for is c is 2.38 [7]. The runtime for the straightforward algorithm that checks all size k subsets is O(mk+2 ) and is the one to be most likely to be implemented in practice. The runtime of the algorithm of Yang and Rajapakse [12], a dynamic programming clique finding 2d algorithm, is O(m(nA2 + An−1 p2n−5 ), where A = m i=0 il (3/4)i (1/4)l−i , m is the length of each sequence and n is the number of sequences. This runtime
A Graph Clustering Approach to Weak Motif Recognition
157
Fig. 2. Distribution of average edge weights in cliques corresponding to actual motifs of size 15 and 50. The data is given for the (15, 4) motif problem with n = 600.
reflects the steep computational expense required to find cliques of a given size for an input graph. Similarly, the estimated runtime of the WINNOWER algorithm is O((mD)4 ), where D is approximately 30 for the challenge problem [8]. The time required by MCL-WMR to find a solution is not affected by the length of the motif that is to be discovered, whereas this is true for many other methods. Rather, it is the weakness of the motif–that is, the probability of the pairwise motif similarity occurring randomly–that has the most impact on the complexity of the algorithm. The increased probability of a clique of high weight affects the runtime of MCL-WMR since the exponential-time algorithm required to determine in a high cluster or subgraph contains a motif instance. We can compare the computational complexity of these programs by considering the required runtime of MCL-WMR of the three sequential steps–that is, the computational time required to construct the graph, finds all cliques of size m, and determine the motifs and consensus. MCL-WMR uses the MCL algorithm that runs in time O(N 3 ) where N is the number of vertices in the input graph [11] to find dense subgraphs. Hence, the most computationally expensive step of MCL-WMR is the clique-finding algorithm that serves the dense subgraphs for cliques corresponding to valid motifs and increases in computation time with the number of vertices. Other graph-based methods for motif finding rely on enumeration methods to find dense subgraphs; for example, WINNOWER requires each edge to be checked and the algorithm of Yang and Rajapakse uses dynamic programming on the complete graph,
5
Experimental Results
We tested MCL-WMR on synthetic problem instances generated according to the embedded (l, d)-motif model. We produce problem instances as follows: first we choose a random motif consensus of length l, and pick m occurrences of the motif
158
C. Boucher, D.G. Brown, and P. Church
Table 1. Comparison of the performance on a range of (l, d)-motif problems with synthetic data, where n = 600 and m = 20. The average performance of MCL-WMR on the eight different problem instances, generated as specified, are given. Data for WINNOWER and SP-STAR is the average of eight random instances given by Pevzner and Sze [8], while PROJECTION is the average of 100 random problem instances where the projection size is 7 and the bucket size is 4 give by Bulher and Tompa [1]. l 10 11 12 13 14 15 17 18
d PROJECTION SP-STAR WINNOWER MCL-WMR Time 2 0.80 0.56 0.78 1.00 54 ± 10.8 2 0.94 0.84 0.90 1.00 30 ± 10.6 3 0.77 0.33 0.78 1.00 205 ± 11.0 3 0.94 0.92 0.92 1.00 65 ± 10.4 4 0.71 0.02 0.02 1.00 806 ± 71.3 4 0.93 0.73 0.92 1.00 220 ± 17.2 5 0.93 0.69 0.03 1.00 704 ± 67.2 6 N. A. N. A. N. A. 1.00 20605 ± 534.3
by randomly choosing d positions per occurrence and randomly mutating the base at each. Lastly, we construct m background sequences of length n and insert the generated motifs into a random position in the sequence. For each of the (l, d) combinations, 100 randomly generated sets of input sequences (n = 600 and m = 20) were generated. This generation corresponds to the “FM” model used in the challenge problem by Pevzner and Sze and the results concerning PROJECTION by Buhler and Tompa. All empirical results were obtained on a desktop computer with a 2.0 GHz AMD Athlon 64 bit processor with 512 KB cache and 1 GB RAM, running Debian Linux. The time is the number of CPU seconds. One of the main advantages of MCL-WMR is the accuracy of the results even for hard problems. A metric, referred to as performance coefficient, is used to gauge the accuracy of the algorithm and is defined as K∩P K∪P , where K is the set of ls nucleotides in motif instances and P is the set of ls nucleotides in the proposed motif instances. A performance coefficient of 0.75 or greater is acceptable for algorithms not guaranteeing exact accuracy; improved algorithms return results with coefficients between 0.9 and 0.95. Table 1 compares the performance of MCL-WMR with that of previous motif finding programs on sets of eight random problem instances. We give the average performance coefficient for MCL-WMR and the competing programs, the mean runtime, and the range of runtimes for each set of motif problem instances. For comparison, we give the performance coefficients for WINNOWER, SP-STAR and PROJECTION. The data for these corresponding algorithms was collected by Pevzner and Sze [8] and Bulher and Tompa [1]. Our program found the exact location of a motif instance every single time and hence, the coefficient is 1; other programs typically were only approximate in discovering the motifs. The computation time of previous programs that find the exact solution becomes unacceptable as the motifs become degraded beyond the (15, 4) problem [9]. The main advantage to our tool is the time required to solve the extremely difficult challenge problems–that is (17, 5) ans (18, 6) problem–is substantially better to the running time of previous algorithms.
A Graph Clustering Approach to Weak Motif Recognition
159
Table 2. Comparison of the time required to solve the (15, 4)-motif problem with 20 sequences of varying length, of MCL-WMR and PROJECTION; n denotes the sequence length, which varies from 600 to 2000. The running times are obtained by averaging the time to obtain a solution on 8 different instances of the problem. Data for PROJECTION was collected from King et al. [4]. n 600 800 1000 1200 1400 1600 1800 2000
PROJECTION 6.6 ± 1.0 27 ± 4 82 ± 25 250 ± 60.0 600.6 ± 140.0 1000 ± 200 1435 ± 353.0 1891 ± 600.0
MCL-WMR 50 ± 17.6 118 ± 39.9 228 ± 67.4 407 ± 78.8 706 ± 138.6 1043.4 ± 80.51 1652 ± 342.7 2078 ± 432.2
The performance coefficient of MCL-WMR is greater than that of the previous algorithms in every line of Table 1. MCL-WMR correctly solved planted (11, 2), (13, 3), (15, 4), (17, 5) and (18, 6) on all data sets–in these cases, the planted motif and motif occurrences at least as strong as planted motifs. WINNOWER, PROJECTION, and SP-STAR achieve acceptable performance on the (11, 2), (13, 3) and (15, 4) problem instances when the sequence length is less than or equal to 600 and the number of sequences is less than or equal to 20, however, all fail on the (18, 6) and (19, 6) problem, and WINNOWER and SP-STAR fail on the (16, 5) and (17, 5) problem instances. The performance of MCL-WMR is most eminent on the more difficult planted (14, 4), (16, 5), (17, 5) and (18, 6) motif problems when compared to results from previous algorithms. WINNOWER and SP-STAR typically failed to find the planted motifs and PROJECTION often failed to have acceptable performance on the more difficult cases of the challenge problem [1] and hence, MCL-WMR’s performance substantially exceeded that of previously algorithms. We evaluated the performance of MCL-WMR on problem instances with longer background sequences–that is, problems where n varies from values greater than 600. As the length of the sequences increase, the number of randomly occurring l-mers increases; specifically, the increase in n, increases the probability of cliques of high-weight occurring. Due to the increase in noise and hence, difficulty in detecting true motifs, MCL-WMR will recover motifs more slowly. Our results are comparable to the results of PROJECTION, as can be seen in Table 2, MCL-WMR maintains its speed advantages as n increases. Considering the (15, 4) problem and fixing the number of sequences to be 20, the performance of WINNOWER breaks at length 700, and SP-STAR breaks when the length is 800 to 900. Table 2 demonstrates that MCL-WMR has comparable running time to PROJECTION for lengths above 1400, and in any case higher. For smallest lengths PROJECTION appears to be faster. We should further note that MCL-WMR achieves a performance ratio of 1.0 whereas, PROJECTION achieved a performance ratio around 0.93.
160
6
C. Boucher, D.G. Brown, and P. Church
Conclusion
We propose an efficient algorithm for motif recognition with the specific purpose of solving more difficult problems when the motif signal is weak due to a large amount of degeneration. We demonstrate promising results on synthetic data. Specifically, we showed promising running time and accuracy for all challenge problems, with most-impressive improvement on the (14, 4), (17, 5) and (18, 6) problems. Previous algorithms lack accuracy, the ability for the running time to scale with the length and number of sequences, and achieving a reasonable running time for all challenge problems. We have shown that a novel model for motif recognition can dramatically influence the algorithmic ability and efficiency. By changing the graphical model to incorporate edge weights, we exploit theoretical results demonstrating the existence of a separation between the weights of cliques corresponding to valid motifs and the weights of those that do not, and obtain improved search techniques. Our theoretical work and empirical data show a large percentage of the cliques corresponding to valid motifs have total weight in a narrow range. This helps us distinguish cliques containing valid motifs and spurious cliques. We expect interesting theoretical results lie within study of this weighted graph model, along with further exploitation of theoretical results for the problem.
References 1. Buhler, J., Tompa, M.: Finding motifs using random projections. J. Comput. Biol. 9(3), 225–242 (2002) 2. Chen, J., Huang, X., Kanj, I.A., Xia, G.: Linear FPT reductions and computational lower bounds. In: Proc. Sym. on Theory of Comp., pp. 212–221 (2004) 3. Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman and Co., New York, NY (1979) 4. King, J., Cheuny, W., Hoos, H.H.: Neighbourhood Thresholding for ProjectionBased Motif Discovery. Bioinfo. (to appear) 5. Liang, S., Samanta, M.P., Biegel, B.A.: cWINNOWER algorithm for finding fuzzy DNA motifs. J. Bioinfo. Comput. Biol. 2(1), 47–60 (2004) 6. Motwani, R., Raghavan, R.: Randomized Algorithms. Cambridge University Press, New York, NY (1995) 7. Niedermeier, R.: Invitation to fixed-parameter algorithms. Habilitation thesis, Universit¨ at T¨ ubingen (2002) 8. Pevzner, P., Sze, S.: Combinatorial approaches to finding subtle signals in DNA sequences. Proc. Int. Conf. Intell. Syst. Mol. Biol (ISMB00) 8, 344–354 (2000) 9. Styczynski, M.P., Jensen, K.L.: An extension and novel solution to the (1, d)-motif challenge problem. Gen. Info. 15(2), 63–71 (2004) 10. Sze, S., Lu, S., Chen, J.: Integrating sample-driven and patter-driven approaches in motif finding. In: Jonassen, I., Kim, J. (eds.) WABI 2004. LNCS (LNBI), vol. 3240, pp. 438–449. Springer, Heidelberg (2004) 11. van Dongen, S.: Graph clustering by flow simulation. PhD thesis, University of Utrecht (May 2000) 12. Yang, X., Rajapakse, J.: Graphical approach to weak motif recognition. Gen. Info. 15(2), 52–62 (2004)
Informative Motifs in Protein Family Alignments Hatice Gulcin Ozer1,2 and William C. Ray2,3 1
Biophysics Program Columbus Children’s Research Institute 3 Department of Pediatrics, The Ohio State University, 700 Children’s Drive Columbus, OH 43205, USA {ozer.9,ray.29}@osu.edu 2
Abstract. Consensus and sequence pattern analysis on family alignments are extensively used to identify new family members and to determine functionally and structurally important identities. Since these common approaches emphasize dominant characteristics of the family and assume residue identities are independent at each position, there is no way to describe residue preferences outside of the family consensus. In this study, we propose a novel approach to detect motifs outside the consensus of a protein family alignment via an information theoretic approach. We implemented an algorithm that discovers frequent residue motifs that are high in information content and outside of the family consensus, called informative motifs, inspired by the classic Apriori algorithm. We observed that these informative motifs are mostly spatially localized and present distinctive features of various members of the family. Availability: The source code is available upon request from the authors. Keywords: Information content, motif, sequence alignment, protein structure.
1 Introduction Multiple sequence alignments have been extensively studied to understand and describe characteristics of biomolecule families for several decades. Consensus analysis, position specific scoring matrices (1,2), hidden Markov models (HMM) (3,4), profile HMMs (5), sequence logos (6) and sequence patterns are the most common approaches to represent and model biosequence families. Unfortunately, all of these approaches attempt to describe only dominant characteristics of the biomolecule families. Structural and functional importance of highly conserved residues in a protein family alignment is indisputable. However, localized preferences outside of the consensus are also very important to understand characteristics of the family. In addition, most of the popular models assume independence of identities in the sequences. Due to this assumption, these methods can result in model descriptions where no instance of the model exists in the family. Although there are numerous studies to address this drawback, generally they are specific to a region of a family or group of biomolecules, or incorporate prior biological knowledge, and they are not integrated into current popular modeling tools. Therefore, it is necessary to define a generalized method to identify variant motifs outside the consensus of a protein family. R. Giancarlo and S. Hannenhalli (Eds.): WABI 2007, LNBI 4645, pp. 161 – 170, 2007. © Springer-Verlag Berlin Heidelberg 2007
162
H.G. Ozer and W.C. Ray
Regular expressions are commonly employed to report sequence patterns that are common to all or most of the sequences in the protein family alignments. This approach can be easily expanded to describe residue preferences outside the consensus. On account of having the positional independence assumption, such patterns might not represent any family member at all. Although this method is good to understand preferences at distinct positions, it cannot identify variant motifs across the family. Of the existing techniques HMMs provide the best possibility for modeling positional dependencies, though practically HMMs are limited to near neighbor dependencies. Long range interdependence approach intractable and crossing dependencies are impossible to represent in the HMM paradigm. Additionally, while HMMs may accurately model certain features of an alignment, it is impractical to tease human understanding of the represented motifs back out of the HMM model. Correlated mutations have been extensively used to predict both intra- and intermolecular contacts. The underlying idea is to define the correlation between the residue conservation patterns of the two columns in the sequence alignment(s). The measurements proposed for correlated mutations are highly depend on residue conservation scoring matrices. There are numerous diverse and sophisticated studies to define both correlated mutation measurements (7) and residue conservation scoring (8). Although several correlated mutation measurements yield reasonable accuracy for predicting residue contacts for some families, general reviews point out that current methodologies of correlated mutations analysis are not suitable for large scale residue contact prediction (7). MAVL/StickWRLD is an analysis and visualization system we developed to display and interpret positional dependencies discovered in nucleic acid (9) and amino acid (10) family alignments. In the analysis system, the expected number of sequences that should share identities at a particular pair of positions is calculated based on positional probabilities, and residuals are calculated based on the observed population of sequences actually sharing the residues. Correlating pairs of residues based on these residuals are visualized in StickWRLD diagram. This approach differs from correlated mutation analysis by examining identitywise correlations between the columns of family alignment, for every possible identity combination. Ongoing research has shown that residue pairs displaying large residuals and high statistical significance, are often the result of physical proximity. We are therefore continuing to develop analysis new analysis techniques to enhance the performance and accuracy of the MAVL analysis engine, and to populate StickWRLD visualizations with additional data to assist the researcher. Mining frequent item sets is a key step in many data mining problems, such as association rule mining, sequential pattern mining and classification. Therefore efficient frequent item set generation is an active research area. The Apriori algorithm (11) is the most popular implementation of frequent item set mining problem. This algorithm can also be applied to amino acid sequence alignments to discover frequently occurring residue patterns. Slight modifications in the algorithm could allow us to discover motifs composed by residues that are not strongly conserved. In addition, an information theoretic approach can be employed to assess significance of discovered motifs. Shannon’s entropy (12) in information theory describes how much information there is in a signal or event. The position specific probabilities of amino acids in a family alignment can be utilized to compute
Informative Motifs in Protein Family Alignments
163
information content of discovered motifs. Information content of a motif is at a minimum when either a motif is strongly conserved or when all motifs are equiprobable. Therefore, we are interested in identifying the motifs that are high in information content and name such motifs as informative motifs. In this paper, we introduce a novel algorithm to mine informative motifs outside the consensus of a protein family alignment by modifying the classic Apriori algorithm. This will allow us to describe localized, distinctive features of various members of the family.
2 Methods In a nut shell, we modified the classic Apriori algorithm to mine frequent residue motifs that are high in information content and outside the family consensus. The fundamental approach for frequent item set mining can be summarized as follows: Let I={i1, i2, …, im} be a set of items and D={t1, t2, …, tk} be a database of transactions where each transaction T is a set of items such that T ⊆ I . Let X be a
set of items. A transaction T is said to contain X if and only if X ⊆ T . The support of an item set X is the percentage of transactions in which X occurs. X is frequent if the support of X is no less than a user defined support threshold. Here, we are interested in finding the complete set of frequent item sets. Apriori is a breadth first frequent item set mining algorithm. The basic idea of the Apriori algorithm is to generate candidate item sets of a particular size and then scan the database to count them, to see if they are large, i.e. if they satisfy the minimum support requirement. During scan i, candidates of size i, Ci are counted. Only those candidates that are large are used to generate candidates for the next pass. To generate candidates of size i+1, joins are made of large item sets found in the previous pass. This process repeats until no new candidates are generated (13). In this study, we applied Apriori algorithm to protein family alignments to detect frequent residue motifs. For this particular application, the transaction database refers to the family alignment, a transaction refers a sequence in the alignment, and an item set refers to a residue motif that exists at least in one sequence. Since the order of residues, i.e. positions along the sequence is important, each residue is subscripted with its position before applying the algorithm. In this analysis minimum support is a critical user defined value. High values of minimum support may result in underestimation of rare but important patterns. Also low values of minimum support result in overestimation of many patterns. Therefore, we decided to examine information content of the candidate item sets based on position specific probabilities of amino acids in the family alignment. Instead of minimum support we used a cutoff (explained later) on information content of the item set to proceed to the next step in the algorithm. Information content of an item set X={x1, x2, …, xK} with associated positions J={j1, j2, …, jK} and size K, is calculated as follows:
IC ( X ) =
∑ P(t
k =1.. K
jk
= xk ) log 2 P(t jk = xk )
164
where
H.G. Ozer and W.C. Ray
t jk is the residue at position jk in the transaction database i.e. the sequence
family alignment. The relation of an amino acid’s probability to information content of that position is depicted in Figure 1. If a residue is highly conserved i.e. it is part of the family consensus, its contribution to the motif’s information content will be low. On the other hand, if a residue is rarely found at a position, again its contribution to the motif’s information content will be low. In this sense, the informative motifs idea properly accommodates the aim of this study.
Fig. 1. Depicts the relation of probability to information content
As highlighted in Figure 1, information content larger than or equal to 0.5, represents amino acid probabilities between 0.25 and 0.5. If a residue’s probability at a position is larger than 0.5, then it will be consensus. If a residue’s probability is less than 0.25, then it can be counted as infrequent. Thus, we take 0.5 as a cutoff on information content of a single position. Since the information content is computed via an aggregate formula, a motif’s information content will increase as its size increases. Therefore, we calculated the motif’s per residue information content by dividing total information content to the motif’s size. Our second modification to the Apriori algorithm is elimination of residues that cause unnecessary computations. Since we want to extract motifs outside of consensus, we eliminate consensus residues in the candidate generation step of the algorithm. Also, due to the cutoff applied on the information content, it is not possible to see rare residues in the final motifs. Thus we eliminate rare residues at the same step too. To be cautious, we eliminated residues with probabilities larger than 0.7 and smaller than 0.1. Our last modification to the Apriori algorithm is at the candidate generation step. The normal Apriori algorithm generates candidates by creating cross-product joins of every current candidate with every other current candidate. This is extremely inefficient in a situation where most of the generated candidates do not exist in the actual data. Therefore we further modify the algorithm by extracting candidates from the actual sequence composition, limiting searches to only those candidates that are at least possible in the sequence set.
Informative Motifs in Protein Family Alignments
165
Fig. 2. The flow chart of the informative motif extraction algorithm. P(xi) denotes position specific probability of residue x at position i. IC(X) denotes per residue information content of item set X.
To sum up, we modified the Apriori approach to be able to detect informative motifs in a protein family alignment. First, each residue in the family alignment is subscripted with its position. Residues with position specific probabilities larger than 0.7 and smaller than 0.1 are eliminated, to avoid unnecessary computations. Then, each sequence with its remaining residues is recorded as a transaction into the transaction database that represents the family alignment. First iteration of the algorithm starts with generation of size-3 candidates based on the existing transactions. The candidates with average information content less than 0.5 are eliminated to obtain 3-size motifs. Then one size larger candidates are generated again based on existing transactions and information content cutoff applied to get corresponding motifs. This is repeated until no new motifs are generated. Also every time new motifs recorded, their subsets in the smaller sized motifs are deleted. At last, the final item sets produced as a result of this modified Apriori approach is called informative motifs. Figure 2, depicts the flowchart of the proposed algorithm.
3 Discussion We examined structural representations of discovered informative motifs for a number of protein families from Pfam database (14). By examining numerous protein family alignments, we observed that discovered informative motifs mostly present spatially localized clusters of residues in the
166
H.G. Ozer and W.C. Ray
structure. Sometimes these motifs also reveal alternating preferences at certain positions of the family alignment. As an example, malic enzymes (malate oxidoreductases) catalyse the oxidative decarboxylation of malate to form pyruvate, a reaction important in a number of metabolic pathways - e.g. carbon dioxide released from the reaction may be used in sugar production during the Calvin cycle of photosynthesis (15). Two well conserved regions of this enzyme are reported in the Pfam database as malic (accession no. PF00390) and Malic_M (accession no. PF03949). Malic_M and malic families are composed of 49 and 69 sequences with sequence lengths of 324 and 199, respectively. It is not reasonable to represent such a large and diverse family with only one descriptive consensus. If we examine the family alignment even by eye, we will see that there are regions where a subset of sequences has different preferences over the rest. We applied the proposed informative motifs algorithm on these two family alignments, namely malic and Malic_M. Then we examined the coordinates of the member residues of the informative motifs for known PDB structures. We particularly compared the most distant members of the family, namely human malic enzyme (PDB ID: 1efl) and bacteria malic enzyme (PDB ID: 2a9f). We computed the average distance amongst the member residues of each motif. As shown in Figure 3, average distance amongst motif residues is plotted against motif size. It is clearly seen that as the motif size increases the average distance amongst the member residues decreases significantly. Average diameters of malic and Malic_M domains are 45Å and 50Å respectively. Average distance amongst the motif residues is at most around 10Å or less for all cases. This suggests informative motifs provide spatially localized patterns for this family. We also examined the distribution of the average distance amongst the residues of random motifs. We generated random motifs of sizes 3 to 20 (100 per motif size) for both malic and Malic_M families. Then we plotted average distances amongst the residues of these random motifs against the motif sizes (Figure 4). Interestingly, regardless of the motif size, the mean of the average distance amongst random motif residues stabilizes around 20Å. This is twice of the average distance amongst residues of informative motifs. This confirms that spatial localization of informative motifs is not a result of chance. Another interesting observation is that the informative motifs found in those two structures refer to quite different regions of the family. This observation is rather interesting, because this might be further investigated to find out different progress along the phylogeny. Pfam ADK_lid (Adenylate kinase, active site lid) domain (accession no. PF05191) is a neat example to address possible alternating preferences within a family alignment. Adenylate kinase presents a particular divergence in the active site lid. In gram-positive bacteria, residues in the lid domain have been mutated to cysteines forming a structural homolog to Zinc-finger domains (16). Although this divergence in the structure can easily be caught by eye as alternating patterns in the family alignment, common models to describe protein families cannot detect such instances. Table 1 lists the informative motifs that are found in two different members of the family. These motifs clearly demonstrate that these two structures have different preferences at family alignment positions 3, 6, 8, 24 and 27 by either having residues C.3, C.6, A.8, C.24, C.27 (as in PDB structure 1zip) or residues H.3, S.6, R.8, D.34, T.27 (as in PDB structure 2ak3).
Informative Motifs in Protein Family Alignments
167
Fig. 3. Informative Motifs. Average distance amongst motif residues is plotted against motif size for Pfam malic family (A) and Pfam Malic_M family (B). For both family right hand size graph shows the results for informative motifs detected in Human malic enzyme (PDB ID: 1efl) and left hand side graph shows the results for informative motifs detected in Bacteria malic enzyme (PDB ID: 2a9f).
168
H.G. Ozer and W.C. Ray
Fig. 4. Random Motifs. Average distance amongst residues of 100 random motifs is plotted against motif size for Pfam malic family (A) and Pfam Malic_M family (B) by (♦). Also averages over the 100 random motifs are plotted by (▪). For both family right hand size graph shows the results for Human malic enzyme (PDB ID: 1efl) and left hand side graph shows the results for Bacteria malic enzyme (PDB ID: 2a9f).
Informative Motifs in Protein Family Alignments
169
Table 1. Informative motifs discovered in ADK_lid family of two different members of the family (PDB structures 2ak3 and 1zip) reveal alternative preferences at family alignment positions 3, 6, 8, 24 and 27.
Average Distance Amongst Residues (ǖ 9.8 5.8 4.4 7.4 3.9 7 5.3 5.1 6.1 5.9 6.2 3.5 3.9 5.3 3.8 4 4.3
Informative motifs for ADK_lid Family PDB: 2ak3 PDB: 1zip Average Distance Amongst Informative Motifs Informative Motifs Residues (ǖ D.24,T.27,E.30,E.36 4.3 C.6,A.8,C.27 P.4,S.6,R.8,I.23 5.8 I.2,C.3,L.13 D.24,T.27,E.30,P.31,V.33 3.9 R.1,I.2,C.24 H.3,P.4,S.6,R.8,E.14 5.1 R.1,I.2,C.3,C.27 H.3,P.4,S.6,R.8,V.33 3.6 R.1,I.2,C.3,C.6 H.3,S.6,R.8,V.10,E.36 3.1 R.1,I.2,C.3,E.31 H.3,S.6,R.8,V.10,N.16 H.3,S.6,R.8,V.10,N.12,D.24 H.3,S.6,R.8,V.10,N.12,E.30 H.3,S.6,R.8,V.10,N.12,P.31 H.3,S.6,R.8,V.10,N.12,T.27 I.2,H.3,P.4,S.6,R.8,D.24 I.2,H.3,P.4,S.6,R.8,E.30 I.2,H.3,P.4,S.6,R.8,N.16 I.2,H.3,P.4,S.6,R.8,P.31 I.2,H.3,P.4,S.6,R.8,Q.34 I.2,H.3,P.4,S.6,R.8,T.27
4 Conclusion In this paper, we introduced a new algorithm to extract informative motifs in protein family alignments. Slight but sensible modifications in the classic Apriori algorithm allowed us to discover informative motifs that describe variant residue motifs exists in protein family alignments outside of the consensus. We studied numerous protein family alignments by examining structural representations of discovered informative motifs. We observed that informative motifs discovered in a family alignment mostly present spatially localized clusters of residues in the structure and manifest alternating preferences amongst the members of the family.
References 1. Gribskov, M., McLachlan, A.D., Eisenberg, D.: Profile analysis: detection of distantly related proteins. Proc. Natl. Acad. Sci. USA 84, 4355–4358 (1987) 2. Gribskov, M., Luthy, R., Eisenberg, D.: Profile analysis. Methods in Enzymology 183, 146 (1990)
170
H.G. Ozer and W.C. Ray
3. Baldi, P., Chauvin, Y., Hunkapiller, T., McClure, M.A.: Hidden Markov models of biological primary sequence information. Proc. Natl. Acad. Sci. 91(3), 1059–1063 (1994) 4. Eddy, S., Mitchison, G., Durbin, R.: Maximum discrimination hidden Markov models of sequence consensus. J. Comput. Biol. 2, 9–23 (1995) 5. Krogh, A., Brown, M., Mian, I.S., Sjolander, K., Haussler, D.: Hidden Markov models in computational biology: Applications to protein modeling. J. Mol. Biol. 235, 1501–1531 (1994) 6. Schneider, T.D., Stephens, R.M.: Sequence logos: a new way to display consensus sequences. Nuc. Acids Res. 18(20), 6097–6100 (1990) 7. Halperin, I., Wolfson, H., Nussinov, R.: Correlated Mutations: Advances and limitations.A study on fusion proteins and on the Chesin-Dockerin family. Proteins 63, 832–845 (2006) 8. Valdar, W.S.J.: Scoring residue conservation. Proteins 48, 227–241 (2002) 9. Ray, W.C.: MAVL and StickWRLD: visually exploring relationships in nucleic acid sequence alignments. Nucleic Acids Res. 32, W59–W63 (2004) 10. Ray, W.C.: MAVL/StickWRLD for protein: visualizing protein sequence families to detect non-consensus features. Nucleic Acids Res. 33, W315–W319 (2005) 11. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: VLDB’94, Santiago, Chile, pp. 487–499 (1994) 12. Shannon, C.E.: A mathematical theory of communication. Bell Sys. Tech. J. 27, 379–423, 623–656 (1948) 13. Dunham, M.: Association Rules. In: Data Mining: Introductory and Advanced Topics, pp. 164–191. Prentice-Hall, Englewood Cliffs (2002) 14. Finn, R.D., Mistry, J., Schuster-Böckler, B., Griffiths-Jones, S., Hollich, V., Lassmann, T., Moxon, S., Marshall, M., Khanna, A., Durbin, R., Eddy, S.R., Sonnhammer, E.L.L., Bateman, A.: Pfam: clans, web tools and services. Nucleic Acids Research 34(Database issue), D247–D251 (2006) 15. Long, J.J., Wang, J.L., Berry, J.O.: Cloning and analysis of the C4 photosynthetic NADdependent malic enzyme of amaranth mitochondria. J. Biol. Chem. 269(4), 2827–2833 (1998) 16. Berry, M., Phillips Jr., G.N.: Crystal Structures of Bacillus stearothermophilus Adenylate kinase with boundAp5A,Mg2+Ap5A, and Mn2+ Ap5A reveal an intermediate lid position and six coordinate octahedral geometry for bound Mg2+ and Mn2+. Prot. Str. Func. Gen. 32, 276–288 (1998)
Topology Independent Protein Structural Alignment Joe Dundas1, , T.A. Binkowski1 , Bhaskar DasGupta2, , and Jie Liang1, 1
2
Department of Bioengineering, University of Illinois at Chicago, Chicago, IL 60607–7052 Department of Computer Science, University of Illinois at Chicago, Chicago, Illinois 60607-7053
[email protected] [email protected]
Abstract. Protein structural alignment is an indispensable tool used for many different studies in bioinformatics. Most structural alignment algorithms assume that the structural units of two similar proteins will align sequentially. This assumption may not be true for all similar proteins and as a result, proteins with similar structure but with permuted sequence arrangement are often missed. We present a solution to the problem based on an approximation algorithm that finds a sequenceorder independent structural alignment that is close to optimal. We first exhaustively fragment two proteins and calculate a novel similarity score between all possible aligned fragment pairs. We treat each aligned fragment pair as a vertex on a graph. Vertices are connected by an edge if there are intra residue sequence conflicts. We regard the realignment of the fragment pairs as a special case of the maximum-weight independent set problem and solve this computationally intensive problem approximately by iteratively solving relaxations of an appropriate integer programming formulation. The resulting structural alignment is sequence order independent. Our method is insensitive to gaps, insertions/deletions, and circular permutations.
1
Introduction
The classification of protein structures often depend on the topology of secondary structural elements. For example, Structural Classification of Proteins (SCOP) classifies proteins structures into common folds using the topological arrangement of secondary structural units [16]. Most protein structural alignment methods can reliably classify proteins into similar folds given the structural units from each protein are in the same sequential order. However, the evolutionary possibility of proteins with different structural topology but with
Partially supported by NSF grant IIS-0346973. Corresponding author. Supported by NSF grants IIS-0346973, IIS-0612044 and DBI-0543365. Supported by NSF grant DBI-0133856 and NIH grants GM68958 and GM079804.
R. Giancarlo and S. Hannenhalli (Eds.): WABI 2007, LNBI 4645, pp. 171–182, 2007. c Springer-Verlag Berlin Heidelberg 2007
172
J. Dundas et al.
similar spatial arrangement of their secondary structures pose a problem. One such possibility is the circular permutation. A circular permutation is an evolutionary event that results in the N and C terminus transferring to a different position on a protein. Figure 1 shows a simplified example of circular permutation. There are three proteins, all consist of three domains (A,B, and C). Although the spatial arrangement of the three domains are very similar, the ordering of the domains in the primary sequence has been circularly permuted. Lindqvist et al. (1997) a) b) c) A B A B A B observed the first natural occurence of a circuC C C lar permutation between N C N C N C jackbean concanavalin A and favin. Although the Fig. 1. The cartoon illustration of three protein strucjackbean-favin permuta- tures whose domains are similarly arranged in space but tion was the result of appear in different order in primary sequences. The locapost-translational ligation tion of domains A,B,C in primary sequences are shown in of the N and C terminus a layout below each structure. Their orderings are related and cleavage elsewhere in by circular permutation. the chain, a circular permutation can arise from events at the gene level through gene duplication and exon shuffling. Permutation by duplication [18] is a widely accepted model where a gene first duplicates and fuses. After fusion, a new start codon is inserted into one gene copy while a new stop codon is inserted into the second copy. Peisajovich et al. demonstrated the evolutionary feasibility of permutation via duplication by creating functional intermediates at each step of the permutation by duplication model for DNA methyltransferases [17]. Identifying structurally similar proteins with different chain topologies, including circular permutation, can aid studies in homology modeling, protein folding, and protein design. An algorithm that can structurally align two proteins independent of their backbone topologies would be an important tool. The biological implications of thermodynamically stable and biologically functional circular permutations, both natural and artificial, has resulted in much interest in detecting circular permutations in proteins [6, 20, 10, 12]. The more general problem of detecting non-topological structural similarities beyond circular permutation has received less attention. We refer to these as non-cyclic permutations from now on. Tabtiang et al. were able to create a thermodynamically stable and biologically functional non-cyclic permutation, indicating that non-cyclic permutations may be as important as circular permutations [21]. In this study, we present a novel method that detects spatially similar structures that can identify structures related by circular and more complex non-cyclic permutations. Detection of non-cyclic permutation is possible by our algorithm by virtue of a recursive combination of a local-ratio approach with a global linearprogramming formulation. This paper is organized as follows. We first show that our algorithm is capable of finding known circular permutations with sensitivity
Topology Independent Protein Structural Alignment
173
and specificity. We then report the discovery of three new circular permutations and one example of a non-cyclic permutation that to our knowledge have not been reported in literature. We conclude with remarks and discussions. This work has incorporated several major improvements and new results over the short paper in [5]. First, the algorithm has been improved so the number of aligned residues in an alignment is significantly increased, without compromise in RMSD values. Second, we have developed a new similarity score for a pair of aligned structures. It incorporates correction of the alignment length, and gives more reliable results. Third, we have developed a method to estimate the statistical significance of a structural alignment by calculating the p-value of a similarity score. Finally, the overall running time is significantly improved and we are able to report results of a large scale exhaustive search of circularly permuted proteins in the PDB database. This includes the discovery of three previously unknown circularly permuted proteins. In addition, we also report the discovery of a new non-cyclicly permuted protein. To our knowledge, this is the first reported naturally occurring non-cyclic permutation between two structures. The rest of the paper is organized as follows. We first show that our algorithm is capable of finding known circular permutations with sensitivity and specificity. We then report the discovery of three new circular permutations and one example of a non-cyclic permutation that to our knowledge have not been reported in literature. We conclude with remarks and discussions.
2
Method
In this study, we describe a new algorithm that can align two protein structures or substructures independent of the connectivity of their secondary structure elements. We first exhaustively fragment the two proteins seperately. An approximation algorithm based on a fractional version of the local-ratio approach for scheduling split-interval graphs [3] is then used to search for the combination of peptide fragments from both structures that will optimize the global alignment of the two structures. 2.1
Basic Definitions and Notations
The following definitions/notations are used uniformly throughout the paper unless otherwise stated. Protein structures are denoted by Sa , Sb , . . .. A substructure λai,k of a protein structure Sa is a continuous fragment λai,k , where i is the residue index of the beginning of the substructure and k is the length (number of residues) of the substructure. We will denote such a substructure simply by λa if i and k are clear from the context or irrelevant. A residue at ∈ Sa is a part of a substructure λai,k if i ≤ t ≤ i + k − 1. Λa is the set of all continuous substructures or fragments of protein structure Sa that is under consideration in our algorithm. χi,j,k (or simply χ when the other parameters are understood from the context) denotes an ordered pair (λai,k , λbj,k ) of equal length substructures of two protein structures Sa and Sb . Two ordered pairs of substructures
174
J. Dundas et al.
(λai,k , λbj,k ) and (λai ,k , λbj ,k ) are called inconsistent if and only if at least one of the pairs of substructures {λai,k , λai ,k } and {λaj,k , λaj ,k } are not disjoint. We can now formalize our substructure similarity identification problem as follows. We call it the Basic Substructure Similarity Identification (BSSIΛ,σ ) problem. An instance of the problem is a set Λ = {χi,j,k | i, j, k ∈ N} ⊂ Λa × Λb of ordered pairs of equal length substructures of Sa and Sb and a similarity function σ : Λ → R+ mapping each pair of substructures to a positive similarity value. The goal is to find a set of substructure pairs {χi1 ,j1 ,k1 , χi2 ,j2 ,k2 , ...χit ,jt ,kt } that are mutually consistent and maximizes the total similarity of the selection t=1 σ(χi ,j ,k ). 2.2
An Algorithm Based on the Local-Ratio Approach
The BSSIΛ,σ problem is a special case of the well-known maximum weight independent set problem in graph theory. In fact, BSSIΛ,σ itself is MAX-SNPhardeven when all the substructures are restricted to have lengths at most 2 [3, Theorem 2.1]. Our approach is to adopt the approximation algorithm for scheduling split-interval graphs [3] which itself is based on a fractional version of the local-ratio approach. Definition 1. The closed neighborhood NbrΔ [χ] of a vertex χ of GΔ is {χ | {χ, χ } ∈ EΔ } {χ}. For any subset Δ ⊆ Λ, the conflict graph GΔ = (VΔ , EΔ ) is the graph in which VΔ = {χ | χ ∈ Δ} and EΔ = { {χ, χ } | χ, χ ∈ Δ and the pair {χ, χ } is not consistent} For an instance of BSSIΔ,σ with Δ ⊆ Λ we introduce three types of indicator variables as follows. For every χ = (λa , λb ) ∈ Δ, we introduce three indicator variables xχ , yχλa and yχλb ∈ {0, 1}. xχ indicates whether the substructure pair should be used (xχ = 1) or not (xχ = 0) in the final alignment. yχλa and yχλb are artificial selection variables for λa and λb that allows us to encode consistency in the selected substructures in a way that guarantees good approximation bounds. We initialize Δ = Λ. Then, the following algorithm is executed: 1. Solve the following LP relaxation of a corresponding integer programming formulation of BSSIΔ,σ :
maximize
σ(χ) · xχ
(1)
χ∈Δ
yχλa
≤1
∀at ∈ Sa
(2)
yχλb
≤1
∀at ∈ Sb
(3)
yχλa − xχ
≥0
∀χ ∈ Δ
(4)
yχλb − xχ
≥0
∀χ ∈ Δ
(5)
xχ , yχλa , yχλb
≥0
∀χ ∈ Δ
(6)
subject to
at ∈λa ∈Λa
at ∈λb ∈Λb
Topology Independent Protein Structural Alignment
175
2. For every vertex χ ∈ VΔ of GΔ , compute its local conflict number αχ = χ ∈NbrΔ [χ] xχ . Let χmin be the vertex with the minimum local conflict number. Define a new similarity function σ(χ) if χ ∈ / NbrΔ [χmin ] σnew (χ) = σ(χ) − σ(χmin ) otherwise 3. Create Δnew ⊆ Δ by removing from Δ every substructure pair χ such that σnew (χ) ≤ 0. Push each removed substructure to a stack in arbitrary order. 4. If Δnew = ∅ then set Δ = Δnew , σ = σnew and go to Step 1. Otherwise, go to Step 5. 5. Repeatedly pop the stack, adding the substructure pair to the alignment as long as the following conditions are met: – The substructure pair is consistent with all other substructure pairs that already exist in the selection. – The cRM SD of the alignment does not change by a threshold. This condition bridges the gap between optimizing a local similarity between substructures and optimizing the tertiary similarity of the alignment by guaranteeing that each substructure from a substructure pair is in the same spatial arrangement in the global alignment. In implementation, the graph GΔ is considered implicitly via intersecting intervals. The interval clique inequalities can be generated via a sweepline approach. The running time depends on the number of iterations needed to solve the LP formulations. Let LP(n, m) denote the time taken to solve a linear programming problem on n variables and m inequalities. Then the worst case running time of the above algorithm is O(|Λ|·LP(3|Λ|, 5|Λ| + |Λa | + |Λb |)). However, the worst-case time complexity happens under the excessive pessimistic assumption that each iteration removes exactly one vertex of GΛ , namely χmin only, from consideration, which is unlikely to occur in practice as our computational results show. A theoretical pessimistic estimate of the performance ratio of our algorithm can be obtained as follows. Let α be the maximum of all the αχmin ’s over all iterations. Proofs in [3] translate to the fact that the algorithm returns a solution whose total similarity is at least α1 times that of the optimum and, if Step 5(b) is omitted from the algorithm, then α ≤ 4. The value of α even with Step 5(b) is much smaller than 4 in practice (e.g. α = 2.89). Due to lack of space we provide the implementation details of our algorithmic approach in a full version of the paper. We just note here that the linear programming problem is solved using the BPMPD package [14] and to improve computational efficiency, only the top-scoring 1200 substructure pairs are initially used in our algorithm.
3
Similarity Score σ
The similarity score σ(χi,j,k ) between two aligned substructures λai,k and λbj,k is a weighted sum of a shape similarity measure derived from the cRM SD value,
176
J. Dundas et al.
which is then modified for the secondary structure content of the aligned substructure pairs, and a sequence composition score (SCS). Here cRMSD values are the coordinate root mean square distance, which are the square root of the mean of squares of Euclidean distances of coordinates of corresponding Cα atoms. cRMSD scaling by secondary structure content. We scale the cRM SD according to the secondary structure composition of the two substructures (λa and λb ) that compose the substructure pair χ. We extracted 1,000 α-helices of length 4-7 (250 of each length) at random from protein structures contained in PDBSELECT 25% [8]. We exhaustively aligned helices of equal length and obtained the cRM SD distributions shown in Figure 2(a-d). We then exhaustively aligned equal length β-strands (length 4-7) from a set of 1,000 (250 of each length) strands randomly extracted from protein structures in PDBSELECT 25% [8] and obtained the distributions shown in Figure 2(e-h). For each length, the mean cRM SD value of the strands is approximately two times larger than the mean RMSD of the helices. Therefore, we introduce the following empirical scaling facN δ(A ,A ) tor s(λa , λb ) = i=1 Na,i b,i , to modify the cRM SD of the aligned substruc-
2, if residues Aa,i and Ab,i are both helix , 1, otherwise to remove bias due to different secondary structure content. We use DSSP [11] to assign secondary structure to the residues of each protein. ture pairs, where δ(Aa,i , Ab,i ) =
Sequencecomposition. The score for sequence composition SCS is defined as k SCS = i=1 B(Aa,i , Ab,i ) where Aa,i and Ab,i are the amino acid residue types at aligned position i. B(Aa,i , Ab,i ) is the similarity score between Aa,i and Ab,i based on a modified BLOSUM50 matrix, in which a constant is added to all entries such that the smallest entry is 1.0. Combined similarity score. The combined similarity score σ(χ) of two aligned substructures is calculated as follows: σ(χi,j,k ) = α[C − s(λa , λb ) ·
cRM SD ] + SCS, k2
(7)
In current implementation, α and C are empirically set to 100 and 2, respectively. Similarity score for aligned molecules. The output of the above algorithm is a set of aligned substructure pairs X = {χ1 , χ2 , . . . χm } that maximize Equation (1). The alignment X of two structures is scored following Equation (7) by treating X as a single discontinuous fragment pair: cRM SD σ(X) = α C − s(X) · + SCS. (8) 2 NX In this case k = NX , where NX is the total number of aligned residues. 3.1
Statistical Significance
To investigate the effect that the size of each the proteins being aligned has on our similarity score, we randomly aligned 200,000 protein pairs from PDBSELECT
Topology Independent Protein Structural Alignment
2.0
2.5
. 0.5
1.0
. 1.5
2.0
. 2.5
3.0
3.5
. 1.5
1.0
2.0
2.5
3.0
0
3.5
1
3
2
4
cRMSD
cRMSD h)
Strand Length 4
Density
0.8
Strand Length 7
0.0
Density 0.0
1.5
2.0
0.5 .
0.0 0.2 0.4 0.6 0.8 1.0
0.8 0.4
Density
0.0
. 1.5
1
0
cRMSD
cRMSD
1.0
Density
0.0
0.0
Strand Length 4
1.5 1.0
Density
1.5
. 1.0
0.8
0.6
. 0.6
Helix Length 4
0.5
2.0 0.4
g)
f)
0.5
1.0
1.0
Density
0.0 . 0.2
cRMSD
0.0
. 0.5
0.0
0.5
3.0 2.0 0.0
cRMSD e)
Strand Length 4
d)
Helix Length 4
0.4
1.5 .
1.0
1.0 0.0
Density . 0.5
0.0
c)
Helix Length 5
0.2
b)
Helix Length 4
0.0 0.5 1.0 1.5 2.0 2.5
Density
a)
177
2
3
4
0
1
cRMSD
2
3
4
cRMSD
Fig. 2. The cRMSD distributions of a) helices of length 4 b) helices of length 5 c) helices of length 6 d) helices of length 7 e) strands of length 4 f) strands of length 5 g) strands of length 6 and h) strands of length 7
Raw Score
Normalized Score
b.)
50
100
150
200
250
50
sqrt(N_a * N_b)
100
150
200
250
sqrt(N_a * N_b)
Fig. 3. a) Linear fit√between raw similarity score σ(X) (equation 8) as a function of the geometric mean Na · Nb of the length of the two aligned proteins (Na and Nb are the number of residues in the two protein structures Sa and Sb ). The linear regression line (grey line) has a slope of 0.314. b) Linear fit of the normalized similarity score σ ˜ (X) (equation 9) as a function of the geometric mean of the length of the two aligned proteins. The linear regression line (grey line) has a slope of −0.0004.
25% [8]. Figure 3a shows the similarity scores σ(X (equation 8)) as a function of √ the geometric mean of two aligned structure lengths Na · Nb . Where Na and Nb are the number of residues in Sa and Sb , respectively. The regression line (grey line) has a slope of 0.314, indicating that σ(X) is not ideal for determining the significance of the alignment because larger proteins produce higher similarity scores. This is corrected by a simple normalization scheme: σ ˜ (X) =
σ(X) , NX
(9)
where N is the number of equivalent residues in the alignment is used. Figure 3b shows the normalized similarity score as a function of the geometric mean of the aligned protein lengths. The regression line (grey line) has a negligible slope of −4.0 × 10−4 . In addition, the distribution of the normalized score σ ˜ (X) can be approximated by an extreme value distribution (EVD) (Figure 4). This allows us to compute the statistical significance given the score of an alignment [1, 4].
178
4 4.1
J. Dundas et al.
Results Discovery of Novel Circular Permutation and Novel Non-cyclic Permutation
Density
In Appendix, we demonstrate the ability of our algorithm to detect circular permutations by examining known examples of circular permutations. The effectiveness of our method is also demonstrated by the discovery of previously unknown circular permutations. In an attempt to test our algorithm’s ability to discover new circular permutations, we structurally aligned a subset of 3,336 structures from PDBSELECT 90% [8]. We first selected proteins from PDBSELECT90 (sequences have less than 90% identities) whose N and C termini were no further than 30 ˚ A apart. From this subset of 3,336 proteins, we aligned two proteins if they met the following conditions: the difference in their lengths was no more than 75 residues, and they had approximately the same secondary structure content. To compare secondary structure content, we determined the percentage of the residues labelled as helix, strand, and other for each structure. Two structures were considered to have the same secondary structure content if the difference between each secondary structure label was less than 10%. Within the approximately 200,000 alignments, we found 426 candidate circular permutations. Of Similarity Score Distribtuion these circular permutations, 312 were symmetric proteins that can be aligned with or without a circular permutation. Of the 114 non-symmetric circular permutations, 112 were already known in literature, and 3 are novel. We describe one novel circular permutations as well as one novel non-cyclic permutation in some details. The newly discovered circular permutation between migration inhibition factor and arginine repressor, which involves an additional strand-swappng Similarity Score is described in Appendix. Fig. 4. The distribution of the nor-
Nucleoplasmin-core and auxin bindmalized similarity scores obtained ing protein. The first novel circular by aligning 200,000 pairs of propermutation we found was between the teins randomly selected from PDBnucleoplasmin-core protein in Xenopu laevis SELECT 25% [8]. The distribution (PDB ID 1k5j, chain E) and the auxin bind- can be fit to an Extreme Value ing protein in maize (PDB ID 1lrh, chain A, Distribution, with parameters α = residues 37 through 127). The overall struc- 14.98 and β = 3.89. tural alignment between 1k5jE (Figure 5a, top) and 1lrhA(Figure 5a, bottom) has an RMSD value of 1.36˚ A with an alignment length of 68 residues and a significant p-value of 2.7×10−5 after Bonferroni correction. These proteins are related by a circular permutation. The short loop connecting two antiparallel strands in nucleoplasmin-core protein (in ellipse, top
Topology Independent Protein Structural Alignment
179
Fig. 5. A new circular permutation discovered between nucleoplasmin-core (1k5j, chain E, top panel), and the fragment of residues 37-127 of auxin binding protein 1 (1lrh, chain A, bottom panel). a) These two proteins superimpose well spatially, with an RMSD value of 1.36˚ A for an alignment length of 68 residues and a significant p-value of 2.7 × 10−5 after Bonferroni correction. b) These proteins are related by a circular permutation. The short loop connecting strand 4 and strand 5 of nucleoplasmin-core (in rectangle, top) becomes disconnected in auxin binding protein 1. The N- and Ctermini of nucleoplasmin-core (in ellipse, top) become connected in auxin binding protein 1 (in ellipse, bottom). For visualization, residues in the N-to-C direction before the cut in the nucleoplasmin-core protein are colored red, and residues after the cut are colored blue. c) The topology diagram of these two proteins. In the original structure of nucleoplasmin-core, the electron density of the loop connecting strand 4 and strand 5 is missing.
of Fig 5b) becomes disconnected in auxin binding protein 1 (in ellipse, bottom of Fig 5b), and the N- and C- termini of the nucleoplasmin-core protein (in square, top of Fig 5b) are connected in auxin binding protein 1 (square, bottom of Fig 5b). The novel circular permutation between aspartate racemase and type II 3-dehydrogenate dehyrdalase is described in detail in Appendix. Beyond Circular Permutation. The information that naturally occurring circular permutations contain about the folding mechanism of proteins has led to a lot of interest in their detection. However, there has been little work on the detection of non-cyclic permuted proteins. As an example of this important class of topologically permuted proteins, Tabtiang et al (2004) were able to artificially create a noncyclic permutation of the Arc repressor that was thermodynamically stable, refolds on the sub-millisecond time scale, and binds operator DNA with nanomolar affinity [21]. This raises the question of whether or not these noncyclic permutations can arise naturally. Here we report the discovery of a naturally occurring non-cyclic permutation between chain F of AML1/Core Binding Factor (AML1/CBF, PDB ID 1e50, Figure 6, top) and chain A of riboflavin synthase (PDB ID 1pkv, Figure 6a,
180
J. Dundas et al. a.)
H2
b.)
2
3 4
2
3
1
4
6
5
H2
c.)
1 6
5
d.)
N
N
f.)
2’
6’ 1’
4
5
1 6
C
C
2
3
3’
2
3
H2
e.)
H1’ N
5’
H1
H1
4’
H2
C
C
4
5
1
H1
6
C
H2’
H1
Fig. 6. A novel non-cyclic permutation discovered between AML1/Core Binding Factor (AML1/CBF, PDB ID 1e50, Chain F, top) and riboflavin synthase (PDBID 1pkv, chain A, bottom) a) These two proteins superimpose well spatially, with an RMSD of 1.23 ˚ A and an alignment length of 42 residues, with a significant p-value of 2.8 × 10−4 after Bonferroni correction. Aligned residues are colored blue. b) These proteins are related by multiple permutations. The steps to transform the topology of AML1/CBF (top) to riboflavin (bottom) are as follows: c) Remove the the loops connecting strand 1 to helix 2, strand 4 to strand 5, and strand 5 to helix 6; d) Connect the C-terminal end of strand 4 to the original N-termini; e) Connect the C-terminal end of strand 5 to the N-terminal end of helix 2; f) Connect the original C-termini to the N-terminal end of strand 5. The N-terminal end of strand 6 becomes the new N-termini and the C-terminal end of strand 1 becomes the new C-termini. We now have the topology diagram of riboflavin synthase.
bottom). The two structures align well with a RMSD of 1.23 ˚ A with an alignment length of 42 residues, and a significant p-value of 2.8 × 10−4 after Bonferroni correction. The topology diagram of AML1/CBF (Figure 6b) can be transformed into the topology diagram of riboflavin synthase (Figure 6f) by the following steps: Remove the the loops connecting strand 1 to helix 2, strand 4 to strand 5, and strand 5 to strand 6 (Figure 6c). Connect the C-terminal end of strand 4 to the original N-termini (Figure 6d). Connect the C-terminal end of strand 5 to the N-terminal end of helix 2 (Figure 6e). Connect the original C-termini to the N-terminal end of strand 5. The N-terminal end of strand 6 becomes the new N-termini and the C-terminal end of strand 1 becomes the new C-termini (Figure 6f).
5
Conclusion
The approximation algorithm introduced in this work can find good solutions for the problem of protein structure alignment. Furthermore, this algorithm can detect topological differences between two spatially similar protein structures. The alignment between MIF and the arginine repressor demonstrates our algorithm’s
Topology Independent Protein Structural Alignment
181
ability to detect structural similarities even when spatial rearrangement of structural units has occurred. In addition, we report in this study the first example of a naturally occurring non-cyclic permuted protein between AML1/Core Binding Factor chain F and riboflavin synthase chain A. In our method, the scoring function plays a pivotal role in detecting substructure similarity of proteins. We expect future experimentation on optimizing the parameters used in our similarity scoring system can improve detection of topologically independent structural alignment. In this study, we were able to fit our scoring system to an Extreme Value Distribution (EVD), which allowed us to perform an automated search for circular permuted proteins. Although the p-value obtained from our EVD fit is sufficient for determining the biological significance of a structural alignment, the structural change between the microphage migration inhibition factor and the C-terminal domain of arginine repressor indicates a need for a similarity score that does not bias heavily towards cRMSD measure for scoring circular permutations. Whether naturally occurring circular permutations are frequent events in the evolution of protein genes is currently an open question. Lindqvist et al, (1997) pointed out that when the primary sequences have diverged beyond recognition, circular permutations may still be found using structural methods [12]. In this study, we discovered three examples of novel circularly permuted protein structures and a non-cyclic permutation among 200,000 protein structural alignments for a set of non-redundant 3,336 proteins. This is an incomplete study, as we restricted our studies to proteins whose N- and C- termini distance were less than 30˚ A. We plan to relax the N to C distance and include more proteins in future work to expand the scope of the investigation.
References 1. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997) 2. Arora, S., Lund, C., Motwani, R., Sudan, M., Szegedy, M.: Proof verification and hardness of approximation problems. Journal of the ACM 45(3), 501–555 (1998) 3. Bar-Yehuda, R., Halldorsson, M.M., Naor, J., Shacknai, H., Shapira, I.: Scheduling split intervals. In: 14th ACM-SIAM SODA, pp. 732–741. ACM Press, New York (2002) 4. Binkowski, T.A., Adamian, L., Liang, J.: Inferring functional relationship of proteins from local sequence and spatial surface patterns. J. Mol. Biol. 332, 505–526 (2003) 5. Binkowski, T.A., DasGupta, B., Liang, J.: Order independent structural alignment of circularly permutated proteins. In: EMBS 2004, pp. 2781–2784 (2004) 6. Chen, L., Wu, L., Wang, Y., Zhang, S., Zhang, X.: Revealing divergent evolution, identifying circular permutations and detecting active-sites by protein structure comparison. BMC Struct. Biol. 6, 18 (2006) 7. Hermoso, J.A., Monterroso, B., Albert, A., Galan, B., Ahrazem, O., Garcia, P., Martinez-Ripoll, M., Garcia, J.L., Menendez, M.: Structural Basis for Selective Recognition of Penumococcal Cell Wall by Modular Endolysin from Phage Cp-1. Structure v11, 1239 (2003)
182
J. Dundas et al.
8. Hobohm, U., Sander, C.: Enlarged representative set of protein structures. Protein Science 3, 522 (1994) 9. Holm, L., Park, J.: DaliLite workbench for protein structure comparison. Bioinformatics 16, 566–567 (2000) 10. Jung, J., Lee, B.: Protein structure alignment using enviromental profiles. Prot. Eng. 13(8), 535–543 (2000) 11. Kabsch, W., Sander, C.: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22, 2577–2637 (1983) 12. Lindqvist, Y., Schneider, G.: Circular permutations of natural protein sequences: structural evidence. Curr. Opinions Struct. Biol. 7, 422–427 (1997) 13. Liu, L., Iwata, K., Yohada, M., Miki, K.: Structural insight into gene duplication, gene fusion and domain swapping in the evolution of PLP-independent amino acid racemases. FEBS LETT v528, 114–118 (2002) 14. Meszaros, C.S.: Fast Cholesky factorization for interior point methods of linear programming. Comp. Math. Appl. 31, 49–51 (1996) 15. Mizuguchi, K., Deane, C.M., Blundell, T.L, Overington, J.P.: HOMSTRAD: a database of protein structur alignments for homologous families. Protein Sci. 7, 2469–2471 (1998) 16. Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C.: SCOP: a structural classification of proteins database for the investigation of sequences and structure. J. Mol. Biol. 247, 536–540 (1995) 17. Peisajovich, S.G., Rockah, L., Tawfik, D.S.: Evolution of new protein topologies through multistep gene rearrangements. Nature Genetics 38, 168–173 (2006) 18. Ponting, R.B., Russell, R.B.: Swaposins: circular permutations within genes encoding saposin homologues. Trends Biochem Sci. 20, 179–180 (1995) 19. Suzuki, M., Takamura, Y., Maeno, M., Tochinai, S., Iyaguchi, D., Tanaka, I., Nishihira, J., Ishibashi, T.: Xenopus laevis Macrophage Migration Inhibitory Factor is Essential for Axis Formation and Neural Development. J. Biol. Chem. 279, 21406– 21414 (2004) 20. Szustakowski, J.D., Weng, Z.: Protein structure alignment using a genetic algorithm. Proteins: Structure, Function, and Genetics 38, 428–440 (2000) 21. Tabtiang, R.K., Cezairliyan, B.O., Grand, R.A., Cochrane, J.C., Sauer, R.T.: Consolidating critical binding determinants by noncyclic rearrangement of protein secondary structure. PNAS 7, 2305–2309 (2004) 22. Van Duyne, G.D., Ghosh, G., Maas, W.K., Sigler, P.B.: Structure of the oligomerization and L-arginine binding domain of the arginine repressor of Escherichia coli. J. Mol. Biol. 256, 377–391 (1996) 23. Zhu, J., Weng, Z.: FAST: A Novel Protein Structure Alignment Algorithm. PROTEINS: Structure, Function, and Bioinformatics 58, 618–627 (2005)
Generalized Pattern Search and Mesh Adaptive Direct Search Algorithms for Protein Structure Prediction Giuseppe Nicosia and Giovanni Stracquadanio Department of Mathematics and Computer Science University of Catania Viale A. Doria 6, 95125 Catania, Italy {nicosia,stracquadanio}@dmi.unict.it
Abstract. Proteins are the most important molecular entities of a living organism and understanding their functions is an important task to treat diseases and synthesize new drugs. It is largely known that the function of a protein is strictly related to its spatial conformation: to tackle this problem, we have proposed a new approach based on a class of pattern search algorithms that is largely used in optimization of real world applications. The obtained results are interesting in terms of the quality of the structures (RMSD–Cα ) and energy values found. Keywords: protein folding, pattern search algorithms, non-linear optimization.
1
Introduction
Proteins are molecules that play a variety of roles in a living organism: the presence, the absence and the interaction of proteins are crucial for the healthy state of an organism. In this view, it is clear that is crucial to understand the protein function. The function of a protein is strictly related to its three dimensional structure: the coordinates of the atoms of a protein define the function of the given protein and the way it can interact with other molecules and the solvent. From a high level point of view, we can say that given a primary structure of a protein, we can infer its tertiary structure hence its function. The great majority of computational methods are based on the well-known Thermodynamics hypothesis, which postulates that the native state of a protein is the one with the lowest free energy under physiological conditions [1]. The free energy of a protein can be modeled as function of the different interactions within the protein, which depend on the positions of its atoms. The set of atomic coordinates providing the minimum possible value of the free energy corresponds to the protein native conformation. According to Levinthal’s paradox, an exhaustive search algorithm would take the present age of the Universe for a protein to explore all possible configurations and locate the one with the minimum energy [2]. Generally speaking, we can define a global optimization problem in this R. Giancarlo and S. Hannenhalli (Eds.): WABI 2007, LNBI 4645, pp. 183–193, 2007. c Springer-Verlag Berlin Heidelberg 2007
184
G. Nicosia and G. Stracquadanio
form: min f (x), x ∈ Ω ⊆ X where X = [xL , xU ] = {x ∈ Rn |xL ≤ x ≤ xU } and Ω = {x ∈ X|C(x)}, f is the objective function and C : X → R is the constraint function. In the area of optimization, many interesting results come from the usage of pattern search methods [3]: at this stage, Generalized Pattern Search[4] and Mesh Adaptive Direct Search[5], that are part of the Nonlinear Optimization for Mixed vAriables and Derivatives algorithm (NOMAD), are some of the most powerful optimization algorithms largely used on academic and practical problems [6]. In this work, we proposed a new ab-initio method[7,8], this means that starting from a sequence of amino acids we find the relative three dimensional conformation, using no information derived from similarity at a sequence and fold level, as in homology[9] and threading modeling[10]. In particular, we try to output the three-dimensional structure with the lowest possible energy value. In section 2, we outline GPS and MADS algorithms and we introduce the NOMAD-PSP protein structure prediction tool based on the adapted version of GPS and MADS; in section 3, we report the obtained experimental results, finally, in section 4, we outline conclusions and future works.
2
The GPS and MADS Algorithms
In this section, we describe the used optimization algorithms, which are part of the class of direct search algorithms: the Generalized Pattern Search (GPS) [4] and Mesh Adaptive Direct Search (MADS) [5]. Without assuming any smoothness, it is assumed that there is a convergent subsequence of the sequence {xk } of iterates produced by the algorithm: since {f (xk )} generated by the algorithm is non increasing, it is convergent to a finite limit if it is bounded below and if f is lower semicontinous at any limit point x ¯ of the sequence of iterates, then f (¯ x) ≤ lim inf k f (xk ) = limk f (xk ). Generalized Pattern Search algorithms [11,4] face derivate-free unconstrained optimization on continuously differentiable functions using positive spanning directions, that later, Lewis and Torczon [4] extended to bound constrained optimization. Instead, the Mesh Adaptive Direct Search (MADS) algorithm for non-linear optimization extends the Generalized Pattern Search (GPS) algorithms by allowing local exploration in an asymptotically dense set of directions in the space of variables. These algorithms are iterative, and each iteration is divided into two phases: an optional Search and a local Poll. GPS and MADS share the same concepts for the Search phase, instead they differ a lot in the Polling procedure. The Search phase. The objective function is evaluated over a finite number of mesh points. Formally, we define a mesh as discrete subset of Rn whose fineness is parameterized by the mesh size parameter Δh > 0. The main task of the Search phase is finding a new point that has lower objective function value than the best current solution, called incumbent. When the incumbent is replaced, so f (xk+1 ) < f (xk ), then xk+1 is said to be an improved mesh point. When the Search step fails in providing an improved mesh point, the algorithm calls the Poll procedure: this phase consists in evaluating the objective function at the neighboring mesh points to see if a lower function value can be found. When
Generalized Pattern Search
185
the Poll fails in providing an improved mesh point, then the current incumbent solution is said to be a local mesh optimizer. The Poll phase. This second step consists of evaluating the barrier objective function at the neighboring mesh points to see if a lower function value can be found there. When the Poll fails in providing an improved mesh point, then the current incumbent solution is said to be a local mesh optimizer. Successively, it refines the mesh by mesh size parameter : Δk+1 = τ wk Δk
(1)
for 0 < τ wk < 1, where τ > 1 is a real number that remains constant over all iterations, and wk ≤ −1 is an integer bounded below by the constant w ≤ −1. When either the Search or Poll step produces an improved mesh point, the current iteration can stop and the mesh size parameter can be kept constant or increased according with equation (1), but where τ > 1 and wk ≥ 0 is an integer bounded above by w+ ≥ 0. Using the previous equation, it follows that for any k ≥ 0 there exists an integer rk ∈ Z such that: Δk+1 = τ rk Δ0 . The basic element in the formal definition of a mesh is a set of positive spanning directions D ∈ Rn : in particular, nonnegative linear combinations of the elements of the set D spans Rn . The directions can be chosen using any strategy, but it must assure that each direction dj ∈ D, ∀j = 1, 2, . . . , |D|, is the product G¯ zj of the non-singular generating matrix G ∈ Rn×n by an integer vector z¯ ∈ Zn ; it is important to remember that the same matrix is used for all directions. We denote D as a real n × |D| matrix, and similarly, with Z¯ the matrix whose ¯ When columns are z¯j , ∀j = 1, . . . , |D| : at this point we can define D = GZ. using the Poll, the mesh is centered around the current iterate xk ∈ Rn and its fineness is parameterized through the mesh size parameter Δk as follows: |D| Mk = {xk + Δk Dz : z ∈ Z+ } where Z+ is the set of nonnegative integers. At each iteration, some positive spanning matrix Dk composed of the columns of D is used to construct the Poll Set. The Poll Set is composed of mesh points neighboring the current iterate xk in the directions of the columns of Dk as in the following equation: Mesh Points = {xk + Δk d : d ∈ Dk }. Instead, MADS generate iterates on a tower of underlying meshes on the domain space and perform an adaptive search on the meshes including controlling its refinement. The set of trial points considered during the Poll step is called a frame[12,13]. The frame is constructed using a current incumbent solution xk , known as frame center, and the poll and mesh size parameters Δpk and Δm k to obtain a positive spanning set of directions Dk : unlike GPS, the MADS set of directions Dk is not a subset of D. Formally, at each iteration k, the MADS frame is defined to be the set: Pk = {xk + Δm k d : d ∈ Dk }. where Dk is a positive spanning set such that 0 ∈ / Dk and for each d ∈ Dk , d can be written as a nonnegative integer combination of the directions in D; the distance from the frame center xk to a frame point xk + Δm k d ∈ Pk is bounded by a constant times the poll p size parameter: Δm d ≤ D k k max{d : d ∈ D}; finally, the limits of the normalized sets Dk are positive spanning sets. If the Poll steps fails to generate an improved mesh point then the frame is called a minimal frame, and the
186
G. Nicosia and G. Stracquadanio
frame center xk is said to be minimal frame center, and this event leads to a mesh refinement. The refinement of the mesh size parameter Δm k+1 increase the mesh resolution, and therefore to allow the evaluation of f at trial points closer to the incumbent solution. Formally, given a fixed rational number τ > 1, and two integers w− ≤ −1, w+ ≥ 0, the mesh size parameter is updated as wk m follows: Δm Δk for some wk ∈ {0, 1, . . . , w+ } if an improved mesh point k+1 = τ is found, otherwise wk ∈ {w− , w− + 1, . . . , −1}. This point is shared between GPS and MADS. Moreover, MADS introduce the poll size parameter Δpk ∈ R+ for iteration k: this parameter sets the magnitude of the distance from the trial points generated by the Poll step to the current incumbent solution xk . In GPS, there is a single parameter to represent these quantities as Δk = Δpk = Δm k . p The strategy of MADS for updating Δpk must be such that Δm k ≤ Δk for all k, and moreover, it must satisfy the following statement: limk∈K Δm k = 0 ⇔ limk∈K Δpk = 0 for any infinite subset of indices K. So finally, the mesh, at each iteration k, is defined by the following equation: Mk = x∈Sk {x + Δm k Dz : z ∈ NnD }, where Sk is the set of points where the objective function has been evaluated by the start of iteration k : the mesh is defined using union because it ensures that all previously visited points lie on a mesh, and that new trial points can be selected around them using the directions D. Algorithm 1. Nomad Optimization Flow 1: procedure NOMAD(f, x0 ) 2: D ← MakeSpanningSet() 3: M0 ← Mesh(Rn ) 4: k←0 5: while ¬End do 6: while f (xk+1 ) ≥ f (xk ) do 7: xk+1 =Search(Mk ) 8: end while 9: xk+1 = Poll(xk+1 ) 10: UpdateParameter(Δk+1 ) 11: k ← k+1 12: end while 13: return xk 14: end procedure
2.1
Termination criterion not met
NOMAD-PSP
In this subsection, we present NOMAD-PSP: the NOMAD optimization tools for the protein structure prediction. We report our main choices about the representation of solutions and the adopted free energy function. Representation of candidate solutions. A nontrivial task that preempts use of any search procedure to tackle the PSP problem is the selection of a good representation for the protein conformation. In the current work, we use an internal coordinates representation (torsion angles), which is the most widely used representation model for real proteins. Each residue type requires a fixed number of
Generalized Pattern Search
187
torsion angles to fix the 3D coordinates of all atoms. Bond lengths and angles are fixed at their ideal values. In some simulations, all the ω torsion angles are fixed, so the degrees of freedom are the backbone and sidechain torsion angles (φ, ψ and χi ). As we know, the number of χ angles depends on the residue type, and they are defined in specific ranges derived from the backbone-independent rotamer library[14]. Side-chain constraint regions are of the form: [μ − σ, μ + σ], where μ and σ are the mean and the standard deviation for each side-chain torsion angle computed from the rotamer library. It is important to note that under these constraints, the conformation is still highly flexible and the structure can take on various shapes that are vastly different from the native shape. Energy function. In order to evaluate the structure of a molecule we need to use some cost functions. Sometimes called potential energy functions or force fields, these functions return a value for the energy based on the conformation of the molecule. As such, they provide information on which conformations of the molecule are better or worse. A lower the energy value should represent a better conformation. Most typical energy functions have the form: E(R) = B(R) + A(R) + T (R) + N (R) bonds
angles
torsions
non−bonded
where R is the vector representing the conformation of the molecule, typically in Cartesian coordinates or in torsion angles. In this work we use the Chemistry at HARvard Macromolecular Mechanics – CHARMM (version 27) energy function, a popular all-atom force field used mainly for studying macromolecules [15,16].
3
Results
In this section we show the performance of NOMAD-PSP on three well-known proteins: 1PLW (5 residues, 22 dihedral angles, 75 atoms), 2MLT (26 residues, 85 dihedral angles, 402 atoms), and 1ZDD (34 residues, 192 dihedral angles, 566 atoms). We are interested in studying the exploring and exploiting abilities of the GPS and MADS algorithms, so we have recorded the number of iterations (ITR), the number of consecutive failures (NCF) and the final mesh size (FMS) of each run. 3.1
Met-Enkephalin
Met-enkephalin (1PLW) is a very short peptide [17]. From an optimization point of view, 1PLW is a paradigmatic example of multiple minima problem: it is estimated to have more than 1011 locally optimal conformations. In the last years it becomes the first test bed for the protein structure prediction problem. Due to the small number of dihedral angles, Met-enkephalin was extensively used to understand the effectiveness of GPS and MADS for PSP. First of all, we setup a sets of various bounds for the dihedral angles: in [tab.1] we outline all the ranges, where RL means rotamer library [14]. The bounds of φ, ψ for the set A1 and A2 settings are taken from [18], instead in A3 and A4 bounds we set
188
G. Nicosia and G. Stracquadanio Table 1. 1PLW: dihedral angles settings used in our experimental protocol Settings A1 A2 A3 A4
φ [−180◦ , −50◦ ] [−180◦ , −50◦ ] [−180◦ , 180◦ ] [−180◦ , 180◦ ]
ψ ω χ [−75◦ , 175◦ ] −180◦ RL [−75◦ , 175◦ ] 180◦ RL [−180◦ , 180◦ ] −180◦ RL [−180◦ , 180◦ ] 180◦ RL
Table 2. 1PLW: GPS results
Setting GPS (A1,I1) (A1,I1) (A1,I1) (A1,I2) GPS (A2,I1) (A2,I1) (A2,I1) (A2,I2) GPS (A3,I1) (A3,I1) (A3,I1) (A3,I2) GPS (A4,I1) (A4,I1) (A4,I1) (A4,I2)
BE BRA BRC BF BE BRA BRC BF BE BRA BRC BF BE BRA BRC BF
Energy (kcal/mol) -15.443 -9.732 -4.7277 -13.9806 -14.691 -14.691 5.278 -13.9836 -30.140 -15.766 -6.354 1.049 -25.986 -23.692 -18.801 4.39−39
RM SDall RM SDCα ITR 4.231 2.778 4.223 3.754 3.846 3.846 3.963 3.753 3.779 3.265 3.734 3.360 3.807 3.437 3.944 4.015
˚ A ˚ A ˚ A ˚ A ˚ A ˚ A ˚ A ˚ A ˚ A ˚ A ˚ A ˚ A ˚ A ˚ A ˚ A ˚ A
2.073 1.657 1.344 1.637 1.697 1.697 1.345 1.637 1.546 1.436 1.199 1.400 1.682 1.696 1.243 1.239
˚ A ˚ A ˚ A ˚ A ˚ A ˚ A ˚ A ˚ A ˚ A ˚ A ˚ A ˚ A ˚ A ˚ A ˚ A ˚ A
9508 2088 6108 6058 6196 6196 6158 6058 7261 6399 6941 6778 8397 7812 6937 5800
NCF FMS 58067 9829 23158 234947 230962 230962 234189 234947 178756 223116 188917 202700 136927 138587 194448 160015
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
all the ranges for φ, ψ. Moreover, for each set we test two improper angle values, to study the effect on the quality of solutions found. We study the impact of the starting point using two initialization methods: random initialization (I1), and centered in bounds initialization (I2), which it is a common initialization procedure in the direct search methods. For all the simulations on 1PLW peptide we set the number of objective function evaluations to 25 × 104 . In [tab.2,3], we report the best results obtained by the algorithm outlining in bold face the best results in terms of Best Energy (BE), Best RM SDall (BRA), and Best RM SDcα (BRC): when we use random initialization, we report the best solution over five independent runs. By inspecting the tables, we obtain the minimum energy using the A3 setting; instead, in terms of RM SDall the A1 bounds gets the best result and assure a good speed of convergence: the algorithm requires only 2088 iterations that is the minimum number of iterations reported in our experiments, and also, it has the minimum number of consecutive failures, 9829. Moreover, if we analyze the results obtained using the bounds A3 and A4 it is clear a problem with the energy function, in fact, there is no direct correspondence between low energy values and low RMSD values. In [tab.4] we show the dihedral angle of the the conformation with the minimum RM SDcα .
Generalized Pattern Search
189
Table 3. 1PLW: MADS results Setting MADS (A1,I1) (A1,I1) (A1,I1) (A1,I2) MADS (A2,I1) (A2,I1) (A2,I1) (A2,I2) MADS (A3,I1) (A3,I1) (A3,I1) (A3,I2) MADS (A4,I1) (A4,I1) (A4,I1) (A4,I2)
BE BRA BRC BF BE BRA BRC BF BE BRA BRC BF BF BRA BRC BF
Energy (kcal/mol) -11.265 -11.265 -11.265 -13.528 -4.8471 -4.8471 -4.8471 -13.513 -14.223 -13.513 -14.223 84.582 -21.103 -19.746 -14.633 84.582
RM SDall RM SDCα ITR 4.056 4.056 4.056 3.856 4.277 4.277 4.277 3.852 4.468 3.852 4.468 4.666 3.785 3.453 3.488 4.666
˚ A ˚ A ˚ A ˚ A ˚ A ˚ A ˚ A ˚ A ˚ A ˚ A ˚ A ˚ A ˚ A ˚ A ˚ A ˚ A
1.680 1.680 1.680 1.694 1.944 1.944 1.944 1.692 1.691 1.692 1.691 2.467 1.596 1.575 1.277 2.467
˚ A ˚ A ˚ A ˚ A ˚ A ˚ A ˚ A ˚ A ˚ A ˚ A ˚ A ˚ A ˚ A ˚ A ˚ A ˚ A
6246 6246 6246 6034 7002 7002 7002 6127 7109 6399 7109 5738 6096 7812 6048 5800
NCF FMS 244523 244523 244523 247837 244039 244039 244039 247756 168751 223116 168751 228050 247918 138587 247867 160015
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Table 4. 1PLW: the dihedral angles of the best conformation found Residue φ ψ TYR -65.99 82.66 GLY -176.04 175 GLY -75.23 7.81 PHE -133.25 -34.76 MET -168.19 115.73
ω χ1 χ2 χ3 -180 -67.30 89.99 -180 -180 -180 31.07 89.68 -180 -172.15 178.38 79.36
Once to better understand how the algorithm explores the solution space, we take all the points evaluated by the algorithm with the best performance, in terms of energy and RM SDall . The analysis confirms that there are a great number of solutions with better RM SD values than the optimum found but they have higher energy function values. Finally we compare NOMAD-PSP with other algorithms present in literature [8]: I-PAES [19], Regal[20], Lamarkian[20]. The comparison is done on RM SDall as reported in [tab.5], and as we can see NOMAD-PSP clearly outperforms state-of-art algorithms in terms of energy and RMSD values. 3.2
Disulphide-Established Mini Protein Domain A
This protein (1ZDD) is a two-helix peptide of 34 residues [21]: it defines 192 dihedral angles, so it is an interesting test bed to better understand how a given folding algorithm works on a problem with a large number of variables. For this instance, the bounds on dihedral angles are deducted from the prediction of the secondary structure: the secondary structure constraints were predicted
190
G. Nicosia and G. Stracquadanio
Table 5. 1PLW: comparison between NOMAD-PSP and other state-of-art folding algorithms Energy Energy RMSDall (kcal/mol) Function ˚ NOMAD-PSP CHARMM -30.14 3.779 A NOMAD-PSP CHARMM -9.73 2.778 ˚ A REGAL CHARMM -22.01 3.23 ˚ A Lamarkian CHARMM -28.35 3.33 ˚ A I-Paes CHARMM -20.47 3.605 ˚ A
Table 6. 1ZDD: GPS and MADS performances Energy RM SDall RM SDCα (kcal/mol) GPS -1460.751 7.04 ˚ A 3.87 ˚ A MADS -1.066 13.815 ˚ A 13.486˚ A Table 7. 1ZDD: predicted secondary structures by GPS
1st α − helix 2nd α − helix
Residues RM SDall RM SDca 3-14 2.89 ˚ A 0.29˚ A 20-32 2.94 ˚ A 0.76˚ A
by the SCRATCH prediction server [22] (the ranges have been set as in [19]), the sidechain dihedral angles are defined on the base of rotamer library. We use GPS and MADS, fixing the number of the truth function evaluations to 106 and using random initialization. We measure the performance of the algorithm in terms of RM SDall , RM SDCα respect to the structure stored in PDB (1ZDD): the results are reported in [tab.6]. From these experiments we can see that GPS clearly outperforms MADS, that clearly fails to predict a feasible protein structure. Moreover, by inspecting all the evaluated points, we can say that best point found by the algorithm is truly the best solution of all evaluated points: from this analysis, it seems clear that 1ZDD protein has a smaller number of local minima than 1PLW, but it is still a quite difficult benchmark due to the huge number of dihedral angles. This protein defines two α-helix: we are interested in evaluating the quality of the secondary structures predicted by GPS computing the RMSD on the sub-sequences of the protein that define the secondary structures. The results, showed in [tab.7], confirms that GPS predict correctly the two α − helix with RM SDca < 1 for each secondary structure. 3.3
Mellitin
The Mellitin (2MLT) is peptide of 26 amino acids that has recently received a good deal of attention in computational protein folding studies because of the huge number of local minima present in its folding landscape. We put attention on the membrane-bound portion of the protein, the first 20 amino acids, as
Generalized Pattern Search
191
Table 8. 2MLT: GPS and MADS performances Energy RM SDall RM SDCα (kcal/mol) ˚ GPS 378.973 4.7 A 3.7 ˚ A GPS 456.93 1.663 ˚ A 0.994 ˚ A MADS 382.235 6.356 ˚ A 5.341 ˚ A
already done by [18], that defines 85 dihedral angles. We use GPS and MADS with the number of the truth function evaluations fixed to 5 × 105 , and the bounds on dihedral angles are deducted from the secondary structure using SCRATCH, as already introduced for 1ZDD peptide. In [tab.8], we can deduce that GPS, even in this case, outperforms MADS in terms of minimum energy value, RM SDall and RM SDcα : in particular, the solution with the lowest RMSD values is very near to the native conformation.
4
Conclusions and Future Works
Finding the three-dimensional structure of a protein is the open problem in structural bioinformatics. In the present research work, we introduced a new ab-initio protein structure prediction approach based on two direct search algorithms: Generalized Pattern Search and Mesh Adaptive Direct Search. These two algorithms have been proved to be effective in many academic and real-world applications. So starting from these results, we modeled PSP as a non-linear optimization problem, and used GPS and MADS to find the native protein structure, that is, the three-dimensional structure with the ”possible” lowest energy function value. The performed experiments on the well known set of peptides confirmed that GPS is a suitable algorithm for PSP: at least for the faced protein instances; GPS seems to outperform MADS in terms of quality of the solutions found, and convergence speed. As future works we are working on three fronts: the first one is understanding how the bound settings can impact the GPS algorithm performance, the second one regard the usage of a more powerful heuristic search procedure that the naive random search, and finally we want to tackle the problem using a multi-objective optimization approach with the combinatorial assembly of structural sub–units. The second point is the more challenging one. In fact, any strategy may be used to select the mesh points that are candidates to replace the best current point (the incumbent). Starting from this consideration, we can introduce a search procedure based on surrogates[23,24]. We can formalize a surrogate definition of the PSP, tackling the optimization of the surrogate function using some derivate based optimization tools or some quadratic programming procedures, and then moving the solution to a nearby mesh point in hopes of obtaining a better next iterate. This is the approach used in the Boeing Design Explorer software [25] and a visionary research topic for the protein structure prediction problem.
192
G. Nicosia and G. Stracquadanio
References 1. Anfinsen, C.B., Haber, E., Sela, M., White, F.H.: The Kinetics of Formation of Native Ribonuclease during Oxidation of the Reduced Polypeptide Chain. PNAS 47(9), 1309–1314 (1961) 2. Levinthal, C.: Are there pathways for protein folding? J. Chim. Phys. 65(1), 44–45 (1968) 3. Dennis Jr., J.E., Torczon, V.: Direct Search Methods on Parallel Machines. SIAM Journal on Optimization 1, 448 (1991) 4. Lewis, R.M., Torczon, V.: Pattern search algorithms for linearly constrained minimization. SIAM Journal on Optimization 10(3), 917–941 (2000) 5. Audet, C., Dennis Jr., J.E.: Mesh Adaptive Direct Search Algorithms for Constrained Optimization. SIAM Journal on Optimization 17, 188 (2006) 6. Kokkolaras, M., Audet, C., Dennis, J.E.: Mixed variable optimization of the number and composition of heat intercepts in a thermal insulation system. Technical Report, Rice University (June 22, 2000) 7. Kim, D.E., Chivian, D., Baker, D.: Protein structure prediction and analysis using the robetta server. Nucleic Acids Research 32(Web-Server-Issue), 526–531 (2004) 8. Anile, A.M., Cutello, V., Narzisi, G., Nicosia, G., Spinella, S.: Determination of protein structure and dynamics combining immune algorithms and pattern search methods. Natural Computing 6(1), 55–72 (2007) 9. Shi, J., Blundell, T.L., Mizuguchi, K.: FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. J. Mol. Biol. 310(1), 243–257 (2001) 10. Rost, B., Schneider, R., Sander, C.: Protein fold recognition by prediction-based threading. J. Mol. Biol. 270(1-10), 26 (1997) 11. Kolda, T.G., Lewis, R.M., Torczon, V.: Optimization by Direct Search: New Perspectives on Some Classical and Modern Methods. SIAM Review 45, 385 (2004) 12. Coope, I.D., Price, C.J.: Frame Based Methods for Unconstrained Optimization. Journal of Optimization Theory and Applications 107(2), 261–274 (2000) 13. Coope, I.D, Price, C.J.: On the Convergence of Grid-Based Methods for Unconstrained Optimization. SIAM Journal on Optimization 11, 859 (2001) 14. Congdon, P.: Bayesian Statistical Modelling. Meas. Sci. Technol. 13, 643 (2002) 15. Foloppe, N., MacKerell Jr., A.D.: All-atom empirical force field for nucleic acids: I. parameter optimization based on small molecule and condensed phase macromolecular target data. Journal of Computational Chemistry 21(2), 86–104 (2000) 16. MacKerell Jr., A.D., Bashford, D., Bellott, M., Dunbrack Jr., R.L., Evanseck, J.D., Field, M.J., Fischer, S., Gao, J., Guo, H., Ha, S., et al.: All-atom empirical potential for molecular modeling and dynamics studies of proteins. J. Phys. Chem. B 102(18), 3586–3616 (1998) 17. Marcote, I., Separovic, F., Auger, M., Gagne, S.: A Multidimensional 1H NMR Investigation of the Conformation of Methionine-Enkephalin in Fast-Tumbling Bicelles. Biophys. J. 86, 5578–5583 (2004) 18. Klepeis, J.L., Pieja, M.J., Floudas, C.A.: Hybrid Global Optimization Algorithms for Protein Structure Prediction: Alternating Hybrids. Biophysical Journal 84(2), 869–882 (2003) 19. Cutello, V., Narzisi, G., Nicosia, G.: A multi-objective evolutionary approach to the protein structure prediction problem. Journal of The Royal Society Interface 3(6), 139–151 (2006)
Generalized Pattern Search
193
20. Kaiser Jr., C.E., Lamont, G.B., Merkle, L.D., Gates Jr., G.H., Pachter, R.: Polypeptide structure prediction: real-value versus binary hybrid genetic algorithms. In: Proceedings of the 1997 ACM symposium on Applied computing, pp. 279–286. ACM Press, New York (1997) 21. Starovasnik, M.A., Braisted, A.C., Wells, J.A.: Structural mimicry of a native protein by a minimized binding domain. Proc. Natl. Acad. Sci. USA 94, 10080– 10085 (1997) 22. Pollastri, G., Przybylski, D., Rost, B., Baldi, P.: Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins Structure Function and Genetics 47(2), 228–235 (2002) 23. Booker, A.J., Dennis, J.E., Frank, P.D., Serafini, D.B., Torczon, V., Trosset, M.W.: A rigorous framework for optimization of expensive functions by surrogates. Structural and Multidisciplinary Optimization 17(1), 1–13 (1999) 24. Audet, C., Booker, A.J., Dennis Jr., J.E., Frank, P.D., Moore, D.W.: A surrogatemodel-based method for constrained optimization. AIAA Paper 4891 (2000) 25. Santner, T.J., Williams, B.J., Notz, W.: The design and analysis of computer experiments. Springer, Heidelberg (2003)
Alignment-Free Local Structural Search by Writhe Decomposition Degui Zhi1 , Maxim Shatsky1,2 , and Steven E. Brenner1,2 1
Department of Plant and Microbial Biology, UC Berkeley, CA 94720, USA 2 Physical Biosciences Division, LBNL, Berkeley, CA 94720, USA
Abstract. In the era of structural genomics, comparing a large number of protein structures can be a dauntingly time-consuming task. Traditional structural alignment methods, although offer accurate comparison, are not fast enough. Therefore, a number of databases storing pre-computed structural similarities are created to handle structural comparison queries efficiently. However, these databases cannot be updated in a timely fashion due to the sheer burden of computational requirements, thus offering only a rigid classification by some predefined parameters. Therefore, there is an increasingly urgent need for algorithms that can rapidly compare a large set of structures. Recently proposed projection methods, e.g., [1,2,3,4,5], show good promise for the development of fast structural database search solutions. Projection methods map a structure into a point in a high dimensional space and compare two structures by measuring distance between their projected points. These methods offer a tremendous increase in speed over residue-level structural alignment methods. However, current projection methods are not practical, partly because they are unable to identify local similarities. We propose a new projection-based approach that can rapidly detect global as well as local structural similarities. Local structural search is enabled by a topology-based writhe decomposition protocol (inspired by [4]) that produces a small number of fragments while ensuring that similar structures are cut in a similar manner. In a benchmark test for local structural similarity detection, we show that our method, Writher, dramatically improves accuracy over current leading projection methods [4, 5] in terms of recognizing SCOP domains out of multidomain proteins.
References 1. Choi, I.G., Kwon, J., Kim, S.H.: Local feature frequency profile: A method to measure structural similarity in proteins. PNAS 101(11), 3797–3802 (2004) 2. Gaspari, Z., Vlahovicek, K., Pongor, S.: Efficient recognition of folds in protein 3D structures by the improved PRIDE algorithm. Bioinformatics 21(15), 3322–3323 (2005) 3. Lisewski, A.M., Lichtarge, O.: Rapid detection of similarity in protein structure and function through contact metric distances. Nucl. Acids Res. 34(22), e152 (2006) 4. Røgen, P., Fain, B.: Automatic classification of protein structure by using Gauss integrals. PNAS 100(1), 119–124 (2003) R. Giancarlo and S. Hannenhalli (Eds.): WABI 2007, LNBI 4645, pp. 194–195, 2007. c Springer-Verlag Berlin Heidelberg 2007
Alignment-Free Local Structural Search by Writhe Decomposition
195
5. Zotenko, E., O’Leary, D., Przytycka, T.: Secondary structure spatial conformation footprint: a novel method for fast protein structure comparison and classification. BMC Structural Biology 6(1), 12 (2006)
Defining and Computing Optimum RMSD for Gapped Multiple Structure Alignment Xueyi Wang and Jack Snoeyink Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599-3175, USA {xwang,snoeyink}@cs.unc.edu
Abstract. Pairwise structure alignment commonly uses root mean square deviation (RMSD) to measure the structural similarity, and methods for optimizing RMSD are well established. However, multiple structure alignment with gaps cannot use these methods directly. We extend RMSD to weighted RMSD for multiple structures, which includes gapped alignment as a special case. By using multiplicative weights, we show that weighted RMSD for all pairs is the same as weighted RMSD to an average of the structures. Although we show that the two tasks of finding the optimal translations and rotations for minimizing weighted RMSD cannot be separated for multiple structures like they can for pairs, an inherent difficulty and a fact ignored by previous work, we develop an iterative algorithm, in which each iteration takes linear time and the number of iterations is small, to converge weighted RMSD to a local minimum. 10,000 experiments done on each of 23 protein families from HOMSTRAD (where each structure starts with a random translation and rotation) converge rapidly to the same minimum. Finally we propose a heuristic method to iteratively remove the effect of outliers and find well-aligned positions that determine the structural conserved region by modeling B-factors and deviations from the average positions as weights and iteratively assigning higher weights to better aligned atoms. Keywords: weighted RMSD, multiple structure alignment, optimization, structural conserved region.
1
Introduction
Protein structure alignment is an important topic in bioinformatics. Proteins with similar 3D structures may have similar functions and are often evolved from common ancestors[2]. Although available protein sequences outnumbered available protein structures by several magnitudes and protein sequence alignment methods have been widely used to determine protein families and find sequence homology, protein structure alignment has its importance in disclosing the extend of structure similarity. The structure alignment provides confirmation for sequence alignment and conserved regions determined by structure alignment are good candidates for threading and homology modeling. R. Giancarlo and S. Hannenhalli (Eds.): WABI 2007, LNBI 4645, pp. 196–207, 2007. c Springer-Verlag Berlin Heidelberg 2007
Defining and Computing Optimum RMSD
197
If we consider protein structures as rigid bodies, then the problem of protein structure alignment is to translate and rotate these structures to minimize a score function. Pairwise alignment commonly uses root mean squared deviation (RMSD) between corresponding atoms in two structures to measure the structural similarity, once a suitable correspondence has been chosen and the molecules have been translated and rotated to the best match[8]. Pairwise RMSD can be extended to measure the goodness of multiple structure alignment in several ways. Examples from literature include sum of all pairwise squared distances[11,17], which we also use, or average RMSD per aligned position[13]. Multiple structure alignment introduces some interesting facts: For a first example, lower B-factors (values measure the mobility or uncertainty of given atoms’ positions) may suggest that the positions of the atoms should be regarded as more precisely known, and thus should count more towards an alignment or consensus structure. For a second, if correspondence between atoms is derived by multiple sequence alignment, one would like to use conserved atoms in the alignment and omit, or at least reduce the influence of, the exceptions – outlier atoms in a family of structures should not force the removal all other atoms that were reliably determined at that position in the sequence. In both examples, we would like to be able to assign weights that indicate the confidence in an atom’s position. Weighting individual atoms allows a measure of local control in RMSD structure alignment that is otherwise missing because RMSD is a global measure. Gapped alignment is a special case in which the weight of each atom is assigned either zero or one. In the next section, we deal with the show of how to use the weights assigned to atoms to determine weights of pairs in RMSD, and develop an algorithm for multiple structure alignment with weighted atoms. Many algorithms for multiple structure alignment have been presented. Some first do pairwise structure alignments, then combine structures in pairs. STRUCTAL[6] chooses a structure that has minimum total RMSD to all other structures as the consensus structure and aligns other structures to it, MAMMOTH-mult[11] chooses one structure at a time and minimizes total RMSD to all previously aligned structures until all structures are aligned, STAMP[15] combines closest pairs to build a tree structure, and MULTAL[18] progressively combines the most similar sequences into a consensus using vectors. Other algorithms align all the structures together instead of combining aligned pairs. Sutcliffe et al.[17], Verboon and Gabriel[19], and Pennec[14] iteratively align protein structures to their average structure and achieve minimum RMSD by optimizing rotations for each structure: our algorithm is a refinement of theirs to correctly handle weights on atoms and optimize both translations and rotations. CE[7] uses Monte Carlo optimization to achieve a tradeoff between the average atom distance and the aligned columns. MUSTA[10] and MASS[4] use geometric hashing for Cα atom and secondary structures, respectively, and find a consensus structure. MultiProt[16] and MALECON[13] iteratively use each structure as a consensus, align other structures to it and determine the largest core. CBA[5] and MUSTANG[9] progressively group similar structures, recalculate atom correspondences and optimize the alignment.
198
X. Wang and J. Snoeyink
In this paper, first we extend RMSD to weighted RMSD for multiple structures. We show that with the right definition of weight for multiple structure alignment, the weighted RMSD for all pairs is the same as the weighted RMSD to the average of the structures. Next we show that for minimizing weighted RMSD, translations and rotations cannot be separated, whereas previous works[14,17,19] focus on rotations only. We propose an iterative algorithm to optimize both translations and rotations in weighted RMSD. In our tests of 10,000 runs on protein families from HOMSTRAD[12], where each run starts with a random translation and rotation for each structure, this algorithm quickly reaches the same optimum alignment. By modeling B-factors and deviations from the average positions as weights and minimizing weighted RMSD, we show that we can find well-aligned positions that determine the conserved region.
2
Methods
We define the average of structures and weighted RMSD for multiple structures, and then establish the properties of wRMSD. 2.1
Weighted Root Mean Square Deviation
We assume there are n structures each having m points (atoms), so that structure Si for (1 ≤ i ≤ n) has points pi1 , pi2 , · · · , pim . For a fixed position k, the n points pik for (1 ≤ i ≤ n) are assumed to correspond. We assign ik ≥ 0 to m a weight w m point pik and the weighted centroid for each structure is k=1 wik pik / k=1 wik . We assign zero weights to gaps, where the coordinates of points in the gaps do not matter. We assume that at least one nonzero weight at each aligned position n and define the weight normalized by position as w = nw / ik ik l=1 wlk (Note n that i=1 w ik = n). We define the weighted average structure S to have points pk =
n i=1
wik pik /
n l=1
1 = w ik pik for 1 ≤ k ≤ m. n i=1 n
wlk
Given n structures, we define wRMSD as the square root of weighted average of all squared pairwise distances. Note there are n(n − 1)/2 structure pairs, and each structure pair has m squared distances. Thus, if wijk = w ik wjk = wik w jk is the weight for point pair (pik , pjk ), then we define n i−1 m 2 2 wRMSD = wijk pik − pjk . mn(n − 1) i=2 j=1 k=1
There are many ways to define a combined weight wijk ; we choose to multiply the weights wik and wjk to capture the confidence we have in aligning atoms from structure i and structure j at position k. If either wik or wjk is zero, then the combination is zero; if all atoms at a position have equal confidence, then they all factor equally into the combination. Our choice is compatible with unweighted
Defining and Computing Optimum RMSD
199
RMSD or RMSD weighted at aligned positions, and captures gapped alignment as a special case. As we will see in the mathematics, with this choice we can align structures to an average structure and speed up computation. Alternate ways to define wijk may not work as well: For example, if we define wijk = (wik +wjk )/2, then when one of wik or wjk is zero and the other is nonzero, the wRMSD value will be influenced by an atom position in which we have no confidence. Since m and n are fixed, we can equivalently minimize the weighted sum of all squared pairwise distances instead of wRMSD. We list three theorems relating the weighted sum of all squared pairwise distances to the average structure. Our first theorem says that if wRMSD is used to compare multiple structures, then what is really happening is that all structures are being compared to the average structure — that the average structure S is a consensus. By comparing to the average structure, we reduce the number of pairs of structures that must be compared from n(n − 1)/2 to n; see Wang and Snoeyink[20] for related theorems on the unweighted case. Theorem 1. The weighted sum of squared distances for all pairs equals the weighted sum of squared distances to the average structure S: n i−1 m
n m 2 2 wijk pik − pjk = n wik pik − pk .
i=2 j=1 k=1
i=1 k=1
Proof. Algebraic manipulation after expanding wijk according to its definition. Our second and third theorems suggest how to choose the structure closest to a given set of structures. If you can choose any structure, then chose the average S; if you must choose from a limited set, then choose the structure closest to the average S. The proofs use the Cauchy-Schwartz inequality and some algebra. Theorem 2. The average structure S minimizes the weighted sum of squared distances from all the structures, i.e. for any structure Q with points q1 , q2 , · · · , 2 n m 2 n m qm , i=1 k=1 wik pik − qk ≥ i=1 k=1 wik pik − pk and equality holds if and only if qk = pk for all positions with wik ≥ 0. Theorem 3. The structure from Q1 , Q2 , · · · , Qm with minimum wRMSD to S minimizes the weighted sum of squared distances to all structures Si . 2.2
Rotation and Translation to Minimize wRMSD
In structure alignment, we translate and rotate structures in 3D space to minimize wRMSD. We define Ri as a 3×3 rotation matrix and Ti as a 3×1 translation vector for structure Si . We aim to find the optimal Ti and Ri for each structure to minimize the wRMSD. The target function is n i−1 m 2
argmin wijk Ri pik − Ti − Rj pjk + Tj . R,T
i=2 j=1 k=1
200
X. Wang and J. Snoeyink
Applying Theorem 1 to the target function, we obtain n m 2 argmin n wik Ri pik − Ti − Rpk pk + T k , R,T
i=1 k=1
where Rpk = nl=1 w lk Rl plk / nl=1 w lk plk and T k = n1 nl=1 w lk Tl . In this way, we change the minimization of wRMSD for all pairs to the minimization of wRMSD from all structures to the average structure. Optimum translation and rotation. Horn[8] shows that to align a pair of structures to minimize wRMSD, one can first translate both structures so their centroids coincide (say, at the origin), then solve for the optimum rotation. For weighted multiple structure alignment, however, this is no longer true. Consider the example of Fig. 1 with three structures S1 , S2 , and S3 , each containing three weighted atoms in correspondence from left to right. Black dots denote weights = 1 and white dots denote the weights = 0, i.e. the gaps. The alignment√in Fig. 1a moves the weighted centroids to the origin, and obtains wRMSD = 6; moving unweighted centroids to the origin would give wRMSD = 2/3. The alignment in Fig. 1b achieves the optimum RMSD = 0 by translating S2 by −1 and S3 by 1. The difference arises because centroids are defined for each structure independently, but contributions to the alignment score depend also on the weights assigned to the structures that are being compared to.
a. Moving centroids to the origin
b. Achieving optimum RMSD
Fig. 1. Example of aligning three structures with gaps. Black dots denote weight = 1, white denote gaps (weight = 0). Dashes indicate corresponding points.
Verboon and Gabriel[19] and Pennec[14] present iterative algorithms to minimize RMSD for multiple structure alignment by translating the centroids of all structures to the origin and optimizing rotations, but our example shows that their algorithms may not find optimum RMSD in weighted structure alignment. It turns out that the optimum translations cannot be found easily; the translation and rotation cannot be separated for minimizing wRMSD. Theorem 4 (see appendix for proof) shows the relation of the optimum translations and rotations. In general case, the translation and rotation cannot be separated for minimizing wRMSD in multiple structure alignment. If all the weights at the same position k are the same value wk , then w ik = 1 and wijk = wk , and we can obtain from equations in Theorem 4 that the optimal translation moves the centroids to the origin, as expected[8,20].
Defining and Computing Optimum RMSD
201
Theorem 4. The optimum translation Ti and the optimum rotation Ri for structure Si (1 ≤ i ≤ n) satisfies the following n linear equations, of which n − 1 are independent: m
wik (Ri pik − Ti ) =
k=1
m n 1 wik ( w lk (Rl plk − Tl )). n k=1
l=1
Given all optimal rotations Ri for (1 ≤ i ≤ n) and one translation Tj (1 ≤ j ≤ n), the remaining optimal translations Ti can be obtained by T i = Tj −
1 n
n l=1
Rl
m k=1
plk (wilk − wjlk ) − Ri m k=1 wik
m k=1
wik pik + Rj
m k=1
wjk pjk
.
Algorithm for minimizing wRMSD. Finding optimal translations and rotations for multiple structures is harder than for a pair because the minimization problem no longer reduces to a linear equation. Instead of directly finding the optimal translations and rotations, we use the fact that the average is the best consensus (Theorem 1), and present an iterative algorithm to converge to the minimum wRMSD. We align each structure to the average structure separately in each iteration. Because translating and rotating structures also change the average structure, we repeat until the algorithm converges to a local minimum of wRMSD. Algorithm 1. Given n structures with m points (atoms) each and weights wik ≥ 0 at each position, minimize wRMSD to within a chosen , e.g. = 10−5 . n 1. Calculate the average structure S with points pk = n1 i=1 w ik pik , and the 2 n m weighted sum of squared distances to S: SD = i=1 k=1 wik pik − pk . 2. For each structure m Si for (1 ≤ i ≤ n), compute the weighted centroid Bi = m w p / ik ik k=1 k=1 wik and translate pik for (1 ≤ k ≤ m) to new position pik = pik − Bi . Compute m the weighted centroid of S using the weights of Si , m Ci = k=1 wik pk / k=1 wik , and translate pk for (1 ≤ k ≤ m) to bring this centroid to the origin: pk = pk − Ci . Align Si to S using Horn’s method[8] to 2 m find the optimal rotation matrix Ri that minimizes k=1 wik Ri pik − pk and replace pnew = Ri pik for (1 ≤ k ≤ m). ik n 3. For each structure Si (1 ≤ i ≤ n), compute partial sum Di = j=i+1 Cj and translate pnew to new position pnew = pnew ik ik ik − Di . new n m new 2 new 4. Calculate new average S and SD = i=1 k=1 wik pnew . ik − pk 5. If (SD − SDnew )/SD < , the algorithm terminates; new otherwise, set SD = SDnew and S = S and go to step 2. The translations in step 3 keep the weighted sum of the squared distances for Si and S in step 2 after optimized by Horn’s method. In step 2, after minimizing SD n with structure Si (1 ≤ i ≤ n), the average structure S translates Di = j=i+1 Cj , so in step 3 we translate Si by Di to keep the SD to S unchanged. Horn’s method and our theorems imply that the deviation SD decreases monotonically in each iteration. From Theorem 1, we know that minimizing
202
X. Wang and J. Snoeyink
the deviation SD to the average minimizes the global wRMSD. From Horn[8], in 2 n m n m ≤ pik −pk 2 = SD. step 2 we have i=1 k=1 wik pnew ik −pk i=1 k=1 wik n m new 2 From Theorem 2, in step 4 we have SDnew = i=1 k=1 wik pnew ≤ ik −pk new 2 n m new − pk . So SD ≤ SD and SD decreases in each iteri=1 k=1 wik pik ation. The algorithm stops when the decrease is less than a threshold and achieves a local minimum of SD. Horn’s method calculates the optimal rotation matrix for two m-atom structures in O(m) operations and the translations in step 2 and 3 take O(nm) in total, so initialization and each iteration take O(nm) operations.
3
Results and Discussion
3.1 Performance We test our algorithm by minimizing wRMSD for the 23 protein families from HOMSTRAD[12] that have more than 10 structures and total aligned length longer than 100 (each aligned position contains more than two Cα atoms). We assign weight = 1 to aligned Cα atoms and weight = 0 to gaps. We run our algorithm 10,000 times for each protein family. Each time we randomly translate (within 100˚ A) and rotate each structure in 3D space, then minimize wRMSD. The results are shown in Table 1. For each protein family’s 10,000 tests, the difference between maximum RMSD and minimum RMSD is less than 1.0 × 10−5 , so they converge to the same local minimum, which is most probably the global minimum. All optimal RMSD values found by our algorithm are less than the original RMSD from the alignments in HOMSTRAD. Fig. 2 shows that for all 23 families, each iteration decreases RMSD rapidly, in 5–6 iterations, whereas the maximum number of iterations for = 1.0 × 10−5 is 21. The code is written in MATLAB and is downloadable at http://www.cs.unc. edu/∼xwang. The experiment was run on 1.8 GHz Pentium M laptop with 768M memory. Fig. 3 indicates that the observed average running time is linear in the number of atoms in the structures so our algorithm approaches the lower bound of multiple structure alignment Θ(nm).
Fig. 2. Convergence of wRMSD for 23 protein families. Each structure starts with a random translation and rotation.
Fig. 3. Average running time vs. number of atoms for 23 protein families
Defining and Computing Optimum RMSD
203
Table 1. Performance of the algorithm on different protein families from HOMSTRAD. We report n, the number of proteins, m, the number of atoms aligned, the wRMSD from HOMSTRAD Alignment (HA), the wRMSD of the optimal alignment from our algorithm, statistics on iterations and time (milliseconds) for 10,000 runs of each alignment. Protein family n igvar-h igvar-h glob phoslip uce lipocalin ghf22 fabp phc proteasome sdr sermam cys gluts α-amylase ltn kinase subt α-amylase NC tim grs ldh p450 asp
3.2
21 21 41 18 13 15 12 17 12 17 13 27 13 14 23 12 15 11 23 10 11 14 12 13
m 134 134 168 130 162 190 129 137 177 283 297 275 242 230 616 246 421 309 741 254 498 352 481 346
# wRMSD optim. %rel. Iterations gaps HA(˚ A) wRMSD diff avg, med, max 27 2.25 2.14 5.42 8.3, 8, 10 27 2.25 2.14 5.42 8.3, 8, 10 59 2.07 2.01 2.57 10.5, 11, 12 19 1.51 1.49 1.53 10.5, 11, 12 48 2.57 2.50 3.14 9.6, 10, 11 72 3.97 3.88 2.34 11.6, 12, 14 10 1.42 1.40 1.50 6.0, 6, 7 15 1.89 1.89 0.26 6.9, 7, 8 29 3.20 2.93 9.24 9.1, 9, 12 135 6.86 6.10 12.50 13.2, 13, 16 120 4.03 3.73 8.00 9.5, 10, 12 94 2.10 2.06 2.18 9.4, 9, 12 52 1.74 1.71 1.83 13.2, 13, 16 30 2.84 2.76 2.77 8.0, 8, 10 415 6.14 6.01 2.11 15.5, 16, 19 44 1.51 1.49 1.39 7.6, 8, 9 216 7.69 7.39 4.04 16.0, 16, 21 87 2.82 2.78 1.53 15.6, 16, 18 517 6.24 6.09 2.40 14.5, 15, 20 12 1.47 1.46 0.87 7.3, 7, 9 236 4.18 3.64 14.69 8.4, 8, 9 86 2.64 2.60 1.41 11.6, 12, 14 186 4.08 4.04 1.18 10.0, 10, 13 49 2.20 2.15 2.46 9.4, 9, 12
Times(ms) avg, med, max 60, 60, 70 60, 60, 70 148, 150, 170 66, 70, 150 57, 60, 70 87, 90, 110 29, 30, 40 44, 40, 60 55, 50, 80 137, 140, 160 89, 90, 110 134, 130, 170 110, 110, 140 62, 60, 80 382, 391, 471 60, 60, 80 212, 210, 280 148, 150, 180 407, 411, 551 51, 50, 70 110, 110, 140 127, 130, 160 132, 130, 160 100, 100, 130
Finding Structural Conserved Regions
Regions of structural conserved structure have great importance for classifying molecules, determining active site and functions, and applying homology modeling. RMSD has an inherent drawback that outliers have strong effects, so RMSD cannot be used to determine the conserved regions. Many different measurements have been developed to determine the conserved regions[1,3]. Here we show that heuristic methods based on wRMSD can be developed to find conserved regions — overcoming the inherent drawback of RMSD. By modeling B-factors and deviations from the average positions as the weights, we demonstrate one heuristic to find well-aligned positions that determine the conserved region. We use the following iterative steps to adjust weights: 1. Align the protein structures using the algorithm of Section 2.2.2 by setting wik = e−bik /10 , where bik is the B-factor for atom k in structure Si .
204
X. Wang and J. Snoeyink
2. For each aligned position k, calculate the number of aligned atoms l, distance dik = pik −pk for the l aligned structures, and the average squared distance ak = ( l d2lk )/l. Then calculate the mean a and standard deviation σ of ak . 3. If all ak ≤ a + 3σ, then exit the algorithm; Otherwise set weights wk = e−bik /10 × l/nak if ak ≤ a + 3σ for (1 ≤ k ≤ m) and other weights to 0, align structures by wRMSD, and go to step 2. B-factor measures of the mobility or uncertainty of a given atom position. In general, lower B-factor suggests that the position of the atom should be regarded as more precisely known, whereas the outliers are always have larger B-factors. We introduce the term e−bik /10 that gives higher weights for those atoms that the positions are more accurate. The term 1/ak in the weights encourages the alignment in the positions where the average squared deviations are small, and the term l/n encourages the positions with more aligned atoms. By combining these several factors together, we reduce the effects of outliers and enhance the weights of atoms in the structural conserved region. Fig. 4 shows the short-chain dehydrogenases/reductases (sdr) and proteasome families before and after optimizing conserved region, where Fig. 4a and 4c are
a. Alignment of sdr family before optimizing the conserved region
b. Alignment of sdr family after optimizing the conserved region
c. Alignment of proteasome family before d. Alignment of proteasome family after optimizing the conserved region optimizing the conserved region Fig. 4. Alignment of short-chain dehydrogenases/reductases (sdr) and proteasome families before and after optimizing conserved region. Positions are colored by number of standard deviations from average with black ak ≤ a, peach a ≤ ak ≤ a + σ, brown a + σ ≤ ak ≤ a + 2σ, and gray ak > a + 2σ.
Defining and Computing Optimum RMSD
205
Table 2. wRMSD before and after optimizing conserved regions for sdr and proteasome families Region sdr (before/after) proteosome (before/after)
ak ≤ a 2.20, 1.84 3.46, 2.31
ak ≤ a + σ ak ≤ a + 2σ all 2.60, 2.40 3.09, 3.12 3.80, 4.11 3.83, 3.01 4.18, 3.69 6.17, 6.94
the alignment before optimizing conserved region and Fig. 4b and 4d are the alignment after optimizing conserved region. From the figure, we can see that the above iterative algorithm significantly improved the alignment of conserved region. The changes of wRMSD for regions ak ≤ a, ak ≤ a + σ, ak ≤ a + 2σ, and all ak are shown in Table 2. We can see that for each structure, the wRMSD for the whole structure increases, but the wRMSDs for the first three regions decrease and the overall alignment is improved by achieving better alignments for the conserved regions.
4
Conclusion
In this paper, we analyzed the problem of minimizing the multiple structure alignment using weighted RMSD, which includes gapped alignment as a special case. By extending from our previous work[20], we show that the wRMSD for all pairs is the same as the wRMSD to the average structure. We also show that translation and rotation cannot be separated in minimizing weighted RMSD, which makes the problem hard. To our knowledge, previous works[17,19] focus on optimizing rotations only, which failed to achieve the optimum RMSD in gapped structure alignment. Based on the property of the average structure, we create an efficient iterative algorithm to achieve optimum translations and rotations in minimizing wRMSD and prove its convergence. The 10,000 tests on each of 23 protein families from HOMSTRAD show that our algorithm reaches the same local minimum regardless of the starting positions of structures, so the local minimum is most probably the global minimum. We further discuss the effects of outliers in the alignment using RMSD and present an iterative algorithm to find structural conserved region by iteratively assigning higher weights (by modeling the B-factors and deviations from the average positions) to better aligned positions until reaching convergence. Our future work includes developing fast algorithm to align multiple structures, separating structures cannot be aligned well in a group of structures, and accurately determining structural conserved regions. Acknowledgments. We thank Prof. Jane Richardson and Mr. Jeffrey Headd for helpful discussions. This research is supported by NIH grant GM-074127.
206
X. Wang and J. Snoeyink
References 1. Altman, R.B., Gerstein, M.: Finding an Average Core Structure: Application to the Globins. In: Proc. 2nd Int. Conf. Intell. Syst. Mol. Biol., pp. 19–27 (1994) 2. Branden, C., Tooze, J.: Introduction to Protein Structure, 2nd edn. Garland Publishing, New York (1999) 3. Chew, L.P., Kedem, K.: Finding the Consensus Shape for a Protein Family. Algorithmica 38(1), 115–129 (2003) 4. Dror, O., Benyamini, H., Nussinov, R., Wolfson, H.J.: Multiple Structural Alignment by Secondary Structures: Algorithm and Applications. Protein Science 12(11), 2492–2507 (2003) 5. Ebert, J., Brutlag, D.: Development and Validation of a Consistency Based Multiple Structure Alignment Algorithm. Bioinformatics 22(9), 1080–1087 (2006) 6. Gerstein, M., Levitt, M.: Comprehensive Assessment of Automatic Structural Alignment Against a Manual Standard, the SCOP Classification of Proteins. Protein Science 7(2), 445–456 (1998) 7. Guda, C., Scheeff, E.D., Bourne, P.E., Shindyalov, I.N.: A New Algorithm for the Alignment of Multiple Protein Structures Using Monte Carlo Optimization. In: Proceedings of Pacific Symposium on Biocomputing, pp. 275–286 (2001) 8. Horn, B.K.P.: Closed-form solution of Absolute Orientation Using Unit Quaternions. Journal of the Optical Society of America A 4(4), 629–642 (1987) 9. Konagurthu, A.S., Whisstock, J.C., Stuckey, P.J., Lesk, A.M.: MUSTANG: A Multiple Structural Alignment Algorithm. Proteins 64(3), 559–574 (2006) 10. Leibowitz, N., Nussinov, R., Wolfson, H.J.: MUSTA — A General, Efficient, Automated Method for Multiple Structure Alignment and Detection of Common Motifs: Application to Proteins. Journal of Computational Biology 8(2), 93–121 (2001) 11. Lupyan, D., Leo-Macias, A., Ortiz, A.R.: A New Progressive-iterative Algorithm for Multiple Structure Alignment. Bioinformatics 21(15), 3255–3263 (2005) 12. Mizuguchi, K., Deane, C.M., Blundell, T.L., Overington, J.P.: HOMSTRAD: A Database of Protein Structure Alignments for Homologous Families. Protein Science 7, 2469–2471 (1998) 13. Ochagavia, M.E., Wodak, S.: Progressive Combinatorial Algorithm for Multiple Structural Alignments: Application to Distantly Related Proteins. Proteins 55(2), 436–454 (2004) 14. Pennec, X.: Multiple Registration and Mean Rigid Shapes: Application to the 3D Case. In: Proceedings of the 16th Leeds Annual Statistical Workship, pp. 178–185 (1996) 15. Russell, R.B., Barton, G.J.: Multiple Protein Sequence Alignment from Tertiary Structure Comparison: Assignment of Global and Residue Confidence Levels. Proteins 14(2), 309–323 (1992) 16. Shatsky, M., Nussinov, R., Wolfson, H.J.: A Method for Simultaneous Alignment of Multiple Protein Structures. Proteins 56(1), 143–156 (2004) 17. Sutcliffe, M.J., Haneef, I., Carney, D., Blundell, T.L.: Knowledge Based Modelling of Homologous Proteins, Part I: Three-dimensional Frameworks Derived from the Simultaneous Superposition of Multiple Structures. Protein Engineering 1(5), 377– 384 (1987) 18. Taylor, W.R., Flores, T.P., Orengo, C.A.: Multiple Protein Structure Alignment. Protein Science 3(10), 1858–1870 (1994)
Defining and Computing Optimum RMSD
207
19. Verboon, P., Gabriel, K.R.: Generalized Procrustes Analysis with Iterative Weighting to Achieve Resistance. Br. J. Math. Stat. Psychol. 48(1), 57–73 (1995) 20. Wang, X., Snoeyink, J.S.: Multiple Structure Alignment by Optimal RMSD Implies that the Average Structure is a Consensus. In: Proceedings of 2006 LSS Computational Systems Bioinformatics Conference, pp. 79–87 (2006)
Appendix In the space remaining, we sketch the proof of theorem 4. Proof. We aim to find optimal rotations Ri and translations Ti to minimize the 2 ). Assume target function: argminR,T ( ni=1 m k=1 wik Ri pik −Ti −Rpk pk +T k that we know the optimal rotations Ri for each structure Si (1 ≤ i ≤ n) and we need to find optimal translations Ti . We move each structure Si by vector m nAi , which satisfies n equations: m 1 wik ( l=1 w lk Rl (plk − Al )). Letting qik = pik − ik − Ai ) = n k=1 wik Ri (p k=1 n m m A , we have: w R q = w Rq lk Rl qlk / i ik i ik ik k q k , where Rq k = k=1 k=1 l=1 w n w q . lk lk l=1 The new ≤ k ≤ m) has points: n average structure n S from qik (1 ≤ i ≤ n,1 1 n q k = n1 l=1 w lk qlk = n1 l=1 w lk (plk − Al ) = pk − n l=1 w lk Al = pk − Ak . Note we have the equality Rpk pk = Rq k q k + RAk Ak . The target function after translation becomes: 2 n m argminR,T ( i=1 k=1 wik Ri qik + Ri Ai − Ti − Rq k q k − RAk Ak + T k ). Let r = Ri A and expand the target function, to obtain i − T i − RAk Ak2+ T k n ikm n m w R q − Rq q + 2 wik (Ri qik − Rqk q k )rik + k k i=1 k=1 ik i ik n m mi=1 k=1 m 2 . Since we have k=1 wik Ri qik = k=1 wik Rq k q k for (1 ≤ i=1 k=1 wik rik i ≤ n), the second term is zero and we are left with the first and third terms. The first term does not depend on Ti for (1 ≤ i ≤ n) and wik ≥ 0 for (1 ≤ i ≤ n, 1 ≤ k ≤ m), so the target function is minimized by setting rik = 0. Expanding rik = 0 and re-arranging, we have n Ri Ai − Ti = RAk Ak − T k = n1 l=1 w lk (Rl Al − Tl ). So the optimum translation is achieved when Ti = Ri Ai , i.e. T i satisfies the fol m n 1 lowing n linear equations: m w (R p −T ) = w ( lk (Rl plk − ik i ik i ik k=1 k=1 l=1 w n Tl )). Note that at most n − 1 equations are independent. In fact, we have the equality n m n 1 n m lk (Rl plk − Tl )). i=1 k=1 wik (Ri pik − Ti ) = n i=1 k=1 wik ( l=1 w m 1 Last we solve T for (1 ≤ i ≤ n) from the n equations. T − lk Tl = i i l=1 w n m m 1 m n ( k=1 wik Ri pik − n k=1 l=1 wilk Rl plk )/ w . ik k=1 m m m n 1 Let al = n1 w lk and bi = ( k=1 nwik Ri pik − n k=1 l=1 wilk Rl plk )/ k=1 wik , the n equations become: Ti − l=1 al Tl = bi . By fixing one translation Tj (1 ≤ j ≤ n), the remaining n − 1 translations are: m n m 1 Ti = T + b − b = T − ( R ( p (w − w )) − R w p + j i j j l lk ilk jlk i ik ik k=1 k=1 m m n l=1 Rj k=1 wjk pjk )/ k=1 wik .
Using Protein Domains to Improve the Accuracy of Ab Initio Gene Finding Mihaela Pertea and Steven L. Salzberg Center for Bioinformatics and Computational Biology, University of Maryland, College Park MD, USA {mpertea,salzberg}@umiacs.umd.edu
Abstract. Background: Protein domains are the common functional elements used by nature to generate tremendous diversity among proteins, and they are used repeatedly in different combinations across all major domains of life. In this paper we address the problem of using similarity to known protein domains in helping with the identification of genes in a DNA sequence. We have adapted the generalized hidden Markov model (GHMM) architecture of the ab intio gene finder GlimmerHMM such that a higher probability is assigned to exons that contain homologues to protein domains. To our knowledge, this domain homology based approach has not been used previously in the context of ab initio gene prediction. Results: GlimmerHMM was augmented with a protein domain module that recognizes gene structures that are similar to Pfam models. The augmented system, GlimmerHMM+, shows 2% improvement in sensitivity and a 1% increase in specificity in predicting exact gene structures compared to GlimmerHMM without this option. These results were obtained on two very different model organisms: Arabidopsis thaliana (mustard wee) and Danio rerio (zebrafish), and together these preliminary results demonstrate the value of using protein domain homology in gene prediction. The results obtained are encouraging, and we believe that a more comprehensive approach including a model that reflects the statistical characteristics of specific sets of protein domain families would result in a greater increase of the accuracy of gene prediction. GlimmerHMM and GlimmerHMM+ are freely available as open source software at http://cbcb.umd.edu/software. Keywords: Pfam, protein domain, profile HMM, GHMM, ab intio gene finding.
1
Introduction
There are many computational approaches to identifying genes in newly sequenced genomes. The most reliable ones use similarity to expressed messenger RNA sequences (mRNAs or ESTs) or to known protein sequences to find the locations genes [5]. Although not as successful as similarity-based methods, ab initio gene finders are used to predict genes without any knowledge of homology to other genes, proteins or expressed sequences. When attempting to predict R. Giancarlo and S. Hannenhalli (Eds.): WABI 2007, LNBI 4645, pp. 208–215, 2007. c Springer-Verlag Berlin Heidelberg 2007
Using Protein Domains to Improve the Accuracy
209
novel genes - i.e., genes that are unique to the species at hand, with no homology to any previously characterized gene - ab initio methods are often the only available choice for gene finding. To improve the accuracy of gene predictions, several gene finders based on ab initio methods now have the capacity to predict genes that are homologous to mRNAs or protein sequences available in public databases, while still retaining their ability to discover novel genes to which no proteins or ESTs align [7,10,11,12]. In this paper, we take a somewhat different approach than these previous methods: instead of using EST and protein alignments to guide the gene predictions, we use much shorter protein domains, which typically comprise a short region of a protein. Because many protein domains are conserved across species and not necessarily specific to only one protein [9], there is a much greater chance that the amino acid sequence of a predicted gene will contain at least some similarity to a known protein domain than to the complete protein sequence from another organism. In addition, the use of protein domains can take advantage of the very large, relatively comprehensive databases of protein domains that have already been compiled based on previously sequenced genomes. To study the effect of using protein domains in improving the accuracy of gene predictions, we employ GlimmerHMM, an ab initio eukaryotic gene finder with comparable accuracy performance to other state of the art de novo gene finders [1,8]. Protein domain scores which reflect the similarity of a gene sequence to a known domain are incorporated into the GHMM mathematical framework of GlimmerHMM. To compute protein domain scores we used HMMER (http://hmmer.wustl.edu/) based on Pfam models (release 21.0). A brief description of both GHMMs for gene finding and of protein domain predictions using HMMER is given below. 1.1
GHMM Decoding
In the context of gene finding, a GHMM is a state-based generative model in which each state emits a sequence of bases comprising a feature such as an exon or intron. Therefore, gene finding with a GHMM involves finding the most probable parse φmax of a given nucleotide sequence S: φmax =
n argmax Pe (Si |qi , di ) Pt (qi |qi−1 ) Pd (di |qi ) φ
(1)
i=1
where (1) the concatenation S1 , ..., Sn of individual features (such as exons and introns) forms the input sequence S, (2) Pe (Si |qi , di ) is the emission probability of the sequence Si conditional on the state qi and duration di , (3) Pt (qi |qi−1 )is the transition probability to state qi from a previous state qi−1 , (4) Pd (di |qi ) is the duration probability that a sequence of length di is generated from a given state qi , and
210
M. Pertea and S.L. Salzberg
(5) each φ = {(qi , di ) |0 < i ≤ n} specifies a time-ordered series of states and integer durations during a single run of the GHMM starting in an initial nonemitting state q0 . The GHMM framework offers flexibility by allowing additional states (modeling different features) to be added to the model, and provides a competitive probabilistic model for the gene finding problem. This model is implemented in numerous ab initio gene finders, including GlimmerHMM [8,13]. 1.2
Protein Domain Predictions
A powerful description of protein domains are provided by profile Hidden Markov models (profile HMMs, [3]) as stored in the Pfam database [4]. Pfam is a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains and families. Profile HMMs can be used to do sensitive database searching using statistical descriptions of a sequence family’s consensus. A freely distributable implementation of profile HMMs for protein sequence analysis is provided with the HMMER software. HMMER looks for known domains in a query sequence and returns a ranked list of hits in the Pfam database. Each hit has two scores associated to it: a bit score and an E-value. Six expert-calibrated raw score cutoffs are also maintained for the Pfam models in order to make the protein domain predictions very specific. The HMMER bit scores reflect how well the query sequence matches a profile model from the database, while the E-value measures how statistically significant the bit score is. Therefore a hit with a high bit score and a low E-value will be likely a true homologue to a known protein domain. Formally, the bit score of a target domain sequence S is given by: BitScore(S) = log2
P (S|HM M ) P (S|null)
(2)
where P (S|HM M ) is the probability of the target sequence according to a given profile HMM, and P (S|null) is the probability of the target sequence given a null hypothesis model of the statistic properties of a random sequence. The null model is represented by a simple one-state HMM that outputs the symbols in the sequence according to a specific model distribution of the residue composition. Thus, a positive score means the profile HMM is a better model of the target sequence than the null model is (e.g. the profile HMM gives a higher probability).
2
Incorporating Protein Domains
In order to allow GlimmerHMM to utilize homology to known domains, we permitted each coding state in the GHMM to be able to also emit any of the known domains from the Pfam database at the same time as the feature initially represented by that state (see Fig. 1). If δ domains are known in the database then the emission probability in (1) can be replaced by:
Using Protein Domains to Improve the Accuracy
Pe (Si |qi , di ) =
m
Pe (Sij , Iij |qij , dji )Pt (qij |qij−1 )
211
(3)
j=1
where qij , 1 ≤ j ≤ m represents any of the states in Fig. 1 (b) or the state qi if j = 0, Sij , 1 ≤ j ≤ m is a sequence with length dji emitted from state qij , the concatenation of the sequences Si1 , Si2 , . . . , Sim form the sequence Si , Iij represent either a null sequence or a known domain (complete or partial), and Pt (qij |qij−1 ) are the transition probabilities. For simplicity, the product of the duration probabilities of the features Sij , 1 ≤ j ≤ m is assumed to be equal to the duration probability of the complete feature Si . The emission probabilities in the right part of (3) can be decomposed via conditional probability as: Pe (Sij , Iij |qij , dji ) = Pe (Sij |qij , dji )P (Iij |Sij , qij , dji )
(4)
As described in [8], an efficient way to compute the maximization step required by (1) is to utilize log-likelihood ratios instead of probabilities for the emission probabilities. The denominators of these ratios are the probabilities of the target sequences under a null model that describes the statistical properties of the non coding sequences. This modification is mathematically valid, and will allow us to skip the evaluation of all non coding states. We would refer to such log-likelihood ratios as feature scores. Since the emission probabilities in GlimmerHMM are computed using a Markov chain which is a multiplicative model, the product of the emission probabilities of the features Sij , 1 ≤ j ≤ m is equal to the emission probability of the complete feature Si . Therefore, using (4), we can estimate the score of an exon feature E as: Score(E) = log2
P (E|coding) P (D|E) + log2 P (E|null) P (D|null)
(5)
D⊆E
where D is a protein domain (or a part of a domain) included in the exon E. The transition probabilities in (4) would be ideally estimated using the training data, but our implementation assumes that they are all equal, and therefore doesn’t require any additional training. Running GlimmerHMM with protein domain homology can be accomplished by providing the gene finder at run time with a file containing all predicted protein domains whose coordinates have been mapped to the genomic sequence. The domain predictions are obtained in a pre-processing step by running HMMER on all open reading frames (ORFs) in the input DNA sequence. These ORFs have been previously translated into proteins. Only predictions with both a positive HMMER bit score and an E-value smaller than 0.1 are retained. GlimmerHMM uses these domain predictions and their scores to compute the score of each exon feature. The log-likelihood ratio corresponding to the presence of a domain in (5) is estimated as: log2
P (D|E) lD∩E = BitScore(D) P (D|null) lD
(6)
212
M. Pertea and S.L. Salzberg
Fig. 1. GHMM exon state architecture in GlimmerHMM: (a) initially, and (b) after including protein domain homology. The state in (a) emits an entire exon, E, at a time, while the diamond shaped states in (b) emit fragments of an exon, e, and at most one of the known protein domains. The model can cycle arbitrarily many times through this portion of the state graph, emitting any number of protein domains within an exon. The two circle states x and y are non-emitting states and are shown just to simplify the connections between the diamond shaped states.
where lD is the length of the domain D computed in base pairs, lD∩E is the length in base pairs of the part of the domain (complete or fragmentary as predicted by HMMER) overlapping exon E, and BitScore(D) is the HMMER score of the domain. Note that in (6) we relaxed the constraint D ⊆ E from (5) and assumed that the predicted domain can extend over the edges of the exon.
Using Protein Domains to Improve the Accuracy
213
Also note that the architecture in Fig. 1 doesn’t model explicitly the fact that a protein domain could span several exons. This might still happen though in this architecture, since the predicted domains are not necessarily complete. In many cases HMMER ran on six-frame translated sequences from our test data predicted fragments of the same domain over several consecutive exons.
3
Results
To evaluate the accuracy of gene finding both with and without protein domain homology, we needed databases of confirmed genes that were accurately annotated. We chose two model organisms for training and testing our new system: the model plant Arabidopsis thaliana and the model fish Danio rerio (zebrafish). A description of the two data sets used, and the gene recognition accuracy improvements after using protein domain homology are presented below. 3.1
Data Sets
A. thaliana has gained much interest from the scientific community as a model organism for research in plant genetics. Our analyses were done on a set of very high-quality gene models obtained from 5000 full-length transcripts sequenced and released in 2001 [6] (GenBank accession numbers AY084215-AY089214). Since homology in the data set could influence the results, we refined this reference set of gene models by using BLAST [27] to perform pairwise alignments between all genes. Sequences that aligned for more than 80% of their length with a BLAST E-value of less than 10−10 were removed. The resulting set includes 4048 genes that we used for training and testing of our algorithm. D. rerio is a widely-used model organism for studies of vertebrate development and gene function. A high-confidence data set was downloaded from the Vertebrate Genome Annotation (VEGA) database (the July 25, 2006 update). VEGA [2] is a central repository for high quality, frequently updated, manual annotation of vertebrate finished genome sequences. We selected only genes coding for a known protein or that were identical or homologous to cDNAs from the same species, for which an unambigous ORF could be assigned. We manually inspected all genes for annotation errors and eliminated the ones with no start or stop codons present or with non-canonical splice sites. 2,684 D.rerio genes fully supported by biological data were selected for the final data set. 3.2
Accuracy of Gene Prediction
Running HMMER on all possible open reading frames contained in the input DNA sequences of all the genes included in our data sets resulted in only about 30% coverage of all coding base pairs of both A. thaliana and zebrafish but with a relatively high specificity (see Table 1). It is interesting to note that the average domain length computed in base pairs approximates very well the average length of the exons included in both data sets. While the coverage by protein domains at the nucleotide level was quite low, 81% of the zebrafish genes and 99% of A. thaliana genes had at least one protein domain predicted for them.
214
M. Pertea and S.L. Salzberg
Table 1. Coverage by predicted protein domains of the coding nucleotides in the A. thaliana and D. rerio data sets. Sn measures the percentage of coding base pairs covered by predicted protein domains, while Sp represents the precentage of the base pairs included in the predicted protein domains that are also coding. Organism CDS(bp) No. of predicted Avg. exon Avg. domain domains length (bp) length (bp) A. thaliana 3,371,737 5,800 194 193 D. rerio 3,584,426 7,625 164 173
Sn (%) 32 31
Sp (%) 96 84
Table 2. Sensitivity and specificity results on the A.thaliana and D.rerio data sets for GlimmerHMM and GlimmerHMM+, which specifies the gene finder enhanced with the ability to use protein domain homology in detecting the gene structures A. thaliana D. rerio GlimmerHMM GlimmerHMM+ GlimmerHMM GlimmerHMM+ Gene Sn 0.52 0.54 0.20 0.22 Gene Sp 0.49 0.51 0.12 0.13 Exon Sn 0.86 0.86 0.75 0.77 Exon Sp 0.86 0.87 0.69 0.70 Nucleotide Sn 0.98 0.98 0.90 0.93 Nucleotide Sp 0.99 0.99 0.82 0.82
The results in Table 2 show the effect of including protein domain homology into GlimmerHMM. These results are obtained by applying a 5-fold crossvalidation procedure for both species: each data set was randomly divided into five non-overlapping subsets, and each subset was held out separately while the system was trained on the remaining four. The most remarkable improvement is at the gene level. Here, GlimmerHMM+ - the gene finder enhanced with the ability to use domain homology - obtained an increase of 2% in the sensitivity of gene detection in both A. thaliana and zebrafish, while at the same time improving specificity by 1% (zebrafish) or 2% (Arabidopsis). Note though that specificity can not be precisely defined at the gene level, since predicted genes need to be verified experimentally in order to confirm if they are real or not. To evaluate gene specificity, we only looked at the gene predictions that overlapped the real genes in the dat set. The same increase in accuracy noticed at the gene level tended to be maintained at the exon and nucleotide level in zebrafish, but this increase was not significant in the case of Arabidopsis.
4
Conclusions
In this study we explored using protein domain homology to improve the ab initio gene prediction. As a result, the gene finder GlimmerHMM has the option to use predicted protein domains on the input DNA sequence in order to guide the structure of the predicted genes. This method for integrating information from protein domains with the GHMM gene predictor has the advantage to
Using Protein Domains to Improve the Accuracy
215
improve the accuracy of gene finding when domain homology is present, while recognizing genes at least as accurately when no domains are predicted in the input genome.
References 1. Allen, J.E., Majoros, W.H., Pertea, M., Salzberg, S.L.: JIGSAW,GeneZilla, and GlimmerHMM: puzzling out the features of human genes in the ENCODE regions. Genome Biol. 7(suppl 1:S9), 1–13 (2006) 2. Ashurst, J.L., Chen, C.K., Gilbert, J.G., Jekosch, K., Keenan, S., Meidl, P., Searle, S.M., Stalker, J., Storey, R., Trevanion, S., Wilming, L., Hubbard, T.: The Vertebrate Genome Annotation (Vega) database. Nucleic Acids Res. 33(Database issue), D459–D465 (2005) 3. Eddy, S.R.: Profile hidden Markov models. Bioinformatics 14(9), 755–763 (1998) 4. Finn, R.D., Mistry, J., Schuster-Bockler, B., Griffiths-Jones, S., Hollich, V., Lassmann, T., Moxon, S., Marshall, M., Khanna, A., Durbin, R., Eddy, S.R., Sonnhammer, E.L., Bateman, A.: Pfam: clans, web tools and services. Nucleic Acids Res. 34(Database issue), D247–D251 (2006) 5. Guigo, R., Flicek, P., Abril, J.F., Reymond, A., Lagarde, J., Denoeud, F., Antonarakis, S., Ashburner, M., Bajic, V.B., Birney, E., Castelo, R., Eyras, E., Ucla, C., Gingeras, T.R., Harrow, J., Hubbard, T., Lewis, S.E., Reese, M.G.: EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol. 7(Suppl 1:S2), 1–31 (2006) 6. Haas, B.J., Volfovsky, N., Town, C.D., Troukhan, M., Alexandrov, N., Feldmann, K.A., Flavell, R.B., White, O., Salzberg, S.L.: Full-length messenger RNA sequences greatly improve genome annotation. Genome Biol. 3(6), RESEARCH0029 (2002) 7. Krogh, A.: Using database matches with HMMGene for automated gene detection in Drosophila. Genome Res. 10(4), 523–528 (2000) 8. Majoros, W.H., Pertea, M., Salzberg, S.L.: TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics 20(16), 2878–2879 (2004) 9. Ponting, C.P., Russell, R.R.: The natural history of protein domains. Annu. Rev. Biophys. Biomol. Struct. 31, 45–71 (2002) 10. Reese, M.G., Kulp, D., Tammana, H., Haussler, D.: Genie–gene finding in Drosophila melanogaster. Genome Res. 10(4), 529–538 (2000) 11. Solovyev, V., Kosarev, P., Seledsov, I., Vorobyev, D.: Automatic annotation of eukaryotic genes, pseudogenes and promoters. Genome Biol. 7(Suppl 1:S10), 1–12 (2006) 12. Wei, C., Brent, M.R.: Using ESTs to improve the accuracy of de novo gene prediction. BMC Bioinformatics 7, 327 (2006) 13. Zhang, M.Q.: Computational prediction of eukaryotic protein-coding genes. Nat. Rev. Genet. 3(9), 698–709 (2002)
Genomic Signatures in De Bruijn Chains Lenwood S. Heath and Amrita Pati Department of Computer Science, Virginia Tech, Blacksburg, VA 24061-0106 {heath,apati}@vt.edu
Abstract. Genomes have both deterministic and random aspects, with the underlying DNA sequences exhibiting features at numerous scales, from codons to regions of conserved or divergent gene order. This work examines the unique manner in which oligonucleotides fit together to comprise a genome, within a graph-theoretic setting. A de Bruijn chain (DBC) is a generalization of a finite Markov chain. A DNA word graph (DWG) is a generalization of a de Bruijn graph that records the occurrence counts of node and edges in a genomic sequence generated by a DBC. We combine the properties of DWGs and DBCs to obtain a powerful genomic signature demonstrated as information-rich, efficient, and sufficiently representative of the sequence from which it is derived. We illustrate its practical value in distinguishing genomic sequences and predicting the origin of short DNA sequences of unknown origin, while highlighting its superior performance compared to existing genomic signatures including the dinucleotides odds ratio.
1
Introduction
The genome G of an organism is a set of long nucleotide sequences modeled, within a formal language framework, as strings over ΣDNA = {A, C, G, T}, the DNA alphabet. While G itself is a unique mathematical structure for the organism, a genome is typically quite large (e.g., billions of bases for the human genome) and differs slightly from one individual of a species to another. Fix a genomic sequence H that is a substring of some string in G. Intuitively, a genomic signature for an organism is a mathematical structure θ(H) derived from H, which, ideally, can be efficiently computed, is significantly smaller to represent than H, and, if H is sufficiently representative of G, can uniquely identify the original organism. The intent is that the signature of other large substrings from G be highly similar to θ(H) and distinguishable from signatures of other organisms. A genomic signature is judged along two, typically antagonistic, dimensions: (1) the amount of compression achieved by θ(H), and (2) its effectiveness in identifying the genome. In this work, we derive and use genomic signatures that are useful in a number of applications, with emphasis on the identification of short unknown sequences. The species from which a genomic sequence is derived is its origin. A genomic sequence X of unknown origin is to be analyzed. We visualize X as an overlap of several successive short sequences of length w each, in a specific manner. R. Giancarlo and S. Hannenhalli (Eds.): WABI 2007, LNBI 4645, pp. 216–227, 2007. c Springer-Verlag Berlin Heidelberg 2007
Genomic Signatures in De Bruijn Chains
217
ACGTTGCAGTATT Fig. 1. Identification of overlapping words of length 4 within the sequence ACGTTGCAGTATT
Figure 1 illustrates this. The order is the word length at which a genomic sequence is analyzed. A pre-defined signature θw (X), at order w, is computed from X and compared to the same signature at the same order w for all available species. The amount of difference between θ(X) and existing signatures is used to predict the origin of X. Oligonucleotide frequencies have been described as characteristic features of genomes in many works [1,2,3,4,5,6,7,8,9,10,11] . Karlin and Burge [7] were among the first to use the term genomic signature. They define the dinucleotide odds ratio (θdor ) or relative abundance, which is the collection of 16 functions defined for dinucleotides XY by ρXY (H) =
fXY (H) , fX (H)fY (H)
where fx (H) is the frequency of string x as a substring in H. They observe that ρ values are similar throughout a genome and compare θdor for a number of organisms to demonstrate its capability of distinguishing organisms. Karlin et al. [8] observe that individual components of the θdor vector typically range from 0.78 to 1.23. They define a normalized L1 -distance, called delta-distance (δ), to distinguish between species. Jernigan and Baran [6] demonstrate that the δ-distance between the θdor signatures of strings sampled within a genome is approximately preserved over a wide range of string lengths, while it varies in case of strings sampled from different genomes. Deschavanne et al. [3] construct images from oligonucleotide frequencies, to build the application GENSTYLE [3,5], that predicts the approximate origin of a sequence using L1 -distances to oligonucleotide frequency vectors of all genome sequences in the Entrez database. The application TETRA [10] uses tetranucleotide frequencies to calculate similarity between sequences. For bacterial species, Coenye and Vandamme [2] correlate δ with 16S rDNA sequence similarity and DNA-DNA hybridization values. They find a strong negative correlation between δ and 16S rDNA similarity among groups of species with low δ and high 16S rDNA similarity. For 57 prokaryotic genomes, Sandberg et al. [9] compare G+C content, oligonucleotide frequency, and codon bias. Dufraigne et al. [4] and van Passel et al. [11] employ oligonucleotide frequencies to identify regions of horizontal gene transfer (HGT) in prokaryotes. Carbone et al. [1] correlate the ecological niches of 80 Eubacteria and 16 Archaea to codon bias used as a genomic signature. All genomic signatures described in this section demonstrate that signatures differ among species, but with the exception of the DOR they all lack emphasis
218
L.S. Heath and A. Pati
on the amount of variation, identification of unknown DNA, and effect of short available sequence length on these signatures. As part of our DNA Words program investigating mathematical invariants derived from genomes, we examine the finest scale in graph-theoretic terms, while integrating DNA word graph structure with Markov chain properties. One frequently exploited observation is that a string over ΣDNA defines a walk in a suitably defined de Bruijn graph. Closely related is the correspondence of such a string to an Eulerian tour in a suitably defined multigraph. Applications include DNA physical mapping, DNA sequence assembly, and multiple sequence alignment problems [12,13,14,15,16]. In previous work [17], we examine signatures derived from the manner in which a DNA word graph fragments when subjected to edge deletion. We showed that these signatures performed much better than oligonucleotide frequency vectors in terms of differentiating between diverse genomes. Further, for unknown sequences of length 1 Mb, these signatures were able to accurately identify the origin. We also showed that bacterial sequences were most conserved at order 5. In this work, we emphasize the importance of being able to identify the origin of much shorter sequences (a few Kb) using signatures defined in Section 2, and analyze the amount of variation among signatures of different genomic sequences. In Section 2, we formalize the mathematical basis for graph-theoretic genomic signatures and describe the algorithm used to predict the origin of an unknown genomic sequence. In Section 3, we describe the results of using the proposed algorithm, and compare the performance of our method with existing methods. Section 4 draws conclusions and describes ongoing and future work.
2
Preliminaries and Methods
An alphabet is a finite, non-empty set of symbols; the DNA alphabet is ΣDNA = {A, C, G, T}. A string or word x over ΣDNA is a finite sequence x = σ1 σ2 · · · σw of symbols from ΣDNA ; its length |x| is w. A single chromosome in a genome is typically written as the string of nucleotides on one DNA strand. A genomic sequence is a chromosomal sequence or any substring of it. G is the set of all chromosomal sequences from an organism. Nucleotide frequencies vary among organisms, while, as Fickett et al. [18] observe, the frequencies of A’s and T’s (and hence of G’s and C’s) are approximately constant within a single genome. If x and y are strings, then occ (x, y) is the count of occurrences of x in y. w Fix a word length w ≥ 1. Let l = 4w . The order-w state space is S w = ΣDNA , the set consisting of the l words of length w. The order-w de Bruijn graph DBw = (S w , E) is a directed graph, where (xi , xj ) ∈ E when xi σ = ιxj , for some σ, ι ∈ ΣDNA ; such an edge is labeled σ [19]. Figure 2 provides one depiction of the order-2 DNA word graph. ∗ Let H ∈ ΣDNA have length |H| = n; we think of H as a long genomic sequence that traces a walk in DBw . The vertex count of xi in H is vc (xi , H) = occ (xi , H), while the edge count of edge (xi , xj ) ∈ E in H, where xi σ = γxj , is ec ((xi , xj ), H) = occ (xi σ, H). The order-w DNA word graph DN Aw (H) is DBw together with labels vc (xi , H) for each xi ∈ S w and ec ((xi , xj ), H) for
Genomic Signatures in De Bruijn Chains
Supernode A
Supernode C
AA
AC
CA
CC
AG
AT
CG
CT
GA
GC
TA
TC
GG
GT
TG
TT
Supernode G
219
Supernode T
Fig. 2. Representation of the de Bruijn graph DB 2 in terms of supernodes and superedges. Each supernode consists of the 4 nodes with the same 1-symbol prefix in their labels and is closed by a dotted boundary. An edge from a node to a supernode represents a set of edges from the node to all nodes in the supernode. For example, the edge from node AC to supernode C represents the set of edges {(AC, CA), (AC, CC), (AC, CG), (AC, CT)}.
each (xi , xj ) ∈ E. For xi , xj ∈ S w , the frequency of xj after xi in H is ⎧ if (xi , xj ) ∈ E or vc (xi , H) = 0; ⎨0 Freq ((xi , xj ), H) = ec ((xi , xj ), H) otherwise. ⎩ vc (xi , H) For 1 ≤ i ≤ l, let xi be the ith element of S w in lexicographic order. The orderw word count vector χw H of H is the l-vector having components occ (xi , H), in lexicographic order. We consider Markov chains with state space S w and having nonzero transition probabilities only for edges in DBw ; such a Markov chain is called an order-w de Bruijn chain (DBC). In the rest of this paper, we approximate the modeling of genomic signatures by DBCs. This approximation is based on the following premise. Let DC be an order-w DBC with l × l transition probability matrix P = (pij ); here, pij is the probability of a one-step transition from state xi to state xj [20]. P is sparse, with at most 4 nonzero entries per row. The orderw DBC, DC w (H), for genomic sequence H has transition probabilities pij = Freq ((xi , xj ), H). Genomic sequences are sufficiently large and diverse in their composition to sample all words in S w for reasonably small w ∈ [1, 5]. Hence, any DBC generating such a sequence is irreducible. It is also reasonable to assume that DBCs generating genomic sequences are aperiodic and recurrent non-null. Throughout, we assume that all DBC are ergodic and hence that there is a unique stationary distribution π = (πi ) on S w satisfying πP = π [20]. This assumption does not hold for a short genomic sequence that consists of systematic repeats of a small subset of words from S w , whose DBC might not satisfy ergodicity. For a genome G and a genomic sequence H taken from G, a genomic signature for H is a function θ that maps H to a mathematical structure θ(H). Ideally, θ(H) is able to identify sufficiently large substrings that come from G and to
220
L.S. Heath and A. Pati
distinguish H from genomic sequences of other genomes. To be useful, θ(H) must be efficiently computable. Of course, a representation of G itself satisfies the requirements, but offers no advantage in space. Fixing word length w ≥ 1, we obtain DN Aw (H), with associated vc (xi , H) and ec ((xi , xj ), H). We define several candidate signatures. The simplest is the cv vertex count vector θw = (vc (xi , H))li=1 , requiring space Θ(4w lg n). Additional signatures come from interplay between the graph structure DBw and the count vectors. Let ψ ≥ 0 be an integer threshold. Let E ≤ψ = {(i, j) ∈ E | ec ((i, j), H) ≤ ψ}, be the set of edges with counts at most ψ. Then edge deletion is the process of deleting edges in E ≤ψ from DB w , while varying ψ from 0 to Ξ = max{ec ((i, j), H) | (i, j) ∈ E} and deleting edges with tied counts in arbitrary order. The ψ-edge deletion of DBw is DB w (ψ) = (S w , E − E ≤ψ ). As ψ increases from 0 to Ξ, the number of connected components in DBw (ψ) increases from 1 to l, while the number of isolated vertices increases from 0 to l. The vertex deletion order θvdo is the permutation of S w giving the order in which vertices become isolated during edge deletion. Let ψi be the smallest integer such that DBw (ψi ) has precisely i connected components. The component-based edge deletion vector θced is the l-vector whose ith component is the number of edge deletions required to go from i − 1 to i components. The vertex-based edge deletion vector θved is the l-vector whose ith component is the number of edge deletions required to go from i − 1 to i isolated vertices. The ordered vertexbased edge deletion vector θoed is the l-vector whose ith component is the total number of edge deletions required to isolate the vertex xi , where xi is the ith element of S w in lexicographic order. We have established the superiority of the ordered vertex-based edge deletion vector θoed over the other signatures discussed above in previous work [17]. However, the performance of θoed decreases with decreasing sequence length. Here, we introduce a new mathematical signature that performs better than all existing signatures, to the best of our knowledge. Define the ordered frequency of vertex deletion vector θof v as the l-vector whose ith component is the ψ at which the vertex labeled with the ith string in lexicographic order was isolated. The de Bruijn chain vector θdbc is the 2lvector π2 · θ2of v , where π2 is the stationary distribution for the order-2 de Bruijn chain. Our results (not shown here) indicated that the performance of θdbc was much better than the individual performances of π and θof v . For two vectorbased signatures θ1 and θ2 , d (θ1 , θ2 ) is the L1 metric in l-dimensional real space and R(θ1 , θ2 ) is the Pearson correlation coefficient. In the rest of this paper, we describe the algorithm used to detect the origin of unknown genomic sequences using the θdbc signature and study its performance with varying sequence length. We imagine every biological sequence to be generated by a formal model that can be approximated by a DBC. For a set of genomic sequences, let D be the set of their θdbc s. Let H be a genomic sequence whose origin is unknown. Then Algorithm 1 is used to approximate the origin of an unknown sequence. The θ2dbc for a sequence of length n can be computed in O(n + 16 log n + 4096) dbc time and space. In general, the complexity of the order-w θw signature for a w w 3 w 3 sequence of length n is O(n+4 log n+(4 ) ). The (4 ) factor is contributed by
Genomic Signatures in De Bruijn Chains
221
Algorithm 1. MATCH Input: Set S of genomic sequences, Database D of existing θ2dbc s for sequences in S, Sequence H of unknown origin. of v 1: Compute θ2 (H) 2: Compute π2 (H) of v 3: θ2dbc (H) ← π2 (H) · θ2 (H) 4: maxcorr = 0 5: origin(H) = λ 6: for each sequence X ∈ S do 7: θ2dbc (X) ← D(X) 8: ρ ← R(θ2dbc (H), θ2dbc (X)) 9: if ρ > maxcorr then 10: maxcorr ← ρ 11: origin(H) ← origin(X) 12: return origin(H)
the Cholesky decomposition performed by MATLAB to compute the stationary distribution. For small w ∈ [1, 4], we observed that the time-complexity was dominated by n.
3
Results and Discussion
To evaluate θdbc and to compare it to existing genomic signatures, we performed a set of experiments using bacterial and eukaryotic genomes. First, we tested the ability of θdbc to differentiate among diverse genomes. We computed θ2dbc for chromosomal or whole genome sequences of the prokaryotic bacteria R. leguminosarum (5.1 Mb, NC 008380), E. litoralis (3.1 Mb, NC 007722), M. leprae (3.3 Mb, NC 002677.1), N. meningitidis (2.2 Mb, NC 008767.1), P. falciparum (chr 12, 2.3 Mb, NC 004316.2), P. aeruginosa (6.4 Mb, NC 002516.2), S. pneumoniae (2.1 Mb, NC 008533.1), and E.coli (4.7 Mb, NC 000913), and the eukaryotes C. elegans (chr 1, 15.3 Mb, NC 003279), H. sapiens (chr 1, 228.7 Mb, AC 000044), A. thaliana (chr 4, 18.8 Mb, NC 003075), and S. cerevisiae (chr 4, 1.6 Mb, NC 001136). The computed signatures were stored in a database D. From each genome, 100 sequences of length 10 K each were randomly sampled. For each sample X, the vector θ2dbc (X) was correlated, using the Pearson correlation coefficient, with all the θ2dbc vectors in D. Per genome, the distribution of the correlation coefficients of the θ2dbc of each of the 100 samples with the θ2dbc of the origin is illustrated using box and whisker plots in Fig. 3(a). The same plot also illustrates, per genome, the distribution of the correlation coefficients of the θ2dbc of each of the 100 samples with other non-origin θ2dbc s. Observe that, per genome, the average correlation of sample sequence θ2dbc s to the θ2dbc of its origin is extremely high (> 0.95 and concentrated) while the average correlation to the θ2dbc of non-origin genomes is much lower (< 0.6 and widely distributed). This illustrates that the θdbc signature picks out distinct
222
L.S. Heath and A. Pati
structural characteristics of the DBC for each genome, even for sequences as short as 10 K. In contrast, the sample sequences used in [3] are of length 100 K. The wide distribution of genome sizes used for sampling demonstrates that the θdbc signature works well for genomes of all sizes. Define a first hit as the scenario where the signature of the sample sequence matches the genomic signature of its origin with the highest correlation. Define a good hit as the scenario where the signature of the sample sequence matches the genomic signature of its origin with a correlation that is among the three highest correlations. Then Fig. 3(b) plots the efficiency of θ2dbc using first hits for different sample lengths, while Fig. 3(c) does the same using good hits. Efficiency is computed as follows. For a sample X, the matches to θ2dbc (X) are ranked 1, 2, 3, . . . in decreasing order of their correlation coefficients. In a first hit scenario, the origin is ranked 1, whereas, in a good hit scenario, the origin is ranked 1, 2, or 3. The number of first hits (or good hits) per 100 samples is the efficiency. Observe that the performance of θ2dbc increases with increasing sample size, reaching 100% first hits at length 100 K for 10 out of 12 genomes. Second, we tested the ability of θ2dbc to differentiate among closely related genomes. We used 20 α-proteobacterial genomes taken from the Entrez genome database. Of the 196 α-proteobacterial genomes in Entrez, we selected the 63 genomic sequences that were greater than 1 Mb in size. Different chromosomes and plasmids of the same genome were concatenated into single sequences producing a total of 53 sequences. For each of these sequences, θdor and θ2dbc were computed and stored in D. For testing purposes, we chose 20 α-proteobacterial genomic sequences whose sizes were distributed between 1 Mb and 7 Mb. These are Wolbachia BM (1.1 Mb, NC 006833), R. typhi (1.1 Mb, NC 006142), A. marginale (1.2 Mb, NC 004842), C. pelagibacter (1.3 Mb, NC 007205), A. phagocytophilum (1.5 Mb, NC 007797), B. suis (chr 1, 2.1 Mb, NC 004310), G. bethesdensis (2.7 Mb, NC 008343), P. denitrificans (chr 1, 2.9 Mb, NC 008686), E. litoralis (3.1 Mb, NC 007722), S. alaskensis (3.4 Mb, NC 008048), H. neptunium (3.8 Mb, NC 008358), C. crescentus (4.1 Mb, NC 002696), S. pomeroyi (4.2 Mb, NC 003911), Jannaschia ssp. CCS1 (4.4 Mb, NC 007802), R. rubrum (4.4 Mb, NC 007643), N. hamburgensis (4.5 Mb, NC 007964), M. magneticum (5.0 Mb, NC 007626), R. leguminosarum (5.1 Mb, NC 008380), R. palustris (5.6 Mb, NC 008435), and M. loti (7.1 Mb, NC 002678). In previous work [17], we prove that graph-based signatures perform better than word count vectors. The only existing signature that performs comparably to θdbc is the dinucleotide odds ratio θdor . Figure 4 compares the performances of θdbc and θ2dbc . We took 100 samples each of lengths 1 K, 5 K, 10 K, and 20 K from each of the 20 genomes. Let X be a sample taken from genome HX . For X, θdor (X) and θ2dbc (X) are computed. θdor (X) is compared to all θdor in D while θ2dbc (X) is compared to all θ2dbc in D as shown in Algorithm 1. For each signature, the rank of the match to the corresponding signature for HX is computed (where rank 1 indicates the best match). Ranks are compared between signatures for each sample to compare performance. Figure 4 illustrates that for all sample lengths, θ2dbc outperforms θdor . The only genome in which θdor performs better
Genomic Signatures in De Bruijn Chains
223
1 0.9 0.8 0.7
Correlations
0.6 0.5 0.4 0.3 0.2 0.1 0
R. leguminosarum
M. leprae
E.litoralis
P. falciparum
N. meningitidis
S. pneumoniae
P. aeruginosa
C. elegans
E. coli
A. thaliana
H. sapiens
S. cerevisiae
(a)
100
Good hits (%)
100
First hits (%)
80 60 40 20
60 40 20
0
RL L EL M PF NM SP PA CE EC AT HS C S
80
0 100000 50000 10000 5000
(b)
Sample size
RL EL ML NM PF PA SP CE EC HS AT SC
100000 5000
50000 10000 Sample size
(c)
Fig. 3. Performance of θ2dbc . (a) The 12 species are on the x-axis. The small box and whisker plots near the top (with associated circles) represent the distribution of correlations of θ2dbc s of the 100 samples with the θ2dbc of their origin. The larger box and whisker plots represent the distribution of correlations with θ2dbc s of other genomes. (b) Plot of efficiency of θ2dbc in identifying the origin of unknown sequences of various lengths in the first hit scenario. The 12 species are on the x-axis while the lengths of the sample sequences are on the y-axis. The efficiency is plotted on the z-axis. (c) Same plot as in (b), but in a good hit scenario.
is that of Wolbachia. We also observe that, as sample length decreases, θ2dbc outperforms θdor by greater margins.
224
L.S. Heath and A. Pati 100
50
100
Wolbachia BM R. typhi A. marginale C. pelagibacter A. phagocytophilum B.suis G. bethesdensis P. denitrificans E. litoralis S. alaskensis H. neptunium C. crescentus S. pomeroyi Jannaschia R. rubrum N. hamburgensis M. magneticum R. leguminosarum R. palustris M.loti
0
100
50
(a)
50
0
(b)
(d)
Wolbachia BM R. typhi A. marginale C. pelagibacter A. phagocytophilum B.suis G. bethesdensis P. denitrificans E. litoralis S. alaskensis H. neptunium C. crescentus S. pomeroyi Jannaschia R. rubrum N. hamburgensis M. magneticum R. leguminosarum R. palustris M.loti
0
100
50
0
(c)
Wolbachia BM R. typhi A. marginale C. pelagibacter A. phagocytophilum B.suis G. bethesdensis P. denitrificans E. litoralis S. alaskensis H. neptunium C. crescentus S. pomeroyi Jannaschia R. rubrum N. hamburgensis M. magneticum R. leguminosarum R. palustris M.loti
Wolbachia BM R. typhi A. marginale C. pelagibacter A. phagocytophilum B.suis G. bethesdensis P. denitrificans E. litoralis S. alaskensis H. neptunium C. crescentus S. pomeroyi Jannaschia R. rubrum N. hamburgensis M. magneticum R. leguminosarum R. palustris M.loti
Fig. 4. Comparison of the performance of θdbc and θdor for 20 α-proteobacteria. In this rotated figure, genomes are arranged in increasing order of size on the x-axis. 100 sequence samples are taken from each genome. Figures (a), (b), (c), and (d) compare performances for sample lengths 1 K, 5 K, 10 K, and 20 K, respectively. Black represents the fraction of samples where θdbc performs better. Dark gray represents the fraction of samples where θdor performs better. Light gray represents the fraction of samples where both perform equally.
Figure 5 illustrates the efficiency of θ2dbc in distinguishing between closelyrelated genomes. Efficiency is computed in the same way as described before.
First hits (%)
Genomic Signatures in De Bruijn Chains
225
100 90 80 70 60 50 40 30 20 10 0
BM hia phi lbac R. rtyginaleter a ac ilum m A. elagib h p ocytop B.suis sis C. a g den ans h thesnitrifliictoralissis A. p e b G. P. de E. skennium s la tu S. a. nepteusceneroyi ia H . cr pom asch um C S. nn ubr sis Ja R. r rgen icumm bu gnet saru tris amm α−proteobacteria N. hM ti . guamino alus R. p M.lo R. le Wo
20000 10000 1000
5000 Sample size
(a)
Good hits (%)
100 90 80 70 60 50 40 30 20 10 0
BM chia yphi R. t inale arg ter A. m gibac ilum ela ytoph uis p . C goc .s sis B n ha sde ns A. p ethe itrificoaralis G. b . denE . lit ensis P k las ium S. a neptunentus i H. cresc meroy C. S. po schia m na α−proteobacteria Jan . rubrunsis R urge um b m a netic um N. h . mag inosar stris M gum palu .loti M R. R. le a Wolb
5000
Sample size 1000
(b) Fig. 5. Plot of the percentage of (a) first hits and (b) good hits per α-proteobacterium for sample sizes 1 K, 5 K, 10 K, 20 Kb
226
L.S. Heath and A. Pati
Observe that efficiency increases with increasing sample size. In the first hit scenario, the average efficiencies for sample sizes 1 Kb, 5 Kb, 10 Kb, and 20 Kb are 39.7%, 68.2%, 77.3%, and 81.9%, respectively. In the good hit scenario, the average efficiencies for sample sizes 1 Kb, 5 Kb are 63.8% and 86.3%, respectively. Genomic signatures of order 2 have the simplest computational complexity and yet, illustrate the increased performance of θdbc signatures over θdor and θwcv signatures. Examination of higher order signatures is one of the directions we are pursuing now. Several applications of the θdbc signature are possible. Using θdbc on short genomic sequences to calculate phylogenetic relationships eliminates the need for tedious multiple alignments to compute phylogenetic distances. θdbc -based distances can be used to pick out non-homogeneous regions in a genome and explain putative phenomena behind the non-homogeneity. θdbc picks out characteristics of sequences that multiple alignment does not, so it can be used to determine the evolution and origin of rare microbial species.
4
Conclusions
The genomic signatures introduced in this paper are systematically derived from the structure of DNA word graphs obtained from genomic sequences and properties of de Bruijn chains. When sufficient sequence for an organism is present in a biological sample, the target organism for the sample can be retrieved by querying an already existing database of signatures. We have demonstrated that θdbc is an extremely powerful signature, able to efficiently identify the origin of an unknown genomic sequence as short as a few kilobases. This implies that the origin and the closest relatives of an unknown sequence can be identified with very little actual sequencing. In [17], we showed that distances between signatures can be characterized within a probabilistic framework in terms of the parameters of the underlying DBC assumed to generate the sequences. In continuing work, we are developing probabilistic bounds to characterize the performance of θdbc in a theoretical framework. We also continue to investigate the minimum sequence size required so that the signature computed from that sequence is useful for a given order.
References 1. Carbone, A., Kepes, F., Zinovyev, A.: Codon bias signatures, organization of microorganisms in codon space, and lifestyle. Molecular Biology and Evolution 22(3), 547–561 (2005) 2. Coenye, T., Vandamme, P.: Use of the genomic signature in bacterial classification and identification. Systematic and Applied Microbiology 27(2), 175–185 (2004) 3. Deschavanne, P.J., Giron, A., Vilain, J., Fagot, G., Fertil, B.: Genomic signature: Characterization and classification of species assessed by chaos game representation of sequences. Molecular Biology and Evolution 16(10), 1391–1399 (1999) 4. Dufraigne, C., Fertil, B., Lespinats, S., Giron, A., Deschavanne, P.: Detection and characterization of horizontal transfers in prokaryotes using genomic signature. Nucleic Acids Research 33(1), 12 pages (2005)
Genomic Signatures in De Bruijn Chains
227
5. Fertil, B., Massin, M., Lespinats, S., Devic, C., Dumee, P., Giron, A.: GENSTYLE: exploration and analysis of DNA sequences with genomic signature. Nucleic Acids Research 33(Web Server issue), W512–W515 (2005) 6. Jernigan, R.W., Baran, R.H.: Pervasive properties of the genomic signature. BMC Genomics 3, 9 pages (2002) 7. Karlin, S., Burge, C.: Dinucleotide relative abundance extremes — A genomic signature. Trends in Genetics 11(7), 283–290 (1995) 8. Karlin, S., Mrazek, J., Campbell, A.M.: Compositional biases of bacterial genomes and evolutionary implications. Journal of Bacteriology 179(12), 3899–3913 (1997) 9. Sandberg, R., Branden, C.I., Ernberg, I., Coster, J.: Quantifying the speciesspecificity in genomics signatures, synonymous codon choice, amino acid usage, and G+C content. Gene 311, 35–42 (2003) 10. Teeling, H., Meyerdierks, A., Buaer, M., Amann, R., Glockner, F.O.: Application of tetranucleotide frequencies for the assignment of genomic fragments. Environmental Microbiology 6, 938–947 (2004) 11. van Passel, M.W.J., Bart, A., Thygesen, H.H., Luyf, A.C.M., van Kampen, A.H.C., van der Ende, A.: An acquisition account of genomic islands based on genome signature comparisons. BMC Genomics 6, 10 pages (2005) 12. Pevzner, P.A.: DNA physical mapping and alternating Eulerian cycles in colored graphs. Algorithmica 13(1-2), 77–105 (1995) 13. Pevzner, P.A., Tang, H.X., Waterman, M.S.: An Eulerian path approach to DNA fragment assembly. Proceedings of The National Academy of Sciences of the United States Of America 98(17), 9748–9753 (2001) 14. Zhang, Y., Waterman, M.S.: An Eulerian path approach to global multiple alignment for DNA sequences. Journal of Computational Biology 10(6), 803–819 (2003) 15. Raphael, B., Zhi, D.G., Tang, H.X., Pevzner, P.: A novel method for multiple alignment of sequences with repeated and shuffled elements. Genome Research 14(11), 2336–2346 (2004) 16. Zhang, Y., Waterman, M.S.: An Eulerian path approach to local multiple alignment for DNA sequences. Proceedings of The National Academy of Sciences of the United States Of America 102(5), 1285–1290 (2005) 17. Heath, L.S., Pati, A.: Genomic signatures from DNA word graphs. LNCS (LNBI), vol. 4463, pp. 317–328. Springer, Heidelberg (2007) 18. Fickett, J.W., Torney, D.C., Wolf, D.R.: Base compositional structure of genomes. Genomics 13(4), 1056–1064 (1992) 19. Rosenberg, A.L., Heath, L.S.: Graph Separators, With Applications. Frontiers of Computer Science. Kluwer Academic/Plenum Publishers, Dordrecht (2000) 20. Feller, W.: An Introduction to Probability Theory and Its Applications, 3rd edn., vol. I. John Wiley & Sons Inc., New York (1968)
Fast Kernel Methods for SVM Sequence Classifiers Pavel Kuksa and Vladimir Pavlovic Department of Computer Science, Rutgers University, Piscataway, NJ 08854 {pkuksa,vladimir}@cs.rutgers.edu
Abstract. In this work we study string kernel methods for sequence analysis and focus on the problem of species-level identification based on short DNA fragments known as barcodes. We introduce efficient sorting-based algorithms for exact string k-mer kernels and then describe a divide-and-conquer technique for kernels with mismatches. Our algorithms for mismatch kernel matrix computations improve currently known time bounds for these computations. We then consider the mismatch kernel problem with feature selection, and present efficient algorithms for it. Our experimental results show that, for string kernels with mismatches, kernel matrices can be computed 100-200 times faster than traditional approaches. Kernel vector evaluations on new sequences show similar computational improvements. On several DNA barcode datasets, k-mer string kernels considerably improve identification accuracy compared to prior results. String kernels with feature selection demonstrate competitive performance with substantially fewer computations.
1
Introduction
Biological species identification through DNA barcodes has been proposed recently in [1]. In the DNA barcoding setting, DNA sequencing of the mitochondrial region is used to obtain a relatively short sequence, DNA barcode, that is subsequently used as a marker for species identification and classification. This approach contrasts traditional identification methods that rely on markers from multiple genomic locations. DNA barcoding has shown great promise due to increased robustness, and predictive value for rapid and accurate identification of species. For instance, barcoding analysis via cox1 gene of moth and fly specimens intercepted at New Zealand’s borders resulted in improved correct placement of previously unknown species or increased resolution of specimens [2]. Reliance of DNA barcoding on a short single fragment of DNA sequence necessitates new computational methods to deal efficiently with this single sequence-based assignment. Several methods, based on pairwise alignments [3] or statistical approaches using evolutionary distances [4], have been applied to the tasks of identification and analysis of the DNA barcode data. However, a number of challenges remain to be addressed, including the accuracy of identification [3,4,5,6], as well as the efficiency and scalability of computational methods. R. Giancarlo and S. Hannenhalli (Eds.): WABI 2007, LNBI 4645, pp. 228–239, 2007. c Springer-Verlag Berlin Heidelberg 2007
Fast Kernel Methods for SVM Sequence Classifiers
229
In this study we investigate kernel classification methods for the DNA barcoding. Kernel-based classification demonstrated strong performance in many related tasks of biological sequence analysis, such as protein classification and remote homology detection [7,8,9]. There are several types of kernels for the biological sequences, including kernels derived from probabilistic models [10], k-mer string kernels [7,8], and weighted-decomposition kernels [11]. In this work we focus on recently proposed k-mer string kernels. In our approach, species identification is performed by first transforming sequences (potentially of varying length) into fixed-length representations (string spectra) and then classifying them into one of many established species classes using Support Vector Machine (SVM) classifiers [12,13]. As a result, the string kernel-based species identification in our study demonstrates high accuracy and improved classification performance compared to previously employed methods. The improved accuracy of kernel-based classification methods in the sequence domain is typically challenged by their computational complexity. To address the computational aspects of the method, we propose novel and efficient algorithms for solving the string kernel-based learning problems. We also introduce string kernels with feature selection which perform as well as the methods based on the full feature sets while having significantly lower computation cost. Our experimental results show that, for string kernels with mismatches, kernel matrices can be computed by factors of 100-200 times faster. Identification of new sequences similarly requires significantly less time than the standard approaches. We also observe that the k-mer string kernels considerably improve identification accuracy compared to the previously reported results on several barcode datasets. String kernels with feature selection demonstrate competitive classification performance with substantially fewer computations. This paper is organized as follows. We start by introducing efficient sortingbased algorithms for exact string k-mer kernels (Section 3) . In Section 4 we describe a divide-and-conquer technique for exact k-spectrum kernels and k-mer kernels with m mismatches which combined with the sorting improves currently known time bounds for the mismatch kernel computations. We then introduce the mismatch kernel problem with feature selection and present algorithms for efficient computations of the string kernels with feature selection (Section 5). A comparison of our kernel method with a baseline computational approach is discussed in Section 6. Finally, we evaluate several feature spaces and provide comparative analysis of the performance of our method on three publicly available barcode datasets in Section 8.
2
Species Identification Problem in the DNA Barcoding Setting
Species identification problem can be described in the following way: given an unlabeled sample (specimen) X and a set SP of known species represented by their reference barcode sequences or models, the task is to assign the given sample to one of the known species (or decide that this sample does not belong to any of
230
P. Kuksa and V. Pavlovic
the known categories). We solve this global multiclass problem by dividing it into a collection of binary membership problems. In a binary membership problem, the task is to decide whether an input sequence belongs to a particular class. We apply the kernel-based formalism [12,13] to design of classifiers for binary species identification. Given a training set of species barcodes SP and their corresponding species labels S, a sequence kernel is used in a SVM setting to learn a species classifier for new sequences: yes, i∈M αi,s ks (x, xi ) > 0 species(x = s) = no, otherwise where αi,s and M ⊆ SP are estimated using standard SVM methods [13]. A critical point in this formalism, when applied to the domain of sequences, is the complexity of kernel computations. [7,8] proposed an efficient algorithm using suffix-trees to address this problem. We next describe an alternative algorithm for sequence kernels that exhibits improved performance both in time and in space, compared to the traditional approach. The proposed algorithm can further incorporate feature selection to reduce dimensionality of the problems. Selection of a small subset of features not only implies computationally more efficient procedures, but also is biologically interesting since selected features can facilitate understanding of the species identity.
3
Counting-Sort Formalism for String Kernel Algorithms
In this section we introduce sorting-based algorithms to more efficiently compute string k-mer kernels. Counting-sort based framework leads to fast and scalable practical algorithms for string kernels suitable for large k and m. In a countingsort framework each substring of length k (k-mer) is considered as a k-digit integer number base |Σ|. A list of n integer k-digit numbers where each digit is from integer alphabet 1 . . . |Σ| can be sorted in Θ(kn) time using k passes of counting sort. Space complexity of this approach is Θ(n) since we can reuse space and store only the current column to be sorted. Given n integers in a sorted order, one pass over the list is sufficient to output for each distinct element in the list frequency of its occurrence. Then the exact k-spectrum kernel for the two sequences can be computed in time Θ(kn) linear in the length n of sequences. Given a set of N sequences, the spectrum kernel matrix can be computed in linear O(nN ) space and O(knN + min(u, n) · N 2 ) time, where the last term reflects the complexity of updating the kernel matrix 1 and u is the number of unique k-mers in the set bounded above by min(nN, |Σ|k ). The proposed algorithm improves the time bounds compared to the suffix tree algorithms with higher O(N 2 kn) complexity, and is simpler and easy to implement. In summary, our counting sort algorithm for the spectrum kernel performs the following steps: 1
It is easy to see that the complexity of updating the matrix is min(u, n)N 2 . Consider the u-by-N matrix C = [ci,j ] of k-mer counts, where ci,j is the number of times kmer i occurs in the jth sequence. Since there are no more than min(u, n)N non-zero elements in C, the update complexity kernel matrix is min(u, n)N 2 .
Fast Kernel Methods for SVM Sequence Classifiers
231
Step 1.Extract and store k-mers from the input sequences, O(knN ) time.2 Step 2. Sort obtained list L using counting sort, O(knN ). Step 3. Compute feature counts by scanning the sorted list and update the kernel matrix on each change in the feature value, O(knN + min(u, n) · N 2 ). For each unique feature f (there are u of them), in step 3 kernel matrix is updated as follows: K(updf , updf ) = K(updf , updf ) + cf cTf
(1)
where updf = {i : f ∈ xi } is a set of input sequences that contain f and cf = [nxi (f )]i∈updf is a vector of feature counts for each sequence from updf . In the case of (k, m)-mismatch kernel when up to m mismatches are allowed, a set of unique features u can be extracted first using the above algorithm and then used in computations instead of the original redundant set. This preprocessing step takes O(knN ) time, however, since nN (number of features collected from all the input sequences) is much larger than u in the case of DNA sequences, this preprocessing step results in the improved performance of the algorithm.
4
Divide-and-Conquer Algorithms for Exact and Mismatch String Kernels
While the sorting formalism results in an improved computational method for kernel evaluation it is also possible to gain further reduction in computational complexity. The efficient computation is achieved by using linear time characterbased clustering to divide the problem into subproblems and the merging procedure that updates the kernel matrix. Using a divide-and-conquer technique, the exact and mismatch kernel problems can be solved recursively as follows: Step 1: Original set L of features composed of all k-mers extracted from N input sequences is divided into subsets L1 , . . . , L|Σ| using character-based clustering. Step 2: The same procedure (Divide step) is applied to each of the subsets L1 , . . . , L|Σ| recursively. The depth of recursion is bounded by k (since clustering continues until there is no substrings left or depth k reached). Each node at depth k corresponds to some feature f and stores counts nxi (f ) (number of times f appears in xi ) for all the sequences that contain f , these counts are used to update kernel matrix as in (1). 2
Complexity of the feature extraction step can be reduced if k-mers are stored as integers (e.g. 32-bit word can store k-mers with value of k up to 16 when |Σ| = 4). Features then can be extracted from all the input sequences in O(N (n + k)) time since feature i can be computed from feature i − 1 in O(1) time and it takes O(k) time to compute the first feature. Extracted features then can be sorted in Θ(knN ) time and Θ(nN ) space.
232
P. Kuksa and V. Pavlovic
The procedure above builds one recursion tree for all input sequences. At each level l = 1 . . . k of the recursion tree there are |Σ|l clusters, each of the l-level cluster corresponds to a distinct substring of length l. Each cluster C consists of a number of subclusters SC where each subcluster is formed by k-mers from one particular sequence. At level k of the recursion tree, each node contains a collection of subclusters where each subcluster corresponds to a set of substrings from one particular sequence that are in the neighborhood of node feature, i.e., each node points to all the substrings that are in the neighborhood of the base string (node feature). The recursion procedure above results in an incremental algorithm for the mismatch kernel computation. 4.1
Analysis of Incremental Mismatch Kernel Algorithm
The time complexity of the incremental mismatch kernel algorithm can be expressed as k min(m,l) l u· (Σ − 1)i + u · N 2 i i=0 l=1
At each level l algorithm gives solution for the (l, min(m, l))-mismatch problem. Last term in the expression for the time complexity reflects the cost of computing kernel matrix using N -length vectors of feature counts. For small values of m (m k) complexity of processing each k-mer can be approximated as k m |Σ|m . k −1| k 3 As m grows, time complexity approaches limit of |Σ| |Σ| |Σ|−1 , i.e. O(u · |Σ| ) . Incremental mismatch kernel algorithm and recursive exact spectrum kernel algorithm are very similar with the only difference being that at each step of clustering some of the features that do not match base character may remain in the subset provided that the number of mismatches is no more than m. Number of mismatches can be tracked using an indicator array that is initialized with k. At each step indicator value is reduced by 1 for features that match the base character, then at step l all the features for which indicator value is greater than k − l + m are removed. The filtering step takes linear time. From implementation point of view, exact spectrum and incremental mismatch algorithms utilizing character-based clustering are the same except the filtering step.
5
Kernels with Feature Selection
A common approach to feature selection relies on the filtering paradigm, when the subset F of the most informative features is extracted prior to learning. In c (fi ) our experiments, we use term-frequency tfc (fi ) = num num(fi ) for feature selection, where numc (fi ) and num(fi ) are the number of times term fi occurs in class c and in all classes, respectively. We characterize utility of each feature fi using 3
It should be noted, however, that m = k represents a special case that essentially takes into account only sequence lengths and kernel matrix can be computed in O(N 2 ) time as |Σ|k len · len , where len is N × 1 vector of sequence lengths.
Fast Kernel Methods for SVM Sequence Classifiers
233
the maximum value tfmax (fi ) = maxc tfc (fi ). Features are then selected globally according to the maximum of log tfmax (fi ). The criteria above is similar to the mutual information when all classes are equally likely, and is also suitable for imbalanced data sets when number of sequences per class varies substantially. 5.1
Mismatch Algorithm with Feature Selection
Brute force algorithms for exact spectrum and mismatch kernels with feature selection have the same time complexity of O(|F |(knN + N 2 )). Using countingsort formalism (section 3) the exact matching and mismatch kernels with feature selection can be computed in O(knN +|F |N 2 ) and O(|F |(k·u+N 2 )+knN ) time, respectively, where u is the number of unique features and is bounded above by min(|Σ|k , nN ). In case of DNA barcode sequences (|Σ| = 4), for typical k, n and N , u nN , which gives substantial performance improvement. Mismatch kernel problem with feature selection can be solved in linear time using additional space. It takes O(|F |vk) operations to build a suffix tree for the selected features and their neighbors, where v is the size of the mismatch neighborhood. The complexity of the mismatch kernel matrix computation using this tree is then O(N kn+|F |N 2 ). Alternative solution is to add selected features and their neighbors to the set of features extracted from the input sequences and then compute feature counts in linear O(N kn + |F |vk) time using sorting.
6
Comparison with Baseline Mismatch Kernel Algorithms
In this section we discuss and compare baseline methods for mismatch kernel computations, as well as kernels with feature selection and their complexity. Explicit map algorithm. Each input sequence is mapped explicitly to the vector of size |Σ|k indexed by all substrings of length k using O(N nvk) time and O(|Σ|k N ) space, the kernel matrix can then be computed in O(|Σ k |N 2 ) time. Although this method is well suited for the exact spectrum kernel (for small k and |Σ|) with constant time mapping, mapping in the mismatch kernel is no longer a constant time since each k-mer is now mapped to its v neighbors. Explicit map with presorting. Unique k-mers are first extracted using sorting and then mapped in O(uvk) time. The overall speed improvement can be substantial since u nN and the mapping can be performed O(nN/u) times faster. In case of the feature selection, a small subset of k-mers is preselected and only the selected features contribute to the kernel value. The explicit mapping algorithm can be extended to incorporate feature selection by using only the selected positions in the sequence spectrum representations to compute kernel instead of all the positions. Table 1 summarizes complexity of the different algorithms for the mismatch kernel computations. It should be noted that EM approaches require larger O(N |Σ|k ) storage than our divide-and-conquer approaches.
234
P. Kuksa and V. Pavlovic Table 1. Complexity of the mismatch kernel computations Mismatch kernel matrix
Mismatch kernel matrix with feature selection
Mismatch kernel vector
EM N nvk + |Σ|k N 2 N nvk + |F |N 2 N nvk + |Σ|k N EM+ N nk + uvk+ N kn + uvk + uvN + |F |N 2 N nk + uvk + uvN + |Σ|k N Sort uvN + |Σ|k N 2 DC N nvk + u N 2 N n|F |k + |F |N 2 N nvk + u N DC+ N kn + uvk + u N 2 N kn + uk|F | + |F |N 2 N kn + uvk + u N Sort EM=explicit map, EM+Sort=EM with presorting, DC=divide and conquer, v=neighborhood size, u=number of different k-mers in the input, u =number of different k-mers including neighbors
7
Related Work
Traditional algorithms for computing string kernels [7,8,14] rely on suffix trees and arrays. Exact k-spectrum kernel for two sequences x and y of length n can be computed in O(kn) time using suffix trees. All-substrings kernel problem [14] has also been solved in the linear time using suffix trees and matching statistics. However, it is not necessary to build suffix trees in order to obtain a kernel. Commonly used linear suffix tree algorithms (e.g., Ukkonen algorithm [15]) have large running time constants and large memory requirements. Moreover, in many applications algorithms make no use of suffix links after construction of a tree. In our framework, the exact k-spectrum kernel Kk (x, y) can be computed in time Θ(kn) linear in the length of sequences x and y, however proposed counting sort formalism provides a much simpler and more efficient implementation, as well as minimal memory requirements. Also, in our framework, the k-mer spectrum kernel can be efficiently computed for N sequences of length n in O(N nk) time and linear space using sorting, resulting in the elimination of the large time constants and storage overhead of the suffix-tree based algorithms [7]. Mismatch kernel over a set S of N sequences each of length n reported in [8,16] to have complexity of O(N 2 nk m+1 |Σ|m ). Our divide-and-conquer algorithm for mismatch kernel computation improves this bound and has running time of O(u · k m+1 |Σ|m + u · N 2 ), where u is the number of different k-mers in the input sequences. In [8] mismatch algorithm considers (k, m)-neighborhood of size O(k m |Σ|m ) for each k-mer. It should be noted that in this case the total number of features considered by the algorithm can exceed the natural upper bound4 of |Σ|k . In our divide-and-conquer algorithm for mismatch kernel we take a different approach: our algorithm clusters unique features of S and naturally finds groups of features that are (k, m)-neighbors, size of the resulting clusters (subclusters that correspond to different input strings) gives desired counts of number of times features occur in the input strings. 4
For example, one of our datasets contains N =466 DNA sequences of length n = 600, for |Σ| = 4, k = 5, m = 1, u is bounded by |Σ|k = 1024, N 2 nkm+1 |Σ|m is 1.3e+10, u(km+1 |Σ|m + N 2 ) = 2.2e+08 which is approximately 50 times less.
Fast Kernel Methods for SVM Sequence Classifiers
8
235
Experiments and Results
In our experiments we use three data sets of DNA barcodes. Fish larvae5 dataset consists of 56 barcode sequences from 7 species. The Astraptes.Fulgerator6 dataset contains 466 barcodes from 12 species. The number of sequences per species in the second dataset varies from as few as 3 barcodes to as many as 100 barcodes. Finally, a large Hesperiidae dataset has 2185 sequences and 355 classes. 8.1
Classification Performance
We evaluate performance of all methods using the ROC50 score [18] as well as the cross-validation error. In case of the Fisher kernel, profile HMM models are estimated from multiple sequence alignments for each sequence class. For classification, we use the existing implementation of SVM from a machine-learning package Spider 7 . In the experiments with feature selection, we preselect a small subset of k-mers before learning a classifier using filtering approach. We compare the performance of full feature kernels and kernels based on the reduced feature set in Fig. 1 on the basis of the number of families with the score higher than corresponding ROC50 score. For the Fisher kernel, ROC50 plots clearly demonstrate that Fisher kernel with feature selection achieves higher performance compared with the full feature set kernel and performs as well as the mismatch kernel with feature selection. However, Fisher kernel offers position information and can therefore facilitate interpretation of the resulting model. Table 2 displays performance results of all six methods on the A.fulgerator dataset. Similar validations are provided in Table 3 for the fish larvae dataset. The performance of the kernels with feature selection (SFK, SSK, SMK) on both datasets is, at worst, indistinguishable from that of full feature kernels (FK, SK, MK). In some cases, as with the Fisher kernel, the feature reduction has resulted in significant improvement (on A.fulgerator dataset, SFK is better than FK with a p-value of 0.0063 as measured by a two-tailed signed rank test). Figure 2 shows performance on the Hesperiidae dataset. The string kernels with feature selection performed extremely well on this dataset obtaining perfect results in more than 94% test cases. Feature selection improved performance for the exact spectrum kernel, and demonstrated performance very similar to that of the full feature mismatch kernel, however the computation cost is significantly lower, as shown in Sec. 5. Average cross-validation error rates on the Hesperiidae dataset are 3.7 · 10−4 for spectrum and 3.5 · 10−4 for mismatch kernels. The use of feature selection for string kernels has resulted in a substantial reduction in the number of features (only 10% of features were selected) for all our test data sets, whereas performance has remained the same or improved. Feature selection for the Fisher kernel not only improved performance, but also even more dramatically decreased the number of features (see Table 2). 5 6
7
See [3] for the details on dataset. From Barcode of Life Data Systems (BOLD) collection www.barcodinglife.org. see [17] for detailed description of the dataset. Available from http://www.kyb.mpg.de/bs/people/spider/
236
P. Kuksa and V. Pavlovic
12
356
11
354
Number of families
Number of families
10
9
Full feature Fisher kernel
8
Fisher kernel with feature selection
350
348
(5,1)−mismatch kernel with feature selection 7
352
(5,1)−mismatch kernel
context−specific kernel
5−spectrum kernel
5−spectrum kernel with feature selection
5−spectrum kernel w/ feature selection 346
5−spectrum kernel
6
(5,1)−mismatch kernel w/ feature selection
(5,1)−mismatch kernel 5 0.8
0.82
0.84
0.86
0.88
0.9
0.92
0.94
0.96
0.98
1
344 0.5
0.55
0.6
0.65
ROC50
Fig. 1. Comparison of performance of the Fisher and string kernels with and without feature selection (Astraptes dataset). Note the positive role of the feature selection, especially in the case of Fisher kernels.
0.7
0.75 ROC50
0.8
0.85
0.9
0.95
1
Fig. 2. Comparison of performance of string kernels with and without feature selection (Hesperiidae dataset). Reducing the number of features correlates mostly positively with the identification performance.
Table 2. Cross-validation error rates (%) Table 3. Cross-validation error rates (%) for A.fulgerator species for Fish larvae species class FK SK MK SFK SSK SMK BYTTNER 0.43 0 0 0 0 0 CELT 1.07 0 0 0 0 0 FABOV 1.29 0 0 0 0.21 0.21 HIHAMP 0.86 0 0 0 0 0 INGCUP 1.30 0 0.64 0.22 0.64 0.64 LOHAMP 1.30 0 0 0 0 0 LONCHO 1.06 0 0 0 0.21 0 MYST 0.64 0.64 0.85 1.07 1.28 0.64 NUMT 0.22 0 0 0 0 0 SENNOV 2.57 1.07 0.86 1.71 1.07 1.29 TRIGO 0.86 0 0 0 0 0 YESENN 3.86 0.43 0.43 1.29 0.86 0.43
class Perca Rutilus Gasterosteus Barbatula Lota Anguilla Phoxinus
SK 1.66 3.33 0 0 3.33 0 3.09
MK 1.66 0.05 0 1.66 3.33 0 3.09
SSK 0 5.0 1.66 0 3.66 0 3.66
SMK 1.66 3.33 0 0 3.33 0 5.33
FK=SVM-Fisher, SK=5-spectrum, MK=(5,1)-mismatch kernel, S=with feature selection
number of features for SSK and SMK is 100, for SFK number of features is class-specific: 5, 2, 5, 30, 10, 10, 5, 5, 50, 5, 5, and 50, respectively
Table 4. Comparison of the identification accuracy (avg.error/s.d.) method Astraptes species Fish species Hesperiidae species svm w/ linear kernel (one-vs-rest) 0.0074/0.0092 0.1342/0.1358 0.0132/0.0022 svm + pca 0.0067/0.0074 0.1692/0.0707 0.0168/0.0038 nearest neighbor 0.0251/0.0304 0.1552/0.0893 0.1038/0.0130 ridge regression (one-vs-rest) 0.0215/0.0207 0.3077/0.1203 0.1121/0.0165 nearest neighbor + pca 0.0300/0.0271 0.1521/0.1387 0.0895/0.0153 PSI-BLAST 0.0963/0.0658 0.1615/0.0765 0.0160/0.0042 Fisher kernel 0.0415/0.0182 0.1245/0.0903 -
Fast Kernel Methods for SVM Sequence Classifiers
8.2
237
Comparison to Other Methods
We compared performance of the string kernel-based method using SVM with a number of other classification methods. In particular, we evaluated Fisher kernel method [10], PSI-BLAST, ridge regression and nearest neighbor methods. Table 4 displays classification performance results on the three barcode datasets (note that the complexity of estimating Fisher kernel is very high and the results are not included for the large Hesperiidae set). We observe that k-mer string kernels considerably improve identification accuracy compared to previously reported results of [4,5] (for example, on Astraptes dataset [17], the test error rate of multi-class SVM is only 0.67% compared to 9% in [5] or 20% in [4]). 8.3
Running Time Analysis
We performed running time analysis of the proposed algorithms to demonstrate their behavior under variety of circumstances, including various feature filtering levels, mismatch factors, and sequence feature lengths. We implemented and tested our algorithms in MATLAB. On a 2.8Ghz machine with 1GB RAM (MATLAB v.7.0.4.352), the running time of our mismatch kernel algorithm on Astraptes.fulgerator data set (N = 466, n = 600) is 16.92 seconds and 240.36 seconds on a larger Hesperiidae dataset (N=2185, n=600) (to compute full (N ×N )kernel matrix, k = 5, m = 1). It takes about 2820 seconds and about 20 hours, respectively, to compute the same matrices when we used publicly available string kernel package that implements the state-of-art method for spectrum/mismatch kernel8 . Our experiments show order of magnitude running time improvement (Table 5) for the k-mer kernel with m mismatches (by factors of 100-200 times depending on the dataset size)9 . We also observe that our extension of the explicit map (EM) algorithm that uses sorting as a preprocessing step results in significant speed improvements, however the explicit map algorithm requires much larger storage (exponential in |Σ| and k) than the divide-and-conquer algorithm. Table 6 shows running times for mismatch kernel matrix computations with feature selection (filtering level in the table is a fraction of features filtered out). As we can see from the results, EM algorithms do not scale with the number of selected features, while divide-and-conquer approach scales almost linearly. Similarly to kernel matrix computations, extracting unique k-mers from support vectors using sorting (O(N kn) time) accelerates kernel vector computations for new sequences during testing. In mismatch kernel vector computations (Table 7), divide-and-conquer approach outperforms in many cases explicit mapping with sorting, especially for larger k and m (note that for the large k, EM algorithm exceeded memory capacity), which makes our algorithms also particularly suited for the fast identification of new sequences. 8 9
From http://www1.cs.columbia.edu/compbio/string-kernels/ This improvement is especially significant in light of the differences in the two implementations: MATLAB is an interpretive language while the competing package is implemented in C.
238
P. Kuksa and V. Pavlovic Table 5. Running time comparison. Mismatch kernel matrix computation. data set K EM EM+Sort D&C String kernel package Astraptes 5 202.11 3.14 16.92 2820 Hesperiidae 5 938.79 14.73 240.36 75219
Table 6. Computation of mismatch kernel Table 7. Computation of the mismatch matrices with feature selection (time, s) kernel vector (time, s)
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Astraptes data Hesperiidae data EM+ EM+ D&C EM D&C Sort Sort 212.23 4.03 12.91 961.94 17.75 163.80 212.01 3.62 11.43 964.55 19.83 144.61 185.62 3.88 10.41 982.72 17.20 122.34 193.08 3.76 9.51 974.43 18.73 113.83 193.74 2.31 8.54 989.4 18.56 89.25 196.87 2.30 7.47 969.48 17.37 78.12 192.54 2.51 6.31 977.07 14.85 52.32 200.59 2.51 5.18 964.67 10.73 39.18 184.75 3.60 3.74 966.68 10.57 23.92
9
Conclusions
filt. EM level
k m Astraptes data Hesperiidae data DC+ EM+ D&C+ EM+ D&C D&C Sort Sort Sort Sort 5 1 9.25 3.25 3.06 53.36 14.44 13.70 6 1 12.64 4.95 3.92 70.70 12.31 18.74 7 1 18.01 9.15 5.84 78.39 16.35 48.62 8 1 26.58 20.37 9.99 113.46 29.57 9 1 39.88 42.70 200.62 99.37 5 2 33.28 3.9 5.30 197.34 26.23 20.03 6 2 53.09 5.71 8.28 354.11 34.92 35.60 7 2 80.65 10.14 12.50 537.61 49.31 75.63 8 2 121.02 23.79 20.00 797.65 82.53 9 2 192.04 70.88 1166.6 173.9 -
In this paper, we present a kernel classification based approach to the DNA barcoding problem that substantially improved identification accuracy compared to traditional approaches. We also present a framework for efficient computations of string kernels that results in substantial speed improvements. We introduce string kernels with feature selection which have lower computational cost and, at the same time, comparable or improved classification performance. Presented algorithmic approaches to implementing string kernels are general and can be applied to many other problems of biological sequence analysis. We have presented a counting-sort formalism and a divide-and-conquer technique for the string kernels that provides a fast and scalable solution for many string kernel computation problems. In particular, for an input set S of N sequences over alphabet Σ with each sequence of typical length n we developed: An Θ(knN + min(u, n) · N 2 ) time and O(nN ) space algorithm based on counting sort for the exact k-spectrum kernel; An O(uk m+1 |Σ|m +u·N 2 ) time divide-and-conquer algorithm for the (k, m)mismatch kernel; An improved O(N kn + |F |N 2 ) time mismatch algorithm with feature selection for the kernel matrix computation. We have demonstrated that the use of feature selection applied to the high dimensional space of string sequence features can often result in dramatic reduction in the number of features and may be of particular interest in the DNA
Fast Kernel Methods for SVM Sequence Classifiers
239
barcoding setting. Reduced set of features not only implies more effective computations, but may also facilitate biological interpretability of the resulting models, a task that is being addressed in our ongoing work.
References 1. Hebert, P.D.N., Cywinska, A., Ball, S., deWaard, J.: Biological identifications through DNA barcodes. In: Proceedings of the Royal Society of London, pp. 313– 322 (2003) 2. Armstrong, K., Bal, S.: DNA barcodes for biosecurity: invasive species identification. Philos. R. Soc. Lond. B. Biol. Sci. 360(1462), 1813–1823 (2005) 3. Steinke, D., Vences, M., Salzburger, W., Meyer, A.: TaxI: a software tool for DNA barcoding using distance methods. Philosophical Transactions of the Royal Society B: Biological Sciences 360(1462), 1975–1980 (2005) 4. Nielsen, R., Matz, M.: Statistical approaches for DNA barcoding. Systematic Biology 55(1), 162–169 (2006) 5. Matz, M.V., Nielsen, R.: A likelihood ratio test for species membership based on DNA sequence data. Philosophical Transactions of the Royal Society B: Biological Sciences 360(1462), 1969–1974 (2005) 6. Meyer, C.P., Paulay, G.: Dna barcoding: error rates based on comprehensive sampling. PLoS Biol. 3(12) (December 2005) 7. Leslie, C.S., Eskin, E., Noble, W.S.: The spectrum kernel: A string kernel for SVM protein classification. In: Pacific Symposium on Biocomputing, pp. 566–575 (2002) 8. Leslie, C.S., Eskin, E., Weston, J., Noble, W.S.: Mismatch string kernels for SVM protein classification. In: Becker, S., Thrun, S., Obermayer, K. (eds.) NIPS, pp. 1417–1424. MIT Press, Cambridge (2002) 9. Kuang, R., Ie, E., Wang, K., Wang, K., Siddiqi, M., Freund, Y., Leslie, C.: Profilebased string kernels for remote homology detection and motif extraction. In: CSB ’04: Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference (CSB’04), Washington, DC, USA, pp. 152–160. IEEE Computer Society Press, Los Alamitos (2004) 10. Jaakkola, T., Diekhans, M., Haussler, D.: A discriminative framework for detecting remote protein homologies. Journal of Computational Biology 7(1-2), 95–114 (2000) 11. Menchetti, S., Costa, F., Frasconi, P.: Weighted decomposition kernels. In: ICML ’05: Proceedings of the 22nd international conference on Machine learning, New York, NY, USA, pp. 585–592. ACM Press, New York (2005) 12. Sch¨ olkopf, B., Smola, A.J.: Learning with kernels. MIT Press, Cambridge (2002) 13. Vapnik, V.: Statistical learning theory. Wiley, Chichester (1998) 14. Vishwanathan, S.V.N., Smola, A.J.: Fast kernels for string and tree matching. In: NIPS, pp. 569–576 (2002) 15. Ukkonen, E.: Constructing suffix trees on-line in linear time. In: Proceedings of the IFIP 12th World Computer Congress on Algorithms, Software, Architecture Information Processing ’92, vol. 1, pp. 484–492. North-Holland, Amsterdam (1992) 16. Leslie, C., Kuang, R.: Fast string kernels using inexact matching for protein sequences. J. Mach. Learn. Res. 5, 1435–1455 (2004) 17. Hebert, P.D.N., Penton, E.H., Burns, J.M., Janzen, D.H., Hallwachs, W.: Ten species in one: DNA barcoding reveals cryptic species in the neotropical skipper butterfly Astraptes fulgerator. In: PNAS, vol. 101, pp. 14812–14817 (2004) 18. Gribskov, M., Robinson, N.L.: Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching. Computers & Chemistry 20(1), 25–33 (1996)
On-Line Viterbi Algorithm for Analysis of Long Biological Sequences ˇ amek1 , Broˇ Rastislav Sr´ na Brejov´ a2, and Tom´ aˇs Vinaˇr2 1
2
Department of Computer Science, Comenius University, 842 48 Bratislava, Slovakia
[email protected] Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, NY 14853, USA {bb248,tv35}@cornell.edu
Abstract. Hidden Markov models (HMMs) are routinely used for analysis of long genomic sequences to identify various features such as genes, CpG islands, and conserved elements. A commonly used Viterbi algorithm requires O(mn) memory to annotate a sequence of length n with an m-state HMM, which is impractical for analyzing whole chromosomes. In this paper, we introduce the on-line Viterbi algorithm for decoding HMMs in much smaller space. Our analysis shows that our algorithm has the expected maximum memory Θ(m log n) on two-state HMMs. We also experimentally demonstrate that our algorithm significantly reduces memory of decoding a simple HMM for gene finding on both simulated and real DNA sequences, without a significant slow-down compared to the classical Viterbi algorithm. Keywords: biological sequence analysis, hidden Markov models, on-line algorithms, Viterbi algorithm, gene finding.
1
Introduction
Hidden Markov models (HMMs) are generative probabilistic models that have been successfully used for annotation of protein and DNA sequences. Their numerous applications in bioinformatics include gene finding [1], promoter detection [2], and CpG island detection [3]. More complex phylogenetic HMMs are used to analyze multiple sequence alignments in comparative gene finding [4] and detection of conserved elements [5]. The linear-time Viterbi algorithm [6] is the most commonly used algorithm for these tasks. Unfortunately, the space required by the Viterbi algorithm grows linearly with the length of the sequence (with a high constant factor), which makes it unsuitable for analysis of very long sequences, such as whole chromosomes or whole-genome multiple alignments. In this paper, we address this problem by proposing an on-line Viterbi algorithm that on average requires much less memory and that can even annotate continuous streams of data on-line without reading the complete input sequence first. R. Giancarlo and S. Hannenhalli (Eds.): WABI 2007, LNBI 4645, pp. 240–251, 2007. c Springer-Verlag Berlin Heidelberg 2007
On-Line Viterbi Algorithm for Analysis of Long Biological Sequences
241
An HMM, composed of states and transitions, is a probabilistic model that generates sequences over a given alphabet. In each step of this generative process, the current state generates one symbol of the sequence according to the emission probabilities associated with that state. Then, an outgoing transition is randomly chosen according to the transition probability table, and this transition is followed to the new state. This process is repeated until the whole sequence is generated. The states of the HMM represent distinct features of the observed sequences (such as protein coding and non-coding sequences in a genome), and the emission probabilities in each state represent statistical properties of these features. The HMM thus defines a joint probability Pr(X, S) over all possible sequences X and all state paths S through the HMM that could generate these sequences. To annotate a given sequence X, we find the state path S that maximizes this joint probability. For example, in an HMM with one state for protein-coding sequences, and one state for non-coding sequences, the most probable state path marks each symbol of sequence X as either protein coding or non-coding. To compute the most probable state path, we use the Viterbi dynamic programming algorithm [6]. For every prefix X1 . . . Xi of sequence X and for every state j, we compute the most probable state path generating this prefix ending in state j. We store the probability of this path in table P (i, j) and its second last state in table B(i, j). These values can be computed from left to right, using the recurrence P (i, j) = maxk {P (i − 1, k)·tk (j)·ej (Xi )}, where tk (j) is the transition probability from state k to state j, and ej (Xi ) is the emission probability of symbol Xi in state j. Back pointer B(i, j) is the value of k that maximizes P (i, j). After computing these values, we can recover the most probable state path S = s1 , . . . , sn by setting the last state as sn = arg maxk {P (n, k)}, and then following the back pointers B(i, j) from right to left (i.e., si = B(i+1, si+1 )). For an HMM with m states and a sequence X of length n, the running time of the Viterbi algorithm is Θ(nm2 ), and the space is Θ(nm). This algorithm is well suited for sequences and models of moderate size. However, to annotate all 250 million symbols of the human chromosome 1 with a gene finding HMM consisting of hundred states, we would require 25 GB of memory to store the back pointers B(i, j). This is clearly impractical on most computational platforms. Several solutions are used in practice to overcome this problem. For example, most practical gene finding programs process only sequences of limited size. The long input sequence is split into several shorter sequences which are processed separately. Afterwards, the results are merged and conflicts are resolved heuristically. This approach leads to suboptimal solutions, especially if the genes we are looking for cross the boundaries of the split. Grice et al. [7] proposed a checkpointing algorithm that trades running time for space. We divide the input sequence into K blocks of L symbols, and during the forward pass, we only keep the first column of each block. To obtain the most probable state path, we recompute the last block and use the back pointers to recover the last L states of the path, as well as the last state of the previous block. This information can now be used to recompute the most probable state
242
ˇ amek, B. Brejov´ R. Sr´ a , and T. Vinaˇr
states
sequence positions
Fig. 1. Example of the back pointer tree structure. Dashed lines mark the edges that cannot be part of the most probable state path. The square marks the coalescence point of the remaining paths.
path within the previous block in √ the same way, and the process is repeated √ for all blocks. If we set K = L = n, this algorithm only requires Θ(n + nm) memory at the cost of two-fold slow-down compared to the Viterbi algorithm, since every value of P (i, j) is computed twice. Checkpointing √ can be further generalized to trade L-fold slow-down for memory of Θ(n + L nm) [8,9]. In this paper, we propose and analyze an on-line Viterbi algorithm that does not use fixed amount of memory for a given sequence. Instead, the amount of memory varies depending on the properties of the HMM and input sequence. In the worst case, our algorithm still requires Θ(nm) memory; however, in practice the requirements are much lower. We prove, using results for random walks and theory of extreme values, that in simple cases the expected space for a sequence of length n is as low as Θ(m log n). We also experimentally demonstrate that the memory requirements are low for more complex HMMs.
2
On-Line Viterbi Algorithm
In our algorithm, we represent the back pointer matrix B in the Viterbi algorithm by a tree structure (see [6]), with node (i, j) for each sequence position i and state j. Parent of node (i, j) is the node (i−1, B(i, j)). In this data structure, the most probable state path is the path from the leaf node (n, j) with the highest probability P (n, j) to the root of the tree (see Figure 1). This tree is built as the Viterbi algorithm progresses from left to right. After computing column i, all edges that do not lie on one of the paths ending at column i can be removed; these edges will not be used in the most probable path [10]. The remaining m paths represent all possible initial segments of the most probable state path. These paths are not necessarily edge disjoint; in fact, often all the paths share the same prefix up to some node that we call a coalescence point (see Figure 1). Left of the coalescence point there is only a single candidate for the initial segment of the most probable state path. Therefore we can output this segment and remove all edges and nodes of the tree up to the coalescence point. Forney [6] describes an algorithm that after processing D symbols of the input sequence checks whether a coalescence point has been reached; in such case, the initial segment of the most probable state path is outputted. If the coalescence point
On-Line Viterbi Algorithm for Analysis of Long Biological Sequences
0
1
0
1
s 2
243
Fig. 2. An HMM requiring Ω(n) space. Every state emits only the symbol shown, with probability 1. Transition probability is evenly divided among transitions outgoing from a given state. For sequences of the form s{0, 1}n {0, 2}, any correct decoding algorithm must retain some representation of the input before discovering whether the last symbol is 0 or 2.
was not reached, one potential initial segment is chosen heuristically. Several studies [11,12] suggest how to choose D to limit the expected error caused by such heuristic steps in the context of convolution codes. Here we show how to detect the existence of a coalescence point dynamically without introducing significant overhead to the whole computation. We maintain a compressed version of the back pointer tree, where we omit all internal nodes that have less than two children. Any path consisting of such nodes will be contracted to a single edge. This compressed tree has m leaves and at most m − 1 internal nodes. Each node stores the number of its children and a pointer to its parent node. We also keep a linked list of all the nodes of the compressed tree ordered by the sequence position. Finally, we also keep the list of pointers to all the leaves. When processing the i-th sequence position in the Viterbi algorithm, we update the compressed tree as follows. First, we create a new leaf for each node at position i, link it to its parent (one of the former leaves), and insert it into the linked list. Once these new leaves are created, we remove all the former leaves that have no children, and recursively all of their ancestors that would not have any children. Finally, we need to compress the new tree: we examine all the nodes in the linked list in order of decreasing sequence position. If the node has zero or one child and is not a current leaf, we simply delete it. For each leaf or node that has at least two children, we follow the parent links until we find its first ancestor (if any) that has at least two children and link the current node directly to that ancestor. A node (, j) that does not have an ancestor with at least two children is the coalescence point; it will become a new root. We can output the most probable state path for all sequence positions up to , and remove all results of computation for these positions from memory. The running time of this update is O(m) per sequence position, and the representation of the compressed tree takes O(m) space. Thus the asymptotic running time of the Viterbi algorithm is not increased by the maintenance of the compressed tree. Moreover, we have implemented both the standard Viterbi algorithm and our new on-line extension, and the time measurements suggest that the overhead required for the compressed tree updates is less than 5%. The worst-case space required by this algorithm is still O(nm). In fact, any algorithm that correctly finds the most probable state path for the HMM shown in Figure 2 requires at least Ω(n) space in the worst case.
244
ˇ amek, B. Brejov´ R. Sr´ a , and T. Vinaˇr
However, our algorithm rarely requires linear space for realistic data; the space changes dynamically depending on the input. In the next section, we show that for two-state HMMs the expected maximum space required for processing sequence of length n is Θ(m log√n). This is much better than checkpointing, which requires space of Θ(n + m n) with a significant increase in running time. We conjecture that this trend extends to more complex cases. We also present experimental results on a gene finding HMM and real DNA sequences showing that the on-line Viterbi algorithm leads to significant savings in memory. Another advantage of our algorithm is that it can construct initial segments of the most probable state path before the whole input sequence is read. This feature makes it ideal for on-line processing of signal streams (such as sensor readings).
3
Memory Requirements of the On-Line Viterbi Algorithm
In this section, we analyze space requirements of the on-line Viterbi algorithm. The space is variable throughout the execution of the algorithm, but of special interest are asymptotic bounds on the expected maximum memory while decoding a sequence of length n. We use results from random walks and extreme value theory to argue that for two-state HMMs, the expected maximum memory is O(m log n). We give tight bounds for the symmetric case. We conduct experiments on a gene finding HMM and both real and simulated DNA sequences. Symmetric two-state HMMs. Consider a symmetric two-state HMM over a binary alphabet as shown in Figure 3a. For simplicity, we assume t < 1/2 and e < 1/2. The back pointers between the sequence positions i and i + 1 can form configurations i–iii shown in Figure 3b. Denote pA = log P (i, A) and pB = log P (i, B), where P (i, j) are probabilities computed in Viterbi algorithm. The Viterbi algorithm recurrence implies that the configuration i occurs when log t − log(1 − t) ≤ pA − pB ≤ log(1 − t) − log t, configuration ii occurs when pA − pB ≥ log(1 − t) − log t, and configuration iii occurs when pA − pB ≤ log t − log(1 − t). Configuration iv never occurs for t < 1/2.1 For the two-state HMM, a coalescence point occurs whenever one of the configurations ii or iii occur. Thus the space is proportional to the length of continuous sequence of configurations i, which we call a run. First, we analyze the length distribution of runs assuming that the input sequence is a sequence of uniform i.i.d. binary random variables. In such case, we represent the run by a symmetB −(log t−log(1−t)) ric random walk corresponding to a random variable X = pA −plog(1−e)−log . e 1
We can easily extend analysis to other values of e and t. If t > 1/2, only configurations ii, iii, and iv may occur. Similar analysis applies by considering two steps of the algorithm together. If t = 1/2, we only have configurations ii and iii and the memory is constant. Case e > 1/2 is equivalent to the case of e < 1/2 after relabeling the states. If e = 1/2, the algorithm requires linear memory because of symmetry.
On-Line Viterbi Algorithm for Analysis of Long Biological Sequences
1−t
1−t
configuration i:
configuration ii:
A
A
B
B
245
t
A
B t
0: 1−e 1: e
0: e 1: 1−e
configuration iii:
configuration iv:
A
A
B
B
(a)
(b)
Fig. 3. (a) Symmetric two-state HMM with two parameters: e for emission probabilities and t for transitions probabilities. (b) Possible back-pointer configurations for a two-state HMM.
log(1−t)−log(t) The configuration i occurs whenever X ∈ (0, K), where K = 2 log(1−e)−log(e) . The quantity pA − pB is updated by log(1 − e) − log e, if the symbol at the corresponding sequence position is 0, or log e − log(1 − e), if this symbol is 1. This corresponds to updating X by +1 or −1. When X reaches 0, we have a coalescence point in configuration iii, and the pA − pB is initialized to log t − log(1 − t) ± (log e − log 1 − e), which either means initialization of X to +1, or another coalescence point, depending on the symbol at the corresponding sequence position. The other case, when X reaches K and we have a coalescence point in configuration ii, is symmetric. We can now apply the classical results from the theory of random walks (see [13, ch.14.3,14.5]) to analyze the expected length of runs. Lemma 1. Assuming that the input sequence is uniformly i.i.d., the expected length of a run of a symmetric two-state HMM is K − 1. The larger is K, the more memory is required to decode the HMM; the worst case happens as e approaches 1/2 and the states become indistinguishable. From the theory of random walks, we can characterize the distribution of run lengths. Lemma 2. Let R be the event that the run length of a symmetric two-state HMM is either 2 + 1 or 2 + 2. Then, assuming that the input sequence is uniformly i.i.d., for some constants b, c > 0: b · cos2
π π ≤ Pr(R ) ≤ c · cos2 K K
(1)
Proof. For a symmetric random walk on interval (0, K) with absorbing barriers and with starting point z, the probability of event Wz,n that this random walk ends in point 0 after n steps is zero, if n − z is odd, and the following quantity, if n − z is even [13, ch.14.5]: Pr(Wz,n ) =
2 K
0
cosn−1
πv πv πzv sin sin K K K
(2)
246
ˇ amek, B. Brejov´ R. Sr´ a , and T. Vinaˇr
Using symmetry, note that the probability of the same random walk ending after n steps at barrier K is the same as probability of WK−z,n . Thus, if K is even: Pr(R ) = Pr(W1,2+1 ) + Pr(WK−1,2+1 ) 2 πv πv πv πv = cos2 sin sin + (−1)v+1 sin K K K K K 0
4 = K
0
cos2 v odd
πv πv sin2 K K
(3)
There are at most K/4 terms in the sum and they can all be bounded from π above by cos2 K . Thus, we can give both upper and lower bounds on Pr(R ) using only the first term of the sum as follows: 4 π π π sin2 cos2 ≤ Pr(R ) ≤ cos2 K K K K
(4)
Similarly, if K is odd, we have Pr(R ) = Pr(W1,2+1 )+Pr(WK−1,2+2 ), obtaining a similar bound: 2 π π 2 π π sin2 1 + cos cos ≤ Pr(R ) ≤ 2 cos2 K K K K K
(5)
The previous lemma characterizes the length distribution of a single run. However, to analyze memory requirements for a sequence of length n, we need to consider maximum over several runs whose total length is n. Similar problem was studied for the runs of heads in a sequence of n coin tosses [14,15]. There the length distribution of runs is geometric. In our case the runs are only bounded by geometrically decaying functions. Still, we can prove that the expected length of the longest run grows logarithmically with the length of the sequence, as is the case for the coin tosses. Lemma 3. Let X1 , X2 , . . . be a sequence of i.i.d. random variables drawn from a geometrically decaying distribution over positive integers, i.e. there exist constants a, b, c, a ∈ (0, 1), 0 < b ≤ c, such that for all integers k ≥ 1, bak ≤ Pr(Xi > k) ≤ cak . Let Nn be the largest index such that i=1...Nn Xi ≤ n, and let Yn be max Nn {X1 , X2 , . . . , XNn , n − i=1 Xi }. Then E[Yn ] = log1/a n + o(log n)
(6)
Proof. Let Zn = maxi=1...n Xi be the maximum of the first n runs. Clearly, Pr(Zn ≤ k) = Pr(Xi ≤ k)n , and therefore (1 − cak )n ≤ Pr(Zn ≤ k) ≤ (1 − bak )n for all integers k ≥ log1/a (c).
On-Line Viterbi Algorithm for Analysis of Long Biological Sequences
247
√ Lower bound: Let tn = log1/a n − ln n. If Yn ≤ tn , we need at least n/tn runs to reach the sum n, i.e. Nn ≥ n/tn − 1 (discounting the last incomplete run). Therefore −tn tn n a ( tn
Pr(Yn ≤ tn ) ≤ Pr(Z tnn −1 ≤ tn ) ≤ (1 − batn ) tn −1 = (1 − batn )a n
−1)
(7)
Since limn→∞ atn (n/tn − 1) = ∞ and limx→0 (1 − bx)1/x = e−b , we get limn→∞ Pr(Yn ≤ tn ) = 0. Note that E[Yn ] ≥ tn (1 − Pr(Yn ≤ tn )), and thus we get the desired bound. Upper bound: Clearly, Yn ≤ Zn and so E[Yn ] ≤ E[Zn ]. Let Zn be the maximum of n i.i.d. geometric random variables X1 , . . . , Xn such that Pr(Xi ≤ k) = 1−ak . We will compare E[Zn ] to the expected value of variable Zn . Without loss of generality, c ≥ 1. For any real x ≥ log1/a (c) + 1 we have: Pr(Zn ≤ x) ≥ (1 − cax )n n = 1 − ax−log1/a (c) n ≥ 1 − ax−log1/a (c)−1 = Pr(Zn ≤ x − log1/a (c) − 1) = Pr(Zn + log1/a (c) + 1 ≤ x) This inequality holds even for x < log1/a (c) + 1, since the right-hand side is zero in such case. Therefore, E[Zn ] ≤ E[Zn + log1/a (c)+ 1] = E[Zn ]+ O(1). Expected value of Zn is log1/a (n) + o(log n) [16], which proves our claim. Using results of Lemma 3 together with the characterization of run length distributions by Lemma 2, we can conclude that for symmetric two-state HMMs, the expected maximum memory required to process a uniform i.i.d. input sequence of length n is (1/ ln(1/ cos(π/K)))·ln n+o(log n).2 Using the Taylor expansion of the constant term as K grows to infinity, 1/ ln(1/ cos(π/K))) = 2K 2 /π 2 + O(1), we obtain that the maximum memory grows approximately as (2K 2 /π 2 ) ln n. The asymptotic bound Θ(log n) can be easily extended to the sequences that are generated by the symmetric HMM, instead of uniform i.i.d. The underlying process can be described as a random walk with approximately 2K states on two (0, K) lines, each line corresponding to sequence symbols generated by one of the two states. The distribution of run lengths still decays geometrically as required by Lemma 3; the base of the exponent is the largest eigenvalue of the transition matrix with absorbing states omitted (see e.g. [17, Claim 2]). 2
We omitted the first run, which has a different starting point and thus does not follow the distribution outlined in Lemma 2. However, the expected length of this run does not depend on n and thus contributes only a lower-order term. We also omitted the runs of length one that start outside the interval (0, K); these runs again contribute only to lower order terms of the lower bound.
248
ˇ amek, B. Brejov´ R. Sr´ a , and T. Vinaˇr
Theorem 1. The expected maximum memory used by the on-line Viterbi algorithm on a symmetric two-state HMM with t, e < 1/2 is Θ(m ln n), assuming sequences generated by either uniform i.i.d. or the HMM itself. Moreover, for the uniform i.i.d. sequences, the constant factor is approximately 2K 2 /π 2 , where log(1−t)−log(t) K = 2 log(1−e)−log(e) .
General two-state HMMs. We can extend the result of Theorem 1 to general two-state HMMs with states A and B, assuming i.i.d. generated sequence (not necessarily binary or uniform). If tB (A)tA (B) < tA (A)tB (B), only configurations i, ii and iii may occur and we can again analyze the length of a single run as random walk of the variable log P (i, A) − log P (i, B). Here, the random walk proceeds in steps that are arbitrary real numbers, different in each direction. The step size for symbol x is | log(tA (A)eA (x)) − log(tB (B)eB (x))|, and we assume the existence of a symbol x with non-zero step size.3 Sufficiently large number h of consecutive occurrences of x will always take the random walk to one of the absorbing barriers (configurations ii or iii), regardless of the starting point. Thus the length of a single run is bounded from above by the length of the sequence before h consecutive occurrences of x, length distribution of which clearly decays geometrically as required by Lemma 3. Thus we obtain an O(m log n) upper bound on the expected maximum memory for i.i.d. sequences. As long as the emission probability of x is non-zero in all states of the HMM, we can straightforwardly extend this analysis to the sequences generated by the HMM. This condition ensures that appearance of h consecutive symbols of x is possible regardless of the starting state. As before, we can also extend the analysis to the case when tB (A)tA (B) > tA (A)tB (B) by considering two steps at a time. Consequently the O(m log n) upper bound applies to all two-state HMMs, excluding some boundary symmetric cases and cases with zero emission or transition probabilities. Multi-state HMMs. Our analysis technique cannot be easily extended to HMMs with many states. In two-state HMMs, each new coalescence event clears the memory, and thus the execution of the algorithm can be divided into independent runs. A coalescence event in a multi-state HMM may leave a tree of substantial depth in the memory. Thus, the sizes of consecutive runs are no longer independent (see Figure 4a). To evaluate the memory requirements of our algorithm for multi-state HMMs, we have implemented the algorithm and performed several experiments on both simulated and biological sequences. First, we generalized the two-states symmetric HMMs to multiple states. The symmetric HMM with m states emits symbols over m-letter alphabet, where each state emits one symbol with higher probability than the other symbols. The transition probabilities are equiprobable, except for self-transitions. We have tested the algorithm for m ≤ 6 and sequences generated both by a uniform i.i.d. process, and by the HMM itself. Observed data 3
Note that in the boundary case where all symbols have zero step size, it is again impossible to distinguish between individual states, and in the absence of tie breakers, the algorithm will require linear memory.
On-Line Viterbi Algorithm for Analysis of Long Biological Sequences (a)
(b)
40K
Average maximum memory
100K
Actual memory
30K
20K
10K
0 15.2M
249
80K
Human genome (35) HMM generated (100) Random i.i.d. (35)
60K
40K
20K
0 15.3M
15.4M
Section of chromosome 1
15.5M
0
5M
10M
15M
20M
Sequence length
Fig. 4. Memory requirements of a gene finding HMM. a) Actual length of table used on a segment of human chromosome 1. b) Average maximum table length needed for prefixes of 20 MB sequences.
are consistent with the logarithmic growth of average maximum memory needed to decode a sequence of length n (data not shown). We have also evaluated the algorithm using a simplified HMM for gene finding with 265 states. The emission probabilities of the states are defined using at most 4-th order Markov chains, and the structure of the HMM reflects known properties of genes (similar to the structure shown in [18]). The HMM was trained on RefSeq annotations of human chromosomes 1 and 22. In gene finding, we segment the input DNA sequence into exons (proteincoding sequence intervals), introns (non-coding sequence separating exons within a gene), and intergenic regions (sequence separating genes). Common measure of accuracy is exon sensitivity (how many of real exons we have successfully and exactly predicted). The implementation used here has exon sensitivity 37% on the ENCODE testing set [19]. A realistic gene finder, such as ExonHunter [20] achieves sensitivity of 53%. This difference is due to additional features that are not implemented in our test, namely GC content levels, non-geometric length distributions, and sophisticated signal models. We have tested the algorithm on 20 MB long sequences: regions from the human genome, simulated sequences generated by the HMM, and i.i.d. sequences (see Figure 4b). Regions of the human genome were chosen from hg18 assembly so that they do not contain sequencing gaps. The distribution for the i.i.d. sequences mirrors the distribution of bases in the human chromosome 1. The average maximum length of the table over several samples appears to grow faster than logarithmically with the length of the sequence, though it seems to be bounded by c log2 n. It is not clear whether the faster growth is an artifact that would disappear with longer sequences or higher number of samples. The HMM for gene finding has a special structure, with three copies of the state for introns that have the same emission probabilities and the same self-transition probability. In two-state symmetric HMMs, similar emission probabilities of the
250
ˇ amek, B. Brejov´ R. Sr´ a , and T. Vinaˇr
two states lead to increase in the length of individual runs. Intron states of a gene finder are an extreme example of this phenomenon. Nonetheless, on average a table of length roughly 100,000 is sufficient to process sequences of length 20 MB, which is a 200-fold improvement compared to the trivial Viterbi algorithm. In addition, the length of the table did not exceed 222,000 on any of the 20MB human segments. As we can see in Figure 4a, most of the time the program keeps only relatively short table; the average length on the human segments is 11,000. The low average length can be of a significant advantage if multiple processes share the same memory.
4
Conclusion
In this paper, we introduced the on-line Viterbi algorithm. Our algorithm is based on efficient detection of coalescence points in trees representing the state paths in the dynamic programming algorithm. The algorithm requires variable space that depends on the HMM and on the local properties of the analyzed sequence. Memory savings enable analysis of long biological sequences. Previous approaches often artificially split the sequence into smaller pieces which can negatively influence the prediction accuracy. For two-state symmetric HMMs, we have shown that the expected maximum memory needed for sequence of length n is approximately only (2K 2 /π 2 ) ln n. Experiments on simulated data suggest that the asymptotic bound Θ(m log n) also extends to some multi-state HMMs; experiments on the real DNA sequences in gene finding scenario suggest polylogarithmic bound. In addition, most of the time throughout the execution, the algorithm uses much less memory. Our algorithm can also be used for on-line processing of streamed sequences; all previous algorithms that are guaranteed to produce the optimal state path require the whole sequence to be read before the output can be started. There are still many open problems. We have only been able to analyze the algorithm for two-state HMMs, though trends predicted by our analysis seem to generalize even to more complex cases. Can our analysis be extended to multistate HMMs? Apparently, design of the HMM affects the memory needed for the decoding algorithm; for example, presence of states with similar emission probabilities tends to increase memory requirements. Is it possible to characterize HMMs that require large amounts of memory to decode? Can we characterize the states that are likely to serve as coalescence points? Acknowledgments. We thank Richard Durrett for useful discussions. TV is supported by NSF grant DBI-0644111, and NSF/NIGMS grant DMS-0201037. BB is supported by NIH/NCI (subcontract 22XS013A). Recently, we discovered parallel work on this problem by another research group [21] with focus on implementation of a similar algorithm in their gene finder, and possible applications to parallelization; we focus on the expected space analysis.
On-Line Viterbi Algorithm for Analysis of Long Biological Sequences
251
References 1. Burge, C., Karlin, S.: Prediction of complete gene structures in human genomic DNA. Journal of Molecular Biology 268(1), 78–94 (1997) 2. Ohler, U., Niemann, H., Rubin, G.M.: Joint modeling of DNA sequence and physical properties to improve eukaryotic promoter recognition. Bioinformatics 17(S1), S199–206 (2001) 3. Durbin, R., Eddy, S., Krogh, A., Mitchison, G.: Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge (1998) 4. Pedersen, J.S., Hein, J.: Gene finding with a hidden Markov model of genome structure and evolution. Bioinformatics 19(2), 219–227 (2003) 5. Siepel, A., et al.: Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Research 15(8), 1034–1040 (2005) 6. Forney Jr., G.D.: The Viterbi algorithm. Proceedings of the IEEE 61(3), 268–278 (1973) 7. Grice, J.A., Hughey, R., Speck, D.: Reduced space sequence alignment. Computer Applications in the Biosciences 13(1), 45–53 (1997) 8. Tarnas, C., Hughey, R.: Reduced space hidden Markov model training. Bioinformatics 14(5), 401–406 (1998) 9. Wheeler, R., Hughey, R.: Optimizing reduced-space sequence analysis. Bioinformatics 16(12), 1082–1090 (2000) 10. Henderson, J., Salzberg, S., Fasman, K.H.: Finding genes in DNA with a hidden Markov model. Journal of Computational Biology 4(2), 127–131 (1997) 11. Hemmati, F., Costello, D., J.: Truncation error probability in Viterbi decoding. IEEE Transactions on Communications 25(5), 530–532 (1977) 12. Onyszchuk, I.: Truncation length for Viterbi decoding. IEEE Transactions on Communications 39(7), 1023–1026 (1991) 13. Feller, W.: An Introduction to Probability Theory and Its Applications, 3rd edn., vol. 1. Wiley, Chichester (1968) 14. Guibas, L.J., Odlyzko, A.M.: Long repetitive patterns in random sequences. Probability Theory and Related Fields 53, 241–262 (1980) 15. Gordon, L., Schilling, M.F., Waterman, M.S.: An extreme value theory for long head runs. Probability Theory and Related Fields 72, 279–287 (1986) 16. Schuster, E.F.: On overwhelming numerical evidence in the settling of Kinney’s waiting-time conjecture. SIAM Journal on Scientific and Statistical Computing 6(4), 977–982 (1985) 17. Buhler, J., Keich, U., Sun, Y.: Designing seeds for similarity search in genomic DNA. Journal of Computer and System Sciences 70(3), 342–363 (2005) 18. Brejova, B., Brown, D.G., Vinar, T.: Advances in hidden Markov models for sequence annotation. In: Mandoiu, I., Zelikovski, A. (eds.) Bioinformatics Algorithms: Techniques and Applications, Wiley, Chichester (to appear, 2007) 19. Guigo, R., et al.: EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biology 7(S1), 1–31 (2006) 20. Brejova, B., Brown, D.G., Li, M., Vinar, T.: ExonHunter: a comprehensive approach to gene finding. Bioinformatics 21(S1), i57–65 (2005) 21. Keibler, E., Arumugam, M., Brent, M.R.: The Treeterbi and Parallel Treeterbi algorithms: efficient, optimal decoding for ordinary, generalized and pair HMMs. Bioinformatics 23(5), 545–554 (2007)
Predicting Protein Folding Kinetics Via Temporal Logic Model Checking (Extended Abstract) Christopher James Langmead and Sumit Kumar Jha
[email protected] Department of Computer Science, Carnegie Mellon University
Abstract. We present a novel approach for predicting protein folding kinetics using techniques from the field of model checking. This represents the first time model checking has been applied to a problem in the field of structural biology. The protein’s energy landscape is encoded symbolically using Binary Decision Diagrams and related data structures. Questions regarding the kinetics of folding are encoded as formulas in the temporal logic CTL. Model checking algorithms are then used to make quantitative predictions about the kinetics of folding. We show that our approach scales to state spaces as large as 1023 when using exact algorithms for model checking. This is at least 14 orders of magnitude larger than the number of configurations considered by comparable techniques. Furthermore, our approach scales to state spaces at least as large as 1032 unique configurations when using approximation algorithms for model checking. We tested our method on 19 test proteins. The quantitative predictions regarding folding rates for these test proteins are in good agreement with experimentally measured values, achieving a correlation coefficient of 0.87.
1
Introduction
In the world of proteins, form usually follows function. Consequently, proteins are often studied in terms of their atomic-resolution structures. A detailed analysis of an enzyme’s active site, for example, may reveal the mechanism by which it catalyzes a given reaction. Protein structures are not static, however, and conformational changes often play important functional roles. Moreover, largescale conformational changes are also associated with a number of diseases, most notably the prion-related diseases. For these reasons, and others, it is interesting to study how a given protein moves between conformations. Such examinations may provide valuable insights into basic biology and pathology, as well as to the design of therapeutic or preventative interventions for certain classes of disease. In this paper, we focus on what is typically the largest conformational change a protein will exhibit — folding. By folding we refer to the act of moving from a completely denatured form to the so-called native configuration. Unfortunately,
Corresponding author.
R. Giancarlo and S. Hannenhalli (Eds.): WABI 2007, LNBI 4645, pp. 252–264, 2007. c Springer-Verlag Berlin Heidelberg 2007
Predicting Protein Folding Kinetics Via Temporal Logic Model Checking
253
there is no experimental technology that can provide atomic-resolution detail into the entire process of folding (or any other large-scale conformational change, for that matter). For this reason, computational methods are used to study large-scale conformational changes, including folding. Our work builds on prior research on the protein unfolding problem. In contrast to the well-known protein folding problem, the unfolding problem assumes that the native structure is already known. The computational challenge is to find low energy pathways between the unfolded and folded states. More specifically, we consider the G¯o theory of (un)folding [12] wherein the folding process is driven by the formation of native contacts between residues (i.e, those present in the native structure). Non-native interactions are deemed negligible, and are therefore ignored. Obviously, this is a highly simplified theory of folding. Nevertheless, this theory has been shown capable of making accurate quantitative predictions regarding the kinetics of folding (e.g., [1,8,11,16]). Like previous algorithms for G¯ o-like theories, our algorithms operate on finitestate models of the protein’s energy landscape. The primary contribution of our work lies in the observation that finite-state models of folding can be formally analyzed using techniques from the field of model checking [10]. Model checking refers to a family of algorithms for automatically verifying dynamic properties of concurrent reactive processes. Historically, model checking has been used to verify the correctness and safety of circuit designs, communications protocols, device drivers, and other classes of software. More recently, model checking algorithms have been introduced for analyzing the properties of stochastic systems. Such model checking algorithms for stochastic systems have been used in the field of systems biology to verify properties of biochemical and regulatory networks (e.g., [15]). To our knowledge, however, model checking has not been applied to any problem within the field of structural biology. This paper is the first to do so. There are three primary advantages of a model-checking approach to studying protein folding pathways: First, model checking algorithms compute over symbolic representations of finite state models, not explicit representations. The computational complexity of model checking algorithms is polynomial in the size of the encoding of the finite-state model. Thus, if a given finite-state model can be compressed, extremely large state spaces can be considered. Unfortunately, finding a minimal encoding for an arbitrary finite-state model is NP-hard. However, good heuristics for finding compact encodings exist. For example, model checking algorithms have been able to verify properties of systems having more than 1020 states since 1990 [7], and have been applied to systems with as many as 10120 states [5,6]. In this paper, we show that using exact algorithms for model checking, energy landscapes with as many as 1023 states are tractable. This is at least 14 orders of magnitude larger than has been attempted by comparable algorithms for studying protein folding pathways. We also show that energy landscapes with at least 1032 states are tractable when using approximation algorithms for model checking. Second, model checking relies on formulas in a temporal logic to express precise queries about the behavior of the finite-state model. Temporal logics are very expressive and can be used to ask many questions of interest to protein
254
C.J. Langmead and S.K. Jha
folding. Third, model checking algorithms are exact; they are not simply a means for sampling or simulating the behavior of a system. There are, however, finitestate models that are too large for traditional model checking algorithms. For these, we use an algorithm for performing approximate model checking [19] which provides a guarantee on the quality of the computed result. The organization of this paper is as follows: In Section 2, we define our model of protein folding. In Section 3, we briefly discuss model checking, and demonstrate how to encode the protein folding problem in a form suitable for model checking. In Section 4, we report the results of applying our method to 19 proteins and show that our quantitative predictions of folding rates are wellcorrelated with experimental values. We conclude with a discussion of ongoing work in applying model checking to the study of protein folding pathways.
2
A Simplified Model of Protein (Un)Folding
In this section we describe our model of protein folding; it is identical to that used in [16] and very similar to those reported elsewhere [1,8,11]. The thermodynamics of folding is governed by the Gibbs free-energy: ΔG = ΔE − T ΔS. Here, E is the energy (in kcal mole−1 ) of inter-residue interactions (e.g., hydrogen bonds, hydrophobic interactions, etc), S is the configurational entropy (in kcal mole−1 K−1 ), and T is the absolute temperature (in Kelvin). Free energy is a balance between the stabilizing contributions of inter-residue interactions and the destabilizing contributions due to the loss of configurational entropy as the protein folds. Definitions. Let P = a1 , a2 , ..., an be a protein with n amino acids (aka residues) and m atoms. Let C ⊂ R3m be the set of possible configurations/ embeddings of P such that each Ci ∈ C is consistent with the laws of physics. Let CF ∈ C be the native configuration of P as determined by, say, X-ray crystallography. Following [16] we define a contact as two non-hydrogen atoms from two different residues that are within 4 ˚ A of each other1 . Contacts between residues (i, i ± 1) and (i, i ± 2) are ignored because they tend to be present in every configuration of P . A contact map, M, is an n × n matrix where element M(i, j) is the number of contacts between residues i and j. We define a separate contact strength map, MS , that is the same size as M but whose elements are obtained by mapping the elements of M as follows: 1-5 contacts → 1; 6-10 contacts → 2; 11-15 contacts → 3; 16-20 contacts → 4. Intuitively, MS classifies contacts as being weak, medium, strong, or very strong. The G¯ o theory assumes that folding is driven by the formation of the native contacts, and that non-native interactions are negligible. Therefore, the state space of the protein can be modeled using a binary string, B ∈ {0, 1}n. Here, B(i) is 0 if the ith residue is completely unfolded and 1 if it is folded. There is an entropic penalty for each 1 in B which must be compensated for by the stabilizing energies of the native contacts. In particular, if B(i) = B(j) = 1, then 1
1˚ A = 10−10 m.
Predicting Protein Folding Kinetics Via Temporal Logic Model Checking
000 (0)
001 (2)
011 (-1)
010 (-1)
101 (2)
100 (2)
110 (2)
0
111 (-2)
0
0
B1
1
0 1
0
B2
1
B3
0
1
255
1
2
-1
-2
Fig. 1. (Left) A toy example of the protein folding model. This finite-state model corresponds to a 3-residue protein. The state variables and the energy (in parens) are placed inside each node. The state labeled 000 is the unfolded state; the state labeled 111 is the folded state. In our experiments, we considered proteins with between 16 and 107 residues. (Right) A MTBDD of the function mapping the states of the model on the left to their respective energies. Each level of the MTBDD corresponds to one of the three bits/resiudes (B1 ,B2 ,B3 ), and each path from the root to a leaf maps one (or more) states to an energy. Notice that the MTBDD is smaller than a complete binary tree encoding the same function from states to energies.
we assume that the contacts between residues i and j (if any) are formed, and that the energy of that interaction can be used to offset the entropic penalty. Under the model, there are 2n possible states. Let BU be the bit string containing all 0’s, and let BF be the bit string of all 1’s. BU and BF correspond to the unfolded and folded states, respectively. Every other bit string corresponds to a partially folded state. Each state can be mapped to its free energy as follows: G(B) =
n n i
j>i
MS (i, j)B(i)B(j)α − T
n
B(i)β
(1)
i
where α is the strength of a single contact and β is the entropic penalty for folding a single residue 2 . The Boltzmann factor (i.e., weight ) for any given configuration is a function of its energy, the gas constant (R) and the temperature, T; it is given by: w(B) = exp (−G(B)/RT ). Since we are only interested in changes in free energy (i.e., ΔG), we arbitrarily set G(BU ) = 0. A protein’s energy landscape is constructed by applying Eq. 1 to every possible configuration. In this paper, it can be thought of as an n-dimensional discrete function. Computationally, our task is to find a low-energy path (or a set of paths) between BU and BF in the energy landscape. Thus, we must define a set of allowable transitions. Under the model, state s can only transition to those states that are similar. In practice, this means that transition are only allowed between pairs of states whose bit vector representations have small Hamming distance. In this paper, we allow transitions between pairs of states with Hamming distance 1. A toy example of the model for a 3-residue protein is shown in Figure1-A. 2
See [16] for more details on contact energies and entropic penalties.
256
C.J. Langmead and S.K. Jha
Path length
Fig. 2. Energy profile for FKBP-12, as computed by our method
2.1
Kinetics
The reaction kinetics of folding are described in terms of an energy profile along a chosen reaction coordinate. A reaction coordinate is a projection of the energy landscape onto a lower-dimensional surface. Given an appropriately chosen reaction coordinate, one can make quantitative predictions regarding the rate of folding from the energy profile. There are a number of potentially relevant reaction coordinates from which to chose when studying protein folding including radius of gyration, solvent accessible area, number of folded residues, and so forth. Following Mu˜ noz and Eaton [16], we will use the number of folded residues (i.e., the number of 1’s in B) as our reaction ncoordinate. For each position 0 ≤ k ≤ n, there are k binary strings, each with its own n energy. Let Bk = {B ∈ {0, 1}n | B(i) = k} be the set of bit strings with k i=1 1’s and n − k 0’s. The Boltzman-weighted total energy for each position k along the reaction coordinate is Gk = −RT ln( b∈Bk w(b)). The energy profile for FKBP-12 is shown in Figure 2. In theory, it is possible to construct the energy profile by explicitly enumerating all 2n binary strings. In practice, it is common to sample from the set of possible configurations. The algorithms reported in [1,8,11,16], for example, operate on state spaces ranging in size from 104 to 109 configurations. In contrast, we seek to consider the entire space of binary strings by adopting symbolic techniques from the field of model checking. We note that because the Boltzmann weight of a configuration is exponentially related to the negative energy of its configuration, we can compute an upper bound for each Gk by considering only the smallest-energy configurations for each k. It is these low-energy configurations we identify via model checking. Specifically, we seek to find the energy of the lowest-energy configuration for ˜k. each k.3 We will denote the lowest energy as G ˜ Given the value of Gk for all 0 ≤ k ≤ n, there are a number of ways to predict folding rates. Under a transition-state theory, for example, the folding rate, k ∝ 3
It may be noted that our technique can be used to identify c lowest-energy configurations, for arbitrary integer c. For ease of presentation, we only consider the case of c=1 in this paper.
Predicting Protein Folding Kinetics Via Temporal Logic Model Checking
257
˜k − G ˜ 0 . In this k0 exp(−ΔG‡ /RT ) where k0 is a constant and ΔG‡ = argmaxk G paper, we use a more accurate way to predict the folding rate in terms of the rate of decay of the average number of folded residues starting from the folded state [16].
3
Model Checking
The field of model checking was born from a need to formally verify the correctness of hardware designs. Since its inception in 1981, it has expanded to encompass a wide range of techniques for formally verifying finite-state transition systems, including those with stochastic behavior. Model checking algorithms are simultaneously theoretically very interesting and very useful in practice. Significantly, they have become the preferred method for formal verification in industrial settings over traditional verification methods like theorem proving, which often need guidance from an expert human user. A complete discussion of model checking theory and practice is beyond the scope of this paper. The interested reader is directed to [10] for a detailed treatment of the subject. 3.1
Modeling Concurrent Systems
Let AP be a set of atomic propositions. An atomic proposition, a, is a Boolean predicate referring to some property of the system. A Kripke strucutre, M , over AP is a tuple, M = (S, S0 , R, L). Here, S is a finite set of states, S0 ⊆ S is the set of initial states, R ⊆ S × S is a total transition relation between states, and L : S → 2AP is a labeling function that labels each state with the set of atomic propositions that are true in that state. Variations on the basic Kripke structure exist. For example, if the system is stochastic, then we replace the transition relation, R, with a stochastic transition matrix, T where element T (i, j) contains either a transition rates (for continuous-time Markov models) or a transition probability (for discrete-time Markov models). Application to Energy Landscapes. The Kripke structure used in this paper closely follows the model of protein folding described in Section 2. The set of states, S, is isomorphic to the set of n-bit binary strings. The set of initial states, S0 , corresponds to (BU ). The transition relation, R, allows transitions between pairs of states whose bit-vector representations have Hamming distance 1. The labeling function, L, maps each state to an energy and works as follows: Recall that Bk is the set of bit strings where k bits are 1 and n − k bits are 0. In this paper, our atomic propositions are generally of the form: “is the minimum energy of B ∈ Bk = c?”. An interesting property of proteins is that that the energies of folding are bounded to a relatively small, constant-size range. In particular, the difference between G(BU ) and G(BF ) is generally 1 to 10 kcal mol−1 . The energy barrier which separates the unfolded and folded states is also typically 10 kcal mol−1 or smaller at room temperature. Indeed, the energy barrier must be small, or else folding won’t occur. Thus, the domain of possible energies is, in effect, bounded by a constant of around 20 kcal mol−1 . This range is not related to the size of the protein. The set of possible states, on the other
258
C.J. Langmead and S.K. Jha
hand, is exponential in the size of the protein. Due to the discrete nature of our energy function and the fixed precision of the parameters α and β in Eq. 1, we can then apply the pigeonhole principle and conclude that the number of unique energy values is also constant. This will ultimately lead to a very efficient representation of the labeling function, as discussed in the next section. In summary, assuming a G¯ o-like model of folding, we have shown that a protein’s energy landscape can be encoded as a Kripke structure. In the model checking literature, Kripke structures are not represented explicitly, but rather symbolically. In the next section we discuss techniques for representing Kripke structures symbolically. 3.2
Symbolic Encodings of Kripke Structures
The basis for symbolic encodings of Kripke structures, which ultimately facilitated industrial applications of model checking, is the reduced ordered Binary Decision Diagrams (BDDs) introduced by Bryant [4]. BDDs are directed acyclic graphs that symbolically and compactly represent binary functions, f : {0, 1}n → {0, 1}. While the idea of using decision trees to represent boolean formulae arose directly from Shannon’s expansion for Boolean functions, two key extensions made to it were the use of a fixed variable ordering and the sharing of sub-graphs. The first extension made the data structure canonical while the second one allowed for compression in its storage. A third extension, also introduced in [4], is the development of an algorithm for applying Boolean operators to pairs of BDDs, as well as an algorithm for composing the BDD representations of pairs of functions. Briefly, if f and g are Boolean functions, the algorithms implementing operators apply(f ,g,op) and compose(f ,g) compute directly on the BDD representations of the functions in time proportional to O(|f ||g|), where |f | is the size of the BDD encoding f . BDDs can be generalized to Multi-terminal BDDs (MTBDD) [9], which encode discrete, real-valued functions of the form f : {0, 1}n → R. Significantly, MTBDDs can be used to encode real-valued vectors and matrices, and algorithms exist for performing matrix addition and multiplication over MTBDDs [9]. These algorithms play an important role in several model checking algorithms for stochastic systems [3]. Application to Energy Landscapes. As previously mentioned, we can encode energy landscapes using Kripke structures. It follows, therefore, that energy landscapes can be encoded symbolically using a combination of BDDs and MTBDDs. In particular, the transition relation, R, and the labeling function, L, can be encoded using BDDs and MTBDDs, respectively. In practice, the construction of the BDDs and MTBDDs is done automatically from a high-level language describing the finite-state system and its behavior. Here, we use the specification formalism of reactive modules [2] as provided in the model checking tool prism [13]. Briefly, each residue is modeled as a separate twostate process (i.e., folded or unfolded). Residues change state asynchronously, and only one residue is allowed to change at any given time (thereby enforcing the Hamming-distance rule). The set of possible states of the system corresponds
Predicting Protein Folding Kinetics Via Temporal Logic Model Checking
259
exactly to the set of n-bit strings. The set of allowable transitions is ultimately encoded as a BDD and the labeling function is encoded as a MTBDD (Fig 1-B). 3.3
Temporal Logics
Temporal logic is a formalism for describing behaviors in finite-state systems. They have been used since 1977 to reason about the properties of concurrent programs [18]. There are a number of different temporal logics from which to chose, and different logics have different expressive powers. In this paper, we use a small subset of the Computation Tree Logic (CTL). CTL formulae can express properties of computation trees. The root of a computation tree corresponds to the set of initial states (i.e., S0 ) and the rest of the (infinite) tree corresponds to all possible paths from the root. A complete discussion of CTL and temporal logics is beyond the scope of this paper. The interested reader is directed to [10] for more information. The syntax of CTL is given by the following minimal grammar: φ ::= a | true | (¬φ) | (φ1 ∨ φ2 ) | EXφ | E[φ1 Uφ2 ] Here, a ∈ AP , is an atomic proposition (e.g., “does state s have energy c?”); “true” is a Boolean constant; ¬ and ∨ are the normal logical operators; E is the existential path quantifier (i.e., “there exists some path from the root in the computation tree”); and X and U are temporal operators corresponding to the notions of “in the next state” and “until”, respectively. Given these, additional operators can be derived. For example, “false” can be derived from “¬true” and the universal quantifier, AXφ, can be defined as ¬EX¬φ. Given some path π = π[0], π[1], . . . through the computation tree, the semantics of a CTL formula are defined recursively: π π π π π π
|= a iff a ∈ L(π[0]) |= true, ∀π |= ¬φ iff π |= φ |= φ1 ∨ φ2 iff π |= φ1 or π |= φ2 |= EXφ iff π[1] |= φ |= E[φ1 Uφ2 ] iff ∃i ≥ 0, π[i] |= φ2 ∧ ∀j < i, π[j] |= φ1
Here, the notation“π |= α” means that π satisfies α. Application to Protein Folding. Clearly, CTL formulas can express a rich set of properties concerning reachability (e.g., “will the protein end up in a particular configuration?”) and/or the logical ordering of events (e.g., “will the second residue fold before the first one?”). Numerous extensions to CTL exist which facilitate questions regarding explicit timings (e.g., “will the protein fold within t milliseconds?”) or likelihoods (e.g., “what is the probability that the protein folds within t milliseconds?”). In this paper, we only consider CTL formulas of the following form: let akc ∈ AP be an atomic proposition that asks “does the state s have k folded residues and have energy c?”, the CTL formula E[true U a]
260
C.J. Langmead and S.K. Jha
asks “Is there a path from S0 to some other state, s ∈ S, such that s |= a?” To find the minimum energy state for fixed k, we can perform a binary search over different values of c.4 Recall, that we argued that the range of energies is bounded by a constant and that the number of unique energy values is also constant. Therefore, the cost of the binary search is O(1). 3.4
Model Checking Algorithms
Having defined a Kripke structure and the CTL formula, we can then use existing model checking algorithms for verifying the formula, given a symbolic encoding of the Kripke structure. A model checking algorithm takes a Kripke structure, M = (S, S0 , R, L), and a temporal logic formula, φ, and finds the set of states in S that satisfy φ: {s ∈ S | M, s |= φ}. The complexity of model checking algorithms varies with the temporal logic and the operators used. For the types of formulas used in this paper (i.e., E[φ1 Uφ2 ]), an explicit state model checking algorithm requires time linear in the size of the finite-state model and in the length of the formula ([10] p 35-36).
Predicted log10(k)
Correlation between predicted and experimental folding rates (k)
Experimental log10(k)
Fig. 3. Scatter plot of log predicted (y-axis) and actual (x-axis) folding rates. The correlation coefficient is 0.87, p 0.001.
Of course, for very large state spaces, even linear time is unacceptable. Symbolic model checking algorithms operate directly on BDD/MTBDD encodings of the Kripke structure and CTL formula. Briefly, the temporal operators of CTL can be characterized in terms of fixpoints. Let P(S) be the powerset of S. A set S ⊆ S is a fixpoint of a function τ : P(S) → P(S) if τ (S ) = S . Symbolic model checking algorithms define an appropriate function, based on the formula, and then iteratively find the fixpoint of the function. This is done using set operations that operate directly on BDDs/MTBDDs. The fixpoint of the function 4
In our experiments, we make use of extensions to CTL provided in the tool prism [13] that allows one to ask for the minimum energy value directly. Therefore, we do not perform an explicit binary search.
Predicting Protein Folding Kinetics Via Temporal Logic Model Checking
261
corresponds exactly to {s ∈ S | M, s |= φ}. The interested reader is encouraged to read [10], ch. 6 for more details. Explicit-state and symbolic model checking algorithms are exact. There are also approximation algorithms for model checking algorithms (e.g., [19]), which employ sampling techniques and hypothesis testing. Such algorithms provide guarantees, in terms of the probability of the property being true, and can scale to much larger state spaces. These do not use BDDs/MTBDDs, but rather operate on the high-level language description of the finite-state model (see Sec. 3.2). We explored the use of both exact and approximate algorithms for model checking in our experiments.
4
Experiments and Results
We replicated the experiments of Mu˜ noz and Eaton [16] who made predictions on 19 proteins5 The largest protein in that set, FKBP-12 (PDB id 1FKB), has 107 residues. Mu˜ noz and Eaton consider state spaces in the range of size O(103 ) to 9 O(10 ) states. In contrast, we have successfully performed exact model checking on state spaces of size 276 ≈ 1023 using 2GB of memory on a single processor of a 4-node cluster. The time taken for these experiments is shown in Table 1. For proteins up to 74 residues, the longest runtime was under 30 minutes. Then, there is a jump to almost 7 hours for a 76-residue protein. The increase in time is due to thrashing of virtual memory. In general, the computation time is dominated by the time to construct the MTBDD. The actual cost of performing the model checking is under 90 seconds. Both load time and model checking time are correlated with the length of the protein for proteins up to 74 residues, with a correlations of 0.77 and 0.78, respectively, (p = .02). However, these are not monotonically related to length. No significant correlations between load times, model checking time and actual folding rates were observed. We were not able to perform exact model checking on proteins larger than 76-residues on a 2GB machine due to memory limitations. For this reason, we also ran experiments with an approximation algorithm for model checking [19]. These all completed in under 11 minutes. The time to perform approximate model checking is strongly correlated with protein length (R = 0.97, p 0.001). The largest state space we considered using the approximation algorithm has 2107 ≈ 1032 states. Figure 2 shows one sample energy profile computed using model checking for the protein FKBP-12. Using the technique described in [16] for transforming the free-energy profile into a quantitative prediction of folding time, we predicted the folding times for each of the 19 proteins. The correlations between the logarithms of the predicted folding rates and the experimentally measured values [14] are shown in Figure 3. The correlation coefficient between predicted and experimental values is 0.87. By comparison, Mu˜ noz and Eaton achieve correlation 5
The PDB ids of the 19 proteins are: 1APS, 1COA, 1CSP, 1FKB, 1FNF, 1HDN, 1LMB, 1MJC, 1NYF, 1PBA, 1PGB, 1PKS, 1SHG, 1SRL, 1TEN, 1URN, 2ABD, 2AIT, 2PTL.
262
C.J. Langmead and S.K. Jha
Table 1. Performance Statistics. MC = model checking. Column 3 indicates whether exact or approximate model checking was used. MTBDD build times are only relevant to exact MC because approximate MC does not use MTBDDs. The approximation error bound was set to 1% of the energy for these experiments. Due to memory limitations (2GB), exact model checking was performed only on proteins up to 76 residues. MTBDD Build MC Time Approximate MC PDB Id Residues Time (seconds) (seconds) Time (seconds) 1PGB 16 0.269 0.027 29.39 1SRL 56 313.546 18.083 188.69 1SHG 57 452.684 34.767 194.48 1NYF 58 712.788 64.882 195.41 1COA 64 1331.58 110.99 226.80 1CSP 67 973.664 6.57 248.75 1MJC 69 1963.879 86.139 267.32 2AIT 74 1753.331 85.205 318.15 1PKS 76 24647.21 10.55 319.61 2PTL 78 328.98 1PBA 81 335.82 1HDN 85 388.19 2ABD 86 378.94 1LMB 87 373.36 1TEN 90 415.54 1FNF 91 447.37 1URN 96 485.32 1APS 98 511.56 1FKB 107 611.59
coefficients between 0.83 and 0.87 on the same proteins, depending on which approximation was used. Plaxco and co-workers developed a simple method for predicting folding rates based on contact order (a length-normalized average sequential distance between contacting residues) [17]. Their correlation coefficient on 18 of the 19 proteins studied in this paper was 0.64. The mean absolute error of our predictions is 1.55. In comparison, the mean errors reported for two different techniques on a similar, but not identical, set of proteins in [8] was 2.77 and 3.42, respectively.
5
Conclusions and Future Work
We have presented an approach to predict the rate of folding using techniques from the field of model checking. We believe this paper represents the first application of model checking to a problem in structural biology. The key advantages of this approach are that it scales to extremely large state spaces and that it is exact. In terms of accuracy, our predictions of folding rate are well-correlated with experimentally determined values. However, it remains to be seen whether such levels of accuracy can be obtained when analyzing significantly larger proteins.
Predicting Protein Folding Kinetics Via Temporal Logic Model Checking
263
There are numerous extensions to this work that we intend to pursue. First, we have only begun to explore the kinds of queries that can be encoded in temporal logics. Second, a more thorough analysis of the relationship between the answers obtained via exact and approximate model checking is necessary. Finally, our model does not actually include any stochastic behavior. We have developed stochastic variants of our model of folding and we intend on applying model checking algorithms for stochastic systems to these. A comparison between the stochastic and non-stochastic techniques is planned.
Acknowledgments We thank Dr. Edmund Clarke for helpful discussions on this topic. This research was supported by a U.S. Department of Energy Career Award (DE-FG0205ER25696), and a Pittsburgh Life-Sciences Greenhouse Young Pioneer Award to C.J.L.
References 1. Alm, E., Baker, D.: Prediction of protein-folding mechanisms from free-energy landscapes derived from native structures. Proc. Natl. Acad. Sci. 96(20), 11305– 11310 (1999) 2. Alur, R., Henzinger, T.A.: Reactive modules. Formal Methods in System Design: An International Journal 15(1), 7–48 (1999) 3. Baier, C., Clarke, E., Hartonas-Garmhausen, V., Kwiatkowska, M., Ryan, M.: Symbolic model checking for probabilistic processes. In: Degano, P., Gorrieri, R., Marchetti-Spaccamela, A. (eds.) ICALP 1997. LNCS, vol. 1256, pp. 430–440. Springer, Heidelberg (1997) 4. Bryant, R.E.: Graph-based algorithms for boolean function manipulation. IEEE Transactions on Computers 35(8), 677–691 (1986) 5. Burch, J., Clarke, E.M., Long, D.E.: Symbolic model checking with partitioned transition relations. In: Proc. 1991 Conf. on VLSI, pp. 49–58 (1991) 6. Burch, J., Clarke, E.M., Long, D.E., McMillan, K.L., Dill, D.L.: Symbolic model checking for sequential circuit verification. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 3(4), 401–424 (1993) 7. Burch, J., Clarke, E.M., McMillan, K.L., Dill, D.L., Hwang, L.J.: Symbolic model checking: 1020 states and beyond. In: Proc. 5th Ann. IEEE Symposium on Logic in Computer Science, pp. 428–439. IEEE Computer Society Press, Los Alamitos (1990) 8. Chiang, T.H., Apaydin, M.S., Brutlag, D.L., Hsu, D., Latombe, J.C.: Predicting Experimental Quantities in Protein Folding Kinetics using Stochastic Roadmap Simulation. In: Proceedings of the 2006 ACM International Conference on Research in Computational Molecular Biology (RECOMB), pp. 410–424. ACM Press, New York (2006) 9. Clarke, E., Fujita, M., McGeer, P.C., Yang, J.-Y., Zhao, X.: Multi-terminal binary decision diagrams: An efficient datastructure for matrix representation. In: IWLS ’93 International Workshop on Logic Synthesis (1993) 10. Clarke, E., Grumberg, O., Peled, D.A.: Model Checking. MIT Press, Cambridge, MA (1999)
264
C.J. Langmead and S.K. Jha
11. Garbuzynskiy, S.O., Finkelstein, A.V., Galzitskaya, O.V.: Outlining folding nuclei in globular proteins. J. Mol. Biol. 336, 509–525 (2004) 12. G¯ o, N., Taketomi, H.: Studies on protein folding, unfolding and fluctuations by computer simulation. IV. Hydrophobic interactions. Int. J. Pept. Protein Res. 13(5), 447–461 (1979) 13. Hinton, A., Kwiatkowska, M., Norman, G., Parker, D.: PRISM: A tool for automatic verification of probabilistic systems. In: Hermanns, H., Palsberg, J. (eds.) TACAS 2006 and ETAPS 2006. LNCS, vol. 3920, pp. 441–444. Springer, Heidelberg (2006) 14. Jackson, S.: How do small single-domain proteins fold? Fold. Des. 33(4), R81–R91 (1998) 15. Kwiatkowska, M., Norman, G., Parker, D., Tymchyshyn, O., Heath, J., Gaffney, E.: Simulation and verification for computational modelling of signalling pathways, pp. 1666–1675 (2006) 16. Munoz, V., Eaton, W.A.: A simple model for calculating the kinetics of protein folding from three dimensional structures.. Proc. Natl. Acad. Sci. 96(20), 11311– 11316 (1999) 17. Plaxco, K.W., Simon, K.T., Baker, D.: Contact order, transition state placement and the refolding rates of single domain proteins. J. Mol. Biol. 277(4), 985–994 (1998) 18. Pnueli, A.: The temporal logic of programs. In: Proceedings of the 18th IEEE. Foundations of Computer Science (FOCS), pp. 46–57. IEEE Computer Society Press, Los Alamitos (1977) 19. Younes, H.L.S., Simmons, R.G.: Probabilistic verification of discrete event systems using acceptance sampling. In: Brinksma, E., Larsen, K.G. (eds.) CAV 2002. LNCS, vol. 2404, pp. 223–235. Springer, Heidelberg (2002)
Efficient Algorithms to Explore Conformation Spaces of Flexible Protein Loops Ankur Dhanik1 , Peggy Yao1 , Nathan Marz1 , Ryan Propper1, Charles Kou1 , Guanfeng Liu1 , Henry van den Bedem2 , and Jean-Claude Latombe1 1
Computer Science Department, Stanford University, Stanford, CA 94305, USA 2 Joint Center for Structural Genomics, SLAC, Menlo Park, CA 94025, USA
Abstract. Two efficient and complementary sampling algorithms are presented to explore the space of closed clash-free conformations of a flexible protein loop. The “seed sampling” algorithm samples conformations broadly distributed over this space, while the “deformation sampling” algorithm uses these conformations as starting points to explore more finely selected regions of the space. Computational results are shown for loops ranging from 5 to 25 residues. The algorithms are implemented in a toolkit, LoopTK, available at https://simtk.org/home/looptk.
1
Introduction
Several applications in biology require exploring the conformation space of a flexible fragment (usually, a loop) of a protein. For example, upon binding with a small ligand, a fragment may undergo large deformations to rearrange nonlocal contacts [14]. Incorporating such flexibility in docking algorithms is a major challenge [17]. In X-ray crystallography experiments, electron-density maps often contain noisy regions caused by disorder in the crystalline sample, resulting in an initial model with missing fragments between resolved termini [19]. Similarly, in homology modeling [15], only parts of a protein structure can be reliably inferred from known structures with similar sequences. This problem requires satisfying two constraints concurrently: closing a kinematic loop and avoiding steric clashes. Each constraint is relatively easy to satisfy alone, but the combination is hard because the two constraints are conflicting. The subset of closed conformations with no steric clash has a relatively small volume, especially for long loops; conversely, an arbitrary collision-free conformation of a loop has null probability to be closed. So, sampling techniques proposed so far have a high rejection ratio. Here, we present two new techniques, seed and deformation sampling, to solve this problem. Each deformation sampling operation starts from a given closed clash-free conformation and deforms this conformation without breaking closure or introducing clashes by modifying the loop’s degrees of freedom (DOFs) in a coordinated way. In contrast, seed sampling generates new conformations from scratch, by prioritizing the treatment of the two constraints, so that the most limiting one is enforced first. In both techniques, detection and prevention of R. Giancarlo and S. Hannenhalli (Eds.): WABI 2007, LNBI 4645, pp. 265–276, 2007. c Springer-Verlag Berlin Heidelberg 2007
266
A. Dhanik et al.
steric clashes is done using the grid-indexing method described in [11]. Seed and deformation sampling complement each other very well. Seed sampling produces conformations that are broadly distributed over the loop’s conformation space and provides conformations (seeds) later used by deformation sampling to explore more finely certain regions of this space. These algorithms are implemented into a toolkit, LoopTK, available at https://simtk.org/home/looptk. They have been tested on loops ranging from 5 to 25 residues.
2
Previous Work
The problem considered in this paper is a version of the “loop closure” problem studied in [2,6,8,12,13,21]. Several works have focused on kinematic closure. Analytical Inverse Kinematics (IK) methods are described in [6,21] to close a fragment of 3 residues. For longer fragments, iterative techniques have been proposed, like CCD (Cyclic Coordinate Descent) [2] and the “null space” technique [19]. Here we use analytical IK in a new way to close longer fragments. We use the null space technique to deform fragments without breaking closure. Other works also consider clash avoidance. Most (e.g., [5,8,12]) successively sample closed conformations and next test them for steric clashes. Because of its high rejection ratio, this approach is slow when clash-free conformations span a small subset of the closed conformation space, which is the case for most long loops. This observation motivated the prioritized constraint-satisfaction approach embedded in our seed sampling procedure. Some works try to sample conformations that locally minimize an energy function. Some use libraries of fragments obtained from previously solved structures [7,13,18,20]. Others sample conformations at random and refine them through energy minimization [8,9,12,16] or molecular dynamics [1]. But in the case of a truly deformable fragment, it is often more useful to explore the entire closed clash-free conformation space. For example, a fuzzy electron density map can be better explained by a distribution of conformations than by a single one, no matter how well it fits the density. Our goal in this paper is to present such exploration tools. Nevertheless, our deformation sampling technique also allows energy minimization, when this is desirable.
3
Loop Model
A loop L is a sequence of p > 3 consecutive residues in a protein P , such that none of the two termini of L is also a terminus of P . We number the residues of L from 1 to p, starting from the N terminus. We model the backbone of L as a serial linkage whose DOFs are the n = 2p dihedral angles φi and ψi around the bonds N–Cα and Cα–C, in residues i = 1, ..., p. The rest of the protein, denoted by P \L, is assumed rigid. We let LB denote the backbone of L. It includes the Cβ and O atoms respectively bonded to the Cα and C atoms in the backbone. We attach a Cartesian coordinate frame Ω1 to the N terminus of L and another frame Ω2 to its C terminus. When LB is connected to the rest of the protein, i.e.,
Efficient Algorithms to Explore Conformation Spaces
267
when it adopts a closed conformation, the pose (position and orientation) of Ω2 relative to Ω1 is fixed. We denote this pose by Πg . However, if we arbitrarily pick the values of φi and ψi , i = 1 to p, then in general we get an open conformation of LB , where the pose of Ω2 differs from Πg . The set Q of all open and closed conformations of LB is a space of dimensionality n = 2p. The subset Qclosed of closed conformations is a subspace of Q of dimensionality n − 6. Let Π(q) denote the pose of Ω2 relative to Ω1 when the conformation of LB is q ∈ Q. The function Π and its inverse Π −1 are the “forward” and “inverse” kinematics map of LB , respectively. A conformation of LB is clash-free if and only if no two atoms, one in LB , the other in LB or P \L, are such that their centers are closer than ε times the sum of their van der Waals radii, where ε is a constant in (0, 1). In our software, ε is an adjustable parameter, usually set to 0.75, which approximately corresponds to the distance where the van der Waals potential associated with two atoms begins increasing steeply. We denote the set of closed clash-free conformations of LB by Qfree closed . It has the same dimensionality as Qclosed , but its volume is usually a small fraction of that of Qclosed .
4
Seed Sampling
Overview. The goal of seed sampling is to generate conformations of LB broadly distributed over Qfree closed . The challenge comes from the interaction between the kinematic closure and clash avoidance constraints. Computational tests (see Section 6) show that the approach that first samples conformations from Qclosed and next rejects those with steric clashes is often too time consuming, due to its huge rejection ratio. The reverse approach – sampling the angles φi and ψi of LB to avoid clashes – will inevitably end up with open conformations, since Qclosed has lower dimensionality than Q. These observations led us to develop a prioritized constraint-satisfaction approach. We partition LB into three segments, the front-end F , the mid-portion M , and the back-end B. F starts at the N terminus of LB and B ends at its C terminus. M is the segment between them. The feasible conformations of F and B are more limited by the clash avoidance constraint than by the closure constraint; so, we sample the dihedral angles in F and B to avoid clashes, ignoring the closure constraint. Then, for any pair of conformations of F and B, the possible conformations of M are mainly limited by the closure constraint; so, we sample conformations of M using an IK procedure to close the gap between F and B and test the clash avoidance constraint afterward. The length of M must be large enough for the IK procedure to succeed with high probability, but not too large since clash avoidance is only tested afterward. In our software, the number of residues in M is set to half of that of LB or to 4, whichever of these two numbers is larger. The number of residues of F and B are then selected equal (± 1). Tests show that these choices are close to optimal on average. Sampling front/back-end conformations. Consider the front-end F . The angles φ and ψ closest to the fixed terminus of F are the most constrained by possible
268
A. Dhanik et al.
clashes with the rest of the protein P \L. So, the angles are sampled in the order in which they appear in F , that is φ1 , ψ1 , φ2 , etc. In this order, each angle φi (resp., ψi ) determines the positions of the next two atoms Cβi and Ci (resp., the next three atoms Oi , Ni+1 and Cαi+1 ). The angle is sampled so that these atoms do not clash with any atom in P \L or any preceding atom in F . Its value is picked at random, either uniformly or according to a user-input probabilistic distribution (e.g., one based on Ramachandran tables). If no value of the angle prevents the two or three atoms it governs from clashing with other atoms, the algorithm backtracks and re-samples a previously sampled angle. Clash-free conformations of the back-end B are sampled in the same way, by starting from its fixed C terminus and proceeding backward. Sampling mid-portion conformations. Given two non-clashing conformations of F and B such that the gap between them does not exceed the maximal length that M can achieve, a conformation of M is sampled as follows. The values of the φ and ψ angles in M are picked at random, uniformly or according to a given distribution. This leads to a conformation q of M that is connected to F at one end and open at the other end. To close the gap between M and B, we use the IK method described in [6]. This method solves the IK problem analytically, for any sequence of residues in which exactly three pairs of (φ, ψ) dihedral angles are allowed to vary. These pairs need not be consecutive. Our experiments show that, on average, the IK method is the most likely to succeed in closing the gap when one pair is the last one in M and the other two are distributed in M . Let r and s denote the numbers identifying the first and last residue of M in LB . As the IK method is extremely fast, AnalyticalIK(q, i, j, s) is called for all i = r, ..., s − 2 and j = i + 1, ..., s − 1, in a random order, until a closed conformation of M has been generated. If this conformation tests clash-free, then the seed sampling procedure constructs a closed clash-free conformation of LB by concatenating the conformations of F , M , and B. If the above operations fail to generate a closed clash-free conformation of M , then they are repeated (with new values of the φ and ψ angles in M ) until a predefined maximal number of iterations has been performed. Placing side-chains. For each conformation of LB sampled from Qfree closed , we use SCWRL3 [3] to place the side-chains. We may only compute the placements of the side-chains in LB given the placements of the side-chains in P \L. Alternatively, we may (re-)compute the placements of all the side-chains in the protein. In each case, SCWRL3 does not guarantee a clash-free conformation.
5
Deformation Sampling
Overview. The deformation sampling procedure is given a “seed” conformation q in Qfree closed . It first selects a vector in the tangent space T Qclosed (q) of Qclosed at q. By definition, any vector in this space is a velocity vector [φ˙1 , ..., ψ˙n ]T that maps to the null velocity of Ω2 (relative to Ω1 ); hence, it defines a direction of motion that does not instantaneously break loop closure. A new conformation
Efficient Algorithms to Explore Conformation Spaces
269
of LB is then computed as q = q + δq where δq is a short vector in T Qclosed (q). Since the tangent space is only a local linear approximation of Qclosed at q, the closure constraint is in fact slightly broken at q . So, Analytical-IK(q , p − 2, p − 1, p) is called to bring back the frame Ω2 to its goal pose Πg . Finally, the atoms in LB are tested for clashes among themselves and with the rest of the protein. If a clash is detected, the procedure exits with failure. The deformation sampling procedure may be run several times with the same seed conformation q to explore the subset of Qfree closed around q. Alternatively, each run may use the conformation generated at the previous run as the new seed to generate a “pathway” in the set Qfree closed . Computation of a basis of the tangent space. To select a direction in T Qclosed (q), we must first compute a basis for this space. This can be done as follows [19]. Let J(q) be the 6 × n Jacobian matrix that maps the velocity q˙ = [φ˙1 , ..., ψ˙p ]T ˙ γ] of the dihedral angles in LB at q to the velocity [x, ˙ y, ˙ z, ˙ α, ˙ β, ˙ T of Ω2 , i.e.: T ˙ γ] [x, ˙ y, ˙ z, ˙ α, ˙ β, ˙ = J(q)q. ˙ J(q) can be computed analytically using techniques presented in [4]. For simplicity, assume that J has full rank (i.e., 6) at q. A basis of T Qclosed (q) is built by first computing the Singular Value Decomposition U ΣV T of J(q) where U is a 6 × 6 unitary matrix, Σ is a 6 × n matrix with non-negative numbers on the diagonal and zeros off the diagonal, and V is an n × n unitary matrix [10]. Since the rows 6, ..., n of V do not affect the product J(q)q, ˙ their transposes form an orthogonal basis N (q) of T Qclosed (q). Selection of a direction in the tangent space. The deformation sampling procedure may select a direction in T Qclosed (q) at random. However, in most cases, it is preferable to minimize an objective function E(q). Let y = −∇E(q) be the negated gradient of E at q and yN = N N T y the projection of y into T Qclosed (q). The deformation sampling procedure selects the increment δq along yN . In this way, all the DOFs left available in LB by the closure constraints are used to move the conformation in the direction that most reduces E. E(q) may be a function of the distances between the closest pairs of atoms at conformation q (where each pair consists of one atom in LB and one atom in either L\B or LB ). Minimizing E then leads deformation sampling to increase the distances between these pairs of atoms, if this goal does not conflict with the closure constraint. In this way, deformation sampling picks increments δq that have small risk of causing steric clashes. Another interesting objective function leads to moving a designated atom A in LB toward a desired position xd . This objective function can be defined as: E(q) = xA (q) − xd 2 . where xA (q) is the position of A when LB ’s conformation is q. This function can be used to iteratively move an atom as far as possible along selected directions to explore the boundary of Qfree closed . E can also be an energy function or any weighted combination of functions, each designed to achieve a distinct purpose. Placing side-chains. For each new conformation of LB , side-chains can be placed using SCWRL3, as described in Section 4. Another possibility is to provide
270
A. Dhanik et al. Table 1. Testset of 20 loops (see main text for comments) Protein id Protein size Loop start Loop size Seed sampling Naive sampling 1XNB 185 SER 31 5 0.22 0.21 1TYS 264 THR 103 5 0.06 0.06 1GPR 158 SER 74 6 0.38 0.38 1K8U 89 GLU 23 7 0.21 0.20 2DRI 271 GLN 130 7 0.42 0.46 1TIB 269 GLY 172 8 2.49 13.03 1PRN 289 ASN 215 8 0.33 0.66 1MPP 325 ILE 214 9 0.53 99.85 4ENL 436 LEU 136 9 1.46 19.35 135L 129 ASN 65 9 0.77 1.54 3SEB 238 HIS 121 10 0.50 3.80 1NLS 237 ASN 216 11 1.30 5.51 1ONC 103 MET 23 11 2.26 5.66 1COA 64 VAL 53 12 19.02 67.49 1TFE 142 GLU 158 12 0.48 8.14 8DFR 186 SER 59 13 2.02 39.36 1THW 207 CYS 177 14 1.48 9.84 1BYI 224 GLU 115 16 2.52 >800 1G5A 628 GLY 433 17 3.28 >800 1HML 123 GLY 51 25 17.74 >800
an initial seed conformation that already contains the loop’s side-chains to the deformation sampling procedure. These side-chains are then considered rigid and the procedure deforms LB so that the produced conformation remains clash-free.
6
Results
Seed sampling. Table 1 lists 20 loops, whose sizes range from 5 to 25 residues, which we used to perform computational tests. Each row lists the PDB id of the protein, the number of residues in the protein, the number identifying the first residue in the loop, the number of residues in the loop, and the average times to sample one closed clash-free of the loop using two distinct procedures. Some loops protrude from the proteins and have much empty space in which they can deform without clash (e.g., 3SEB), while others are very constrained by the other protein residues (e.g., 1TIB). The loop in 1MPP is constrained in the middle by side-chains protruding from the rest of the protein. In the results presented below, all φ and ψ angles were picked uniformly at random (i.e., no biased distributions, like the Ramachandran’s ones, were used). Each picture in Figure 1 displays a subset of backbone conformations generated by seed sampling for the loops in 1TIB, 3SEB, 8DFR, and 1THW. The loop in 1TIB, which resides at the middle of the protein, has very small empty space to move in. The PDB conformation of the loop in 1THW (shown green in the
Efficient Algorithms to Explore Conformation Spaces
(a) 1TIB 8-residue loop
(b) 3SEB 10-residue loop
(c) 8DFR 13-residue loop
(d) 1THW 14-residue loop
271
Fig. 1. Some backbone conformations generated by seed sampling for the loops in 1TIB, 3SEB, 8DFR, and 1THW
picture) bends to the right, but our method also found clash-free conformations that are very different. Each picture in Figure 2 shows the distributions of the middle Cα atom in 100 sampled conformations of the loops in proteins 1K8U, 1COA, 1G5A, and 1MPP along with a few backbone conformations. The loops in 1K8U and 1COA have relatively large empty space to move in, whereas the loops in 1G5A and 1MPP are restricted by the surrounding protein residues. These figures illustrate the ability of seed sampling to generate conformations broadly distributed across the closed clash-free conformation space of a loop. The average running time (in seconds) to compute one closed clash-free conformation of each loop is shown in Table 1 (column 5). Each average was obtained by running the procedure until it generated 100 conformations of the given loop and dividing the total running time by 100.1 The last column of Table 1 gives the average running time of the “naive” procedure that first samples closed conformations of the loop backbone and next rejects those which are not clash-free. In both procedures, the factor ε used to define steric clashes (see Section 3) was set to 0.75. Our seed sampling procedure does not break a loop into 3 segments if it has fewer than 8 residues. So, the running times of both procedures for the first 5 1
The algorithms are written in C++ and runs under Linux. Running times were obtained on a 3GHz Intel Pentium processor with 1GB of RAM.
272
A. Dhanik et al.
(a) 1K8U 7-residue loop
(b) 1COA 12-residue loop
(c) 1G5A 17-residue loop
(d) 1MPP 9-residue loop
Fig. 2. Positions of the middle Cα atom (red dots) in 100 loop conformations computed by seed sampling for four proteins: 1K8U, 1COA, 1G5A, and 1MPP
proteins are essentially the same. For all other proteins, our procedure is faster than the naive procedure, sometimes by a large factor (188 times faster for the highly constrained loop in 1MPP). For the last 3 proteins, the naive procedure failed to sample 100 conformations after running for more than 80,000 seconds. Not surprisingly, the running times vary significantly across loops. Short loops with much empty space around them take a few 1/10 seconds to sample, while long loops with little empty space can take a few seconds to sample. The loops in 1COA and 1HML take significantly more time to sample than the others. In the case of 1COA, it is difficult to connect the loop’s front-end and back-end (3 residues each) with its mid-portion (6 residues). As Figure 5 shows, the termini of the loop are far apart and the protein constrains the loop all along. Due to the local shape of the protein at the two termini of the loop, many sampled front-ends and back-ends tend to point in opposite directions, which then makes it often impossible to close the mid-portion without clashes. In this case, we got a better average running time (4 seconds, instead of 19) by setting the length of the mid-portion to 8 (instead of 6). The loop in 1HML is inherently difficult to sample. Not only is it long, but there is also little empty space available for it. Figure 3 displays two RMSD histograms generated for the loop in 3SEB. The red (resp., yellow) histogram was obtained by sampling 100 (resp. 1000) conformations of the corresponding loop and plotting the frequency of the RMSDs
Efficient Algorithms to Explore Conformation Spaces
Fig. 3. RMSD histograms for one loop
273
Fig. 4. Twenty conformations of the loop in 1MPP generated by deforming a given seed conformation along randomly picked directions
between all pairs of conformations. The almost identity of the two histograms indicates that the sampled conformations spread quickly in Qfree closed . Similar histograms were generated for other loops. For rather long loops, any seed sampling procedure that samples broadly Qfree closed can only produce a coarse distribution of samples. Indeed, for a loop with n dihedral angles, a set of N evenly distributed conformations defines a grid with N 1/n−6 discretized values for each of the n − 6 dimensions of Qfree closed . If n = 18 (9-residue loop), a grid with 3 discretized values per dimension requires sampling 531,441 conformations. However, deformation sampling makes it possible to sample more densely “interesting” regions of Qfree closed . Deformation sampling. Figure 4 shows 20 conformations of the loop in 1MPP generated by deformation sampling around a conformation computed by seed sampling. To produce each conformation, the deformation sampling procedure started from the same seed conformation and selected a short vector δq in T Qclosed (q) at random. This figure illustrates the ability of deformation sampling to explore Qfree closed around a given conformation. Figure 5 shows a series of closed clash-free conformations of the loop in 1COA successively sampled by pulling the N atom (shown as a white dot) of THR 58 away from its initial position along a given direction until a steric clash occurs (white circle). The initial conformation shown in red was generated by seed sampling and the side-chains were placed without clashes using SCWRL3. Each other conformation was sampled by deformation sampling starting at the previously sampled conformation and using the objective function E defined in Section 5. Only the backbone was deformed, and each side-chain remained rigid. Steric clashes were tested for all atoms in the loop. Figure 6 displays the volume (shown green) reachable by the 5th Cα atom in the loop of 1MPP. This volume was obtained by sampling 20 seed conformations of the loop and, for each of these conformations, pulling the 5th Cα atom along several randomly picked direction until a clash occurs. The volume shown green was obtained by rendering the atom at all the positions it reached.
274
A. Dhanik et al.
Fig. 5. Deformation of the loop in 1COA by pulling the N atom (white dot) of THR 58 along a specified direction
(a)
Fig. 6. Volume reachable by the 5th Cα atom in the loop of 1MPP
(b)
Fig. 7. Use of deformation sampling to remove steric clashes involving side chains
The running time of deformation sampling depends on the objective function. In the above experiments, it is less than 0.5 seconds per sample on average. Placements of side-chains. Our software calls SCWRL3 to place side chains. The result, however, is not guaranteed to be clash-free. We ran the seed sampling procedure to sample conformations of the backbones of the loops in 1K8U, 2DRI, 1TIB, 1MPP, and 135L, with the uniform and Ramachandran sampling distributions for the dihedral angles (see Section 4). For each loop, we sampled 50 conformations with the uniform distribution and 50 with the Ramachandran distribution. We then checked each conformation for steric clashes. Table 2 reports the number of clash-free conformations for each loop and each of the two distributions. As expected, the backbone conformations generated using the Ramachandran distribution facilitate the clash-free placement of the side-chains. When seed sampling generates a conformation q of a loop backbone, such that SCWRL3 computes a side chain placement that is not clash-free, deformation
Efficient Algorithms to Explore Conformation Spaces
275
Table 2. Number of clash-free placements of side chains for five loops Protein 1K8U 2DRI 1TIB 1MPP 135L Uniform 7 9 1 0 9 Ramachandran plots 18 14 6 4 13
sampling can be used to sample more conformations around q, to produce one where side chains are placed without clashes. In Figure 7(a) a conformation (shown blue) of the backbone of the loop in 1MPP was generated using seed sampling and the side chains were placed by SCWRL3. However, there are clashes between two side chains. In (b) a conformation (shown yellow) was generated by the deformation sampling procedure using the conformation shown in (a) as the start conformation. The new placement of the side chains computed by SCWRL3 is free of clashes. Once such a clash-free conformation has been obtained, many other clash-free conformations can be quickly generated around it, again using deformation sampling, as shown in Figure 4.
7
Conclusion
We have described two algorithms to sample the space of closed clash-free conformations of a flexible loop. The seed sampling algorithm produces broadly distributed conformations. It is based on a novel prioritized constraint-satisfaction approach that interweaves the treatment of the clash avoidance and closure constraints. The deformation sampling algorithm uses these conformations as starting points to explore more finely certain regions of the space. It is based on the computation of the null space of the loop backbone at its current conformation. Tests show that these algorithms can handle efficiently loops ranging from 5 to 25 residues in length. We have successfully used early versions of these algorithms to interpret fuzzy regions in electron-density maps obtained from X-ray crystallography [19]. Our current and future work is aimed at applying them to other applications, in particular function-driven homology (where available functional information is used to limit the search for adequate loop conformations) and ligand-protein binding. Acknowledgements. This work has been partially supported by NSF grant DMS-0443939. Peggy Yao was supported by a Bio-X graduate fellowship.
References 1. Bruccoleri, R.E., Karplus, M.: Conformational sampling using high temperature molecular dynamics. Biopolymers 29, 1847–1862 (1990) 2. Canutescu, A., Dunbrack Jr., R.: Cyclic coordinate descent: A robotics algorithm for protein loop closure. Protein Sci. 12, 963–972 (2003) 3. Canutescu, A., Shelenkov, A., Dunbrack Jr., R.: A graph theory algorithm for protein side-chain prediction. Protein Sci. 12, 2001–2014 (2003)
276
A. Dhanik et al.
4. Chang, K.S., Khatib, O.: Operational space dynamics: Efficient algorithm for modeling and control of branching mechanisms. In: Proc. IEEE Int. Conf. on Robotics and Automation, San Francisco, CA, pp. 850–856. IEEE Computer Society Press, Los Alamitos (2000) 5. Cortes, J., Simeon, T., Renaud-Simeon, M., Tran, V.: Geometric algorithms for the conformational analysis of long protein loops. J. Comp. Chem. 25, 956–967 (2004) 6. Coutsias, E.A., Soek, C., Jacobson, M.P., Dill, K.A.: A kinematic view of loop closure. J. Comp. Chem. 25, 510–528 (2004) 7. Deane, C.M., Blundell, T.L.: A novel exhaustive search algorithm for predicting the conformation of polypeptide segments in proteins. Proteins: Struc., Func., and Gene. 40, 135–144 (2000) 8. DePristo, M.A., de Bakker, P.I.W., Lovell, S.C., Blundell, T.L.: Ab initio construction of polypeptide fragments: efficient generation of accurate, representative ensembles. Proteins: Struc., Func., and Gene. 51, 41–55 (2003) 9. Fiser, A., Do, R.K.G., Sali, A.: Modeling of loops in protein structures. Protein Sci. 9, 1753–1773 (2000) 10. Golub, G., van Loan, C.: Matrix Computations, 3rd edn. John Hopkins University Press, Baltimore, MD (1996) 11. Halperin, D., Overmars, M.H.: Spheres, molecules and hidden surface removal. Comp. Geom. Theory and App. 11, 83–102 (1998) 12. Jacobson, M.P., Pincus, D.L., Rapp, C.S., Day, T.J.F., Honig, B., Shaw, D.E., Friesner, R.A.: A hierarchical approach to all-atom protein loop prediction. Proteins: Struc., Func., and Bioinf. 55, 351–367 (2004) 13. Kolodny, R., Guibas, L., Levitt, M., Koehl, P.: Inverse kinematics in biology: the protein loop closure problem. Int. J. Robotics Research 24, 151–163 (2005) 14. Okazaki, K., Koga, N., Takada, S., Onuchic, J.N., Wolynes, P.G.: Multiple-basin energy landscapes for large-amplitude conformational motions of proteins: Structurebased molecular dynamics simulations. PNAS 103, 11844–11849 (2006) 15. Sauder, J.M., Dunbrack Jr., R.: Beyond genomic fold assignment: rational modeling of proteins in bilological systems. J. Mol. Biol. 8, 296–306 (2000) 16. Shehu, A., Clementi, C., Kavraki, L.E.: Modeling Protein Conformational Ensembles: From Missing Loops to Equilibrium Fluctuations. Proteins: Struc., Func., and Bioinf. 65, 164–179 (2006) 17. Sousa, S.F., Fernandes, P.A., Ramos, M.J.: Protein-ligand docking: Current status and future challenges. Proteins: Struc., Func., and Bioinf. 65, 15–26 (2006) 18. Tossato, C.E., Bindewald, E., Hesser, J., Manner, R.: A divide and conquer approach to fast loop modeling. Protein Eng. 15, 279–286 (2002) 19. van den Bedem, H., Lotan, I., Latombe, J.C., Deacon, A.: Real-space protein-model completion: an inverse-kinematic approach. Acta Cryst. D61, 2–13 (2005) 20. van Vlijmen, H.W.T., Karplus, M.: PDB-based protein loop prediction: parameters for selection and methods for optimization. J. Mol. Biol. 267, 975–1001 (1997) 21. Wedemeyer, W.J., Scheraga, H.A.: Exact analytical loop closure in proteins using polynomial equations. J. Comp. Chem. 20, 819–844 (1999)
Algorithms for the Extraction of Synteny Blocks from Comparative Maps Vicky Choi1 , Chunfang Zheng2, Qian Zhu2 , and David Sankoff2 1
2
Department of Computer Science, Virginia Tech, Blacksburg, VA 24061
[email protected] Departments of Biology, Biochemistry, and Mathematics and Statistics, University of Ottawa, Ottawa, Canada K1N 6N5 {czhen033,qzhu012,sankoff}@uottawa.ca
Abstract. In comparing genomic maps, we try to distinguish mapping errors and incorrectly resolved paralogies from genuine rearrangements of the genomes. This can be formulated as a Maximum Weight Independent Set (MWIS) search, where vertices are potential strips of markers syntenic on both genomes, and edges join conflicting strips, in order to extract the subset of compatible strips that accounts for the largest proportion of the data. This technique is computationally hard. We introduce biologically meaningful constraints on the strips, reducing the number of vertices for the MWIS analysis and provoking a decomposition of the graph into more tractable components. New improvements to existing MWIS algorithms greatly improve running time, especially when the strip conflicts define an interval graph structure. A validation of solutions through genome rearrangement analysis enables us to identify the most realistic solution. We apply this to the comparison of the rice and sorghum genomes.
1 Introduction Comparing two genomic maps containing orthologous sets of markers induces a decomposition of the genomes into synteny blocks, segments of chromosomes containing orthologous markers in the same or reverse order in the two genomes. The blocks may be differently grouped into chromosomes, and differently ordered and oriented, in the two genomes being compared. In the course of genomic evolution, as more and more rearrangements intervene since the common ancestor, the synteny blocks in common between the two genomes become more fragmented, i.e., shorter, and eventually contain only one marker, or none. The construction of the synteny blocks based on traditional comparative maps is different in both spirit and technique from the analogous problem based on genome sequences, and is very vulnerable to errors and ambiguities in the position of the markers on a map, depending on the specific mapping technology. Another kind of problem involves ambiguous homology, leading to the risk of matching up inappropriate pairs of markers as orthologs in the two genomes. These problems tend to artifactually increase the number of synteny blocks induced by the comparison, disrupting true synteny blocks by artifactual blocks containing only one or two markers. Thus, when many rearrangements have intervened since the common ancestor, or where the sampling density of markers on the chromosome is sparse, it may be unclear R. Giancarlo and S. Hannenhalli (Eds.): WABI 2007, LNBI 4645, pp. 277–288, 2007. c Springer-Verlag Berlin Heidelberg 2007
278
V. Choi et al.
whether any particular one of the increasing number of short synteny blocks is due to error or to rearrangement. These considerations suggest the principle that inferences that depend on the position of a single marker should not be given as much weight as inferences that are supported by more markers. We would thus like to construct a set of synteny blocks that are conflict-free, contain as much of the data as possible, and are credible from a genome rearrangement viewpoint. In [9], we proposed the following strategy: first, construct a set of pre-strips, which are certain short common subsequences of one chromosome from each genome; second, extract from this set a subset of mutually compatible (non-intersecting) containing a maximum number of markers; third, add to this subset any markers that do not increase the rearrangement distance [7] between the genomes; fourth, assemble the synteny blocks from the markers in the solution. This approach encountered a bottleneck at the second step, formulated in terms of a solution for the NP-hard maximum weight clique (MWC) problem in a graph representing pre-strip compatibilities. It was not feasible to run the whole data set using available algorithms. Thus we devised biologically-motivated constraints to reduce the data set and were then able to run moderate size instances. In this paper, our main contributions are: first, based on a key combinatorial observation, the establishment of constraints on the set of pre-strips that are necessary to a solution, thus reducing the amount of data that must be input to MWIS without losing optimality (Section 3), and second, the design of a new algorithm for the maximum weight independent set (MWIS) problem1, specifically motivated by the nature of prestrip data (Section 4.1). Finally, taking advantage of the source of the incompatibilities in the chromosome-based data, we propose a natural decomposition of the graph which allows us to solve relatively large instances of the problem extremely efficiently – 1 to 2 seconds on a Pentium IV computer for instances that took days or that proved infeasible with the previous techniques. As a prerequisite to this material, in Section 2, we review the definition of strips and pre-strips, as well as a polynomial-time algorithm for generating all pre-strips. And after the theoretical development, we discuss the question of restoring additional markers to the solution in Section 6 and analyze the rice and sorghum comparative map in Section 7.
2 Problem and Terminology: Strips, Pre-strips, Pure Strips Let n be the number of markers in common in two genomes with χ1 and χ2 chromosomes. In one genome, number all these markers on any one of the chromosomes from left to right in increasing order starting with marker 1. Continue the numbering sequence on a second chromosome and so on, until finishing with the n-th marker on the χ1 -st chromosome. Then each marker in the second genome receives the same label as its supposed ortholog in the first genome. We recall the definition of strips, pre-strips and pure strips in [9]. Consider any l ≥ 2 consecutive contiguous markers on a chromosome in one genome. If the same l markers are consecutive on a chromosome in the other genome, with the same (or 1
Equivalent to the MWC formulation in the complementary graph in [9].
Algorithms for the Extraction of Synteny Blocks from Comparative Maps
279
reverse) order and with each marker having the same (or opposite) orientation2 in both genomes, they constitute a forward strip ( reverse strip) of length l. Note that many or most of the markers in a comparative map may not be in any strip. The synteny blocks in the decomposition of the two genomes we are looking for are all strips, but many of these blocks will not be visible in the original data since they are disrupted by erroneously mapped markers and mistaken orthologs, so we have to construct them by discarding the markers disrupting their contiguity property. ORIGINAL
REDUCED
Genome 1 abcdef lmnoprq wxyz
Genome 2 lbcdpz -x-q-o-m we-fry na
Genome 1 abcd lmoq wyz
Genome 2 lbcdz -q-o-m wy a
Pre-strips bcd,bc,cd, moq,mo,oq, wy,lp
Pure strip bcd
Strips bcd,moq,wy
Singletons not in pre-strips but compatible a,l,z
Common subsequences not pre-strips bd,mq
Discarded as noise e,f,n,p,r,x
Fig. 1. Strips and pre-strips. “-” indicates different orientation markers in two genomes.
Maximal Strip Recovery (MSR) problem: Given two genomes as described above, discard some subset of the markers, leaving only markers in disjoint strips S1 , · · · , Sr r of lengths w1 , · · · , wr , respectively, in the genomes thus reduced, such that i=1 wi is maximized. The MSR problem corresponds to our previously stated goal of constructing a set of compatible strips containing as much of the data as possible. We will search for pre-strips in the two genomes, relying on the subsequent analyses to eliminate the disrupting markers and thus reveal the “underlying” strips. This is illustrated in Fig. 1.A pre-strip P is a common subsequence, or a reverse common subsequence, of the markers on the two chromosomes, such that there is no other marker of appropriate orientation on both chromosomes that is between two successive markers in P . For example, if AB is a pre-strip, then there does not exist C such that ACB is a pre-strip. For all reverse pre-strips, this is indicated by minus signs on the markers involved in the second genome only. =Notice that a pre-strip satisfies the same definition as a strip, except that the markers need not be contiguous. A pre-strip that is a strip in the original genome data, and is not contained in another strip, is called a pure strip. Remark: Strips are defined relative to the current state of the two genomes, either before, during or after reducing their size, but pre-strips and pure strips are defined in terms of the original genome data only. In [9] it is shown that every pre-strip P has a unique representation as a string of p’s and 1’s, where a p represents a pure strip and a 1, called a singleton, represents a marker not in a pure strip. Moreover, 2
Reading direction, DNA strand.
280
V. Choi et al.
Proposition 1. Any pre-strip can be uniquely represented by a sequence of terms of form p, 11, 1p, p1, 111 and 1p1. Proposition 2. All possible strips that can be formed by the deletion of markers from two genomes, and that can be part of a solution to the MSR problem, are pre-strips of these genomes. Consequently, it suffices to consider only pre-strips of the forms mentioned in Proposition 1. All such pre-strips can be calculated by an algorithm requiring O(n4 ) time in the worst case. In practice, the running time is far less. In the following, we show that we can further reduce the set of pre-strips to be considered and define the conflict graph.
3 Data Reduction and the Conflict Graph We say two pre-strips P and Q are in conflict if they share at least one marker or if one pre-strip, say P , contains a marker between two successive markers, in either genome, in the other pre-strip, Q. Otherwise P and Q are compatible. Let k-pure-strip denote a pure strip of length k. Then we have (proof omitted): Lemma 1. All k-pure-strips with k ≥ 4 are part of a solution to the MSR problem. Corollary 1. Pre-strips that are in conflict with a k-pure-strip for k ≥ 4 are not included in the solution to the MSR problem. In fact, we can eliminate further pre-strips (proof omitted) : Corollary 2. Pre-strips of the form 1p, p1, 111, 1p1 that are in conflict with a k-purestrip for k ≥ 3, are not included in the solution to the MSR problem. By these two reductions (Corollaries 1 and 2), we can generate pre-strips more efficiently, namely generate them “on the fly” with the k-pure-strips acting as “terminators”. The corollaries also imply that we need not treat any marker in k-pure-strips for k ≥ 3 as a singleton in a reverse pre-strip. At most one marker in each 2-pure-strip need be considered as a singleton in a reverse pre-strip. We define the conflict graph G = (V, E), where V consists of all pre-strips after reduction, and E consists of all pairs of conflicting pre-strips. The conflict graph is the complement of the compatibility graph defined in [9], but it has an important interval graph-related property (cf. Section 5 below.) Graph theory terminology and notation. Let G = (V, E) be a simple undirected graph. For v ∈ V , the set of neighbors of v in G is denoted by nbr(v) = {u ∈ V : uv ∈ E}. For S ⊂ V , G[S] = (S, ES ) is called a (vertex) induced subgraph of G, where ES = ¯ = (V, E), ¯ {uv ∈ E : u, v ∈ S}. The complement of G = (V, E) is denoted by G ¯ where E = {uv : u = v ∈ V, uv ∈ E}. For S ⊆ V , S is called an independent set if for u, v ∈ S, uv ∈ E; S is called a clique if u, v ∈ S, uv ∈ E; S is a vertex cover of G if for uv ∈ E, either u or v is in S. A linear ordering of G = (V, E) with |V | = n is a bijection φ : V −→ [n] = {1, . . . , n}. When φ is understood, we denote φ−1 (i) by vi and write Vi =
Algorithms for the Extraction of Synteny Blocks from Comparative Maps
281
{v1 , v2 , . . . , vi }, for i = 1, . . . , n. For 1 ≤ i ≤ n, we define the right neighbors of vi to be rnbr(vi ) = {vj ∈ V : vi vj ∈ E, j < i}. We consider the vertex-weighted graph, where the weight of vertices is given by a function w : V −→ Z + . For S ⊆ V , the weight of S, w(S) = v∈S w(v). MWIS problem: Given a vertex-weighted graph G = (V, E) with the weight function w : V −→ Z + , find an independent set S of G such that w(S) is maximized. We denote the optimum independent set by mis(G). 3.1 Reformulation as Maximum Weight Independent Sets (MWIS) By propositions from [9], the MSR problem (Section 2) is just the maximum weight independent set (MWIS) problem on the conflict graph G where the weight of each vertex is the number of markers in the corresponding pre-strip. Proposition 3. Given any set C of pairwise compatible pre-strips. Consider the reduced genomes produced by deleting all markers that are in none of the pre-strips in C. In these reduced genomes all of the markers in each pre-strip in C appear as strips. The number of markers in each strip is the same as in the corresponding pre-strip. Proposition 4. The solution mis(G) of the MWIS problem on G induces a reduction of the original genomes so that they are composed completely of disjoint strips and so that the total strip score is maximized. It is well-known that the MWIS problem, equivalent to the Maximum Weight Clique (MWC) problem and the Minimum Weight Vertex Cover (MWVC) problem, is NPhard. Exact algorithms and heuristics have been developed for these problems. The most recent MWC algorithm is due to Kumlander [3], itself a minor improvement of Ostergard’s [5,6] algorithm.
4 Maximum Weight Independent Sets (MWIS) In the following, we will first describe a linear-time algorithm for MWIS problem on interval graphs. We then describe improvements on one of the best exact algorithms – Ostergard’s algorithm[5,6] – for MWIS problem on general graphs. Our improvement consists of (1) better upper and lower bounds (for pruning the search tree); (2) the ordering of the vertices. In particular, we give a characteristic of a good ordering, partially answering an open problem in [5,6]. Suppose G is linearly ordered, V = {v1 , . . . , vn }. As in [5,6], we consider the induced subgraph incrementally, G[V1 ], G[V2 ], . . . , G[Vn ]. Recall that mis(G) is an maximum-weight independent set of G. Define si = w(mis(G[Vi ])), for i = 1, . . . , n. Thus, we have s1 = w(v1 ) and sn = w(mis(G)), the weight of the maximum independent set sought. It is easy to see that si−1 ≤ si ≤ si−1 +w(vi ). If vi ∈ mis(G[Vi ]), then si = si−1 . If vi ∈ mis(G[Vi ]), then by definition, si = w(vi ) + w(mis(G[Vi−1 \ rnbr(vi )])). Denote this quantity by s1i . Hence, to compute si incrementally, we compute s1i and compare it with si−1 , and set si to be the larger of the two.
282
V. Choi et al.
In the following, we first recall the definition of interval graphs. Then we will describe a linear-time algorithm for MWIS problems on interval graphs, which motivated our improved algorithm for MWIS problems on general graphs. 4.1 Linear-Time Algorithm for MWIS on Interval Graphs Definition 1. A graph G = (V, E) is an interval graph if and only if it admits an interval graph realization: there exists a set of intervals such that there is a one-one correspondence between each vertex and each interval and there is an edge between two vertices if and only if their corresponding intervals overlap. Theorem 1. A graph G = (V, E) is an interval graph if and only if there is a linear ordering of G such that the right neighbors of each vertex are consecutive: i.e., there is a ordering V = {v1 , v2 , . . . , vn } such that for i > j > k, if vi vk ∈ E, then vi vj ∈ E. The ordering of G is called an I-ordering [1]. An I-ordering can be obtained in linear time (e.g., by the 5- SWEEP LBFS algorithm [2]). Note that for an I-ordering, rnbr(vi ) = {vi−1 , vi−2 , . . . , vi−t } for some t ≥ 0 and Vi−1 \ rnbr(vi ) = {v1 , v2 , . . . , vi−t−1 }. Thus we have si−1 if vi ∈ mis(G[Vi ]) si = max w(vi ) + si−t−1 if vi ∈ mis(G[Vi ]) Thus one can easily get the linear time O(|V | + |E|) time algorithm for the MWIS problem on interval graphs (compared with O(|V |2 ) time in [4]). 4.2 Improved Algorithm for MWIS Problem on the General Graph Our algorithm improves upon the algorithm by Ostergard [5,6], which is a branch-andbound algorithm. Branch-and-bound algorithms build a search tree, which associates to each node a current partial solution set, and the remaining working set. Critical to the basic branch-and-bound algorithm for MWC/MWIS problem are a good ordering of the vertices, a good lower bound on the size of maximum independent set of the graph, and a good upper bound on the size of independent set of the induced subgraph on the working set. For example, colouring the vertices (the graph and its complement) can be used to obtain both upper and lower bounds. (Indeed, Kumlander’s minor improvement over Ostergard’s algorithm is on the efficient computing upper bounds on the working set based on a greedy colouring of the entire graph.) Ostergard resolves the tight lower bound problem by incrementally computing the MWIS of the graph. This method actually gives a best possible bound that, once attained, terminates the search. The motivation for the incremental method, however, is to get a tighter upper bound, namely si , where i is the maximum of the remaining working set. The key to our algorithm is the observation that with no extra work, we can get a better upper bound by dividing the working set into two parts: a disruption list and a consecutive prefix. Recall that if vi ∈ mis(G[Vi ]), then si = w(vi ) + w(mis(G[Vi−1 \ rnbr(vi )])). In general, we have Vi−1 \ rnbr(vi ) = {di1 , . . . , dis } ∪ Vit , as shown in Figure 2. We call Di = {di1 , . . . , dis } the disruption list of vi , and Vit the consecutive prefix. (If G is an interval graph, we can order the vertices such that Di = ∅.)
Algorithms for the Extraction of Synteny Blocks from Comparative Maps
283
Fig. 2. vi is adjacent to the black vertices
Thus, a tighter upper bound can be obtained by the upper bound of the disruption list and the exact solution of consecutive prefix. And we only need to branch on the disruption list, in contrast to the entire working set in [5,6]. Further, by partitioning the working set into two parts, we can get a good lower bound of the working set (which Ostergard’s algorithm does not have), namely a lower bound by the exact solution of the consecutive prefix.
5 Union of Interval Graphs; Decomposition of the Conflict Graph 5.1 Union of Interval Graphs Recall that in the conflict graph G = (V, E), each vertex v ∈ V corresponds to each pre-strip P (v). For u, v ∈ V , uv ∈ E if and only if P (v) and P (u) are in conflict. Recall that a pre-strip corresponds to two copies of a subsequence of markers, one copy from a chromosome in each genome. We say two pre-strips P (u) and P (v) conflict in genome 1 (resp. 2) if their copies in genome 1 (resp. 2) conflict. By definition, P (u) and P (v) conflict if they conflict in genome 1 or genome 2 or both. For i = 1, 2, let Ei = {uv : P (u) and P (v) conflict in genome i, u, v ∈ V }. Then we have E = E1 ∪ E2 . Further, according to the observation before Section 4.1, a good ordering will have |Di | as small as possible; a possible objective functionfor an ordering is i |Di |. If G is an interval graph, we can find an ordering such that i |Di | = 0. The 5-sweep LBFS algorithm gives a good ordering if the graph is “close” to an interval graph. Since the chromosomes are linear, when considering only one genome, each prestrip corresponds to an interval of a line (chromosome) and two pre-strips are conflict if and only if their corresponding intervals overlap. Therefore, G1 = (V, E1 ) and G2 = (V, E2 ) are interval graphs. However, G = (V, E) = (V, E1 ∪ E2 ) is not necessarily interval, e.g., a four-cycle can be formed v1 v2 , v2 v3 , v3 v4 ∈ E1 and v1 v4 ∈ E2 . In fact, the graph G is in general not an interval graph. In our experience [1], if the graph is only locally distorted, the ordering so produced by 5-S WEEP LBFS algorithm will also be distorted locally. Namely, the ordering only failed for local forbidden subgraph region, i.e. the vertices before and after the forbidden subgraph satisfy the right neighborhood consecutive property. In other words, if our graph is only locally distorted, then the ordering produced by 5-S WEEP LBFS algorithm will be a good ordering for our MWIS algorithm. Motivated by this observation and together with the ideas in [9], we propose the following natural decomposition of our conflict graph.
284
V. Choi et al.
5.2 Natural Decomposition Our idea here is to find a small subset of vertices, called separators, such that the removal of these vertices results in a set of computationally tractable connected components (here locally distorted interval components), and the solution to this set of connected components has good properties, either being a good approximation to the optimal solution of the original problem or having biologically desirable properties. The separator vertices ideally correspond to errors in the original genomic maps. Recall that a pre-strip consists of a common subsequence of markers in two genomes. Two markers in a pre-strip may well be located far from each other in a genome. Note that the larger the gap a pre-strip has, the more pre-strips it can conflict with, because conflicts occur when a marker from certain other pre-strips fall in the gap. Thus, a computationally and biologically well-motivated way to approximate the MWIS solution will be to remove those pre-strips with the largest gaps. That is, choose those large-gap pre-strips as separators. One would then expect that the non-interval components will only be locally distorted due to the gap constraints. Indeed, in the example to be discussed in Section 7, if we remove all pre-strips with gap > 4, then the graph is decomposed into 36 components, with all but one components being interval. For the only non-interval component, there is one I-critical vertex [1], that is, the component becomes interval when the vertex is removed. As the gap increases, the number of connected components decreases and the total vertices in non-interval connected components increases. Neverthless, even when we retain pre-strips with rather larger gap size, these components are only locally noninterval, and our algorithm based on the LBFS ordering is still very efficient (within one to two seconds). See Table 1 for the statistics. Another way to choose separators is to exclude pre-strips containing only two markers separated by a gap of any non-zero size. Biologically speaking, such a strip is the weakest kind of evidence for a synteny block other than singletons (markers in no prestrips, which are never even considered in the MWIS input).
6 Restoration of Markers The MWIS solution is incompatible with any pre-strip not in it, but it is not necessarily incompatible with all parts of such a pre-strip. For example, it is possible that some pre-strip of form p1 is not in the solution, but the singleton element in this pre-strip does not intervene between any two successive markers of a pre-strip in the solution, and may thus be considered compatible. In addition, singleton markers, in no pre-strip, which play a role neither in the input nor the output of the MWIS, could similarly be compatible with the solution. Since there is no way of identifying, in real data, exactly which markers excluded from the MWIS solution are valid evidence of evolutionary relatedness or divergence of the two genomes, and which are simply erroneous, we have recourse to genome rearrangement analysis. First we use the strips output by the MWIS to calculate the genomic distance between the two genomes [7]. If we were to add a new marker at random (“noise”) to both genomes, this would generally increase the distance by 1 or 2, even if it were compatible with all the strips in the solution. Thus, if we add a
Algorithms for the Extraction of Synteny Blocks from Comparative Maps
285
marker from among those not in the MWIS, and this does not increase the distance, this means that when one genome is optimally transformed into the other, the new marker falls naturally into place with no extra effort and is fully consistent with the inferred evolutionary history of all the markers in the solution.
7 A Comparison of the Rice and Sorghum Genomes We compare maps of the rice and sorghum genomes. The construction of the data set, based on resources in [8] is described in [9]. In this comparison, the database reports 567 correspondences between the two genomes, involving n1 = 481 rice markers and n2 = 567 sorghum markers. The number of distinct markers in common was n = 481. A total of 69 of these were present in two or more copies in the sorghum data, with a maximum gene family size of 6. The inclusion of paralogous genes is shown in [9] to create no problems for the biological interpretation of the analysis, to require only slight modifications of the definitions in Section 2 and to affect the computation simply by increasing the number of pre-strips. Our algorithm for generating pre-strips produced 1853 pre-strips to enter as vertices into the MWIS routine, which exceeded the capability of our algorithm and, indeed, state-of-the art MWC programs that were tried on it. The results of the analysis on the data reduced by the techniques of Sections 3 and 5 are shown in Table 1. The first thing to note is that even after all possible compatible markers, consistent with the output rearrangement distance, are restored, only 292-324 are present, meaning that 157-189 were discarded, of the maximum possible represented by n1 = 481. This illustrates the importance of analyzing the marker data to remove errors and conflicts. Another observation is the slight increase in the number of markers in the output, as the gap size criterion is relaxed from gap < 2 to gap < 9, despite the great increase in the number of pre-strips. Thus the extra pre-strips proved to be largely redundant. Table 1. Pre-strip inclusion criteria and solution characteristics. Strips out, total markers (including restored markers) and distance are averages over ten solutions. Comps/non-int refers to the numbers of components in the MWIS and vertices not in interval set components.
gap < 9 8 7 6 5 4 3 2
prestrps 894 836 771 709 646 550 432 259
11’s included 11’s excluded red- comps/ strps total dist- pre- red- comps/ strps total distuced non-int out. mark. ance strps uced non-int out. mark. ance 739 12/542 127 324 72 616 565 20/275 100 306 52 700 15/505 126 322 72 577 533 24/203 100 306 52 654 20/385 124 321 67 529 492 26/149 98 306 53 608 23/296 125 321 69 484 451 26/129 96 302 51 567 28/170 126 323 70 449 424 29/88 98 304 51 503 44/80 126 322 70 382 368 40/46 97 303 51 410 53/77 124 320 69 302 293 46/45 96 303 52 255 79/0 115 318 67 183 180 68/0 91 292 51
286
V. Choi et al.
Finally we note the great drop in the genomic distance (typically 18 out of 70) as the “11’s excluded” constraint is added. True, this comes at the cost of losing about 18 markers from the output, but the fact that the distance saved is about equal to the number of markers lost suggests that these markers, coming largely from isolated “11” pre-strips (i.e., that could not be incorporated in p11 or 11p pre-strips), do not carry authentic evolutionary information, by the same arguments about noise as in Section 6. Our MWIS program for each instance in Table 1 (actually for “11” excluded, gap was up to 15, data not shown) took less than two seconds in a Pentium IV 3.0GHz computer with 2G memory running under Fedora 2 linux OS. (Our previous program can only run on data with gap < 3 for “11” included, and gap < 4 for “11” excluded.) In Fig. 3, we show the result of applying our method to the sorghum and rice comparative maps, with gap < 15, excluding “11” pre-strips. The confusing pattern of alternating markers from many rice chromosomes on each sorghum chromosome is replaced by a more credible set of long strips.
chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10
chr1 chr2
chr1 chr2 chr3
chr3 chr4 chr5 chr6
chr4 chr7 chr5 chr6 chr7 chr8
chr8 chr9 chr10 chr11
chr9 chr10
chr12
Fig. 3. Top: 567 markers on sorghum chromosomes, colour-keyed by the rice chromosome containing their homologs. Bottom: 303 compatible markers remaining in optimal set of compatible strips. Note that long regions of a single colour generally consist of several synteny blocks whose order and orientation differ from one genome to the other.
Algorithms for the Extraction of Synteny Blocks from Comparative Maps
287
Algorithm 1. An improved algorithm for the MWIS problem Input: G = (V, E), w Output: w(IS(G)) compute an ordering of V =< v1 , v2 , . . . , vn >, (default: 5-S WEEP LBFS ordering); best-so-far = s1 = w(v1 ); for i = 2 to n do compute disruption-list and consecutive-prefix if vi is included in the solution set; best-possible = si−1 + w(vi ); found =false; MWIS-B RANCH - AND -B OUND (w(vi ), disruption-list, consecutive-prefix, best-so-far, best-possible, found); si =best-so-far ; return sn ; MWIS-B RANCH - AND -B OUND (current-weight, disruption-list, consecutive-prefix, best-so-far, best-possible, found) if disruption-list is empty then if current-weight + weight of mis(consecutive-prefix) > best-so-far then best-so-far = current-weight + weight of mis(consecutive-prefix); if best-so-far = best-possible then found =true; return ; if current-weight + upper bound of disruption-list + weight of mis(consecutive-prefix) ≤ best-so-far then return; /* prune the search tree */; else if current-weight + weight of mis(consecutive-prefix) = best-possible then best-so-far =best-possible ; found =true; /* found the best possible, terminate the search */; return ; while disruption-list is not empty do d =dequeue(disruption-list); new-current-weight = current-weight + w(d); compute new-disruption-list and new-consecutive-prefix if d is included in the solution set; MWIS-B RANCH - AND -B OUND(new-current-weight, new-disruption-list, new-consecutive-prefix, best-so-far, best-possible, found); if found =true then return ; return;
8 Conclusion We have studied the conversion of the MSR problem to the MWIS problem, based on the elimination of as few markers as possible from the genomes being compared. We have improved the preparation of conflict graph input to the MWIS by proving many cases of the six types of small pre-strip need not be considered. Our main result is an improved algorithm for the general MWIS problem that has superior performance
288
V. Choi et al.
where the data is “close” to an interval graph structure, precisely the type of data that pre-strip conflicts generate. Our implementation of this new algorithm easily handles data sets with 700 vertices and more, realistic values for available comparative maps. Our analysis of the rice-sorghum comparison, by comparing the trade-off between loss of markers versus inflation of genomic distance, confirms that fully 37% of the common markers cannot be confidently assigned to synteny blocks, in the sense that either such blocks would conflict with larger blocks already in the solution, or else the inclusion of each of the markers would require an extra rearrangement event to account for its presence, which is exactly the effect expected from a randomly placed marker. The extent to which our method “cleans up” the comparative map is rather drastic, and we have probably excluded many correctly mapped markers, but their inclusion could not be justified on the basis of the present inventory of common markers. The fact that the map produced by our method shows evidence of a rather small number of translocations between chromosomes, certainly less than 10, suggests that inversion (more than 40 events) is the dominant rearrangement process in the evolution of these cereals.
References 1. Choi, V.: BARNACLE: An assembly algorithm for clone-based sequences of whole genomes. Ph.D dissertation, Rutgers University (2002) 2. Corneil, D.G., Olariu, S., Stewart, L.: The LBFS structure and recognition of interval graphs. ms. cf: The ultimate interval graph recognition algorithm? In: SODA 1998, pp. 175–180 (2006) 3. Kumlander, D.: A new exact algorithm for the maximum-weight clique problem based on a heuristic vertex-coloring and a backtrack search. ms. and poster. In: 4th European Congress of Mathematics (2005) 4. Liang, Y.D., Dhall, S.K., Lakshmivarahan, S.: On the problem of finding all maximum weight independent sets in interval and circular-arc graphs. IEEE Symposium on Applied Computing, 465–470 (1991) 5. Ostergard, P.R.J.: A new algorithm for the maximum-weight clique problem. Nordic Journal of Computing 8, 424–436 (2001) 6. Ostergard, P.R.J.: A fast algorithm for the maximum clique problem. Discrete Applied Mathematics 120, 195–205 (2002) 7. Tesler, G.: Efficient algorithms for multichromosomal genome rearrangements. Journal of Computer and System Sciences 65, 587–609 (2002) 8. Ware, D., Jaiswal, P., Ni, J., et al.: Gramene: a resource for comparative grass genomics. Nucleic Acids Research 30, 103–105 (2002) 9. Zheng, C., Zhu, Q., Sankoff, D.: Removing noise and ambiguities from comparative maps in rearrangement analysis. Transactions on Computational Biology and Bioinformatics (forthcoming, 2007), doi:10.1109/TCBB.2007.1075
Computability of Models for Sequence Assembly Paul Medvedev1, Konstantinos Georgiou1, Gene Myers2 , and Michael Brudno1 1 University of Toronto, Canada Janelia Farms, Howard Hughes Medical Institute, USA {pashadag,cgeorg,brudno}@cs.toronto.edu,
[email protected] 2
Abstract. Graph-theoretic models have come to the forefront as some of the most powerful and practical methods for sequence assembly. Simultaneously, the computational hardness of the underlying graph algorithms has remained open. Here we present two theoretical results about the complexity of these models for sequence assembly. In the first part, we show sequence assembly to be NP-hard under two different models: string graphs and de Bruijn graphs. Together with an earlier result on the NP-hardness of overlap graphs, this demonstrates that all of the popular graph-theoretic sequence assembly paradigms are NP-hard. In our second result, we give the first, to our knowledge, optimal polynomial time algorithm for genome assembly that explicitly models the double-strandedness of DNA. We solve the Chinese Postman Problem on bidirected graphs using bidirected flow techniques and show to how to use it to find the shortest doublestranded DNA sequence which contains a given set of k-long words. This algorithm has applications to sequencing by hybridization and short read assembly.
1 Introduction Most current technologies for sequencing genomes rely on the shotgun method – the genome (or its portion) is broken into many small segments (reads) whose sequence is then determined. The problem of combining these reads to reconstruct the source genome is known as sequence (or genome) assembly, and is one of the fundamental algorithmic problems within bioinformatics. One basic assumption made by assembly algorithms is that every read in the input must be present in the original genome. This follows from the fact that it was read from the genome. Motivated by parsimony, some methods made another, less justifiable assumption: that the original genome should be the shortest sequence that contains every read as a substring. This assumption lead to the casting of the genome assembly problem as the Shortest Common Superstring (SCS) problem, which is known to be NP-hard [4]. The problem of modeling genome assembly as the SCS problem is that most genomes have repeats – multiple identical, or nearly identical, stretches of DNA, while the SCS solution would represent each of these repeats only once in the assembled genome. This problem is known as over-collapsing the repeats. One way of solving this problem is to build representative strings or structures for each repeat, and allow the assembly algorithm to use these multiple times. Pevzner et al. [12] had the insight that by dividing the reads into shorter k-long stretches (called k-mers), all of the instances of a repeat collapse into a single set of vertices. They represent each read as a walk on a de Bruijn R. Giancarlo and S. Hannenhalli (Eds.): WABI 2007, LNBI 4645, pp. 289–301, 2007. c Springer-Verlag Berlin Heidelberg 2007
3’ 5’
GGCAAT
A
P. Medvedev et al.
ATTGCC
5’
290
B
C
3’
Fig. 1. A. An example of double stranded DNA. The sequence read from this DNA can be either ATTGCC or GGCAAT. B. Three possible types of overlaps between two reads: each read can be in either of two orientations, but two of the cases (both to the left and both to the right) are symmetric. C. The three corresponding types of bidirected edges. The left node corresponds to the lower read. Note that the arrow points into a node if and only if the overlap covers the start (5’) of the read.
graph (defined below), and the assembly could then be represented as a superwalk – a walk that includes all of the input walks. In this formulation every edge of the de Bruijn graph has to be present in any solution and can be used multiple times. The solution to the assembly problem is formulated as a variation on finding an Eulerian tour, and because the Eulerian tour problem is solvable in polynomial time this lead to the hope of a polynomial algorithm for sequence assembly. This approach was later expanded to A-Bruijn graphs [13], where the initial subdivision into k-mers is not necessary, but the basic algorithmic problem of searching for a superwalk remains. Myers [10] provides for an alternative model of sequence assembly, using a string graph. Instead of dividing the reads into k-mers, he builds an overlap graph – a graph where nodes correspond to reads and edges correspond to overlaps (the prefix of one read is the suffix of the other). Through the process of removing redundant edges he is able to classify all edges as either required or optional, and the goal of the assembly is to find the shortest walk which includes all of the required edges. The main algorithmic difference between the de Bruijn / A-Bruijn and the string graph models for sequence assembly is that while in the latter some edges are required and others are optional, in the former all edges are required, but walks have been pre-specified and must be included in the solution. In our first result, we show that sequence assembly with both string graphs and de Bruijn graphs is NP-hard by reduction from Hamiltonian Cycle and Shortest Common Superstring, respectively. Together, these two proofs demonstrate that both of the popular graph-theoretic sequence assembly paradigms are unsolvable by optimal polynomial-time algorithms unless P = N P. Another algorithmic problem faced by assembly algorithms is the treatment of doublestranded DNA (see Figure 1A). A DNA molecule consists of two strands which are reverse compliments of each other. The start (called 5’) of one strand is complementing the end (called 3’) of the other. Whenever DNA is sequenced, the molecule is always read in the same direction, from 5’ to 3’, but it is impossible to know from which of the two strands the sequence is read. Many sequence assembly algorithms use heuristics to determine the strand for each read. The EULER method [12] uses both the reads and their reverse-complements to build the de Bruijn graph and searches heuristically for two
Computability of Models for Sequence Assembly
291
“complementary” paths. In the work of Kececioglu and Myers [6] strand selection for a read is formulated as the NP-hard maximum weight cut problem. In 1992, Kececioglu [8] introduced an elegant method for dealing with doublestrandedness by modeling overlaps between DNA molecules using a bidirected graph. Each read is represented by a single node, and each overlap (edge) has an orientation at both endpoints. The three types of bidirected edges correspond to the three possible ways in which the overlap can occur (see Figure 1B & C). Bidirected graphs were further used for sequence assembly in [9,10] and to model breakpoint graphs in [7]. Remarkably, however, bidirected graphs have been studied within the context of graph theory already in the 1960s when Edmonds formulated the problem of bidirected flow (a generalization of network flow to bidirected graphs) and showed it equivalent to perfect b-matchings [1]. Edmonds’ work was later extended by Gabow [3], who gave the fastest to-date algorithm for bidirected flow. In our second result, we extend Gabow’s and Edmonds’ work to give a polynomial time algorithm for solving the Chinese Postman Problem in bidirected graphs. By combining this algorithm with Pevzner’s work on de Bruijn graphs [11,12] and Kececioglu’s work on modeling strandedness with bidirected graphs [8], we show how it can be used to find the shortest (double-stranded) DNA sequence with a given set of k-long DNA fragments. To the best of our knowledge, this is the first optimal polynomial time assembly algorithm which explicitly deals with the double-stranded nature of DNA.
2 Preliminaries In this section, we give the background and definitions needed for the rest of this paper. 2.1 Strings, Overlap Graphs, De Bruijn Graphs, and Molecules Let v and w be two strings over the alphabet Σ. The concatenation of these strings is denoted as v · w. The length of v is denoted by |v|. The ith character of v is denoted by v[i]. If 1 ≤ i ≤ j ≤ |v|, then v[i, j] is the substring beginning at the ith position and ending at the j th position, inclusive. If there exists i, j such that v = w[i, j], then we say v is a substring of w. For x ∈ Σ, xk is x concatenated with itself k times if k ≥ 1, and otherwise. A string of length k is called a k-mer. The k-spectrum of v is the set of all k-mers that are substrings of v. A k-molecule is a pair of k-mers which are reverse compliments of each other. We say a k-molecule corresponds to each of its two constitutive k-mers. The k-molecule-spectrum of a DNA molecule is the set of all k-molecules corresponding to the k-mers of the k-spectrum of either of the DNA strands. We say w overlaps v if there exists a maximal length non-empty string z which is a prefix of w and a suffix of v (notice this definition is not symmetric). The length of the overlap is ov(v, w) = |z|. If w does not overlap v then ov(v, w) = 0. Let S = {s1 , . . . , sn } be a set of non-empty strings over an alphabet Σ. An overlap graph of S is a complete weighted directed graph where each string in S is a vertex and the length of the edge x → y is |y| − ov(x, y). We say w is a superstring of S if for all i, si is a substring of w. The Shortest Common Superstring (SCS) problem is to find the shortest superstring of S. It was proven to
292
P. Medvedev et al.
A
W E
B
X D Z
C Y
A B C D E W 1
0
X -1 1
0 0
0 -1 -1
Y 0
-1 2 0
Z
0
0
-1 0
0 -1 0
Fig. 2. This is an example of a bidirected graph and its incidence matrix. We draw an edge that is positive incident to a vertex using an arrow that is pointing out of the vertex, but this choice of graphical representation is arbitrary.
be NP-hard for |Σ| ≥ 2 [4,5]. We define the de Bruijn graph Bk (S) as a directed graph, using a positive integer parameter k. The vertices of B k (S) are {d ∈ Σ k | ∃i such that d is a substring of si }. We identify a vertex by the k-mer associated with it. We abuse notation here by referring to a vertex in B k (S) by the k-mer associated with it. The edges are {d[1..k] → d[2..k + 1] | d ∈ Σ k+1 , ∃i such that d is a substring of si }. 2.2 Bidirected Graphs and Flow Consider an undirected (multi) graph G with a set of vertices V and a set of edges E. The multiplicity of an edge e is the number of edges in G whose endpoints are the same as e’s If the endpoints are distinct, the edge is called a link, otherwise it is a loop. Additionally, we assign orientations to the edges. Every link has two orientations, one with respect to each of its endpoints, while every loop has one orientation. There are two kinds of orientations – positive and negative – and thus we can say an edge is positive-incident or negative-incident to an endpoint. When taken together with the orientations of its edges, G is called a bidirected graph. If there is additionally a weight function we associated with the edges, we say the graph is weighted. The weight of a graph is the sum of the weights of its edges. A bidirected graph is connected if its underlying undirected graph is connected. The orientations of the edges can be represented by an incidence matrix I G : V × E −→ {−2, −1, 0, 1, 2} (we omit G when it is obvious from the context). If an edge e is not incident to a vertex x then I(x, e) = 0. For a link e and a vertex x, I(x, e) = +1 if e is positive-incident to x, and I(x, e) = −1 if e is negative-incident to x. For a loop e and a vertex x, I(x, e) has the value of +2 if e is positive-incident to x, and the value of -2 if e is negative-incident to x. See Figure 2 for an example of a bidirected graph and its incidence matrix. The in-degree of a vertex x in graph G − is defined as degG (x) = − {e∈E | I(x,e)<0} I(x, e). Similarly, the out-degree is de + + − fined as degG (x) = {e∈E | I(x,e)>0} I(x, e). Let balG (x) = degG (x) − degG (x) = I(x, e) be the balance at each vertex. G is balanced if the balance of each vertex is 0. A (x1 , xk )-walk is a sequence x1 , e1 , . . . , xk−1 , ek−1 , xk where ei is an edge incident to xi and xi+1 , and for all 2 ≤ i ≤ k − 1, ei−1 and ei have opposite orientations at xi . Since the specification of vertices is redundant, we may omit them sometimes and
Computability of Models for Sequence Assembly
293
specify a walk as just a sequence of edges. A walk is said to be cyclical if its endpoints are the same and e1 and ek−1 have opposite orientations at x1 . A bidirected graph is strongly connected if it is connected and for every edge there is a cyclical walk containing it. Note that we can view a loopless directed graph as a special kind of bidirected graph, where every edge is positive-incident to one of its endpoints and negative-incident to the other one. In this case, the definition of a walk reduces to its usual meaning in directed graphs. However, there are some caveats. For example, it is possible for the shortest walk between two vertices to repeat a vertex in a bidirected graph. In Figure 2, observe that there does not exist a walk between W and Z which does not repeat a vertex, something that is not possible in a directed graph. A Chinese walk is a cyclical walk that traverses every edge at least once. Given a weighted bidirected graph, the Chinese Postman Problem(CPP) is to find a minimum weight Chinese walk (called a Chinese Postman Tour), or report that one doesn’t exist. An Eulerian tour of a graph is a cyclical walk that contains every edge of the graph exactly once, and a graph which contains an Eulerian tour is called Eulerian. The following is a generalization of a well-known fact for directed graphs whose proof is almost identical to the directed case and is therefore ommited. Lemma 1. A bidirected graph G contains an Eulerian tour if and only if it is connected and balanced. Given a bidirected graph G, and vectors a, b ∈ ZV (G) and d, c, w ∈ ZE(G) , a minimum cost bidirected flow problem [14] is an integer linear program where the goal is to find x ∈ ZE(G) that minimizes w · x subject to the constraints that d ≤ x ≤ c and a ≤ I G · x ≤ b. Here, · refers to the inner product between two vectors, and ≤ is a component-wise comparison operator.
3 The String Graph Framework In [10], Myers introduces a string graph framework for sequence assembly. A string graph is built from an overlap graph through the process of transitively inferable edge reduction – whenever y and z overlap x, and z overlaps y, the overlap of z to x is said to be inferable from the other two overlaps, and is removed from the graph. Myers demonstrates a fast algorithm for removing transitively inferable edges from the graph, which, in combination with statistical methods, associates a ”selection” constraint with each edge. The selection constraint states that the edge must appear in the target genome either at least once (it is required), exactly once (it is exact), or any number of times (it is optional). The key property of string graphs is that any cyclical walk that respects the selection constraints represents a valid assembly of the genome. and the weight of the walk is the length of the assembled genome. After building the string graph, the algorithmic problem is to find a cyclical walk that visits each edge in accordance with its selection constraint. Appealing to parsimony, the goal is to find a walk with minimum weight. In this section, we show that this problem is NP-hard. Formally, a selection function s is a function that classifies each edge into one of three categories: optional, required, exact. We call a walk which contains all the
294
P. Medvedev et al.
required edges at least once, all the exact edges exactly once, and all the optional edges any number of times an s-walk. The Minimum s-Walk Problem(MSWP) for a weighted directed graph G and a selection function s is the problem of finding a minimum weight cyclical s-walk of G, or report that one doesn’t exist. Theorem 1. The Minimum s-Walk Problem is NP-hard. The proof works by reducing the Hamiltonian Cycle problem in directed graphs to MSWP. A cycle is Hamiltonian if it visits every vertex exactly once. The reduction works by splitting each vertex into ’in’ and ’out’ counterparts and adding a required edge between them, while making all other edges optional. Having optional edges is essential for the reduction; if they are not present, the problem can be efficiently solved using a variant of the algorithm of Section 5.1. Also note that in [10] the edges of the string graph are bidirected in order to reflect the double strandedness of DNA. Since directed graphs are a special type of bidirected graphs, Theorem 1 holds for bidirected graphs as well. Proof. Let G be a directed graph, with vertices v1 , . . . , vn , for which we wish to find a Hamiltonian cycle. Let G be a directed graph with vertex set {vi− , vi+ | 1 ≤ i ≤ n} and edge set O ∪ R, where O = {vi+ → vj− | (vi → vj ) ∈ E(G)} and R = {vi− → vi+ | 1 ≤ i ≤ n}. The weight of each edge is 1. Let s be a selection function on G that labels all the O edges as optional and all the R edges as required. We show that G has a Hamiltonian cycle if and only if G has a cyclical s-walk of weight at most 2n. First, suppose C = vi1 → . . . → vin → vi1 is a Hamiltonian cycle of G. Then C = vi−1 → vi+1 → vi−2 → vi+2 → . . . → vi−n−1 → vi+n−1 → vi−1 is a cyclical s-walk in G of weight 2n. For the other direction, let C be a cyclical s-walk in G of length at most 2n. Because the R edges form a matching and all n of them must be in C , the edges of C must alternate between R and O edges, and thus have a total of n edges of each kind. If we remove all the R edges from C and map all the vertices of C to their counterparts in G, we get a Hamiltonian cycle of G.
4 The De Bruijn Graph Framework One of the original graph-theoretic frameworks for sequence assembly was proposed by Pevzner, Tang, and Waterman in [12]. They note that by tiling every read by (k + 1)mers they can view the read as a walk in a de Bruijn graph, where the vertices are k-mers and edges are (k + 1)-mers. Thus, any walk that contains all the reads as subwalks represents a valid assembly. Consequently, they formulate the assembly problem as finding the shortest superwalk, a problem closely related to the polynomial time Eulerian tour problem (which was previously used to solve the problem of sequencing by hybridization [11]). What we show in this section is that the de Bruijn graph framework does not make the problem of read assembly more tractable. Let S = {s1 , . . . , sn } be a set of strings over an alphabet Σ and let G = B k (S) be the de Bruijn graph of S for some k. The strings si correspond to walks in B k (S) via the function w(s) = s[1..k] → s[2..k + 1] → . . . → s[|s| − k + 1, |s|]. A walk is called a superwalk of G if, for all i, it contains w(si ) as a subwalk. Thus, a superwalk
Computability of Models for Sequence Assembly
295
Fig. 3. An example of the reduction from Shortest Common Superstring to De Bruijn Superwalk. The set of strings S is over the alphabet {A,C,G,T}, and the graph drawn is B 2 (f (S)). The cycles in the edge decomposition are CA , CC , CG , CT and have three edges each. As an example, the walk w(f (AT T )) starts at the central node and is CA followed by CT followed by CT again.
represents a valid assembly of the reads into a genome. Within this framework, the goal of finding a parsimonious genome assembly is to find a minimal length superwalk. The assembly algorithm of [12] looks for such a superwalk, however, it uses heuristics and may not produce the correct answer. Formally, given a set of strings S as defined above and a positive integer k, the De Bruijn Superwalk Problem (BSP) is to find a minimum length superwalk in B k (S), or report that one does not exist. Observe that since every edge in B k (S) is covered by at least one walk w(si ), a superwalk will traverse every edge at least once. We shall show that BSP is NP-hard by a reduction from the Shortest Common Superstring (SCS) problem. Informally, we will transform a string by inserting 3k in between every character, as well as in the beginning and end, where 3 is a special character that does not appear in the input strings. For example, we transform the string ’abc’ into ’3k a3k b3k c3k ’. This transformation preserves overlaps and introduces a 3k overlap between otherwise non-overlapping strings. The idea is that while a superstring can be built by appending non-overlapping strings, a superwalk must correspond to a string built by overlaps of at least k characters. See Figure 3 for an illustration of the de Bruijn graph on a set of transformed strings. Theorem 2. The De Bruijn Superwalk Problem is NP-hard, for |Σ| ≥ 3 and for any positive integer k. Proof. SCS is NP-hard even if the size of the alphabet is 2 [5]. We reduce an instance of SCS to an instance of BSP which has an additional character 3 in the alphabet. Let S = {s1 , . . . , sn } be the set of strings of an SCS instance, and Σ be the set of characters appearing in S. We define a function f (s)[i] for 1 ≤ i ≤ k(|s| + 1) + |s| as i follows: For all i divisible by k + 1, f (s)[i] = s[ k+1 ]. For all other i, f (s)[i] = 3. Let k G = B (f (S)), where f (S) = {f (si ) | 1 ≤ i ≤ n}. We first make some observations about G, which follow directly from the definition of de Bruijn graphs and from f . The vertices of G, which are the k-mers appearing in f (S), are {3k } ∪ {3k−i x3i−1 | x ∈ Σ, 1 ≤ i ≤ k}. The edges of G are
296
P. Medvedev et al.
{Ex | x ∈ Σ}, where Ex = {3k → 3k−1 x} ∪ {x3k−1 → 3k } ∪ {3k−i x3i−1 → k−i−1 3 x3i | 1 ≤ i ≤ k − 1}. The edge set of G forms a disjoint union of cycles k k−1 x → 3k−2 x3 → . . . → 3x3k−2 → x3k−1 → x∈Σ Cx , where Cx = 3 → 3 k 3 . We also note that w(f (si )) = w(3k si [1]3k . . . 3k si [|s|]3k ) = Csi [1] → . . . → Csi [|si |] . For an illustration see Figure 3. Now we show that the length of the shortest superwalk of G is k + 1 times the length of the shortest superstring of S. First, suppose s is a superstring of S. Let w = Cs[1] → . . . → Cs[|s|] . We claim that w is a superwalk of G of length |s|(k + 1). We have to show that w(f (si )) is a subwalk of w for all i. Since si is a substring of s, there is some j and k such that si = s[j, k]. Then, w(f (si )) = Cs[j] → . . . → Cs[k] , which is indeed a subwalk of w. Now, suppose w is a superwalk of G. Every edge that appears before the first 3k and after the last 3k in w can be removed from w while preserving it as a superwalk. Therefore, we can assume that the first and last vertex of w is 3k , and w can be uniquely expressed as a sequence of cycles Cj1 → . . . → Cj |w| . Let s = j1 · j2 · · · j |w| . For k+1
k+1
all i, since w(f (si )) is a subwalk of w, we can write it as w(f (si )) = Cjm → . . . → Cj |wi | for some m. By definition, w(f (si )) = Csi [1] → . . . → Csi [|si |] . Since the m+
k+1
−1
decomposition of a walk into cycles Cx is unique, we conclude that si [k] = jm+k−1 for |w| 1 ≤ k ≤ |si |. Therefore, si is a substring of s, and s is a superstring of length k+1 .
5 Assembly of Double-Stranded DNA with Bidirected Flow In this section, we demonstrate the first, to our knowledge, polynomial algorithm for assembly of a double-stranded genome. First, we give a polynomial time algorithm for solving the Chinese Postman Problem (CPP) on bidirected graphs. Subsequently, we will show how to construct a bidirected de Bruijn graph from the set of k-long molecules that are present in it (the k-molecule-spectrum). By solving the CPP on the resulting graph we are able to reconstruct the shortest DNA molecule with the given k-molecule-spectrum. 5.1 The Bidirected Chinese Postman Problem Given a weighted bidirected graph G, recall that the Chinese Postman Problem (CPP) is to find a minimum weight Chinese walk of G, or report that one does not exist. CPP is polynomially time solvable on both undirected and directed graphs [2]. It becomes NP-Hard on mixed graphs, which are graphs with both directed and undirected edges [5]. For undirected graphs, CPP is reducible to minimum cost perfect matchings. For directed graphs, it is reducible to minimum cost network flow. In this section, we give an efficient algorithm for solving CPP on bidirected graphs via a reduction to minimum cost bidirected flow. We will show in Lemma 2 that for G to have a Chinese walk it is necessary and sufficient for it to be strongly connected. To find a min-weight Chinese walk, first consider the case G is Eulerian. An Eulerian tour of G is also a Chinese walk, since it visits every edge exactly once. Furthermore, since any Chinese walk has to visit every edge at least
Computability of Models for Sequence Assembly 1: 2: 3: 4: 5: 6: 7:
297
if G is not connected then return ”no Chinese walk exists” Use algorithm of [3] to solve the corresponding minimum cost bidirected flow (see text). if there is no solution then return ”no Chinese walk exists” Let G be the graph G with fe copies of every edge e, in addition to e itself. Use a standard algorithm to find an Eulerian circuit C of G . Relabel C according to Theorem 3. return C Fig. 4. Algorithm for the Chinese Postman Problem on bidirected graphs
once, the Eulerian tour is also a Chinese postman tour. In the general case, however, when G is not Eulerian, our approach is to make the graph Eulerian by duplicating some of the edges, and then using a standard algorithm to find an Eulerian tour. We shall prove that if we minimize the total weight of the duplicated edges, the Eulerian tour we find in the modified graph will correspond to a Chinese postman tour in the original graph. Formally, we say a graph G is an extension of G if it can be obtained from G by duplicating some of its edges. The Eulerization Problem (EP) is to find a min-weight Eulerian extension of G, or report that one does not exist. The following theorem shows that CPP and EP are polynomially equivalent. Theorem 3. There exists a Chinese walk of weight i if and only if there exists an Eulerian extension of weight i. Moreover, they can be derived from each other in polynomial time. be the graph Proof. For the only if direction, let W be a Chinese walk in G. Let W induced by W , where the multiplicity of each edge is the number of time it is traversed is an extension of G because W visits every edge at least once. Also W by W . Then W whose weight is that of W . Thus W is an Eulerian extension is an Eulerian circuit of W of G with weight of W . For the if direction, let G be an Eulerian extension of G. Let W be an Eulerian circuit in G . Construct W from W by replacing every edge e ∈ G by an edge e ∈ G such that e is a duplicate of e. W is thus a valid cyclical walk in G which visits every edge at least once and whose weight is the same as that of W and of G . Now, we give an algorithm for the Eulerization Problem. First, we consider the case that G is not connected. Since any extension of G will also not be connected, our algorithm can safely report that there is no Eulerian extension of G, and hence no Chinese walk. For the case that G is connected, we formulate EP as a min-cost bidirected flow problem. First, we represent an extension G of G with |E(G)| variables, where each variable fe represents the number of additional copies of edge e in G . It is clear that an assignment of non-negative integers to these variables corresponds to an extension of G, and vice-versa. Now, we would like to formulate EP in terms of these variables instead of in termsof an extension. The minimization criterion is the weight of the extension, which is we (1 + fe ). The criterion that G is Eulerian is, by Lemma 1, the criteria that it is connected and balanced. The connectivity criterion is redundant since G is connected and thus any extension of G must also be connected. The balance condition for each vertex x can be stated as: e I G (x, e) · fe + balG(x) = 0. That is, the
298
P. Medvedev et al.
balance of x in G is the balance of x in G plus the contribution of all the copied edges. We are now able to formulate EP as the following integer linear program: minimize we fe subject to I G (x, e)fe = −balG(x) for each vertex x e
fe ≥ 0
for each edge e
From the definition in Section 2.2, this is actually a minimum cost bidirected flow problem, which can be solved using Gabow’s algorithm [3]. Our final algorithm for CPP on bidirected graphs is given in Figure 4. For the running time, we need to bound the size of the solution: Lemma 2. G has an Eulerian extension if and only if it is strongly connected. Moreover, the min-weight Eulerian extension has at most 2|E||V | edges. Proof. If G has an Eulerian extension, then it must be connected, and for every edge there is a cyclical walk containing it (namely the one induced by the Euler tour). Conversely, suppose that G is strongly connected. For every edge, we can duplicate all the other edges of the shortest cyclical walk that contain it, thus balancing the graph. Now, suppose G is a min-weight Eulerian extension of G. We can decompose G into a set of minimal cycles. Each cycle must contain an edge that no other cycle contains, otherwise it can be removed from G to get a smaller weight extension. Therefore, there are at most |E| cycles, and each cycle contains at most 2|V | edges. Gabow’s algorithm runs in time O(|E|2 log(|V |) log(C)), where C is the largest capacity (C = max c(e) using the definition of Section 2.2). By the above lemma, C = O(|V |3 ) if the graph is simple, so the running time for finding the flow, and thus for the whole algorithm, is O(|E|2 log2 (|V |)). 5.2 The Bidirected De Bruijn Graph In an earlier work [11], Pevzner showed that the de Bruijn graph B k−1 can be used to represent the k-spectrum of a string, and that the (directed) Chinese postman tour on this graph corresponds to the shortest string with the given k-spectrum. When working with double-stranded DNA molecules, however, it is necessary to model k-molecules instead of k-mers in the de Bruijn graph. To do this Pevzner includes both of the kmers associated with every k-molecule in the de Bruijn graph. He then searches for two “complementary” walks, each corresponding to one of the DNA strands (see Figure 5). Instead, we show how to construct a bidirected de Bruijn graph where each k-molecule is represented only once. Our input is the k-molecule-spectrum of the genome. We will arbitrarily label one of the k-mers associated with each k-molecule as coming from the ”positive” strand and the other from the ”negative” strand. Let the nodes of the bidirected de Bruijn graph be all of the possible (k − 1)-molecules. For every k-molecule in the spectrum, let z be one of its two k-mers. Let x and y be the (k − 1)-molecules corresponding to z[1..k − 1] and z[2..k], respectively. We make an edge between the vertices corresponding to x and y.
AC
-AT
AA
+AT
+AC
AT -GC +GC
-TG
GT GG TG
+AA
+CA
-GG
GC
-TT
CA CC
299
-GT
Computability of Models for Sequence Assembly
+CC
TT
Fig. 5. Given the k-molecule-spectrum {ATT/AAT, TTG/CAA, TGC/GCA, GCC/GGC, CCA/TGG, CAA/TTG, AAC/GTT, Pevzner et al.’s [12] approach builds the graph on the left, and searches for two complementary paths. The bidirected de Bruijn graph is on the right; one tour that includes all of the edges spells ATTGCCAAC on the forward strand, and GTTGGCAAT on the reverse.
This edge is positive-incident to x if z[1..k − 1] is the positive strand of x, and negativeincident otherwise. It is negative-incident to y if z[2..k] is the positive strand of y, and positive-incident otherwise. Note that this edge construction is identical to the one defined by Kececioglu [8] for an overlap between two DNA molecules (also see Figure 1). The Chinese postman tour of the resulting bidirected de Bruijn graph corresponds to the shortest assembly of the DNA molecule with the given k-molecule-spectrum. The proof follows from the construction: every k-molecule from the spectrum is represented by exactly one edge in the graph. Every valid assembly of the genome corresponds to a walk in the bidirected de Bruijn graph. Because the Chinese postman tour is the shortest such walk, it is also the shortest assembly of the genome. The tour also corresponds to both of the DNA strands. Because a walk is required to use edges with opposite orientations to enter and leave every vertex, but is allowed to enter on either a positive or negative oriented edge, the Chinese postman tour can be ”walked” in either of two directions. If we enter a node on a positive-incident edge we use the positive k-mer, if on the negative incident we use the negative k-mer. The two directions correspond exactly to the two strands of DNA, and the sequences “spelled” by them are reversecomplements. For the running time, because the de Bruijn graph has a constant degree at every node (|E| ∈ Θ(|V |)), the overall running time is O(|V |2 log2 (|V |)) using the algorithm of Section 5.1.
6 Discussion In this work we showed that both the de Bruijn graph and string graph models for sequence assembly are NP-hard. While this result makes it impractical to look for polynomial time exact algorithms for either of these problems, we believe our work suggests two important areas of investigation. The first is to characterize the computational difficulty of the genome assembly models on real-world genomes. It is well known that
300
P. Medvedev et al.
many NP-hard problems are efficiently solvable when restricted to particular classes of inputs. The success of both the de Bruijn and string graph models in practice indicate that by defining a more restricted model of inputs that nevertheless covers most actual genomes, we may be able to create a model for sequence assembly that can be solved exactly in polynomial time. Simultaneously, real-life genomes contain repeats, making it unlikely that any real genome will have a unique solution under either string graph or de Bruijn graph assembly models. Consequently it is important to explore what a realistic objective function for an assembly algorithm should be. Conducting a rigorous study of these questions is a promising avenue for improving assembly programs. In our second result we showed that the computational difficulty of sequence assembly is not due to double-strandedness of DNA. By unifying Pevzner’s work on de Bruijn graphs, Kececioglu’s and Myers’ work on bidirected graphs in assembly and Edmonds’ and Gabow’s work on bidirected flow, we are able to demonstrate an optimal polynomial time assembly algorithm that explicitly deals with double-strandedness. We believe the use of bidirected flow as a technique will be fruitful for other sequence assembly problems, including for the assembly of short DNA reads coming from novel sequencing technologies such as Illumina and 454.
Acknowlegments We would like to thank Allan Borodin for helpful comments and careful reading of the manuscript. This work was partially supported by an NSERC Discovery Grant to MB.
References 1. Edmonds, J.: An introduction to matching. Notes of engineering summer conference, University of Michigan, Ann Arbor (1967) 2. Edmonds, J., Johnson, E.L.: Matching, Euler tours, and the Chinese postman. Mathemetical Programming 5, 88–124 (1973) 3. Gabow, H.N.: An efficient reduction technique for degree-constrained subgraph and bidirected network flow problems. In: STOC, pp. 448–456 (1983) 4. Gallant, J., Maier, D., Storer, J.A.: On finding minimal length superstrings. J. Comput. Syst. Sci. 20(1), 50–58 (1980) 5. Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NPCompleteness. W. H. Freeman, New York (1979) 6. Kececioglu, J.D., Myers, E.W.: Combinatiorial algorithms for DNA sequence assembly. Algorithmica 13(1/2), 7–51 (1995) 7. Kececioglu, J.D., Sankoff, D.: Exact and approximation algorithms for sorting by reversals, with application to genome rearrangement. Algorithmica 13(1/2), 180–210 (1995) 8. Kececioglu, J.D.: Exact and approximation algorithms for DNA sequence reconstruction. PhD thesis, Tucson, AZ, USA (1992) 9. Myers, E.W.: Toward simplifying and accurately formulating fragment assembly. Journal of Computational Biology 2(2), 275–290 (1995) 10. Myers, E.W.: The fragment assembly string graph. In: ECCB/JBI, p. 85 (2005) 11. Pevzner, P.A.: 1-Tuple DNA sequencing: computer analysis. J. Biomol. Struct. Dyn. 7(1), 63–73 (1989)
Computability of Models for Sequence Assembly
301
12. Pevzner, P.A., Tang, H., Waterman, M.S.: An Eulerian path approach to DNA fragment assembly. In: Proceedings of the National Academy of Sciences, vol. 98, pp. 9748–9753 (2001) 13. Pevzner, P.A., Tang, H., Tesler, G.: De novo repeat classification and fragment assembly. In: RECOMB, pp. 213–222 (2004) 14. Schrijver, A.: Combinatorial Optimization, vol. A. Springer, Heidelberg (2003)
Fast Algorithms for Selecting Specific siRNA in Complete mRNA Data Jaime Davila, Sudha Balla, and Sanguthevar Rajasekaran CSE Department University of Connecticut {jdavila,ballasudha,rajasek}@engr.uconn.edu
Abstract. The Specific Selection Problem arises from the need to design short interfering RNA (siRNA) that aims at gene silencing. These short sequences target specific messenger RNA (mRNA) and cause the degradation of such mRNA, inhibiting the synthesis of the protein generated by it. In [11] this problem was solved in a reasonable amount of time when restricted to the design of siRNA for a particular mRNA, but their approach becomes too time consuming when trying to design siRNA for each mRNA in a given organism. We devise simple algorithms based on sorting and hashing techniques that allow us to solve this problem for the entire mRNA of the Human in less than 4 hours, obtaining a speedup of almost two orders of magnitude over previous approaches.
1
Introduction
The Specific Selection Problem arises from the need to design short interfering RNA (siRNA) that aims at gene silencing [2]. These short sequences target specific messenger RNA (mRNA) and cause the degradation of such mRNA, inhibiting the synthesis of the protein generated by it. These sequences are usually of small length, usually consisting of between 20 and 25 nucleotides. However a length of 21 is used in practice and usually two of the nucleotides are predetermined, so the problem becomes one of designing sequences of length 19. An important criterion in the design of the siRNA is that the sequence should minimize the risk of off-target gene silencing caused by hybridization with the wrong mRNA. This hybridization may occur because the number of mismatches between the sequence and an unrelated sequence may be too small or because they share a long enough subsequence. In [11] this problem was considered in the context of designing an siRNA that would target a particular mRNA sequence. However their approach becomes computationaly very demanding in the case of selecting such siRNA for every possible mRNA in a given organism. In this paper we design simple algorithms that solve this problem by making use of sorting techniques. The algorithm is shown to be practical when processing the complete mRNA of Human and Drosophila, running in less than 4 hours and outperforming previous approaches [11,9,12]. In this paper we tackle the problem that arises from constraints that consider mismatches with unintended sequences. Some other constraints have also been R. Giancarlo and S. Hannenhalli (Eds.): WABI 2007, LNBI 4645, pp. 302–309, 2007. c Springer-Verlag Berlin Heidelberg 2007
Fast Algorithms for Selecting Specific siRNA in Complete mRNA Data
303
considered in the literature [8,10]. These other other constraints can be taken into account in pre or post processing stages.
2
Specific Selection Problem
We are interested in identifying small l-mers which will target a particular mRNA sequence, trying to minimize hybridizations with other sequences. Hybridizations could occur, for example, if the number of mismatches between the designed lmer and an l-mer of another sequence is low –say, less than 3–. In this section, we define the problem under formal terms as the (l,d) Specific Selection Problem and we consider practical and efficient algorithms that solve it. 2.1
Problem Definition
We denote by dH (x, y) the Hamming distance between two strings x and y, i.e. the number of mismatches between x and y. Definition 1. Let x and s be strings over Σ with |x| = l, |s| = n and l < n. 1. We say x is an l-mer of s if x is a subsequence of s and we denote it by x l s. 2. We denote by dH (x, s) = min dH (x, y) yl S
Definition 2. Let S = {s1 , . . . , sn } and x l si . We denote by d¯H (x, S) =
max dH (x, sj )
(1)
1≤j=i≤n
A similar concept to d¯H (·, S) was introduced in [11] under the name of mismatch tolerance. Definition 3. Let S = {s1 , . . . , sn } be a set of sequences over Σ. Given l and d the (l, d) Specific Selection Problem consists of finding a set of l-mers X = {x1 , . . . , xn } such that ∀(1 ≤ i ≤ n) : xi l si and d¯H (xi , S) > d
(2)
That is xi appears in si and does not appear in any other sj (j = i) with up to d errors. In case for a particular i there is no xi that satisfies (2) we set xi = ∅. It is clear that this problem can be solved in O(N 2 ) time where N :=
n
|si |.
i=1
However such an approach becomes impractical when we are dealing with complete mRNA data where N could be of the order of 108 . In [12] this problem was studied under the name of unique oligo problem and in [9] a more general problem is considered under the name of probe design problem, imposing more conditions on the designed l-mers which include homogeneity – which is measured by the melting temperature of the probe and the CG content– and sensitivity –which is calculated using the free energy of the probe–.
304
J. Davila, S. Balla, and S. Rajasekaran
Their solution strategy is based on determining whether d¯H (·, S) ≤ d for each candidate l-mer by making use of a precalculated index for small q-mers or seeds, and then extending contigous hits of q-mers with few mismatches. The running time of these approaches depends critically on the values of q and the number of mismatches which are used, which in turn depends heavily on the combination of values of l and d. In [11] it is pointed out that in cases such as the ones that arise from designing siRNA where N ∼ 108 , 19 ≤ l ≤ 23 and d = 3, 4 the previous strategy is computationally very intensive, hence the value of d¯H (·, S) is calculated by making use of overlapping –instead of contiguous– q-mers or seeds allowing a few mismatches, and it is shown that this approach outperforms the previous methods by orders of magnitude. In particular it is claimed that for l = 19, ¯ S) can be calculated in nearly 10−2 seconds on a d = 3 and N = 5 × 107 , d(·, Xeon CPU with a clock rate of 3.2 GHz and 2GB of main memory. This would imply that if we want to solve the (l, d) off-target selection problem in this case we would take close to 6 days of calculation. Our method would take close to 3 hours to be solved on a similar machine. 2.2
SOS: A Solution Based on Radix Sorting
Let x l si and assume that d¯H (x, S) ≤ d. This means that there is y sj (j = i) with dH (x, y) ≤ d. If we eliminate the ≤ d characters where x and y differ the resulting strings will be identical and easily identifiable if we sort them. Since we don’t know which set of positions will work, we need to try all the dl combinations. However if l and d are small –as it is in our case– the number of possibilities is not that big. Notice that in this case we are using a strategy which is similar to [7,1] but in a different context. In other words we are exploiting the fact that if l and d are small enough, the number of cases where two l-mers will differ in less than d positions is not that big. This following definition will be used in the description of the algorithm that captures this idea. Definition 4. Let x be a string over Σ and let 1 ≤ i1 ≤ · · · ≤ ih ≤ |x|. We call x i1 ,...,ih the l − h-mer that omits the characters x[i1 ], . . . , x[ih ].
Algorithm SOS 1. Given S = {s1 , . . . , sn }, generate X =
n i=1
{(x, i) : x l si }. Let C := X.
2. For all (j1 . . . jd ) with 1 ≤ j1 < · · · < jd <= l (a) Sort the collection of X = {(x, i)} according to the values of x j1 ,...,jd using radix-sort. (b) Scan the sorted collection. If (y, i) and (y , j) appear consecutively, where y j1 ,...,jd = y j1 ,...,jd and i = j mark the elements (x, i) such that x j1 ,...,jd = y j1 ,...,jd . 3. Output the unmarked elements.
Fast Algorithms for Selecting Specific siRNA in Complete mRNA Data
305
Theorem 1. Algorithm SOS can be implemented in O(N wl ( dl ) time and O (N log |Σ| wl ) memory, where w is the word size of the computer. Proof. We just have to notice that we can represent each letter of the alphabet using log |Σ| bits, hence we can store each l-mer using logwΣl words. The time complexity results follows from noticing that we execute step 2 dl times. One big advantage of Algorithm SOS is the fact that for a fixed value of l and d the algorithm is linear in N , making it practical for high values of N . However it is sensitive on the parameter l and particularly sensitive on parameter d, making it practical for values of d ≤ 5. Notice that we can decrease the memory used by the algorithm SOS to O(N ) by storing the l-mers in collection X by their position numbers. At the core of our SOS algorithm is the use of radix sorting, so is a natural step to try to use a combination of MSD and LSD radix sorting. This type of approach has been tried before in a different problem in [2] but for the sake of completeness we will describe it in full in the following paragraph. Let (j1 , . . . , jd ) be a set of positions as in step 2 and let us write x := x j1 ,...,jd where (x, i) ∈ X. It is clear that the problem of finding “repeated” l − dmers can be solved if we partition the input into buckets B1 , . . . , B|Σ|k according to the first k MSD digits of x . Then independently on each bucket we do an LSD radix-sort on the remaining l − d − k digits and eliminate “ duplicates” as in step 2b. An advantage of this approach is the fact that processing every bucket Bp p = 1, . . . , |Σ|k can be done independently of the others and if we assume that strings si are generated by a random source of characters over Σ with equal N probability then the expected size of each Bp will be |Σ| k. This implies the following result. Theorem 2. For a fixed k > 0, algorithm SOS can be implemented in parallel in an O(|Σ|−k N wl ( dl ) expected time and |Σ|k processors. The expected memory usage for each processor is O(|Σ|−k N log |Σ| wl ), where w is the word size of the computer. 2.3
SOS-Hash: A Solution Based on Hashing
The use of hashing for pattern matching related problems was pioneered in [6] and has been used extensively in the pattern matching literature. Notice that if we fix a given set of positions i1 , . . . , id it is clear that we can find all the l-mers which differ in those positions by using a hash table as in [3]. In doing so we make use of the representation of l-mers over Σ as numbers in base |Σ|. In the following algorithm we will use two hash functions, g : Σ l −→ {0, 1} and h : Σ l −→ {0, . . . , n}. g will be used to tell whether a particular l-mer is in the solution set and h will store the index of the last sequence where a particular l-mer was found.
306
J. Davila, S. Balla, and S. Rajasekaran
Algorithm SOS-Hash 1. Initialize g with 0 values. 2. For all (j1 . . . jd ) with 1 ≤ j1 < · · · < jd <= l (a) Initialize h with 0 values. (b) For all i = 1, . . . , n and x l si i. Let x = x j1 ,...,jd . ii. If g(x) = 0 and h(x ) = 0 set h(x ) = i. iii. If g(x) = 0 and 1 ≤ h(x ) = i set g(x) = 1. 3. For all i = 1, . . . , n and x l si output x if g(x) = 0. We use as hash functions g(x) = h(x) = x˜ mod p, where x ˜ is the representation of x as a number in base Σ and p is an integer –usually a prime. Theorem 3. Algorithm SOS-Hash takes O(N d dl + p) time and O(p) memory. Proof. We should only point out that given two consecutive l-mers x and y in a given si it is known how to calculate g(y) given g(x) in constant time. Furthermore we can calculate the value of h(y ) given the value of g(y) and then substracting an appropiate amount that would depend only on the values of y at the positions j1 , . . . , jd . In case l ≤ log|Σ N we can use p = |Σ|d and the algorithm will be exact and its run time will be O(N d dl ), using O(N ) memory. For larger values of l the memory consumption can become impractical so we might settle for a Monte Carlo version of this algorithm. It is a well known result from [6] that if we choose p to be a prime less than N 1+ l log(N 1+ l) the probability that for every (j1 , . . . , jd ) chosen in step 2 we get a false positive is O( N1 ).
3
Specific Selection with Longest Common Factor Constraint
In [5] it was shown that a designed siRNA may bind with an off-target mRNA sequence if they shared a specific number of consecutive matches. This type of constraint was considered in the work of [11] under the name of longest common factor and will be made formal in the following definitions. Definition 5. Let s and x be strings over Σ with |x| = l < |s|.We denote by lcf(x, s) the longest common factor, i.e. the longest common contiguous string shared between x and s and y. In formal terms, we say y := lcf(x, s), |y| = t iff (y t x ∧ y t s) and (∀h > t : z h x ⇒ ¬(z h s))
(3)
Definition 6. Let S = {s1 , . . . , sn } be a set of sequences over Σ. Given l, d, h the (l, d, h) Specific Selection Problem with Longest Common Factor Constraint consists of finding a set of l-mers X = {x1 , . . . , xn } such that they satisfy (4) and (3) ∀(1 ≤ i ≤ n) : ∀(j = i) : |lcf(xi , sj )| > h (4) In case for a particular i there is no xi that satisfies (4) we set xi = ∅.
Fast Algorithms for Selecting Specific siRNA in Complete mRNA Data
307
It is clear that we can solve this problem in two phases. In the first phase we mark the l-mers which do not satisfy (3). This could accomplish just by sorting all the l − h-mers and recording the duplicates coming from different si . In the second phase we can run either SOS or SOS-Hash. Notice that in this case it is not necessary to try all the possible (i1 , . . . , id ) with 1 ≤ i1 ≤ · · · ≤ id ≤ n but a subset of these, namely the ones that satisfy the condition (5) i1 − 1 < h and i2 − i1 < h and . . . and id−1 − id < h and n − id < h,
(5)
since by condition (4) the set of l-mers at the end of Phase (I) will not share a subsequence of length h or bigger. It should also be pointed out that in case in the first phase we discard a big fraction of the l-mers in the input, it would be faster and makes more sense to ¯ S) as in [11] for the unmarked elements. calculate d(·, Table 1. SOS Performance for d = 3 Species Human
Size(bp) l 9.3 × 107 19 20 21 Drosophila 4 × 107 19 20 21
4
Time Memory Used Size of Solution Coverage of solution 202 m 1.5 Gb 2.1 × 105 59% 251 m 1.5 Gb 2.7 × 106 76% 344 m 1.5 Gb 1.1 × 107 80% 98 m 0.7 Gb 8.5 × 105 71% 124 m 0.7 Gb 4.5 × 106 82% 167 m 0.7 Gb 9.4 × 106 88%
Experiments
We implemented algorithm SOS as a C program and tested on the complete mRNA data for Human 1 which N = 9.3 × 107 and n = 3.8 × 104 – and Drosophila Melanogaster 2 for which N = 4 × 107 and n = 1.9 × 104 . The programs were run on a Power Edge Linux Server with 4GB of RAM and dual Xeon 2.66 Ghz CPU’s –only one which was used. In processing the Human mRNA data we used close to 1.5Gb of RAM and in the case of the Drosophila we used close to 700Mb of RAM, due to the fact that we store the l-mers as 64 bit numbers. In the particular case of the Human mRNA with l = 19 and d = 3 our algorithm took 3 hours and 22 minutes, outperforming the results in [11] by almost two orders of magnitude. In table 1 we show the run time, memory usage and number of l-mers which satisfy the Specific Selection Problem for values of l = 19, 20, 21 and d = 3. By coverage of solution we mean the percentage of mRNA sequences that have at least one specific l-mer. Of particular interest is the fact that as we consider larger values of l the number of possible l-mers grows exponentially. In those 1 2
From ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/mRNA_Prot/human.rna.gz From ftp://hgdownload.cse.ucsc.edu/goldenPath/dm1/bigZips/mrna.fa.gz
308
J. Davila, S. Balla, and S. Rajasekaran
case we can use conditions such as (4) or the ones defined on [8,10] in order to prune the space of possible results. In table 2 we consider the Human mRNA dataset and we fix l = 19 and d = 3 and consider the variant of SOS for the Specific Selection Problem with Longest Common Factor Constraint, for values of h = 12, 13, 14. We include the number of l-mers at the end of Phase I and II. Notice that in case we use a small value of h –in our case 12– the number of l-mers which needs to be pruned in Phase I reduces drastically and it may be more efficient to use the approach in [11]. However as the values of l increases it is more practical to use an algorithm like SOS. Table 2. SOS Performance for Human mRNA, l = 19, d = 3 h 12 13 14
Time Size Phase (I) Size 179 m 3.3 × 104 192 m 1.1 × 106 200 m 9.4 × 106
Phase (II) Coverage of Solution 6.0 × 103 8% 7.0 × 104 41% 1.7 × 105 55%
References 1. Balla, S., Rajasekaran, S.: Sorting and FFT Based Techniques in the Discovery of Biopatterns. In: Pan, Y., Zomaya, A.Y. (eds.) Bioinformatics Algorithms: Techniques and Applications, Wiley Book Series on Bioinformatics, Chichester (to appear) 2. Balla, S., Rajasekaran, S.: Space and Time Efficient Algorithms to Discover Endogenous RNAi Patterns In Complete Genome Data. In: International Symposium on Bioinformatics Research and Applications (ISBRA 2007) (May 2007) 3. Chin, F.Y.L., Leung, H.C.M.: Voting Algorithms for Discovering Long Motifs. In: Proceedings of the 3rd Asia-Pacific Bioinformatics Conference (APBC2005), Singapore, January 2005, pp. 261–271 (2005) 4. Elbashir, S., Harboth, J., Lendeckel, W., Yalcin, A., Weber, K., Tuschtl, T.: Duplexes of 21-nucleotide RNAs mediate RNA interference in cultured mammalian cells. Nature 411, 494–498 (2001) 5. Jackson, A., Bartz, S., Schelter, J., Kobayashi, S., Burchard, J., Mao, M., Li, B., Cavet, G., Linsley, P.: Expression profiling reveals off-target gene regulation by RNAi. Nature Biotechnology 21, 635–637 (2003) 6. Karp, R., Rabin, M.: Efficient randomized pattern matching algorithms. IBM Journal of Research and Development 31, 249–260 (1987) 7. Rajasekaran, S., Balla, S., Huang, C.-H., Thapar, V., Gryk, M., Maciejewski, M., Schiller, M.: Exact Algorithms for Motif Search. In: Proc. of the 3rd Asia-Pacific Bioinformatics Conference, pp. 239–248 (2005) 8. Reynolds, A., Leake, D., Boese, Q., Scaringe, S., Marchall, W.S., Khvorova, A.: Rational siRNA design for RNA interference. Nature Biotechnology 22, 326–330 (2004) 9. Sung, W.K., Lee, W.H.: Fast and Accurate Probe Selection Algorithm for Large Genomes. In: CSB 2003, pp. 65–74 (2003)
Fast Algorithms for Selecting Specific siRNA in Complete mRNA Data
309
10. Ui-Tei, K., Naito, Y., Takahashi, F., Haraguchi, T., Ohki-Hamazaki, H., Juni, A., Ueda, R., Saigo, K.: Guidelines for the Selection of Highly Effective siRNA Sequences for Mammalian and Chick RNA Interference. Nucleic Acid Research 22, 326–330 (2004) 11. Yamada, T., Morishita, S.: Accelerated off-target search algorithm for siRNA. Bioinformatics 21(8), 1316–1324 (2005) 12. Zheng, J., Close, T.J., Jiang, T., Lonardi, S.: Efficient selection of unique and popular oligos for large EST databases. Bioinformatics 20(13), 2101–2112 (2004)
RNA Folding Including Pseudoknots: A New Parameterized Algorithm and Improved Upper Bound Chunmei Liu1 , Yinglei Song2 , and Louis Shapiro3 1
2
Dept. of Systems and Computer Science, Howard University, Washington, DC 20059, USA
[email protected] Dept. of Mathematics and Computer Science, University of Maryland Eastern Shore, Princess Anne, MD 21853, USA
[email protected] 3 Dept. of Mathematics, Howard University, Washington, DC 20059, USA
[email protected]
Abstract. Predicting the secondary structure of an RNA sequence is an important problem in structural bioinformatics. The general RNA folding problem, where the sequence to be folded may contain pseudoknots, is computationally intractable when no prior knowledge on the pseudoknot structures the sequence contains is available. In this paper, we consider stable stems in an RNA sequence and provide a new characterization for its stem graph, a graph theoretic model that has been used to describe the overlapping relationships for stable stems. Based on this characterization, we identify a new structure parameter for a stem graph. We call this structure parameter crossing width. We show that given a sequence with crossing width c for its stem graph, the general RNA folding problem can be solved in time O(2c k3 n2 ), where n is the length of the sequence, k is the maximum length of stable stems. Moreover, this characterization 2 leads to an O(2(1+2k )n n2 k3 ) time algorithm for the general RNA folding problem where the lengths of stems in the sequence are at most k, this result improves the upper bound of the problem to 2O(n) n2 when the maximum stem length is bounded by a constant. Keywords: RNA secondary structure prediction, pseudoknot, free energy, maximum weighted independent set, path decomposition.
1
Introduction
The biological functions of an RNA sequence are often closely associated with its secondary structure. In general, the secondary structure of an RNA sequence contains multiple stems that can be nested, parallel or crossing. A secondary structure is a pseudoknot structure if it contains at least two crossing stems [19]. Pseudoknot structures play important roles in many biological processes including transcription regulation, RNA splicing and catalysis [2,9,18]. Determining R. Giancarlo and S. Hannenhalli (Eds.): WABI 2007, LNBI 4645, pp. 310–322, 2007. c Springer-Verlag Berlin Heidelberg 2007
RNA Folding Including Pseudoknots
311
the secondary structure of an RNA sequence is thus critical for understanding its biological functions. Experimental methods for determining the secondary structure of an RNA sequence are time consuming and expensive. It is therefore more desirable to develop computational tools that can accurately fold an RNA sequence based on its primary sequence. Most of the existing methods perform the prediction by minimizing the free energy of the sequence based on a model originally developed in [21]. However, such a minimization task is computationally intractable when the sequence contains pseudoknot structures [10]. Therefore, most of existing ideas or tools developed to do pseudoknot prediction either use heuristics or apply additional constraints on the pseudoknot structures that the sequence may contain [1,3,5,6,10,13,20]. Iterated Loop Matching (ILM) [17] is a heuristic method that can perform pseudoknot prediction efficiently. It selects stems for a sequence by iteratively finding the most stable stem on the sequence and removing the nucleotides in the base pairs of this stem from the sequence. The secondary structure formed by these selected stems is reported as the predicted structure when the iteration process terminates. HotKnots [14] uses a technique similar to ILM to select stems. However, other than generating a single structure as the prediction result, it constructs and returns a list of possible structures for the sequence. Computational tools based on genetic algorithms and simulated annealing techniques were also developed for pseudoknot prediction [1,6]. Although all these methods are computationally efficient, the prediction accuracy is not guaranteed. On the other hand, a few algorithms have been developed to optimally fold a sequence into certain classes of pseudoknot structures. For example, in [3], a dynamic programming algorithm with time complexity O(n4 ) was developed to optimally fold a sequence into a single pseudoknot structure. The time complexity of the prediction algorithm rises when more stem crossing patterns are incorporated into the classes of allowed pseudoknot structures [10,20]. For example, PKNOTS [15] is a program that can predict a wide class of pseudoknot structures in time O(n6 ). Recently, pknotsRG [13] introduces the class of canonical simple recursive pseudoknots and implements an efficient algorithm that can predict commonly observed pseudoknots in time O(n4 ). However, these algorithms cannot be used to fold a sequence without prior knowledge on the pseudoknot structures the sequence may contain. Graph theoretic models have been developed to model the pairing relations of nucleotides and a few algorithms based on such models have been developed [12,19,22]. Recently, TdFOLD [22], a new optimal algorithm that does not require such constraints was developed for pseudoknot prediction. TdFOLD finds all stable stems in the sequence and constructs a stem graph to store the overlapping relations among stems. In particular, a stem is represented with a vertex in the stem graph and a pair of vertices are connected if the corresponding stems overlap in their positions on the sequence. Each graph vertex is associated with a weight that is the absolute value of the free energy of its corresponding stem. Without considering the free energy of loops, the structure of the sequence can
312
C. Liu, Y. Song, and L. Shapiro
be determined by finding the maximum weighted independent set (WIS) in the stem graph. To find such an independent set, TdFold uses a dynamic programming algorithm based on a tree decomposition of the stem graph. However, the tree width of a stem graph determines the computational efficiency of the dynamic programming algorithm and it is in general not a small number. Although a few techniques have been implemented in TdFOLD to alleviate this problem, the computational efficiency is still not guaranteed. In this paper, we provide a new characterization of the stem graph. Based on this new characterization, we develop a new parameterized optimal algorithm for finding the maximum WIS in a stem graph. In particular, we consider the interval overlapping graphs formed by the left and right pairing regions of all stable stems respectively. We define a binary graph operator called “cross product” and show that edges in the stem graph can be colored with different colors based on the cross product of these two graphs. We then show that a path decomposition where each path node contains a clique can be easily constructed to cover all edges colored with a certain color and the rest of edges form cliques that “cross” the path nodes in this path decomposition. Based on this characterization, we identify a new structure parameter called “crossing width” for a stem graph. We then show that, for a stem graph with its crossing width bounded by c, the maximum WIS can be found in time O(2c n2 k 3 ), where n is the length of the sequence and k is the maximum length for stable stems. In addition, we show that if the length of a stable stem is not larger than a constant k, this algorithm 2 can fold an RNA sequence in time O(2(1+2k )n n2 ). To our knowledge, this also improves the upper bound of the general RNA folding problem to 2O(n) n2 for cases with bounded stem lengths, since even in such cases the stem graph may contain up to O(n2 ) vertices and a direct application of any available exact 2 algorithm for finding the maximum WIS in a graph may need time 2O(n ) . To test whether the crossing widths of stem graphs for RNA sequences are small integers, we performed experiments on sequences from 16 RNA families with various sequence lengths. The sequences we tested contain up to 600 nucleotides. Our testing results showed that, the crossing width slightly increases while the sequence becomes longer and for most of the tested sequences, the crossing widths of their stem graphs are less than 20. Our experiments demonstrated the possible advantage of our algorithm over other methods in computational efficiency.
2 2.1
Algorithms Problem Description
Given an RNA sequence S, a base pair is a pair of interacting nucleotides in S. In general, base pairs formed between nucleotides A and U, G and C are energetically stable and such base pairs are canonical pairs. Base pairs formed between G and U are less stable and such base pairs are wobble pairs. A stem is a set of stacked base pairs. A stem D can be described with a tuple of four
RNA Folding Including Pseudoknots
313
stem a a
d
b
c stem b
stem c (a)
stem d (b)
Fig. 1. Stable stems and the stem graph constructed from them; (a) four stable stems a, b, c, d in a given RNA sequence; (b) the stem graph formed by the stable stems in (a)
integers (sl , tl , sr , tr ), where sl < tl < sr < tr and (S[sl ], S[tr ]) and (S[tl ], S[sr ]) are two canonical base pairs. The subsequence S[sl · · · tl ] is the left region of the stem and the subsequence S[sr , tr ] is the right region of the stem. A location i is covered by a stem (sl , tl , sr , tr ) if either sl ≤ i ≤ tl or sr ≤ i ≤ tr . The free energy of a stem is the sum of the packing free energies of its base pairs. A stem is stable if its free energy is less than a given threshold E < 0. Two stems T1 = (sl , tl , sr , tr ) and T2 = (ul , vl , ur , vr ) overlaps if there exists at least one location i in S such that i is covered by both T1 and T2 . A graph Gs = (Vs , Es ) is the stem graph for a given sequence S if there exists a bijective mapping f from Vs to the set of all stable stems in S such that for u, v ∈ Vs , (u, v) ∈ Es iff f (u) overlaps f (v). Figure 1(a) shows four stable stems a, b, c, and d in a given RNA sequence and (b) shows the corresponding stem graph for them. It is straightforward to see that vertices mapped to stems in a valid secondary structure for S form an independent set in G, such as the vertex set {a, c} in the stem graph in Figure 1(b). Each vertex u in G can be associated with a weight which is the absolute value of the free energy of f (u). In general, a sequence tends to fold into the secondary structure with the lowest free energy. Without considering the free energies of loops, such a structure for S corresponds to the independent set with the maximum total weight in G (maximum WIS). The general RNA folding problem can thus be solved by finding such an independent set in the stem graph of the sequence to be folded. 2.2
Tree Decomposition
Definition 1 ([16]). Let G = (V, E) be a graph, where V is the set of vertices in G, E denotes the set of edges in G. Pair (T, X) is a tree decomposition of graph G if it satisfies the following conditions:
314
C. Liu, Y. Song, and L. Shapiro
1. T = (I, F ) defines a tree, the sets of vertices and edges in T are I and F , respectively, 2. X = {Xi |i ∈ I, Xi ⊆ V }, and ∀u ∈ V , ∃i ∈ I such that u ∈ Xi , 3. ∀(u, v) ∈ E, ∃i ∈ I such that u ∈ Xi and v ∈ Xi , 4. ∀i, j, k ∈ I, if k is on the path that connects i and j in tree T , then Xi ∩Xj ⊆ Xk . The tree width of the tree decomposition (T, X) is defined as maxi∈I |Xi | − 1. The tree width of the graph G is the minimum tree width over all possible tree decompositions of G. Figure 2(a) and (b) provide an example of a graph and a tree decomposition for it. It can be seen from the figure that tree decomposition provides an alternative view on the topology of a graph. A tree node in a tree decomposition is often a separator of the graph. A divide-and-conquer based dynamic programming framework [4] has been developed to solve some NP-hard graph optimization problems by finding and combining partial optimal solutions on subproblems of smaller sizes. Methods based on such a framework are often computationally efficient when the tree width of the underlying graph in the problem is small. For example, given a tree decomposition (T, X) for a graph G = (V, E) with tree width t, the maximum independent set in G can be computed using a dynamic programming algorithm in time O(2t |V |) [4]. Such an algorithm can be sketched as follows. Without loss of generality, we assume T is a binary tree. A dynamic programming table with t + 3 columns are maintained for each tree node Xi ∈ X. Each of the t+1 vertices u1 , u2 , · · · , ut+1 contained in Xi is associated with a column in the table and the additional two columns are marked with Vi and Si , respectively. Given an independent set I in G, a vertex in Xi can be marked as 1 if it is in I and 0 otherwise. It is therefore clear that we can use a binary string of length t + 1 to describe the status of vertices in Xi with respect to I. The number of such binary strings are at most 2t+1 and the dynamic programming table thus contains up to 2t+1 entries. We use Gi to denote the subgraph induced by the vertices contained in the tree nodes of the subtree rooted at Xi in T . For each entry in the table, Vi is 1 if there exists an independent set in Gi that includes all the vertices marked by 1 in the entry and 0 otherwise; Si is used to store the maximum size of such independent sets. An entry is valid if its V is 1 and invalid otherwise. The algorithm then follows a bottom-up procedure to fill the dynamic programming tables in the tree nodes. For a leaf tree node Xl , the algorithm enumerates all 2t+1 entries for vertices in Xl and determines the values of Vl and Sl for each of them. For an internal tree node Xi with two child nodes Xj and Xk , the algorithm also enumerates all 2t+1 entries for vertices in Xl . For a given enumerated entry ei , the algorithm first checks whether two vertices marked by 1 in the entry are connected or not. If it is the case, the Vi value for ei is set to be 0 and the algorithm continues on another entry. Otherwise, it queries the table of Xj for valid entries that share the markings of vertices in Xj ∩ Xi and finds the one with the maximum Sj value. We denote such value with Sj (ei )
RNA Folding Including Pseudoknots
315
1
2
3
1,2,3
1,2,3,4,5
2,3,5,6
2,3,5,6
4
2,4,5
5,6,7 5
5,6,7
6 6,7,8
6,7,8
8
7
(a)
(b)
(c)
Fig. 2. (a) An example of a graph; (b) a tree decomposition for it; (c)a path decomposition for it
and the Vi value for ei is set to be 0 if no such entries exist. The table in Xk is queried similarly and a value of Sk (ei ) can be obtained. The algorithm then determines the number of vertices that are marked by 1 in ei and not contained in the parent of Xi in T . We use I(ei ) to denote this number. The Si value for ei is computed by adding Sj (ei ) and Sk (ei ) to I(ei ). Once all dynamic programming tables are filled. The table for the root node Xr of T is queried to determine the valid entry with the maximum Sr value. This value is then returned as the size of the maximum independent set in G. Based on this entry, a similar top-down trace-back procedure can be used to determine the vertices in this independent set. The computation time of the algorithm is O(2t |V |) since each table may contain up to 2t+1 entries and T may contain up to |V | tree nodes. 2.3
The Parameterized Algorithm
Given an RNA sequence S, the left region set of S is the set of the left regions for its stable stems. Similarly, the right region set of S is the set of their right regions. Given the left region set of S, the overlapping relations among the left regions in the set can be modeled with a left region graph. In particular, each left region is represented with a vertex and two vertices are connected with an edge if the corresponding left regions overlap. A right region graph can be defined similarly. Both types of graphs are in fact interval overlapping graphs and we have the following lemma for interval overlapping graphs. Lemma 1 ([7]). There exists a unique path decomposition (P, X) for an interval overlapping graph G such that,
316
C. Liu, Y. Song, and L. Shapiro
1. ∀Xi ∈ X, the vertices contained in Xi induce a clique in G, 2. ∀Xi , Xj ∈ X, if i < j, there exist vertices u ∈ Xi , v ∈ Xj such that the intervals of u and v do not overlap and the interval of u is to the left of that of v. Such a path decomposition can be computed by a linear time algorithm. We can infer from Lemma 1 that, for both types of graphs, there exists a unique path decomposition along the sequence backbone such that the vertices contained in each path node induce a complete graph. For example, the left region graph and right region graph for the stable stems shown in Figure 3(a) can be path decomposed as shown in (b). Definition 2. Let G1 = (V1 , E1 ) and G2 = (V2 , E2 ) be two graphs, the cross product of G1 and G2 , i.e. G = G1 ∗ G2 is a graph defined on V1 × V2 such that any two vertices (u1 , v1 ) and (u2 , v2 ) are connected in G if and only if v1 = v2 or (v1 , v2 ) ∈ E2 . The cross product graph G1 ∗ G2 has the following property if both G1 and G2 are interval overlapping graphs. Lemma 2. Given two interval overlapping graphs G1 and G2 , there exists a path decomposition (P, Z) for graph G = G1 ∗ G2 such that the vertices in any path node Zi induce a clique in G. In addition, this path decomposition can be computed in time O(|V1 ||V2 |). Proof. Since both G1 and G2 are interval overlapping graphs, we can apply the linear time algorithm described in Lemma 1 to find path decompositions (P1 , X) for G1 and (P2 , Y ) for G2 . In particular, We assume X = {X1 , X2 , · · · , Xs } and Y = {Y1 , Y2 , · · · , Yt } and we now show that we can construct a path decomposition for G based on the X and Y . Indeed, consider the following partition of vertices in G, s Zq = (Xi × Yq ) (1) i=1
where 1 ≤ q ≤ t − 1, from the definition of cross product, it is not difficult to verify that {Z(1), Z(2), · · · , Z(t)} form a path decomposition for G. In fact, this path decomposition corresponds to the columns of a grid structure due to the cross product. In addition, from the definition of the cross product operator, all vertices in Z(q) induce a clique in G. Such a path decomposition can be obtained in time O(|V1 ||V2 |). For a given RNA sequence S with left region graph Gl and right region graph Gr , we color the edges in the stem graph Gs with red, green and yellow, respectively. To color an edge, we consider the stems that correspond to the vertices it connects. The edge is colored with red if the two stems overlap in their right regions and with green if the two stems overlap in their left regions. Two stems are connected with a yellow edge if the left region of one stem overlaps the right
RNA Folding Including Pseudoknots X2
X1 stem a
la
lb
ra
stem b
stem c (a)
stem d
lc
rb
Y1
317
X3 ld
rc
Y2
rd
Y3
(b)
Fig. 3. Path decompositions of the left region graph and the right region graph for stable stems in a sequence; (a) stable stems in a given sequence; (b) the path decompositions where vertices in each path node induce a clique for both the left region and right region graph
region of the other. Note that an edge can be colored with multiple colors. Red edges in Gs induce a subgraph in Gl ∗ Gr . We call such a subgraph H. We then consider the path decompositions (P1 , X) and (P2 , Y ) as described in Lemma 1 for graph Gl and Gr , respectively. For each path node Xi ∈ X, we can find a vertex subset Mi in H such that Mi = {(u, v)|u ∈ Xi }. Vertices in Mi are connected into a clique by green edges and we call such a clique green clique. Gs thus may contain s green cliques if the path decomposition (P1 , X) contains s path nodes. We now consider the yellow edges in Gs . For a given right region u, we denote the vertices in Gs whose left regions overlap u as Ny (u). Consider the subgraph induced by green edges on vertices in Ny (u). Such a graph is also an interval overlapping graph. From Lemma 1, there exists a path decomposition for this graph where the vertices in each path node are connected into cliques by green edges. Each of such cliques forms a clique with all the vertices whose right region is u in Gs , such a clique is called a yellow clique. Given a path node in the path decomposition as described in Lemma 2 for H, a yellow or green clique may “cross” the path node. We show later that the number of cliques crossing with a given path node determines the complexity of the problem. Definition 3. Given a path decomposition (P, Z) as described in Lemma 2 for H, a green or yellow clique C crosses a given path node Zt if there exists u, v ∈ C such that the u ∈ Zi and v ∈ Zj and i ≤ t ≤ j. Definition 4. Given a path decomposition (P, Z) as described in Lemma 2 for H, the crossing width of a given path node Zt is the number of green and yellow cliques that cross Zt . The crossing width of H is the maximum crossing width of all path nodes in Z.
318
C. Liu, Y. Song, and L. Shapiro
Theorem 1. Given the stem graph Gs for a sequence of length n and the subgraph H induced by the red edges in Gs , The maximum WIS in Gs can be computed in time O(2c k 3 n2 ), if the lengths of all stable stems are at most k and the crossing width of H is c. Proof. We use a dynamic programming algorithm based on the path decomposition described in Lemma 2 to compute the maximum WIS in Gs . First, the algorithm finds such a path decomposition (P, Z) in time O(n2 ). It then follows the generic dynamic programming framework we have described to compute the maximum WIS in Gs . Note that since the path decomposition (P, Z) only covers red edges in Gs and we thus must consider the green and yellow cliques. However, each path node crosses with at most c such cliques, we can therefore include c additional columns in each path node to store the status of the cliques that cross with the path node. In particular, the status of a clique is marked as 0 if none of the vertices in the clique are included in the independent set and 1 otherwise. Since the vertices in each path node are connected into a clique by red edges, the algorithm only needs to enumerate |Z(l)| + 1 entries for vertices in a path node Z(l). However, up to 2c different combinations of the status of crossing cliques can be associated with each of such entries and the total number of entries including the status of crossing cliques in a table can be up to 2c (|Z(t)| + 1). From Lemma 2, the number of vertices in a path node is at most k 3 n if no stable stems have lengths larger than k. Since the number of path nodes is at most n, the total amount of computation time needed for finding the maximum WIS is thus at most O(2c k 3 n2 ). 2.4
Improved Upper Bound
We are ready to show that the characterization we have provided for a stem graph can lead to an 2O(n) n2 time algorithm for the general RNA folding problem when the length of a stable stem is at most a constant k. This result has algorithmic significance since even in the case of bounded stem length, the number of stable stems in a sequence could be up to O(n2 ) and a direct application of any available optimal algorithm for finding maximum WIS in the stem graph would result in 2 an algorithm that needs worst case computation time 2O(n ) . In addition, our algorithm can be slightly modified to generate a list of structures with free energies close to minimum and we omit the proof here. Theorem 2. Given an RNA sequence with length n, if the lengths of stable stems are bounded by a constant k, the general RNA folding problem can be 2 solved in time O(2(2k +1)n k 3 n2 ) without considering the free energy of loops. Proof. Based on Theorem 1, we only need to show that the crossing width of a stem graph is bounded by (2k 2 + 1)n. To show this, we consider to bound the total number of green and yellow cliques in a stem graph. In particular, the number of green cliques is bounded by n since each green clique corresponds to a path node in the path decomposition as described in Lemma 1 for the left region graph. Such a path decomposition can have at most n path nodes. For
RNA Folding Including Pseudoknots
Alpha_RBS S15 U1
Flavi_CRE SraB_RNA ctRNA_pGA1
7SK IRES_HepA bicoid_3
Prion_pseudo Telomerase_ci
0.6 Percentage
Percentage
0.5 0.4 0.3 0.2 0.1 0
CsrB Intron_gpI tmRNA
319
IRES_Apatho Telomerase_vert
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Crossing Width Values
Crossing Width Values
Fig. 4. The distribution of the crossing widths for the RNA families in both groups; Left: for families with an average sequence length at most 200; Right: for families with an average sequence length larger than 300.
yellow cliques, we consider a right region u in the right region graph Gr . The set of all vertices whose left regions overlap with u can be partitioned into yellow cliques and the number of such yellow cliques is bounded by 2k, due to the length restriction on the stem length of u. The number of yellow cliques is thus bounded by 2k ×kn = 2k 2 n since we may have at most kn different right regions in a stem graph. The total number of yellow and green cliques are bounded by (2k 2 + 1)n and we can use the algorithm described in Theorem 1 to do the computation in 2 time O(2(2k +1)n k 3 n2 ).
3
Values of Crossing Widths
We performed experiments to evaluate the crossing widths for sequences in a few RNA families. We downloaded 16 sequence families from the Rfam Database [8]. The lengths of sequences in these families range from 30 to 600. We then divide these sequence families into two groups, one group contains all sequence families with average sequence lengths at most 200 and the other group contains the rest of sequence families. We compute the distribution of crossing widths for each sequence family in the two groups. To compute the stable stems, we used the same stacking energies as those used in mfold [11], a program that implemented Zuker’s algorithm for RNA folding [21]. The energy threshold we used to select stable stems is −3.0kcal/mol. The maximum allowed stem length is 15. The two groups of testing RNA families and their average sequence lengths are shown in Tables 1. Figures 4 show the distribution of crossing widths for the two groups. It can be seen that, for the group with shorter sequences, the crossing widths are generally less than 10. The only exception is T elomerase ci, where some of the sequences can have a crossing width up to 18. For the second group, where all sequences contain more than 300 nucleotides, the crossing widths rise to a number in between 10 and 20. This value can possibly be further reduced by using the method developed in pknotRG [13] to resolve the contention between
320
C. Liu, Y. Song, and L. Shapiro
Table 1. The testing RNA families with different average sequence lengths. PS indicates whether sequences in a family contain pseudoknot structures or not. RNA Family Name Number of Sequences Average Length Alpha RBS 42 109.2 Flavi CRE 302 95.4 Prion pseudo 1597 40.8 S15 46 117.4 SraB RNA 10 168.6 Telomerase ci 59 179.0 U1 42 161.0 ctRNA pGA1 20 78.6 7SK 170 317.0 CsrB 42 359.7 IRES Apatho 187 461.5 IRES HepA 8 389.0 Intron gpI 29 397.7 Telomerase vert 59 430.7 bicoid 3 43 550.6 tmRNA 349 351.3
PS Yes No Yes Yes No Yes No No No No No No No Yes No Yes
crossing stems. Experiments on the testing families suggest that the crossing widths are in general numbers of reasonable magnitude and the parameterized algorithm we have developed may have advantages over other optimal methods in computational efficiency.
4
Conclusions
In this paper, we study the general RNA folding problem, where the structure of a sequence may contain pseudoknots and no prior knowledge on the pseudoknot structures are available. Based on a new characterization of the stem graph that has been developed in previous work, we develop a new parameterized algorithm for the general RNA folding problem. In addition, this characterization also leads to an improved upper bound for the problem when the lengths of the stems in the structure are at most a given constant. Our future work includes the implementation and testing of this algorithm and comparing its performance with other software tools.
Acknowledgement We would like to thank the constructive comments of the anonymous reviewers on an earlier version of the paper. CL’s work was supported in part by the new faculty startup award at Howard University.
RNA Folding Including Pseudoknots
321
References 1. J. Abrahams, M. van den Berg, E. van Batenburg, and C. Pleij, “Prediction of RNA secondary structure including pseudoknotting, by computer simulation”, Nucleic Acids Research, 18: 3035-3044, 1990. 2. P. L. Adams, M. R. Stahley, A. B. Kosek, J. Wang, and S. A. Strobel, “Crystal structure of a self-splicing group i intron with both exons”, Nature, 430:4550, 2004. 3. T. Akutsu, “Dynamic programming algorithms for RNA secondary structure prediction with pseudoknots”, Discrete Applied Mathematics, 104:45-62, 2000. 4. H. L. Bodlaender, “Dynamic programming algorithms on graphs with bounded tree-width”, Proceedings of the 15th International Colloquium on Automata, Languages and Programming, pp. 105-119, Springer Verlag, Lecture Notes in Computer Science, vol. 317, 1987. 5. L. Cai, R. L. Malmberg, and Y. Wu, “Stochastic modeling of pseudoknotted structures: a grammatical approach”, Proceedings of the 11th International Conference on Intelligent Systems for Molecular Biology, pp. 66-73, 2003. 6. J.-H. Chen, S.-Y. Le, and J. V. Maize, “Prediction of common secondary structures of RNAs: a genetic algorithm approach”, Nucleic Acids Research, 28(4):991-999, 2000. 7. P. C. Gilmore and A. J. Hoffman, “A characterization of comparability graphs and of interval graphs”, Canadian Journal of Mathematics, 16(99):539-548, 1964. 8. S. Griffiths-Jones, S. Moxon, M. Marshall, A. Khanna, S. R. Eddy, and A. Bateman, “Rfam: annotating non-coding RNAs in complete genomes”, Nucleic Acids Research, 33:D121-D124, 2005. 9. A. Ke, K. Zhou, F. Ding, J. H. Gate, and J. A. Doudna, “A conformational switch controls hepatitis delta virus ribozyme catalysis”, Nature, 429:201-205, 2004. 10. R. B. Lyngso and C. N. S. Pedersen, “RNA pseudoknot prediction in energy-based models”, Journal of Computational Biology, 7(3-4):409-427, 2000. 11. Available at: http://www.bioinfo.rpi.edu/ zukerm/rna/energy/node2.html#SECTION20. 12. R. Nussinov, G. Pieczenik, J. Griggs, and D. Kleitman, “Algorithms for loop matchings”, SIAM Journal of Applied Mathematics, 35:68-82, 1978. 13. J. Reeder and R. Giegerich, “Design, Implementation and Evaluation of A Practical Pseudoknot Folding Algorithm Based on Thermodynamics”, BMC Bioinformatics, 5:104, 2004. 14. J. Ren, B. Rastegart, A. Condon, and H. H. Hoos, “HotKnots: Heuristic prediction of RNA structures including pseudoknots”, RNA, 11:1194-1504, 2005. 15. E. Rivas and S. R. Eddy, “A dynamic programming algorithm for RNA structure prediction including pseudoknots”, Journal of Molecular Biology, 285:2053-2068, 1999. 16. N. Robertson and P. D. Seymour, “Graph Minors II: Algorithmic aspects of tree width”, Journal of Algorithms, 7:309-322, 1986. 17. J. Ruan, G. D. Stormo, and W. Zhang, “An iterated loop matching approach to the prediction of RNA secondary structures with pseudoknots”, Bioinformatics, 20(1):58-66, 2004. 18. P. Schimmel, “RNA pseudoknots that interact with components of the translation apparatus”, Cell, 58(1):9-12, 1989. 19. J. Tabaska, R. Gary, H. Gabow, and G. Stormo, “An RNA folding method capable of identifying pseudoknots and base triples”, Bioinformatics, 14(8):691-699, 1998.
322
C. Liu, Y. Song, and L. Shapiro
20. Y. Uemura, A. Hasegawa, S. Kobayashi, and T. Yokomori, “Tree adjoining grammars for RNA structure prediction”, Theoretical Computer Science, 210:277-303, 1999. 21. M. Zuker and P. Stiegler, “Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information”, Nucleic Acids Research, 9(1):133-148, 1981. 22. J. Zhao, R. L. Malmberg, and L. Cai, “Rapid ab initio RNA folding including pseudoknots via graph tree decomposition”, Proceedings of the 6th Workshop on Algorithms in Bioinformatics, Springer Verlag, Lecture Notes in Bioinformatics, vol. 4175, pp. 262-273, 2006.
HFold: RNA Pseudoknotted Secondary Structure Prediction Using Hierarchical Folding Hosna Jabbari1, Anne Condon1 , Ana Pop2 , Cristina Pop2 , and Yinglei Zhao1 1
2
Dept. of Computer Science, U. of British Columbia {hjabbari,condon}@cs.ubc.ca Dept. of Electrical and Computer Engineering, U. of British Columbia
Abstract. Improving the accuracy and efficiency of computational RNA secondary structure prediction is an important challenge, particularly for pseudoknotted secondary structures. We propose a new approach for prediction of pseudoknotted structures, motivated by the hypothesis that RNA structures fold hierarchically, with pseudoknot free pairs forming initially, and pseudoknots forming later so as to minimize energy relative to the initial pseudoknot free structure. Our HFold (Hierarchical Fold) algorithm has O(n3 ) running time, and can handle a wide range of biological structures, including nested kissing hairpins, which have previously required Θ(n6 ) time using traditional minimum free energy approaches. We also report on an experimental evaluation of HFold. Keywords: RNA, Secondary Structure Prediction, Folding Pathways, Pseudoknot, Hierarchical Folding.
1
Introduction
RNA molecules aid in translation and replication of the genetic code, catalyze cellular processes, and regulate the expression level of genes [1]. Structure is key to the function of RNA molecules, and so methods for predicting RNA structure from the base sequence are of great value. Currently, prediction methods focus on secondary structure - the set of base pairs that form when the RNA molecule folds. There has been significant success in prediction of pseudoknot free secondary structures, which have no crossing base pairs (see Fig. 1). State-ofthe-art prediction algorithms, such as Mfold [2] or RNAfold [3] find the structure with minimum free energy (MFE) from the set of all possible pseudoknot free secondary structures. While many small RNA secondary structures are pseudoknot free, pseudoknots do arise frequently in biologically-important RNA molecules. Examples include simple H-type pseudoknots, with two interleaved stems, which are essential for certain catalytic functions and for ribosomal frameshifting [4], as well as kissing hairpins, which are essential for replication in the coxsackie B virus [5]. Unfortunately, MFE pseudoknotted secondary structure prediction is NP-hard [6,7]. Polynomial-time MFE-based approaches to pseudoknotted structure prediction have been proposed [6,8,9] which find the MFE structure for a given input R. Giancarlo and S. Hannenhalli (Eds.): WABI 2007, LNBI 4645, pp. 323–334, 2007. c Springer-Verlag Berlin Heidelberg 2007
324
H. Jabbari et al. Start of pseudoloop Bulge A
5
1 A
GG A G G
C
C
C C
A A
A A A
A
A G G
A
Bulge Unpaired A
C C
G C A A
G
A
A
G G
Hairpin Loop A G G
A
1
3
C
A
C
C
A
Stack 3
C C
C A
A
A
Internal Loop
End of pseudoloop
14 6
5 A
A
Unpaired
2
A
G
A
28 20
32 3
8
15
20
Weakly closed
NOT Weakly closed
Fig. 1. An H-type pseudoknotted structure (left) and a pseudoknot free structure (right) in graphical (top) and arc diagram (bottom) formats
sequence, from a restricted class of structures. Algorithms for MFE pseudoknotted secondary structure prediction trade off run-time complexity and generality – the class of structures handled, that is, the class of structures over which the algorithms optimize. For example, kissing hairpins are not handled by Θ(n5 ) algorithms [6,8], but can be handled in Θ(n6 ) time [9]. (We note that, even when the true structure R for a sequence is handled by an algorithm, the algorithm still may not correctly predict R, because correctness depends also on the energy model and energy parameters.) Our work is motivated by two limitations of MFE-based algorithms for pseudoknotted secondary structure prediction: they have high time complexity, and ignore the folding pathway from unfolded sequence to stable structure. Several experts have provided evidence for, and support, the hierarchical folding hypothesis [10,11], which is succinctly stated by Tinoco and Bustamante as follows: “An RNA molecule [has] a hierarchical structure in which the primary sequence determines the secondary structure which, in turn, determines its tertiary folding, whose formation alters only minimally the secondary structure” [10]. (These and other authors consider the initially-formed secondary structure to be pseudoknot free, and refer to base pairs that form pseudoknots as part of the tertiary structure. However, in this paper we refer to all canonical base pairs, namely A-U , C-G, and G-U , as secondary structure.) We note that while the hierarchical folding hypothesis is a common assumption, some counter examples have been reported, notably formation of the structure of a subdomain of the Tetrahymena thermophila group I intron ribozyme [12]. However, even in this case, 15 of the 19 base pairs in the
HFold: RNA Pseudoknotted Secondary Structure Prediction
325
initially-formed pseudoknot free secondary structure are retained upon formation of tertiary structure, and the 4 missing base pairs lie at the ends of stems. In this paper, we show how to efficiently predict RNA secondary structures, in a manner consistent with a natural formalization of the hierarchical folding hypothesis. We consider the Hierarchical-MFE secondary structure prediction problem: given a sequence S and a pseudoknot free secondary structure G, find a pseudoknot free secondary structure G for S, such that the free energy of G ∪ G is less than or equal to the free energy of G ∪ G for all pseudoknot free structures G = G . Since both G and G are pseudoknot free, the most general class of structures that could be handled by an algorithm for hierarchical-MFE secondary structure prediction would be the bi-secondary structures of Witwer et al. [13] – those structures which can be partitioned into two pseudoknot free secondary structures G and G . We solve the problem with respect to a subclass of the bi-secondary structures, which we call density-2 structures, defined in Section 2. This is quite a general class, including H-type pseudoknots and kissing hairpins, as well as structures containing nested instances of these structural motifs. The only known algorithm for predicting MFE nested kissing hairpins, that of Rivas and Eddy, requires Ω(n6 ) time. In Section 3, we present HFold, a dynamic programming algorithm that solves the hierarchical-MFE secondary structure prediction problem for the class of density-2 secondary structures in O(n3 ) time and O(n2 ) space. Combined with a pseudoknot free secondary structure prediction algorithm, HFold can be used to efficiently predict pseudoknotted secondary structures in the following way. First, a pseudoknot free secondary structure G is predicted. Then, HFold is run, with G as input, to calculate a (potentially) pseudoknotted secondary structure G ∪ G . In general, the structure G ∪ G obtained in this way may not be the true MFE pseudoknotted secondary structure, since G is fixed when G is calculated. Our experimental evaluation of HFold in Section 4 shows that, when provided with the true pseudoknot free substructure for the input sequence, HFold adds pseudoknots which, on average, improve the accuracy (measured as fraction of correctly predicted bases) by 7%. However, HFold does not significantly improve accuracy when given as input a computational prediction of the MFE pseudoknot free secondary structure G, since HFold cannot correct errors in G.
2
Background on RNA Secondary Structure
Following the definitions of Rastegari and Condon [14], we introduce notation on secondary structure needed to describe our algorithm. Basic definitions. We model an RNA molecule as a sequence (string) over alphabet of bases {A, C, G, U }, with distinct ends, called the 5 (left) and 3 (right) ends. We index the bases consecutively from the 5 end starting from 1, and refer to a base by its index. When an RNA molecule folds, bonds may form between certain pairs of bases (namely A-U , C-G, and G-U ; see Fig. 1), where each base may pair with at most one other base. A secondary structure for a sequence is a
326
H. Jabbari et al.
set of base pairs. In what follows, all the definitions are with respect to a fixed secondary structure R for a sequence S with n bases. We use i.j to denote a base pair, or arc, between i and j, where i < j. We let bpR (i) denote the base that is paired with base i in R, if any; otherwise bpR (i) = 0. Pair i.j is pseudoknotted if it crosses some base pair i .j , that is, exactly one of i and j is in the region [i, j] (namely, the set of bases {i, . . . , j}). Generally we use R to refer to a structure that may be pseudoknotted (that is, contains at least one pseudoknotted base pair), and use G to refer to a structure that we know to be pseudoknot free. Base pair i.j covers base k if i ≤ k ≤ j and there is no other base pair i .j ∈ G with i < i < k < j < j. In this case, we denote i.j by cover(k). The predicate isCovered(G, k) indicates that some base pair of G covers k. Two secondary structures R and R are disjoint if no base is paired in both R and R . Rij is the set of base pairs of R that involve only bases in the region [i, j]. Loops, bands, and weakly closed regions. Loops in pseudoknot free structures are comprised of regions of unpaired bases, separated by “closing” base pairs, from which stems of base pairs emanate. Loops are classified by the number of emanating stems: hairpin, internal, and multiloops have one, two, and at least three emanating stems, respectively. (Stacked pairs are special types of interior loops.) Pseudoknotted base pairs can be partitioned into bands. Two base pairs belong to, and are said to span, the same band if they cross exactly the same set of base pairs [14]. Let i1 .j1 , i2 .j2 , . . . , ik .jk be the arcs that span a fixed band, where we assume without loss of generality that i1 < i2 < . . . ik < jk . . . < j2 < j1 (arcs in a band should be nested). We call i1 .j1 and ik .jk the outer and inner base pairs of the band, respectively. Between two successive base pairs ir .jr , ir+1 .jr+1 is either an internal loop, as illustrated in Fig. 1, or a multiloop. For example, the structure in the left of Fig. 1 has two bands, one of which has outer base pair 2.20 and inner base pair 6.17. Pseudoloops are comprised of regions of unpaired bases that are not in any band but are directly flanked by pseudoknotted base pairs, along with base pairs that mark the beginning or end of such regions [14]. The structure in the left of Fig. 1 has one pseudoloop, which includes the unpaired bases in regions [7, 13] and [21, 27], as well as the base pairs 2.20, 6.17, 14.30, and 16.28. Here, 2 and 30 are the start and the end of this pseudoloop respectively. A region [i, j] is weakly closed if no base pair connects a base in the region to a base outside the region, as shown in Fig. 1. Energy model. Roughly speaking, base pairs tend to stabilize an RNA structure, whereas unpaired bases form destabilizing loops. The free energy of a strand S with respect to a fixed secondary structure R is the sum of the free energies of the loops of R. We use the standard Turner model for energies of pseudoknot free loops [2]. Our model for pseudoknot energies is based on that of Dirks and Pierce [8] (our algorithm could easily be adapted to other energy models, such as that of Rivas and Eddy). In Table 1 we summarize some of the parameters used in our model. In addition to standard energies for pseudoknot free loops, the model parameters include a penalty Ps for pseudoloop initiation, as well as
HFold: RNA Pseudoknotted Secondary Structure Prediction
327
Table 1. Energy parameters used in this paper Name
Description
Value (Kcal/mol)
Ps Pb Pup
pseudoloop initiation penalty 9.6 band penalty 0.2 penalty for unpaired base in a 0.1 pseudoloop eH(i, j) energy of a hairpin loop closed by i.j eInt(i, r, r , j) energy of a pseudoknot free internal loop eIntP (i, r, r , j) energy of internal loop eInt(i, r, r , j) × 0.83 that spans a band
penalties Pb and Pup for each band and for each unpaired base in a pseudoloop, respectively. The function eIntP (i, r, r , j) denotes the energy of an internal loop that spans a band. To illustrate the model, we calculate the free energy of the pseudoknotted structure of region [1, 32] of the left part of Fig. 1 as follows: Ps + 2Pb + 14Pup + eIntP (2, 3, 19, 20) + eIntP (3, 5, 18, 19) + eIntP (5, 6, 17, 18) + eIntP (14, 15, 29, 30) + eIntP (15, 16, 28, 29). Density-2 pseudoknotted structures. A density-2 secondary structure is a bi-secondary structure R with an additional constraint, which is easy to describe intuitively in terms of the structure’s arc diagram. Take any region [i, j], and remove all proper nested substructures (that is, arcs in all weakly closed proper subregions of region [i, j]). Choose any base l ∈ [i, j]. Draw a vertical line through base l in the arc diagram. Then, the vertical line should intersect at most two bands of Rij . Figure 2 shows an example of a density-2 structure with four interleaving bands. Density-2 structures can also have nested pseudoloops of arbitrarily large nested depth. However, not all bi-secondary structures are density-2 structures. For example, the structure formed by base pairs 1.7, 2.4, 3.6, and 5.8 is not density-2: each of these base pairs forms a distinct band (since no two of the base pairs cross the same set of other base pairs), and in an arc diagram, three of these base pairs intersect a vertical line drawn at positions 3,4,5, or 6. As will become clearer later, the reason that our HFold algorithm works for density-2 structures is because of the following lemma, which is key for efficient decomposition of energies in our recurrences. Roughly, the lemma shows how to calculate the band borders for a given region that is not weakly closed. Lemma 1. Let G and G be disjoint, pseudoknot free, secondary structures, such that G ∪ G is a density-2 secondary structure, and let i, j be the start and end of a pseudoloop of G ∪ G . Let l ∈ [i, j] be paired in G (but not in G) and let bpG (l) ∈ [i, j] such that l.bpG (l) crosses an arc of G. Let
328
H. Jabbari et al.
b(i,l) = min{k|i ≤ k < l < bpG (k)} ∪ {∞}, and b(i,l) = max{k|i ≤ k < l < bpG (k)} ∪ {−1}. Then, either both of these quantities have finite, positive values, in which case the structure G ∪ G contains a band with outer base pair b(i,l) .bpG (b(i,l) ) and inner base pair b(i,l) .bpG (b(i,l) ), or neither of these two quantities has a finite, positive value, in which case l is not covered by a base pair of Gij . The bottom left part of Fig. 1 illustrates Lemma 1, showing the borders of the band whose arcs cross the base pair involving base l = 14. If [i, j] is the region [1, 31], then b(1,14) = 2 and b(1,14) = 6.
3
The HFold Algorithm
Here we outline our hierarchical fold algorithm. We first briefly review key ideas of dynamic programming algorithm which predicts the energy of the MFE pseudoknot free secondary structure for a fixed sequence S = s1 s2 . . . sn [2]. Let Wi,j be the energy of the MFE pseudoknot free secondary structure for the subsequence si si+1 . . . sj . If i ≥ j, Wi,j = 0, since in this case the structure has no loops. Otherwise, it must either be that i.j is a base pair in the MFE structure for si . . . sj , or that the MFE structure can be decomposed into two independent subparts. These two cases correspond to the first two rows of the recurrence for Wi,j below. Vi,j is the free energy of the MFE structure for si . . . sj that contains i.j. The recurrence for Vi,j can in turn be expressed in terms of the energies of the hairpin loop (eH(i, j)), an internal loop, or a multiloop closed by i.j. We extend the definition of Wi,j for our hierarchical folding algorithm as follows. Let G be a given pseudoknot free structure for S. If some arc of G covers i or j, then Wi,j = ∞. If i ≥ j, then Wi,j = 0. Otherwise we define Wi,j to be the energy of the MFE secondary structure Gij ∪ Gij for the strand si . . . sj , taken over all choices of Gij which is pseudoknot free, disjoint from Gij , and such that Gij ∪ Gij is density-2. In this case, Wi,j satisfies the following recurrence: ⎧ ⎨ Vi,j , Wi,j = min mini≤r<j Wi,r + W(r+1),j , ⎩ WMBi,j + Ps where the first two cases are the same as for pseudoknot free cases and the last case is specific to pseudoknotted structures. We omit further details here. Ps is the pseudoknot initiation penalty, given in Table 1. The third row of this recurrence accounts for the case when the optimal secondary structure Gij ∪ Gij includes pseudoknotted base pairs and cannot be partitioned into two independent substructures for two regions [i, r] and [r+1, j], for some r. Such a structure must contain a chain of two or more successivelyoverlapping bands, which must alternate between Gij and Gij , possibly with nested substructures interspersed throughout. Figure 2 provides an example,
HFold: RNA Pseudoknotted Secondary Structure Prediction
BE
BE ½ W M Bi,l
W Il3
3
1,l2 1
l2
l3
i
329
l1 j
V Pi,l3
W M Bi,½ Ôl ¡1Õ 2
V Pl2,l1 ½ W M Bi,l
W Il1
1
1,bpG b i,l ½
Ô
1Õ
1
W M Bi,j Fig. 2. Illustration of how the WMB recurrence unwinds, to calculate WMBi,j . Arcs above the horizontal line from i to j represent base pairs of Gij , and arcs below the line represent base pairs of Gij . Case (a) of the WMB recurrence handles the overall structure whose energy is WMBi,j , with l = l1 , with terms to account for energies of the right upper band (BE) and right lower nested closed region (WI(l1 +1),(bpG (bi,l )−1) ) 1
as well as the remaining prefix (WMBi,l1 ). The term WMBi,l1 is handled by case (a) of the WMB recurrence, with l = l2 and terms to account for the lower right substructure labeled VPl2 ,l1 , the upper left band (BE), and the remaining “prefix” of the overall structure (WMBi,(l2 −1) ). WMBi,(l2 −1) is then handled by case (b) of the WMB recurrence, with l = l3 , and terms to account for WI(l3 +1),(l2 −1) and WMBi,l3 . Finally, the WMBi,l3 term is handled by end case (c) of the WMB recurrence.
and shows how the recurrence for WMB, given below, unwinds when the example structure is the MFE structure. In order to calculate the energies of substructures in such a structure in our recurrences, we use three additional terms: BE, VP, and WI. Roughly, these account for energies of bands spanned by base pairs of Gij , regions enclosed by pseudoknotted base pairs of Gij (excluding part of those regions that are within a band of Gij ), and nested, weakly closed regions, respectively. We do not include the recurrences for BE, VP, and WI here, but note that VP and WI are somewhat similar to V and W. We now give the recurrence for WMBi,j . As the base case, we set WMBi,j = +∞ if i ≥ j, since if i ≥ j the structure is empty, and thus cannot be pseudoknotted. Otherwise, there are two cases, depending on whether j is paired in G or not. In case (a), j is paired in G. Then, in the MFE structure, some base l with bp(j) < l < j must be paired in G , causing bp(j).j to be pseudoknotted. We minimize the energy over all possible choices of l (note that l must be unpaired in G, since it will be paired in G , which is disjoint from G). By Lemma 1, once l is fixed, the inner base pair of the band whose outer base pair is bp(j).j is also
330
H. Jabbari et al.
determined. The Pb + BE term in case (a) of the recurrence accounts for the energy of the band, a WI term accounts for a weakly closed region that is nested in the band, and the remaining energy is represented by the WMB term. In case (b), j is not paired in G, and the recurrence is unwound by moving directly to a WMB term. Thus, we have: ⎧ (a) Pb + ⎪ ⎪ ⎪ ⎨ WMBi,j =
⎪ ⎪ ⎪ ⎩
min
bpG (j)
(BEb(i,l) ,b(i,l)
+ WMBi,l + WI(l+1),(bpG (b(i,l) )−1) ),
if 0 < bp(j) < j
(b) WMBi,j
Complementing case (a) of the WMB recurrence, WMB handles the case that the rightmost band is not in G, but is part of the structure G . In the recurrence for WMB , case (a) is the complex case, accounting for the energy of the region spanned by the rightmost two bands using the 2Pb , VP, and BE terms, and recursively calling WMB . Case (b) is called when one iteration of WMBi,j or WMBi,j case (a) is done and the rightmost substructure of the overall “prefix” up to position j is a weakly closed region. Note that WIi,j = +∞ when cover(i) = cover(j), which ensures that case (b) is not entered as the first iteration of WMB . Cases (c) and (d) are end cases, where only one or two bands need to be accounted for, respectively and so no recursive call to WMB is made. Thus we have: WMBi,j = min ⎧ (a) 2Pb + min (BEb(i,l) ,b(i,l) ⎪ ⎪ i
4
if bpG (j) = 0 if bpG (j) < j
if 0 = bpG (j) < bpG (i)
Results
We implemented HFold, and performed several analyses on a large dataset. The goals of our analyses were fourfold: first, a baseline test to see if HFold finds pseudoknots when presented with the true pseudoknot free secondary structure for a sequence; second, to assess the accuracy of HFold, when using the predicted MFE structure for a sequence as input; third, to see whether by taking the best HFold output, taken over several runs when suboptimal pseudoknot free
HFold: RNA Pseudoknotted Secondary Structure Prediction
331
structures are used as input, it is possible to obtain significant improvements in accuracy; and finally, to measure the run-time performance of HFold. In our analyses, we measure the accuracy of a predicted structure R for sequence S with true structure T as follows. Each base (position) i of S gets a score of 1 if bpR (i) = bpT (i), and gets a score of 0 otherwise. The accuracy is the total score over all bases of S, divided by the length of the sequence. Thus, the accuracy lies between 0 and 1, with 1 indicating perfect accuracy. Our data set of 70 sequences includes 6 with pseudoknot free and 64 with pseudoknotted structures. These 64 structures include 6 sequences with kissing interactions, 4 with H-type pseudoknots with nested structures and 3 sequences with more than one pseudoknot. The length of these sequences varies from 26 to 214 bases. The sequences include the following types: mRNA, tmRNA, viral RNA, signal recognition particle RNA, small nuclear RNA and tRNA [15,16,17,18,19]. We partitioned each pseudoknotted structure in our data set into two pseudoknot free structures Gbig and Gsmall . Structure Gbig was created by removing the minimum number of base pairs needed to get a pseudoknot free structure from the input pseudoknotted structure, and structure Gsmall consists of the removed base pairs. We call each of these pseudoknot free structures a true pseudoknot free secondary structure for the corresponding sequence. Accuracy when input is the true pseudoknot free structure. We first test the accuracy of HFold, when presented with a sequence S whose true structure is pseudoknotted, and the corresponding true pseudoknot free secondary structure Gbig . The average accuracy of the Gbig structures obtained from pseudoknotted structures in our dataset is 78%. Figure 3(a) plots the accuracy of the structures Gbig ∪ G , for each sequence in our data set. The average accuracy of the structure obtained after running HFold, namely the structure Gbig ∪ G , is 85% averaged over 70 structures, with 16 cases achieving perfect accuracy. We note that it is perhaps not surprising that HFold has high accuracy in these cases, since it is presented with a true pseudoknot free structure, but it is nevertheless encouraging. We also ran HFold, when presented with a sequence S and the corresponding true pseudoknot free secondary structure Gsmall as its input. In this case, the average accuracy of the Gsmall structures is 63%. The average accuracy of the HFold output, Gsmall ∪ G , is 84%, with 27 cases achieving perfect accuracy. Interestingly, when the better of the two accuracies obtained by using HFold on Gbig and Gsmall is taken for each sequence, the average, over all sequences is 92%. On 5 of the 6 sequences for which the true structure is pseudoknot free, HFold achieves perfect accuracy (that is, does not predict any pseudoknotted base pairs), when the input is the true pseudoknot free structure. In the 6th case, the accuracy is 88%. In 28 cases of predicted pseudoknotted structures, presented in Fig. 3(a), the prediction accuracy is between 53% and 80%, of which 22 cases are the result of the high penalty for introducing a pseudoknot. In all of these cases, the expected output structures need addition of stems of size less than 6 bases, and the pseudoknot initiation penalty is much higher than the free energy of the stem. That is why HFold does not add these stems and
332
H. Jabbari et al.
200
200
150
150
100
100
50
50
0.5
0.6
0.7 (a)
0.8
0.9
0.2
0.4
0.6
0.8
1
(b)
Fig. 3. Accuracy of HFold when the input is (a) the true pseudoknot free secondary structure (Gbig ), and (b) the predicted MFE secondary structure. The horizontal axis represents the accuracy of predictions while the vertical line represents the length of the sequences. Pseudoknotted and pseudoknot free structures are presented by ‘x’ and ‘O’, respectively.
thus the accuracy is not perfect. In 3 cases HFold adds extra bands but in wrong places and in the rest of the cases it does not add anything. Accuracy when input is the MFE pseudoknot free structure. More realistically, we need to be able to predict the secondary structure of a sequence without knowing the true pseudoknot free secondary structure for the sequence. A natural approach is to run HFold on the predicted MFE secondary structure for the sequence. We use SimFold [20] to produce the pseudoknot free secondary structure G, which becomes the input to HFold. Figure 3(b) plots the accuracy for each data sequence. The average accuracy is 60%, which is just a 1% improvement over the average accuracy of the MFE pseudoknot free structures themselves. Figure 3(b) shows 14 points with accuracy between 18% to 31%. In all of these cases, the MFE pseudoknot free structures bear little resemblance to the true structures, and HFold does not add any pseudoknotted base pairs to them. In 4 out of these 14 cases, the low accuracy of the MFE pseudoknot free structures is explained by stems that are “shifts” of stems in the true structure. Best accuracy, taken over suboptimal folds. Since HFold does not significantly improve accuracy, on average, when using MFE pseudoknot free predictions as the input, we also measured the best accuracy obtainable using HFold with suboptimal structures as input. For each sequence in our dataset, we used SimFold to calculate the 25 lowest-energy structures for that sequence. (We chose the 25 lowest-energy structures because, when we compared the 1000 lowestenergy structures from SimFold with the true pseudoknot free structures for
HFold: RNA Pseudoknotted Secondary Structure Prediction
333
each sequence in our dataset, in only 39 cases was the true structure found among these 1000 structures and in 29 of the 39 cases the true structure was among the 25 lowest-energy structures.) We then ran HFold on all 25 structures, and calculated the accuracy of the HFold output on each run. We took the best accuracy among the 10 lowest-energy structures, and the best accuracy among the 25 lowest-energy structures. When averaged over all sequences, the best-of10 accuracy is 70% and the best-of-25 is 76%. These accuracies are a significant improvement over the 60% average accuracy obtained using MFE structures. In many cases the improvement is simply due to the fact that the suboptimal structure is close to Gbig . In 12 of the 14 cases, HFold adds nothing, but in 2 cases it perfectly identifies Gsmall . In 9 of the 14 cases that MFE structures perform poorly as input to HFold, the structure close to the true pseudoknot free structure Gbig is among the 25 lowest-energy structures and thus the accuracy of the prediction increases. Runtime performance. We implemented our HFold algorithm in C, using the energy model described in Section 2. Our implementation builds on the SimFold algorithm of Andronescu [20]. All experiments were run on a SUSE Linux 9.1 with two Intel 2.40 GHz processors and 1 GB of RAM. For all of our sequences, the final result was produced in less than 2 seconds (clock time).
5
Conclusion and Future Work
In this paper, we presented HFold, a fast, new dynamic programming algorithm that efficiently predicts RNA secondary structure including pseudoknots, based on the hierarchical folding hypothesis. HFold can predict kissing hairpins and pseudoloops with arbitrary number of bands. An empirical analysis of HFold’s performance shows that, when presented with the true pseudoknot free structure, improved accuracy is obtained over the accuracy of the pseudoknot free structure alone. This study also shows that using Gbig as the input to HFold results in 85% accuracy with 16 cases of perfect accuracy. High pseudoknot initiation penalty and stem shifts are main causes of not achieving perfect accuracy for the rest of the cases. Based on the analyses presented in this work, using only MFE structures as input to HFold does not result in high accuracies as the MFE structures are usually overcrowded with long stems and small loops and thus running HFold does not result in the addition of any extra base pairings. We are not yet able to do a sound comparison of the prediction accuracy of HFold with MFE-based methods, since it would be important to ensure that the same energy model is used by both methods. Therefore one of the main goals for our future work is to compare hierarchical and MFE algorithms implemented using the same energy model, at least for H-type pseudoknots. Another goal is to use a better energy model for pseudoknotted structures, such as that of Cao and Chen [16], and obtain better energy parameters. Finally, we plan to incorporate other techniques to produce better input structures to HFold, such as confidence estimates on predicted base pairs, or information obtained from chemical modification data.
334
H. Jabbari et al.
References 1. Dennis, C.: The brave new world of RNA. Nature 418(6894), 122–124 (2002) 2. Mathews, D.H., Sabina, J., Zuker, M., Turner, D.H.: Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J. of Mol. Biol. 288(5), 911–940 (1999) 3. Hofacker, I.L., Fontana, W., Stadler, P.F., Bonhoeffer, L.S., Tacker, M., Schuster, P.: Fast folding and comparison of RNA secondary structures. Monatshefte f¨ ur Chemie / Chemical Monthly 125(2), 167–188 (1994) 4. Alam, S.L., Atkins, J.F., Gesteland, R.F.: Programmed ribosomal frameshifting: Much ado about knotting! PNAS 96(25), 14177–14179 (1999) 5. Melchers, W., Hoenderop, J., Bruins Slot, H., Pleij, C., Pilipenko, E., Agol, V., Galama, J.: Kissing of the two predominant hairpin loops in the coxsackie B virus 3’ untranslated region is the essential structural feature of the origin of replication required for negative-strand RNA synthesis. J. Virol. 71(1), 686–696 (1997) 6. Akutsu, T.: Dynamic programming algorithms for RNA secondary structure prediction with pseudoknots. Disc. App. Math. 104(1-3), 45–62 (2000) 7. Lyngsø, R.B.: Complexity of pseudoknot prediction in simple models. In: D´ıaz, J., Karhum¨ aki, J., Lepist¨ o, A., Sannella, D. (eds.) ICALP 2004. LNCS, vol. 3142, pp. 919–931. Springer, Heidelberg (2004) 8. Dirks, R.M., Pierce, N.A.: A partition function algorithm for nucleic acid secondary structure including pseudoknots. J. Comput. Chem. 24(13), 1664–1677 (2003) 9. Rivas, E., Eddy, S.R.: A dynamic programming algorithm for RNA structure prediction including pseudoknots. J. Mol. Biol. 285(5), 2053–2068 (1999) 10. Tinoco, I., Bustamante, C.: How RNA folds. J. Mol. Biol. 293(2), 271–281 (1999) 11. Mathews, D.: Predicting RNA secondary structure by free energy minimization. Theor. Chem. Acc.: Theory, Computation, and Modeling (Theoretica Chimica Acta) 116(1-3), 160–168 (2006) 12. Wu, M., Tinoco, I.: RNA folding causes secondary structure rearrangement. Proc. Natl. Acad. Sci. USA 95(20), 11555–11560 (1998) 13. Witwer, C., Hofacker, I.L., Stadler, P.F.: Prediction of consensus RNA secondary structures including pseudoknots. IEEE/ACM Trans. Comput. Biol. Bioinformatics 1(2), 66–77 (2004) 14. Rastegari, B., Condon, A.: Parsing nucleic acid pseudoknotted secondary structure: algorithm and applications. J. Comput. Biol. 14(1), 16–32 (2007) 15. van Batenburg, F.H., Gultyaev, A.P., Pleij, C.W.: Pseudobase: structural information on RNA pseudoknots. Nucleic Acids Res. 29(1), 194–195 (2001) 16. Cao, S., Chen, S.J.: Predicting RNA pseudoknot folding thermodynamics. Nucleic Acids Res. 34(9), 2634–2652 (2006) 17. Deiman, B., Pleij, C.W.A.: Pseudoknots: A vital feature in viral RNA. Seminars in Virol. 8(3), 166–175 (1997) 18. Griffiths-Jones, S., Moxon, S., Marshall, M., Khanna, A., Eddy, S.R., Bateman, A.: Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res. 33(Database issue) (January 2005) 19. Sprinzl, M., Horn, C., Brown, M., Ioudovitch, A., Steinberg, S.: Compilation of tRNA sequences and sequences of tRNA genes. Nucleic Acids Res. 26(1), 148–153 (1998) 20. Andronescu, M., Aguirre-Herndez, R., Condon, A., Hoos, H.H.: RNAsoft: A suite of RNA secondary structure prediction and design software tools. Nucleic Acids Res. 31(13), 3416–3422 (2003)
Homology Search with Fragmented Nucleic Acid Sequence Patterns Axel Mosig1,8 , Julian J.-L. Chen2,3 , and Peter F. Stadler4,5,6,7 1
Department of Combinatorics and Geometry (DCG), CAS-MPG Partner Institute for Computational Biology (PICB), Shanghai Institutes for Biological Sciences (SIBS) Campus, Shanghai, China
[email protected] 2 School of Life Sciences, Arizona State University, Tempe, AZ 85287, USA
[email protected] 3 Department of Chemistry and Biochemistry, Arizona State University, Tempe, AZ 85287, USA 4 Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, University of Leipzig, H¨ artelstraße 16-18, D-04107 Leipzig, Germany 5 RNomics Group, Fraunhofer Institut f¨ ur Zelltherapie und Immunologie — IZI Deutscher Platz 5e, D-04103 Leipzig, Germany 6 Department of Theoretical Chemistry, University of Vienna, W¨ ahringerstraße 17, A-1090 Wien, Austria 7 Santa Fe Institute, 1399 Hyde Park Rd., Santa Fe, NM 87501, USA
[email protected] 8 Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, D-04103 Leipzig, Germany
Abstract. The comprehensive annotation of non-coding RNAs in newly sequenced genomes is still a largely unsolved problem because many functional RNAs exhibit not only poorly conserved sequences but also large variability in structure. In many cases, such as Y RNAs, vault RNAs, or telomerase RNAs, sequences differ by large insertions or deletions and have only a few small sequence patterns in common. Here we present fragrep2, a purely sequence-based approach to detect such patterns in complete genomes. A fragrep2 pattern consists of an ordered list of position-specific weight matrices (PWMs) describing short, approximately conserved sequence elements, that are separated by intervals of non-conserved regions of bounded length. The program uses a fractional programming approach to align the PWMs to genomic DNA in order to allow for a bounded number of insertions and deletions in the patterns. These patterns are then combined to significant combinations of PWMs. At this step, a subset of PWMs may be deleted, i.e., have no match in the current region of the genome. The program furthermore estimates p- and E-values for the matches. We apply fragrep2 to homology searches for RNase MRP, unveiling two previously unidentified matches as well as reproducing the results of two previous surveys. Furthermore, we complement the picture of vertebrate vault RNAs, a class of ncRNAs that has not received much attention so far. R. Giancarlo and S. Hannenhalli (Eds.): WABI 2007, LNBI 4645, pp. 335–345, 2007. c Springer-Verlag Berlin Heidelberg 2007
336
1
A. Mosig, J.J.-L. Chen, and P.F. Stadler
Introduction
With the rapidly increasing number of completely sequenced genomes their annotation becomes an important task in bioinformatics. In contrast to protein coding genes, fast and reliable annotation of ncRNAs is still in its infancy with few exceptions such as tRNA annotation using tRNAscanSE [1]. The reason is that most families of ncRNAs are short and rather variable at sequence level. Often secondary structure is well conserved; in this case structure-based tools such as infernal [2] can be used. These approaches are too slow to be used on their own, however. Consequently, the ENSEMBL ncRNA pipeline uses a non-restrictive blast search to filter likely regions that are then searched with infernal. Similarly, in [3] a sequence-based filtering step using an HMM-based heuristic is proposed. For many ncRNA families, dramatic variations in sequence length pose an additional problem. In conjunction with rapid substitutions it becomes surprisingly hard to find homologs even of fairly long ncRNAs. For example, telomerase RNAs (with a length of more than 300nt) cannot be found by means of blast in the genomes of teleost fishes, even though more than 20 sequences representing all other major vertebrate clades are known [4]. On the other hand, most ncRNAs exhibit a few short well conserved sequence motifs. The fragrep program [5] has been designed to exploit this fact: in its original version it searches for (almost) exact sequence patterns separated by non-conserved sequence intervals which are specified only in terms of upper and lower bounds on their length. In practical applications we found, however, that almost exact gap-less patterns are too restrictive to search in evolutionarily distant genomes. A relaxed version, using a heuristic to match PWMs to genomic DNA, was indeed employed successfully for discovering telomerase RNAs in teleosts [6]. This work also showed that entire conserved blocks can be lost (or mutated beyond recognition) in certain lineage, so that it is important to allow for the “deletion” of entire sub-patterns. In this contribution we describe a novel version of fragrep which (1) uses an exact fractional programming algorithm for matching PWMs to genomic DNA with insertions and deletions and (2) accounts for deleted subpatterns as well as missing data at the ends of contigs in incomplete sequence assemblies.
2
Theory
A fragrep pattern, Fig. 1, is specified as a vector of triplets ([ui , vi ], Wi , di |i = 1, . . . , N ) where [ui , vi ] denoted the upper and lower bound of the number of nucleotides between the patterns Wi−1 and Wi , which are specified as PWMs, and di is the maximal number of in/dels that are tolerated in a match of Wi to the genomic sequence. The sequence interval before the first pattern need not be specified, hence we set [u1 , v1 ] = [0, 0]. The problem of finding a fragrep pattern in a genomic sequence x consists of two parts: (1) finding matches of the individual patterns Wi in the genome, and (2) combining such matches that satisfy the distance constraints.
Homology Search with Fragmented Nucleic Acid 0.70/0
0.70/0
0.75/0
TAGCTC AGCGGTT CTTC GGC GG GGTTC CCCGCGGGCGCT C 0..2
11
9
10
8
6
5
4
3
2
1
0
6
15..120
T
C
T
T
C TC
CAA
14
12
11
9
10
C A
8
7
6
A
5
4
2
1
0
TA
T
T
3
T
G
3
2
1
0
T
C
4
T
C
0.65/0
2..20
A
C
G
13
0.65/0
5
C
4
3
T
2
1
0
5
4
3
A
T
AC
G
7
1..6
T
TCA
A
2
0
1
1
0
bits
2
337
Fig. 1. Example of a fragrep pattern for vault RNAs derived from 40 tetrapod vault RNA sequences retrieved from the Rfam database, ENSEMBL 43 annotation, and blast searches of additional vertebrate genomes
2.1
Position-Specific Weight Matrices with InDels
Let W = (Wkα ) be a position specific weight matrix (PWM) representing a pattern of length n over the alphabet A, i.e., k ∈ [1, n] and α ∈ A, normalized such that α Wkα = 1. Following the slight abuse of notation from literature on modelling transcription factor binding sites, we refer to these normalized frequency count matrices as PWMs. One common way [7] of scoring the match of W with a sequence (x1 , . . . , xn ), xi ∈ A is to compute position-wise scores of the form si = Hi Wi,xi
s+ i = Hi max Wi,α α∈A
s− i = Hi min Wi,α α∈A
(1)
where Hi = α∈A Wiα ln Wiα is the information content of column i. These scores are then combined to the so-called MATCH score [7]: n n n n + − − S= si − si si − si . (2) i=1
i=1
i=1
i=1
Here we generalize this approach to incorporate insertions and deletions. We observe that this problem is inherently asymmetric, i.e., the effect of deleting a letter in the sequence x is not the same as that of deleting a column of the PWM. For a given “alignment path” A, i.e., a sequence of matches (i, i ), insertions (−, i) and deletions (i, −), we modify the initial scores by setting si,j = Hi Wi,xj for the match of the i-th column of the PWM with the j-th letter in x. Introducing gap penalties Δs for the sequence and Δpwm for the PWM, respectively, we evaluate a given alignment A path by means of ⎛ ⎞ ⎠ S(A) = ⎝ si,j − Δs − Δpwm Hi − s− i matches(i,j)
⎛ ⎝
i:(i,∗)
j:(−,j)
s+ i −
i:(i,∗)
⎞
i:(i,−)
⎠. s− i
i:(i,∗)
(3)
338
A. Mosig, J.J.-L. Chen, and P.F. Stadler
If there are no indels, S(A) coincides with the usual MATCH score. Furthermore, S(A) ≤ 1. The score can, however, become negative if there are too many insertions or deletions. We follow the general framework of fractional programming [8,9]: since optimizing a fractional target function is often difficult, one turns the optimization problem into the corresponding decision problem. As it turns out in many cases, this decision can be solved by a non-fractional optimization problem. The optimal value of the original fractional optimization problem can then be found using nesting intervals. In our case, we ask whether there is an alignment path A such that S(A) ≥ ϑ with a given threshold ϑ ≤ 1. This decision problem is satisfied if and only if there is an alignment path A such that + − si,j − s− Δs − Δpwm Hi ≥ 0 i − ϑ(si − si ) −
matches(i,j)
σij (ϑ)
insertion(−,j)
deletion(i,−)
(4) The decision problem (4) is closely related to normalized editing problems, which can be solved in an analogous way, see [10,11] for details. In order to decide whether (4) has a non-negative solution, it suffices to solve an asymmetric variant of the standard sequence alignment problem with match scores that depend explicitly on the parameter ϑ. For fixed ϑ this can be done by means of the following asymmetric variant of the Needleman-Wunsch algorithm [12]: ⎧ ⎪ ⎨Si−1,j−1 + σij (ϑ) Sij = Si−1,j − Δpwm Hi (5) ⎪ ⎩ Si,j−1 − Δs with initial condition S0,0 = 0 and the convention that alternatives with negative indices are ignored. Following the logic of earlier versions of fragrep [5] we optionally restrict the number of insertions and deletions. This leads to the following simple generalization of algorithm (5). Let d be the number of deleted PWM columns in the alignment path before i, j and denote by Si,j;d the optimal score with of an alignment up to positions i and j, resp., with exactly d deletions in the PWM. One easily verifies that these quantities satisfy the recursion ⎧ ⎪ ⎨Si−1,j−1;d + σij (ϑ) Si,j;d = Si−1,j;d−1 − Δpwm Hi (6) ⎪ ⎩ Si,j−1;d − Δs if j − (i + d) ≤ dthreshold with initial condition S0,0;0 = 0. Note that j − (i + d) equals the number of deletions from the sequence within the alignment path. Denote the optimal score of this dynamic programming step, with or without constraint on the maximum number d of insertions and deletions, by S[ϑ]. Fractional programming [8,9] is based on the observation that S[ϑ] ≥ 0 if and only if the fractional score score can exceed ϑ. This implies that the optimal value of ϑ can be obtained by means of the following iteration: ϑ ← ϑ + sgnS[ϑ] × Δϑ
Δϑ ← Δϑ/2
(7)
Homology Search with Fragmented Nucleic Acid
339
with initial condition ϑ = 1 and Δϑ = 1 − ϑ∗ where ϑ∗ is the a priori threshold below which a match of W with x is deemed insignificant. 2.2
Significance Statistics
Once the optimal value of ϑ is determined, we can retrieve a corresponding optimal alignment path by means of backtracking in the most recently computed S-matrix. We either use ϑ as the weight of the match, or we may employ a p or E value statistic for this purpose. The probability p that the PWM W of length n matches a random sequence (αi ) of length n at least as well as the actual sequence sequence x = (x1 , . . . , xn ) can be computed as n n p(W, x) = Prob u := Hi Wi,αi > Hi Wi,xi (8) i=1
i=1
where the letters αi are drawn i.i.d. from the background distribution corresponding to the genomic frequencies πα . By the central limit theorem the random variable u will follow at least approximately a Gaussian distribution, whose mean μ and variance σ 2 is given by μ=
n i=1
Hi
Wi,α πα
α
σ 2 = μ2 −
n
Hi2
i=1
2 Wi,α πα
(9)
α
and hence depends only on the PWM W and on the background nucleotide frequencies πα . In the case of indels we match a sequence of n + I with a PWM pattern of length n − D, where I and D are the numbers of insertions and deletions respectively. Thus we can estimate the probability of match with indels at least roughly as p ≈ n+I n−d p(W, x). As in the first version of fragrep [5], we account for the variable length intervening sequences by considering all combinations of the length of intervening sequence pieces, i.e., p ≈ p (W1 , x1 )
N
p (Wk , xk )(vk − uk + 1)
(10)
k=2
While by no means an exact p-value, this approximation is good enough to rankorder the hits of a database search, to allow a comparison of results obtained with different patterns, and to weed out obviously insignificant results. 2.3
Optimal Combinations of PWM Patterns
As indicated above, we are interested in combinations of PWM occurrences that satisfy certain distance constraints. To formalize this further, we assume that for each PWM Wi , a threshold θi as well as a maximum number of deletions di is specified. Given these constraints, we now consider all those pairs (i, ν) where
340
A. Mosig, J.J.-L. Chen, and P.F. Stadler
Wi achieves a match score of at least ti with at most di deletions at position ν in the genomic sequence to be searched; we will refer to index pairs (i, ν) satisfying these constraints as occurrences. Denote by (i, ν) an occurrence of PWM pattern Wi at genomic sequence position ν with weight w(i,ν) . We say that (i , ν ) is a predecessor of (i, ν), in symbols (i , ν ) ≺ (i, ν), if the following condition is satisfied: (i , ν ) ≺ (i, ν) ⇔
i
uj ≤ ν − ν ≤
j=i
i
(vj + nj + dj )
(11)
j=i
where ni is the length of the pattern Wi . Rather than requiring all PWMs in the search pattern to be matched, we allow a fixed (and typically small) number of δ many patterns to be deleted in a matching sequence. In other words, we are interested in finding chains of length k − δ in the above (partial) order, k denoting the number of PWMs in our search pattern. Now let D(i,ν) be the minimum number of deletions associated with a path ending in (i, ν). It satisfies the recursion ⎧ ⎪ ⎨min(i ,ν )≺(i,ν) D(i ,ν ) + (i − i − 1) D(i,ν) = min i − 1 (12) ⎪ ⎩ 0 if (0, 0) ≺ (i, ν) Here (0, 0) denotes the origin of the sequence. The last alternative allows us to ignore the deletion of partial pattern at the begin of a sequence. This is useful e.g. when dealing with unfinished genome assemblies or even unassembled shotgun reads. Matches for a given query can be derived in a straightforward manner from the entries in the table D: whenever we find D(i,ν) ≥ δ − N − i, we can identify a match by tracing back the corresponding chain of of occurrences in the matrix D. In practice, there can be a large number of very similar matches at essentially the same genomic location due to multiple occurrences of individual patterns Wi with low individual significance. To avoid redundant output, fragrep2 has been implemented to report only the most significant match out of a bundle of essentially equivalent ones. Furthermore, fragrep2 initiates dynamic programming of table D only at those locations where an occurrence of one of the δ + 1 most significant patterns has been reported. As another measure of performance tuning, the partial order underlying table D is maintained in a dynamic data structure, so that no genomic location is ever searched more than once for any pattern Wi .
3
Generating Search Patterns
The most convenient starting point for creating PWM-based search patterns are alignments of non-coding RNA families, such as those contained in databases
Homology Search with Fragmented Nucleic Acid
341
such as Rfam [13]. Conserved blocks with a length typically varying between three and 20 nucleotides are frequently observed in these alignments. These conserved blocks are easily identified and annotated by manual inspection. In order to facilitate the application of fragrep2, we developed a simple tool, aln2pattern, which reads an annotated sequence alignment and computes both the PWMs (from contiguous marked alignment columns) and the distance constraints between them (from the lengths of the sequences between marked blocks). The match score thresholds for the PWMs as well as column deletion bounds and distant constraints between the matrices are determined in such a way that they represent the most restrictive search pattern that is capable of identifying all sequences in the given alignment. In a typical application scenario, these constraints will then be manually relaxed by the user, depending on how far the genome to be searched is evolutionarily away from the given training sequences in the multiple sequence alignment.
4
Applications
4.1
RNase MRP
RNase MRP is a ribonucleoprotein whose RNA component, MRP RNA, is enzymatically active. It is associated with at least two distinct roles in eukaryotes: (1) initiation of mitochondrial DNA replication and (2) cleavage of primary rRNA transcripts. RNase MRP is a distant relative of RNase P, a ribonucleoprotein in involved in tRNA maturation. Two recent studies describe systematic searches for RNase MRP homologs in diverse eukaryotic genomes [14,15]. In both studies combinations of manually composed sequence and secondary structure patterns were used for the search. 0.8 / 1
16
15
14
13
9
12
8
11
7
6
5
4
3
2
10
5..30
A T 19
18
17
16
15
14
13
12
11
10
T
0.8 / 1 125..165
30
29
28
27
26
25
24
23
22
AT
21
20
19
18
17
C
16
15
14
13
12
11
10
9
8
7
6
5
4
3
A
18
17
16
15
14
13
12
11
10
9
8
7
TG
6
5
4
3
T
2
1
ACACAACAGGGCTCACTCT 0
20
19
18
A
17
16
15
A
14
13
12
A
11
9
C
10
8
7
6
A
5
4
3
2
1
0
G
G T
C
G
0.8 / 1
AAAC ATAGTTATGACGATTG
G
TG
CTTGGGGGAAAGTCCCTGGACCCTGGGTAGA 2
81..173
1
A
0
4
3
2
C G
5
T T
C
0.91 / 1
C
T
9
8
T
7
6
C
5
4
0
TAAATC
0..15
1
0
6
5
4
3
2
1
0
TT
3..32
T
A
C
0
T
C
G C A
0..3
0.57 / 0
1
AAA T T
3
19
18
17
16
15
14
13
1/0
C
0.6 / 2
T
0.89 / 1
C T
AAGTTACACACAGAGAAC
C
2
G CAG
TG
4..15
C
12
9
8
10
11
C
T
7
6
5
4
3
G T
TCAGAAGA TTGTTATC 1
TGCGGAAAACGTAATGAGAT
C A
2
0
1
1
0
bits
2
0
1
bits
1
0
2
0.72 / 2
Fig. 2. Search patterns successfully used for finding RNase MRP: the pattern on the left identifies fungal species; the pattern on the right reveils unique and previously unidentified candidate matches in the amphioxus and Saccoglossus kovalevskii
As a first test case for fragrep2, we used a pattern from known yeast RNase MRPs for whole-genome surveys in other yeast genomes. The pattern depicted in Figure 2 (left) indeed retrieves salient matches in the genome sequences of Ashbya gossypii, Candida glabrata, Debaryomyces hansenii, Kluyveromyces
342
A. Mosig, J.J.-L. Chen, and P.F. Stadler
lactis, and Yarrowia lipolytica. While most of these have been identified previously, the full advantage of the approach implemented in fragrep2 becomes obvious through the MRP candidates we identified in the amphioxus and Saccoglossus kovalevskii using the pattern depicted in Figure 2 (right); fragrep2 returns unique matches in both of these species, where RNase MRP has not been identified previously. This setting is particularly challenging, since all species with annotated MRP genes are phylogentically quite distinct, and homology is consequently very spurious. While the examples shown above reveil unique matches in a relatively direct way, the search procedure in more challenging genome-wide surveys eventually needs to be conducted in several stages: starting with low sensitivity, a long list of candidate matches is reduced by variants of the first pattern with lower fault tolerance (i.e., higher match thresholds for the individual PWMs). Final matches are then usually identified among one or few final candidates using either phylogenetic or secondary structure analysis using state-of-the-art tools. 4.2
Vault RNAs
Vault RNAs (vRNAs) are short (80-150nt) polymerase III transcripts that comprise about 5% of the vault complex. Vaults are large ribonucleoprotein particles that are ubiquitous in eukaryotic cells with characteristic barrel-like shape and a still poorly understood function in multi-drug resistance, see [16] for a review. Vault RNAs exhibit little sequence conservation beyond their internal box A and box B internal promoter elements. So far, only a few examples have been studied experimentally. The human genome contains three expressed vRNAs located in a cluster on Chr.5 [17], while mouse and rat have only a single vRNA [18,19,20]. Outside mammals, only two vRNAs from the bullfrog Rana catasbaiana have been sequenced [18]. Due to their high sequence variability, ENSEMBL’s ncRNA annotation pipeline has identified vRNAs only in mammals and in Xenopus (due to its close relationship with the bullfrog). While additional mammalian vRNA sequences can easily be found by means of blastn, blastn searches outside the mammals have been unsuccessful. Using fragrep2 with the pattern in Fig. 1 we not only recover all vRNAs that are experimentally known or annotated in ENSEMBL (based on infernal), we also find plausible vRNA sequence in the genomes of chicken as well as zebrafish and tetraodon, which previously remained undetected1 . In general, fragrep2 returns a handful of candidates, among which the true homologs are easily recognized in a sequence alignment or using infernal. 4.3
Telomerase RNAs
Telomerase maintains telomere length by adding G-rich telomeric repeats to the ends of linear eukaryotic chromosomes. The core telomerase enzyme consists of 1
The alignment if Fig. 3 can be downloaded from http://www.bioinf.uni-leipzig.de/Publications/SUPPLEMENTS/07-011/
Homology Search with Fragmented Nucleic Acid
#BOXES Mus_musculus Dipodomys_212079164 Rattus_norvegicus Spermophilus_SQ_scaffold_13099 Cavia_ENSCPOG00000015766 Cavia_ENSCPOG00000015929 Oryctolagus_ENSOCUG00000018860 Oryctolagus_ENSOCUG00000018936 Oryctolagus_ENSOCUG00000018582 Tupaia_Scaff__1775 Equus_1358350707 Canis_BROADD2_11_26999394_2700 Felis_CAT_scaffold_183903_1186 Bos_ENSBTAG00000028358 Myotis_scaffold_80695 Erinaceus_HEDGEHOG_scaffold_33 Erinaceus_scaffold_338733 Sorex_COMMON_SHREW1_scaffold_2 Myotis_M1_scaffold_80695_.243_ Hs_AF045143_vault.assoc1 Hs_AF045144_vault.assoc2 Hs_AF045145_vault.assoc3 Macaque_ENSMMUG00000024017 Macaque_ENSMMUG00000025999 Macaque_ENSMMUG00000028464 Otolemur_BUSHBABY1_scaffold_11 Echinops_ENSETEG00000021010 Echinops_ENSETEG00000020911 Loxodonata_ENSLAFG00000019342 Monodelphis_ENSMODG00000021883 Platypus_OANA5_Contig27750_113 Platypus_OANA5_Contig242806_6_ Chicken.13 Xenopus_JGI4.1_sca_306_236003_ Xenopus_ENSXETG00000028464 Xenopus_JGI4.1_sca_177_1931260 Xenopus_JGI4.1_sca_177_1911219 Xenopus_JGI4.1_sca_177_1931587 Rc.vRNA.1 Rc.vRNA.2 Dr.cand tetNig1_dna.2 tetNig1_dna #=GC SS_cons
343
------------|~~BOX_A~~|-------------------------------------------------------------------------------------------|~~BOX_B~~~|------------------|TERM GGCCAGC..TTTAGCTC.AGCGGTTAC.TTCGACAGTGGTTCAGTTCATTACCAGCTATT..CGTAGCAGGTTCGAACAACACAACCAACCACTTACCTAACCCGTGAGTGTTTGGTTCGAGA.CCCGCGGGCGCTCCCTGGCCCTTTT GGCTGGC..TTTAGCTC.AGCGGTTAC.TTCGACGAGCATTC...........................TCCACAATCAACACGCGCAGCTTTCAGC........TGTGCAGTTGGTTCGAAA.CCCGCGGGCGCTCTCCAGCCCTTTT GGCCAGC..TTTAGCTC.AGCGGTTAC.TTCGACGTGCTCCAGTTTGA.GCAGGCTATGTAACGTGGTCGGTTCGAGCAACACAACCAGCCGCTTGCCTATCTGGTGAGTGGTTGGTTCGAGA.CCCGCGGGCGCTCTCTGGCCCTTTT GGCTGGC..TTTAGCTC.AGCGGTTCC.TTCG.CCAGA..AA.ACTT........................................GCAATCATCT..........CGGT...GGTTCGAGA.CCCGCGGGCGCTCACCAGTCCTTTT GGCTGGC..TTTAGCTC.AGCGGTTCC.TTCGATC...................................AATCAAGCTTAGTCACTTGGGCCGGTGGCTTCCATCGGTCCGGTGGTTCGAGA.CCCGCGGGCGCTCTCCAGT...... GGCTGGC..TTTATCTC.AGCGGTTCC.TTCGA.................................................CTGTTCTTAGTTGACTTCTTGGGCTGGTCTGGTGTTTGAGA.CCCGCGGGCGCTCTCTGGTCCTTTT GGCTGGT..TTTAGCTC.AGCGGTTTCTTTCGGC..TAC.........................................TGAATCAGAC...GTAGTCACCAACTTTGGTGGTGGTTCGAGC.CCCGCGGGCGCTTTCCAGT...... GGCTGGC..TATAGCTC.AGCGGTTCC.TTCGAACTAC......................................CAGCTACATTTAAGCTACTAAGTCAGCACGTCAGTGTTGGTTCGAGT.CCCGCGGGCGCTCCCCAGC...... GGCTGGC..TATAGCTC.AGCGGTTCC.TTCGACG.........................................CAGAAGCTACTTGAAG.CTAAGTCAGCATGCAAGTGCTGGTTCGAGT.CCCGCGGGCGCTCCCCAGC...... GACTGGC..TTTAGCTC.AGCGGTTCC.TTCGGCGACTTCAGCTGCGAATTCCACC.........CTCTCTGGGCGGATGGCTGGCTGCAAATCCACCCCTCCCCTGGGCGGGTGGCTCGAGT.CCCGCGGGCGCTCTCCAGTCCTTT. GGTCGGC..TGTAGCTCAAGCGGTTAC.TTCGCAAC............................................TTCAGAATG..CTTCTGGACAACCCC.TTCCA.GGGTTCGAGA.CCCGCGGGCGCCCTCTGACCCTTTT GGTCAGC..TTTAGCTCGAGCGGTTAC.TTCGCGAC............................................TTGTCTTCGGACT.AA...CAACCCCCTTGCGGGGGTTCGAGA.CCCGCGGGCGCTCCCTGAC...... GGTCGGC..TTTAGCTCAAGCGGTTAC.TTCGCATG............................................TGGTTACTAAACCTTC...CAACCCTCTCGTTAGGGTTCGAGA.CCCGCGGGTGCTTTCTGAC...... GGCTGGC..CGTAGCTC.AGCGGTTCC.TTCGACT..A..CA...CA.................................GTGTTTA..AATTACTCTCTC.....TGAGGGT.GGTTCGAGA.CCCGCGGGCGCTCTCCAGT...... GGCTGGC..TTTAGCTC.AGCGGTTCC.TTCGC......................................ATACAGAACTTCCTACTCGCAGTTGCTCACTTGAGCGACAGC.GGTTCGAGA.CCCGCGGGCGCTCTCCAGCCCTTTT GGCTGGC..TTTAGCTC.AGCGGTTAC.TTCG.................................................TGCTTAACAATCACACTTAC......GAGTGA.GGTTCGATA.CCCGCGGGCGCTCTCCAGC...... GGCTGGC..TTTAGCTC.AGCGGTTAC.TTCGTG........................................................CTTAACAATCACACTTACGAGTGAGGTTCGATA.CCCGCGGGCGCTCTCCAGC...... GGCTGGC..TTTAGCTC.AGCGGTTAC.TTCGTC.......A.GTCT.................................ACGGAGCTTTGCTTTGTAACTA..TCCTTGA..TGGTTCGAGA.CCCGCGGGCGCTCTCCAGTCCTTTT GGCTGGC..TTTAGCTC.AGCGGTTCC.TTCGCATACA..GA.ACTT.................................CCTACTCGCAGTTGCTCACTTGAGCGACAGC...GGTTCGAGA.CCCGCGGGCGCTCTCCAGCCCTTTT GGCTGGC..TTTAGCTC.AGCGGTTAC.TTCG.....A..CA.GTTC.........................TTTAATTGAAACAAGCAACCTGTCTGGGT.............TGTTCGAGA.CCCGCGGGCGCTCTCCAGTCCTTTT GGCTGGC..TTTAGCTC.AGCGGTTAC.TTCGA..........GTAC.................................ATTGTAACCACCTCTCTGGGT.............GGTTCGAGA.CCCGCGGGTGCTTTCCAGCTCTTTT GGCTGGC..TTTAGCTC.AGCGGTTAC.TTCGC..........GTGT.................................CATCAAACCACCTCTCTGGGT.............TGTTCGAGA.CCCGCGGGCGCTCTCCAGCCCTCTT GGCTGGC..TTTAGCTC.AGCGGTTAC.TTCGGC.................................GACACCTCCTAGGATTACACCAACCTCTCTGGGT.............TGTTCGAGA.CCCGCGGGCGCTCTCCAGT...... GGTCGGA..GTTAGCTCAAGCGGTTAC.CTCCTCATGCCG.................................CACTTTCTAACTGTCCATCTCTGTGCTG............GGGTTCGAGA.CCCGCGGGTGCTTACTGAC...... GGCTGGC..TTTAGCTC.AGCGGTTAC.TTCGCAGTTCAGCAAA...........................................CCACCTCTCTGGGT.............TGTTCGAGA.CCCGCGGGCACTCTCCAGC...... GGCTGGC..TTTAGCTC.AGCGGTTCC.TTCGAC.......G.GTGC.................................ATTGCTTTCGATCCTCTGCATC..TGGCAGAGCCGGTTCGAGA.CCCGCGGGCGCTCTCCAGCCCTTTT GGTCAGC..TTTAGCTC.AGCGGTTAC.TTCGTCAAGCTTGA.ATTAA................................TCATTCCATGATT.CAGTTCATCCAACCCCTTAGGGTTCGAGG.CCCGCGGGCGATTTCTGAC...... GGTCAGC..TTTAGCTC.AGCGGCTAC.TTCGTCA.........TT..................................TCATTTCATGATT.CAGATCATCTAACCTTTTGGGGTTCGAGG.CCTGCGGGCGCTTTCTGAC...... GGTCAGC..TTTAGCTCTAGCGGTTAC.TTCG.............A..................................TCA......AAGT.CAGAACAACTC.TGTCGTGGAGTTCGAGA.CCCGCGGGCGCTCTCTGAC...... GGCTGGC..TTTAGCCC.AGCGGTTCC.TTCCCTCATTTAAACATTTC....TTACAATTTCCCTTGGTGGTTCGAATCCCA............................GGACCATCAGAGA.CCCGCGGGCGCTTCCCGGT...... GGCCGGC..CGTAGCTC.AGCGGTTAC.TTCTGTCAAC............................................GAAAGGCTTCTTTCGGTTAAATC.TCTTCGGAGGTTCGAGCCCCCGCGGGCGCTTTCCGGT...... GGCCGGC..TCTAGCTC.AGCGGTTAC.TTCGAACGA..............................................AAGTGCTACTTTCCACTAAATC.TCTTCGGAGGTTCGAGC.CCCGCGGGCGCTTTCTGGT...... GGCCGGC..TTTAGCCC.AGCGGTTCC.TTCG.GCAAA..CA.GTTT.................................CTGAGTTGCGG..GCCC.........CGACTGG.GGTTCGAT..CCCGCGGGC.ACCCCTGGCCCATTT GGTCAGC..TATAGCTC.AGCGGTTCC.TTCACTC........ATTTC................................TTGCAAACAAATAAA...GTC....TCTC.CGGAGGTTCAAGA.CCCGCGGGCGCTTTCTGACCTTTTT GGCTGGCCATATAGCTC.AGCGGTTCC.TTCTCA......AA.ACTC.................................TGATTTA..AACAGACCAAT.....CCTCTCGGGGGTTTGAAA.CCCGCGGGCGCTATCCAGCCTTTTT GGCTGGCCATATAGCTC.AGCGGTTCC.TTCTTG......AG.ATCT.................................CTGTTTACTACCAGACCAAT.....CCTCTCGGGGGTTTGAAA.CCCGCGGGCACTATCTAGCCCTTTT GGCTGGCATTGTAGCTC.AGCGGTTCC.TTCACG......GG.CTCT.................................CTGATTT..ATCAGACTAAT.....CCTCTCGGGGGTTCAAAA.CCTGCGGGCGCTATCCAGCCTTTTT GGCTGGCAATATAGCTC.AGCGGTTCC.TTCACA......TG.CTTT.................................CTGGTT....TCAGAATAAT.....CCTTTTGGGGGTTCAAAA.CCCGCGGGTGCTATCCAGCCTTTTT GGTCAGC..AACAGCTC.AGCGGTTAC.TTCTCGAAA.....TACCA.................................CGGAATTGTAATTCTGAAAA.....CCTTTC.GGGGTTCGAAA.CCCGCGGGCGCCACCTGAC...... GGTCAGC..AACAGCTC.AGCGGTTAC.TTCTCGACA...........................................CGGAATTGTAATTCTGAAAA.....CCTTTC.GGGGTTCGAAA.CCCGCGGGCGCCACCTGAC...... CGCCGG...GTTAGCTC.AGTGGTTAC.TTCTCAGCCAAAAGACAACT................................CTCATGGCCGCCGGGTCTCCGGAA.....GCCAGAGATCGAA...CCACGGGCACTGTCCAGCGTTTT. CGCTGG...TTTAGCTC.AGTGGTTAC.TTCACCAC.AAATCCTTTGG................................ATTTCGCGGTCTCCTCCCAACT...........GGGATCCAA...CCACGGGCGCTGCTCAGCGTTTT. CGCCGG...TTTAGCTC.AGTGGTTAC.TTCTTGAAGTTCCCGCTTGC.....................................GGCTCGCGCGGGCT............GCGGGTTCG.....CCACGGGCGCTGCCCGGCGTTTT. ((((((.......((((..((((......((((---------------------------------------------------------------------------------...))))....)))))))).....)))))).....
C - U UC U M Box A - - WU G U 5’ C A M U G C U C-AG C G GY YRG GG variable CGGGCG C YRRYC CC region U G c 3’ C H U C g u G Box B u terminator A G R u AGCUU u
Fig. 3. Alignment (top) and consensus secondary structure (below) of vertebrate vRNAs. The two internal polIII promoter sequences, Box A and Box B as well as the terminator sequence CTTT are highlighted. The ENSEMBL vRNA predictions end before the terminator motif, which strictly speaking it does not belong to the vRNA sequence, but is indicative of a true polIII transcript. In the secondary structure below the variable region has been left out (denote by three shaded gap symbols). The IDs of the experimentally known vRNA sequences are marked in green, novel candidate sequences from chicken and teleosts are highlighted in red. The remaining sequences can also be found with blast using one of the known sequences as query.
two components: an essential RNA component, which also serves as template for the repeat sequence, and the catalytic protein component TERT. The RNA component varies dramatically in sequence composition and in size. Although dozens of tel-RNAs (usually called TER in vertebrates and TLC-1 in yeasts) have been cloned and sequenced in the last years, the known examples are restricted to three narrow phylogenetic groups: vertebrates, yeasts, and ciliates. Although there appears to be a common core structure of all these tel-RNAs [21], and despite their length, tel-RNAs remain a worst case scenario for homology search. Indeed, a survey of vertebrate tel-RNAs [4] shows dramatic sequence variation with only a few, short, well-conserved sequence patterns separated by regions of highly variable length. Some vertebrate groups, however, are still not represented. For example, blast-based approaches were unsuccessful in finding tel-RNAs in the genomes of teleost fishes. Using fragrep2, however, we were able to find candidates in all five available teleost genomes and to verify their function experimentally. These results are described in a separate publication [6]. Yeast tel-RNAs appear to be even less well conserved: In [22], only seven short sequence motifs are reported in the more than 1.2kb transcripts of Kluyveromyces
344
A. Mosig, J.J.-L. Chen, and P.F. Stadler
species, and of these only a few are partially conserved in Saccharomyces. In fact, Saccharomyzes and Kluyveromyces TLC-1 genes cannot be aligned by programs such as clustalw. As a consequence search patterns for fragrep2 have to be obtained manually. The search for tel-RNAs in a diverse set of fungal species is ongoing research, on which we will report elsewhere.
5
Concluding Remarks
We developed fragrep2 to overcome limitations in the fragrep program [5] that became apparent in several applications: First, sequence patterns are rarely so well conserved that a classical IUPAC consensus sequence is a good descriptor. Hence we use PWMs. Secondly, in contrast to features such as transcription factor binding sites, even the conserved regions of ncRNAs seems to allow occasional indels. Hence we allow indels when matching PWMs to the genome. Thirdly, entire domains that are well conserved in wide range of clades are sometimes lost or mutated beyond recognition. Hence we allow the deletion of entire PWMs. As a result we obtain a tool, fragrep2, which is sensitive enough to find distant homologs that are not detectable with blastn while at the same time being fast enough to scan mammalian genomes even at the stage of shotgun reads. In the case of teleost fish telomerase RNA, we successfully characterized several predicted sequences experimentally [6]. Of course fragrep2 also has its limitations. In particular, if the the individual PWM patterns become small and ambiguous, and if the lengths of the intervals separating them become very variable, fragrep2 becomes slow and, like all homology search tool operating at the limits of their sensitivity, tends to produce many false positives. Thus we are currently exploring the possibility to combine fragrep2 with other homology search tools, either as a pre-processor for structure-based approaches, or as a post-processor for fast, regular-expression filters. Availability: An implementation of fragrep2 can be downloaded from http:// www.bioinf.uni-leipzig.de/Software/fragrep
References 1. Lowe, T., Eddy, S.: tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucl. Acids Res. 25, 955–964 (1997) 2. Nawrocki, E.P., Eddy, S.R.: Query-dependent banding for faster RNA similarity searches. PLoS Comp. Biol. 3, 56 (2007) 3. Weinberg, Z., Ruzzo, W.L.R.: Sequence-based heuristics for faster annotation of non-coding RNA families. Bioinformatics 22, 35–39 (2006) 4. Chen, J.L., Blasco, M.A., Greider, C.W.: Secondary structure of vertebrate telomerase RNA. Cell 100, 503–514 (2000) 5. Mosig, A., Sameith, K., Stadler, P.F.: fragrep: Efficient search for fragmented patterns in genomic sequences. Geno. Prot. Bioinfo. 4, 56–60 (2005)
Homology Search with Fragmented Nucleic Acid
345
6. Xie, M., Mosig, A., Qi, X., Li, Y., Stadler, P.F., Chen, J.L.: Structure and function of the smallest vertebrate telomerase RNA from teleost fish. in preparation 7. Kel, A.E., G¨ oßling, E., Reuter, I., Cheremushkin, E., Kel-Margoulis, O.V.E.W.: MATCHTM: a tool for searching transcription factor binding sites in DNA sequences. Nucl. Acids Res. 31, 3576–3579 (2003) 8. Dinkelbach, W.: On nonlinear fractional programming. Manage. Sci. 13, 492–498 (1967) 9. Schaible, S.: Fractional programming. Z. Operations Res. 27, 39–54 (1983) ¨ Efficient algorithms for normalized edit distance. J. 10. Arslan, A.N., E˘ gecio˘ glu, O.: Discr. Algorithms 1, 3–20 (2000) ¨ Pevzner, P.: A new approach to sequence comparison: 11. Arslan, A.N., E˘ gecio˘ glu, O., Normalized sequence alignment. Bioinformatics 17, 327–337 (2001) 12. Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the aminoacid sequences of two proteins. J. Mol. Biol. 48, 443–452 (1970) 13. Griffiths-Jones, S., Moxon, S., Marshall, M., Khanna, A., Eddy, S.R., Bateman, A.: Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res. 33, D121–D124 (2005) 14. Piccinelli, P., Rosenblad, M.A., Samuelsson, T.: Identification and analysis of ribonuclease P and MRP RNA in a broad range of eukaryotes. Nucleic Acids Res. 33, 4485–4495 (2005) 15. Woodhams, M.D., Stadler, P.F., Penny, D., Collins, L.J.: RNAse MRP and the RNA processing cascade in the eukaryotic ancestor. BMC Evol. Biol. 7, S13 (2007) 16. van Zon, A., Mossink, M.H., Scheper, R.J., Sonneveld, P., Wiemer, E.A.C.: The vault complex. Cell. Mol. Life Sci. 60, 1828–1837 (2003) 17. van Zon, A., Mossink, M.H., Schoester, M., Scheffer, G.L., Scheper, R.J., Sonneveld, P., Wiemer, E.A.C.: Multiple human vault RNAs. J. Biol. Chem. 276, 37715–37721 (2001) 18. Kickhoefer, V.A., Searles, R.P., Kedersha, N.L., Garber, M.E., Johnson, D.L., Rome, L.H.: Vault ribonucleoprotein particles from rat and bullfrog contain a related small RNA that is transcribed by RNA polymerase III. J. Biol. Chem. 268, 7868–7873 (1993) 19. Vilalta, A., Kickhoefer, V.A., Rome, L.H., Johnson, D.L.: The rat vault RNA gene contains a unique RNA polymerase III promoter composed of both external and internal elements that function synergistically. J. Biol. Chem. 269, 29752–29759 (1994) 20. Kickhoefer, V.A., Emre, N., Stephen, A.G., Poderycki, M.J., Rome, L.H.: Identification of conserved vault RNA expression elements and a non-expressed mouse vault RNA gene. Gene 309, 65–70 (2003) 21. Chen, J.L., Greider, C.W.: An emerging consensus for telomerase rna structure. Proc. Natl. Acad. Sci. U S A 101(41), 14683–14684 (2004) 22. Tzfati, Y., Knight, Z., Roy, J., Blackburn, E.H.: A novel pseudoknot element is essential for the action of a yeast telomerase. Genes & Dev. 17, 1779–1788 (2003)
Fast Computation of Good Multiple Spaced Seeds Lucian Ilie1, and Silvana Ilie2, 1
2
Department of Computer Science, University of Western Ontario N6A 5B7, London, Ontario, Canada
[email protected] Numerical Analysis, Centre for Mathematical Sciences, Lund University Box 118, SE-221 00 Lund, Sweden
[email protected]
Abstract. Homology search finds similar segments between two biological sequences, such as DNA or protein sequences. A significant fraction of computing power in the world is dedicated to performing such tasks. The introduction of optimal spaced seeds by Ma et al. has increased both the sensitivity and the speed of homology search and it has been adopted by many alignment programs such as BLAST. With the further improvement provided by multiple spaced seeds in PatternHunterII, the sensitivity of dynamic programming is approached at BLASTn speed. Whereas computing optimal multiple spaced seeds was proved to be NP-hard, we show that, from practical point of view, computing good ones can be very efficient. We give a simple heuristic algorithm which computes good multiple seeds in polynomial time. Computing sensitivity is not required. When allowing the computation of the sensitivity for few seeds, we obtain better multiple seeds than previous ones in much shorter time. Keywords: homology search, multiple spaced seeds, sensitivity, string overlaps, PatternHunterII, BLAST.
1
Introduction
Homology search finds similar segments between two biological sequences, such as DNA or protein sequences. A significant fraction of computing power in the world is dedicated to performing such tasks. The increase in genomic data is quickly outgrowing computer advances and hence better mathematical solutions are required. As the classical dynamic programming techniques of [21,26] became overwhelmed by the task, popular programs such as FASTA [17] and BLAST [1] used heuristic algorithms. BLAST used a filtration technique in which positions with short consecutive matches, or hits, were identified first and then extended into local alignments. Speed was traded for sensitivity since longer initial matches
Research partially supported by NSERC. Supported by a postdoctoral fellowship from the Natural Science and Engineering Research Council of Canada.
R. Giancarlo and S. Hannenhalli (Eds.): WABI 2007, LNBI 4645, pp. 346–358, 2007. c Springer-Verlag Berlin Heidelberg 2007
Fast Computation of Good Multiple Spaced Seeds
347
missed many local alignments, hence decreasing sensitivity, whereas short initial matches produced too many hits, thus decreasing speed. A breakthrough came with PatternHunter [20] where the hits were no longer required to consist of consecutive matches. Precisely, PatternHunter looks for runs of 18 consecutive nucleotides in each sequence such that only those specified by 1’s in the string 111*1**1*1**11*111 are required to match. Such a string is called a spaced seed and the number of 1’s in it is its weight. Using this notion, BLAST required a hit according to a consecutive seed such as 11111111111. The filtration principle has been used before in approximate string matching [5,10,24] but the important novelty of PatternHunter was the use of optimal spaced seeds, that is, spaced seeds that have optimal sensitivity. Impressively, the approach of PatternHunter increases both the speed and sensitivity. The idea has been adopted since by the new versions of BLAST — MegaBLAST, BLASTZ — and other software programs [3,12,23]. As noticed in [20], multiple spaced seeds — sets of seeds that hit whenever one of the components does so — are better,1 and with their introduction in PatternHunterII, [16], the sensitivity of the dynamic programming is approached whereas the speed is that of BLASTn. The sensitivity of a seed is the probability of hitting a random region of a given length. For two spaced seeds of the same weight, the expected number of hits is the same but their sensitivities need not be the same. This happens because the hits of one seed may appear more clustered. A good intuitive example is searching for the strings aaa and abc in a random text. For each occurrence of aaa, the chance of having another one sharing two letters with it is 1/26 whereas starting afresh would require (1/26)3 . Therefore, the occurrences of abc are more evenly distributed and it is more likely to see first an abc in the text. Quite a few papers have been written about spaced seeds, evaluating the advantages of spaced seeds over consecutive ones [4,6,11,18], showing that the relevant computational problems are NP-hard [16,18], giving exact (exponential) algorithms for computing sensitivity [4,6,7,11,16], polynomial time approximation schemes [18] or heuristic algorithms [7,9,13,16,25,29], adapting the seeds for more specific biological tasks [3,14,23,27], or building models to understand the mechanism that makes spaced seeds powerful [4,25,27]. Finding optimal (multiple) spaced seeds by exhaustive search involves two exponential-time steps: (i) there are exponentially many seeds to be tried and (ii) computing the sensitivity of each takes exponential time as well. Several approaches [4,11,16] tried to deal with the latter exponential by approximating the sensitivity. For the former, the number of seeds to be considered has been reduced by various heuristics [7,13,25] but it remained exponential. In [9], a new approach has been given for the case of one seed, based on the overlaps between its hits. We use here a similar idea for general purpose multiple seeds. A new measure, overlap complexity, is introduced and shown to 1
For instance, the optimal weight 11 spaced seed, [20], has sensitivity 0.467122 for similarity level 70% whereas the multiple spaced seed of the same weight used in PatternHunterII, [16], has sensitivity 0.924114 for the same similarity level.
348
L. Ilie and S. Ilie
be experimentally well correlated with sensitivity. It is easily computable (low polynomial time), so this takes care of (ii) above. Then, in order to avoid (i) above, we give a simple algorithm which improves quickly the overlap complexity of an initial multiple seed, thus providing a good multiple seed in polynomial time. Note that sensitivity is not required, which, on one hand, allows polynomial time computation of good multiple seeds and, on the other hand, opens the possibility of computing long seeds, beyond the size for which sensitivity can be computed. When combining our algorithm with computation of sensitivity for few seeds, we manage to find better multiple seeds than previous ones in much shorter time. A number of problems remain to be investigated such as proving guarantees about the correlation between overlap complexity and sensitivity, approximation ratio and exact running time of our heuristic algorithm for approximating the overlap complexity. However, such problems may be mostly of theoretical interest as in practice our algorithms produce very good seeds in very short time. The paper is organized as follows. The next section formally introduces multiple spaced seed and all concepts needed later. Our new measure is introduced in Section 3. Uniformly spaced seeds are known to have the lowest sensitivity. In Section 4, we prove that uniformly multiple spaced seeds have lowest overlap complexity. Numerical results are recalled in Section 5 to show good correlation between overlap complexity and sensitivity. Our polynomial-time algorithm for computing good multiple seeds is given in Section 6. Better seeds than previous ones are computed in Section 7. We conclude with a brief discussion in Section 8.
2
Spaced Seeds
We start with some basic definitions. An alphabet is a finite nonempty set, denoted by A. The set of finite strings over A is denoted by A∗ . For a string x ∈ A∗ , the length of x is denoted by |x|. For 1 ≤ i ≤ |x|, the ith letter of x is denoted by x[i]. If u = xy, for some x, y ∈ A∗ , then x (y, resp.) is called a prefix (suffix, resp.) of u. For two strings u and v, an overlap between u and v is any string that is both a suffix of u and a prefix of v. A spaced seed is any2 string over the alphabet {1, *}; 1 stands for a ‘match’ and * for a ‘don’t care’ position. For a seed s, the length of s is (s) = |s| and the weight of s is the number of 1’s in s. A multiple spaced seed S is any finite nonempty set of spaced seeds. Given two DNA sequences and a seed s, we say that s simultaneously matches (hits) the two sequences at given positions if each 1 in s corresponds to a match between the corresponding nucleotides in the two sequences; see Fig. 1 for an example using PatternHunter’s seed. The above process can be reformulated as follows. Assume there are two DNA sequences S1 and S2 such that the events that they are identical at any given position are jointly independent and each event is of probability p, called the 2
From biological point of view only strings starting and ending with 1 are spaced seeds. The seeds we shall eventually compute satisfy this condition.
Fast Computation of Good Multiple Spaced Seeds DNA sequence DNA sequence matches/mismatches Bernoulli sequence spaced seed
A A = 1
C G = 0
G T = 0
A A = 1 1
G G = 1 1
G G = 1 1
C C = 1 *
A A = 1 1
C A = 0 *
T T = 1 *
G G = 1 1
T C = 0 *
A A = 1 1
T T = 1 *
G T = 0 *
T T = 1 1
A A = 1 1
T A = 0 *
A A = 1 1
T T = 1 1
C C = 1 1
349 T T = 1
A C = 0
Fig. 1. An example of a hit using PatternHunter’s spaced seed
similarity level. The sequence of equalities/inequalities between the two DNA sequences translates then into a sequence R of 1’s and 0’s, corresponding to matches and mismatches, that appear with probability p and 1 − p, respectively. Therefore, given an (infinite) Bernoulli random sequence R and a seed s, we say that s hits R (ending) at position k if aligning the end of s with position k of R causes all 1’s in s to align with 1’s in R; see Fig. 1. The sensitivity of a seed s is the probability that s hits R at or before position n. Note that the sensitivity depends on both the similarity level p and the length of the random region n. A multiple spaced seed hits a sequence R if and only if one of its seeds hits R. The sensitivity is defined the same way. In the light of the tradeoff between search speed and sensitivity, it makes sense to consider multiple seeds in which all seeds have the same weight (they may have different lengths).
3
Overlap Complexity
The hits of a seed can overlap but overlapping hits will detect a single local alignment. Therefore, the sensitivity of a seed is inversely proportional with the number of overlapping hits, since the expected number of hits is the same. Therefore, good seeds should have a low number of overlapping hits. The definite proof that (non-uniformly) spaced seeds are better than consecutive seeds, due to [18], involves estimating the expected number of non-overlapping hits. However, computing this number in general is as difficult as computing sensitivity. Therefore, we look here for simpler ways to detect low numbers of overlapping hits. We shall define a measure that is independent of the similarity level p. Counting overlaps can be done in several ways and the most appropriate for our purpose is as follows; details of a similar choice for single seeds are given in [9]. Consider two seeds s1 and s2 and denote by σ[i] the number of pairs of 1’s aligned together when a copy of s2 shifted by i positions is aligned against s1 . The shift i takes values from 1 − |s2 | to |s1 | − 1; a negative shift means s2 starts first. Precisely, if we denote t1 = ∗|s2 |−1 s1 ∗|s2 |−1 and t2,i = ∗|s2 |−1+i s2 ∗|s1 |−i−1 , then σ[i] = card{j | 1 ≤ j ≤ |s1 |+2|s2 |−2, t1 [j] = t2,i [j] = 1}. The overlap complexity |s1 |−1 of two seeds is OC(s1 , s2 ) = i=1−|s 2σ[i] . An example is shown in Fig. 2. 2| For a multiple seed S = {s1 , s2 , . . . , sk }, the overlap complexity is defined as an extension of the two-seed case, that is, OC(S) = 1≤i≤j≤k OC(si , sj ). This definition is a generalization of the one in [9] where single seeds are considered. The measure of [9] considers only (strictly) positive shifts and, if we denote it by oc(s) for a seed s, then we have the relation between the two measures given
350
L. Ilie and S. Ilie
* 1 * * * * * * * * *
* * 1 * * * * * * * *
* 1 * 1 * * * * * * *
1 1 1 * 1 * * * * * *
1 * 1 1 * 1 * * * * *
* * * 1 1 * 1 * * * *
* * * * 1 1 * 1 * * *
1 * * * * 1 1 * 1 * *
* * * * * * 1 1 * 1 *
1 * * * * * * 1 1 * 1
* * * * * * * * 1 1 *
* * * * * * * * * 1 1
* * * * * * * * * * 1
shift i
σ[i]
−3 −2 −1 0 1 2 3 4 5 6
1 2 1 1 2 1 1 2 0 1
Fig. 2. Overlap complexity of two seeds: OC(11**1*1, 1*11) =
6 i=−3
2σ[i] = 25
by OC({s}) = OC(s, s) = 2oc(s) + 2w(s) . This means the order induced by the two measures for single seeds of given weight is the same and hence the results of [9] apply here as well. Note that the overlap complexity is invariant with respect to the order of the seeds and reversal (assuming all seeds are reversed simultaneously). This is expected of any measure well correlated with sensitivity.
4
Multiple Uniformly Spaced Seeds
The spaced seed of PatternHunter clearly outperformed the former consecutive seeds but whether spaced seeds were always better than consecutive ones required investigation. In fact, it is not true for all spaced seeds: uniformly spaced seeds, i.e., seeds of the form ug,r = ∗r (1∗g )w ∗−r−w(g+1) , for g ≥ 0, 0 ≤ r ≤ − w(g + 1), are not better. Note that consecutive seeds are a particular case of this definition for g = 0. For the remaining ones, indication of their superiority has been given in [6,11] and [18] gave a rigorous proof. As showed in [9], uniformly spaced seeds (consecutive ones included) are the worst with respect to overlap complexity. (Here “worst” means of highest overlap complexity.) Moreover, they are the only ones with this property. It seems natural to define multiple uniformly spaced seeds as sets of spaced seeds that share the same w and g. Then, OC(s1 , s2 ) is highest among all multiple spaced seeds of a given weight when s1 = ug,r1 and s2 = ug,r2 , for some g and r1 , r2 . Therefore, for a fixed w, among all multiple spaced seeds S that contain only seeds of weight w, OC(S) is highest for uniformly spaced ones.
5
Correlation Between Overlap Complexity and Sensitivity
We recall some numerical data from [9] which experimentally show good correlation between overlap complexity and sensitivity. Essentially, it says that top
Fast Computation of Good Multiple Spaced Seeds
351
Table 1. Heuristic versus optimal sensitivity for weights 9 to 18 and length of random region 64 similarity 65% 70% 75% 80% 85% 90%
Soptimal − Sbest overlap mean standard deviation 0.000024 0.000043 0.000033 0.000098 0.000029 0.000061 0.000089 0.000148 0.000383 0.000809 0.000270 0.000644
Soptimal − Sheuristic overlap mean standard deviation 0.000152 0.000217 0.000454 0.000492 0.000812 0.000810 0.001153 0.001077 0.001309 0.001449 0.000707 0.001137
sensitivity seeds have top (lowest!) overlap complexity and vice versa. For multiple spaced seeds such comparison cannot be done since the optimal multiple spaced seeds cannot be computed. Consider the top seeds computed by Choi, Zeng, and Zhang in [7]. While we do not have the space to show all data, we mention only that all top seeds in the sensitivity ranking are at the top, or very close, of the ranking induced by overlap complexity. In all cases, at least one sensitivity-optimal seed tops the overlap ranking. The opposite comparison, even more important from our point of view, is shown in Table 1. We give only the mean statistics. The second and third columns contain the mean and standard deviation, resp., of the difference between optimal sensitivity, Soptimal (taken from [7]), and the sensitivity of the seed with best overlap complexity for the same weight and length, Sbest overlap . (In fact, almost all differences are zero and the remaining ones are negligible, which proves a remarkable correlation between the two measures.) The data are computed for weight range 9 to 18. The optimal sensitivity for higher weights is unknown. The last two columns will be discussed at the end of the next section.
6
A Polynomial Time Algorithm
The exact algorithms for computing sensitivity are all exponential, see [4,6,11], which is expected since the problem is NP-hard [18]. Some are though better than others. The one of [6] runs in time O(n22(−w) ), for seeds of length and weight w, which makes it the slowest. The other two have running times O(n2 2−w ) for [11] and a bit less, O(nw2−w ) for [4]. Therefore, finding optimal seeds by trying all of a given weight (and length) and selecting the best is computationally very expensive. In fact, it has been shown by [16] to be NP-hard for an arbitrary distribution. (On the other hand, [18] explains why finding an optimal seed in an uniform distribution is probably not NP-hard. We refer to [18] for details.) Choi, Zeng and Zhang [7] used their O(n22(−w) )-algorithm for sensitivity to find optimal seeds. The total complexity of the exact algorithm is then O(( w )n22(−w) ). They noticed that the sensitivity for random region of length n = 2 is a good indicator for the target sensitivity for n = 64 (as indicated by [20]). Therefore, they have a heuristic algorithm of complexity O(( w )2 22(−w) ).
352
L. Ilie and S. Ilie
swap(w, 1 , 2 , . . . , k ) - given: the weight w and lengths 1 , 2 , . . . , k - returns: S = {s1 , . . . , sk } of weight w, lengths |si | = i and low overlap complexity 1. for i from 1 to k do si ← *i −w 1w 2. swaps ← 0 3. while
∃ r, i, j with OC(flip(S, sr , i, j)) < OC(S) and (swaps ≤ k × w) do
4. choose a triple (r, i, j) that reduces OC(S) the most 5. S ← flip(S, sr , i, j) 6. swaps ← swaps + 1 7. return(S) Fig. 3. The swap algorithm which, given the weight and lengths of the seeds, computes a multiple seed with low overlap complexity and, therefore, high sensitivity
A much faster heuristic algorithm has been proposed by Preparata et al. [25], where heuristics derived from a complicated analysis of a probability leakage model are used. The complexity of their algorithm is not explicitly given but, from the description of the tests performed, it seems to be something like O(2w ) in addition to computing the exact sensitivity for a few seeds. A couple other heuristics are presented in [29] and [13]. As with our approach, they find some measures that are well correlated with similarity but they still need to consider exponentially many seeds. For multiple seeds, Li et al. [16] gave a dynamic programming algorithm that runs in time O (k + L + n) ki=1 i 2i −w , where k is the number of seeds, i ’s are the lengths of the seeds and L = max1≤i≤k i . The heuristic algorithm we derive from our overlap complexity is very simple: compute the seed with the lowest overlap complexity. This simple heuristic for single seeds provided better results than those of [25]; see [9]. It may have been surprising that the simple algorithm based on overlap complexity gives good results but the results in this section are presumably much less expected. As we replaced sensitivity by the apparently well correlated overlap complexity, we got rid of the second out of the two exponentials mentioned in the Introduction. Still, we need to consider all seeds of a given weight and length and there are exponentially many. To reduce the complexity of this step, we shall start with a fixed seeds and repeatedly modify it to improve its overlap complexity. Each improvement consists of swapping a 1 with a * as long as the overlap complexity improves. Moreover, we greedily choose a swap that produces the greatest improvement. To obtain a polynomial-time algorithm, we shall bound the number of such swaps by the weight of the seeds. We shall say that 1 flipped is * and vice versa. For a seed s and two positions i, j, we denote by flip(s, i, j) the seed obtained from s by flipping the letters in positions i and j. For instance, flip(1*11*11, 3, 5) = 1**1111. If S = {s1 , . . . , sk }, denote flip(S, sr , i, j) = {s1 , . . . , sr−1 , flip(sr , i, j), sr+1 , . . . , sk }. With this notation, the algorithm swap is described in Fig. 3.
Fast Computation of Good Multiple Spaced Seeds
* 1 1 1 1
* * * 1 1
* * 1 1 1
* * * * *
* * * * 1
intermediate seed **11111111 **1111*111 **1111*11* **1*11*11* **1*1**11*
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
353
pairs swapped (1, 12) (3, 15) (2, 9) (5, 11)
Fig. 4. Intermediate seeds computed by swap(11, 18) to find PatternHunter’s seed 111*1**1*1**11*111. Flipped positions are overlined.
Remarkably, PatternHunter’s seed is obtained by performing only 4 swaps in the algorithm swap(11, 18); see Fig. 4. This could be done by hand! To perform a swap, all possibilities for the triple (r, i, j) in step 3 are considk ered, that is, i=1 w(i − w). For each, we compute the new overlap complexity k in O(r i=1 i ) time. (This is because the overlap complexity of two seeds is computed in time the product of their lengths and here we need only update the pairs containing the seed sr .) If we set L = max1≤i≤k i , then the total time complexity of the swap algorithm is O(k 3 L2 w2 (L − w)). If we assume that, in practice, k is bounded and L is linear in w, then it becomes O(w5 ). That makes it possible to compute multiple seeds long enough for any practical purpose. For single seeds in [9], the algorithm is slightly different. It starts with the reversed of our seed in step 1 and, because it handles a single seed, it affords swapping first two positions and then four. However, it is still very simple and it is unexpected that it obtains better seeds than those of [25]; see [9]. The last two columns of Table 1 give the statistics of the difference between the sensitivity of single seeds computed this way and optimum.
7
Better Multiple Seeds
We have seen above that our swap algorithm gives good results in spite of the very low running time. The only drawback of the overlap complexity is that it works only within fixed weight and length. Therefore, in this section we are going to allow the computation of sensitivity for a few seeds, which, time-wise, we can afford, so that we can compare multiple seeds across lengths. Therefore, the algorithms are no longer polynomial but still much faster than previous ones. However, in this case we are able to obtain always better seeds. All sensitivities computed below are for length of the random region 64. We compare first with the results of Yang et al. [29]. The results are shown in Table 2. Our seeds are always better. An intricate heuristic of Kong [13] based on generalized correlation functions is next. We consider only the top seeds given in [13]. In the Tables 3 and 4, we give first the seeds of [13] and then two sets computed by us. The first is for the same weight and length whereas for the second we choose the best combination of lengths from an interval. For each length set only swap is used but the results are compared using sensitivity for the similarity shown in the table.
354
L. Ilie and S. Ilie Table 2. Comparison with the multiple seeds of Yang et al. [29] for weight 12 60%
Yang et al. [29] all lengths 19 Yang et al. [29] lengths in 16..19 ours, same weight all lengths 19 ours, same weight all lengths 24 ours, same weight lengths from 16..19
sensitivity for a similarity level 65% 70% 75%
80%
0.287255
0.500277
0.727770
0.897822
0.977895
0.282883
0.494700
0.723226
0.895810
0.977596
0.291278
0.505876
0.733351
0.901464
0.979235
0.305564
0.527296
0.754333
0.913957
0.983149
0.294044
0.509942
0.737806
0.904716
0.980569
Table 3. Comparison with the multiple seeds of Kong [13], for weight 9 and length 15 Kong, [13] similarity 50% ours, same weight and length ours, same weight, lengths from the interval 13..23 Kong, [13] similarity 75% ours, same weight and length ours, same weight, lengths from the interval 13..23
multiple seeds sensitivity { 111***1*11**111, 50%: 0.185211 60%: 11*111***1*1*11, 70%: 0.900644 80%: 1*11*1*1**111*1 } { 111*1*1**1**111, 50%: 0.183314 60%: 11*1**11**111*1, 70%: 0.897741 80%: 111*1***1*11*11 } { 111*11*1**111, 50%: 0.190035 60%: 11*1****11**1*111, 70%: 0.910949 80%: 111***1**1****1*1*11 } { 111**11***1*111, 65%: 0.745150 70%: 11*11**1*1*1*11, 75%: 0.972460 80%: 1*11*1*11**11*1 } { 111*1*1**1**111, 65%: 0.747975 70%: 11*1**11**111*1, 75%: 0.973134 80%: 111*1***1*11*11 } { 111*11*1**111, 65%: 0.767413 70%: 11*1****11**1*111, 75%: 0.978558 80%: 111***1**1****1*1*11 }
0.549139 0.996495 0.544791 0.996226 0.563903 0.997357 0.895981 0.996091 0.897741 0.996226 0.910949 0.997357
The heuristic of [13] is much more complicated than ours but the seeds for the same weight and length are comparable with ours (recall that ours are obtained using an approximation of the overlap complexity heuristic). When a length interval is considered, our seeds are always better. The most difficult test is comparing with the multiple seeds of Li et al. [16], the sensitivities of which were kindly provided by the authors [15,19]. The multiple seed of 16 weight 11 seeds in [16] took 12 days to compute greedily, that is, assuming the first i seeds are given, the (i + 1) seed is selected by exhaustive search in a length interval so that it maximizes the sensitivity of all i + 1 seeds; its sensitivities the given in the first line of Table 6.
Fast Computation of Good Multiple Spaced Seeds
355
Table 4. Comparison with the multiple seeds of Kong [13], weight 11 and length 18
Kong, [13] for similarity 50% ours, same weight and length ours, same weight, lengths from 16..23 Kong, [13] for similarity 75% ours, same weight and length ours, same weight, lengths from 16..23
multiple seeds { 11*1***111*11*1*11, 11*1111***1**1*111 } { 11*11*11*1*1***111, 1111***1*1**111*11 } { 11*1*111**1*1111, 111*1***11**1**1*111 } { 11*111***1*11*1*11, 1111**11**1**1*111 } { 11*111**1*1***111*1, 111***1*11**11*1*11 } { 111**1*1*1**11*111, 111*1****11****1**1*111 }
50%: 70%: 50%: 70%: 50%: 70%: 65%: 75%: 65%: 75%: 65%: 75%:
sensitivity 0.038393 60%: 0.613583 80%: 0.038157 60%: 0.611758 80%: 0.038597 60%: 0.617672 80%: 0.385600 70%: 0.815865 80%: 0.389571 70%: 0.820824 80%: 0.394900 70%: 0.828010 80%:
0.210134 0.947737 0.209031 0.947030 0.211797 0.949583 0.608639 0.945525 0.614122 0.948132 0.621924 0.951880
Table 5. Comparison with the 4 weight 11 seeds of Li et al. [16]
Li et al., [16] 4 seeds of weight 11
ours, same weight, lengths from 16..24
multiple seeds { 111*1**1*1**11*111, 1111**11**1*1****1*11, 11*1****11***1*1*1111, 111*111*1***1111 } { 1111**11*1*1*111, 1111**1*1***1**1*111, 11*1*1****11*1*11**11, 111*1****1***1***1**1111 }
sensitivity 65%: 0.533728 70%: 0.754809 75%: 0.912346 80%: 0.982355 65%: 0.538731 70%: 0.759849 75%: 0.915409 80%: 0.983326
We computed first a multiple seed with the same weight and number of seeds by considering only seeds of equal length, between 16 and 24. That is, we applied swap 9 times and computed the sensitivity for 9 multiple seeds (for similarity 70%). It took only 7 minutes on a 1.80GHz laptop to compute a multiple seed which is practically just as good; see the second line in Table 6. Then, we spent a couple of hours computing a better multiple seed; see the third line of Table 6. We computed first the optimal length set for the first four seeds. The seeds computed by swap for these lengths are shown in Table 5 and compared with the first four seeds of [16]. Then, the length set is extended by greedily adding one best length at the time from the interval 16 to 24. Different lengths are compared by computing sensitivity for similarity level 70%. The fourth line of Table 6 gives the sensitivities for a set of 15 seeds we computed as a by-product of the previous ones; having the length set, we just applied swap. Note that the last three values are higher than the ones of [16]. Finally, we took the challenge further and computed, in half a day, a multiple seed for weight 11 which consists of 32 spaced seeds. The sensitivities, shown in the last line of Table 6 are significantly higher than all the other corresponding ones in the same table.
356
L. Ilie and S. Ilie
Table 6. Comparison with the 16 weight 11 seeds of Li et al. [16]
60% Li et al., [16] 16 seeds of weight 11 ours, same weight, equal length 16 seeds ours, same weight, better 16 seeds ours, same weight, 15 seeds ours, same weight, 32 seeds
sensitivity for a similarity level 65% 70% 75%
80%
0.566640
0.781508
0.924114
0.984289
0.998449
0.565971
0.780308
0.922707
0.983500
0.998265
0.575998
0.790088
0.929016
0.985849
0.998676
0.564483
0.780967
0.924398
0.984530
0.998504
0.695484
0.874409
0.966406
0.995014
0.999679
The implementation of swap is straightforward and we used an unoptimized implementation of the dynamic programming algorithm of [16] for computing sensitivity. The focus is on fast algorithms and not on efficient implementation.
8
Conclusion and Further Research
The introduction of optimal spaced seeds in [20] followed by multiple spaced seeds in [16] revolutionized homology search. It is therefore important to compute good multiple spaced seeds fast. The optimal ones are hard to compute and people have been looking for faster ways of finding less than optimal but still good seeds. Our approach above is faster and better than previous ones. This was shown by comparing our results with the best previous ones. While our experimental results are very good, the theory to support them needs development. Problems include proving guarantees for the correlation between overlap complexity and sensitivity, finding bounds on the approximation ratio of our heuristic algorithm and approximating the number of swaps needed. (The bound we set for the number of swaps in the algorithm was never reached in practice.) On one hand, these theoretical questions are probably difficult to solve and they are not essential for the practical aspect of our study. On the other hand, they may bring new ideas to further improve our approach. From practical point of view, the best way of using the overlap complexity is not clear yet and should be further investigated. Recalling that it does not require computation of sensitivity for producing good multiple seeds, we may try, as done in [9] for single seeds, to compute longer good multiple spaced seeds, beyond the size for which sensitivity can be computed. The expected improvement of better seeds on homology search should also be investigated. Acknowledgements. We would like to thank Ming Li [15] and Bin Ma [19] for kindly providing the details of their computations in [16].
Fast Computation of Good Multiple Spaced Seeds
357
References 1. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.: Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990) 2. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped Blast and Psi-Blast: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997) 3. Brejova, B., Brown, D., Vinar, T.: Optimal spaced seeds for homologous coding regions. J. Bioinf. and Comput. Biol. 1, 595–610 (2004) 4. Buhler, J., Keich, U., Sun, Y.: Designing seeds for similarity search in genomic DNA. In: Proc. of RECOMB’03, pp. 67–75. ACM Press, New York (2003) 5. Burkhardt, S., K¨ arkk¨ ainen, J.: Better filtering with gapped q-grams. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 73–85. Springer, Heidelberg (2001) 6. Choi, K.P., Zhang, L.: Sensitivity analysis and efficient method for identifying optimal spaced seeds. J. Comput. Sys. Sci. 68, 22–40 (2004) 7. Choi, K.P., Zeng, F., Zhang, L.: Good Spaced Seeds for Homology Search. Bioinformatics 20, 1053–1059 (2004) 8. Golub, G.H., Van Loan, C.F.: Matrix Computations, 3rd edn. Johns Hopkins Univ. Press, Baltimore (1996) 9. Ilie, L., Ilie, S.: Long spaced seeds for finding similarities between biological sequences. In: Proc. of BIOCOMP’07 (to appear) 10. Karp, R., Rabin, M.O.: Efficient randomized pattern-matching algorithms. IBM J. Res. Develop. 31, 249–260 (1987) 11. Keich, U., Li, M., Ma, B., Tromp, J.: On spaced seeds for similarity search. Discrete Appl. Math. 3, 253–263 (2004) 12. Kisman, D., Li, M., Ma, B., Wang, L.: tPatternHunter: Gapped, fast and sensitive translated homology search. Bioinformatics 21, 542–544 (2005) 13. Kong, Y.: Generalized correlation functions and their applications in selection of optimal multiple spaced seeds for homology search. J. Comput. Biol. (to appear) 14. Kucherov, G., Noe, L., Ponty, Y.: Estimating seed sensitivity on homogeneous alignments. In: Proc. of BIBE’04, Taiwan, pp. 387–394 (2004) 15. Li, M.: personal communication 16. Li, M., Ma, B., Kisman, D., Tromp, J.: Pattern-HunterII: highly sensitive and fast homology search. J. Bioinformatics and Comput. Biol. 2, 417–440 (2004) 17. Lipman, D.J., Pearson, W.R.: Rapid and sensitive protein similarity searches. Science 227, 1435–1441 (1985) 18. Li, M., Ma, B., Zhang, L.: Superiority and complexity of spaced seeds. In: Proc. of SODA’06. SIAM, pp. 444–453 (2006) 19. Ma, B.: personal communication 20. Ma, B., Tromp, J., Li, M.: PatternHunter: faster and more sensitive homology search. Bioinformatics 18, 440–445 (2002) 21. Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453 (1970) 22. Ning, Z., Cox, A.J., Mullikin, J.C.: SSAHA: A fast search method for large DNA databases. Genome Res. 11, 1725–1729 (2001) 23. No´e, L., Kucherov, G.: Yass: enhancing the sensitivity of DNA similarity search. Nucleic Acids Res. 33, 540–543 (2005)
358
L. Ilie and S. Ilie
24. Pevzner, P., Waterman, M.S.: Multiple filtration and approximate pattern matching. Algorithmica 13, 135–154 (1995) 25. Preparata, F.P., Zhang, L., Choi, K.P.: Quick, practical selection of effective seeds for homology search. J. Comput. Biol. 12, 137–1152 (2005) 26. Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981) 27. Sun, Y., Buhler, J.: Designing multiple simultaneous seeds for DNA similarity search. In: Proc. of RECOMB’04, pp. 76–85. ACM Press, New York (2004) 28. Xu, J., Brown, D., Li, M., Ma, B.: Optimizing multiple spaced seeds for homology search. In: Sahinalp, S.C., Muthukrishnan, S.M., Dogrusoz, U. (eds.) CPM 2004. LNCS, vol. 3109, pp. 47–58. Springer, Heidelberg (2004) 29. Yang, I.-H., Wang, S.-H., Chen, H.-H., Huang, P.-H., Chao, K.-M.: Efficient methods for generating optimal single and multiple spaced seeds. In: Proc. of IEEE 4th Symp. on Bioinformatics and Bioengineering, Taiwan, pp. 411–418. IEEE Computer Society Press, Los Alamitos (2004) 30. Zhang, Z., Schwartz, S., Wagner, L., Miller, W.: A greedy algorithm for aligning DNA sequences. J. Comput. Biol. 7, 203–214 (2000)
Inverse Sequence Alignment from Partial Examples Eagu Kim and John Kececioglu Department of Computer Science The University of Arizona, Tucson AZ 85721, USA {egkim,kece}@cs.arizona.edu
Abstract. When aligning biological sequences, the choice of parameter values for the alignment scoring function is critical. Small changes in gap penalties, for example, can yield radically different alignments. A rigorous way to compute parameter values that are appropriate for biological sequences is inverse parametric sequence alignment. Given a collection of examples of biologically correct alignments, this is the problem of finding parameter values that make the example alignments score close to optimal. We extend prior work on inverse alignment to partial examples and to an improved model based on minimizing the average error of the examples. Experiments on benchmark biological alignments show we can find parameters that generalize across protein families and that boost the recovery rate for multiple sequence alignment by up to 25%.
1
Introduction
A fundamental issue in molecular sequence analysis is deciding what parameter values to use when aligning biological sequences. For example, the standard scoring function for protein sequence alignment requires determining values for 210 substitution scores and two gap penalties. An interesting approach to determining these values is inverse parameteric sequence alignment [6,10], where parameters are set using examples of correct alignments. Informally, inverse alignment tries to find parameter values that make the examples be optimalscoring alignments of their strings. In practice, parameter values rarely exist that make a collection of biological examples optimal, so the problem becomes finding values that make the examples score close to optimal. An important issue is determining what measure of error between the scores of the examples and the scores of optimal alignments should be optimized. Recently, Kececioglu and Kim [8] discovered a new method for inverse alignment based on linear programming that for the first time could quickly find values for all 212 parameters in the standard protein sequence alignment model from hundreds of examples of complete alignments. Their approach minimized the maximum relative error across the examples. In this paper we extend this work in three directions: (1) to examples consisting of partial alignments, which are the type of examples currently available in the standard suites of benchmark protein alignments, and which consist of incomplete sequence alignments; (2) to R. Giancarlo and S. Hannenhalli (Eds.): WABI 2007, LNBI 4645, pp. 359–370, 2007. c Springer-Verlag Berlin Heidelberg 2007
360
E. Kim and J. Kececioglu
an improved error model involving minimization of average error across the examples; and (3) to experimentally study the performance of parameters learned by inverse alignment in terms of their recovery rate on benchmark alignments. Related Work. Inverse parametric alignment was introduced in the seminal paper of Gusfield and Stelling [6]. They considered the problem for two parameters and one example, and gave an indirect approach to inverse alignment that attempted to avoid computing a parametric decomposition of the parameter space. Sun, Fern´ andez-Baca and Yu [10] gave the first direct algorithm for inverse alignment for the case of three parameters and one example; given two strings of length n, their algorithm finds parameters that make the example optimal in O(n2 log n) time. Kececioglu and Kim [8] gave the first polynomial-time algorithm for arbitrarily-many parameters and examples; their algorithm finds parameters that make the examples score as close to optimal as possible in terms of relative error. As they demonstrated, it is also fast in practice. The authors recently learned that Eppstein [5] independently discovered a general approach to inverse parametric optimization that is similar to [8]. Eppstein applied it in the context of minimum spanning trees, and considered finding parameters that make an example tree the unique optimal solution; in the context of biological sequence alignment, however, this rarely has a solution. Alternate approaches for determining alignment parameters have been recently proposed based on machine learning. Do, Gross and Batzoglou [4] use discriminative training on conditional random fields to find parameter values for a hidden Markov model of sequence alignment. Their approach requires solving a convex numerical optimization problem (which becomes nonconvex in the presence of partial alignments), and does not provide a polynomial-time guarantee on running time. Yu, Joachims, Elber and Pillardy [12] describe a supportvector-machine approach for learning parameters to align a protein sequence to a protein structure. Their approach involves solving a quadratic numerical optimization problem with linear constraints, and for the first time incorporates a measure of alignment recovery directly into the problem formulation. In contrast to these machine learning approaches, our method for inverse alignment uses linear programming (which can be solved quickly even for very large instances) and for the first time we rigorously address partial examples. Overview. The next section presents several variations of inverse alignment, with relative or absolute error, and with complete or partial examples. Section 3 reduces these variations to linear programmming, and develops an iterative approach to partial examples. Finally Section 4 presents results from experiments on recovering benchmark protein alignments when using learned parameters for both pairwise and multiple sequence alignment.
2
Inverse Alignment and Its Variations
The conventional sequence alignment problem is, given a pair of strings and a scoring function f on alignments, find an alignment A of the strings that has optimal score under f . The inverse alignment problem turns this around: given
Inverse Sequence Alignment from Partial Examples
361
an alignment A of a pair of strings, find parameter values for scoring function f that makes A be an optimal alignment of its strings. To learn parameter values that are useful in practice this basic form of inverse alignment, which was originally studied in [6,10], must be generalized in several directions. When function f has many parameters, many input alignments A are needed to determine reliable values for the parameters. Accordingly we consider inverse alignment where the input is a large collection of example alignments. In practice there are usually no parameter values that make the example alignments have optimal score. Consequently we consider finding parameters that make the examples score near-optimal, and we examine two criteria for measuring the error between the example scores and the optimal alignment scores: minimizing relative error or absolute error. Finally, the type of benchmark alignments that are available in practice for learning parameters actually consist of regions where the alignment is specified, interspersed with stretches where no alignment is specified. We call such an input alignment a partial example, since it is only a partial alignment of its strings. When an example specifies a complete alignment of its strings, we call it a complete example. Our approach to inverse alignment from partial examples builds upon a solution to the problem with complete examples, which we discuss first. Complete Examples. Inverse alignment from complete examples with arbitrarily many parameters was first considered by Kececioglu and Kim [8]. They examined the relative-error criterion, which we review below. Let f be the alignment scoring function, which gives score f (A) to alignment A. Typically f is a function of several parameters p1 , p2 , . . . , pt , which assign scores or penalites to various alignment features such as substitutions and gaps. (For example, the standard scoring model for aligning protein sequences has 210 substitution scores for all unordered pairs of amino acids, plus two gap penalties for opening and extending a gap, for a total of t = 212 parameters.) We view the entire set of parameters as a vector p = (p1 , . . . , pt ). When we want to emphasize the dependence of f on its parameters p, we write fp . The input consists of many example alignments Ai , where each example aligns a corresponding set of strings Si . (Typically the examples Ai are induced pairwise alignments that come from a structural multiple alignment; in this case, each Si contains two strings.) For scoring function f and parameters p, we write fp∗ (Si ) for the score of an optimal alignment of strings Si under fp . The following definitions assume that an optimal alignment maximizes scoring function f . (The original formulation [8] was in terms of minimizing f .) Definition 1 (Complete examples under relative error). Inverse Alignment from complete examples under the relative error criterion is the following problem. Let the alignment scoring function be fp with parameter vector p = (p1 , . . . , pt ) drawn from domain D. The input is a collection of complete alignments A1 , . . . , Ak that respectively align the sets of strings S1 , . . . , Sk . The output is parameter vector x∗ := argminx∈D Erel (x), where Erel (x) := max
1≤i≤k
fx∗ (Si ) − fx (Ai ) . fx∗ (Si )
362
E. Kim and J. Kececioglu
In other words, the output vector x∗ minimizes the maximum relative error of the alignment scores of the examples. While Erel (x) is not well-defined for fx∗ (Si ) ≤ 0, we will avoid this in Section 3. When scoring function fp is linear in its parameters p, inverse alignment under relative error can be solved in polynomial time [8] as long as an optimal alignment can be computed in polynomial time for any fixed parameter choice. We review the solution in Section 3, which uses a reduction to linear programming. The above formulation considers the maximum error of the examples, because as we will see later, minimizing the average relative error would lead to an optimization problem with nonlinear constraints. We also consider here a new model: minimizing absolute error. This has the advantage that we can minimize the average error of the examples and still have a formulation that is efficiently solvable by linear programming. Definition 2 (Complete examples under absolute error). Inverse Alignment from complete examples under the absolute error criterion is the following problem. The input is a collection of complete alignments Ai of strings Si for 1 ≤ i ≤ k. The output is parameter vector x∗ := argminx∈D Eabs (x) where Eabs (x) :=
1 ∗ fx (Si ) − fx (Ai ) . k 1≤i≤k
Output vector x∗ minimizes the average absolute error of the example scores. A key issue in the above formulations of inverse alignment is that the problem is degenerate. In both formulations, the trivial parameter choice x = (0, 0, . . . , 0) is an optimal solution (as it makes every alignment, including the example, be an optimal alignment). This trivial solution must be ruled out in an applicationspecific manner that depends on the particular form of the alignment problem being considered. Section 3 presents a new approach for avoiding degeneracy that applies to both global and local alignment of protein sequences. Partial Examples. For inverse alignment of protein sequences, the best example alignments that are available come from multiple alignments of protein families that are determined by aligning the three-dimensional structures of family members. Several suites of such benchmark alignments are now available [1] and are widely used for evaluating the accuracy of software for multiple alignment of protein sequences. Most all these benchmark alignments, however, are partial alignments. The benchmark alignment has regions that are reliable and where the alignment is specified, but between these regions the alignment of the strings is effectively left unspecified. These reliable regions are usually the core blocks of the multiple alignment, which are gapless sections of the alignment where structure is conserved across the family. For our purposes a partial example is an alignment A of strings S where each column of A is labeled as being either reliable or unreliable. A complete example is a partial example whose columns are all labeled reliable.
Inverse Sequence Alignment from Partial Examples
363
When learning parameters by inverse alignment from partial examples, we treat the unreliable columns as missing information: such columns do not specify the alignment of the strings. Given a partial example A for strings S, a completion A of A is a complete example for S that agrees with the reliable columns of A. In other words, a completion A can change A on the substrings that are in unreliable columns, but must not alter A in reliable columns. We define inverse alignment from partial examples as the problem of finding the optimal parameter choice over all possible completions of the examples. Definition 3 (Partial examples). Inverse Alignment from partial examples is the following problem. The input is a collection of partial alignments Ai of strings Si for 1 ≤ i ≤ k. The output is parameter vector x∗ := argmin x∈D
min
A1 ,...,Ak
E(x),
where error function E is either Eabs or Erel . In other words, vector x∗ minimizes the error of the example scores over all completions of the partial examples. In the next section we reduce inverse alignment from complete examples to linear programming, and approach the problem with partial examples by solving a series of problems on complete examples.
3
Solution by Linear Programming
When the alignment scoring function fp is linear in its parameters p, inverse alignment from complete examples under relative error can be reduced to linear programming [8], and a similar reduction applies to absolute error. We define a linear scoring function as follows. Suppose f scores an alignment A by measuring t+ 1 features of A through functions g0 , g1 , . . . , gt and combines these measures into one score through a weighted sum involving parameter vector p = (p1 , . . . , pt ) by fp (A) := g0 (A) + 1≤i≤t pi gi (A). Then we say f is linear in parameters p1 , . . . , pt . For example, in the standard scoring model for alignment of protein sequences, for every unordered pair a, b of amino acids there is a substitution score σab , plus gap penalty γ for opening a gap and penalty λ for extending a gap. This gives a scoring function with 212 parameters σab , γ, λ for the alphabet of 20 amino acids. The functions gab count the number of substitutions of each type a, b in A, and functions gγ and gλ count the number of gaps and the total length of all gaps in A. Complete Examples. As described in [8], for inverse alignment from complete examples with relative error, we first consider the problem assuming a fixed upper bound on the relative error. For a given bound , we test whether there is a feasible solution x with relative error at most by solving a linear program. We then find the smallest ∗ , to a given accuracy, for which there is a feasible solution using binary search on . The feasible solution x∗ found at bound ∗ is an optimal solution to inverse alignment under relative error.
364
E. Kim and J. Kececioglu
We briefly summarize this linear programming approach for the standard model of protein sequence alignment. The parameters of the scoring function are the variables of the linear program. The domain D of the parameters is described by the inequalities (−1, 0, 0) ≤ (σab , γ, λ) ≤ (1, 1, 1). When the alignment problem is to maximize the score f of an alignment, substitution scores σab are usually allowed to be both positive and negative, and parameter values can always be rescaled so the largest magnitude hits 1 without changing the alignment problem. We also add the inequalities σab ≤ σaa for all a = b, since an identity should score better than any substitution involving that letter. To ensure relative error Erel (x) is well-defined, we constrain fx (Ai ) ≥ 0 for all examples. For the relative error criterion, the remaining inequalities in the linear program enforce that the relative error of all examples is at most . For each example Ai , and every alignment Bi of strings Si , the linear program has an inequality fx (Ai ) ≥ (1 − ) fx (Bi ). Notice that for a fixed value of , this is a linear inequality in parameters x, since function fx is linear in x. Example Ai satisfies all these inequalities iff the inequality with Bi = B ∗ is satisfied, where B ∗ is an optimal-scoring alignment of Si under parameters x. In other words, the inequalities are all satisfied iff the score of Ai has relative error at most under fx . Finding the minimum for which this system of inequalities has a feasible solution x corresponds to minimizing the maximum relative error of the example scores. This linear program has an exponential number of inequalities, since for an example Ai there are exponentially-many alignments Bi of Si (in terms of the lengths of the strings). Nevertheless, this program can be solved in polynomial time using a far-reaching result from linear programming theory. This result, known as the equivalence of optimization and separation [2], states that one can solve a linear program in polynomial time iff one can solve the separation problem for the linear program in polynomial time. The separation problem is, given a possibly infeasible vector x of parameter values, to report an inequality from the linear program that is violated by x , or to report that x satisfies the linear program if there is no violated inequality. We can solve the separation problem in polynomial time for the above linear program by the following algorithm. Given a vector x of parameter values, for each example Ai we compute an optimal-scoring alignment B ∗ of Si under fx˜ . If the above inequality is satisfied when Bi = B ∗ , the inequalties are satisfied for all Bi , and if the above inequality is not satisfied for B ∗ , this gives the requisite violated inequality. For a problem with k examples, solving the separation problem involves computing at most k optimal alignments. In practice, this leads to the following cutting plane algorithm [2] for solving a linear program consisting of inequalities L. (1) Start with a small subset P of the inequalities in L. (2) Compute an optimal solution x to the linear program given by subset P. If no such solution exists, halt and report that L is infeasible.
Inverse Sequence Alignment from Partial Examples
365
(3) Call the separation algorithm for L on x . If the algorithm reports that x satisfies L, output x and halt: x is an optimal solution for L. (4) Otherwise, add the violated inequality returned by the separation algorithm in Step (3) to P, and loop back to Step (2). While such cutting plane algorithms are not guaranteed to terminate in polynomial time, they can be fast in practice [8]. For inverse alignment, we start with subset P containing just the trivial inequalities that specify parameter domain D. For the absolute error criterion, we modify the linear program as follows. For each example Ai we have an additional error variable δi . The inequalities for each example Ai are replaced by fx (Ai ) ≥ fx (Bi ) − δi . Finally, the objective function for the linear program is to minimize i δi . An optimal solution x∗ to this linear program gives a parameter vector that minimizes the average absolute error of the example scores. Again the program has exponentially-many inequalities, but the same separation algorithm that computes an optimal alignment B ∗ solves the separation problem in polynomial time, so in principle the linear program can be solved in polynomial time. In practice we use a cutting plane algorithm as described above. Partial Examples. Inverse alignment from partial examples involves optimizing over all possible completions of the examples. While for partial examples we do not know how to efficiently find an optimal solution, we present a practical iterative approach which as demonstrated in Section 4 finds a good solution. (0) Start with an initial completion Ai for each partial example Ai . These initial completions may be formed by computing alignments of the unreliable regions that are optimal with respect to a default parameter choice x(0) . (In practice for x(0) we use a standard substitution matrix [7] with appropriate gap penalties.) Alternately, an initial completion may be trivially obtained by taking the alignment of the unreliable regions in the partial example as the completion. We then iterate the following for j = 0, 1, . . .. Compute an optimal parameter choice x(j+1) by solving the inverse alignment problem on the complete (j) (j+1) examples Ai . Given x(j+1) , form a new completion Ai of Ai by (1) computing alignments of the unreliable regions that are optimal with respect to parameters x(j+1) , and (2) concatenating them to form a complete example. Such a completion optimally stitches together the reliable regions of the partial example, using the current estimate for parameter values. This iterative scheme repeatedly solves inverse alignment using improved complete examples. As the following result shows, each iteration yields a better parameter estimate. Theorem 1 (Error convergence for partial examples). For the iterative scheme for inverse alignment from examples, denote the error in score partial for iteration j ≥ 1 by ej := E x(j) , where E is error criterion Eabs or Erel (j−1)
measured on completions Ai
. Then
366
E. Kim and J. Kececioglu
e 1 ≥ e 2 ≥ · · · ≥ e∗ , where e∗ is the optimum error for inverse alignment from partial examples Ai . (j)
Proof sketch. Since Ai to parameters x(j) ,
is an optimal-scoring completion of Ai with respect
(j) (j−1) fx(j) Ai ≥ fx(j) Ai . (j)
This implies that with respect to the new complete examples Ai , the old parameters x(j) are still feasible at error ej . So for the new examples, error ej is achievable. Since the optimum error ej+1 for the new examples cannot be worse, ej+1 ≤ ej . Furthermore e∗ lower bounds the error for all completions. By the above result, the error of the iterative scheme converges, though it may converge to a value larger than the optimum error e∗ . As shown in Section 4, choosing a good initial completion can reduce the error. In practice we iterate this scheme until the improvement in error becomes too small or a bound on the number of iterations is reached. Moreover as the error improves across iterations, recovery of the examples generally improves as well. Eliminating Degeneracy. To eliminate the degenerate solution x = (0, . . . , 0) we use the following approach. When the alignment problem is to maximize scoring function f , substitution scores σab are typically both positive and negative, where a positive score indicates letters a, b are similar, and a negative score indicates they are dissimilar. For the σab to be appropriate for local alignment, the expected score of a substitution in an alignment of two random strings should be strictly negative. (Otherwise, extending a local alignment by concatenating columns tends to increase its score, so an optimal local alignment degenerates into a trivial global alignment that substitutes as much as possible.) Similarly for global alignment, this expected score should be negative so random substitutions are considered dissimilar. Let threshold τ be the expected score of a random substitution for a default substitution scoring matrix. Values of τ for commonly-used BLOSUM [7] and PAM [3] substitution matrices at standard amino acid frequencies are shown below, where each matrix has been scaled so its scores lie in interval [−1, 1].
τ
BLOSUM45 −0.056
BLOSUM62 −0.091
BLOSUM80 −0.136
PAM250 −0.050
PAM160 −0.056
PAM120 −0.138
Note that as the percent identity value for the matrix increases (corresponding to increasing BLOSUM or decreasing PAM numbers), threshold τ gets more negative. To eliminate degeneracy, we add to the linear program the inequality qa2 σaa + 2 qa qb σab ≤ τ, a
a,b : a =b
Inverse Sequence Alignment from Partial Examples
367
Table 1. Dataset characteristics. For sets U , P , Q, and S of PALI benchmarks, the table reports the number of benchmarks in each set, and averaged across its benchmarks, their number of strings, their string length, and the percent identity of their induced pairwise alignments. Also shown averaged for the core blocks, or reliable regions of the benchmarks, are their percent coverage of the strings and their percent identity.
Recovery and average absolute error
core blocks Datasets benchmarks strings length identity coverage identity U 102 14 239 29.1 40.4 35.8 P 51 15 239 29.7 41.0 37.1 Q 51 12 239 28.1 39.9 33.8 S 25 20 245 27.4 33.2 33.5 1.0 0.8 0.6 0.4
recovery, default completion recovery, trivial completion error, default completion error, trivial completion
0.2 0.0 1
2
3
4
5
6
Iteration
Fig. 1. Improvement in recovery and error for the iterative approach to partial examples. Each curve shows either the recovery or error across the iterations starting from a given initial completion. Recovery is the percentage of columns from reliable regions that are present in an optimal alignment computed using the estimated parameters. Results are plotted for two initial completions: the default, which aligns the unreliable regions using default parameters, and the trivial, which takes all columns of the partial alignment including unreliable ones. The set of examples for the curves is all induced pairwise alignments of the PALI benchmark with SCOP identifier b.1.8.1.
where qa is the probability of amino acid a appearing in a random protein sequence. This forces the optimal solution x∗ of the linear program to be as nondegenerate as the default substitution matrix from which τ was measured. (In our experiments we use the τ value of BLOSUM62.) When τ is negative, which holds for standard scoring schemes, this inequality cuts off the trivial solution (0, . . . , 0).
4
Experimental Results
To evaluate the performance of this approach to inverse alignment, we ran several types of experiments on biological data. For the examples, we used benchmark alignments from the PALI [1] suite of structural multiple alignments of proteins. For each family from the SCOP [9] classification of protein families, PALI contains
368
E. Kim and J. Kececioglu
Table 2. Recovery rates for variations of inverse alignment. For PALI benchmarks in set S, the table reports the average recovery rate across the examples, which are all induced pairwise alignments of the benchmark. Recovery is measured using the learned parameters to either compute optimal pairwise alignments of the example strings, or to compute a multiple alignment of the benchmark strings using the tool Opal [11]. Parameters are learned under the absolute or relative error criteria. Recovery is also shown for pairwise alignments computed using the BLOSUM62 substitution matrix with learned gap penalties, and for multiple alignments computed with Opal using its default parameters, which are BLOSUM62 with carefully-chosen gap penalties. To save space we do not list every benchmark, but the average row is across all benchmarks in S. The relative-error binary search is to a precision of 0.01%. When parameters are used for alignment they are rounded to an integer scale of 100.
SCOP identifier c.95.1.1 e.3.1.1 c.95.1.2 d.32.1.3 a.127.1.1 a.104.1.1 d.54.1.1 b.43.3.1 d.81.1.1 b.1.8.1 average
Pairwise alignment absolute relative BLOSUM62 47.8 34.3 39.0 50.6 33.1 46.2 69.4 32.9 46.8 56.6 38.9 45.4 73.1 47.5 71.1 80.0 69.0 80.7 67.4 54.3 51.9 76.0 57.2 58.6 85.6 67.8 69.2 91.0 85.9 79.2 78.6 64.9 68.3
Multiple alignment default absolute 45.2 70.1 66.6 77.4 64.9 82.9 64.3 84.7 82.9 89.8 89.6 90.7 70.4 91.7 82.2 93.4 85.2 98.4 89.6 98.6 82.3 91.5
a multiple alignment of the sequences of the family members, computed by aligning their three-dimensional structures. In total, PALI has 1655 benchmark alignments, from which we selected a subset U of 102 benchmarks consisting of all alignments with at least 7 sequences that have nontrivial gap structure. We also perform a detailed study of a smaller subset S ⊂ U containing the 25 benchmarks with the most sequences. Set U was also partitioned into two equal-size subsets P, Q for the purpose of conducting training-set/test-set crossvalidation experiments. Table 1 summarizes the characteristics of these datasets. Each of these PALI benchmarks consists of partial (not complete) examples. Figure 1 illustrates the improvement in error for the iterative approach to partial examples discussed in Section 3. As the error in alignment scores improves across the iterations, the recovery of the example alignments tends to improve as well. Generally, smaller error correlates with higher recovery. Table 2 shows a detailed comparison of recovery rates from different scenarios for inverse alignment. Parameters are learned from all induced pairwise alignments in a given PALI benchmark, and are applied to the strings in the same benchmark, either to compute pairwise alignments or a multiple alignment of the strings. A key conclusion from this comparison is that the absolute error criterion substantially outperforms the relative error criterion with respect to
Inverse Sequence Alignment from Partial Examples
369
Table 3. Recovery rates for cross validation experiments on training and test sets. Parameters learned on training sets U , P , Q using the absolute error criterion are applied to test sets U, P, Q. For a generic set X of PALI benchmarks, the examples in training set X are a subset of the induced pairwise alignments of the benchmarks in X. Set X contains pairwise alignments selected by their recovery rate in a multiple alignment of the benchmark computed with Opal using default parameters. Set X selects one alignment of median rank from each benchmark, together with a sample of alignments that occur at equally-spaced ranks in the union of the benchmarks in X. Parameters from a given training set were used in Opal to compute multiple alignments of the benchmarks in the test set, and the table reports the average benchmark recovery.
Training set characteristics dataset examples identity U P Q
204 153 153
33.1 34.5 31.9
Test set recovery U P Q 83.4 82.9 82.8
84.8 86.1 84.4
82.0 82.2 81.2
recovery of example alignments. When used for pairwise alignment, the parameters learned using absolute error outperform the standard BLOSUM62 [7] matrix in recovery by up to 20%; when used for multiple alignment in the tool Opal [11], which scores alignments under the sum-of-pairs objective, they outperform the default parameters of Opal in recovery by up to 25%. Also note that the recovery rates of parameters when used for multiple alignment are generally much higher than when used for pairwise alignment. In short by performing inverse alignment from partial examples one can learn parameters for multiple sequence alignment that are tailored to a given protein family and that yield very high recovery. Finally, Table 3 presents recovery results from cross validation experiments. Parameters learned on sparse training sets using the absolute error criterion are applied to full test sets. Their recovery is measured when computing multiple sequence alignments of the benchmarks in the test sets using the learned parameters within Opal. Note there is only a small difference in recovery when parameters are applied for multiple sequence alignment to disjoint test sets, compared to their recovery on their training set. This suggests that the absolute error method is not overfitting the parameters to the training data. To give a sense of running time, performing inverse alignment on a given training set involved around 6 iterations for completing partial examples and took about 4 hours total on a 3.2 GHz Pentium 4 with 1 GB of RAM. An iteration took roughly 40 minutes and required around 4,000 cutting planes.
5
Conclusion
We have explored a new approach to inverse parametric sequence alignment that for the first time carefully treats partial examples. The approach minimizes the average absolute error of alignment scores, and iterates over completions of partial examples. We also studied for the first time the performance of learned
370
E. Kim and J. Kececioglu
parameters when used for multiple sequence alignment, and showed that a substantial improvement in alignment accuracy can be achieved on individual protein families. Furthermore our results suggest that parameters learned across a sampling of protein families generalize well to other families. Further Research. Inverse alignment can be extended in several directions: to more general models of protein sequence alignment that use an ensemble of hydrophobic gap penalties [4], to formulations that directly incorporate example recovery [12], and to formulations that use regularization to improve parameter generalization [4,12]. Acknowledgements. We wish to thank Chuong Do for helpful discussions, and Travis Wheeler for assistance with using Opal [11]. This research was supported by the US National Science Foundation through grant DBI-0317498.
References 1. Balaji, S., Sujatha, S., Kumar, S.S.C., Srinivasan, N.: PALI: a database of alignments and phylogeny of homologous protein structures. Nucleic Acids Research 29(1), 61–65 (2001) 2. Cook, W., Cunningham, W., Pulleyblank, W., Schrijver, A.: Combinatorial Optimization. John Wiley and Sons, New York (1998) 3. Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C.: A model of evolutionary change in proteins. In: Dayhoff, M.O. (ed.) Atlas of Protein Sequence and Structure, Washington DC. National Biomedical Research Foundation, vol. 5(3), pp. 345–352 (1978) 4. Do, C., Gross, S., Batzoglou, S.: CONTRAlign: discriminative training for protein sequence alignment. In: Proceedings of the 10th ACM Conference on Research in Computational Molecular Biology, pp. 160–174. ACM Press, New York (2006) 5. Eppstein, D.: Setting parameters by example. SIAM Journal on Computing 32(3), 643–653 (2003) 6. Gusfield, D., Stelling, P.: Parametric and inverse-parametric sequence alignment with XPARAL. Methods in Enzymology 266, 481–494 (1996) 7. Henikoff, S., Henikoff, J.G.: Amino acid substitution matrices from protein blocks. Proc. National Academy of Sciences USA 89, 10915–10919 (1992) 8. Kececioglu, J., Kim, E.: Simple and fast inverse alignment. In: Proc. 10th ACM Conference on Research in Computational Molecular Biology, pp. 441–455. ACM Press, New York (2006) 9. Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C.: SCOP: a structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology 247, 536–540 (1995) 10. Sun, F., Fern´ andez-Baca, D., Yu, W.: Inverse parametric sequence alignment. Journal of Algorithms 53, 36–54 (2004) 11. Wheeler, T., Kececioglu, J.: Multiple alignment by aligning alignments. In: Proc. 15th Conference on Intelligent Systems for Molecular Biology (2007) 12. Yu, C.-N., Joachims, T., Elber, R., Pillardy, J.: Support vector training of protein alignment models. In: Proceedings of the 11th ACM Conference on Research in Computational Molecular Biology, pp. 253–267. ACM Press, New York (2007)
Novel Approaches in Psychiatric Genomics Maja Bucan Professor, Department of Genetics, Penn Center for Bioinformatics, University of Pennsylvania, Philadelphia, PA 19104
Abstract. A central challenge in biomedical research remains identification of genetic factors influencing human behavior and susceptibility to common disease. Recent advances in genotyping technologies, coupled with the fundamental information provided by the Human Genome and International HapMap Projects, permit characterization and analysis of high-density SNP genotype data for thousands of individuals. Using genotype data obtained on the Illumina HumanHap550 SNP platform, for 1000 individuals (in 170 autism multiplex families from Autism Genetic Resource Exchange, 120 HapMap samples and 320 neurologically normal children from Children’s Hospital of Philadelphia) we performed a pilot study to identify genetics risk factors for autism susceptibility. Using these data we illustrate that novel statistical and computational biology approaches provide additional insights than can not be observed when using simple association tests. These novel approaches include: a) phylogenetic analysis of population structure; b) pathway-based analysis of genome-wide association data c) selection of variants based on a priori potential to affect phenotypes and d) analysis of copy number variation (CNVs) in autistic and healthy (control) children. In addition, our results demonstrate that high-density SNP genotyping array can be used to detect copy number variants (CNV) at an unprecedentedly high resolution and that regions with a high density of CNVs in many cases correspond to syntenic breakpoints, providing further support for the importance of evolutionary analysis of structural variation to disease and disease susceptibility. This is a joint work with Junhyong Kim (Penn), Mingyao Li (Penn) and Hakon Hakonarson (CHOP).
R. Giancarlo and S. Hannenhalli (Eds.): WABI 2007, LNBI 4645, p. 371, 2007. c Springer-Verlag Berlin Heidelberg 2007
The Point Placement Problem on a Line – Improved Bounds for Pairwise Distance Queries Francis Y.L. Chin1 , Henry C.M. Leung1 , W.K. Sung2 , and S.M. Yiu1 1
2
Department of Computer Science, The University of Hong Kong, Pokfulam Road, Hong Kong {chin, cmleung2, smyiu}@cs.hku.hk Department of Computer Science, National University of Singapore, Singapore
[email protected]
Abstract. In this paper, we study the adaptive version of the point placement problem on a line, which is motivated by a DNA mapping problem. To identify the relative positions of n distinct points on a straight line, we are allowed to ask queries of pairwise distances of the points in rounds. The problem is to find the number of queries required to determine a unique solution for the positions of the points up to translation and reflection. We improved the bounds for several cases. We show √ that 4n/3 + O( n) queries are sufficient for the case of two rounds while the best known result was 3n/2 queries. For unlimited number of rounds, the best result√was 4n/3 queries. We obtain a much better result of using 5n/4 + O( n) queries with three rounds only. We also improved the lower bound of 30n/29 to 17n/16 for the case of two rounds.
1
Introduction
The point placement problem on a line is defined as follows. Suppose that there are n points located at n distinct positions on a straight line. We are allowed to ask queries of pairwise distances of the points. The problem is to find the number of queries required to determine a unique solution for the positions of the points up to translation and reflection. The points are distinguishable (i.e., each point has a unique label). In this paper, we study the adaptive version of the problem. We submit our queries in rounds. In each round, the queries to be asked can be based on the answers to the queries of the previous rounds. For the non-adaptive version of the problem, only one round of queries is allowed. The problem is motivated by a biological problem, called DNA mapping. In general, we are interested to recover the whole DNA sequence for an organism. One approach is to make use of some known short substrings, called markers or restriction sites. The first step in this approach is to find the relative positions of these markers in the DNA sequence. Unfortunately, there is no direct method to identify these positions. A common technique for handling it is the flourescent in situ hybridization (FISH) [5] Given any two markers, a FISH experiment can
This research is supported by the RGC grant HKU7119/05E.
R. Giancarlo and S. Hannenhalli (Eds.): WABI 2007, LNBI 4645, pp. 372–382, 2007. c Springer-Verlag Berlin Heidelberg 2007
The Point Placement Problem on a Line – Improved Bounds
373
measure the distance between the markers in the sequence. The locations of these markers can be deduced if enough information about the pairwise distances of the markers are known. For more details about this problem, one can refer to [6]. Note that doing experiments incurs resources such as time and money. To reduce the cost, one would like to perform as few experiments as possible. We can design the experiments adaptively depending on the results of all previous experiments. However, this is too time-consuming. A more reasonable approach is to perform the experiments in a few rounds. In each round, we design a set of FISH experiments based on the measurements obtained from the previous rounds and perform the designed FISH experiments in parallel. This motivates the study of the adaptive versions of the point placement problem on a line. The problem is easy to describe and solve optimally if the point positions are not distinct; it can be shown in [1] that C(n, 2) nonadaptive and 2n − 3 adaptive queries are necessary and sufficient to solve the problem. However when the point positions are distinct, a more realistic assumption for the DNA mapping problem in which each maker should be distinct, this point placement problem becomes surprisingly difficult. It was shown that 8n/5 queries are sufficient while the lower bound is 4n/3 except for some small n cases [1]. For the case of two rounds, it was shown that 3n/2 queries are sufficient [1] while the lower bound could only be shown to be 30n/29 [2]. For unlimited number of rounds, the upper bound result of 3n/2 in [1] was improved to 4n/3 in [2] (to be more precise, they derived a strategy with [4/3 + O(1/t)]n queries for O(t) rounds, note that this result only applies to the case when the number of rounds is sufficiently large). So, the best upper bound for two rounds was still 3n/2 while for unlimited number of rounds, the best upper bound was 4n/3. In this paper, we manage to improve these results and our contribution is summarized in the √ following table. For two rounds, we obtained an upper bound of 4n/3 + O( n) queries and improved the lower bound to 17n/16. For unlimited√number of rounds, we are able to reduce the number of queries to 5n/4 + O( n) and we only need three rounds.
Upper Bound Lower Bound
Two Rounds √ 3n/2 → 4n/3 + O( n) 30n/29 → 17n/16
Unlimited number of Rounds √ 4n/3 → 5n/4 + O( n) in 3 rounds —
The improvement we obtained is based on an observation on the induced graph (we call it the point placement graph) of the given points and the pairwise distances being queried. We give a characterization to these graphs for which the corresponding locations of the points can be determined uniquely (up to translation and reflection). The definitions of the point placement graph and the characterization will be given in Section 2. Section 3 shows the algorithm for solving the two-round case. The solution for the three-round case is presented in Section 4. The lower bound for the two-round case is discussed in Section 5. As a remark, there are some other related works in this area. For example, [4] studied the problem of finding the best estimation of the locations of points from
374
F.Y.L. Chin et al.
a partial set of pairwise distances of the points. In their study, they assumed that there may be errors in measuring the pairwise distances and provided heuristics algorithms to solve the problem. In [3], a similar problem has been studied by assuming that the pairwise distances are given in terms of distance intervals. They proposed a randomized algorithm to solve the problem. They also proved that the problem is NP-hard even for those point alignment graphs which have only chordless cycles.
2
Preliminaries
Given a set S of n distinct points, p1 , p2 , . . ., pn , on a line and a set D of pairwise distances of some of these points, we can construct a point placement graph, G(S, D), as follows. Each point is represented as a node and if the distance between pi and pj is given, there is an edge between the nodes representing pi and pj with weight equal to the given distance. The distance is denoted as |pi pj |. Not every point placement graph corresponds to a unique set of n distinct points (up to translation and reflection) on a line. For a given point placement graph, any set of n distinct points on a line for which the pairwise distances are consistent with the graph is called a linear layout of the graph. The following lemma shows a simple observation on a point placement graph. Lemma 1. There are at most 2 edges with the same length l adjacent to any node in the graph G. [2] Proof. Consider the linear layout of G, there are at most two distinct points whose distance from a node is l. Definition 1. A point placement graph G is line rigid if there is only one unique set of n distinct points (up to translation and reflection) on a line that is consistent with G, i.e. there exists a unique linear layout of G. The following shows a characterization of a line rigid point placement graph. We first define what a layer graph is. Definition 2. A point placement graph G is called a layer graph if G can be plotted in a xy-plane (represented by two unit vectors x and y) with the following properties. 1. All edges uv are parallel to x or y, i.e. |u − v| = (u − v) · x or (u − v) · y 2. Length of edge uv, |u − v|, is the same as the weight of the edge. 3. There are two nodes p and q with different x-coordinates and y-coordinates. i.e. (p − q) · x = |p − q| and (p − q) · y = |p − q|. 4. When the angle between x and y tends to 0 or π, no nodes overlap. Figure 1 contains three examples of layer graphs with x and y as two perpendicular vectors.
The Point Placement Problem on a Line – Improved Bounds
375
Theorem 1. A point placement graph G is line rigid iff it is not a layer graph. Proof. Without loss of generality, we assume that G is connected, otherwise, Theorem 1 can be proved by considering each connected component of G. If G is a layer graph, we can find two linear layouts of the graph by considering angle between x and y tends to 0 and π (Property 4). Assume that G is not line rigid. Consider a spanning tree S of G and two linear layouts P and P of G. Pick any node of S as the root. We plot the root of S at the origin and the remaining edges uv with length l in S as follows: 1. Node v is on the right of node u in both P and P , v = u + lx. 2. Node v is on the left of node u in both P and P , v = u − lx. 3. Node v is on the right of node u in P and on the left of node u in P , v = u + ly. 4. Node v is on the left of node u in P and on the right of node u in P , v = u − ly. It is easy to see that the spanning tree S constructed by the above procedure satisfies Property 1, 2, and 3 of a layer graph. When the angle between x and y tends to 0 and π, S is degenerated to a straight line on the x-axis with the positions of all vertices the same as the linear layouts P and P respectively, therefore S satisfy property 4 and is a layer graph. Consider the edges of G not in S. Let uv be a length-l edge between vertices u(x1 , y1 ) and v(x2 , y2 ). When the angle between x and y tends to 0, i.e. the positions of the vertices are the same as the linear layout P , v(x2 , y2 ) is on the right of u(x1 , y1 ) with distance (x2 − x1 ) + (y2 − y1 ). Note that when (x2 − x1 ) + (y2 − y1 ) is negative, v(x2 , y2 ) is on the left of u(x1 , y1 ). When angle between x and y tends to π, i.e. the positions of the vertices are the same as the linear layout P , v(x2 , y2 ) is on the right of u(x1 , y1 ) with distance (x2 − x1 ) − (y2 − y1 ). Note that when (x2 −x1 )−(y2 −y1 ) is negative, v(x2 , y2 ) is on the left of u(x1 , y1 ). Since |(x2 − x1 ) + (y2 − y1 )| = |(x2 − x1 ) − (y2 − y1 )| = l, we have two possible solutions: i) (x2 − x1 ) = l and (y2 − y1 ) = 0, ii) (x2 − x1 ) = 0 and (y2 − y1 ) = l. Therefore, uv is either parallel to x or y and the distance between u(x1 , y1 ) and v(x2 , y2 ) is l. G is a layer graph. Lemma 2. A 4-cycle C with nodes p, q, r, s and edges pq, qr, rs, sp is not line rigid if |pq| = |rs| and |qr| = |sp|. [1] (proof omitted) Based on the above characterization, we identify the following property for 5cycle and 6-cycle point placement graph to be line rigid. Note that since we only consider point placement graphs, we will simply refer them as graphs. Lemma 3. A 5-cycle C with nodes p, q, r, s, t and edges pq, qr, rs, st, tp is line rigid if |pq| ∈ / {|rs|, |st|, ||rs| ± |st||} and |st| ∈ / {|qr|, ||pq| ± |qr||}. Proof. A 5-cycle layer graph with nodes a, b, c, d, e must be plotted in one of the three ways in Figure 1 with (a, b, c, d, e) = (p, q, r, s, t) or (t, p, q, r, s) or (s, t, p, q, r) or (r, s, t, p, q) or (q, r, s, t, p). In all cases, we show that C cannot be mapped to any of these three layer graphs because (a, b, c, d, e) cannot be equal to
376
F.Y.L. Chin et al.
Fig. 1. Layer graphs for 5-cycle
i. ii. iii. iv. v.
(p, q, r, s, t) because |pq| = ||rs| ± |st|| (t, p, q, r, s) because |pq| = |st| (s, t, p, q, r) because |st| = ||pq| ± |qr|| (r, s, t, p, q) because |st| = |qr| (q, r, s, t, p) because |pq| = |rs|
Since C cannot be mapped to these three graphs, C is not a layer graph. Therefore, it is line rigid (Theorem 1). Lemma 4. ‘ A 6-cycle C with nodes o, p, q, r, s, t and edges op, pq, qr, rs, st, to is line rigid if 1. 2. 3. 4. 5. 6.
|op| ∈ / {|qr|, |rs|, |st|, ||qr| ± |rs|| , ||rs| ± |st|| , ||qr| ± |st|| , ||qr| ± |rs| ± |st||} |pq| ∈ / {|rs|, |st|, ||rs| ± |st||} |qr| ∈ / {|st|, ||op| ± |st||} |rs| = ||op| ± |pq|| |st| ∈ / {||op| ± |pq|| , ||pq| ± |qr|| , ||op| ± |qr|| , ||op| ± |pq| ± |qr||} ||op| ± |pq|| = ||rs| ± |st||
Proof. Similar to the proofs of Lemma 3, Lemma 4 can be proved by considering all possible mapping function from (op, pq, qr, rs, st, to) to 15 ways of plotting a 6-cycle layer graph in the xy-plane.
3
Upper Bound for Two Rounds
√ In this section, we show that 4/3n (more precisely, 4n/3 + O( n)) pairwise distance queries are sufficient to determine the positions of n distinct points on a line using two rounds. The main idea is based on Lemma 1 and 3. From Lemma 3, we can always have a line rigid 5-cycle (p, q, r, s, t) no matter what the distance between node p and node t is. Since at each point, there are at most two edges with the same length, we can always find many links of 4 edges pq, qr, rs, st satisfying the conditions stated in Lemma 3 and form 5-cycles by making the last query on the distance bewteen node p and node t. Since each 5-cycle can determine the positions of 3 points (2 points of the 5-cycle should be fixed) with 4 queries, we might achieve the ratio 4/3 when the number of points is large. Algorithm 1. Let n = 3b2 + 17b + 31 for some positive integer b. In the first round, we choose 3b2 + 17b + 30 queries represented by the tree in Figure 2. Let vertex r be the root of the tree. There are b2 subtrees (2-links) with exactly one child ((pi qi , qi r), i = 1, 2, . . . , b2 ) and and b + 2 subtrees (b-trees) with roots sj , j = 1, 2, dots, b + 2, each with b + 14 children tjk , k = 1, 2, . . . , b + 14.
The Point Placement Problem on a Line – Improved Bounds
377
Fig. 2. First round queries for 2-round algorithm
In the second round, for each 2-link (pi qi , qi r), we find a distinct path (rsj , sj tjk ) in the b + 2 b-trees ((b + 2)(b + 14) = b2 + 16b + 28 possible paths) to form a 5-cycle which satisfies the requirements stated in Lemma 3 and query the length of tjk pi . Note that there are at most 16b + 28 paths that do not satisfy Lemma 3 because there are at most a) 2 b-trees with |pi qi | = |rsj | (Lemma 1) and b) for each tree rooted at sj there are at most 2 edges sj tjk for each case i. ii. iii. iv. v. vi. vii.
|pi qi | = |sj tjk | |pi qi | = |rsj | + |sj tjk | |pi qi | = |rsj | − |sj tjk | |pi qi | = |sj tjk | − |rsj | |sj tjk | = |qi r| |sj tjk | = |pi qi | + |qi r| |sj tjk | = |pi qi | − |qi r| or |qi r| − |pi qi |
So we can always find a distinct path (rsj , sj tjk ) by matching with (pi qi , qi r). For each of the unused 16b + 28 leaves tjk in the b-trees, we query the distance between node tjk and the root r (rtjk ). Last, we query the length of sj sj+1 , j = 1, 2, . . . , b + 1. Theorem 2. The graph constructed by Algorithm 1 is line rigid. Proof. The relative positions of the b+2 b-trees and the root r can be determined because they form b + 1 connected triangles (rsj , rsj+1 , sj sj+1 ) with the root r [1]. The relative positions of 16b + 28 leaves tjk are also fixed because each forms a triangle (sj tjk , rsj , rtjk ) with the root r and its parent [1]. Each 5-cycle formed in the second round is line rigid (Lemma 3). Therefore the graph is line rigid. √ Theorem 3. Algorithm 1 uses 4n/3 + O( n) queries. Proof. In the first round, we have chosen 3b2 + 17b + 30 queries. In the second round, we have chosen b2 + (16b + 28) + (b + 1) = b2 + 17b + 29 queries. Therefore, the total number of queries is (3b2 + 17b + 30) + (b2 + 17b + 29) 4 2 34 53 = (3b + 17b + 30) + b+ 3 3 3
378
F.Y.L. Chin et al.
Fig. 3. First round queries for three round algorithm
4 34 √ n+ n 3 3 √ 4 = n + O( n) 3 ≤
4
Upper Bound for Three Rounds
√ In this section, we show that 5n/4 (more precisely, 5n/4 + O( n)) pairwise distance queries are sufficient to determine the positions of n distinct points on a line using only three rounds. The idea is based on the same idea for the 2round case. Instead of building 2-links, we try to build 3-links. However, there is a problem of using 3-link. Let (oi pi , pi qi , qi r) be a particular 3-link (note that we assume r is the root in Algorithm 1). If |oi pi | = |qi r|, then we may not be able to get a unique solution (Lemma 4). However, according to Lemma 1, for any edge with weight |oi pi |, there are at most two edges qβ r, β = 1, 2, . . . , b2 , such that |qβ r| = |oi pi |. Instead of constructing b2 3-links in a single round, we first construct b2 + 2 1-links and b2 edges oi pi , i = 1, 2, . . . , b2 . For each edge oi pi , we can always find a distinct 1-link qβ r such that |oi pi | = |qβ r|. By querying the length of pi qβ , we will get a 3-link (oi pi , pi qβ , qβ r) with |oi pi | = |qβ r|. Algorithm 2. Let n = 4b2 + 87b + 773 for some positive integer b. In the first round, we choose 2b2 + 87b + 772 queries represented by the tree in Figure 3. We also pair up the rest 2b2 nodes and query their pairwise distances |oi pi |, i = 1, 2, . . . , b2 . Let vertex r be the root of the tree. There are b2 + 2 children (1-links) are leave (qβ r, β = 1, 2, . . . , b2 + 2) and b + 10 subtrees (b-trees) with roots sj , j = 1, 2, . . . , b + 10, each with b + 76 children tjk , k = 1, 2, . . . , b + 76. In the second round, for each edge oi pi , we find a distinct 1-links qβ r such that |oi pi | = |qβ r| and query the length of pi qβ to form a 3-link (oi pi , pi qβ , qβ r). For the rest two 1-link qβ r, we query the length of qβ s1 . In the third round, for each 3-link (oi pi , pi qβ , qβ r), we find a distinct path (rsj , sj tjk ) in the b + 10 b-trees ((b + 10)(b + 76) = b2 + 86b + 760 possible paths) to form a 6-cycle which satisfies the requirements stated in Lemma 4 and query
The Point Placement Problem on a Line – Improved Bounds
379
the length of tjk oi . Note that there are at most 86b + 760 paths that do not satisfy Lemma 4 (Lemma 1). So we can always find a distinct path (rsj , sj tjk ) for (oi pi , pi qβ , qβ r). For each of the unused 86b + 760 leaves tjk in the b-trees, we query the distance between node tjk and the root r (rtjk ). Last, we query the length of sj sj+1 , j = 1, 2, . . . , b + 1. Theorem 4. The graph constructed by Algorithm 2 is line rigid. Proof. The relative positions of the b + 10 b-trees and the root r can be determined because they form b + 1 connected triangles (rsj , rsj+1 , sj sj+1 ) with the root r [1]. The two unused 1-links rqβ are line rigid because they form two triangles (rs1 , rqβ , qβ s1 ) with the root r and node s1 . The relative positions of 10b + 76 leaves tjk are also fixed because each forms a triangle (sj tjk , rsj , rtjk ) with the root r and its parent [1]. Each 6-cycle formed in the third round is line rigid (Lemma 4). Therefore the graph is line rigid. √ Theorem 5. Algorithm 2 uses 5n/4 + O( n) queries. Proof. In the first round, we have chosen (2b2 + 87b + 772) + b2 = 3b2 + 87b + 772 queries. In the second round, we have chosen b2 + 2 queries. In the third round, we have chosen b2 + 86b + 760 + (b + 9) = b2 + 87b + 760. Therefore, the total number of queries is (3b2 + 87b + 772) + (b2 + 2) + (b2 + 87b + 760) 5 2 261 2271 = (4b + 87b + 773) + b+ 4 4 4 5 261 √ ≤ n+ n 4 8 √ 5 = n + O( n) 4
5
Lower Bound for Two Rounds
In this section, we show that any 2-round adaptive algorithm for solving the 1-dimensional point placement problem requires at least 17n/16 queries. Let V be the set of points. Suppose that a particular 2-round algorithm can uniquely determine the relative positions of a set V of n nodes in two rounds. Let G1 = (V, E1 ) and G2 = (V, E1 ∪ E2 ) be the graphs of queried node pairs (or edge), where Ei (i = 1, 2) contains all edges whose lengths have been measured in round i. Let us consider round 1 first. The algorithm has queried the edges in G1 . The adversary will report the length of the edges based on the following strategy: 1. For nodes of degree at least 3, the adversary fixed the layout of all these nodes and answers the queries of E1 related to these nodes accordingly.
380
F.Y.L. Chin et al.
2. For nodes of degree 2, we consider the maximal paths (p1 , p2 , p3 , . . . , pk ) formed by these nodes such that the number of degree 2 nodes is at least 3. Let p0 and pk+1 be the nodes of degree not equal to 2 which are adjacent to p1 and pk , respectively. The adversary fixed the layout of nodes pi if i = 0(mod 3). Denote (pi , pi+1 ) as special node pair if i = 1(mod 3). For each special node pair, say (pi , pi+1 ), suppose pi is adjacent to pi−1 and pi+1 is adjacent to pi+2 . Note that the positions of pi−1 and pi+1 are fixed. The adversary sets |pi−1 pi | = |pi+1 pi+2 | and |pi pi+1 | = |pi−1 pi+1 |. By setting the length in this way, the positions of pi and pi+1 are ambiguous. (More precisely, pi and pi+1 have two possible layouts: (1) pi to the left of pi−1 and pi+1 to the left of pi+2 or (2) pi to the right of pi−1 and pi+1 to the right of pi+2 .) Then, we consider G2 (that is, round 2). We have the following two properties: Lemma 5. In G2 , for each special node pair (pi , pi+1 ), there exists at least one edge in E2 connecting to either pi or pi+1 . Proof. Suppose that both pi and pi+1 do not attach to any edges in E2 . Then, the ambiguity we introduced in round 1 cannot be resolved. Lemma 6. For any maximal path P of degree 2 in G2 , the number of nodes in the maximal path is at most 4. Proof. By construction, no three consecutive edges in the maximal path can be in E1 . If there exist three consecutive edges in E1 , two of the nodes will form a special node pair. Suppose that the number of degree-2 nodes is 5, let the nodes be p0 , p1 , p2 , p3 , p4 , p5 , p6 where p0 and p6 are heavy nodes adjacent to the length-5 maximal path. Suppose that |p0 p1 | = w1 , |p1 p2 | = w2 , |p2 p3 | = w3 , |p3 p4 | = w4 , |p4 p5 | = w5 , and |p5 p6 | = w6 . There are the following combinations of E1 and E2 edges for (w1 , w2 , w3 , w4 , w5 , w6 ). 1. 2. 3. 4. 5. 6.
E2 , E1 , E1 , E2 , E1 , E1 E1 , E2 , E1 , E2 , E1 , E1 E1 , E2 , E1 , E1 , E2 , E1 E1 , E1 , E2 , E2 , E1 , E1 E1 , E1 , E2 , E1 , E2 , E1 E1 , E1 , E2 , E1 , E1 , E2
For each combination, we can set the length of the edge of E2 to make it ambiguous. For example, for combination 1, we can set w1 + w2 + w3 = w5 + w6 and w4 = |p6 p0 |; then (p0 , p1 , p2 , p3 , p4 , p5 ) can be degenerated to a 4-cycle which is not line rigid (Lemma 2) and becomes ambiguous. Below, we try to analyze the average number of edges per node. Nodes of degree at least 3 or belong to special node pair are denoted as heavy nodes; otherwise, they are light nodes. We split every edge into two fractional edges owned by the
The Point Placement Problem on a Line – Improved Bounds
381
two incident nodes. For an edge joining two heavy nodes or two light nodes, each incident node owns 1/2 edge count. For an edge joining a light and a heavy node, the light node owns 1/2 + g edge count and the heavy node owns 1/2 − g edge count. (We will specify g below.) The nodes in V can be divided into three types: a. Special node pairs b. Degree 3 normal nodes c. Maximal path formed by degree-2 normal nodes connecting using edges in E1 and E2 . For type (1), by Lemma 5, each special node pair (p1 , p2 ) has at least one edge in E2 connecting to either p1 or p2 (says p1 ). Then, the edge count of p1 is at least (1/2 + 2(1/2 − g) = 1.5 − 2g) while the edge count of p2 is at least (1/2 + 1/2 − g = 1 − g). So, the average edge count of p1 and p2 = (2.5 − 3g)/2 = (5 − 6g)/4. For type (2), each node of degree at least 3, the edge count of the node is at least 3(1/2 − g) = 3/2 − 3g. For type (c), by Lemma 6, each maximal path is of length k(k ≤ 4), the total edge count of all nodes in the path = 2(1/2 + g) + (k − 1) = k + 2g. Hence, the average edge count is 1 + 2g/k. (For the worst case, we set k = 4 and the average degree is 1 + g/2.) In total, the average edge count of each node in G2 is at least max{(5 − 6g)/4, 3/2 − 3g, 1 + g/2}. By setting g = 1/8, the total edge count of all n nodes in G2 is at least 17n/16. Hence we have the following theorem. Theorem 6. Any deterministic 2-round adaptive algorithm for solving the 1dimensional point placement problem requires at least 17n/16 queries.
6
Conclusions
There are quite a number of related open problems. For examples, whether it is possible to have a strategy that matches the lower bound or whether a better lower bound exists for the case of two rounds. For unlimited number of rounds, the only lower bound we have is the trivial bound of n. It is challenging to obtain non-trivial lower bounds for the case of r rounds. Of course, it is also challenging to obtain better strategies for all cases including the non-adaptive version in which the upper bound (8n/5) and the lower bound (4n/3) are not closed yet. Another direction is to design randomized algorithms to solve the problem (e.g. [2]).
References 1. Damaschke, P.: Point Placement on the Line by Distance Data. Discrete Applied Mathematics 127, 53–62 (2003) 2. Damaschke, P.: Randomized vs. Deterministic Distance Query Strategies for Point Location on the Line. Discrete Applied Mathematics 154, 478–484 (2006)
382
F.Y.L. Chin et al.
3. Mumey, B.: Probe Location in the Presence of Errors: A Problem from DNA Mapping. Discrete Applied Mathematics 104, 187–201 (2000) 4. Redstone, J., Ruzzo, W.L.: Algorithms for a Simple Point Placement Problem. In: Bongiovanni, G., Petreschi, R., Gambosi, G. (eds.) CIAC 2000. LNCS, vol. 1767, pp. 32–43. Springer, Heidelberg (2000) 5. Trask, B.J., Allen, S., Massa, H., Fertitta, A., Sachs, R., Engh, G., Wu, M.: Studies of Metaphase and Interphase Chromosomes using Fluorescence in situ Hybridication. Cold Spring Harbor Symposia on Quantitative Biology LVIII, 767–775 (1993) 6. Waterman, M.S.: Introduction to Computational Biology. Maps, Sequences and Genomes. Chapman and Hall, London (1995)
Efficient Computational Design of Tiling Arrays Using a Shortest Path Approach Alexander Schliep1 and Roland Krause1,2 1
Max Planck Institute for Molecular Genetics, 14195 Berlin, Germany
[email protected] 2 Max Planck Institute for Infection Biology, 10117 Berlin, Germany
Abstract. Genomic tiling arrays are a type of DNA microarrays which can investigate the complete genome of arbitrary species for which the sequence is known. The design or selection of suitable oligonucleotide probes for such arrays is however computationally difficult if features such as oligonucleotide quality and repetitive regions are to be considered. We formulate the minimal cost tiling path problem for the selection of oligonucleotides from a set of candidates, which is equivalent to a shortest path problem. An efficient implementation of Dijkstra’s shortest path algorithm allows us to compute globally optimal tiling paths from millions of candidate oligonucleotides on a standard desktop computer. The solution to this multi-criterion optimization is spatially adaptive to the problem instance. Our formulation incorporates experimental constraints with respect to specific regions of interest and tradeoffs between hybridization parameters, probe quality and tiling density easily. Solutions for the basic formulation can be obtained more efficiently from Monge theory.
1
Introduction
Tiling arrays. DNA microarrays can be manufactured today with the high density required to comprehensively sample the complete genomic sequence of an organism. These chips are typically referred to as genomic or tiling arrays and are widely used in transcriptome analysis. Representing a genome completely opens new routes, such as transcription profiles that do not rely on prior gene prediction, discovery of microRNAs or chromatin-immunoprecipitation-chip studies for the detection of transcriptional regulation. Depending on the type of question, the probe selection for the array needs to be customized, which is economically feasible as several providers offer customer designed arrays. The determination of suitable probe sequences, typically contiguous subsequences of the genomic sequence, however is computationally demanding even for relatively small genome sizes of bacteria if the quality of probes is considered in the design. R. Giancarlo and S. Hannenhalli (Eds.): WABI 2007, LNBI 4645, pp. 383–394, 2007. c Springer-Verlag Berlin Heidelberg 2007
384
A. Schliep and R. Krause
Oligonucleotide quality. There are many aspects determining the quality of oligonucleotides probes, which arise from the specifics of the hybridization reaction between an oligonucleotide probe immobilized on the chip surface and the region of genomic DNA it reacts with. Prominent aspects are the melting temperature or binding energy of the duplex formed and the cross-hybridization potential. The latter reflects the availability of forming stable duplexes with genomic regions other than the intended and is approximated heuristically through sequence similarity. Kane et al [1] measured the amount of cross-hybridization and summarized the results into two criteria for unique probe selection for oligonucleotides probes of length 50, or 50mers. The first criterion is satisfied if the probe has no match of global sequence identity exceeding 75% in another region in the target genome. The second criterion requires that there is no exact match of 15 bases or more within unintended matches of a sequence identity between 50% to 75%. Later efforts tried to extend Kane’s work to larger sets of genes, oligonucleotide lengths and sequence composition [2,3]. Other aspects are under discussion, for instance hairpin structures or the occurrence of guanine or cytosin on the 5’ or 3’ end of the oligonucleotide. Recently, a disillusioning study showed that designing for optimal oligonucleotide quality is ill-understood and that thermodynamic considerations do not necessarily improve results [4]. The major body of work to assess hybridization was done on Affymetrix microarrays, which typically employ 25mers and particular experimental arrangements to assess the quality. It is unclear how to make best use of this work in the context of other array designs with oligonucleotides of different lengths and different hybridization conditions. Prior work. One possible approach to the design is the selection of evenly spaced oligonucleotide probes which provide dense coverage of a genome. Selinger et al [5] constructed a dense microarray for the 4MBb genome of Escherichia coli with a density of one 25mer for each 30 base window. The design of such an array is straight-forward but suffers from the problem that cross-hybridization between probes and several genomic regions is to be expected. While the analysis of such an array can yield important results as to whether particular genomic regions are expressed, it is compromised due to the high error rates of unspecific probes and, furthermore, if relative expression levels are to be quantified and compared. The question of finding specific probes is further complicated by the vast amount of repetitive sequence in higher organism, which is often handled by repeat detection and subsequent masking. The large proportion of repetitive sequence in the human genome requires enhanced design considerations to achieve high coverage. One approach [6] concentrates on maximizing the size of segments which can be covered with evenly spaced oligonucleotide probes by post-processing genomic sequences with masked repeats by joining segments of non-repetitive DNA if they are separated by short segments of repetitive DNA as long as the joined segment is sufficiently longer. As probe quality is ignored in the process, probes with a large cross-hybridization potential will often be chosen for the array.
Efficient Computational Design of Tiling Arrays
385
A two-stage approach was proposed by Lipson et al [7]. Following computation of candidate probes they propose to choose a whenever possible (WP) -cover. A WP -cover is a subset of candidates so that for any chromosomal position x the following holds. Either there are probes i, j, i ≤ x ≤ j (probes are identified with their chromosomal location) in the cover with j − i ≤ or there is no candidate between i and j. They provide a greedy algorithm for computation of WP covers and minimizing for a given array size with a log-normal complexity of their entire approach. For a problem instance with candidates at positions 1, 2, . . . , , + 1, 2 + 1, 2 + 2, . . . , 3 + 1, 3 + 2, 4 + 2 the greedy algorithm will arrive at the WP -cover , + 1, 2 + 1, 3 + 1, 3 + 2, 4 + 2, . . .. This is clearly undesirable as one third of the probes are essentially uninformative and for example + 1, 2 + 1, 3 + 2, 4 + 2 uses fewer probes to cover the same segment with the same resolution. Furthermore, their approach cannot optimize with respect to probe quality. Computation of candidates. There are also two stages in our approach. First, candidate oligonucleotides are computed for a given genome with one of the several tools available. For this work, we compute candidates with the recently introduced tool Flog [8]. It employs a suffix array [9] for the selection of unique probes that satisfy particular experimental conditions such as GC content and melting temperature. It implements a set of filters to satisfy Kane’s criteria [1] (see above). We used the margin with which the second criterion was fulfilled as the quality values for probes. For specific applications it is desirable to use oligos of different lengths to satisfy experimental constraints, in particular the melting temperature Tm . The very tool used in the process is not essential and other tools that satisfy experimental requirements or considerations such as hair-pin formation tendencies can be easily employed. The use of Flog removes the need to scan for repetitive regions. If particular repetitive regions are to be covered explicitly, our algorithm can incorporate additional selected oligonucleotides. Note that candidate computation has to be done only once per genome. Novel contributions. We formulate the multi-criterion optimization problem of selecting an optimal subset of oligonucleotide probes from set of candidates as the minimal cost tiling path problem. We show that our formulation is equivalent to a shortest path problem, which can be solved in log-linear time to global optimality using Dijkstra’s shortest path algorithm. The optimal solutions adapt to spatial differences within the problem instance and can be controlled globally with respect to tradeoffs between hybridization parameters, probe quality and tiling density. Additionally, the solutions can be locally constrained for example to reflect regions of particular interest to the biologist. We also provide an efficient implementation of Dijkstra’s shortest path algorithm. The resulting computational efficiency and memory footprint makes it feasible to solve instances with millions of candidate oligonucleotides on standard desktop computers. Furthermore, optimal solution for the case of homogeneous (position-independent) tradeoffs and design parameters can be obtained from Monge theory in linear time.
386
2
A. Schliep and R. Krause
Minimal Cost Tiling Paths
We formalize the problem of designing a tiling array in the following. The primary input is a collection of n candidate oligonucleotide probes and relevant information about them. Let i ∈ {1, . . . , n} denote the probes covering a DNA sequence S and p(i), Tm (i) and q(i) the position, the melting temperature and the probe quality of probe i respectively. The melting temperatures Tm (i) for probes selected should be within a narrow range to optimize comparability of probes and thus consistency of the experiments. There are various measures for probe quality mainly reflecting the potential for cross-hybridization. Higher quality probes with a lower cross-hybridization potential not only have a lower false positive rate but also a lower variance of probe intensities, so high quality probes should be used. Here we assume that the q(i) are computed while creating the list of candidates; the details do not matter. To simplify the analysis, which typically relies on homogeneous proximity effects in signal correlation between adjacent probes, it is also desirable to keep the distance p(j) − p(i) between any two adjacent probes close to constant. Further relevant parameters exist and our formulation can be trivially extended to accommodate them. The question remains how to choose a tuple T ⊂ {1, . . . , n} of probes, which we will refer to as the tiling path such that p(j) − p(i) = d for any two adja cent probes, Tm (i) = Tm and q(i) maximal for all i ∈ T . Here d is the desired tiling distance and Tm the desired melting temperature. The multi-criterion optimization has conflicting criteria. Fulfilling the distance criterion is trivial when choosing arbitrarily bad probes with incorrect melting temperature and vice versa. Which tradeoffs are made has to be decided individually depending on the organism selected, possible density maxima given by the size of the array and sensitivity differences in the melting temperature, which varies in time and stringency between experimental conditions. To simplify, we first cancel units in probe parameters and bring the different criteria on a common scale. Let d(i, j) :=
|d − (p(j) − p(i))| d
(1)
denote the penalty for choosing adjacent probes i and j in T . Similarly, let pt(j) :=
|Tm − Tm (j)| Tm
denote the penalty contribution from Tm violations and finally q −q(j) if q(j) < q , and q pq(j) := 0 otherwise
(2)
(3)
denote the penalty for selecting probes of quality below q . Note that this formulation does not distinguish between probe quality exceeding q as this reflects the practice of microarray design.
Efficient Computational Design of Tiling Arrays
...
i
i+1
i+2
...
i+k-1
i+k
387
...
Fig. 1. We show the neighborhood structure for node i. It is adjacent to nodes i + 1, i + 2, . . . , i + k. Other edges are not shown.
We can compute the cost of a tiling path T now as C(T ) =
|T |
d(Ti , Ti+1 ) +
i=0
|T |
pt(i) + pq(i)
(4)
i=1
or consider a weighted version to allow global control of the tradeoffs Cw (T ) = wd ×
|T |
d(Ti , Ti+1 ) +
i=0
|T |
wt × pt(i) + wq × pq(i).
(5)
i=1
Note that d(0, i) and d(i, |T | + 1) only penalize probe choices which are too far away from either sequence end; that is d(0, i) = p(i)−d for p(i) > d and 0 else; d d(i, |T | + 1) is analogously defined. This leads us to the definition of the minimal cost tiling path problem. Problem 1. Find a tiling path T of minimal cost Cw (T ) given candidate probes {1, . . . , n}, probe parameters p(i), Tm (i) and q(i) and design parameters d , Tm and q with criteria weights wd , wt and wq . 2.1
Shortest Path Solution
The problem can be reformulated as a shortest path problem. Consider the graph in Figure 1. The set of vertices are the probes {1, . . . , n} with special nodes 0 and n + 1, which are “virtual probes” before the start and after the end of the sequence we want to tile. For an edge (i, j), with 1 ≤ i, j ≤ n we compute its weight w(i, j) as the terms j contributes to the sum in Eq. 5. That is, w(i, j) = wd ×d(i, j)+wt ×pt(j)+wq ×pq(j). The weights w(0, i) and w(i, n+1) are defined as d(0, i) above. Clearly, the cost of a path 0, i1 , i2 , . . . , n+1 is exactly as defined in Eq. 5 and its vertices define the tiling path. Dikstra’s shortest path algorithm [10] with a Fibonacci heap based priority queue [11] can compute a shortest path in O(|E| + |V | log |V |), where |V | and |E| are order and size of the input. To improve the running time we bound the cardinality of the neighborhoods by k. Theoretically, all possible edges (i, j), 0 ≤ i < j ≤ n + 1 have to be considered. However, a simple analysis of behavior of the cost function reveals that the distance penalty dominates the other penalties for typical choices of
388
A. Schliep and R. Krause
6000 2000 0
Frequency
Distance between adjacent probes
0
50
100
150
200
250
j −i
Fig. 2. The graph-theoretic distance of vertices i, j, which are adjacent on the shortest path in the input graph. This experiment with about 40,000 probes selected from 2,000,000 candidates in Mycobacterium smegmatis exemplifies the existence of a bound for j − i on real data. Computations were performed with k = 1000.
tradeoff weights and instances. As p(j) − p(i) ≥ j − i, we can see that k depends on d rather than n. Hence larger arrays, with lower d for the same genome length |S|, are actually faster to compute. Note, from k independent of n it follows that we can solve the minimal cost tiling path problem with a worst-case complexity of O(n × log(n)), as in the corresponding graph |E| < k × |V |, which is dominated by |V | log |V |. We present the derivation for the unweighted cost function C(T ); similar results for Cw (T ) are straight-forward to obtain. Proposition 1. Let T be a minimal unweighted cost tiling path, and let i and j be adjacent probes in T . Then j−i < 2d +1, provided maxl pt(l)+maxl pq(l) < 1. Proof. Assume that j − i ≥ 2d + 1. We show that for i = j−i 2 the cost of T is lowered when we replace (i, j) by (i, i ), (i , j). That is, T is not minimal. Using Eq. 4 we can rewrite w(i, j) > w(i, i ) + w(i , j)—note i, i and j are chosen such that with p(j) − p(i) > j − i the position specific terms are positive—as −d +(p(j) − p(i)) > −d +(p(i ) − p(i))+d (pq(i )+pt(i ))−d +(p(j) − p(i )) . Canceling further terms we arrive at 0 > −d + d (pq(i ) + pt(i )) which finishes the proof. Note that typically maxl pt(l) < 0.3; the quality penalties are equally bounded. Based on experimental results, cf. Figure 2, and manual inspection of gap size, melting temperature and probe quality distributions we choose k = 400 for d = 160. In the instances for which we computed tiling paths we never observed two selected probes i and j with j − i > 200. Our choice of k clearly leaves room for further improvement, in particular as a k too small could lead to a shortest path which is not a minimal cost tiling path.
Efficient Computational Design of Tiling Arrays
2.2
389
Implementation
As problem instances are quite large—from order 2,000,000 and size 800,000,00 up to order 30,000,000 and size 1,200,000,000 for a small bacterium like M. tubercolosis and Human chromosome 2 respectively—computation of the shortest path is a veritable challenge. Even state-of-the-art libraries like Boost(http:// www.boost.org) or LEDA [12] cannot effectively cope with graphs of this size. For n = 3.200.000 and k = 300 a Boost-based implementation needs over 50GB of memory and over 20 minutes of CPU time for allocating the graph. This does not include time for computing edge weights or the shortest path. As Dikstra’s shortest path algorithm explores the neighborhood of minimal vertices removed from the priority queue, we simply compute the neighborhood when needed instead of precomputing all neighbors and storing them as a graph. This quite obvious optimization of shortest path computations was implemented for example in [13]. Our method is implemented in Python (http:// www.python.org using the numpy (http://numpy.scipy.org/) package for linear algebra and a priority queue implementation from http://py.vaults.ca/ apyllo.py/514463245.769244789.44776582. Figure 3 shows the growth of running time and memory usage for increasing problem sizes. The memory usage is minimized as we can use large arrays to store probe position, quality, and hybridization parameters, which can be allocated in constant time (neglecting the priority queue). Neighborhood computations are performed by vector-valued operations using numpy, where the vectors are slices. A slice is a vector consisting of a consecutive number of elements of another vector, for example the positions of candidate oligonucleotides i, i + 1, . . . , i + k is a slice in the vector of all candidate positions. Numpy utilizes either vendor supplied basic linear algebra system (BLAS) libraries or self-optimizing BLAS such as Atlas [14] which are tuned to specific hardware. 2.3
Local Constraints in the Formulation
There is considerable interest in custom tiling arrays for specific applications. Novel, complete genome sequences are released almost daily now and many different experimental applications exist for tiling arrays. Theoretically, oligonucleotide probes should be evenly spaced, but often the experimental questions warrants a particular interest in specific genes or their upstream regions and prefer higher coverage (or better regions) for particular regions of the chromosome. – Inclusion of obligatory oligonucleotides. The algorithm can easily incorporate the selection of specific oligonucleotides, such as positioned at the start of a gene or exon to ensure specific coverage and optimal tiling density. Another potential constraint is selection of probes that span exon boundaries [15]. For comparison with previous results and standardization, it might be helpful to use oligonucleotide probes from previous experimental designs. This constraint is trivially implemented for a obligatory probe o by removal of edges (i, j), i < o < j.
390
A. Schliep and R. Krause
4000 Running time [s] Memory [Mb] 3500
3000
2500
2000
1500
1000
500
0 500000
1e+06
1.5e+06
2e+06
2.5e+06
3e+06
Fig. 3. Memory usage and running time of the algorithm for segments of human chromosome 2. In this experiments, n denotes the number of probe candidates, k = 400 denotes the neighborhood cardinality. Further parameters were d = 150, Tm = 87.5.
– Heterogeneous probe density. To improve the density in gene-rich or other regions of interest, one can specify a shorter spacing of the oligonucleotides by a position-specific definition of the d (p), where p designates the chromosomal location. g – Position-dependent penalty weights. Likewise, to achieve a more even spacing for particular regions, the algorithm can relax criteria locally by using position-specific weights wd (p) instead of global weights, similarly for melting temperature and quality. 2.4
Optimal Solutions from Monge Theory
In the case of homogeneous (position-independent) weights wd , wt and wq and design parameters d , optimal solutions can be obtained in linear time from the theory of Monge matrices. See Burkard et al [16] for a review. Definition 1. A matrix C is called an upper triangular Monge matrix, if for all integers i, r, j, s with 1 ≤ i < r ≤ j < s ≤ n the following condition holds: cij + crs ≤ cis + crj .
(6)
Lemma 1. The matrix W = {w(i, j)} is an upper triangular Monge matrix. Proof. Let i, r, j, s be integers such that 1 ≤ i < r ≤ j < s ≤ n holds. We can substitute our definition of edge weights into the Monge condition (6): wd × d(i, j) + wt × pt(j) + wq × pq(j) + wd × d(r, s) + wt × pt(s) + wq × pq(s) ≤ wd × d(i, s) + wt × pt(s) + wq × pq(s) + wd × d(r, j) + wt × pt(j) + wq × pq(j).
(7)
Efficient Computational Design of Tiling Arrays
391
Fig. 4. Distance between adjacent candidates and the density of candidate probes in 2.5kb windows for M. tuberculosis (top). The low density regions are caused by known repetitive regions.On the bottom we show distance and density for a minimal tiling path computed for d = 90 and stringent conditions on oligonucleotide quality, causing gaps. Note, the there is little deviation from d towards smaller values.
Note that the quality and temperature penalties are identical on both sides and hence cancel. We multiply both sides by d /wd to simplify the remaining distance penalties. We obtain |d − (p(j) − p(i))| + |d − (p(s) − p(r))| ≤ |d − (p(s) − p(i))| + |d − (p(j) − p(r))| .
(8)
To check that the inequality holds we have to consider cases depending on the signs of the individual terms, which is tedious but fully elementary. If, for example, p(j) − p(r) ≥ d , it follows that also p(j) − p(i) ≥ d , p(s) − p(r) ≥ d and p(s) − p(i) ≥ d . Then the inequality (8) is equivalent to −d + p(j) − p(i) − d + p(s) − p(r) ≤ −d + p(s) − p(i) − d + p(j) − p(r)where all terms cancel. The other cases are left to the reader. This is relevant, as linear time algorithms [17,18,19] exist, which are on-line variants of the SMAWK-algorithm of Aggarwal et al [20], and which solve the shortest path problem for upper triangular Monge matrices. Moreover, also efficient algorithms to compute shortest paths with a prescribed number of edges have been proposed; e.g. [21].
392
3
A. Schliep and R. Krause
Application
We have designed a tiling array for the genome of Mycobacterium tuberculosis H37Rv comprising 44,000 spots [22], a desired Tm of 87.5 ◦ C, desired distance of d = 150bp using oligomers of length 50 to 60. Mycobacteria have GC-rich genomes and contain several repetitive regions of varying degree of similarity. Instead of masking known biases, we explored the maximal possible tiling path defined by sequence properties only. Figure 4 shows that regions without coverage have no candidate oligonucleotides. As manual inspection shows, the gaps in the coverage (spikes) are typically caused by repeated regions such as insertion sequences, which results in 2kb regions without placements of oligonucleotides. The largest regions (6kb) without oligo placement correspond to proteins belonging to the families of PE and PPE genes, which are highly repetitive structures with low sequence complexity. These regions are of no particular interest for our experimental settings and can safely be ignored. Note, any naive equidistant tiling with d = 150 selects more than 75% of probes from the over 5.3 million oligonucleotides filtered out by Flog.
4
Discussion
We formulate a multi-criterion optimization problem in the design of oligonucleotide tiling or genome arrays. The minimal cost tiling path problem lends itself to a reformulation as a shortest path problem and a solution with Dijkstra’s algorithm. Due to the structure of our cost function we can bound the neighborhood cardinality in the graph by a constant independent of n, the number of candidate probes, leading to a O(n × log n) complexity. We demonstrate on real data that our efficient implementation of Dijkstra’s algorithm allows us to solve problem instances with millions of candidates on standard desktop computers. Solutions can be controlled globally by criteria weights and locally to incorporate different constraints with respect to solution specifics and even extend partial solutions. A case study on a bacterial genome shows the effectiveness in covering genomes without prior handling of repetitive regions, in contrast to [6], which also does not optimize for oligonucleotide quality, and in obtaining balanced solutions of high-quality oligonucleotides with low cross-hybridization potential. The size of the final design m cannot be specified directly. Rather the genome size divided by the desired distance between probe start position gives the expected number of probes, that is m ≈ |S|/d . Depending on parameter choices this may or may not be realized and consequently m can possibly not take on arbitrary values. Nevertheless a parameter exploration starting with d = |S|/m quickly provides reasonable solutions in practice. A prior approach [7] of the same log-normal complexity as ours allows specification of m directly but produces solutions with possibly large variations in probe distances and without discrimination of probe quality. Further improvements concern the inclusion of non-unique probes [23], probes which hybridize very well in two or more genomic positions but otherwise do not
Efficient Computational Design of Tiling Arrays
393
cross-hybridize, in the candidate generation which will facilitate closing of gaps due to repetitive regions. A spatially adaptive quality measure [7] for candidates can easily be incorporated. We will also explore an implementation of the Monge linear time algorithms in practice. An implementation of our method is available from http://algorithmics. molgen.mpg.de/Tileomatic. Acknowledgments. Thanks to J¨ org Schreiber at the MPI for Infection Biology for helpful discussions, to Stefan Bienert for making Flog [8] available to us and to Janne Grunau for computational experiments. We would also thank one of the reviewers for helpful comments regarding Monge theory.
References 1. Kane, M.D., Jatkoe, T.A., Stumpf, C.R., Lu, J., Thomas, J.D., Madore, S.J.: Assessment of the sensitivity and specificity of oligonucleotide (50mer) microarrays. Nucleic Acids Res. 28(22), 4552–4557 (2000) 2. Matveeva, O., Shabalina, S., Nemtsov, V., Tsodikov, A., Gesteland, R., Atkins, J., Journals, O.: Thermodynamic calculations and statistical correlations for oligoprobes design. Nucleic Acids Research 31(14), 4211–4217 (2003) 3. He, Z., Wu, L., Li, X., Fields, M., Zhou, J.: Empirical Establishment of Oligonucleotide Probe Design Criteria. Applied and Environmental Microbiology 71(7), 3753–3760 (2004) 4. Pozhitkov, A., Noble, P.A., Domazet-Loso, T., Nolte, A.W., Sonnenberg, R., Staehler, P., Beier, M., Tautz, D.: Tests of rrna hybridization to microarrays suggest that hybridization characteristics of oligonucleotide probes for species discrimination cannot be predicted. Nucleic Acids Res. 34(9) (2006) 5. Selinger, D.W., Cheung, K.J., Mei, R., Johansson, E.M., Richmond, C.S., Blattner, F.R., Lockhart, D.J., Church, G.M.: Rna expression analysis using a 30 base pair resolution escherichia coli genome array. Nat. Biotechnol. 18(12), 1262–1268 (2000) 6. Bertone, P., Trifonov, V., Rozowsky, J.S., Schubert, F., Emanuelsson, O., Karro, J., Kao, M.Y., Snyder, M., Gerstein, M.: Design optimization methods for genomic dna tiling arrays. Genome Res. 16(2), 271–281 (2006) 7. Lipson, D., Yakhini, Z., Aumann, Y.: Optimization of probe coverage for highresolution oligonucleotide acgh. Bioinformatics 23(2), 77–83 (2007) 8. Bienert, S.: Flexible combination of filters for oligodesign. Diplomathesis, Center for Bioinformatics, Universit¨ at Hamburg (2006) 9. Abouelhoda, M., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms 2(1), 53–86 (2004) 10. Dijkstra, E.W.: A note on two problems in connexion with graphs. In: Numerische Mathematik, vol. 1, pp. 269–271. Mathematisch Centrum, Amsterdam, The Netherlands (1959) 11. Fredman, M.L., Tarjan, R.E.: Fibonacci heaps and their uses in improved network optimization algorithms. J. ACM 34(3), 596–615 (1987) 12. Leda: http://www.algorithmic-solutions.com/ 13. Schliep, A.: The software GADAR and its application to extremal graph theory. In: Proceedings of the Twenty-fifth Southeastern International Conference on Combinatorics, Graph Theory and Computing, Boca Raton, FL, vol. 104, pp. 193–203 (1994)
394
A. Schliep and R. Krause
14. Automatically tuned linear algebra software (atlas): http://math-atlas.sourceforge.net/ 15. Shai, O., Morris, Q., Blencowe, B., Frey, B.: Inferring global levels of alternative splicing isoforms using a generative model of microarray data. Bioinformatics 22(5), 606 (2006) 16. Burkard, R.E., Klinz, B., Rudolf, R.: Perspectives of monge properties in optimization. Discrete Applied Mathematics 70(2), 95–161 (1996) 17. Wilber, R.: The concave least-weight subsequence problem revisited. J. Algorithms 9(3), 418–425 (1988) 18. Eppstein, D.: Sequence comparison with mixed convex and concave costs. J. Algorithms 11(1), 85–101 (1990) 19. Galil, Z., Park, K.: A linear-time algorithm for concave one-dimensional dynamic programming. Inf. Process. Lett. 33(6), 309–311 (1990) 20. Aggarwal, A., Klawe, M., Moran, S., Shor, P., Wilber, R.: Geometric applications of a matrix searching algorithm. In: SCG ’86: Proceedings of the second annual symposium on Computational geometry, New York, NY, USA, pp. 285–292. ACM Press, New York (1986) 21. Aggarwal, A., Schieber, B., Tokuyama, T.: Finding a minimum weight k-link path in graphs with monge property and applications. In: SCG ’93: Proceedings of the ninth annual symposium on Computational geometry, New York, NY, USA, pp. 189–197. ACM Press, New York (1993) 22. Cole, S., Brosch, R., Parkhill, J., Garnier, T., Churcher, C., Harris, D., Gordon, S., Eiglmeier, K., Gas, S., Barry III, C., et al.: Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature 393, 537–544 (1998) 23. Schliep, A., Torney, D., Rahmann, S.: Group testing with DNA chips: generating designs and decoding experiments. In: Proceedings of the 2nd IEEE Computer Society Bioinformatics conference, pp. 84–93. IEEE Computer Society Press, Los Alamitos (2003)
Efficient and Accurate Construction of Genetic Linkage Maps from Noisy and Missing Genotyping Data Yonghui Wu1 , Prasanna Bhat2 , Timothy J. Close2 , and Stefano Lonardi1, 1
Dept. of Computer Science and Eng., University of California, Riverside, CA Dept. of Botany & Plant Sciences, University of California, Riverside, CA
[email protected]
2
Abstract. We introduce a novel algorithm to cluster and order markers on a genetic linkage map, which is based on several theoretical observations. In most cases, the true order of the markers in a linkage group can be efficiently computed from the minimum spanning tree of a graph. Our empirical studies confirm our theoretical observations, and show that our algorithm consistently outperforms the best available tool in the literature, in particular when the genotyping data is noisy or in case of missing observations.
1 Introduction Genetic linkage mapping dates back to the early 20th century when scientists began to understand the recombinational nature and cellular behavior of chromosomes. In his landmark paper published in 1913, Sturtevant studied the first genetic linkage map of chromosome X of Drosophila melanogaster [19]. Ever since its introduction, genetic linkage mapping has been a cornerstone of a variety of applications in biology, including map-assisted breeding, disease association analysis and map-assisted gene cloning, just to name a few. Genetic linkage maps historically began with just a few to several tens of phenotypic markers obtained one by one by observing morphological and biochemical variations of an organism, mainly following mutation. During the past few decades the introduction of DNA-based markers such as RFLPs, RAPDs, SSRs and AFLPs caused genetic maps to become much more densely populated, generally into the range of several hundred to more than 1,000 markers per linkage map. Most recently, the accumulation of sequence information has led to a further leap in marker density, principally driven by very high throughput and highly accurate genotyping that can accommodate thousands, or even hundreds of thousands, of simultaneous genotyping reactions by one person in a single day. In plants, one of the most densely populated maps is that of Brassica napus [20], which consists of 13,551 markers. High density genetic maps do not require complete genome sequencing but rather are a critical step in the study of organisms for which the whole genome sequence is unlikely to be available in the near future. A genetic map is a linear ordering of markers (also called loci) along the chromosome. The map is built using input data typically composed of the states of the loci on a set of individuals obtained from controlled crosses. When an order of the markers is computed from the data, the genetic distance between nearby markers can be relatively
Corresponding author. This project was supported in part by NSF CAREER IIS-0447773 and NSF DBI-0321756.
R. Giancarlo and S. Hannenhalli (Eds.): WABI 2007, LNBI 4645, pp. 395–406, 2007. c Springer-Verlag Berlin Heidelberg 2007
396
Y. Wu et al.
easily estimated. In order to characterize the quality of an order, various objective functions have been proposed in the literature, e.g., minimum Sum of Square Errors [18], minimum number of recombination events [16], Maximum Likelihood [11], maximum Sum of adjacent LOD scores [21], minimum Sum of Adjacent Recombination Fractions [3], minimum Product of Adjacent Recombination Fractions [22]. Searching for an optimal order with respect to any of the objective functions mentioned above is computationally difficult. Enumerating all the possible orders quickly becomes infeasible since the total number of distinct orders is proportional to n!, which is too large even if n is rather small. With the exception of SSE, the rest of the objective functions listed above can be decomposed into a simple sum of terms involving only pairs of markers. Liu [14] first observed the connection between the marker ordering problem and the traveling salesman problem. Various searching heuristics that were originally developed for the TSP problem, such as simulated annealing [12], genetic algorithms [8], tabu search [6,7], ant colony optimization, and iterative heuristics such as K-opt and Lin-Kernighan heuristic [13] have been applied to the genetic mapping problem in various computational packages. For example, J OIN M AP [11] implements simulated annealing, C ARTHAGENE [17,2] uses a combination of Lin-Kernighan heuristic, tabu search and genetic algorithms, A NT M AP [10] exploits the ant colony optimization heuristic, [5] is based on genetic algorithms, and [15] takes advantage of evolutionary algorithms. Finally, R ECORD [16] implements a combination of greedy and Lin-Kernighan heuristic. Most of the algorithms proposed in the literature for genetic linkage mapping find reasonably good solutions. Nonetheless, they fail to identify and exploit the combinatorial structures hidden in the data. Some of them simply start to explore the space of the solutions from a purely random order (see, e.g., [17,15,11,10]), while other start from a simple greedy solution (see, e.g., [16,18]). In this paper, we will show both theoretically and empirically, that when the data quality is high, the optimal order can be identified via the simple minimal spanning tree algorithm. We will also show that when the data is noisy or data is missing, our algorithm consistently constructs better genetic maps than the best available tools in the literature.
2 Basic Concepts and Notations First we introduce some basic concepts of genetics and establish a common notation. A single nucleotide polymorphism (SNP) is a variation on a chromosome where a single nucleotide (A, C, G, T) differs between members of a species or between paired chromosomes in an individual. SNP sites can be used as genetic markers. The organisms considered here are an ideal case of fully homozygous (or nearly so) diploids derived from two highly inbred (fully homozygous) parents. In this system, for every locus there is one maternal and one paternal allele and with rare exception each locus exists in only two possible fully homozygous states distinguished by two alternative nucleotides. By convention, the two states are denoted as A or B. A genetic marker is said to be homozygous if the two alleles at a given locus have the same state, and it is said to be heterozygous otherwise. Various population types have been studied in association with the genetic mapping problem, which includes BackCross (BC1), Doubled Haploid (DH), Haploid (Hap),
Efficient and Accurate Construction of Genetic Linkage Maps
397
Recombinant Inbred Line (RIL), etc. Our algorithm can handle Hap, advanced RIL (low heterozygosity) and BC1 populations in addition to DH’s. In what follows, we will concentrate on the DH population, but the extension to the Hap, advanced RIL and BC1 populations is straightforward. Briefly, a DH population for genetic map construction is prepared as follows. Let M be the set of markers of interest. Pick two highly inbred (fully homozygous) parents p1 and p2 . We assume that the parents p1 and p2 are homozygous on every markers in M (those markers that are heterozygous in either p1 or p2 are simply excluded from consideration), and the same marker always has different allelic states in the two parents (those markers having the same allelic state in both parents are also excluded from M ). By convention, we use symbol A to denote the allelic states appearing in p1 and B to denote the allelic states appearing in p2 . Parent p1 is crossed with parent p2 to produce the first generation, called F1. The individuals in the F1 generation are heterozygous for every marker in M , with one chromosome being all A and the other chromosome being all B. In the DH system, gametes produced by meiosis from the F1 generation are fixed in a homozygous state by doubling the haploid chromosomes to produce a doubled haploid individual (hence the name doubled haploid). The doubled haploid individuals, denoted by N , are then genotyped on the set M of markers, i.e., the state of each marker is determined in a wet lab experiment. The genotypes are either homozygous A or homozygous B. The genotyping data which will be fed into our algorithm is collected into a matrix A of size m × n, where m = |M | and n = |N |. Each row of A corresponds to a marker in M , and each column of A corresponds to an individual in N . Given a marker li ∈ M , we use A[i, ] to refer to the row corresponding to li . Given an individual ck ∈ N , we use A[, k] to refer to the column corresponding to ck . Each entry in the matrix can be either A (i.e., the same allelic state of its parent p1 ) or B (i.e., the same allelic state of its parent p2 ). For the time being, we assume there is no missing observation in the matrix. The case where there is missing data will be discussed later in the paper. Building a genetic map from the matrix A is a two-step process. First, one has to partition the markers in A into groups, each of which corresponds to a chromosome. More specifically, one needs to determine which markers are from the same linkage group1 . This problem is essentially a clustering problem. Second, given a set of markers in the same linkage group, one needs to determine their correct order on the chromosome. For a pair of markers l1 , l2 ∈ M and an individual c ∈ N , we say that c is a recombinant with respect to l1 and l2 if c has genotype A on l1 and genotype B on l2 (or vice versa). If l1 and l2 are in the same linkage group, then a recombinant is produced if an odd number of crossovers occurred between the paternal chromosome and the maternal chromosome within the region spanned by l1 and l2 during meiosis. We denote with Pi,j the probability of a recombinant event with respect to a pair of markers (li , lj ). Pi,j varies from 0.0 to 0.5 depending on the distance between li and lj . At one 1
A linkage group is group of loci known to be physically connected, that is, they tend to act as a single group (except for recombination of alleles by crossing-over) in meiosis instead of undergoing independent assortment. Ideally, each linkage group corresponds to one chromosome, but sometimes multiple LGs can reside on the same chromosome if they are too far apart.
398
Y. Wu et al.
extreme, if li and lj belong to different LGs, then Pi,j = 0.5 because alleles at li and lj are passed down to next generation independently from each other. At the other extreme, when the two markers li and lj are so close to each other that no recombination can occur between them, then Pi,j = 0.0. Let (li , lj ) and (lp , lq ) be two pairs of markers from the same linkage group. We say that (li , lj ) is enclosed in (lp , lq ) if the region of the chromosome spanned by li and lj is fully contained in the region spanned by lp and lq . A fundamental law in Genetics is that if (li , lj ) is fully contained in (lp , lq ) then Pi,j ≤ Pp,q . The total number of recombinants in N with respect to the pair (li , lj ) can be easily determined by computing the Hamming distance di,j between row A[i, ] and row A[j, ]. It is easy to prove that di,j /n the maximum likelihood estimate (MLE) for Pi,j .
3 Some Theoretical Observations Our first observation is that when two markers li and lj belong to two different linkage groups, then di,j will be large with high probability. This is formally captured in the following theorem. Recall that in this case, Pi,j = 0.5. Theorem 1. Let li and lj be two markers that belong to two different LGs, and let di,j be the Hamming distance between A[i, ] and A[j, ]. Then, E(di,j ) = n/2
and
P(di,j < δ) ≤ e
2(n/2−δ)2 n
where δ < n/2. k Proof. Let ck ∈ N and let Xi,j be a random indicator variable which is equal to 1 if k ck is a recombinant with respect to li and lj and to 0 otherwise. Clearly E(Xi,j ) = 12 , k k and di,j = k Xi,j . The family of random variables {Xi,j : 1 ≤ k ≤ n} are i.i.d. According to linearity of expectation, E(di,j ) = n/2. The bound P(di,j < δ) ≤
e
2(n/2−δ)2 n
derives directly from Hoeffding’s inequality [9].
Theorem 1 allows us to partition the markers into linkage groups fairly easily. The algorithmic details will be presented in the next section. In the following, we will assume that the markers have been successfully clustered into linkage groups, and we will focus on the ordering problem only. Let us assume now that all the markers in M belong to the same linkage group. Let G(M, E) be an edge-weighted complete graph on the set of vertices M . The weight of an edge (li , lj ) ∈ E is set to Pi,j , which is the pairwise recombinant probability between the corresponding markers. A traveling salesman path (TSP path) Γ in G is a path that visits every marker/vertex once and only once2 . The weight w(Γ ) of a TSP path Γ , is simply the sum of the weights of the edges on Γ . Since each TSP path Γ defines an order Π of the markers (and vice versa), we define the weight of an order Π as the weight of the corresponding TSP path Γ in G. A linear ordering of the markers is also called a map of the markers. 2
Note the difference between a traveling salesman path and a traveling salesman tour. A tour is a cycle (i.e., the salesman returns back to the origin).
Efficient and Accurate Construction of Genetic Linkage Maps
399
Lemma 1. Let Π0 be the true order of markers (according to their positions on the chromosome). Then, the weight of Π0 is minimum among all the possible orders of M . Proof. Let l1 , l2 , . . . ,lmbe the markers in M in true order Π0, and let Π1 = li1 , li2 , . . . ,lim be any order. We have w(Π0 ) = 2≤i≤m Pi−1,i and w(Π1 ) = 2≤j≤m Pij−1 ,ij . Let S0 = {Pi−1,i |2 ≤ i ≤ m} and S1 = {Pij−1 ,ij |2 ≤ j ≤ m}. Now observe that there is a one to one correspondence between the elements in S0 and the elements in S1 , such that if Pi−1,i ∈ S0 is mapped to Pij−1 ,ij ∈ S1 then the pair of markers (li−1 , li ) is fully contained in the pair (lij−1 , lij ), and hence Pi−1,i ≤ Pij−1 ,ij . Therefore, we conclude that w(Π0 ) ≤ w(Π1 ). According to Lemma 1, in order to determine the correct order of the markers, one has to find the minimum weight TSP path in G. Although the problem of finding the minimum weight TSP path in a general graph is NP-complete [4], in our case it is rather easy, as shown next. Recall that a minimum (weight) spanning tree (MST) of G is a subgraph of G which is a tree that spans all the vertices of G and has minimum total weight. To be technically accurate, we assume that the graph G has only one minimum weight spanning tree. Lemma 2. Let Γ0 be the MST of G. Then, Γ0 is also the minimum weight TSP path. Proof. Let l1 , l2 , . . . , lm be the markers in their true order. Let us run Prim’s minimum spanning tree algorithm [1] on G starting from the first marker in the linkage group, i.e., l1 . Prim’s algorithm iteratively adds node (and edges) to a partially discovered tree until all the nodes are included. The next node to be added is the closest one to the partially discovered tree. Let li−1 be the node added in the previous step of Prim. Due to the way the edge weights are assigned in G, the next marker to be added will be li . Therefore, the MST is also a TSP path in G. Theorem 2. The true order of the markers in M can be determined by computing the MST in G. Proof. Follows directly from Lemma 1 and 2.
Theorem 2 claims that the true order of the markers in a linkage group can be identified by simply running Prim’s MST algorithm (or any other MST algorithms) on the graph G, which would take quadratic time in the number of markers. Unfortunately, we do not known the exact pairwise recombinant probabilities Pi,j . What we have are their maximum likelihood estimates di,j /n for those probabilities. Thus, we replace Pi,j by di,j as the edge weights in G, and we call H the resulting graph. Our objective is to find a minimum weight TSP path in the graph H (which turns out to be the same objective function as used in [16]). When n → ∞, the max likelihood estimates converge to the true probabilities Pi,j . According to Lemma 1, the minimum weight TSP path will reveal the true order of the markers. Thus, we run Prim’s algorithm on H to compute the optimum spanning tree. If the MLEs are accurate, according to Lemma 2, the MST will be a TSP path. In practice, due to noise in the genotyping data or due to an insufficient number of individuals, the spanning tree may not be a path – but hopefully “very close” to a path. As we will show in Section 5, this is exactly what we observed when running our algorithm on both real data and simulated data – the MST produced is “almost” a path. In any case, when a tree is not yet a path, we will try to transform it into a path, as explained next.
400
Y. Wu et al.
4 Algorithmic Methods The construction of a genetic linkage map consists of two steps. In the first step, one clusters the markers into linkage groups. In the second, one determines the order of the markers in each LG. 4.1 Clustering Markers into Linkage Groups In order to cluster the markers into linkage groups, we construct a complete graph H(M, E) over the set of markers to be clustered. The weight of an edge (li , lj ) ∈ E is equal to the pairwise distance di,j between li and lj . As shown in Theorem 1, if two markers belong to different LGs, then the distance between them will be large with high probability. Once a small probability is chosen by the user (the default is =0.000001), we can determine δ by solving the equation 2(n/2 − δ)2 /n = loge . We then remove all the edges from H(M, E) whose weight is larger than or equal to δ. The resulting graph will break up into connected components, each of which is assigned to a linkage group. The parameter should be chosen according the quality of the data. In practice, this is not such a critical issue since the recombinant probability between nearby markers on the same linkage group is usually very small (less than 0.05). According to our experience, our algorithm is capable of determining the correct number of LGs for a fairly large range of values for . 4.2 Ordering Markers in Each Linkage Groups Before ordering the markers in each linkage groups we preprocess the data by collecting cosegregating markers3 into bins. Each bin is uniquely identified by one of its members. Given one of the linkage groups obtained by the step above, we first construct a complete graph H(B, E), where B corresponds to the bins in that linkage group, and the weight on the edges is the pairwise distance between the corresponding representative markers. As mentioned earlier, in order to find a good TSP path in H(B, E), we start by constructing an MST. If the MST turns out to be a path, we are done. Otherwise, we need to transform the MST into a path, in a way so that the ordering captured by the tree is maintained as much as possible. We proceed as follows. First, we find the longest path in the MST, hereafter referred to as the backbone. The bins/vertices that do not belong to the path will be first disconnected from it. Then, the disconnected bins/vertices will be re-inserted into the backbone one by one at the position where the resulting backbone has the minimum total weight. The path resulting at the end of this process is our initial solution, which might not be locally optimal. Once the initial solution is computed, we apply two heuristics that iteratively perform local perturbations to it in an attempt to improve its quality. First, we use the commonlyused K-opt (K = 2 in this case) heuristic. We cut the current path into three pieces, and try all the possible rearrangements of the three pieces. If any of the resulting paths has a less total weight, it will be saved. This procedure is repeated until no further improvement is possible. In the second heuristic, we try to relocate each node in the 3
Cosegregating markers are those markers for which the pairwise distances is 0.
Efficient and Accurate Construction of Genetic Linkage Maps
401
path to all the other possible positions. If this relocation reduces the weight, the new path will be saved. We apply the 2-opt heuristic and the relocation heuristic iteratively until none of them can further reduce the weight of the path. The resulting TSP path represents our final solution. 4.3 Dealing with Missing Data In our discussion so far, we assumed no missing data. This assumption is, however, not very realistic. As it turns out in practice, it is rather common to have missing data about the state of a marker. In fact, as we will show in our experimental results, missing observations do not have as much negative impact on the accuracy of the final map as genotype errors. Thus, it appears beneficial to leave uncertain genotypes as missing observations than arbitrarily calling them one way or the other. We deal with missing observations via an Expectation Maximization (EM) algorithm. Observe that if we knew the order of the markers (or, bins, if we have cosegregating markers), the process of imputing the missing data would be relatively straightforward. For example, suppose we knew that marker l3 immediately follows marker l2 , and that ˆ 1,2 the estimate of the recombil2 immediately follows marker l1 . Let us denote with P ˆ 2,3 the recombinant probability nant probabilities between markers l1 and l2 , and with P between markers l2 and l3 . Let us assume that for an individual c the genotype at locus l2 is missing, but the genotypes at loci l1 and l3 are available. Without loss of generality, let us suppose that they are both A. Then, the posterior probability for the genotype at locus l2 in individual c is ˆ 1,2 )(1 − P ˆ 2,3 ) (1 − P P{genotype in c at l2 is A} = ˆ 1,2 )(1 − P ˆ 2,3 ) + P ˆ 1,2 P ˆ 2,3 (1 − P and P{genotype in c at l2 is B} = 1 − P{genotype in c at l2 is A}. This posterior probability is the best estimate for the genotype of the missing observation. Similarly, one can compute the posterior probabilities for different combinations of the genotypes at loci l1 and l2 . In order to deal with uncertainties in the data, we replace each entry in the genotype matrix A that used to contain symbols A/B with a probability. The probability A[i, j] represents the confidence that we have about marker li in individual cj of being in state A. For the known observations the probabilities are fixed to be 1 or 0 depending whether the genotypes observed is A or B, respectively. The probabilities for the missing observations will be initially set to 0.5. Given that now A contains probabilities, the pairwise distance between two markers li and lj can computed as follows di,j = A[i, k](1 − A[j, k]) + (1 − A[i, k])A[j, k] (1) 1≤k≤n
Our iterative algorithm works as follows. First, given the input matrix A, we compute the pairwise distance according to (1). Then, we run our MST-based algorithm to find the most probable order of the markers. This constitutes the M step in the EM framework. Given the new marker order, we can adjust the estimatate for a missing observation from marker i2 on individual j as follows A[i2 , j] = La,A,c / La,b,c (2) a∈{A,B},c∈{A,B}
a∈{A,B},b∈{A,B},c∈{A,B}
402
Y. Wu et al.
where i1 is the marker immediately preceding i2 in the most recent ordering, i3 is the marker immediately following i2 , and La,b,c is the likelihood of the event (l1 = a, l2 = b, l3 = c) at the three consecutive loci. The quantity La,b,c is straighforward to compute. ˆ i1 ,i2 )(1− P ˆ i2 ,i3 )A[i3 , j], where P ˆ i1 ,i2 = di1 ,i2 /n, For example, LA,A,A = A[i1 , j](1− P ˆ i2 ,i3 = di2 ,i3 /n are the MLEs for Pi1 ,i2 and Pi2 ,i3 respectively, and di1 ,i2 and di2 ,i3 P are as computed according to (1) in the most recent M step. In the case where the missing observation is at the beginning or the end of the map, the above estimates will have to be adjusted slightly. The new estimation of the probabilities corresponds to the E step in the EM framework. An E-step is followed by another M-step, and this iterative process continues until the marker order converges. In our experimental evaluations, the algorithm converges pretty quickly, usually in less than ten iterations. The pseudo-code of our algorithm, called MST MAP, is presented in the Appendix.
5 Experimental Results We implemented our algorithm in C++ and carried out extensive evaluations on both real data and simulated data. The real data comes from our ongoing genetic mapping project for Hordeum vulgare (barley) at University of California, Riverside. In total there are three mapping populations being studied, all of which are DH populations. The first mapping population is the result of crossing Oregon Wolfe Barley Dominant with Oregon Wolfe Barley Recessive (see http://barleyworld.org/oregonwolfe.php), and from here on, we will refer to it as the OWB data set. The OWB data set consists of 1,020 markers genotyped on 93 individuals. The second mapping population is the result of a cross of Steptoe with Morex (see http://wheat.pw.usda.gov/ggpages/SxM/), which consists of 800 markers genotyped on 149 individuals. It will be referred to as SM data set from here on. The third mapping population is the result of a cross of Morex with Barke recently developed by Nils Stein and colleagues at the Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), which contains 1,068 markers on 93 individuals. This data set will be referred to as MB in our discussion. The genotypes of SNPs for the above data sets were determined via the Illumina GoldenGate Assay. These three barley data sets were expected to contain seven LGs, one for each of the seven barley chromosomes. We generate the simulated data set according to the following procedure (which is the same as that used in [16]). First we decide how many markers to place on the genetic map, how many individuals to genotype, what is the error rate and what is the missing rate. We then produce a “skeleton” map, according to which the genotypes for the individuals will be generated. The markers on the skeleton map are spaced at a distance of 0.5 centimorgan plus a random distance according to a Poisson process. On average, the adjacent markers are 1 centimorgan apart from each other. We generate the genotypes for the individuals as follows. The genotype at the first marker is generated at random with probability 0.5 of being A and probability 0.5 of being B. The genotype at the next marker depends upon the genotype at the previous marker and the distance between them. If the distance between the current marker and the previous marker is x centimorgans, then with probability x%, the genotype at the current locus is the opposite
Efficient and Accurate Construction of Genetic Linkage Maps
403
Table 1. Summary of the clustering results for the barley data sets. ρ¯ is the average ρ of the seven largest LGs in each population. Data set # markers # LGs Sizes of the LGs ρ¯ OWB 1,020 7 105,178,165,132,146,129,165 0.9948 SM 800 8 149,105,147,73,85,147,93,1 0.9972 MB 1,068 8 150,198,136,162,96,130,194,2 0.9950
of that at the previous locus, and with probability (1-x%) the two genotypes are the same. Finally, according to the specified error rate and missing rate, we flip the current genotype to purposely introduce an error or simply delete it to introduce a missing observation. Following this procedure, we generated various datasets on a wide range of choices for n, m, error rate and missing rate. 5.1 Evaluation of the Clustering Algorithm First, we evaluated the effectiveness of our clustering algorithm. We ran our clustering algorithm on the datasets for barley. Since the genome of barley consists of seven chromosome pairs, we expected the clustering algorithm to produce seven linkage groups. Using the default value for , our algorithm produced seven linkage groups for the OWB data set, and eight linkage groups for the MB and SM data set. The same results can be obtained in a rather wide range of values of . For example, for ∈ [0.000001, 0.0001] the OWB data set is always clustered into seven LGs. The smallest linkage group in the MB data set contains only two markers, whereas the smallest linkage group in the SM data set is a singleton. The explanation for these exceptions is not yet certain, but we conjectured some problems with the Illumina data. The result of the clustering algorithm is summarized in Table 1. We also compared the clusters produced by our algorithm against the ones produced by J OIN M AP, which is a commercial software for linkage analysis. The clusters turned out to be identical. However, since J OIN M AP implements a hierarchical clustering algorithm based on pairwise LOD score, the help of an expert is needed in order to decide where to cut in the dendrogram in order to produce the meaningful clusters. Our algorithm is conceptually much simpler. 5.2 Evaluation of the Quality of the Minimum Spanning Trees Second, we verified that on real and simulated data, the MSTs produced by MST MAP are indeed very close to TSP paths. This experimental evaluation corroborates the fact that the MST produces very good initial solution. Here, we computed the fraction ρ of the total number of bins/vertices in the graph that belong to the longest path (backbone) of the MST. The closer is ρ to 1, the closer is the MST to a path. Table 1 shows that on the barley datasets, the average ρ for the seven linkage groups (not including the smallest ones in the SM and MB dataset) is always very close to 1. Indeed, 16 of the 21 MSTs are paths. The remaining 5 MSTs are all very close to paths, with just one node hanging off the backbone. When our algorithm generates MSTs which are paths we are guaranteed that they are optimal, thus increasing the confidence in the correctness of the order obtained.
404
Y. Wu et al.
On the simulated dataset with no genotyping errors, ρ is again close to one (see Figure 1-LEFT) for both n = 100 and n = 200 individuals. When the error rate is 1%, the ratio drops sharply to about 0.5. This is due to the fact that the average distance between nearby markers is only one centimorgan. One percent error introduces an additional distance of two centimorgans which is likely to move a marker around in its neighborhood. We computed ρ for several values of error rate, up to 15%. At 15% error rate, the backbone contains only about 1/4 of the markers. However, this short backbone is still very useful in obtaining a good map since it can be thought as a sample of the markers in their true order. Also, observe that increasing the number of individuals will slightly increase the length of the backbone. However, the ratio remains the same irrespective of the number of markers we include on a map (data not shown). 5.3 Evaluation of the Accuracy of the Ordering In the third and final evaluation, we used the simulated data to compare the accuracy of the maps produced by MST MAP against the ones generated by R ECORD [16]. We selected R ECORD because, to the best of our knowledge, it is the best tool available for genetic mapping. R ECORD is a recent software tool which, according to the authors and based on our experience, outperforms J OIN M AP. J OIN M AP is another widely-used commercial software for linkage analysis. R ECORD is also rather fast . The algorithm implemented in R ECORD runs in time quadratic in the total number of markers, which is the same as in MST MAP. Last but not least, R ECORD is command line based, which allows us to run more extensive tests without too much human intervention. All the other tools we are aware of (e.g., C ARTHAGENE, A NT M AP, J OIN M AP) are GUI-based. We used the Kendall’s concordance correlations to evaluate the quality of the maps. Kendall’s metric, denoted by τ , is commonly used in statistics to measure the corre4(# concordant pairs) lation between two rankings. This metric is given by τ = − 1, m(m−1) where a pair of markers is concordant if the relative order between them is the same in the two maps (one is the true map and the other is the map generated by either MST MAP or R ECORD), where m is the total number of markers. The value of τ ranges from -1 to 1. τ = 1 when the two maps are identical; τ = −1 when one map is the reverse of the other. Since the orientation of the map is not important in this context, whenever τ is negative, we flip one of the map and we recompute τ . In our case, τ ∈ [0, 1]. The higher the value of τ , the closer is the map produced to the true map. Note that τ is more sensitive to global reshuffle than local reshuffle. For example, assume that the true order is the identity permutation. τ for the following order n2 , n2 +1, n2 +2, ..., n, 1, 2, 3, ..., n2 −1 is 0, whereas τ for the order 2, 1, 4, 3, 6, 5, ..., n, n − 1 is (1 − 2/n) which is still close to 1 when n is large. The fact that τ is more sensitive to global reshuffle is a desired property since biologists are more interested in the correctness of the global order of the markers than the local order. The results of the evaluation based on τ for n = 100 individuals are summarized in Figures 1-RIGHT, 2-LEFT, and 2-RIGHT. Four observations are in order. First, Figure 1-RIGHT shows that when the error rate is low, both MST MAP and R ECORD perform equally well. However, when the error rate gets higher, MST MAP consistently
Efficient and Accurate Construction of Genetic Linkage Maps
m = 100 missing rate = 0
1
0.9 0.8
0.6
tau
rho
n = 100 missing rate = 0
1
0.8
405
0.7 0.4 0.6 0.2
0.5 100 individuals 200 individuals
0
MSTMap 100 mrks RECORD 100 mrks MSTMap 300 mrks RECORD 300 mrks MSTMap 500 mrks RECORD 500 mrks
0.4 0
0.02 0.04 0.06 0.08
0.1
0.12 0.14 0.16
0
0.02 0.04 0.06 0.08
error rate
0.1
0.12 0.14 0.16
error rate
Fig. 1. Average ρ (LEFT) and τ (RIGHT) for thirty runs on simulated data for several choices of the error rates (and no missing data). n is the number of individuals, and m is the number of markers.
n = 100 error rate = 0
n = 100 1
1
0.8
0.99
tau
tau
0.9
0.7 0.98
0.97
0.6
MSTMap 100 mrks RECORD 100 mrks MSTMap 300 mrks RECORD 300 mrks MSTMap 500 mrks RECORD 500 mrks
0.5
MSTMap 100 mrks RECORD 100 mrks MSTMap 300 mrks RECORD 300 mrks MSTMap 500 mrks RECORD 500 mrks
0.4 0
0.02 0.04 0.06 0.08
0.1
missing rate
0.12 0.14 0.16
0
0.02 0.04 0.06 0.08
0.1
0.12 0.14 0.16
error missing rate
Fig. 2. Average τ for thirty runs on simulated data. LEFT: various choices of missing rates (error rate=0); RIGHT: various choices of error/missing rates (missing rate=error rate).
builds much more accurate maps than R ECORD. Second, observe in Figure 2-LEFT that the missing data rate does not have too much of a negative impact on the quality of the maps assembled as the error rate. Even at the missing rate of 15%, the maps assembled are still very accurate (τ is larger than 0.99). Third, Figures 1-RIGHT and 2-RIGHT show that when the data is noisy, the performance of MSP MAP improves as the number of markers m increases. Forth, we observe that if number n of individuals increases, the quality of the maps constructed by both algorithms also improves (data not shown).
6 Conclusions We presented a novel method to cluster and order genetic markers based on genotyping data. Our method is based on solid theoretical foundations, is computationally very efficient, handles gracefully missing observation, and performs as well as the best tool in the scientific literature. Additionally, in the presence of noisy data, our method clearly outperforms the other tools.
406
Y. Wu et al.
References 1. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 2nd edn. The MIT Press and McGraw-Hill Book Company, Cambridge (2001) 2. de Givry, S., Bouchez, M., Chabrier, P., Milan, D., Schiex, T.: CARTHAGENE: multipopulation integrated genetic and radiation hybrid mapping. Bioinformatics (2004) 3. Falk, C.T.: Preliminary ordering of multiple linked loci using pairwise linkage data. Genetic Epidemiology 9, 367–375 (1992) 4. Garey, M., Johnson, D.: Computers and Intractability: A Guide to the Theory of NPCompleteness. WH Freeman and Company, New York (1979) 5. Gaspin, C., Schiex, T.: Genetic algorithms for genetic mapping. In: Hao, J.-K., Lutton, E., Ronald, E., Schoenauer, M., Snyers, D. (eds.) AE 1997. LNCS, vol. 1363, pp. 145–155. Springer, Heidelberg (1998) 6. Glover, F.: Tabu search-part I. ORSA Journal on Computing 1, 190–206 (1989) 7. Glover, F.: Tabu search-part II. ORSA Journal on Computing 2, 4–31 (1990) 8. Goldberg, D.E.: Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley Professional, Reading (January 1989) 9. Hoeffding, W.: Probability inequalities for sums of bounded random variables. Journ. Am. Stat. Ass. 58(301), 13–30 (1963) 10. Iwata, H., Ninomiya, S.: AntMap: constructing genetic linkage maps using an ant colony optimization algorithm. Breeding Science 56, 371–377 (2006) 11. Jansen, J., de Jong, A.G., van Ooijen, J.W.: Constructing dense genetic linkage maps. Theor. Appl. Genet. 102, 1113–1122 (2001) 12. Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by simulated annealing. Science 220(4598), 671–680 (1983) 13. Lin, S., Kernighan, B.: An effective heuristic algorithm for the traveling sales man problem. Operation research 21, 498–516 (1973) 14. Liu, B.: The gene ordering problem: an analog of the traveling sales man problem. Plant Genome (1995) 15. Mester, D., Ronin, Y., Minkov, D., Nevo, E., Korol, A.: Constructing large-scale genetic maps using an evolutionary strategy algorithm. In: Hao, J.-K., Lutton, E., Ronald, E., Schoenauer, M., Snyers, D. (eds.) AE 1997. LNCS, vol. 1363, pp. 145–155. Springer, Heidelberg (1998) 16. Os, H.V., Stam, P., Visser, R.G.F., Eck, H.J.V.: RECORD: a novel method for ordering loci on a genetic linkage map. Theor. Appl. Genet. 112, 30–40 (2005) 17. Schiex, T., Gaspin, C.: CARTHAGENE: Constructing and joining maximum likelihood genetic maps. In: ISMB, pp. 258–267 (1997) 18. Stam, P.: Construction of integrated genetic linkage maps by means of a new computer package: Joinmap. The Plant Journal 3, 739–744 (1993) 19. Sturtevant, A.H.: The linear arrangement of six sex-linked factors in drosophila, as shown by their mode of association. Journal of Experimental Zoology 14, 43–59 (1913) 20. Sun, Z., Wang, Z., Tu, J., Zhang, J., Yu, F., McVetty, P.B., Li, G.: An ultradense genetic recombination map for brassica napus, consisting of 13551 srap markers. Theor. Appl. Genet. (2007) 21. Weeks, D., Lange, K.: Preliminary ranking procedures for multilocus ordering. Genomics 1, 236–242 (1987) 22. Wilson, S.R.: A major simplification in the preliminary ordering of linked loci. Genetic Epidemiology 5, 75–80 (1988)
A Novel Method for Signal Transduction Network Inference from Indirect Experimental Evidence R´eka Albert1, , Bhaskar DasGupta2, , Riccardo Dondi3 , Sema Kachalo4,† , Eduardo Sontag5,‡ , Alexander Zelikovsky6, and Kelly Westbrooks6 1
Department of Physics, Pennsylvania State University, University Park, PA 16802
[email protected] 2 Department of Computer Science, University of Illinois at Chicago, Chicago, IL 60607
[email protected] 3 Dipartimento di Scienze dei Linguaggi, della Comunicazione e degli Studi Culturali, Universit` a degli Studi di Bergamo, Bergamo, Italy, 24129
[email protected] 4 Department of Bioengineering, University of Illinois at Chicago, Chicago, IL 60607
[email protected] 5 Department of Mathematics, Rutgers University, New Brunswick, NJ 08903
[email protected] 6 Department of Computer Science, Georgia State University, Atlanta, GA 30303 {alexz,kelly}@cs.gsu.edu
Abstract. In this paper we introduce a new method of combined synthesis and inference of biological signal transduction networks. A main idea of our method lies in representing observed causal relationships as network paths and using techniques from combinatorial optimization to find the sparsest graph consistent with all experimental observations. Our contributions are twofold: on the theoretical and algorithmic side, we formalize our approach, study its computational complexity and prove new results for exact and approximate solutions of the computationally hard transitive reduction substep of the approach. On the application side, we validate the biological usability of our approach by successfully applying it to a previously published signal transduction network by Li et al. [20] and show that our algorithm for the transitive reduction substep performs well on graphs with a structure similar to those observed in transcriptional regulatory and signal transduction networks.
† ‡
A full version of this paper will appear in Journal of Computational Biology. Partly supported by a Sloan Research Fellowship, NSF grants DMI-0537992, MCB-0618402 and USDA grant 2006-35100-17254. Corresponding author. Partly supported by NSF grants IIS-0346973, IIS-0612044 and DBI-0543365. Supported by NSF grant IIS-0346973. Partly supported by NSF grant DMS-0614371.
R. Giancarlo and S. Hannenhalli (Eds.): WABI 2007, LNBI 4645, pp. 407–419, 2007. c Springer-Verlag Berlin Heidelberg 2007
408
1
R. Albert et al.
Introduction
Most biological characteristics of a cell arise from the complex interactions between its numerous constituents such as DNA, RNA, proteins and small molecules [3]. Cells use signaling pathways and regulatory mechanisms to coordinate multiple functions, allowing them to respond to and acclimate to an ever-changing environment. Genome-wide experimental methods now identify interactions among thousands of proteins [18,11,12,19]; however these experiments are rarely conducted in the specific cell type of interest and are not able to probe the directionality of the interactions (i.e., to distinguish between the regulatory source and target). Identification of every reaction and regulatory interaction participating even in a relatively simple function of a single-celled organism requires a concerted and decades-long effort. Consequently, the state of the art understanding of many signaling processes is limited to the knowledge of key mediators and of their positive or negative effects on the whole process. Experimental information about the involvement of a specific component in a given signal transduction network can be partitioned into three categories. First, biochemical evidence that provides information on enzymatic activity or proteinprotein interactions. This first category is a direct interaction, e.g., binding of two proteins or a transcription factor activating the transcription of a gene or a chemical reaction with a single reactant and single product. Second, pharmacological evidence, in which a chemical is used either to mimic the elimination of a particular component, or to exogenously provide a certain component, leads to observed relationships that are not direct interactions but indirect causal effects most probably resulting from a chain of interactions and reactions. For example, binding of a chemical to a receptor protein starts a cascade of protein-protein interactions and chemical reactions that ultimately results in the transcription of a gene. Observing gene transcription after exogeneous application of the chemical allows inferring a causal relationship between the chemical and the gene that however is not a direct interaction. Third, genetic evidence of differential responses to a stimulus in wild-type organisms versus a mutant organism implicates the product of the mutated gene in the signal transduction process. This category is a three-component inference that in a minority of cases could correspond to a single reaction (namely, when the stimulus is the reactant of the reaction, the mutated gene encodes the enzyme catalysing the reaction and the studied output is the product of the reaction), but more often it is indirect. As stated above, the last two types of inference do not give direct interactions but indirect causal relationships that correspond to reachability relationships in the unknown interaction network. Here we describe a method for synthesizing indirect (path-level) information into a consistent network by constructing the sparsest graph that maintains all reachability relationships. This method’s novelty over other network inference approaches is that it does not require expression information (as all reverse engineering approaches do, for a review see [5]). Moreover, our method significantly expands the capability for incorporating indirect (pathway-level) information. Previous methods of synthesizing signal transduction networks [21] only include direct biochemical
A Novel Method for Signal Transduction Network Inference
409
interactions, and are therefore restricted by the incompleteness of the experimental knowledge on pairwise interactions. Our method is able to incorporate indirect causal effects as network paths with known starting and end vertices and (yet) unknown intermediary vertices. The first step of our method is to distill experimental conclusions into qualitative regulatory relations between cellular components. Following [8,20], we distinguish between positive and negative regulation, usually denoted by the verbs “promote” and “inhibit” and represented graphically as → and . Biochemical and pharmacological evidence is represented as component-to-component relationships, such as “A promotes B”, and is incorporated as a directed arc from A to B. Arcs corresponding to direct interactions are marked as such. Genetic evidence leads to double causal inferences of the type “C promotes the process through which A promotes B”. The only way this statement can correspond to a direct interaction is if C is an enzyme catalyzing a reaction in which A is transformed into B. We represent supported enzyme-catalized reactions as both A (the substrate) and C (the enzyme) activating B (the product). If the interaction between A and B is direct and C is not a catalyst of the A-B interaction, we assume that C activates A. In all other cases we assume that the three-node indirect inference corresponds to an intersection of two paths (A ⇒ B and C ⇒ B) in the interaction network; in other words, we assume that C activates an unknown intermediary (pseudo)-vertex of the AB path. The main idea of our method is finding the minimal graph, both in terms of pseudo vertex numbers and non-critical edge numbers, that is consistent with all reachability relationships between real vertices. The algorithms involved are of two kinds: (i) transitive reduction of the resulting graph subject to the constraints that no edges flagged as direct are eliminated and (ii) pseudo-vertex collapse subject to the constraints that real vertices are not eliminated. Note that we are not claiming that real signal transduction networks are the sparsest possible; our goal is to minimize false positive (spurious) inferences, even if risking false negatives. Thus we want to be as close as possible to a “tree topology” while supporting all experimental observations. The implicit assumption of chain-like or tree-like topologies permeates the traditional molecular biology literature: signal transduction and metabolic pathways were assumed to be close to linear chains, genes were assumed to be regulated by one or two transcription factors [3]. According to current observations the reality is not far: the average in/out degree of transcriptional regulatory networks[23,18] and the mammalian signal transduction network [21] is close to 1.
2
A Formal Description of the Network Synthesis
The goal of this section is to introduce a formal framework of the network synthesis procedure that is sufficiently general in nature, and amenable to algorithmic analysis and consequent automation. First, we need to describe a graph-theoretic problem which we refer to as the binary transitive reduction (BTR) problem. We are given a directed graph G = (V, E) with an edge labeling
410
R. Albert et al. 0
1
function w : E → {0, 1}. Biologically, edge labels 0 and 1 in edges u→v and u→v correspond to the “u promotes v” and “u inhibits v”, respectively. The following definitions and notations are used throughout the paper. All paths are (possibly self-intersecting) directed paths unless otherwise stated. A non-selfintersecting path or cycle is called a simple path or cycle. If edge labels are removed or not mentioned, they are assumed to be 0 for the purpose of any problem that needs them. The parity of a path P from vertex u to vertex v is e∈P w(e) (mod 2). A path of parity 0 (resp., 1) is called a path of even (resp, odd) parity. x The same notions carries over to cycles in an obvious manner. The notation u ⇒ v denotes a path from u to v of parity x ∈ {0, 1}. If we do not care about the parity, x we simply denote the path as u ⇒ v. An edge will simply be denoted by u → v or u → v. For a subset of edges E ⊆ E, reachable(E ) is the set of all ordered triples x (u, v, x) such that u ⇒ v is a path of the restricted subgraph (V, E ). The BTR problem is defined as follows. An input instance is a directed graph G = (V, E) with an edge labeling function w : E → {0, 1} and a set of critical edges Ecritical ⊆ E. A valid solution is a subgraph G = (V, E ) where Ecritical ⊆ E ⊆ E and reachable(E ) =reachable(E). The objective is to find a valid solution that minimizes |E |. Intuitively, the BTR problem is useful for determining a sparsest graph consistent with a set of experimental observations. The set of “critical edges” represent edges which are known to be direct interactions with concrete evidence. By maximizing sparseness we do not simply mean to minimize the number of edges per se, but seek to minimize the number of spurious feed-forward loops (i.e., a node regulating another both directly and indirectly). Thus we want to be as close as possible to a “tree topology” while supporting the experimental observations. We also need to define one more problem that will be used in the formal framework of the network synthesis approach. The pseudo-vertex collapse (PVC) problem is defined as follows. An input instance is a directed graph G = (V, E) with an edge labeling function w : E → {0, 1} and a subset V ⊂ V of vertices called pseudo-vertices. The vertices in V \V are called “real” vertices. x For any vertex v, let in(v) = {(u, x) | u⇒v, x ∈ {0, 1}} \ {v} and let out(v) = x {(u, x) | v ⇒u, x ∈ {0, 1}} \ {v}. Collapsing two vertices u and v is permissible provided both are not “real” vertices and in(u) =in(v) and out(u) =out(v). If permissible, the collapse of two vertices u and v creates a new vertex w, makes every incoming (resp. outgoing) edges to (resp. from) either u or v an incoming (resp. outgoing) edge from w, removes any parallel edge that may result from the collapse operation and also removes both vertices u and v. A valid solution of the problem is then any graph G = (V , E ) that can be obtained from G by a sequence of permissible collapse operations and the objective is to find a valid solution that minimizes |V |. Intuitively, the PVC problem is useful for reducing the pseudo-vertex set to the the minimal set that maintains the graph consistent with all indirect experimental observations. As in the case of the BTR problem, our goal is to minimize false positive (spurious) inferences of additional components in the network.
A Novel Method for Signal Transduction Network Inference
411
A formal framework for the network synthesis procedure is presented in Figure 1. As described in Section 1, in the first step we incorporate biochemical interaction or causal evidence as labeled edges, noting the critical edges corresponding to direct interactions. Then we perform a binary transitive reduction to eliminate spurious inferred edges (i.e., edges that can be explained by paths of the same label). In step two we incorporate double causal relationy y x x ships A → (B → C) by (i) adding a new edge A → B if B → C is a critical edge, (ii) doing nothing if existing paths in the network already explain the relationship, or (iii) adding a new pseudo-vertex and three new edges. To corx+y (mod 2)
y
rectly incorporate the parity of the A −→ C relationship, B → C paths, with y (mod 2) = 0, will be broken into two edges with 0 parity, while paths of odd parity will be broken into an edge of a = 0 parity and an edge of b = 1 parity, summarized in a concise way by the equation b = a + b = y (mod 2). The unnecessary redundancy of the resulting graph is reduced by performing pseudo-vertex collapse, then a second round of binary transitive reduction. Intuitively, the approach in Figure 1 first expands the network by the addition of the pseudo-vertices at the intersection of the two paths corresponding to threenode inferences, then uses the additional information available in the network to collapse these pseudo-vertices, i.e., to identify them with real vertices or with each other. The PVC is the heart of the algorithm, the final BTR is akin to a
1 [encoding single causal inferences] 0 1 1.1 Build a network for each causal inference of the type A→B or A→B noting each critical edge. 1.2 Solve the BTR problem for this network. 2 [encoding double causal inferences] x
y
2.1 Consider each indirect causal relationship A → (B → C) where x, y ∈ {0, 1}. We add new nodes and/or edges in the network based on the following cases: y y x – If B → C ∈ Ecritical , then we add A → (B → C). A ⇓x – If there is no subgraph of the form for some node D a b B⇒ D ⇒C A ↓x where b = a+b = y (mod 2) then add the subgraph to a b B→ P →C the network where a new “pseudo-node” P is added and b = a + b = y (mod 2). 2.2 Solve the PVC problem for the resulting graph. 3 [final reduction] Solve the BTR problem for the network.
Fig. 1. The overall network synthesis approach
412
R. Albert et al.
final cleanup step; thus it is important to perform PVC before BTR in Step 2.2 of Figure 1. Proposition 1. All the steps in the network synthesis procedure except the steps that involve BTR can be solved in polynomial time.
3
Summary of Pertinent Previous Works
The idea of transitive reduction, though in a more simplistic setting and/or integrated in an approach different from what appears in this paper, has been used by a few researchers before. For example, in [25] Wagner’s goal is to find the network from the reachability information. He constructs uniformly random graphs and scale-free networks in a range of connectivities (average degrees), and matches their reachability information to the range of gene reachability information found from yeast perturbation studies. He concludes that the expected number of direct regulatory interactions per gene is around 1 (if the underlying graph is uniformly random) or less than 0.5 (if the underlying graph is scale free with a degree exponent of 2). Chen et al. in [6] use time-dependent gene expression information to determine candidate activators and inhibitors of each gene, then prune the edges by assuming that no single gene functions both as activator and inhibitor. This assumption is too restrictive given that transcription factors can have both activation and inhibition domains, and the same proteinlevel interactions (for example phosphorylation by a kinase) can have positive or negative functional character depending on the target. Li et al. in [20] manually synthesize a plant signal transduction network from indirect (single and double) inferences introducing a first version of pseudo-vertex collapse. They assume 0 0 0 0 that if A→B, A→C and C →(A→B), the most parsimonious explanation is that 0 0 A→C →B. The reader is referred to the excellent surveys in [9,15] for further general information on biological network inference and modelling. Special cases of the BTR problem have been looked at by the theoretical computer science community in a different context of designing reliable communication networks. Obviously, BTR is NP-complete since the special case with all-zero edge labels includes the problem of finding a directed Hamiltonian cycle in a graph. If Ecritical = ∅, BTR with all-zero edge labels is known as the minimum equivalent digraph (MED) problem. MED is known to be MAX-SNP-hard, admits a polynomial time algorithm with an approximation ratio of 1.617 + ε for any constant ε > 0 [16] and can be solved in polynomial time for directed acyclic graphs [1]. More recently, Vetta [24] has claimed a 32 -approximation for the MED problem. A weighted version of the MED problem admits a 2-approximation [10]; this implies a 2-approximation for the BTR problem when all-zero edge labels. In a previous publication [4], we considered the BTR problem, generalized it to a so-called p-ary transitive reduction problem and provided an approximation algorithm for this generalization. In particular, we designed a 2 + o(1)approximation for the generalized problem, observed that the general problem can be solved in polynomial time if the input graph is a DAG and provided a
A Novel Method for Signal Transduction Network Inference
413
1.78-approximation for the BTR problem when all edge labels are zero but critical edges are allowed. The results in [4] are purely theoretical in nature with no experimental or implementation results, moreover the network synthesis process described in Figure 1 does not appear in [4]. All the theoretical results reported in this paper are disjoint from the results reported in [4].
4
New Algorithmic Results for BTR
Theorem 1. BTR can be solved in polynomial time if the graph has no cycles of length more than 3. Theorem 2. The GREEDY procedure, namely the following approach: x
x
Definition an edge u→v is redundant if there is an alternate path u⇒v GREEDY while (there exists a redundant edge) delete the redundant edge is a 3-approximation for the BTR problem. Moreover, there are input instances of BTR for which GREEDY has an approximation ratio of at least 2.
5
Our Implemention for the BTR Problem
Given an instance graph G = (V, E) of the BTR problem, one can design a straightforward dynamic programming approach to determine, for every u, v ∈ V x and every x ∈ {0, 1}, if u⇒v exists in G. The worst-case running time of the 3 algorithm is O(|V | ). To solve the BTR problem within a acceptable time complexity while ensuring a good accuracy, we have implemented the following two major approaches. In Approach 1 (applicable for smaller graphs), if the number of nodes in the graph is at most a threshold N , we implemented the GREEDY heuristic of Theorem 2 on the entire graph. The heuristic is implemented by iteratively selecting a new non-critical edge e = u → v for removal, tentatively x removing it from G and checking if the resulting graph has a path u⇒v. If so, we remove the edge; otherwise, we keep it and mark it so that we never select it again. We stop when we have no more edges to select for deletion. In Approach 2 (applicable for larger graphs), if the number of nodes in the graph is above the threshold N , we first use Approach 1 for every strongly connected component of G. Then we use two procedures Tcycle−to−gadget and Tgadget−to−cycle, described in the proof of Theorem 2 in the full version of this paper, to identify the remaining edges that can be deleted. To speed up our implementations and to improve accuracy, we also use some algorithmic engineering approaches. For example, we stop the Floyd-Warshall iteration in Approach 1 as soon as an alternate path x u⇒v is discovered, we randomize the selection of the next edge for removal and, in Approach 2, if the strongly connected component has very few vertices, calculate an exact solution of BTR on this component exhaustively. Both Approach 1 and Approach 2 are guaranteed to be a 3-approximate solution by Theorem 2.
414
R. Albert et al.
However, in Approach 1 there is no bias towards a particular candidate edge for removal among all candidate edges; in contrast, in Approach 2 a bias is introduced via removal of duplicate edges in the gadget replacement procedure. Thus, the two approaches may return slightly different solutions in practice. Choosing N to be 150, our implementation takes mostly negligible time to run on networks with up to thousands of nodes, taking time of the order of seconds for the manually curated network that is described in Section 6 to about a minute for the 1000 node random biological networks described in Section 7 on which we tested the performance of our implementations. Theoretical worst-case estimates of the running times of the two approaches are as follows. Approach 1 runs in O(d · |V |3 ) time where d is the number of non-critical edges. By using a linear-time msolution of the BTR problem on a DAG, Approach 2 runs in O(m2 + |E| + i=1 di · n3i ) time where the given graph has m strongly connected components and di (ni ) are the number of non-critical edges (vertices) in the ith strongly connected component.
6
Synthesizing ABA-Induced Stomatal Closure Network
Network inference algorithms applied to gene expression (microarray) data based on several types of analysis lead to indirect causal relationships among genes. Large-scale repositories for microarray data for several organisms such as Many Microbe Microarrays, NASCArrays and Gene Expression Omnibus contain expression information for thousands of genes under tens to hundreds of experimental conditions. Our methods are applicable for filtering redundant information by binary transitive reduction of indirect pairwise data and for incorporating differential gene expression under experimental perturbations by pseudo-vertex collapse. Signal transduction pathway repositories such as TRANSPATH and protein interaction databases contain up to thousands of interactions, a large number of which are not supported by direct binding evidence. Our methods can be used to selectively filter redundant information while keeping all direct interactions. In this section we discuss our computational results on synthesizing experimental results into a consistent guard cell signal transduction network for ABAinduced stomatal closure using our detailed procedure described in Section 2 and compare it with the manually curated network obtained in [20]. Our starting point is the list of experimentally observed causal relationships in ABA-induced closure collected by Li et al. and published as Table S1 in [20]. This table contains around 140 interactions and causal inferences, both of type “A promotes B” and “C promotes process(A promotes B)”. We augment this list with critical edges drawn from biophysical/biochemical knowledge on enzymatic reactions and ion flows and with simplifying hypotheses made by Li et al., both described in Text S1 of [20]. The synthesis of the network is carried out using the formal method described in Section 2. We also formalize an additional rule specific to the context of this network (and implicitly assumed by [20]) regarding enzyme-catalyzed reactions.
A Novel Method for Signal Transduction Network Inference
ABA
PEPC
415
ABA
RCN1
PEPC
RCN1 Sph
SphK
Sph
SphK
Malate NOS
Arg NIA12Nitrite NADPH
S1P
GCR1
NO
PLC
PIP2
NAD ADPRc GTP
GC
GPA1
AGB1
PLD
PC
InsP3
cGMP
cADPR
InsP6
CIS ROP10
Ca2+ATPase
KEV
Ca2+c
NAD ADPRc GTP
InsP3
GC
cGMP
cADPR
pHc
AGB1
PA
InsP6
Actin ERA1
Ca2+ATPase
KOUT
KEV
Ca2+c
Depolarization AnionEM
*
Closure
pHc ABI1
HATPase
CaIM
Depolarization
ROS
ROP2
CIS ROP10
Atrboh
PC *
InsPK
ABH1
AnionEM
AtPP2C
PIP2
ABI1
KAP
GPA1
RAC1
HATPase
CaIM
OST1
PLD
DAG
Actin ERA1
GCR1
ROS
ROP2
ABH1
S1P
NO
PLC PA
Arg NIA12Nitrite NADPH
Atrboh
InsPK RAC1
DAG
Malate NOS
OST1
AtPP2C
*
KAP
KOUT
Closure
(a)
(b)
Fig. 2. (a) The network manually synthesized by Li et al. [20]. (b) The network synthesized in this paper. A pseudo-vertex is displayed as .
We follow Li et al. in representing each of these reactions by two directed critical edges, one from the reaction substrate to the product and one from the enzyme to the product. As the reactants (substrates) of the reactions in [20] are abundant, the only way to regulate the product is by regulating the enzyme. The enzyme, being a catalyst, is always promoting the product’s synthesis, thus positive indirect regulation of a product will be interpreted as positive regulation of the enzyme, and negative indirect regulation of the product will be interpreted as negative regulation of the enzyme. In graph-theoretic terms, this leads to the following rule. We have a subset Eenzymatic ⊆ Ecritical of edges that are all laa
b
0
beled 0. Suppose that we have a path A → x → B, an edge C → B ∈ Eenzymatic . Then, we identify the node C with x by collapsing them together and set the parities of the edges A → (x = C) and (x = C) → B based on the following two cases: if a + b = 0 (mod 2) then both A → (x = C) and (x = C) → B have zero parities, otherwise if a + b = 1 (mod 2) then A → (x = C) has parity 1 and (x = C) → B has parity 0. The manually synthesized network of Li et al. includes a pseudo-vertex for each non-critical edge, indicating the existence of unknown biological mediators. For the ease of comparison we omit these degree two pseudo-vertices. The two networks are shown in Figures 2 (a)–(b). Here is a brief summary of an overall comparison of the two networks: – [20] has 54 vertices and 92 edges; our network has 57 vertices (3 extra pseudovertices) but only 84 edges. The two networks have 71 common edges. – Both [20] and our network has identical strongly connected component (SCC) of vertices. There is one SCC of size 18 (KOUT Depolarization KAP CaIM Ca2+c Ca2+ATPase HATPase KEV PLC InsP3 NOS NO GC cGMP ADPRc cADPR CIS AnionEM), one SCC of size 3 (Atrboh ROS ABI1), one SCC of size 2 (GPA1 AGB1) and the rest of the SCCs are of size 1 each.
416
R. Albert et al.
– All the paths present in the [20] reconstruction are present in our network as 1 well. Our network has the extra path ROP10⇒Closure that Li et al. cited in their Table S1 but did not include in their network due to weak supporting evidence. Thus the two networks are highly similar but diverge on a number of edges. 0 Li et al. keep a few graph-theoretically redundant edges such as ABA→PLC, 1 0 PA→ABI1 and ROS→CaIM that would be explainable by feedback processes. 0 Some of our edges such as NO→AnionEM correspond to paths in Li et al.’s reconstruction. Our graph contains the full pseudo-vertex-using representation of the 1 0 1 process AtPP2C→(ABA→Closure) that Li et al. simplifies to AtPP2C→ABA. 0 0 0 We have pHc→ROS and ROS⇒Atrboh where [20] has pH→Atrboh and a positive feedback loop on Atrboh. All these discrepancies are due not to algorithmic deficiencies but to human decisions. Finally, the entire network synthesis process was done within a few seconds by our implemented algorithm.
7
BTR Algorithm’s Performance on Simulated Networks
A variety of cellular interaction and regulatory networks have been mapped and graph theoretically characterized. One of the most frequently reported graph measures is the distribution of node degrees, i.e., the distribution of the number of incoming or outgoing edges per node. A variety of networks, including many cellular interaction networks, are heterogeneous (diverse) in terms of node degrees and Fig. 3. A plot of the empirical performance of our exhibit a degree distri- BTR algorithm on the 561 simulated interaction netbution that is close to works. E is our solution, OPT is the loose lower bound |E | a power-law or a mix- on the minimum number of edges and 100× OPT −1 ture of a power law and is the percentage of additional edges that our algoan exponential distribution rithm keeps. On an average, we use no more than 5.5% [14,2,11,19,21]. Transcript- more edges than the optimum (with about 4.8% as the ional regulatory networks standard deviation). exhibit a power-law outdegree distribution, while the in-degree distribution is more restricted [23,18]. To test our algorithm on a network similar to the observed features, we generate random networks with a prescribed degree distribution using the methods in [22]. We base the degree
A Novel Method for Signal Transduction Network Inference
417
to ta ex l ci t in ory hi bi to ry cr iti ca l
distributions on the yeast transcriptional regulatory network that has a maximum out-degree ∼ 150 and maximum in-degree ∼ 15 [18]. In our generated network the distribution of in-degree of the network is exponential, i.e., Pr[indegree= x]= Le−Lx with L between 1/2 and 1/3 and the maximum in-degree is 12. The distribution of out-degree of the network is governed by a power-law, i.e., for x ≥ 1 Pr[out-degree= x]= cx−c and for x = 0 Pr[out-degree= 0]≥ c with c between 2 and 3 and the maximum out-degree is 200. We varied the ratio of excitory to inhibitory edges between 2 and 4. Since there are no known biological estimates of critical edges, we tried a few small and large values, such as 1%, 2% and 50%, for the percentage of edges that are critical to catch qualitatively all regions of dynamics of the network that are of interest.1 To empirically test the performance of our algorithm, we used the (rather loose) lower bound OPT for the optimal solution OPT≥ max{n + s − c, t, L} where n is the number of vertices, s is the number of strongly connected components, c is the number of connected components of the underlying undirected graph, t is the number of x x those edges u→v such that either u→v ∈ Ecritical or there is no alternate path x u⇒v in the graph and L is a lower bound described in the full version of the paper. We tested the performance of our BTR algorithm on 561 randomly generated networks varying the number number of nodes average number of edges of vertices between roughly 100 and 900. A summary of the performance is shown in Figure 3 indicating that our (range) transitive reduction procedure returns 98–100 206 147 59 31 solutions close to optimal in many 250–282 690 552 138 33 cases even with such a simple lower 882-907 2489 1991 498 118 bound of OPT. The running time of BTR on an individual network is neg- Fig. 4. Basic statistics of the simulated ligible (from about one second for a 100 networks used in Figure 3 node networks to about no more than a minute for a 1000 node network). A summary of the various statistics of these 561 networks is given in Figure 4. To verify the performance of our BTR algorithm we perturb most of these networks with increasing amounts of additional random edges chosen such they do not 1
By “estimates of critical edges”, we mean an accurate estimate of the percentage of total edges that are critical on an average in a biological network. Depending on the experimental or inference methods, different network reconstructions have widely varying expected fractions of critical edges. For example, the curated network of Ma’ayan et al. [21] is expected to have close to 100% critical edges as they specifically focused on collecting direct interactions only. Protein interaction networks are expected to be mostly critical [11,12,19]. The so-called genetic interactions (e.g., synthetic lethal interactions) represent compensatory relationships, and only a minority of them are direct interactions. Network inference (reverse engineering) approaches lead to networks whose interactions are close to 0% critical.
418
R. Albert et al.
change the optimal solution of the original graph. In most cases, our algorithm returns a solution that is either optimal or very close to the original network on which additional edges are added. Software. See http://www.cs.uic.edu/∼dasgupta/network-synthesis/.
References 1. Aho, A., Garey, M.R., Ullman, J.D.: The transitive reduction of a directed graph. SIAM Journal of Computing 1(2), 131–137 (1972) 2. Albert, R., Barab´ asi, A.-L.: Statistical mechanics of complex networks. Reviews of Modern Physics 74(1), 47–97 (2002) 3. Alberts, B.: Molecular biology of the cell. Garland Pub., New York (1994) 4. Albert, R., DasGupta, B., Dondi, R., Sontag, E.: Inferring (Biological) Signal Transduction Networks via Transitive Reductions of Directed Graphs, Algorithmica (to appear) 5. Carter, G.W.: Inferring network interactions within a cell. Briefings in Bioinformatics 6(4), 380–389 (2005) 6. Chen, T., Filkov, V., Skiena, S.: Identifying Gene Regulatory Networks from Experimental Data. In: Proc. of third RECOMB, pp. 94–103 (1999) 7. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms. The MIT Press, Cambridge (2001) 8. DasGupta, B., Enciso, G.A., Sontag, E.D., Zhang, Y.: Algorithmic and Complexity Results for Decompositions of Biological Networks into Monotone Subsystems. In: ` Alvarez, C., Serna, M. (eds.) WEA 2006. LNCS, vol. 4007, pp. 253–264. Springer, Heidelberg (2006) 9. Filkov, V.: Identifying Gene Regulatory Networks from Gene Expression Data. In: Aluru, S. (ed.) Handbook of Computational Molecular Biology, Chapman & Hall/CRC Press, Sydney, Australia (2005) 10. Frederickson, G.N., J` aJ` a, J.: Approximation algorithms for several graph augmentation problems. SIAM Journal of Computing 10(2), 270–283 (1981) 11. Giot, L., Bader, J.S., et al.: A protein interaction map of Drosophila melanogaster. Science 302, 1727–1736 (2003) 12. Han, J.D., Bertin, N., et al.: Evidence for dynamically organized modularity in the yeast protein-protein interaction network. Nature 430, 88–93 (2004) 13. Heinrich, R., Schuster, S.: The regulation of cellular systems. Chapman & Hall, New York (1996) 14. Jeong, H., Tombor, B., Albert, R., Oltvai, Z.N., Barab´ asi, A.-L.: The large-scale organization of metabolic networks. Nature 407, 651–654 (2000) 15. Jong, H.D.: Modelling and Simulation of Genetic Regulatory Systems: A Literature Review. Journal of Computational Biology 9(1), 67–103 (2002) 16. Khuller, S., Raghavachari, B., Young, N.: Approximating the minimum equivalent digraph. SIAM Journal of Computing 24(4), 859–872 (1995) 17. Khuller, S., Raghavachari, B., Young, N.: On strongly connected digraphs with bounded cycle length. Discrete Applied Mathematics 69(3), 281–289 (1996) 18. Lee, T.I., Rinaldi, N.J., et al.: Transcriptional regulatory networks in Saccharomyces cerevisiae. Science 298, 799–804 (2002) 19. Li, S., Armstrong, C.M., et al.: A map of the interactome network of the metazoan C. elegans. Science 303, 540–543 (2004)
A Novel Method for Signal Transduction Network Inference
419
20. Li, S., Assmann, S.M., Albert, R.: Predicting Essential Components of Signal Transduction Networks: A Dynamic Model of Guard Cell Abscisic Acid Signaling. PLoS Biology 4(10) (October 2006) 21. Ma’ayan, A., et al.: Formation of Regulatory Patterns During Signal Propagation in a Mammalian Cellular Network. Science 309(5737), 1078–1083 (2005) 22. Newman, M.E.J., Strogatz, S.H., Watts, D.J.: Random graphs with arbitrary degree distributions and their applications. Phys. Rev. E 64(2), 26118–26134 (2001) 23. Shen-Orr, S.S., Milo, R., Mangan, S., Alon, U.: Network motifs in the transcriptional regulation network of Escherichia coli. Nature Genetics 31, 64–68 (2002) 24. Vetta, A.: Approximating the minimum strongly connected subgraph via a matching lower bound, 12th ACM-SIAM Symposium on Discrete Algorithms, pp. 417-426 (2001) 25. Wagner, A.: Estimating Coarse Gene Network Structure from Large-Scale Gene Perturbation Data. Genome Research 12, 309–315 (2002)
Composing Globally Consistent Pathway Parameter Estimates Through Belief Propagation Geoffrey Koh1 , Lisa Tucker-Kellogg2, David Hsu1,2 , and P.S. Thiagarajan1,2 1
2
Graduate School for Integrative Sciences and Engineering, National University of Singapore
[email protected] Department of Computer Science, National University of Singapore {tucker,dyhsu,thiagu}@comp.nus.edu.sg
Abstract. Parameter estimation of large bio-pathway models is an important and difficult problem. To reduce the prohibitive computational cost, one approach is to decompose a large model into components and estimate their parameters separately. However, the decomposed components often share common parts that may have conflicting parameter estimates, as they are computed independently within each component. In this paper, we propose to use a probabilistic inference technique called belief propagation to reconcile these independent estimates in a principled manner and compute new estimates that are globally consistent and fit well with data. An important advantage of our approach in practice is that it naturally handles incomplete or noisy data. Preliminary results based on synthetic data show promising performance in terms of both accuracy and efficiency.
1
Introduction
Quantitative modeling of the dynamics of bio-pathways - gene regulatory networks, metabolic pathways and signaling pathways - can play a vital role in understanding fundamental intra- and inter-cellular processes. Abstractly, a biopathway can be viewed as a network of biochemical reactions modeled as a simultaneous system of differential equations. In practice, the values of many of the rate parameters governing these reactions (equations) are often unknown. Hence techniques for estimating the values of these unknown parameters are of considerable importance. To perform parameter estimation of large bio-pathway models, we face two major challenges: (i) a high-dimensional search space, due to a large number of unknown parameters and (ii) insufficient and noisy data. In our earlier work (Koh et al., 2006), we have proposed a decompositional approach to address the first challenge. To decompose a large model into smaller components, we exploit the structure of a large pathway model and the distribution of locations where experimental data is available within the pathway. We then estimate the parameters for each component separately. The decompositional approach dramatically R. Giancarlo and S. Hannenhalli (Eds.): WABI 2007, LNBI 4645, pp. 420–430, 2007. c Springer-Verlag Berlin Heidelberg 2007
Composing Globally Consistent Pathway Parameter Estimates
421
improves computational efficiency; however, the decomposed pathway components often share common parts that may have conflicting parameter estimates, as they are computed independently within each individual component. A key question, which is the focus of this work, is then how to reconcile these different estimates in a principled manner. Our main idea is to use a probabilistic inference technique called belief propagation (Yedidia et al., 2003). We model the value of an unknown parameter as a probability distribution, commonly called a belief in this context. We then propagate and update the beliefs of the unknown parameters within the pathway model. This way, local beliefs arising from different pathway components are collated to construct globally consistent beliefs for all the parameters. Another important advantage of our approach is that it naturally handles incomplete or noisy data and helps to address the second challenge described earlier. We have implemented a discretized version of our belief propagation algorithm and performed initial tests. Simulation results using the estimated parameter values show good correlation with (synthetic) experimental data. Furthermore, based on the performance of belief propagation in many other applications (e.g. Felzenszwalb and Huttenlocher, 2006, Ihler et al., 2004, Friedman, 2004), we expect that our algorithm will scale up well with the pathway size. The rest of this paper is organized as follows. In Section 2, we review background material and related work. In Section 3, we describe a probabilistic graphical model called the Factor Graph and show how it can be used to represent a bio-pathway. We then explain the details of our parameter estimation algorithm. It applies belief propagation on a Factor Graph model of a pathway in order to reconcile the parameter estimates for the individual components and yield globally consistent parameter estimates for the entire pathway. In Section 4, we present simulation results on the performance of our algorithm. In the final section, we conclude and discuss the prospects for future work.
2
Background
We first briefly review the background on bio-pathway modeling and then the parameter estimation problem. Here and in the rest of the paper, we focus on signaling pathways in eukaryotic cells. The dynamics of a signaling pathway is usually represented as a system of nonlinear ordinary differential equations (ODEs): x˙ i = fi (x(t), p)
(1)
where x˙ i denotes the rate of change of the concentration level for the species xi (typically a protein). The vector x(t) denotes the concentration levels of the various species at time t while p is the set of parameters, many of which will be unknown and have to be estimated. The nonlinear function fi encodes the rates of the reactions that produce or consume xi . Without loss of generality, we restrict our discussion and examples to mass action kinetics where the rate of reaction is proportional to the concentration of the participating molecules.
422
G. Koh et al.
In this setting, a typical reaction can be written as Equation 2. The substrates x1 and x2 bind reversibly to form the complex x3 at a rate that is affected by the forward and reverse rate constants k1 and k2 . k
1 x1 + x2 x3
k2
(2)
The system of equations that describe this reaction is x˙ 1 = k2 x3 − k1 x1 x2 x˙ 2 = k2 x3 − k1 x1 x2 x˙ 3 = k1 x1 x2 − k2 x3 Clearly, the correctness of a pathway model crucially depends on the values of the parameters. Determining these parameters through wet-lab experiments consumes significant time and cost, and is sometimes impossible. Hence, one must resort to computational techniques to estimate their values, based on available experimental data. Parameter estimation can be viewed as an optimization problem with differential-algebraic constraints (e.g. Kikuchi et al., 2003, Moles et al., 2003). The algebraic constraints result from the input data, which consists of experimentally measured gene expression levels or protein concentration levels at selected discrete time points. The differential constraints come from the ODEs that govern the biochemical reactions in the pathway. The problem is to estimate the pathway parameters (initial conditions and kinetic rate constants) and all the unknown gene expression levels and protein concentration levels so as to fit the experimental data as best as possible according to a suitable error measure (see Koh et al., 2006 for details). Many approaches are available to solve this optimization problem, including standard local descent and evolutionary strategies. Each algorithm has its own merits and limitations (Mendes and Kell, 1998, Moles et al., 2003). A common characteristic of these techniques is that they consider the entire pathway and all its parameters simultaneously during estimation. In general, this leads to a combinatorial explosion in terms of the dimensionality of the parameter space being searched. Therefore, the methods do not scale well for large pathway models having many unknown parameters. Furthermore, the existing methods provide point estimates as solutions, which is unrealistic. In our previous work, we have tackled the high-dimensionality barrier by exploiting the structure of the pathway to break it down into smaller components so that parameter estimation can be done by parts. The identification of a component is achieved by back-tracing from the observed molecules until there are no more molecules to include in the component, or when we encounter molecules that are already observed at all time points, possibly due to prior simulations. As a key consequence, the resulting component can be simulated on its own. We have shown that this approach can produce reasonable estimates using much less time compared to global methods. However, the components defined by our method are not necessarily disjoint. When two components overlap, the parameter values of the common portion are fixed to be those associated with one of
Composing Globally Consistent Pathway Parameter Estimates
423
the components. This is one limitation which has not been explored previously, especially when experimental data is available about the unshared parts in both the components. In this work we propose to use belief propagation to address this critical difficulty. Belief propagation is a message passing protocol that operates on probabilistic graphical models such as Bayesian Networks and Markov Random Fields (Pearl 1988). Belief propagation has been applied in systems biology for the analysis of gene regulatory networks and for inferring network structure. See, for example, Yeang and Jaakkola, 2003 and Friedman, 2004. In Gat-Viks et al., 2005, it was used to develop a method that incorporates prior biological knowledge and learn a refined model with improved fit to experimental data. Interestingly, as we show in this paper, belief propagation can be used to integrate the estimates provided by the different components that share portions of the pathway such that they are globally consistent.
3
Parameter Estimation by Belief Propagation
Belief propagation provides a principled way of reconciling inconsistent estimates that arise from decomposition. The reasons for these inconsistencies are twofold: (i) the multimodal solution landscape, which may lead to the algorithms converging to different solutions, and (ii) noisy and sparse data sets being used for the various components and pathways. Our method consists of the following steps. We first decompose the pathway model into smaller components using the method presented in Koh et al., 2006 (See Section 2 for a short summary). In the second step, we convert each component into a probabilistic graphical model, which in our current setting is the Factor Graph (Kschischang et al., 2001). The parameters are given a probability distribution (belief) over the interval between their upper and lower bounds. The initial beliefs of the unknown parameters are uniformly distributed. However, one can also assign them non-uniform distributions to reflect any prior knowledge about their values. Functional dependencies between the parameters are captured by building compatibility functions of each factor node via sampling. The next step is to compose the Factor Graphs together to form a larger Factor Graph. Finally, we use belief propagation to reconcile the local estimates, so as to generate globally consistent estimates for all the parameters. In addition, we may refine the estimates through local descent on the entire pathway. 3.1
Factor Graph
A Factor Graph is an undirected bipartite graph consisting of factor nodes and variable nodes. In the present setting, where each equation in the system of ODEs is of the form x˙ i = fi (x(t), p), we will have one factor node for each such fi . For convenience, we will denote this factor node as Fi . The variable nodes then denote the unknown parameters. An edge exists between the factor node Fi and
424
G. Koh et al.
the parameter k iff k appears in fi or some xj (representing the concentration level of the jth molecular species) appears in fi and k is an argument of fj . An example of a simple pathway model, its system of equations and the associated Factor Graph is shown in Figure 1. In this Factor Graph, there is an edge from factor node F2 to each of the variable nodes k1 , k2 , k3 and k4 . This is because k3 and k4 are arguments of f2 while k1 and k2 directly affect the rate of change of x1 , which is also an argument of f2 .
x˙ 1 = k1 − k2 x1 x˙ 2 = k4 x3 − k3 x1 x2 x˙ 3 = k3 x1 x2 − k4 x3 x˙ 4 = k6 x5 − k5 x1 x4 x˙ 5 = k5 x1 x4 − k6 x5
Fig. 1. (A) Simple pathway model with its system of ODEs (B) The Factor Graph representation of the pathway. The round nodes are variable nodes and the square nodes are factor nodes.
3.2
Compatibility Function
With each factor node Fi , we associate a compatibility function ψi over the parameters that are connected to it. This compatibility function is defined by a joint probability distribution between the values of the parameters. This distribution has to be built up by sampling the parameter space and scoring the samples using the following objective function ⎛ ⎞1/2 exp 2 2 ⎠ J(pi ) = ⎝ (xsim (3) kj (p) − xkj ) /wkj k∈xobs ,j exp
where xkj is the jth experimental data point of variable xk and xsim kj (p) is the corresponding predicted value generated by simulating the ODEs using the sampled values of the parameters p. The set pi consists of the parameters that are connected to the factor node Fi . Finally wkj is the weight, typically the maximal value of xk , used to normalize its contribution to the objective function. Within a component, we distinguish between local score and global score. A local score is obtained by applying the objective function on a single observed variable while a global score is obtained by using all the observed variables. For a factor node whose corresponding variable xi is observed, its compatibility function will be derived using the local score, i.e. xobs = {xi }. Otherwise the global score will be used.
Composing Globally Consistent Pathway Parameter Estimates
425
The scores are then converted into probabilities by the following equation μi J(pi ) 1 − maxJ(p i) e (4) z where z is a normalizing constant to ensure that the probabilities sum up to 1, and μi is a scaling factor, which is usually set to a value between 10 and 20, depending on the accuracy of the experimental data. These distributions capture the dependencies between the parameters, and they are immutable.
ψi (pi ) =
3.3
Composing the Factor Graph for the Entire Pathway
Suppose that we have possibly overlapping components after pathway decomposition. Each of them is converted into a Factor Graph as described above and sampled separately to build their respective compatibility functions. To reconcile the beliefs of their parameters, we compose a single Factor Graph for the entire pathway by fusing together their common variable nodes (e.g. Figure 3B). These variable nodes will allow information from one Factor Graph to propagate to the other. 3.4
Message Passing and Updating
Having constructed the joint probability distributions between the parameters and recomposed the Factor Graphs, we can now update the beliefs of their parameters. This is achieved by forming messages and passing them between the nodes of the Factor Graph. As the messages are being propagated, they cause the beliefs of the receiving parameters to be updated. Since there are two types of nodes, there will be two types of messages. Denoting the message from node ni to nj as mij (nj ), a message from a factor node nf to a variable node nv is defined as: mf v (nv ) = arg max ψf mif (nf ) (5) i∈N (nf )/nv
i∈N (nf )/nv
and a message from a variable node nv to a factor node nf is then: mvf (nf ) = αv φ(kv ) mjv (nv )
(6)
j∈N (nv )/nf
where N (n) denotes the set of nodes neighboring n, αv is a normalizing constant for nv and kv is the parameter that is represented by nv . φ(kv ) denotes the current belief of kv and it is updated after receiving messages from all its neighboring factor nodes. The scheme described above is also known as the max-product algorithm. Applied to a probabilistic graphical model, it computes the maximum a posteriori (MAP) probabilities of the variable nodes. On loop free graphs, this algorithm converges to a unique probability distribution. Further, the assignment based on this distribution yields the most probable values for the nodes, which can
426
G. Koh et al.
then be reported as the estimated values. Although the Factor Graphs induced by bio-pathways are seldom loop free, it has been shown that applying belief propagation on such loopy graphs will still yield a distribution that gives a neighborhood maximum and that for some graphs, this neighborhood can be exponentially large (Weiss and Freeman, 2001). We have implemented this loopy belief propagation algorithm (Murphy et al., 1999) on parameters that have been discretized. In this algorithm, the messages will be generated and propagated throughout the Factor Graph until the beliefs of the parameters have converged, or a pre-determined number of iterations have been executed.
4
Simulation Results
Our parameter estimation algorithm is implemented in C++. It works in the discrete domain by dividing each dimension of the parameter space into finite partitions. Belief propagation will then give us the most likely domains, or the MAP partitions. We fine-tune the estimates by applying the Levenberg-Marquardt algorithm (Gill et al., 1982), starting from the mid-points of the MAP partitions. 4.1
Case Study: Branching Signaling Pathway
To test our approach, we have applied it to a synthetic pathway model that exhibits branching, a typical feature found in signaling pathways. We constructed a branched pathway with 11 molecules and with “nominal values” for the kinetic rate constants falling in the interval [0.0, 1.0]. Simulation of this model yielded the synthetic data we use in lieu of experimental data, and then the nominal rate constants were set aside. We allowed synthetic time series data at 20 discrete time points to be made available for 5 of the molecules - x1 , x5 , x7 , x9 and x11 . There remain 12 unknown parameters to be estimated. This is a reasonable approximation of the parameter estimation problem in actual settings. Using our decompositional approach of (Koh et al., 2006), the pathway can be broken down into multiple overlapping components. We will consider components C1 and C2, which consist of the variable sets {x1 , x2 , x3 , x4 , x5 , x6 , x7 } and {x1 , x2 , x3 , x8 , x9 , x10 , x11 } respectively (Figure 3A). We first estimate the parameters by sampling the components separately, and then we reconcile their values by belief propagation on the combined Factor Graph (Figure 3B). As an alternative, we also apply our previous method by estimating the parameters for C1 followed by the remaining ones in C2 (Figure 3C - Scheme S1) and vice versa (Figure 3C - Scheme S2). We compare the efficiency and quality of our results with other optimization algorithms - Levenberg-Marquardt (LM), Evolutionary Strategies with Stochastic Ranking (SRES) and Genetic Algorithm (GA). All simulations are performed on an Intel Pentium M processor with 1 GB memory. Fine-tuning of the estimates, and the running of other optimization techniques are done using the open source software COPASI (Hoops et al., 2006). We score the resulting parameters obtained from all the algorithms using the weighted sum-ofsquares difference between the experimental data and the corresponding simulation profile. The results of the simulations are summarized in Table 1.
Composing Globally Consistent Pathway Parameter Estimates
427
x˙ 1 = k1 − k2 x1 x˙ 2 = k4 x3 − k3 x1 x2 x˙ 3 = k3 x1 x2 − k4 x3 x˙ 4 = k6 x5 − k5 x3 x4 x˙ 5 = k5 x3 x4 − k6 x5 x˙ 6 = k8 x7 − k7 x5 x6 x˙ 7 = k7 x5 x6 − k8 x7 x˙ 8 = k10 x9 − k9 x3 x8 x˙ 9 = k9 x3 x8 − k10 x9 x˙ 10 = k12 x11 − k11 x10 x9 x˙ 11 = k11 x10 x9 − k12 x11
Fig. 2. Schema of the pathway model and the system of ODEs that defines it. The dashed arrows in the schema represent enzyme-catalyzed reactions where the enzymes are not consumed by the reactions.
4.2
Results and Discussion
The measure of quality for this set of results is not the closeness of the estimated parameters to the nominal ones. Rather, it is the score, which is the weighted sum-of-squares difference between the simulated concentration profiles (generated using those parameters) and the pseudo-experimental data sets. We can see from the results that belief propagation out-performs all the other algorithms both in terms of efficiency (requiring 10440 evaluations, where each evaluation is a complete numerical simulation of a single component) as well as quality (with the lowest score of 0.0119). Note that due to the discretized nature of our implementation, we are not able to get the best estimates simply by using belief propagation. Instead, we are able to provide a starting point within the vicinity of a good solution, which we can locate using the LM algorithm. Without this starting point, it is not surprising to see that the LM algorithm, starting from the midpoint of the parameter space performs the worst with an astonishingly high score of 269.061 even though it requires only 177 evaluations to converge. Clearly, this is an indication of the algorithm getting trapped in a local minimum. It is interesting to note that the decompositional schemes (S1 and S2) provide better results than SRES, GA and LM. This is largely due to the lower dimensionality of the components and thus of their search. However, the quality varies depending on the order in which the components C1 and C2 are considered. This variation is a result of prematurely fixing the estimates of the parameters in one component when there could be better solutions when taking into account the entire pathway. The issues of ordering and choosing the components were largely left unaddressed in our previous work but belief propagation fills this gap nicely by allowing each of the components to be estimated for separately, and later combining their estimates by its message passing scheme.
428
G. Koh et al.
Fig. 3. (A) Overlapping components C1 and C2 (B) Factor graph induced by C1 and C2, with the left Factor Graph corresponding to C1 and the right one, C2. They are composed together via the common variable nodes k1 , k2 , k3 and k4 (C) The two estimation schemes S1 and S2.
Besides requiring fewer model evaluations, the additional runtime incurred for message passing itself is quadratic in the size of the pathway. The size of the joint probability distribution table of the factor node Fi , on the other hand, is exponential in the number of variable nodes it is connected to. Hence, it is the degree of the Factor Graph, rather than the pathway size, that will be the limiting factor for the performance of belief propagation. It will be interesting to determine if there is a reasonable upper bound on the connectivity of the factor nodes for bio-pathways.
5
Conclusion
Parameter estimation of large bio-pathway models is an important, but difficult problem. To reduce the prohibitive computational cost, we take a decompositional approach that consists of three main steps conceptually: (i) divide a large
Composing Globally Consistent Pathway Parameter Estimates
429
Table 1. Comparison of belief propagation (BP/LM) with the original decompositional approach (S1 and S2) and other optimization techniques (LM, SRES, GA). The zeroth, first and second order rate constants are given in nM.s−1 , s−1 and nM−1 .s−1 respectively. Search parameters are specific to their individual algorithms for them to run: Iter = Maximum number of iterations; Tol = Tolerance; Gen = Number of generations; Pop = Population size. An evaluation is a complete numerical simulation of the ODEs using current parameter estimates. The score for the parameters is the weighted sum-of-squares difference between the experimental data and the corresponding simulation profile generated using those parameters.
Parameters
Nominal Values
BP/LM
S1
S2
LM
GA
SRES
k1
0.625
0.6368
0.6915
0.4635
0.3991
0.3035
0.5031
k2 k3 k4 k5 k6 k7 k8 k9 k10 k11 k12
0.228 0.112 0.96 0.579 0.312 0.628 0.104 0.04 0.286 0.624 0.88
0.2317 0.4007 0.1917 0.1089 0.2269 0.6235 0.0983 0.0075 0.1836 0.6317 0.8896
0.2609 0.0739 0.3264 0.9317 0.8134 0.6349 0.1657 0.0456 0.5523 0.6551 0.9468
0.1311 0.0565 0.7327 1 0.3240 0.7405 0.1224 0.1184 0.8324 0.7041 0.8388
0.7912 0.3236 1 0.5592 0.3779 0.5877 0.3032 0.3310 0.7980 0.2847 0.9402
1.2e−138 0.3997 0.6247 0.2361 0.3815 0.6680 0.0481 0.0192 0.4360 0.4132 0.5318
0.1629 0.1020 0.6063 0.6859 0.4444 0.8878 0.5638 0.0710 0.8602 0.5773 0.7079
Search Parameters
−
Iter: 200 Tol: 1e−5
Gen: 200 Pop : 20
Gen: 200 Pop : 20
Iter: 200 Tol: 1e−5
Gen: 400 Pop: 40
Gen: 400 Pop: 40
Score Evaluations
− −
0.0119 10440
0.4138 47820
2.3498 47820
269.061 177
4.75115 15258
2.5103 95814
pathway model into components, (ii) compute the estimates for the parameters in each component separately, and (iii) combine the parameter estimates for all the components into a globally consistent one. We have shown that belief propagation is an effective method for the critical last step. It takes into account all the local constraints represented as beliefs and reconciles them in a principled manner by exploiting the pathway structure. An additional advantage of this method is that it handles incomplete or noisy data well. Preliminary results based on simulation show that our new parameter estimation algorithm, which applies belief propagation followed by local descent, substantially outperforms existing alternatives based on local descent alone or evolutionary strategies in both accuracy and efficiency. We are currently working on several extensions of our algorithm to improve the reliability and efficiency of belief propagation. We also plan to test the algorithm on larger pathway models with many components and feedback loops. More importantly, along with our collaborator, we are applying the algorithm to study the Akt-MAPK pathway (Koh et al., 2006) using real experimental data.
430
G. Koh et al.
References [Felzenszwalb and Huttenlocher, 2006]Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient belief propagation for early vision. International J. of Comp. Vision 70(1), 41–54 (2006) [Friedman, 2004]Friedman, N.: Inferring cellular networks using probabilistic graphical models. Science 303, 799–805 (2004) [Gill et al., 1982]Gill, P.E., Murray, W., Wright, M.H.: Practical Optimization. Academic Press, London (1982) [Gat-Viks et al., 2005]Gat-Viks, I., Tanay, A., Raijman, D., Shamir, R.: The factor graph network model for biological systems. In: Proc. of the 9th Int. Conf. on Res. in Comp. Mol. Biol., pp. 31–48 (2005) [Hoops et al., 2006]Hoops, S., Sahle, S., Gauges, R., Lee, C., Pahle, J., Simus, N., Singhal, M., Xu, L., Mendes, P., Kummer, U.: COPASI - a COmplex PAthway SImulator. Bioinformatics 22(24), 3067–3074 (2006) [Ihler et al., 2004]Ihler, A.T., Fisher, J.W., Moses, R.L., Willsky, A.S.: Nonparametric belief propagation for self-calibration in sensor networks. In: Proc. of the 2004 Int. Conf. on Inf. Proc. in Sensor Networks, pp. 225–233 (2004) [Kikuchi et al., 2003]Kikuchi, S., Tominaga, D., Arita, M., Takahashi, K., Tomita, M.: Dynamic modeling of genetic networks using genetic algorithm and S-system. Bioinformatics 19(5), 643–650 (2003) [Koh et al., 2006]Koh, G., Teong, H.F.C., Clement, M.V., Hsu, D., Thiagarajan, P.S.: A decompositional approach to parameter estimation in pathway modeling: a case study of the Akt and MAPK pathways and their crosstalk. Bioinformatics 22(14), e271–e280 (2006) [Kschischang et al., 2001]Kschischang, F.R., Frey, B.J., Loeliger, H.A.: Factor graphs and the sum-product algorithm. IEEE Trans. on Information Th. 47(2), 498–519 (2001) [Mendes and Kell, 1998]Mendes, P., Kell, D.B.: Non-linear optimization of biochemical pathways: applications to metabolic engineering and parameter estimation. Bioinformatics 14(10), 869–883 (1998) [Moles et al., 2003]Moles, C.G., Mendes, P., Banga, J.R.: Parameter estimation in biochemical pathways: a comparison of global optimization methods. Genome Research 13, 2467–2474 (2003) [Murphy et al., 1999]Murphy, K.P., Weiss, Y., Jordan, M.I.: Loopy belief propagation for approximate inference: An empirical study. In: Proc. of the 15th Ann. Conf. on Uncertainty in AI, pp. 467–475 (1999) [Pearl 1988]Pearl, J.: Probabilistic reasoning in intelligent systems: networks of plausible inference, 2nd edn. Morgan Kaufmann Publishers, Inc., San Francisco (1988) [Weiss and Freeman, 2001]Weiss, Y., Freeman, W.T.: On the optimality of solutions of the max-product belief propagation algorithm in arbitrary graphs. IEEE Trans. on Information Th. 47(2), 736–744 (2001) [Yeang and Jaakkola, 2003]Yeang, C.H., Jaakkola, T.: Physical network models and multi-source data integration. In: Proc. of the 7th Int. Conf. on Res. in Comp. Mol. Biol. pp. 312–321 (2003) [Yedidia et al., 2003]Yedidia, J.S., Freeman, W.T., Weiss, Y.: Understanding belief propagation and its generalizations, Exploring Artificial Intelligence in the new Millenium, pp. 239–269 (2003)
Author Index
Albert, R´eka 407 Apostolico, Alberto
Ilie, Lucian Ilie, Silvana
136
Balla, Sudha 302 Bansal, Mukul S. 124 Berman, Piotr 38 Bhat, Prasanna 395 Binkowski, T.A. 171 Birin, Hadas 111 Boberg, Jorma 2 Boucher, Christina 149 Brejov´ a, Broˇ na 240 Brenner, Steven E. 194 Brown, Daniel G. 149 Brudno, Michael 289 Bucan, Maja 371
Jabbari, Hosna 323 Jeong, Jieun K. 38 Jha, Sumit Kumar 252 Kachalo, Sema 407 Kececioglu, John 359 Kennedy, Justin 73 Kim, Eagu 359 Koh, Geoffrey 420 Kou, Charles 265 Krause, Roland 383 Kuksa, Pavel 228
Casadio, Rita 25 Chen, Julian J.-L. 335 Chin, Francis Y.L. 372 Choi, Vicky 277 Church, Paul 149 Close, Timothy J. 395 Condon, Anne 323 DasGupta, Bhaskar 171, 407 Davila, Jaime 302 Dhanik, Ankur 265 Di Lena, Pietro 25 Dondi, Riccardo 407 Dundas, Joe 171 Elias, Isaac 111 Eulenstein, Oliver Fariselli, Piero 25 Fromer, Menachem
124
Heath, Lenwood S. Hsu, David 420
216
Langmead, Christopher James Latombe, Jean-Claude 265 Lee, Phil Hyoun 61 Leung, Henry C.M. 372 Liang, Jie 171 Linial, Michal 12 Liu, Chunmei 310 Liu, Guanfeng 265 Lonardi, Stefano 395 Lozano, Antoni 98 Man, Amit 12 M˘ andoiu, Ion 73 Margara, Luciano 25 Marz, Nathan 265 Medri, Filippo 25 Medvedev, Paul 289 Mosig, Axel 335 Myers, Gene 289 Nicosia, Giuseppe
12
Gal-Or, Zohar 111 Genovese, Loredana M. Georgiou, Konstantinos Geraci, Filippo 49
346 346
49 289
Ozer, Hatice Gulcin
183 161
Pa¸saniuc, Bogdan 73 Pati, Amrita 216 Pavlovic, Vladimir 228 Pellegrini, Marco 49 Pertea, Mihaela 208
252
432
Author Index
Pevzner, Pavel A. 1 Pinter, Ron Y. 98 Pop, Ana 323 Pop, Cristina 323 Propper, Ryan 265 Przytycka, Teresa M.
Tagliacollo, Claudia 136 Thiagarajan, P.S. 420 Tsivtsivadze, Evgeni 2 Tucker-Kellogg, Lisa 420 Tuller, Tamir 111 38 Ukkonen, Esko
Rajasekaran, Sanguthevar Rastas, Pasi 85 Ray, William C. 161 Rokhlenko, Oleg 98
302
Salakoski, Tapio 2 Salzberg, Steven L. 208 Sankoff, David 277 Schliep, Alexander 383 Shapiro, Louis 310 Shatkay, Hagit 61 Shatsky, Maxim 194 Snoeyink, Jack 196 Song, Yinglei 310 Sontag, Eduardo 407 ˇ amek, Rastislav 240 Sr´ Stadler, Peter F. 335 Stracquadanio, Giovanni 183 Sung, W.K. 372
85
Valiente, Gabriel 98 van den Bedem, Henry Varshavsky, Roy 12 Vassura, Marco 25 Vinaˇr, Tom´ aˇs 240
265
Wang, Xueyi 196 Westbrooks, Kelly 407 Wu, Yonghui 395 Yao, Peggy 265 Yiu, S.M. 372 Zelikovsky, Alexander 407 Zhao, Yinglei 323 Zheng, Chunfang 277 Zhi, Degui 194 Zhu, Qian 277 Ziv-Ukelson, Michal 98