Lecture Notes in Bioinformatics
5780
Edited by S. Istrail, P. Pevzner, and M. Waterman Editorial Board: A. Apostolico S. Brunak M. Gelfand T. Lengauer S. Miyano G. Myers M.-F. Sagot D. Sankoff R. Shamir T. Speed M. Vingron W. Wong
Subseries of Lecture Notes in Computer Science
Visakan Kadirkamanathan Guido Sanguinetti Mark Girolami Mahesan Niranjan Josselin Noirel (Eds.)
Pattern Recognition in Bioinformatics 4th IAPR International Conference, PRIB 2009 Sheffield, UK, September 7-9, 2009 Proceedings
13
Series Editors Sorin Istrail, Brown University, Providence, RI, USA Pavel Pevzner, University of California, San Diego, CA, USA Michael Waterman, University of Southern California, Los Angeles, CA, USA Volume Editors Visakan Kadirkamanathan Guido Sanguinetti Josselin Noirel University of Sheffield Mappin Street, Sheffield, S1 3JD, UK E-mail: {visakan,g.sanguinetti,j.noirel}@sheffield.ac.uk Mark Girolami University of Glasgow Glasgow, G12 8QQ, UK E-mail:
[email protected] Mahesan Niranjan University of Southampton Southampton SO17 1BJ, UK E-mail:
[email protected]
Library of Congress Control Number: 2009933351
CR Subject Classification (1998): J.3, I.5, F.2.2, I.2, H.3.3, H.2.8 LNCS Sublibrary: SL 8 – Bioinformatics ISSN ISBN-10 ISBN-13
0302-9743 3-642-04030-6 Springer Berlin Heidelberg New York 978-3-642-04030-6 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2009 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12747078 06/3180 543210
Preface
The Pattern Recognition in Bioinformatics (PRIB) meeting was established in 2006 under the auspices of the International Association for Pattern Recognition (IAPR) to create a focus for the development and application of pattern recognition techniques in the biological domain. PRIB’s aim to explore the full spectrum of pattern recognition application was reflected in the breadth of techniques represented in this year’s submissions and in this book. These range from image analysis for biomedical data to systems biology. We were fortunate to have invited speakers of the highest calibre delivering keynotes at the conference. These were Pierre Baldi (UC Irvine), Alvis Brazma (EMBL-EBI), Gunnar R¨atsch (MPI T¨ubingen) and Michael Unser (EPFL). We acknowledge support of the EU FP7 Network of Excellence PASCAL2 for partially funding the invited speakers. Immediately prior to the conference, we hosted half day of tutorial lectures, while a special session on “Machine Learning for Integrative Genomics” was held immediately after the main conference. During the conference, a poster session was held with further discussion. We would like once again to thank all the authors for the high quality of submissions, as well as Yorkshire South and the University of Sheffield for providing logistical help in organizing the conference. Finally, we would like to thank Springer for their help in assembling this proceedings volume and for the continued support of PRIB. July 2009
Mark Girolami Visakan Kadirkamanathan Mahesan Niranjan Josselin Noirel Guido Sanguinetti
Organization
International Program Committee Shandar Ahmed
National Institute of Biomedical Innovation, Japan Jes´us S. Aguilar-Ruiz Escuela Polit´ecnica Superior, Seville, Spain Tatsuya Akutsu Kyoto University, Japan Sanghamitra Bandyopadhyay Indian Statistical Institute, India Sebastian B¨ocker Friedrich-Schiller-Universit¨at, Jena, Germany Rainer Breitling University of Groningen, The Netherlands Nicolas Brunel CNRS, Paris, France Colin Campbell University of Bristol, UK Frederic Cazals Sophia Antipolis, France CQ Chang University of Hong Kong, China Marco Chierici Bruno Kessler Foundation, Trento, Italy Theo Damoulas University of Glasgow, UK Richard Edwards University of Southampton, UK Maurizio Filippone University of Sheffield, UK Alexandru Floares Oncological Institute Cluj, Romania Jennifer Hallinan University of Newcastle, UK Jin-Kao Hao University of Angers, France Jaap Heringa VU University Amsterdam, The Netherlands Antti Honkela Helsinki University of Technology, Finland Giuseppe Jurman Bruno Kessler Foundation, Trento, Italy R. Krishna Murthy Karuturi Genome Institute of Singapore Samuel Kaski Helsinki University of Technology, Finland Alex Kochetov Russian Academy of Sciences, Russia Mehmet Koyuturk Case Western Reserve University, Cleveland, USA Zoe Lacroix Arizona State University, USA Tak-Wah Lam University of Hong Kong, China Kee Khoon Lee Institute of High Performance Computing, Singapore Pietro Li`o University of Cambridge, UK Xuejun Liu Nanjing University of Aeronaoutics and Astronautics, China Francesco Masulli University of Genova, Italy Mariofanna Milanova Donaghey College of Engineering and Information Technology, USA Sach Mukherjee University of Warwick, UK
VIII
Organization
Alioune Ngom Carlotta Orsenigo Nikhil Pal Magnus Rattray Simon Rogers Juho Rousu Anastasia Samsonova Alexander Schliep Roberto Tagliaferri Gwenn Volkert David Wild Hong Yan Jing Yang Yan-Qing Zhang
University of Windsor, Canada Politecnico di Milano, Italy Indian Statistical Institute, India University of Manchester, UK University of Glasgow, UK University of Helsinki, Finland Harvard University, USA Max Planck Institute for Molecular Genetics, Berlin, Germany University of Salerno, Italy Kent State University, USA University of Warwick, UK City University of Hong Kong, China Qingdao Institute of Bioenery and Bioprocess Technology, China Georgia State University, USA
Conference Organizing Committee Conference Chairs Visakan Kadirkamanathan Guido Sanguinetti
University of Sheffield, UK University of Sheffield, UK
General Co-chairs Raj Acharya Madhu Chetty Jagath Rajapakse
PennState, USA Monash University, Australia Nanyang Technological University, Singapore
Program Chairs Mahesan Niranjan Mark Girolami
University of Southampton, UK University of Glasgow, UK
Tutorial Chair Florence d’Alch´e-Buc
´ University of Evry, France
Special Sessions Chair Cesare Furlanello
Fondazione Bruno Kessler, Italy
Publicity Chair Elena Marchiori
Radboud University of Nijmegen, The Netherlands
Organization
Publications Chair Josselin Noirel
University of Sheffield, UK
Local Organization Chair Daniel Coca
University of Sheffield, UK
Finance Chair Andrew Zammit Mangion
University of Sheffield, UK
Webmaster Maurizio Filippone
University of Sheffield, UK
IX
Table of Contents
Evolutionary Parameters in Sequence Families: Cold Adaptation of Enzymes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Said Hassan Ahmed and Tor Fl˚ a MProfiler: A Profile-Based Method for DNA Motif Discovery . . . . . . . . . . Doaa Altarawy, Mohamed A. Ismail, and Sahar M. Ghanem On Utilizing Optimal and Information Theoretic Syntactic Modeling for Peptide Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Eser Ayg¨ un, B. John Oommen, and Zehra Cataltepe Joint Tracking of Cell Morphology and Motion . . . . . . . . . . . . . . . . . . . . . . Jierong Cheng, Esther G.L. Koh, Sohail Ahmed, and Jagath C. Rajapakse
1
13
24
36
Multiclass Microarray Gene Expression Analysis Based on Mutual Dependency Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Girija Chetty and Madhu Chetty
46
An Efficient Convex Nonnegative Network Component Analysis for Gene Regulatory Network Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . Jisheng Dai, Chunqi Chang, Zhongfu Ye, and Yeung Sam Hung
56
Using Higher-Order Dynamic Bayesian Networks to Model Periodic Data from the Circadian Clock of Arabidopsis Thaliana . . . . . . . . . . . . . . . R´ on´ an Daly, Kieron D. Edwards, John S. O’Neill, Stuart Aitken, Andrew J. Millar, and Mark Girolami
67
Sequential Hierarchical Pattern Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . Bassam Farran, Amirthalingam Ramanan, and Mahesan Niranjan
79
Syntactic Pattern Recognition Using Finite Inductive Strings . . . . . . . . . . Paul Fisher, Howard Fisher, Jinsuk Baek, and Cleopas Angaye
89
Evidence-Based Clustering of Reads and Taxonomic Analysis of Metagenomic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gianluigi Folino, Fabio Gori, Mike S.M. Jetten, and Elena Marchiori
102
Avoiding Spurious Feedback Loops in the Reconstruction of Gene Regulatory Networks with Dynamic Bayesian Networks . . . . . . . . . . . . . . . Marco Grzegorczyk and Dirk Husmeier
113
XII
Table of Contents
Ligand Electron Density Shape Recognition Using 3D Zernike Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Prasad Gunasekaran, Scott Grandison, Kevin Cowtan, Lora Mak, David M. Lawson, and Richard J. Morris
125
Definition of Valid Proteomic Biomarkers: A Bayesian Solution . . . . . . . . Keith Harris, Mark Girolami, and Harald Mischak
137
Inferring Meta-covariates in Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . Keith Harris, Lisa McMillan, and Mark Girolami
150
A Multiobjective Evolutionary Algorithm for Numerical Parameter Space Characterization of Reaction Diffusion Systems . . . . . . . . . . . . . . . . Tim Hohm and Eckart Zitzler Knowledge-Guided Docking of WW Domain Proteins and Flexible Ligands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Haiyun Lu, Hao Li, Shamima Banu Bte Sm Rashid, Wee Kheng Leow, and Yih-Cherng Liou
162
175
Distinguishing Regional from Within-Codon Rate Heterogeneity in DNA Sequence Alignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexander V. Mantzaris and Dirk Husmeier
187
A Hybrid Metaheuristic for Biclustering Based on Scatter Search and Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Juan A. Nepomuceno, Alicia Troncoso, and Jes´ us S. Aguilar–Ruiz
199
Di-codon Usage for Gene Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Minh N. Nguyen, Jianmin Ma, Gary B. Fogel, and Jagath C. Rajapakse
211
Counting Patterns in Degenerated Sequences . . . . . . . . . . . . . . . . . . . . . . . . Gr´egory Nuel
222
Modelling Stem Cells Lineages with Markov Trees . . . . . . . . . . . . . . . . . . . . Victor Olariu, Daniel Coca, Stephen A. Billings, and Visakan Kadirkamanathan
233
Bi-clustering of Gene Expression Data Using Conditional Entropy . . . . . . Afolabi Olomola and Sumeet Dua
244
c-GAMMA: Comparative Genome Analysis of Molecular Markers . . . . . . Pierre Peterlongo, Jacques Nicolas, Dominique Lavenier, Raoul Vorc’h, and Jo¨el Querellou
255
Semi-supervised Prediction of Protein Interaction Sentences Exploiting Semantically Encoded Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tamara Polajnar and Mark Girolami
270
Table of Contents
Classification of Protein Interaction Sentences via Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tamara Polajnar, Simon Rogers, and Mark Girolami MCMC Based Bayesian Inference for Modeling Gene Networks . . . . . . . . Ramesh Ram and Madhu Chetty
XIII
282 293
Efficient Optimal Multi-level Thresholding for Biofilm Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dar´ıo Rojas, Luis Rueda, Homero Urrutia, and Alioune Ngom
307
A Pattern Classification Approach to DNA Microarray Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Luis Rueda and Juan Carlos Rojas
319
Drugs and Drug-Like Compounds: Discriminating Approved Pharmaceuticals from Screening-Library Compounds . . . . . . . . . . . . . . . . . Amanda C. Schierz and Ross D. King
331
Fast SCOP Classification of Structural Class and Fold Using Secondary Structure Mining in Distance Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jian-Yu Shi and Yan-Ning Zhang
344
Short Segment Frequency Equalization: A Simple and Effective Alternative Treatment of Background Models in Motif Discovery . . . . . . . Kazuhito Shida
354
Bayesian Optimization Algorithm for the Non-unique Oligonucleotide Probe Selection Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Laleh Soltan Ghoraie, Robin Gras, Lili Wang, and Alioune Ngom
365
Microarray Time-Series Data Clustering via Multiple Alignment of Gene Expression Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Numanul Subhani, Alioune Ngom, Luis Rueda, and Conrad Burden
377
Recursive Neural Networks for Undirected Graphs for Learning Molecular Endpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ian Walsh, Alessandro Vullo, and Gianluca Pollastri
391
Enhancing the Effectiveness of Fingerprint-Based Virtual Screening: Use of Turbo Similarity Searching and of Fragment Frequencies of Occurrence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shereena M. Arif, J´erˆ ome Hert, John D. Holliday, Nurul Malim, and Peter Willett
404
Patterns, Movement and Clinical Diagnosis of Abdominal Adhesions . . . Benjamin Wright, John Fenner, Richard Gillott, Paul Spencer, Patricia Lawford, and Karna Dev Bardhan
415
XIV
Table of Contents
Class Prediction from Disparate Biological Data Sources Using an Iterative Multi-Kernel Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yiming Ying, Colin Campbell, Theodoros Damoulas, and Mark Girolami
427
Cross-Platform Analysis with Binarized Gene Expression Data . . . . . . . . . Salih Tuna and Mahesan Niranjan
439
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
451
Evolutionary Parameters in Sequence Families Cold Adaptation of Enzymes Said Hassan Ahmed and Tor Fl˚ a Dept of Mathematics and Statistics, University of Tromsø, 9037 Tromsø, Norway
Abstract. In an attempt to incorporate environmental effects like coldadaptation into models of sequence evolution on a phylogenetic tree, we present a viable way of representing descriptive statistics of sequence observables under reversible Markov models of sequence evolution. Local variation in amino acid distribution along and across the sequence family can be connected to enzymatic adaptation to different temperatures. Here, we estimate a few amino acid properties and how the variations of these properties both with respect excess mean values (EMVs) and covariance classify the protein family into clusters. Application of a multiscale and multivariate method to an aligned family of distinct trypsin and elastase sequences shows drift of centroid mean sequences of cold adapted enzymes compared to their warm-active counterparts.
1
Introduction
Phylogenetic tree-building methods presume particular evolutionary models [2]. Current evolutionary models of amino acid sequence evolution generally depend on mathematical models based on empirical observations using either comparisons of the observed amino acid sequences or their physical-chemical properties [1, 2]. These models estimate evolutionary distances in terms of expected number of substitutions per site by assuming evolution with independent sites and that the sequences in each site are assumed to evolve according to a single stochastic process, and that this process is fixed across all sites. For instance, in Markov models of amino acid replacement, the Markov process is assumed to be stationary, homogeneous and reversible so that the amino acid distribution and the rate of replacement are assumed to be fixed in time and positions, and that the forward and reverse substitution rates are assumed to be the same [15, 16]. We will be interested in the possibility to parameterize environmental effects like cold adaptation into Markov transition and the corresponding rate matrices. In particular, we are interested in amino acid distribution profile in an aligned family with cold-adapted representatives. Cold-adapted enzymes are characterized by clusters of glycine residues, a reduced number of proline residues in loop regions, a general reduction in charged residues on the surface and exposure of hydrophobic residues to solvent [19, 20]. All these features are thought to give rise to the increased structural flexibility observed in some of the regions of the enzyme. Flexibility seems to be a strategy for cold-adapted enzyme to maintain high catalytic activity at low temeperatures [18, 19]. Often a few conserved V. Kadirkamanathan et al. (Eds.): PRIB 2009, LNBI 5780, pp. 1–12, 2009. c Springer-Verlag Berlin Heidelberg 2009
2
S.H. Ahmed and T. Fl˚ a
residues within each temperature class sequence positions are detemining factors of their strategies to adapt to cold/warm temperature. Here, we study how approximations on Markov models for standard phylogeny will give us an opportunity to obtain first hand insight into the statistics of observables related to index and counting variables. Based on parameterized sequence features, we will carry out a multiscale and multivariate data analysis on an aligned family of distinct trypsin and elastase sequences. The basis of this multivariate analysis is that covariation of residue sites in evolution is related mainly to structural or functional site fitness as parameterized by models of mean amino acid distributions and certain amino acid properties. These correlated residues based on amino acid properties (property sequences) show deviation from a common position dependent mean value. Such mean deviations, which we refer to as excess mean values (EMVs), are due to species dependent variations in local and global fitness without affecting the overall 3D fold and fitness with respect to protein (enzyme) function. Our goal is to extract these EMVs from evolutionary noise both along and across the sequence family. On application, the method revealed drift of centroids due to features of cold adaptation. Such deviations could be used as measures of evolutionary adapted fitness landscapes corresponding both to the folding rate as parameterized by the global relative energy gap/energy standard deviation ratio (funneling picture) and the local fitness adaptations at the active site measuring the binding activity effects.
2 2.1
Parameterization of Sequence Features Statistics Based on Amino Acid Unit Count Vectors
We assume an aligned family of L homologous protein sequences of length N . Let α(l, s) be the residue at position s and species l, where l ∈ {1, . . . , L}, s ∈ {1, . . . , N }. Then we describe α(l, s) numerically in the vector space of amino acid unit counts, denoted as Yl,s = Yα(l,s) , where Yα(l,s) = (δα, α(l,s) ) ∈ IR20 . Here, α ∈ A, is one of the 20 amino acid categories, δα, α(l,s) is 1 if one of the amino acids α equals the amino acid at (l, s), α(l, s), or 0 otherwise (the Kronecker delta function). With this representation, we have that the average over the observed present time leaf distribution of the protein amino acids at (l, s) is given by the amino acid distribution < Yl,s >= pl,s = (pl,s α ) ,
(1)
where . is the expectation operator (with respect to phylogenetic distributions). For completness, we have taken into account that pl,s will vary both on subset of species and positions due to different species clusters and functional (or structural) constraints, in our case, clusters and residue determinants of cold-adapted enzymes. We are interested in the amino acid distribution given in (1), in terms of two sequence ensembles (Yl,s , Yl ,s ), namely, the first-and second-order marginals, p(1)l,s (we will supress the superscript (1)) and P(2)l,l ;s,s , respectively. Since the
Evolutionary Parameters in Sequence Families
3
l,s first-order marginal, pl,s = L l=1 pα , is a product of single-site multinomial type probability with no information about the sequence pair probabilites necessary to describe standard phylogenetic tree parameters, we consider correlation of unit count vectors. For simplicity, we look at the two-point covariation, which is given by (2)l,l ;s,s l ,s l ,s (Yα(l,s) − pl,s − pl,s , (2) α )(Yβ(l ,s ) − pβ ) =Pα,β α pβ (2)l,l ;s,s
=ραβ
l ,s pl,s , α pβ
(2)l,l ,s
where ραβ is the pair (α at (l, s) and β at (l , s )) dependent correction. Based on a reversible Markov model of amino acid replacement, with the instantaneous rate of replacement of amino acid α by amino acid β defined by the rate matrix Q = (Qαβ ), as described in Sect. 1, the two point correlation could, for relatively short evolutionary times compared to the mutation rate Qαα and within the same cluster c ∈ {1, 2, . . . , K}, such that the mean amino acid distribution is fixed, pl,s = pc,s , c(l) = c, be modelled as1 (2)l,l ;s
ραβ
psα psβ psα δα,β − psα psβ + (Tl + Tl ) Λsαβ psα psβ ,
(3)
Here Tl is the edge length corresponding to species l, and Λαβ is constrained so that the row sums are all zero: Qαα = Λsαα psα = − β=α = Λsβα , α = β (symmetry), which ensures reversibility of the Markov process. Notice that for long evolutionary times between leaf nodes l and l , the process will effectively (2)l,l ;s be independent and ρα,β 0. This effect will tend to divide our protein sequences into clusters of close neighbors in evolutionary time Tl + Tl where the above model will be used within each cluster. We can extend the above model of the two-point correlation for all sites s = 1, 2, . . . , N and find covariance of two unit count vectors between two protein sequences (Yl,s − pl,s )(Yl,s − pl ,s ), for short evolutionary times as (2)l,l ;s,s s s ss s ραβ pα pβ psα δα,β δs,s − psα psβ + (Tl + Tl )Λsαβ psα Jαβ pβ , ∀αβ , (4)
s,s where Jαβ is the pair dependent correlation that ensures symmetric reversible s,s s substitution matrix ∀(α, β) pairs, Λsαα psα − β=α Λsalphaβ s = sJαβ pβ . We are interested in physical-chemical observables and how they are reflected in the amino acid distribution of the family. As they are linearly dependent on the unit count vectors, parameterized features (covariance) based on physicalchemical observables can be derived by unit count vector projections as described below. Thus, we consider the unit count vectors as our basic observables.
2.2
Statistics Based on Physico-Chemical Properties
Given a vector of amino acid properties C ∈ IR|A| , we find that for a family of sequences with unit count vectors Yl,s , there is a family of property sequences given as 1
For simplicity, we assume one cluster and skip the cluster index c in this section.
4
S.H. Ahmed and T. Fl˚ a (N )
Cl
= (Ci(l,s) )s∈{1,...,N } = CT Yl,s s∈{1,...,N } ,
l = 1, . . . , L ,
(5)
where the superscript (N ) indicates the length of the sequence. Since (5) a linear mapping of the unit count vectors, the mean of the property sequences can be expressed as ¯ (N ) = (C¯c(l),s ) = CT pl,s C , (6) l s∈{1,...,N } where C¯c(l)s is the mean property in cluster c(l) = c and we assume a fixed ¯ (N ) could be the amino acid distribution for each cluster pl,s = pc,s . Here, C l mean property of the whole family or as above of a cluster c(l) = c within the (N ) = C− C ¯ (N ) be the mean subtracted property sequences protein family. Let C l l (this subtraction to be explained below). Then a similar model of covariation as in (4) based on property sequences can be derived by projecting mean subtracted property vector on the parameterized covariance of the unit count vectors in fixed cluster c: 1 (N ) (N ) 1 s Σcll = Cl , Cl (c(l),c(l ))=c (τ + (Tl + Tl )S¯cs , (7) N −1 N −1 s c ¯s 2 τcs = (Cα − C¯cs )2 pc,s Cα2 pc,s α = α − (Cc ) , α
S¯cs =
α,β
α
c,s c,s ¯ (Cα − C¯c )pc,s α Λαβ pβ (Cβ − Cc ) ,
c(l),s s where C¯cs = {α} C¯c(l),α = {α}Cα pα , is the average amino acid property2 for proteins in species l in cluster c = c(1) ∈ {1, 2, . . . , K} since we assume that proteins come in say K groups with more or less the same properties within a group and are independent between groups.3 . The logic of the mean subtraction prior to parameterization, which isalso the basis for our data analysis, is the rec,s lation for the substitution matrix, α Λc,s αβ pβ = 0, which is valid for equilibrium amino acid distributions and symmetric substitution matrix. This would lead to that correlation within a subfamily will be described by a simple variance. If some cold-adapted representatives are present within the family or subfamily, it will imply that the mean amino acid distribution will change. Consequently, both the center of the cluster as described by the mean properties and covariance matrix will move relative to that of standard mesophilic temperature class. Excess mean values (EMVs). When there are more than one cluster in the sequence family or subfamily, each cluster might have a different mean amino acid distribution pc,s . Often the size of the data contained in each cluster is not sufficient enough to estimate the sequence position dependent mean necessary to observe cluster deviations. In this case, one would have to be satisfied with 2 3
τcs is the local property variance obtained for short evolutionary times compared to the local mutation rate, i.e. T μs 1, μs = α Qsαα . The formula for covariance clusters is simply obtained by summing the covriance of each cluster with respect to the cluster prior probability p(c), c = 1, 2, . . . , K.
Evolutionary Parameters in Sequence Families
5
a common mean ps . This analysis lead to artificial, extrinsic correlations which s c,s c,s we attribute to the EMVs δpc,s − C¯ s . Additional average e = p and δCe = C might come due to intrinsic correlations, interactions /and dependencies along the protein sequence as discussed above and this would lead to cavity filter (the coefficients J s,s in 4)) averaged fields pscav . Both extrinsic and intrinsic excess mean values lead to extra correlations to linear and quadratic order (or higher) in our measurements and theoretical model of covariance might indeed give a substantial contribution to the clustering we study below. The theory and detailed discussion of these effects is beyond the scope of this paper. Still we will refer to EMVs in our discussion of the results of the data analysis below.
3
Multivariate Data Analysis
The goal of the data analysis to find excess mean values which maximize covariations between clusters of cold-and warm adapted enzymes in an aliged family of homologs of differently adapted enzymes. 3.1
Data Sets
As a benchmark data, we used trypsin (a well studies enzyme w.r.t. cold adaptation) sequences studied by Nils Peder Willassen and coworkers [20]. The sequences are divided into 3 groups: trypsins from the higher vertebrate, the cold-adapted fish and the other fish. Additionally, we used 29 elastase sequences (though could not show all results due to small space), composed of the three types of elastases, namely, elastase type-I (with 3 cold-adapted representatives), II (with 5 cold-adapted elastases) and III. The elastases were collected from homologous search by blast at the data bases available at NCBI (http://www.ncbi. nml.nih.gov/blast) and SiB (http://au-expasy-org/tools/blast). Multiple alignment was performed using Geneious, version 3.7.1 (Biomatters Ltd.). In this paper, the property sequences are based on hydrophobicity (Kyte-Doolittle, 1983) and polarity (Zimmermann, 1968). All analyses subsequently described were implemented in MATLAB 7.0. 3.2
Discrete Wavelet Transform (DWT)
We begin with a brief review of orthogonal forward discrete wavelet transform (DWT). An important concept in wavelets is the multiresolution analysis (MA) [8], which decomposes the property sequences as coefficients Aj+1 at n reference level 0 (unit scale) in orthonormal basis functions {φj+1 (t)} in the k (j) (j) space Vj+1 into approximation and detail coefficients, Ak and Dk , at level 1 in orthonormal basis functions {φjk (t), wkj (t)} in the nested spaces Vj and Wj (Vj+1 = Vj ⊕ Wj ), respectively: f j+1 (t) =
n
A(j+1) ψnj+1 (t) = n
k
Ak ψkj (t) + (j)
k
Dk ψkj (t) , (j)
(8)
6
S.H. Ahmed and T. Fl˚ a
where, qkj (t) = 2j/2 q(2j t−k). That is, the scaling and the wavelet basis functions, ψkj (t) and wkj (t), are dyadic dilations ((1/2)j ) and integer translations ((1/2)j k) of the father and mother functions ψ(t) and w(t), which connect DWT to subband filtering (see the footnote below). As the basis functions are orthonormal at each level j, the corresponding coefficients can be obtained by taking the inner products f j+1 (t), ψkj (t) and f j+1 (t), wkj (t) to yield4 (j)
Ak =
n
A(0) n g(n − 2k),
(j)
Dk =
n
A(0) n h(n − 2k) .
(9)
This is filtering and downsampling operations in the analysis filter bank [10]. It is convolution with time reversed lowpass ( g(−n) ) and highpass filters ( h(−n) ). (N ) We performed a 4-level wavelet decomposition of each Cl using Symlet (sym4), a near symmetric, orthogonal wavelet with 4 vanishing moments. In DWT [8] starting with the approximation coefficient at reference level j = 0 (unit (j) scale), (8) was recursively applied on Ak at coaser levels, i.e. levels j = 1, . . . , J, up to the desired level J = 4. The detail (differences) coefficients at each level and the approximation (averages) coefficients at final level J were extracted. For l = 1, 2, . . . , L, the set of L detail coefficients were arranged as L × Nj matrices, (j) denoted by D(j) = (Dlk ), and the approximation coefficients at the final level (j) as L × NJ matrix, denoted by A(J) = (Alk ), where Nj ≈ N (1/2)j is the number of coefficients at level j. The goal with using orthogonal DWT is that it produces uncorrelated ensem(N ) bles (due to orthonormality of the basis functions) of Cl along the sequence (j) family based on Dl across k, creating a sparse representation5 . An important feature of this representation is that most of the energy of the ensembles is (j) concentrated on a few number of large coefficients Dlk that contain correlated features, partly due to intrinsic and mostly due to extrinsic correlations (see Sect. 2.2) across the species labels, that could be associated with “EMV variation” (fitness w.r.t. environmental effects like cold adaptation). In other words, most of the coefficients at finer levels are attributed to evolutionary noise (background noise) across the sequence family, with small energy spread out equally across the scales. Additionally, orthogonal DWT represents these ensembles at localspace (1/2)j k and at scales (1/2)j , hence giving an accurate local description and separations of the high-frequency-features (small j) at different resolutions. Figure 1 shows histograms of D(j) at levels 1 to 3 (4 not shown but displays a similar form). From the figure, it is clear that the multivariate distributions of (j) Dl , ∀l has small variance with mean close to zero. Thus, a Gaussian distribution is a reasonable probabilistic model for the multivariate distributions at these 4
5
Use the √ refinement equations of qkj (t), i.e. shift √ the dilation and wavelet equations ψ(t) = 2 m g(m)ψ(2t − m) and w(t) = 2 m h(m)ψ(2t − m) by k and set m = n − 2k. Since A(j) contain few data points of low-frequency features, we concetrate on the details (variations), that is, D(j) .
Evolutionary Parameters in Sequence Families Level 1
Level 2
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
7
Level 3 1
0.8
0.6
0.4
0.2
0 −3
−2
−1
0 a)
1
2
3
0 −3
−2
−1
0 b)
1
2
3
0 −3
−2
−1
0 c)
1
2
3
Fig. 1. Histograms showing multivariate distribution of the set of L detail coefficients corresonding to the 27 hydrophobicity sequences based on trypsin sequence data at levels 1 (a), 2 (b), and 3 (c). The superimposed curves correspond to theoretical probability density functions.
levels (see also the transposed curves). Consequently, we can use diagonalizing transformations to elimate the effects of correlation [7]. 3.3
Diagonalization
Let the energy (correlation) of the ensembles of the property sequences based on D(j) at level j be given by
1 (j) (j) T R = (D(j) )(D(j) )T = Rll = D ) Dl ) , (10) Nj − 1 l where Rll , l, l ∈ {1, 2, . . . , L} is the l − l -th element of the symmetric matrix (j) (j) R, a measure of correlation between Dl and Dl . Since R is symmetric and positive definite (in our case), there exists an L × L orthogonal matrix U such that R = UD(σi )UT =⇒ UT RU = D(σi ) , (11) where the columns of U is given by a set of L orthonormal eigenvectors (ui ) and (σi ) are the corresponding eigenvariances, ordered from high-to-low species variations, (σ1 > σ2 > . . . σi > . . . σL ), and D(.) is a diagonal matrix with (j) eigenvariances as elements in the bracket. Then projection of Dl along the L (j) orthogonal directions (ui ) will create a set of L uncorrelated coefficients in i D i (j) that are normally distributed with mean D 1 (the subscript 1 indicates the 1st row of the transformed sequences): (j) (j) = UT D(j) ∼ N D , D(σi ) , D 1
(12)
(j) indexed by i are arranged from high-to-low species variWhere the rows of D i ation, as described by (σi ). The effect of the diagonalizing transformation in (11) (j) and D ˜ (j) is that two highly correlated transformed detail coefficients D i i will contribute less that two nearly correlated transformed sequence of detail coefficients (D(σi )), thus eliminating the effect of such correlation. Figure 2 shows
8
S.H. Ahmed and T. Fl˚ a Species variation at level 2
0.4
0.4
0.3
0.3
0.2
0.2 3
0.5
0.1
u
u
3
Species variation at level 1 0.5
0.1
0
0
−0.1
−0.1
−0.2
−0.2
−0.3 −0.2
0
0.2 u 2
0.4
0.6
−0.3 −0.2
−0.1
0
0.1 u
0.2
0.3
0.4
2
Fig. 2. Species variation at decomposition levels (from left) 1 and 2 based on the transformed detail coefficients along the 2nd and 3rd eigenvectors obtained from energy of the ensembles across the sequence family, based on detail coefficients at these levels
species variation along 2nd and 3rd eigenvectors (u2 , u3 ) based on 27 hydrophobicity sequences (trypsin). We see that the extrinsic variations, especially due to cold-and warm adapted trypsins, are represented by the detail coefficients at level 1 and 2, while correlated variations at level 3 and 4 (not shown) are mostly due to within cluster variations. We could use the information at the two finest level to extract extrinsic correlations associated to environmental effects like cold adaptation, instead we chose to remove redundancy from the L set of transformed sequences. Since each of the L transformed detail coefficients are uncorrelated, we performed one-dimensional Hard-thresholding (”keep or kill” approach) using a universal threshold [11] based on √ eigenvariances σi derived from diagonalization of R at level 1. That is, εi = 2σi logN for the i-th trans (j) , i = 1, . . . , L, j = 1, . . . , 4. We chose formed sequence of detail coefficients, D i the eigenvariances derived at level 1 for two main reasons: (1) a better estimate of σi can be obtained due to high noise level, (2) σi determined from coarser level with more large coefficients can eliminate significant coefficients. Finally, we performed diagonalization on the covariance matrix based on approximation coefficients at the final level A(4) . In this case, we removed redundancy by keep (j) , i = 1, 2 (using a scree-plot). The ing the first two significant components of A i output of the inverse transformed coefficients (by transpose UT and transpose of analysis filter bank due to orthogonality) corresponds to a smoothed version (N ) of the original property sequences Cl . For visualization and extraction of extrinsic variations both across and along ¯ (N ) ˆ (N ) − C ˆ the sequence family, we performed mean subtraction, that is , (C ) l l before computing the covariance. Components of the first two eigenvectors and the corresponding eigensequences (projections on the first two orthogonal directions), obtained from diagonalization of the covariance matrix based on the mean subtracted smoothed property sequences, were used to visualize species variation and the underlying residue positions responsible for this variation.
Evolutionary Parameters in Sequence Families
4
9
Results and Discussion
We presented a viable way of representing an aligned family of protein sequences through evolutionary parameterization of features. Based on these parameterized 2
2
1.5
1.5 1 Hydrophobicity
Hydrophobicity
1 0.5 0 −0.5
0.5 0 −0.5
−1
−1
−1.5
−1.5
−2 0
−2 0
20 40 60 80 100 120 140 160 180 200 220 Residue number
20 40 60 80 100 120 140 160 180 200 220 Residue number
Fig. 3. Pattern of original hydrophobicity values along the 27 sequences and their smoothed version based on trypsin sequence data
Polarity 0.4
0.2
0.2 2nd eigenvector
2nd eigenvector
Hydrophobicity 0.4
0
−0.2
−0.4
−0.6
0
−0.2
−0.4
−0.2 −0.1
0 0.1 0.2 Ist eigenvector
0.3
−0.6 −0.4
0.4
4
−0.2
0 0.2 1st eigenvector
0.4
0.6
4 3 2
0
Polarity
Hydrophobicity
2
−2
1 0 −1
−4 −2 −6 0
25
50
75 100 125 150 175 200 225 Sequence position
−3 0
25
50
75 100 125 150 175 200 225 Sequence position
Fig. 4. Species variation based on trypsin sequence data in the space spanned by the first two eigenvectors (1st row) and the corresponding excess eigensequences, 1st (solid line) and 2nd (dashed line). The sequenes are based on hydrophobicity (on the left side) and poliarity (right).
10
S.H. Ahmed and T. Fl˚ a
features under reversible Markov models, we carried out a multiscale and multivariable data analysis on distinct alignment of L trypsin and elastase sequences of length N (L is 27/29, N is 221/240 for trypsin/elastase) based on hydrophobicity (N ) and polarity sequences Cl . Since the sequences in both groups of enzymes are closely related, for simplicity, we removed the few columns containing gaps. The basis of the data analysis is that covariation of residue sites in evolution is related to mainly structural and functional site fitness as parameterized by models of amino acid distributions and certain amino acid properties. These correlated residues based on property sequences show deviation from a common position dependent mean value. In principle, this requires a sequence data of sufficient size and diversity (at each site) to compute such position dependent mean values for each cluster. Therefore, we used multivariate method to remove background noise and extract extrinsic correlations due to environmental effects like cold adaptation. Description of the method is given in details with some illustrations in Sect. 3. The idea is to use orthogonal wavelets to obtain a sparse representation of the property sequences based on detail coefficients and perform diagonalizing transformation in the wavelet domain to decorrelate the small number of large detail coefficients (representing the sequence ensembles) with high energy. Onedimensional thresholding, in this case, Hard thresholding can then be applied on the uncorrelated wavelet coefficients in order to separate out the larger coefficients that are associated with variations due to environmental effects like cold cold adaptation. The resulting backward transformed, denoised property sequences are smoothed version of the original property sequences as shown in Fig. 3. In this figure (to the right), the thick horizontal curve represents the Elastases (Hydrophobicity)
0.2
1st eigensequence (elastases) 3
0
2 Hydrophobicity
3rd eigensquence
4
0.1
−0.1 −0.2 −0.3 −0.4
0.2 0.2
0
−0.2
0 −0.2
1 0 −1 −2 −3 −4 0
25
50
75 100 125 150 175 200 225 Sequence position
Fig. 5. Drift of centroids, as described by the species covariance matrix (figure on the left) and centroid sequence (figure on the right) based on hydrophobicity. The solid triangles and circles represent cold-adapted elastases of type I and II, respectively. The corresponing open triangles and circles represent their warm-active counterparts. The squares represent elastase type-III with no cold representatives.
Evolutionary Parameters in Sequence Families
11
Fig. 6. 3D structure of trysin from Bovine (PDB:3PTB) showing surpport of the components of the excess eigensequences based on hydrophobicity (green) and polarity (red). Region of active site is shown in yellow.
centroid sequence of the trypsin family based on 27 hydrophobicity sequences. The larger spikes are due to extrinsic variations that are associated with residue (hydrophobic) determinants of cold-adapted trypsins. The smaller spikes are due to intrinsic variations, partly due to that there are several clusters, in the trypsin case, the warm-active higher vertebrates and other fishw, and partly due to assymmetries in covariance induced by evolutionary time since two leaf nodes were merged. Such drift of centroids, in terms of species and position variations can be clearly observed in the subspace spanned by the first two eigenvectors with largest variances, derived from diagonalization based on the smoothed property sequences after subtracting the mean profile in Fig 3, and projecting the excess variations from th mean profile along the two eigenvectors. Drift of centroids in terms of covariation and mean sequences are shown in Fig. 4 and Fig. 5. Fig. 6 shows support of the excess eigensequences, namely, in the N-and C terminals for distability and around active site [20].
References [1] Goldman, N., Yang, Z.: A codon based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11(5), 725–736 (1994) [2] Felsenstein, J.: Evolutionary trees from DNA sequences: A maximum likelihood approach. J. Mol. Evol. 17, 368–376 (1981) [3] Pollock, D., Taylor, W., Goldman, N.: Coevolving protein residues: Maximum likelihood identification and relationship to structure. J. Mol. Biol. 287, 187–198 (1999) [4] Jones, D.T., Taylor, W.R., Thornton, J.M.: The rapid generation of muatation data matrices from protein sequences. Comp. Appl. Biosci. 8, 275–282 (1992)
12
S.H. Ahmed and T. Fl˚ a
[5] Kishino, H., Miyata, T., Hasegawa, M.: Maximum likelihood inference of protein phylogenies and the origin of chloroplasts. J. Mol. Evol. 31, 151–160 (1990) [6] Hasegawa, M., Fujiwara, M.: Relative efficiencies of the maximum likelihood, maximum parsimony, and neighbor joining methods for estimating protein phylogeny. Mol. Phylog. and Evol. 2, 1–5 (1993) [7] Johnson, R.A., Wichern, D.W.: Applied Multivariate Statistical Analysis, 5th edn. Prentice Hall, Upper Saddler River (2002) [8] Mallat, S.G.: A theory for multiresolution signal decomposition: the wavelet representation. IEEE Trans. Pattern Anal. Mach. Intell. 11(7), 674–693 (1989) [9] Suzuki, T., Srivastava, A., Kurokawa, T.: cDNA cloning and phylogenetic analysis of pancreatic serine proteases from Japanese flounder, Paralichthys olivaceus. Comp. Biochem. and Physiol. Part B 131, 63–70 (2001) [10] Strang, G., Nguyen, T.: Wavelets and Filter Banks. Wellesley-Cambridge Press (1997) [11] Donoho, D., Johnstone, I.: Ideal spatial adaptation by wavelet shrinkage. Biometrika 81(3), 425–455 (1994) [12] Koshi, J.M., Mindell, D.P., Goldstein, R.A.: Beyond mutation matrices: physicalchemistry based evolutionary models. In: Miyano, S., Takagi, T. (eds.) Genome informatics, pp. 80–89. Universal Academy Press, Tokyo (1997) [13] Casari, G., Sander, C., Valencia, A.: A method to predict functional residues in proteins. Nat. Struc. Biol. 2(2), 171–178 (1995) [14] Durbin, R., Eddy, S., Krogh, A., Mitchison, G.: Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge (1998) [15] Whelan, S., Goldman, N.: A general empirical model of protein evolution derived from multiple protein famililies using a maximum-likelihood approach. Mol. Biol. and Evol. 18, 691–699 (2001) [16] Goldman, N., Whelan, S.: A novel use of equillibrium frequencies in models of sequence evolution. Mol. Evol. 11, 1821–1831 (2002) [17] Ahmed, S.H., Fl˚ a, T.: Estimation of evolutionary average hydrophobicity profile from a family of protein sequences. In: Rajapakse, J.C., Schmidt, B., Volkert, L.G. (eds.) PRIB 2007. LNCS (LNBI), vol. 4774, pp. 158–165. Springer, Heidelberg (2007) [18] Feller, G., Gerday, C.: Psychrophilic enzymes: molecular basis of cold-adaptation. Cell Mol. Life Sci. 53, 830–841 (1997) [19] Georlette, D., Blaise, V., Collins, T., D’Amico, S., Gratia, E., Hoyoux, A., Marx, J.C., Sonan, G., Feller, G., Gerday, C.: Some like it cold: biocatalysis at low temperatures. FEMS icrobiol. Rev. 28, 25–52 (2004) [20] Schrøder, H.-K., Willassen, N.P., Smal˚ as, A.O.: Residue determinants and sequence analysis of cold-adapted trypsins. Extremophiles (2), 5–219 (1999)
MProfiler: A Profile-Based Method for DNA Motif Discovery Doaa Altarawy, Mohamed A. Ismail, and Sahar M. Ghanem Computer and Systems Engineering Dept. Faculty of Engineering, Alexandria University Alexandria 21544, Egypt {doaa.altarawy,maismail,sghanem}@alex.edu.eg
Abstract. Motif Finding is one of the most important tasks in gene regulation which is essential in understanding biological cell functions. Based on recent studies, the performance of current motif finders is not satisfactory. A number of ensemble methods have been proposed to enhance the accuracy of the results. Existing ensemble methods overall performance is better than stand-alone motif finders. A recent ensemble method, MotifVoter, significantly outperforms all existing stand-alone and ensemble methods. In this paper, we propose a method, MProfiler, to increase the accuracy of MotifVoter without increasing the run time by introducing an idea called center profiling. Our experiments show improvement in the quality of generated clusters over MotifVoter in both accuracy and cluster compactness. Using 56 datasets, the accuracy of the final results using our method achieves 80% improvement in correlation coefficient nCC, and 93% improvement in performance coefficient nP C over MotifVoter. Keywords: Bioinformatics, DNA Motif Finding, Clustering.
1
Introduction
Computational identification of overrepresented patterns (motifs) in DNA sequences is a long-standing problem in Bioinformatics. Identification of those patterns is one of the most important tasks in gene regulation which is essential in understanding biological cell functions. Over the last few years, the sequencing of the complete genome of large variety of species (including human) has accelerated the advance in the filed of Bioinformatics [1]. The problem of DNA motif finding is to locate common short patterns in a set of co-regulated gene promoters (DNA sequences). Those patterns are conserved but still tend to vary slightly [2]. Normally the patterns (motifs) are fairly short (5 to 20 base pair long) [3]. Those motifs are the locations where transcription factors (TF) bind to in order to control protein production in cells. DNA motifs are also called transcription factor binding sites (TFBS). Many computational methods are being proposed to solve this problem. Their strategies can be divided into two main classes: exhaustive enumeration and probabilistic methods [4]. V. Kadirkamanathan et al. (Eds.): PRIB 2009, LNBI 5780, pp. 13–23, 2009. c Springer-Verlag Berlin Heidelberg 2009
14
D. Altarawy, M.A. Ismail, and S.M. Ghanem
Review of the field and description of some motif finding methods can be found in [1,2,3,4,5,6]. Several studies show that current motif finding methods are unsatisfactory [3,5,7]. In Tompa et al.’s [7] assessment, 13 motif finding methods were examined. Their study shows that the accuracy of those methods in terms of sensitivity and specificity is low. Despite the large number of methods being proposed for motif finding, it is still a challenging problem. Motifs found by different methods are not always the same, meaning that their results can be complementary [7,8]. Although the accuracy of a single motif finder method is low, ensemble methods are promising. Ensemble methods are compound algorithms that combine the results of multiple predictions from multiple algorithms. Thus, combing more than one stand-alone method can increase the sensitivity (more true positives), but without a good filtering method it will reduce the specificity (more false positives) [7]. In the last few years, some ensemble methods have been proposed such as SCOPE [9], BEST [10], EMD [11], and more recently, MotifVoter [8]. MotifVoter significantly outperforms all existing stand-alone and ensemble methods. For example, in Tompa’s benchmark MotifVoter increased the accuracy of the results (correlation coefficient nCC) over the best stand-alone method by more than 100%. MotifVoter formulates the ensemble motif finding as an optimization search problem and uses a heuristic to generate a search space consisting of clusters of motifs. It uses a variance-based objective function to select the best cluster among the generated search space. In this paper, we propose a method called MProfiler to increase the accuracy of MotifVoter by using a new heuristic to generate the search space. Enhancing the search space in both accuracy and quality improves the final results and reduces the chances of falling in local maximums. A more accurate search space is the one that has higher percentage of motifs with higher collaboration coefficient with the true motifs. The quality of the search space is the compactness of its higher accuracy clusters, since the selection function is variance-based. The proposed technique for search space generation has more than 200% improvement over MotifVoter’s in terms of percentage of generated sets having nCC greater than 0.5. In addition, the generated sets are having higher mean and lower variance (i.e more compact) when compared to the sets generated by MotifVoter’s approach. Having compact sets is a desirable feature for the objective function because it is variance-based. In our experiments, we compare the proposed MProfiler technique with MotifVoter on 56 different datasets that are proposed on Tompa’s benchmark [7]. The correlation coefficient nCC and performance coefficient nP C are used as measures of accuracy for Motif finding methods. Our experimental results show that MProfiler increases correlation coefficient by 80% over MotifVoter on the same benchmark. In addition, MProfiler increases the performance coefficient by 93%. The rest of the paper is organized as follows: Section 2 provides an overview of related word and the motivation of our proposed algorithm. Section 3 introduces
MProfiler: A Profile-Based Method for DNA Motif Discovery
15
the MProfiler algorithm. Section 4 presents experimental results and discussions. Section 5 concludes the paper along with future work.
2
Motivation
Many ensemble motif finding methods make use of the observation that true motifs are shared by multiple motif finders. In addition, MotifVoter in addition proposed the use of the negative information from the false positive motifs that are usually predicted by a few or even one motif finder. MotifVoter performs two selective criteria to find an optimal cluster [8]: 1. Discriminative criterion: select a cluster of motifs that are not only similar, but also have the property that motifs outside the cluster are distant from each other. This is done with a variance-based objective function (see equation (6) in methods section). 2. Consensus criterion: the selected cluster must be predicted by as many motif finders as possible. After the cluster is chosen, one representative motif is extracted from the cluster (i.e. a cluster center) using a process called site extraction. The enumeration technique is unfeasible since it takes exponential time. Instead MotifVoter uses a simple heuristic to generate the search space. Let P be the set of all input motifs. MotifVoter only considers subsets Xz,j ={z, p1, ....., pj } for every z ∈ P and for every 1 < j < |P | − 1, where pi ’s are sorted descending according to its similarity to z, i.e. sim(z, pi ) > sim(z, pi+1 ) and pi ∈ P . The heuristic used by MotifVoter to generate the search space produces good search space for the motif finding problem. MotiVoter outperform all stand-alone and ensemble motif finding methods in terms of accuracy [8]. Because the objective function is variance-based, it favors compact clusters even if they are not the optimal ones. In addition, using a variance-based function with different size sets can mislead the selection to smaller clusters, since smaller clusters appear more compact. Therefore, the capability of the objective function to select a more accurate cluster can be improved by making the clustered sets of nearly equal size. In this paper, an ensemble motif finding method, MProfiler, is proposed that improves MotifVoter search space in three desirable features: 1. Increase the percentage of higher accuracy sets. 2. The generated sets are more compact, i.e. having higher mean and lower variance. 3. Clusters examined by the objective function are nearly of equal size. The proposed MProfiler technique constructs profiles of similar motifs predicted by different finders. Then, the profiles are used to generate the search space. The constructed profiles increase the similarities between motifs if they exist, thus giving them higher score from the variance-based function. Details of how to generate and use the profiles are described in the following section.
16
3 3.1
D. Altarawy, M.A. Ismail, and S.M. Ghanem
MProfiler Methods Definitions
Problem Statement: Ensemble DNA Motif finding problem can be formalized as follows: given a set of DNA sequences, and the output of different motif finding methods, each output is a set of motifs (i.e n ∗ m motifs in total), it is required to construct a representative motif that is the best approximate of the real motif which is shared in the input sequences. Motif: A motif is a set of sites where each site is a continuous range of positions representing a subsequence from a DNA sequence. Motif Similarity: In [8], the similarity between two motifs is defined by (1), where cov(xi ) is all positions covered by motif x. From that motif similarity definition 0 ≤ sim(xi , xj ) ≤ 1, and sim(xi , xi ) = 1. sim(xi , xj ) =
cov(xi ) ∩ cov(xj ) cov(xi ) ∪ cov(xj )
(1)
Cluster Similarity: The similarity among a cluster X of motifs is defined as the mean pairwise similarity among its members given by (2), where |X| is the number of motifs in the set X. sim(xi , xj ) sim(X) =
xi ,xj ∈X
2
|X|
(2)
Cluster Center: We define the center of a cluster as the motif that consists of all positions covered by two or more motifs in the cluster, i.e. it is the pairwise intersection of its members, and can be calculated using (3). center(X) = [cov(xi ) ∩ cov(xj )] (3) xi ,xj ∈X xi =xj
Consensus Cluster Center (Profile): We define consensus center of a cluster as the motif that consists of all positions covered by at least two motifs, such that the intersecting motifs are predicted by two different motif finders and can be calculated using (4). An extra refinement is added by removing sites (continuous positions) that has only two contributing finding methods. consCenter(X) = [cov(xi ) ∩ cov(xj )] (4) xi ,xj ∈X f inder(xi )= f inder(xj )
Cluster Weight: There are several weighing functions that can be used to give a score to a set of motifs. In this paper, we compare our technique to MotifVoter [8], and apply the same weight used by MotifVoter as defined by (5).
MProfiler: A Profile-Based Method for DNA Motif Discovery
weight(X) =
sim(X)
2
xi ,xj ∈X
(sim(xi , xj ) − sim(X))
17
(5)
Objective Function: The objective function is defined in [8] as the ratio between the weight of a chosen set X, and the weight of remaining motifs not ¯ as shown in (6). belonging to X (i.e. X) A(X) =
weight(X) ¯ weight(X)
(6)
Accuracy Measures: Following Tompa et al. [7] and others, the following accuracy measures are considered. Sensitivity is the percentage of known sites that the algorithm was able to find correctly. Specificity is the percentage of the predicted sites that are correct. – Nucleotide Correlation Coefficient (nCC): Nucleotide Correlation Coefficient combines both sensitivity and specificity (Positive predictive value). As nCC calculated by (7), if the predicted motif perfectly coincide with the known motif, then the value of nCC is 1. If they are independent, then the value of nCC is 0. TP, FP, TN and FN are nucleotide true positive, false positive, true negative and false negative respectively [7]. nCC =
T P.T N − F N.F P (T P + F N )(T N + F P )(T P + F P )(T N + F N )
(7)
– Performance Coefficient (nP C): Performance coefficient captures both specificity and sensitivity in a single accuracy measure using a simple equation. It is the ratio between true positives (true motifs) and all regions that is marked as motifs either correctly or incorrectly. Nucleotide level performance coefficient (nP C) is defined in 8. It ranges from 0 (worst) to 1 (best). nP C =
3.2
TP TP + FN + FP
(8)
MProfiler Algorithm
Given the output of m stand-alone motif finding methods, it is desirable to produce a motif that best approximates the real motif. MotifVoter algorithm has three steps. First, a search space consisting of sets of motifs is generated using the heuristic described in section 2. Second, a set is chosen from the generated search space the maximizes the variance-based objective function in equation (6) with consensus criterion satisfied. Finally, the final motif is extracted from the chosen set as described in MotifVoter [8]. Instead of using n ∗ m motifs given by the stand-alone finders in MotifVoter, the proposed MProfiler technique uses a set of generated motif profiles called consensus cluster centers as defined in section 3.1. Using those profiles helps
18
D. Altarawy, M.A. Ismail, and S.M. Ghanem
increase similarities between motifs in the same cluster, if it exists, thus giving them higher score from the variance-based function in equation (6). A profile has at least 3 intersecting motifs predicted by 3 different motif finders. The generation of the profiles is described in Algorithm (1).
Input : set P contains n ∗ m Motifs. Output: one Motif and PWM for its aligned sites. 1 2 3 4 5 6 7 8 9 10 11 12
foreach xi , xj ∈ P do compute sim(xi , xj ); profiles ← ∅; foreach motif z ∈ P do X ← ∅; sortedP ← sort P according to sim(z,pi ); for top n pi ∈ sortedP do X ← X + pi ; if (sim(profiles.lastElement,consCenter(X))< ) then profiles ← profiles + consCenter(X); end end acceptedCluster ← MotifVoter(profiles); extractSites and generate PWM ; Algorithm 1. MProfiler pseudo code
The condition in line 8 avoids obtaining very similar profiles from the same group which actually represent the same profile. A new profile is generated only if it differs by at least within its group where is any similarity value between 0 and 1. Small values generate larger number of profiles, which will be merged in line 11. In line 11, MotifVoter algorithm is used to find the cluster X using the objective function in equation (6). Consensus criterion is not needed in this step because it is already applied in generating the profiles. 3.3
Site Extraction
Final sites are extracted from the selected cluster of motifs as in equation (3). Accepted positions are the positions covered by more than one motif in the cluster. The sites are then aligned using MUSCLE [12] and a Position Weight Matrix (PWM) is generated. PWM is a common representation of motifs. A position weight matrix is a matrix of score values that gives a weighted match to any given substring of fixed length. It has one row for each symbol of the alphabet (A, C, G, T), and one column for each position in the motif. 3.4
Time Complexity
Given m motif finders, each with n predicted motifs, the time complexity of our method is O(m2 n2 ), which is the same for MotifVoter. First, at most mn2
MProfiler: A Profile-Based Method for DNA Motif Discovery
19
profiles are generated. Then, for each profile, the objective function is calculated for m subsets. As in MotifVoter, since motifs are added one by one, the objective function can be calculated in a constant time from the previous value. Unlike MotifVoter, for each profile MProfiler algorithm did not need to add all other profiles to the growing clusters of motifs because sets are more compact. Instead the first most similar m profiles are examined. Thus the final running time is O(m2 n2 ).
4 4.1
Results and Discussion Stand-Alone Motif Finders
We used the same 10 finders used by MotifVoter with the same parameters described in [8]. The stand-alone motif finders are: MEME [13], Weeder [14], Bioprospector [15], SPACE [16], MDScan [17], ANN-Spec [18], MotifSampler [19], MITRA [20], AlignACE [21], and Improbizer [22]. Any other DNA motif finder can be used. For each finder, the first 30 predicted motifs are taken. The top 30 motifs achieve maximum sensitivity (nSn) on Tompa’s benchmark [8]. Since Tompa’s benchmark is a good representative of real motifs, using top 30 motifs for other datasets is a quite reasonable approximation. 4.2
Datasets
Datasets used in the comparison are the Tompa et al. [7] benchmark consisting of 56 different datasets, which cover 4 different species (Mouse, Fruit fly, Human and Yeast). The datasets are constructed based on real transcription factor binding sites (TFBS). 4.3
Improvement in Search Space
Accuracy: Using performance coefficient nPC as a measure of accuracy, MProfiler has 380% improvement over MotifVoter in percent of generated sets having accuracy nP C > 0.5. Fig. 1 shows the total improvement in nPC for all 56 datasets. Improvement of nPC over MotifVoter is calculated using (9). Improvement(nP C) =
nP CMP rof iler − nP CMotif V oter nP CMotif V oter
(9)
Also MProfiler’s search space (generated sets) has more than 200% improvement over MotifVoter in percentage of generated sets having higher correlation coefficient, i.e nCC > 0.5 as shown in Fig. 2. The figure shows the combined nCC for all 56 datasets. Improvement of nCC over MotifVoter is calculated using (10). Improvement(nCC) =
nCCMP rof iler − nCCMotif V oter nCCMotif V oter
(10)
20
D. Altarawy, M.A. Ismail, and S.M. Ghanem
Fig. 1. The overall improvement in accuracy (nPC) of MProfiler over MotifVoter. Yaxis: percent improvement of number of generated clusters having nPC greater than or equal to x. MProfiler’s search space (generated sets) has 380% improvement over MotifVoter in percentage of generated sets having higher accuracy (i.e with nP C > 0.5).
Fig. 2. The overall improvement in accuracy (nCC) of MProfiler over MotifVoter. Y-axis: percent improvement of number of generated clusters having nCC greater than or equal to x. MProfiler’s search space (generated sets) has more than 200% improvement over MotifVoter in percentage of generated sets having higher accuracy (i.e with nCC > 0.5).
More accurate clusters mean a higher probability to find the correct set, given that their quality are better. Notice that MProfiler has more improvement in higher nCC and nP C values than lower ones which is a desirable feature (i.e. it increases the percentage of higher quality sets more than lower quality sets). Average Mean and Variance: Since the objective function is based on the mean and the variance of cluster similarity (see equation (6)), it is desirable
MProfiler: A Profile-Based Method for DNA Motif Discovery
21
Fig. 3. Average mean and variance of similarity in generated clusters. y-axis is the average mean/variance of similarity for sets having nCC greater than or equal to x.
to make higher accuracy clusters more compact, i.e. with higher mean and lower variance. MProfiler improves both mean (higher value) and variance (lower value) over MotifVoter which led to the improvement in the optimal cluster selected. Fig. 3 shows the improvement of mean and variance of MProfiler generated sets over MotifVoter for all 56 datasets. 4.4
Comparison of Final Results
On 56 different datasets, MProfiler has 80% improvement in accuracy (nucleotide correlation coefficient nCC) over MotifVoter results using the same input and
22
D. Altarawy, M.A. Ismail, and S.M. Ghanem
the same objective function implemented as described by Wijaya et al. [8]. Also MProfiler has 93% improvement in accuracy using performance coefficient nP C as a measure of accuracy. Comparison with the results stated in [8] was not possible because the exact implementation of the objective function is not described in their paper and the source code is not available.
5
Conclusion
Ensemble methods provide improvement in motif finding accuracy without the need to use additional data (such as phylogenetic information or characterization of the domain structure of the transcription factor), which are not always available. Our proposed method, MProfiler, improves the best existing motif finding ensemble method, MotifVoter, in terms of accuracy without increasing time complexity. On the widely used Tompa’s benchmark with 56 different datasets, MProfiler’s search space has 200% improvement over MotifVoter in percentage of generated sets having higher accuracy (i.e with nCC > 0.5), and 380% improvement for sets having performance coefficient nP C > 0.5. For final motif results, our method achieves 80% improvement in final accuracy using correlation coefficient, and 93% improvement using performance coefficient over MotifVoter.
6
Future Work
The problem of computational motif finding is still standing in bioinformatics. Even with ensemble methods the accuracy is low. The upper-bound of ensemble methods is limited by the underlying stand-alone finders. Thus, using better stand-alone finders will raise the maximum possible senstivity for ensemble methods. Moreover, other objective functions can be suggested to enhance the accuracy. The idea of generating the profiles can also be used with other standalone or ensemble methods.
References 1. Qiu, P.: Recent advances in computational promoter analysis in understanding the transcriptional regulatory network. Biochemical and Biophysical Research Communications 309(3), 495–501 (2003) 2. Wei, W., Yu, X.D.: Comparative analysis of regulatory motif discovery tools for transcription factor binding sites. Genomics Proteomics Bioinformatics 5(2), 131– 142 (2007) 3. Das, M., Dai, H.K.: A survey of DNA motif finding algorithms. BMC Bioinformatics 8(suppl. 7) (2007) 4. Li, N., Tompa, M.: Analysis of computational approaches for motif discovery. Algorithms for Molecular Biology 1(1), 8–15 (2006) 5. Hu, J., Li, B., Kihara, D.: Limitations and potentials of current motif discovery algorithms. Nucleic Acids Res. 33(15), 4899–4913 (2005)
MProfiler: A Profile-Based Method for DNA Motif Discovery
23
6. Stormo, G.D.: DNA binding sites: representation and discovery. Bioinformatics 16(1), 16–23 (2000) 7. Tompa, M., et al.: Assessing computational tools for the discovery of transcription factor binding sites. Nature Biotechnology 23, 137–144 (2005) 8. Wijaya, E., Yiu, S., Son, N.T., Kanagasabai, R., Sung, W.: Motifvoter: a novel ensemble method for fine-grained integration of generic motif finders. Bioinformatics 24, 2288–2295 (2008) 9. Chakravarty, A., Carlson, J.M., Khetani, R.S., Gross, R.H.: A novel ensemble learning method for de novo computational identification of DNA binding sites. BMC Bioinformatics 8, 249–263 (2007) 10. Che, D., Jensen, S., Cai, L., Liu, J.S.: BEST: Binding-site estimation suite of tools. Bioinformatics 21(12), 2909–2911 (2005) 11. Hu, J., Yang, Y.D., Kihara, D.: EMD: an ensemble algorithm for discovering regulatory motifs in DNA sequences. BMC Bioinformatics 7, 342–454 (2006) 12. Edgar, R.C.: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32(5), 1792–1797 (2004) 13. Bailey, T.L., Elkan, C.: Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine Learning 21, 51–80 (1995) 14. Pavesi, G., Mereghetti, P., Mauri, G., Pesole, G.: Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes. Nucleic Acids Res. 32(Web Server issue) (July 2004) 15. Liu, X., Brutlag, D.L., Liu, J.S.: Bioprospector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. In: Pac. Symp. Biocomput., pp. 127–138 (2001) 16. Wijaya, E., Kanagasabai, R., Yiu, S.-M.M., Sung, W.-K.K.: Detection of generic spaced motifs using submotif pattern mining. Bioinformatics 23(12), 1476–1485 (2007) 17. Liu, X.S., Brutlag, D.L., Liu, J.S.: An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nat. Biotechnol. 20(8), 835–839 (2002) 18. Workman, C.T., Stormo, G.D.: ANN-Spec: a method for discovering transcription factor binding sites with improved specificity. In: Pac. Symp. Biocomput., pp. 467– 478 (2000) 19. Thijs, G., et al.: A higher-order background model improves the detection of promoter regulatory elements by gibbs sampling. Bioinformatics 17(12), 1113–1122 (2001) 20. Eskin, E., Pevzner, P.A.: Finding composite regulatory patterns in DNA sequences. Bioinformatics 18(suppl. 1) (2002) 21. Huang, H.-D., Horng, J.-T., Sun1, Y.-M., Tsou, A.-P., Huang, S.-L.: Identifying transcriptional regulatory sites in the human genome using an integrated system. Nucleic Acids Res. 32(6), 1948–1956 (2004) 22. Ao, W., Gaudet, J., Kent, W.J., Muttumu, S., Mango, S.E.: Environmentally induced foregut remodeling by PHA-4/FoxA and DAF-12/NHR. Science 305, 1743– 1746 (2004)
On Utilizing Optimal and Information Theoretic Syntactic Modeling for Peptide Classification Eser Ayg¨ un1 , B. John Oommen2 , and Zehra Cataltepe3 1 2
Department of Computer Eng., Istanbul Technical University, Istanbul, Turkey School of Computer Science, Carleton University, Ottawa, Canada : K1S 5B6, and Adjunct Professor at the University of Agder in Grimstad, Norway 3 Department of Computer Eng., Istanbul Technical University, Istanbul, Turkey
[email protected]
Abstract. Syntactic methods in pattern recognition have been used extensively in bioinformatics, and in particular, in the analysis of gene and protein expressions, and in the recognition and classification of biosequences. These methods are almost universally distance-based. This paper concerns the use of an Optimal and Information Theoretic (OIT) probabilistic model [11] to achieve peptide classification using the information residing in their syntactic representations. The latter has traditionally been achieved using the edit distances required in the respective peptide comparisons. We advocate that one can model the differences between compared strings as a mutation model consisting of random Substitutions, Insertions and Deletions (SID) obeying the OIT model. Thus, in this paper, we show that the probability measure obtained from the OIT model can be perceived as a sequence similarity metric, using which a Support Vector Machine (SVM)-based peptide classifier, referred to as OIT SVM, can be devised. The classifier, which we have built has been tested for eight different “substitution” matrices and for two different data sets, namely, the HIV1 Protease Cleavage sites and the T-cell Epitopes. The results show that the OIT model performs significantly better than the one which uses a Needleman-Wunsch sequence alignment score, and the peptide classification methods that previously experimented with the same two datasets. Keywords: Biological Sequence Analysis, Optimal and Information Theoretic Syntactic Classifcation, Peptide Classification, Sequence Processing, Syntactic Pattern Recognition.
1
Introduction
The syntactic methods that have been traditionally used in the analysis, recognition and classification of bioinformatic data include distance-based methods, and probabilistic schemes which are, for example, Markovian. A probabilistic model, distinct from these, is the one proposed by Oommen and Kashyap [11]. The model, referred to as the OIT model, attains the optimal and information theoretic bound. This paper reports the first known results in which the OIT model has been applied in any bioinformatic application. V. Kadirkamanathan et al. (Eds.): PRIB 2009, LNBI 5780, pp. 24–35, 2009. c Springer-Verlag Berlin Heidelberg 2009
On Utilizing Optimal and Information Theoretic Syntactic Modeling
25
Peptides are relatively short amino acid chains that occur either as separate molecules or as building blocks for proteins. Apart from their significance in analyzing proteins, peptides themselves may have various distinct chemical structures that are themselves related to different molecular functions. These functions, such as cleavage or binding, while being interesting in their own right, have also been shown to be important in areas such as biology, medicine, drug design, disease pathology, and nanotechnology Indeed, for more than a decade, researchers have sought computational techniques to rapidly identify peptides that are known to be, or can be, related to certain molecular functions. The research in peptide classification is not new –indeed, a host of techniques have been proposed for in silico peptide classification1 . In 1998, Cai and Chou [3] presented one of the pioneering works in this area. They classified 8residue peptides and used artificial neural networks with 20 input nodes per residue, thus involving a total of 160 input nodes. In their work, each amino acid was encoded using 20 bits so that the 20 amino acids were encoded as A = 100 . . . 00, B = 010 . . . 00, . . . , Y = 000 . . . 01. Similarly, Zhao et al. [15] mapped the amino acid sequences of peptides directly into feature vectors and fed them into a Support Vector Machine (SVM). They, however, represented the amino acids by a set (more specifically, ten) of their biophysical properties, such as hydrophobicity or beta-structure preference, instead of an orthonormal representation, as advocated by [3]. By resorting to such a representation, they were eventually able to reduce the dimensionality of the input space by 50%. To further increase the information density of input vectors, Thomson et al. [13] used bio-basis artificial neural networks, which are a revision of radial-basis function networks, that use biological similarities rather than spatial distances. This work was subsequently enhanced by Trudgian and Yang [14] by optimizing the substitution matrices that are used to compute the latter biological similarities. Kim et al. [8] followed a rule-based approach to achieve results which were interpretable. It should be mentioned that there were also earlier studies based on the properties of quantitative matrices, binding motifs and hidden Markov models, which should really be treated as precursors to the results cited above. The differences between our results and those which use Hidden Markov Models (HMMs) will be clarified presently. A completely different sequence representation technique was introduced in the area of protein fold recognition by Liao and Noble [9]. Liao and Noble represented protein sequences by their pairwise biological similarities, which were measured by ordinary sequence alignment algorithms. Subsequently, by considering these similarities as feature vectors, relatively simple classifiers were trained and successfully utilized for classifying and discriminating between different protein folds. The primary intention in this study is to use a SVM-based classifier in achieving the classification and discrimination. However, rather than use distances, we shall advocate the use of a rigorous probabilistic model, namely one which has 1
The review and bibliography presented here is necessarily brief. A more detailed review is found in [1].
26
E. Ayg¨ un, B.J. Oommen, and Z. Cataltepe
been proven to be both optimal and to attain the information theoretic bound. Indeed, in this study, we combine the strategy of Liao and Noble (i.e., to use pairwise SVM classifiers) with a probabilistic similarity metric, and to successfully classify peptides. Observe that, instead of resorting to the alignment scores, we quantify the similarity by means of their Optimal and Information Theoretic (OIT) garbling probabilities as described by Oommen and Kashyap [11]. The latter OIT garbling probability is the probability of obtaining a sequence Y from a sequence U based on the OIT mutation model, whose properties will be clarified later. One clear difference between the alignment scores and OIT garbling probabilities is that whereas an alignment score considers only the shortest path between two sequences, the OIT garbling probabilities covers all possible paths. Furthermore, since it assigns a probability mass to every possible path (i.e., possible garbling operations), it contains more information about the similarity between the two sequences. It is pertinent to mention that a similar transition probability measurement based on HMMs was earlier proposed by Bucher and Hofman [2]. Indeed, since then, HMM-based similarity metrics have been used in many biological applications. The difference between our work and the ones which use HMMs can be, in all brevity stated as follows: Unlike the latter, the OIT model permits non-Geometric-based distributions for the number of insertions occurring in any sequence of mutations [1,11]. Additionally, the superiority of OIT model, say Π ∗ , to “distance-based” approaches are (a) Π ∗ is Functionally Complete because it comprehensively considers all the ways by which U can be mutated into Y using the three elementary Substitutions, Insertions and Deletions (SID) operations, (b) The distributions and the parameters involved for the various garbling operations in Π ∗ can be completely arbitrary, (c) Π ∗ captures the scenarios in which the probability of a particular string U being transformed into another string Y , is arbitrarily small, (d) For a given U , the length of Y is a random variable whose distribution does not necessarily have to be a mixture of Geometric distributions, and (e) If the input U is itself an element of a dictionary, and the OIT channel is used to model the noisy channel, the technique for computing the probability Pr [Y |U ] can be utilized in a Bayesian way to compute the a posteriori probabilities, and thus yield an optimal, minimum probability of error pattern classification rule. Most importantly, however, in both the Bayesian and non-Bayesian approaches, the OIT model actually attains the information theoretic bound for recognition accuracy when compared with all the other models which have the same underlying garbling philosophy These issues are also clarified in greater detail in [1,11]. We have tested our solution, the OIT SVM, which involves the combination of the SVM-pairwise and the OIT model, on two peptide classification problems, namely the HIV-1 Protease Cleavage site and the T-cell Epitope prediction problems. Both of these problems are closely related to pharmacological research work that has been the focus of a variety of computational approaches [3,8,13,14,15]. The results, which we present in a subsequent section, indicate that our solution paradigm leads to an extremely good classification performance.
On Utilizing Optimal and Information Theoretic Syntactic Modeling
2
27
Modeling – The String Generation Process
We now describe the model by which a string Y is generated given an input string U ∈ A∗ , where A is the alphabet under consideration, and ξ and λ are the input and output null symbols, respectively. First of all, we assume that the model utilizes a probability distribution G over the set of positive integers. The random variable in this case is referred to as Z, and is the number of insertions that are performed in the mutating process. G is called the Quantified Insertion Distribution, and in the most general case, can be conditioned on the input string U . The quantity G (z|U ) is the probability that Z = z given that U is the input word. Thus, G has to satisfy the following constraint: G (z|U ) = 1. (1) z≥0
The second distribution that the model utilizes is the probability distribution Q over the alphabet under consideration. Q is called the Qualified Insertion Distribution. The quantity Q (a) is the probability that a ∈ A will be the inserted symbol conditioned on the fact that an insertion operation is to be performed. Note that Q has to satisfy the following constraint: Q (a) = 1. (2) a∈A
Apart from G and Q, another distribution that the model utilizes is a probability distribution S over A × (A ∪ {λ}), where λ is the output null symbol. S is called the Substitution and Deletion Distribution. The quantity S (b|a) is the conditional probability that the given symbol a ∈ A in the input string is mutated by a stochastic substitution or deletion –in which case it will be transformed into a symbol b ∈ (A ∪ {λ}). Hence, S (c|a) is the conditional probability of a ∈ A being substituted for by c ∈ A, and analogously, S (λ|a) is the conditional probability of a ∈ A being deleted. Observe that S has to satisfy the following constraint for all a ∈ A: S (b|a) = 1. (3) b∈(A∪{λ})
Using the above distributions we now informally describe the OIT model for the garbling mechanism (or equivalently, the noisy string generation process). Let |U | = N . Using the distribution G, the generatorfirst randomly determines the number of symbols to be inserted. Let Z be random variable denoting the number of insertions that are to be inserted in the mutation. Based on the output of the random number generator, let us assume that Z takes the value z. The algorithm then determines the position of the insertions among the individual symbols of ∗ U . This is done by randomly generating an input edit sequence U ∈ (A ∪ {ξ}) . We assume that all the possible strings are equally likely. Note that the positions of the symbol ξ in U represents the positions where symbols will be inserted into U . The non-ξ symbols in U are now substituted for
28
E. Ayg¨ un, B.J. Oommen, and Z. Cataltepe
or deleted using the distribution S. Finally, the occurrences of ξ are transformed independently into the individual symbols of the alphabet using the distribution Q. This defines the model completely. The process followed by the model, and its graphical display, are formally included in the unabridged version of this paper, and omitted here in the interest of brevity [1]. The theoretical properties of the OIT model can be found in [11].
3
Proposed Methodology
In this section, we provide the explicit details of the syntactic probabilities of the OIT model, and also explain the way by which we utilize it together with the SVM-pairwise scheme for peptide classification. For a mutation consisting of random SID operations as per the OIT model, Oommen and Kashyap [11] have derived the syntactic probability of obtaining the sequence Y = y1 y2 . . . yM , from the sequence U = u1 u2 . . . uN as: P (Y | U ) =
M z=max{0,M−N }
N +z G (z) N ! z! p (yi | ui ) , (N + z)! i=1 U
Y
where G(z) is the probability of inserting z elements into U , and p (yi | ui ) is the probability of substituting the symbol element ui with the symbol element yi . Observe that in the above, ui = ξ ⇒ yi = λ, and yi = λ ⇒ ui = ξ. The sum over the strings U = u1 u2 . . . uN +z and Y = y1 y2 . . . yN +z (of the same length), represent the sum over all possible pairs of strings U and Y of equal length N + z, generated by inserting ξ’s into random positions in string U , and λ’s into random positions in strings Y respectively, and which are to represent the insertion and the deletion operations respectively. Although this requires a summation over a combinatorially large number of elements (represented by U and Y ), Oommen and Kashyap [11] have shown that this can be computed in an extremely efficient manner in cubic time, i.e., with complexity O (M · N · min {M, N }). Based on the work of Oommen and Kashyap [11], we have programmed our own toolkit to efficiently compute the syntactic probabilities between two arbitrary sequences, and adapted it to this particular domain. Since the OIT model essentially requires three “parameters” namely, S for the Substitution/Deletion probabilities, Q, for the insertion distribution, and G, we list the issues crucial to our solution:
1. The input and output alphabets in our application domain consist of twenty amino acids and one gap element, which for the input strings is the null symbol, ξ, representing an inserted element, and for output strings is the null symbol, λ, representing a deleted element. 2. The substitution of an amino acid with another corresponds to a series of mutations in the biological context. Based on this premise, we have computed
On Utilizing Optimal and Information Theoretic Syntactic Modeling
29
our substitution probabilities on the mutation probability matrix referred to as PAM1 derived by Dayhoff et al. [5]. PAM1 is a 20 × 20 matrix, M, where each cell mij corresponds to the probability of replacing amino acid i with amino acid j after 1% of the amino acids are replaced. It is possible to generate matrices for a series of longer mutations using successive multiplications of PAM1, and thus, for example, PAM250 is equal to PAM249×PAM1 [5]. 3. The first major deviation from the traditional PAM matrices involves the operation of deletion. Observe that PAM matrices generally do not specify deletion probabilities for amino acids. As opposed to this, the OIT model of Oommen and Kashyap [11] suggests that an element can be deleted (substituted by λ) as well as substituted by another element. In this vein, we advocate that the matrix PAM1 be extended by appending another column for λ, where the value Δ is assigned to the deletion probabilities of amino acids, and where each row is normalized to satisfy the probability constraint: p (y | u) = 1, (4) y∈A∪{λ}
where A is the set of all amino acids, and u is the amino acid corresponding to the row. 4. There is no standard method of determining the deletion probabilities of amino acids. Comparing the widely-used gap penalties as per [12] to the log − odd PAM matrices, we opted to use Δ = 0.0001. The question of how to optimally determine Δ is open, and we are currently considering how it can be obtained from a training phase using known Input/Output patterns. 5. The second major deviation from utilizing the traditional PAM matrices involves the operation of insertion. As in the case of deletion, we propose to extend the new PAM matrix by appending a row for ξ and assigned to p (y | ξ) (i.e. the probability that a newly inserted amino acid is y) the relative frequency of observing y, f (y). In our experiments, the relative frequencies were computed in a maximum likelihood manner by evaluating the limit of the PAMn matrix as n goes to infinity, i.e., as each row of the limiting matrix converges to f (y). Finally, the remaining cell of our extended PAM matrix, p (λ | ξ), is, by definition, equal to zero. The resulting matrix has been referred to as the OIT PAM matrix, and is a 21 × 21 matrix. Table 1 gives a typical OIT PAM matrix for the amino acid application domain. Observe that as in the case of the traditional PAM matrices, it is possible to derive higher order OIT PAM matrices for longer mutation sequences by multiplying OIT PAM1 by itself. In our work, we have experimented with OIT PAM matrices of different orders to observe the effect of different assumptions that concern evolutionary distances. 6. The final parameter of the OIT model involves the Quantified Insertion distribution, G (z), which specifies the probability that the number of insertions during the mutation is z. In our experiments, we have assumed that the probability of inserting an amino acid during a single PAM mutation is equal to the deletion probability of an amino acid, Δ. This assumption leads
30
E. Ayg¨ un, B.J. Oommen, and Z. Cataltepe
to the conclusion that for longer mutation series, the insertion distribution converges to a Poisson distribution such that G (z) = Poisson (z; nΔ) =
z
(nΔ) e−Δn , z!
(5)
where n is the number of PAMs (i.e. the length of the mutation series). In other words, we have currently used Poisson (z; nΔ) as the insertion distribution whenever we use OIT PAMn as the substitution probability matrix. 7. Using the OIT model and the parameters assigned as described above, a classification methodology based on the SVM-pairwise scheme proposed by Liao and Noble [9] was devised. This will be explained in the next subsection. Having explained how the OIT-based scheme works, we shall now also present the results obtained from our experiments.
4 4.1
Experimental Results and Discussions Experimental Setup
In our experiments, we used two peptide classification data sets, which are accepted as benchmark sets. The first one, referred to as HIV, was produced for the HIV-1 Protease Cleavage sites prediction problem by Kim et al. [8]. This set contains 754 8-residue peptides with 396 positives and 358 negatives. The second data set, referred to as TCL, was produced for the T-cell Epitope prediction problem by Zhao et al. [15], and it contains 203 10-residue peptides of which 36 were positives and 167 were negatives. As mentioned earlier, our classification scheme was based on the SVM-pairwise scheme proposed by Liao and Noble [9] to detect remote evolutionary relationships between proteins. According to our scheme, m representative peptides were chosen a priori from the training set. Subsequently, for each instance, an m-dimensional vector of scores was computed by comparing the instance to the representatives. The classifiers were trained and tested with these feature vectors. As a computational convenience, we used the logarithm of the OIT probability as the measure of similarity because the logarithm is a monotonic function, and it turns out that this can be computed more efficiently than the original OIT probabilities. To compare the performance of the OIT SVM to the standard measures, we have also used the Needleman-Wunsch (NW) alignment score [10], which is a commonly used sequence comparison method in bioinformatics, to achieve an analogous classification. Our representative peptides were chosen to be the positive training instances, and in each case, we used eight different substitution matrices with mutation lengths 10, 50, 100, 200, 250, 300, 400 and 500. Each feature set was tested on a SVM classifier with a linear kernel. A preliminary evaluation showed that the SVM with a linear kernel performs slightly better than the SVM with a radial -basis kernel on all the feature sets. Based
On Utilizing Optimal and Information Theoretic Syntactic Modeling
31
on this observation, we fixed the classifier prior to the experiments, and merely focused on the comparison of feature sets themselves. In the testing phase, we estimated the performance of different methods by means of a cross-validation process. To do this, we divided the HIV data set into ten partitions and the TCL data set, which is rather small, into five partitions as was done in [8] and [15] respectively. In our case, we chose to not divide the TCL data set into more than five partitions because the number of positive examples was too low, and it consequently prevented us from providing the necessary variation across the partitions. This choice also rendered our results to be compatible with the results of [15]. Finally, we also ensured the preservation of the ratio of positive and negative instances across the partitions. All the classification and performance estimations were performed on the Mathworks MATLAB [7] system with the help of the PRTools 4.1 pattern recognition toolbox [6] and the LIBSVM 2.88 support vector machine library [4]. 4.2
Experimental Results and Discussions
The performance of the OIT-based features were compared to the scores obtained by a Needleman-Wunsch (NW) alignment strategy. In each case, and for each of the experiments, we recorded the area under the ROC (AUC), the Accuracy (Acc), the Sensitivity (Sens) and the Positive Predictive Value (PPV). Tables 2 and 3 show the averaged values and average widths of the 95% confidence intervals for the HIV and TCL data sets, respectively. It is worth mentioning that the OIT-based scheme is uniformly superior to the NW-based scheme, and in some cases the superiority is categorically marked –for example, whereas the best accuracy for the NW-based method is 85.7%, the corresponding best accuracy for the OIT-based scheme is 91.7%. Also note that the 95% confidence intervals are generally wider for the TCL dataset than they are for the HIV dataset. This is because the cross validation was performed through a five-fold strategy on the former, and through a ten-fold strategy on the latter. For the HIV data set, [8] report accuracies for ten different methods, and our OIT-based method outperforms nine of them, while the accuracy of the tenth is marginally better. With regard to the TCL data set, it should be mentioned that the OIT SVM leads to better results than those reported by [15] –when it concerns any performance criterion. The behaviors of the two methods for different score matrices can be seen in Figures 1.These two figures display how the AUCs vary as the assumption of the mutation lengths increases from 10 PAMs to 500 PAMs. The reader will observe that for the HIV data set, both the OIT and the NW reach their highest performances between 100 and 300 PAMs. For the TCL data set, however, the NW prefers PAM400. When it concerns the means of the average AUCs, it should be mentioned that the OIT outperforms the NW even in its worst cases. Table 4 records the t-test results that validate this observation. Also, the average widths of the confidence intervals point to the conclusion that the OIT leads to more robust classifications than the NW.
A R N D C Q E G H I L K M F P S T W Y V ξ
A -0.01 -8.52 -7.01 -6.91 -8.11 -7.13 -6.38 -6.17 -8.52 -7.42 -7.82 -8.52 -7.42 -8.52 -6.12 -5.66 -5.74 -36.04 -8.52 -6.32 -2.43
R -9.21 -0.01 -9.21 -36.04 -9.21 -6.91 -36.04 -36.04 -6.91 -8.11 -9.21 -6.27 -7.82 -9.21 -7.82 -7.42 -9.21 -7.13 -36.04 -9.21 -3.21
N -7.82 -9.21 -0.02 -5.63 -36.04 -7.82 -7.42 -7.42 -6.17 -8.11 -9.21 -6.65 -36.04 -9.21 -8.52 -6.21 -7.01 -9.21 -7.82 -9.21 -3.20
D -7.42 -36.04 -5.47 -0.01 -36.04 -7.42 -5.24 -7.42 -7.82 -9.21 -36.04 -8.11 -36.04 -36.04 -9.21 -7.60 -8.11 -36.04 -36.04 -9.21 -3.04
C -9.21 -9.21 -36.04 -36.04 0.00 -36.04 -36.04 -36.04 -9.21 -9.21 -36.04 -36.04 -36.04 -36.04 -9.21 -7.60 -9.21 -36.04 -8.11 -8.52 -3.43
Q -8.11 -7.01 -7.82 -7.60 -36.04 -0.01 -5.91 -9.21 -6.07 -9.21 -8.11 -7.42 -7.82 -36.04 -7.42 -8.52 -8.52 -36.04 -36.04 -9.21 -3.27
E -6.91 -36.04 -7.26 -5.19 -36.04 -5.66 -0.01 -7.82 -8.52 -8.11 -9.21 -7.82 -9.21 -36.04 -8.11 -7.82 -8.52 -36.04 -9.21 -8.52 -2.99
G -6.17 -9.21 -6.73 -6.81 -9.21 -8.11 -7.26 -0.01 -9.21 -36.04 -9.21 -8.52 -9.21 -9.21 -8.11 -6.17 -8.11 -36.04 -36.04 -7.60 -2.41
H -9.21 -7.13 -6.32 -8.11 -9.21 -6.21 -9.21 -36.04 -0.01 -36.04 -9.21 -9.21 -36.04 -8.52 -8.11 -9.21 -9.21 -9.21 -7.82 -9.21 -3.42
I -8.52 -8.52 -8.11 -9.21 -8.52 -9.21 -8.52 -36.04 -36.04 -0.01 -7.01 -8.52 -6.73 -7.26 -36.04 -9.21 -7.26 -36.04 -9.21 -5.71 -3.33
L -8.11 -9.21 -8.11 -36.04 -36.04 -7.42 -9.21 -9.21 -7.82 -6.12 -0.01 -8.52 -5.40 -6.65 -8.11 -9.21 -8.11 -7.82 -8.52 -6.50 -2.46
K -8.52 -5.60 -5.99 -7.42 -36.04 -6.73 -7.26 -8.52 -8.52 -7.82 -9.21 -0.01 -6.21 -36.04 -8.11 -7.13 -6.81 -36.04 -9.21 -9.21 -2.54
M -9.21 -9.21 -36.04 -36.04 -36.04 -8.52 -36.04 -36.04 -36.04 -7.60 -7.13 -7.82 -0.01 -9.21 -36.04 -9.21 -8.52 -36.04 -36.04 -7.82 -4.21
F -9.21 -9.21 -9.21 -36.04 -36.04 -36.04 -36.04 -9.21 -8.52 -7.13 -7.42 -36.04 -7.82 -0.01 -36.04 -8.52 -9.21 -8.11 -5.88 -36.04 -3.19
P -6.65 -7.60 -8.52 -9.21 -9.21 -7.13 -8.11 -8.52 -7.60 -9.21 -8.52 -8.52 -9.21 -9.21 -0.01 -6.73 -7.82 -36.04 -36.04 -8.52 -2.96
S -5.88 -6.81 -5.68 -7.26 -6.81 -7.82 -7.42 -6.44 -8.52 -8.52 -9.21 -7.26 -7.82 -8.11 -6.38 -0.02 -5.57 -7.60 -8.52 -8.52 -2.66
T -6.12 -8.52 -6.65 -7.82 -9.21 -8.11 -8.52 -8.52 -9.21 -6.81 -8.52 -7.13 -7.42 -9.21 -7.60 -5.74 -0.01 -36.04 -8.52 -7.01 -2.84
W -36.04 -8.52 -36.04 -36.04 -36.04 -36.04 -36.04 -36.04 -36.04 -36.04 -36.04 -36.04 -36.04 -9.21 -36.04 -9.21 -36.04 0.00 -9.21 -36.04 -4.68
Y -9.21 -36.04 -8.11 -36.04 -8.11 -36.04 -9.21 -36.04 -7.82 -9.21 -9.21 -36.04 -36.04 -6.17 -36.04 -9.21 -9.21 -8.52 -0.01 -9.21 -3.45
V -6.65 -8.52 -9.21 -9.21 -8.11 -8.52 -8.52 -8.11 -8.11 -5.17 -6.81 -9.21 -6.38 -9.21 -8.11 -8.52 -6.91 -36.04 -8.52 -0.01 -2.76
λ -9.21 -9.21 -9.21 -9.21 -9.21 -9.21 -9.21 -9.21 -9.21 -9.21 -9.21 -9.21 -9.21 -9.21 -9.21 -9.21 -9.21 -9.21 -9.21 -9.21 -∞
Table 1. The Log-OIT PAM1 matrix used for the OIT model. Each element Mi,j is equal to the logarithm of the probability associated with the event of replacing the ith element with j th element. The symbols ξ and λ represent the insertion and deletion of an element of the alphabet, respectively. Please see Section 3 for more details on the Log-OIT PAM1 matrix.
32 E. Ayg¨ un, B.J. Oommen, and Z. Cataltepe
On Utilizing Optimal and Information Theoretic Syntactic Modeling
33
Table 2. The performance measurements for the HIV data set using the OIT and NW metrics. The highest value over each column is shown in bold. The last row displays the average widths of the 95% confidence intervals (Avg. w) for each measurement. (O)PAM 10 50 100 200 250 300 400 500 Avg. w
AUC 0.948 0.962 0.968 0.969 0.965 0.965 0.958 0.949 0.011
OIT Acc Sens 0.887 0.863 0.902 0.891 0.917 0.897 0.911 0.877 0.913 0.874 0.911 0.863 0.901 0.849 0.893 0.830 0.018 0.037
PPV 0.884 0.904 0.927 0.932 0.938 0.948 0.937 0.938 0.021
AUC 0.906 0.909 0.917 0.927 0.925 0.921 0.912 0.924 0.019
NW Acc Sens 0.839 0.821 0.849 0.841 0.846 0.846 0.857 0.833 0.853 0.830 0.849 0.829 0.849 0.838 0.846 0.813 0.025 0.040
PPV 0.837 0.843 0.833 0.862 0.857 0.852 0.848 0.859 0.029
Table 3. The performance measurements for the TCL data set using the OIT and NW metrics. The highest value over each column is shown in bold. The last row displays the average widths of the 95% confidence intervals (Avg. w) for each measurement. (O)PAM 10 50 100 200 250 300 400 500 Avg. w
AUC 0.918 0.937 0.943 0.947 0.944 0.945 0.939 0.936 0.016
OIT Acc Sens 0.852 0.922 0.872 0.934 0.882 0.929 0.897 0.940 0.902 0.946 0.887 0.940 0.887 0.946 0.882 0.929 0.023 0.022
PPV 0.901 0.912 0.928 0.935 0.936 0.924 0.919 0.928 0.020
AUC 0.883 0.892 0.889 0.889 0.885 0.895 0.904 0.819 0.028
NW Acc Sens 0.837 0.928 0.842 0.922 0.847 0.922 0.853 0.905 0.853 0.893 0.852 0.916 0.867 0.911 0.793 0.881 0.030 0.041
PPV 0.882 0.891 0.895 0.917 0.927 0.905 0.928 0.871 0.021
Table 4. The t-test results for the 1% significance level comparing the AUC values of the OIT and NW based schemes (O)PAM 10 50 100 200 250 300 400 500
HIV TCL OIT > NW p-value OIT > NW p-value no 0.013 no 0.018 yes 0.001 no 0.025 yes <0.001 no 0.047 yes <0.001 no 0.014 yes <0.001 yes <0.001 yes <0.001 no 0.015 no 0.012 yes 0.001 no 0.014 yes 0.001
34
E. Ayg¨ un, B.J. Oommen, and Z. Cataltepe 1
1 OIT NW
OIT NW
0.98 0.95
Area Under ROC
Area Under ROC
0.96
0.94
0.92
0.9
0.85
0.9 0.8
0.88 0.86 0
50
100
150
200
250 300 PAMs
350
400
450
500
0
50
100
150
200
250 300 PAMs
350
400
450
500
Fig. 1. The figure on the left displays the behavior of the OIT and NW similarity metrics on the HIV data set when the mutation length assumption changes between 10 PAMs and 500 PAMs. The figure on the right displays the corresponding behavior of the OIT and NW similarity metrics for the TCL data set. In each case, the error bars display the respective 95% confidence intervals.
5
Conclusions
In this paper, we have considered the problem of classifying peptides using syntactic pattern recognition methodologies. Unlike the traditional distance-based or Markovian methods, we have considered how the pattern recognition can be achieved by using the Optimal and Information Theoretic (OIT) model of Oommen and Kashyap [11]. We have shown that one can model the differences between the compared strings as a mutation model consisting of random SID operations which obeys a OIT model. Consequently, by using the probability measure obtained from the OIT model as a pairwise similarity metric, we have devised a Support Vector Machine (SVM)-based peptide classifier, referred to as OIT SVM. The classifier has been tested for eight different “substitution” matrices and for two different data sets, namely, the HIV-1 Protease Cleavage sites and the T-cell Epitopes, and the results obtained categorically demonstrate that the OIT model performs significantly better than the one which uses a Needleman-Wunsch sequence alignment score. Further, when combined with a SVM, it leads to, probably, the best peptide classification method available. The main drawback of of the OIT method is its higher time complexity. Otherwise, the avenues for future work include the learning of the PAM matrices using maximum likelihood or Bayesian methods. The use of an OIT model for other bioinformatic pattern recognition problems remains open.
Acknowledgments The first and third authors, Ayg¨ un and Cataltepe, were partially supported ¨ ITAK, ˙ by TUB The Scientific and Technological Research Council of Turkey. The second author, Oommen, was partially supported by NSERC, the Natural Sciences and Engineering Research Council of Canada.
On Utilizing Optimal and Information Theoretic Syntactic Modeling
35
References 1. Ayg¨ un, E., Oommen, B.J., Cataltepe, Z.: Peptide Classification Using Optimal and Information Theoretic Syntactic Modeling (submitted for publication) 2. Bucher, P., Hofmann, K.: A sequence similarity search algorithm based on a probabilistic interpretation of an alignment scoring system. In: Proceedings of the Conference on Intelligent Systems for Molecular Biology, pp. 44–51 (1996) 3. Cai, Y.D., Chou, K.C.: Artificial neural network model for predicting HIV protease cleavage sites in protein. Advances in Engineering Software 29(2), 119–128 (1998) 4. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001), http://www.csie.ntu.edu.tw/~ cjlin/libsvm 5. Dayhoff, M., Schwartz, R., Orcutt, B.: A model of evolutionary change in proteins. Atlas of Protein Sequence and Structure 5(suppl. 3), 345–352 (1978) 6. Duin, R.P.W., Juszczak, P., Paclik, P., Pekalska, E., de Ridder, D., Tax, D.M.J.: PRTools, a Matlab Toolbox for Pattern Recognition. Delft University of Technology (2004) 7. Guide, M.R.: The MathWorks. Inc., Natick, MA (1998) 8. Kim, H., Zhang, Y., Heo, Y.S., Oh, H.B., Chen, S.S.: Specificity rule discovery in HIV-1 protease cleavage site analysis. Computational Biology and Chemistry 32(1), 71–78 (2008) 9. Liao, L., Noble, W.S.: Combining Pairwise Sequence Similarity and Support Vector Machines for Detecting Remote Protein Evolutionary and Structural Relationships. Journal of Computational Biology 10(6), 857–868 (2003) 10. Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the ammo acid sequence of two proteins. J. Mol. Biol. 48(3), 443– 453 (1970) 11. Oommen, B.J., Kashyap, R.L.: A formal theory for optimal and information theoretic syntactic pattern recognition. Pattern Recognition 31(8), 1159–1177 (1998) 12. Tatusova, T.A., Madden, T.L.: BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences. FEMS Microbiology Letters 174(2), 247–250 (1999) 13. Thomson, R., Hodgman, T.C., Yang, Z.R., Doyle, A.K.: Characterizing proteolytic cleavage site activity using bio-basis function neural networks. Bioinformatics 19(14), 1741–1747 (2003) 14. Trudgian, D.C., Yang, Z.R.: Substitution Matrix Optimisation for Peptide Classification. In: Marchiori, E., Moore, J.H., Rajapakse, J.C. (eds.) EvoBIO 2007. LNCS, vol. 4447, pp. 291–300. Springer, Heidelberg (2007) 15. Zhao, Y., Pinilla, C., Valmori, D., Martin, R., Simon, R.: Application of support vector machines for T-cell epitopes prediction. Bioinformatics 19(15), 1978–1984 (2003)
Joint Tracking of Cell Morphology and Motion Jierong Cheng1,2, , Esther G.L. Koh3, , Sohail Ahmed3 , and Jagath C. Rajapakse1,2,4 1
BioInformatics Research Center and School of Computer Engineering, Nanyang Technological University, Singapore
[email protected] 2 Singapore-MIT Alliance, Singapore 3 Institute of Medical Biology, Singapore 4 Department of Biological Engineering, Massachusetts Institute of Technology, USA
Abstract. A new method is proposed for joint tracking of cell morphology and motion from 3D temporal cellular images. We adopt the framework of region-based active contours for segmentation, which is able to cope with objects having blurred boundaries. Motion estimation is performed by optical flow to increase the robustness and accuracy. Cell morphology and motion are modelled via a unified energy formulation and estimated iteratively searching for the minimum energy configuration. Experiments are carried out on synthetic and real cellular images to demonstrate the performance of the method. Keywords: Cell segmentation, tracking, motion estimation, optical flow, active contours, level sets.
1
Introduction
Three-dimensional (3D) molecular imaging techniques are now capable of generating large time-lapsed image datasets rapidly. Manual segmentation of cells in these images is almost impossible. Automatic segmentation and tracking methods are therefore necessary to enable quantitative understanding and analysis of cellular images. In recent years, many image analysis approaches have been adopted for cell segmentation from microscopic images. Parametric active contours, or snakes, are widely applied in boundary detection and motion tracking [1],[2]. Geometric active contours based on level sets are becoming increasingly popular because they are not constrained by the topology and thus are capable of detecting an arbitrary number of cells from an arbitrary initial front. In the model of active contours without edges [3], the image energy terms are computed using intensity variances inside and outside the contour. This region-based approach provides strong robustness to noise and allows segmentation of cells with blurred edges.
Corresponding author. Esther G.L. Koh is now with the Singapore Immunology Network, Singapore.
V. Kadirkamanathan et al. (Eds.): PRIB 2009, LNBI 5780, pp. 36–45, 2009. c Springer-Verlag Berlin Heidelberg 2009
Joint Tracking of Cell Morphology and Motion
37
The approach has been used for segmenting and tracking cells in 2D images [4], [5], [6] and 3D images [7]. Much work has been devoted to the data analysis on time-lapsed cellular images. Existing cell tracking methods can be roughly divided into frame-toframe linking and mathematical shape modelling methods. Promising methods for cell tracking include hidden Markov modelling [8], dynamic programming optimization [9], Kalman filter [10], and optical flow [11]. In an overview of cell motility analysis and particle tracking methods [12], it is concluded that regional oriented models are far more robust than purely edge oriented models. Optical flow is a convenient and useful representation of image motion. Discontinuities in the optical flow can help in segmenting images into regions of different motions. By direct coupling of a motion field and a segmentation map, a dense optical flow field which is boundary-preserving can be computed, without relying on specific constraints of motion boundary. Joint motion estimation and segmentation by combining optical flow with active contours has been applied on photographic images in [13], [14], [15], and [16]. The energy functional proposed by Feghali [14] embeds the constraints of both the motion estimation and segmentation. The spatial-temporal surface evolution is entirely driven by the magnitude of the motion field. However, 3D temporal images are often of low resolution and the cell boundaries are blurred and noisy. The existing gradientbased methods have difficulty in dealing with these images. Based on the method proposed by Feghali, we introduce regional force in the curve evolution which proved efficient in segmenting cells with blurred edges. This work is motivated by the study of neuronal cell growth. In order to obtain basic understanding of cell structure development, cell morphology, and motion need to be tracked accurately from 3D temporal cellular images. It is of great interest to know how neuronal stem cells grow into mature neurons. The quantitative measurements are important as cells are allowed to grow. We propose an automatic technique to quantify the growth of cell by simultaneously tracking cell morphology and motion. The proposed method is more robust and accurate for 3D cell segmentation than the existing methods, by combing optical flow and region-based active contours. The paper is organized as follows. In Section 2, a short overview of active contours and optical flow is introduced and our combining model is described. Experimental results are shown in Section 3 and the conclusion is given in Section 4.
2 2.1
Method Region-Based Active Contours
Active contours compute segmentations of a given image by evolving contours in the direction of the negative gradient of image energy. Traditional active contours make use of image gradient to localize object boundaries. For cell segmentation from 3D microscopic images, we choose to adopt the model of active contours without edges [3]. In this region-based model, image energy is computed from
38
J. Cheng et al.
surface integrals over the entire image. Therefore, region-based models are robust to noise and allow segmentation of objects with blurred edges. The active contours are implicitly represented by a single level set function and the topology of objects changes automatically as the level set function evolves. This enables automatic detection of an arbitrary number of cells from an arbitrary initial front. Let I(ω, t) be the image intensity at location ω = (x, y, z) ∈ Ω at time t, Ω ⊂ R3 being the 3D image domain. φ(ω, t) is a level set function defined on Ω, whose zero-level set {ω ∈ Ω|φ(ω, t) = 0, ∀ t} defines the segmentation such that the value of φ > 0 inside the segmented objects and φ < 0 outside. The energy function of the active contour is based on a reduced form of the Mumford-Shah function for image segmentation (we drop ω for notational simplicity): E(φ, I) = α δ(φ)|∇φ|dω + λI H(φ)(I − μI )2 dω Ω Ω +λO (1 − H(φ))(I − μO )2 dω (1) Ω
where dω is the elementary volume; μI and μO are respectively the mean intensity of pixels inside and outside the zero level-set; H and δ are the Heaviside and Dirac functions; λI , λO , and α are positive constants. The minimization of image energy is accomplished by letting the level set function evolve as a function of an abstract time t, starting from an initialization φ(ω, t = 0) according to: ∂φ ∇φ = α∇ · − λI (I − μI )2 + λO (I − μO )2 δ(φ). ∂t |∇φ|
(2)
∇φ Here ∇ · |∇φ| is the (mean) curvature of the level set, generating a regulating force which smoothes the contours. The other two forces on the righthand side move the contour toward the actual boundary of objects.
2.2
Optical Flow Field
Let u(ω, t) and v(ω, t) be the x and y velocity component of the apparent motion of brightness patterns of the image, respectively. The classical regularization of the optical flow field used by Horn and Schunk [17] has a quadratic functional: min (u2x + u2y + vx2 + vy2 )dω (3) (u,v)
Ω
subject to the optical flow constraint Ix u + Iy v + It = 0.
(4)
∂u ∂v ∂v where ux = ∂u ∂x , uy = ∂y , vx = ∂x , and vy = ∂y are optical flow gradients; Ix , Iy , and It are the intensity gradients in the x, y, and the temporal directions,
Joint Tracking of Cell Morphology and Motion
39
respectively. Horn and Schunk’s smoothness constraint seeks to improve some measure of consistency on the flow vectors that are close to each other. The energy function combining the smoothing error and image motion error is E(I, u, v) = β(u2x + u2y + vx2 + vy2 ) + (Ix u + Iy v + It )2 dω (5) Ω
2.3
Joint Tracking of Cell Morphology and Motion
We propose the energy formulation for joint tracking of cell morphology and motion. E(φ, I, u, v) = α δ(φ)|∇φ|dω + λI H(φ)(I − μI )2 dω Ω Ω +λO (1 − H(φ))(I − μO )2 dω Ω 2 2 2 2 + β(ux + uy + vx + vy ) + (Ix u + Iy v + It )2 dω Ω 1 +γI H(φ) dω + γ (1 − H(φ))(u2 + v 2 )dω (6) O 1 + u2 + v 2 Ω Ω where γI and γO are positive constants. The last two terms of the right hand side of (6) correspond to motion segmentation constraints: large motion field is expected in the spatiotemporal region that encloses moving objects; small motion field is expected in the spatiotemporal region that encloses background. The evolution equations for (φ, u, v) is derived by applying Euler-Lagrange equations on (6). The level set evolution is given by the following descent equation ∂φ ∇φ = α∇ · − λI (I − μI )2 + λO (I − μO )2 ∂t |∇φ| 1 2 2 −γI + γ (u + v ) δ(φ). (7) O 1 + u2 + v 2 The minimization of the energy function with respect to the motion field (u, v) yields a set of two descent equations that can be solved iteratively by
where
u=
β(Ix2 + β + ε)¯ u − βIx Iy v¯ − (β + ε)Ix It 2 (β + ε) + (β + ε)(Ix2 + Iy2 )
v=
β(Ix2 + β + ε)¯ v − βIx Iy u ¯ − (β + ε)Iy It (β + ε)2 + (β + ε)(Ix2 + Iy2 ) 2γI H(φ) + 2γO (1 − H(φ)) 1+u ¯2 + v¯2 (ωi ,t)∈Nφ (ω,t) u(ωi , t) u ¯(ω, t) = |Nφ (ω, t)|
ε=− and
(8)
(9)
(10)
40
J. Cheng et al.
with Nφ (ω, t) = {w ∈ N (ω, t)|sign[φ(w, t)] = sign[φ(w , t)]}, N (ω, t) being the neighborhood of (ω, t). In other words, u ¯(ω, t) is the average take only in the region of the moving objects or that of the static background. Similarly, the expression for v¯ is defined. The proposed algorithm is summarized in Algorithm 1. Algorithm 1. Joint tracking of cell morphology and motion begin t=0 φ(ω, t) as a small circle at the center of the image u(ω, t) = v(ω, t) = 0, ∀ω ∈ Ω repeat for all ω ∈ Ω do Update u(ω, t) and v(ω, t) according to (8) Update φ(ω, t) according to (7) end for t = t+1 until convergence end
3
Experimental Results
Experiment is performed on synthetic images as well as on real 3D temporal cellular images. Our results are compared with the method proposed by Feghali [14]. 3.1
Synthetic Data
In the original synthetic image shown in Fig. 1(a), the intensity on the squareshaped object is set to be the Euclidean distance to the object boundary. This is aimed to simulate real microscopic images where the intensity contrast at the cell boundary is relatively low. The image at the next time point is generated by moving the same object one pixel to the bottom-right. An optical flow field (Fig. 1(d)) is estimated from two noise-free images. The noise-free image and the corresponding optical flow field are regarded as ground-truth for object segmentation and motion estimation. The images are added with Gaussian noise with mean of 0 and standard deviation of 0.02. The parameter values were taken heuristically for our experiments on synthetic image are: α = 0, λI = 1, λO = 1, β = 0.01, γI = 0.05, and γO = 0.27. From Fig. 1 we found that the segmentation by the proposed method is smoother and closer to the ground-truth than by the method by Feghali, so as the estimated optical flow field. Two error metrics are adopted to compare the segmentation results to the ground-truth: the mean absolute difference estimates the averaged disagreement between the object boundaries. Another measure is the Jaccard similarity coefficient (Intersection/Union), which compares the areas of the segmented objects.
Joint Tracking of Cell Morphology and Motion
(a)
(b)
(c)
(d)
(e)
(f)
41
Fig. 1. Comparison results on synthetic image. (a) Original intensity image, (b) segmentation on noisy image by the method by Feghali, (c) segmentation on noisy image by the proposed method, (d) magnitude of motion field estimated from original intensity image, (e) magnitude of motion field estimated from noisy image by the method by Feghali, (f) magnitude of motion field estimated from noisy image by the proposed method.
The value ranges from 0 (no overlap) to 1 (total overlap), thus the higher value the better correlation. The measurement results against different level of noise are plotted in Fig. 2 and 3. 3.2
Neuronal Data
The cells in our neuronal images are ND7 cells, hybrid of mouse neuroblastoma and rat dorsal root ganglia. The purpose of the experiment is to characterize the dynamic properties of the filopodia. These highly mobile projections are continuously extended and retracted from the cell body. Cells were seeded on plastic petridishes and transfected with an expression plasmid for cherry-actin. The next day, cells were harvested and embedded in a collagen gel matrix (0.4mg/mL collagen) in a glass-bottomed petridish of 35mm diameter. One mL of DMEM (without phenol red) supplemented with 5% fetal calf serum was added to the dish after the gel has solidified. The petridish of cells was incubated on the
42
J. Cheng et al. 4.5
Mean absolute difference (pixels)
4
The method by Feghali The proposed method
3.5 3 2.5 2 1.5 1 0.5 0 20
22
24
26
28
30
32
34
36
38
40
Signal−to−noise ratio (dB)
Fig. 2. Mean absolute differences at different noise levels on synthetic images
Jaccard similarity coefficients (%)
100 The method by Feghali The proposed method
99
98
97
96
95
94 20
22
24
26
28
30
32
34
36
38
40
Signal−to−noise ratio (dB)
Fig. 3. Jaccard similarity coefficients at different noise levels on synthetic images
microscope for 30 minutes to stabilize the temperature and CO2 before images were captured. The imaging was performed on an Olympus IX81 equipped with a 60× 1.20 UPLANSAPO objective. The images were acquired by a Photometrics QuantEM:512SC and the MetaMorph Imaging Software from Molecular Devices. The stack size of our 3D neuronal images is 281 × 233 × 10 voxels and the spatial
Joint Tracking of Cell Morphology and Motion
(a)
(b)
(c)
(d)
(e)
(f)
43
Fig. 4. Comparison results on cellular image. (a) First image of the temporal sequence, (b) segmentation by the method by Feghali, (c) segmentation by the proposed method, (d) second image of the temporal sequence, (e) magnitude of motion field estimated by the method by Feghali, (f) magnitude of motion field estimated by the proposed method.
resolution is 0.275 × 0.275 × 0.5 μm/pixel. The temporal sequence contains 31 stacks, taken at an interval of 10 seconds. The position of the cell is fixed but the cell morphology changes due to the dendrite growth and movement. In this case, the estimated motion field indicates the movement of each voxel within the cell. The parameter values for experiments on neuronal image are: α = 0, λI = 1, λO = 1, β = 0.01, γI = 0.02 and γO = 0.025. Fig. 4(a) and Fig. 4(d) show the middle sections from two successive stacks. The segmentation by the method by Feghali misses the cell boundary sometimes and contains singularities inside the cell. In contrast, the proposed method is more robust to the noise and finds the cell boundary accurately in spite of the low contrast; meanwhile, an accurate motion field is obtained.
44
4
J. Cheng et al.
Conclusion
We proposed a new method of joint tracking of cell morphology and motion from 3D temporal cellular images. Optical flow and region-based active contours methods are combined via a united energy formulation and solved iteratively. Motion estimation and segmentation are coupled directly so that the temporal dimension is taken into account. In addition, we make use of regional information in additional energy terms. The advantage of such method is that the segmentation relies on the average intensity of different regions rather than the presence of strong edge around object boundary. This increases the robustness and accuracy of our method. Experimental results on synthetic and real images demonstrate that the proposed method is able to give smooth and precise segmentation on noisy images with blurred boundary. Another advantage of our method is that it is highly automatic. No human interference is required during the process, except that some parameter adjustments may be needed for the processing of a new dataset. Our method enables the quantitative analysis of cell morphology and motion from a large amount of 3D temporal cellular images. The method can help the automatic detection and separation of dendrites from cell body. The lengths of dendrites, the number of branches, and other quantitative features can be measured further from the segmentation results. The robustness and the accuracy of the method are vital for biological studies that make inferences on cell morphology and growth. In our method, we assume that the motion of voxels within the cell is always larger than the motion of background. This assumption is roughly true since the motion is large near the cell boundary where growth is likely to occur. However, there could be near zero-motion regions inside the cell and this will cause a hole in the segmentation. In future studies, we will incorporate adaptive hole-filling algorithm in our method.
References 1. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: active contour models. International Journal of Computer Vision 1(4), 321–331 (1987) 2. Meegama, R.G.N., Rajapakse, J.C.: Nurbs snakes. Image and Vision Computing 21, 551–562 (2003) 3. Chan, T.F., Vese, L.A.: Active contours without edges. IEEE Trans. Image Processing 10(2), 266–277 (2001) 4. Zhang, B., Zimmer, C., Olivo-Marin, J.-C.: Tracking fluorescent cells with coupled geometric active contours. In: Proc. IEEE Int’l Symp. Biomedical Imaging (ISBI), pp. 476–479 (2004) 5. Cheng, J., Rajapakse, J.C.: Segmentation of clustered nuclei with shape markers and marking function. IEEE Trans. Biomedical Engineering 53(3) (2009) 6. Yu, W., Lee, H.K., Hariharan, S., Bu, W., Ahmed, S.: Quantitative neurite outgrowth measurement based on image segmentation with topological dependence. Cytometry Part A 75A(4), 289–297 (2008) 7. Dufour, A., Shinin, V., Tajbakhsh, S., Guillen, N., Olivo-Marin, J.-C., Zimmer, C.: Segmenting and tracking fluorescent cells in dynamic 3-D microscopy with coupled active surfaces. IEEE Trans. Image Processing 14(9), 1396–1410 (2005)
Joint Tracking of Cell Morphology and Motion
45
8. Althoff, K., Degerman, J., W¨ ahlby, C., Thorlin, T., Faijerson, J., Eriksson, P.S., Gustavsson, T.: Time-lapse microscopy and classification of in vitro cell migration using hidden markov modeling. In: Proc. IEEE Int’l Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 5, pp. 1165–1168 (2006) 9. Sage, D., Neumann, F.R., Hediger, F., Gasser, S.M., Unser, M.: Automatic tracking of individual fluorescence particles: Application to the study of chromosome dynamics. IEEE Trans. Image Processing 14(9), 1372–1383 (2005) 10. Yang, X., Li, H., Zhou, X.: Nuclei segmentation using marker-controlled watershed, tracking using mean-shift, and kalman filter in time-lapse microscopy. IEEE Trans. Circ. Sys. -I 53(11), 2405–2414 (2006) 11. Melani, C., Campana, M., Lombardot, B., Rizzi, B., Veronesi, F., Zanella, C., Bourgine, P., Mikula, K., Peyrieras, N., Sarti, A.: Cells tracking in a live zebrafish embryo. In: Proc. 29th Annual International Conference of IEEE Engineering in Medicine and Biology Society, pp. 1631–1634 (2007) 12. Olivo-Marin, J.-C.: An overview of image analysis in multidimensional biological microscopy. In: Proc. IEEE Int’l Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 5, pp. 1173–1176 (2006) 13. Mitiche, A., Feghali, R., Mansouri, A.: Motion tracking as spatio-temporal motion boundary detection. Robotics and Autonomous Systems 43, 39–50 (2003) 14. Feghali, R.: Multi-frame simultaneous motion estmation and segmentation. IEEE Trans. Consumer Electronics 51(1), 245–248 (2005) 15. Mitiche, A., Sekkati, H.: Optical flow 3d segmentation and interpretation: A variational method with active curve evolution and level sets. IEEE Trans. Pattern Anal. Machine Intell. 28(11), 1818–1829 (2006) 16. Sekkati, H., Mitiche, A.: Joint optical flow estimation, segmentation, and 3d interpretation with level sets. Computer Vision and Image Understanding 103(2), 89–154 (2006) 17. Horn, B.K.P., Schunck, B.G.: Determining optical flow. Artif. Intell. 23, 185–203 (1981)
Multiclass Microarray Gene Expression Analysis Based on Mutual Dependency Models Girija Chetty1 and Madhu Chetty2 1
Faculty of Information Sciences and Engineering, University of Canberra, ACT, Australia 2 Faculty of Information Technology, Monash University, Victoria, Australia
[email protected],
[email protected]
Abstract. In this paper a novel feature selection technique based on mutual dependency modelling between genes is proposed for multiclass microarray gene expression classification. Several studies on analysis of gene expression data has shown that the genes (whether or not they belong to the same gene group) get co-expressed via a variety of pathways. Further, a gene may participate in multiple pathways that may or may not be co-active for all samples. It is therefore biologically meaningful to simultaneously divide genes into functional groups and samples into co-active categories. This can be done by modeling gene profiles for multiclass microarray gene data sets based on mutual dependency models, which model complex gene interactions. Most of the current works in multiclass microarray gene expression studies are based on statistical models with little or no consideration of gene interactions. This has led to lack of robustness and overly optimistic estimates of accuracy and noise reduction. In this paper, we propose multivariate analysis techniques which model the mutual dependency between the features and take into account complex interactions for extracting a subset of genes. The two techniques, the cross modal factor analysis (CFA) and canonical correlation analysis(CCA) show a significant reduction in dimensionality and class-prediction error, and improvement in classification accuracy for multiclass microarray gene expression datasets.
1 Introduction Molecular classification involves the classification of tumour samples into groups of biological phenotypes [1, 2, 3, 4]. Studies on molecular classification have great significance for cancer diagnosis. Molecular classification of tumour samples from patients into different molecular types or subtypes is vital for diagnosis, prognosis, and effective treatment of cancer [3, 4, 5]. Traditionally, such classification relies on observations regarding the location and microscopic appearance of the cancerous cells. These methods have proven to be slow and ineffective; and there is no way of predicting with reliable accuracy the progress of the disease, since tumours of similar appearance have been known to take different paths in the course of time. Some tumours may grow aggressively after the point of the previous observations, and hence require equally aggressive treatment regimes. Other tumours may stay inactive and thus require no treatment at all [2,3,5]. Since cancer treatment often V. Kadirkamanathan et al. (Eds.): PRIB 2009, LNBI 5780, pp. 46–55, 2009. © Springer-Verlag Berlin Heidelberg 2009
Multiclass Microarray Gene Expression Analysis Based on Mutual Dependency Models
47
produces adverse side effects on patients, patients whose tumours are predicted to stay inactive should be spared the unnecessary treatment. The problem is the risks involved in withholding treatment when the classification method (used to predict the aggressiveness of the tumour) is not reliable. Some tumours are particularly resistant to the more commonly prescribed anticancer drugs, while others are not. Predicting resistance to anticancer drugs will also ensure the optimal treatment regime for each patient. A patient predicted to be resistant to the more commonly prescribed anticancer drugs can then be prescribed alternative anticancer drugs, or can be recommended as a potential candidate for clinical trials of new anticancer drugs. As a genome is not just a collection of genes working in isolation, but rather it encompasses the global and highly coordinated control of information to carry out a range of cellular functions [1], any cellular activity requires elaborate patterns of gene interaction to marshal appropriate processes. In addition, the genome also incorporates information that controls when and where the parts of living organisms should be made. Therefore, it is imperative to conduct proper genome-wide studies so as to facilitate: 1. An effective identification of correlated genes and 2. A better understanding of the mechanisms underlying gene transcription and regulation. Expression of several thousands of genes can be measured simultaneously by DNA microarrays. With the advent of the microarray technology, data regarding the gene expression levels in each tumour sample proved to be a useful tool in molecular classification [2,3,4]. Microarrays have been effectively used to classify clinical samples, to investigate the mechanism of drug action and to examine the effects of drugs on gene expression in various organisms. The advantage of microarrays is that gene expression analysis is computationally less demanding than sequencing. Furthermore, recent advances in machine learning and statistical analysis tools for expression profiling have become more mature and cost effective. However, microarrays also have their own limitations. In particular, when the data is very noisy and contain artefacts, .gene prediction can be very difficult. Moreover, the feature dimensions of the genes are usually too large (causing large search space) while the dimensions of samples are too small (causing statistical errors). The problem of high dimensionality of features and small sample size dimensions has been addressed by several feature selection techniques in literature [1,2]. Due to the large number of genes for a typical microarray data, feature selection plays an important role in reducing noise and computational cost in gene expression based tissue classification while improving accuracy at the same time. However, the current feature selection techniques have not quite resulted in an appreciable noise reduction or accuracy improvement, particularly for multiclass microarray data sets. This could be due to the reason that many current feature selection techniques applied on microarray datasets do not take into account accurate feature dependencies or complex interactions between the genes, resulting in incorrect and overly optimistic estimates of accuracy. Only a relatively small number of genes (out of the thousands) monitored in microarray experiments actually influence the biological state of interest (such as tumour type or subtype, or resistance to anticancer drugs). Since majority of genes are not relevant (i.e., they do not supply useful information in distinguishing among
48
G. Chetty and M. Chetty
samples of different classes [2, 3, 4, 5]), adding them to the reduced feature set ot predictor set will not increase the multiclass classification accuracy. In fact, doing so will increase classifier complexity, and will also increase the noise in the classifier (and therefore decrease accuracy).
2 Role of Optimal Feature Selection Techniques In general, the objectives of feature selection techniques are to find from an overall set of N features, an optimal subset of features, S, that gives the best classification accuracy. This reduced feature subset is also known as the predictor set, |S| and is generally << N. Some of the important reasons for using good feature selection approaches are [2, 3, 10, 11]: a) b)
To gain a better understanding of the data and an insight into the way the selected features or genes affect the phenotypes of the samples. To reduce noise, over fitting and classifier complexity.
Identifying the members of the predictor set can indicate the genes involved in biological pathways which are responsible for the observed biological state of the sample (i.e., the class membership of the sample). This information is important to the field of pharmacological gene therapy, where drugs are designed to target specific genes in order to achieve the desired biological state (e.g., from highly aggressive tumour to less aggressive tumour). Only a relatively small number of genes out of the thousands that are monitored in microarray experiments actually influence the biological state of interest (such as tumour type or subtype, or resistance to anticancer drugs). One of the significant work in extracting a predictor set using different feature selection techniques for multiclass microarray gene expression problem is reported in [2, 12, 13, 14]. Here the authors proposed the simple-correlation based criteria such as relevance and redundancy in the formation of the predictor set. For addressing multiclass scenarios, a third criterion called differential prioritization criterion was used which assigned higher priority to maximizing relevance as compared to the priority of minimizing redundancy. The degree of differential prioritization (DDP) measure in this criteria ascertained that the optimal balance between relevance and redundancy is achieved in the multiclass micoarray gene expression classification problem [2, 12, 14]. The DDP measure also allowed increasing the importance of minimizing redundancy as the number of classes increase. For instance, in order to achieve the best accuracy, minimizing redundancy in a 14-class problem can be considered more important than minimizing redundancy in a two-class problem. Further, this technique was extended by developing another measure called antiredundancy, which was used in conjunction with DDP measure. The degree of differential prioritization (DDP) criteria along with antiredundancy measure, resulted in an unique ability to differentially prioritize the optimization of relevance against redundancy (and vice versa), ensuing optimal accuracy for multiclass microarray data analysis problem. However, the authors in this work [2] did not consider feature dependencies or complex gene interactions in extracting the predictor set. Hence, the evaluation results reported for the joint DDP-antiredundancy technique in [2] seemed
Multiclass Microarray Gene Expression Analysis Based on Mutual Dependency Models
49
to be overly optimistic, though it provided a good insight into the multiclass microarray problem. In this paper, we propose new feature selection techniques for extracting the predictor set based on mutual dependency models, which model the feature dependencies and complex gene interactions using multivariate analysis techniques. The proposed technique is expected to enhance the performance of multiclass microarray gene expression classification. An evaluation of the proposed mutual dependency modeling techniques for several multiclass Microarray gene expression datasets showed a significant improvement in dimensionality reduction, deviation error and classification accuracy. The rest of the paper is organized as follows. Next section describes the approach used for feature dependency modeling and the proposed multivariate analysis techniques. The details of the experiments and performance evaluation for different multiclass microarray datasets is discussed in Section 4, and the paper concludes with some conclusions and plan for further work in Section 5.
3 Mutual Dependency Models We examine two different cross modal analysis (CMA) techniques based on multivariate statistical analysis for modelling the feature dependencies: Cross modal Factor Analysis (CFA) and Canonical Correlation Analysis(CFA). Our contribution in this paper is to point out that CMA (CFA/CCA) can be used to extract the optimal predictor set that take into consideration gene dependencies and complex interactions. Commonalities in data sources or genes is exploited by these methods that search for statistical dependencies between them. Methods that model mutual dependencies tend to find the optimal transformations that can best represent or identify the coupled patterns between the features of the two different subsets. For CFA technique, following optimization criterion can be used to obtain the optimal transformations: Given two mean-centred matrices X and Y, which consist of row-by-row coupled samples from two subsets of features, we want orthogonal transformation matrices A and B that can minimise the expression:
XA − YB where
M
F
2 F
A A = I and B B = I . T
T
denotes the Frobenius norm of the matrix M and can be expressed as:
M
F
⎛ = ⎜⎜ ∑∑ mij ⎝ i j
2
⎞ ⎟ ⎟ ⎠
1/ 2
The earliest method was classical linear Canonical Correlation Analysis (CCA) [15], which has later been extended to nonlinear variants and more general methods that maximize mutual information instead of correlation. In other words, A and B define two orthogonal transformation spaces where coupled data in X and Y can be projected as close to each other as possible.
50
G. Chetty and M. Chetty
Since we have:
XA − YB
2 F
(
= trace ( XA − XB ) . (YA − YB) T
( = trace(( XX
)
= trace XAAT X T + YBBT Y T − XAB T Y T − YBAT X T T
) + trace(YY T ) − 2 ⋅ trace( XAB T Y T ) )
)
where the trace of a matrix is defined to be the sum of the diagonal elements. We can easily see from above that matrices A and B which maximise trace (XABTYT)) will minimise the equation above. It can be shown [284] that such matrices are given by:
⎧ A = S xy ⎨ ⎩ B = Dxy
where
X T Y = S xy ⋅ V xy ⋅ D xy
With the optimal transformation matrices A and B, we can calculate the transformed version of X and Y as follows
~ ⎧X = X ⋅ A ⎨~ ⎩Y = Y ⋅B ~ ~ Corresponding vectors in X and Y are thus optimised to represent the coupled relationships between the two feature subsets without being affected by distribution patterns within each subset. Traditional Pearson correlation or mutual information calculation [15] can then be performed on the first and most important k correspond-
~
~
ing vectors in X and Y , which preserve the principal coupled patterns in much lower dimensions. In addition to feature dimension reduction, feature selection capability is another advantage of CFA. The weights in A and B automatically reflect the significance of individual features. Following the development of the CFA technique, we can adopt a different optimization criterion for Canonical Correlation Analysis (CCA) method: Instead of minimizing the projected distance, we attempt to find transformation matrices A and B that maximise the correlation between XA and YB. This can be described more specifically using the following mathematical formulations: Given two mean centred matrices X and Y as defined in the previous section, we seek matrices A and B such that
~ ~ correlation( XA, XB) = correlation( X , Y ) = diag (λ1 ,L λi ,L, λl ) ~ Where X = Y ⋅ B, and 1 ≥ λ1 ≥, L , λi , L , ≥ λl ≥ 0 , λi represents the largest ~ ~ possible correlation between the ith translated features in X and Y . Note that A and B are only determined up to a constant factor (shown in equation above). A statistical method called canonical correlation analysis [15] can solve the above problem with additional norm and orthogonal constraints on translated features:
Multiclass Microarray Gene Expression Analysis Based on Mutual Dependency Models
{
}
{
51
}
~ ~ ~ ~ E X T ⋅ X = I and E Y ⋅ Y = I In CCA, A and B are calculated as follows:
B = ∑ yy . DK −1 / 2
A = ∑xx . S K −1 / 2
and
where
∑
xx
= E {X T X },
∑
yy
= E {Y T Y },
∑
xy
= E {X T Y }
and
L = ∑ xx
−1 / 2
∑ ∑ xy
−1 / 2
yy
= S K .VK . DKT
The calculation of the inverse matrix requires that no linear dependency exists between any two vectors within X or Y. The major differences between CCA and CFA include: •
The transformations provided by CFA are orthogonal, while this is not necessary true for CCA. A and B given by CFA satisfy ATA=I and BTB=I, where I is the identity matrix. CCA, however, does not provide such orthogonal transformations in most cases. CFA is in favour of coupled patterns with high variations (i.e. large amplitude changes), while CCA is more sensitive to highly coupled, but low variation patterns. This is mainly due to the whitening of X and Y in CCA by calculating
•
∑
−1 / 2
xx
and
∑
−1 / 2 yy
.
The optimization criteria used for all both CFA and CCA exhibit a high degree of noise tolerance. Hence the correlation features extracted perform better as compared to normal Pearson correlation analysis against noisy environmental conditions.
The CCA or CFA transformed features form the predictor set or the reduced feature set for training and testing the multiclass microarray datasets. The dimensionality of predictors set should be high enough to preserve most of the shared variation and yet low enough to avoid over fitting. Ideally an optimal dimensionality should be sought. Though a sophisticated optimization criteria could be used for finding the optimal dimensionality of the predictor set, we found that an approach based on empirical and experimental observations was quite satisfactory. This is because, the first few CCA or CFA components normally contain most of the reliable shared variation among the data sets, while the last components may actually represent just noise, and thus dropping some of the dimensions makes the method more robust. The detail of extracting the predictor set is described in the next section.
52
G. Chetty and M. Chetty
4 Experiments and Results The performance of mutual dependency models based on cross modal analysis techniques (CFA/CCA) was carried out on five different multiclass microarray datasets. Further as a baseline comparison, all the experiments were carried out with predictor set obtained by standard principal component analysis technique which is one of the most popular multivariate analysis technique for dimensionality reduction [2, 15]. The five multiclass microarray datasets used as bench- mark datasets were: • • • • •
The PDL dataset [10], which consists of 6 classes, each class representing a diagnostic group of childhood leukemia. The SRBC dataset [11] consisting 4 subtypes of small, round, blue cell tumors (SRBCTs). The Lung dataset [12], which is a 5-class dataset, with 4 classes as subtypes of lung cancer; and the fifth class consisting of normal samples. The MLL dataset [13], which contains 3 subtypes of leukemia: ALL, MLL and AML. The AML/ALL dataset [14] , which also contains 3 subtypes of leukemia: AML, B-cell and T-cell ALL.
The dimensionality of the predictor sets ranging from size P = 2 to P = Pmax was obtained by an empirical experimental technique. In this technique, we increase the dimensionality one at a time, testing with a randomization test that the new dimension captures shared variation. To protect from over fitting, all estimates of captured variation was computed using a validation set, i.e., for data that has not been used when computing the components (dimensions). The randomization test essentially compares the shared variance along the new dimension to the shared variance we would get under the null-hypothesis of mutual independency. When the shared variance does not differ significantly from the null-hypothesis, the final dimensionality has been reached. It was observed that this technique works quite well and infact the dimensionality of the predictor set is significantly better as compared to some of the previous studies, [2]. This could be due to inherently superior modelling and dimensionality reduction capability of mutual dependency models (CCA/CFA method). As can be seen in Table 1, as the number of classes in the datasets increases (from K = 3 to K = 5), it was possible to get better estimates of accuracy with a maximum predictor set dimension of Pmax = 10 to Pmax = 20, which is a significant reduction in the predictor set dimensionality as compared to previous studies reported[2, 3]. Two feature selection experiments were run on each data-set: one using the predictor set based on CFA features (cross modal factor analysis features) and the other using predictor set based on CCA features (canonical correlation analysis features). The DAGSVM classifier was used in evaluating the performance of all resulting predictor sets from both experiments. The DAGSVM is an all-pairs SVM-based multiclassifier which uses substantially less training time compared to neural networks, and has been shown to produce accuracy in some of the previous studies [2, 16].
Multiclass Microarray Gene Expression Analysis Based on Mutual Dependency Models
53
Table 1. Dimensionality of Predictor set for different benchmark datasets
Dataset PDL Lung SRBC MLL ALL
Type Affymetrix Affymetrix cDNA Affymetrix Affymetrix
N 12011 1741 2308 8681 3571
K 5 5 4 3 3
Pmax 25 20 15 12 10
N is the number of CMF(CCA/CFA) features. K is the number of classes in the dataset.
To evaluate the performance of the proposed CFA/CCA based predictors sets, we used two measures: classification accuracy and class-prediction error in class accuracy. For each class, class accuracy denotes the ratio of correctly classified samples of that class to the class size in the test set. The class-prediction error in class accuracies is the difference between the best class accuracy and the worst class accuracy among the K class accuracies in a K-class dataset. In an ideal situation, overall accuracy being exactly 1, each class accuracy is 1, so the perfect range of class accuracies is 0. Hence, lower the class-prediction error, better the classifier performance. Table 2. Classification Accuracy at Predictor set size of Pmax
Dataset NC160 PDL Lung SRBC MLL ALL
CFA 65% 92.6% 88.3% 91.6% 94.8% 94.9%
CCA 63% 90.4% 82.2% 89.8% 92.3% 93.7^
PCA 61% 86% 79.1% 81.5% 88.8 91.4%
As can be seen in Table 2, predictor set obtained by CFA method outperforms the CCA method by yielding better classifier accuracy for all the five datasets. Further, the class-prediction error in class accuracies shown in Table 3 also depicts a better performance of CFA method over CCA method. Finally, both the cross modal methods significantly outperform the PCA method. Table 3. Class-prediction error in Classification Accuracy at Predictor set size of Pmax
Dataset NC160 PDL Lung SRBC MLL ALL
CFA 0.69 0.34 0.53 0.12 0.18 0.16
CCA 0.71 0.38 0.59 0.18 0.20 0.21
PCA 0.78 0.41 0.62 0.21 0.28 0.24
54
G. Chetty and M. Chetty
5 Conclusions and Further Plan In this paper a novel feature selection technique based on mutual dependency modelling between genes is proposed for multiclass microarray gene expression classification. The two techniques based on cross modal factor analysis and canonical correlation analysis show a significant reduction in dimensionality and improvement in classification accuracy and deviation error for multiclass microarray gene expression datasets. Further research work will focus on feature selection techniques based on other multivariate analysis techniques such as co-inertia analysis and latent semantic analysis, and the fusion of predictor sets obtained from different mutual dependency models for large multiclass microarray datasets.
References 1. Dudoit, S., Fridly, J., Speed, T.P.: Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data (June 2000), http://www.stat.berkeley.edu/tech-reports/576.pdf 2. Ooi, C.H., Chetty, M., Teng, S.W.: Differential prioritization between relevance and redundancy in correlation-based feature selection techniques for multiclass gene expression data. BMVC Journal 47, 1–19 (2006) 3. Tripathi, A., Klami, A., Kaski, S.: Simple integrative preprocessing preserves what is shared in data sources. BMC Bioinformatics 9, 111 (2008) 4. Bittner, M., et al.: Molecular classification of cutaneous malignant melanoma by gene expression profiling. Nature 406(3), 536–540 (2000) 5. Cheng, Y., Church, G.M.: Biclustering of expression data. In: Proc. Eighth Int’l Conf. Intelligent Systems for Molecular Biology (ISMB), vol. 8, pp. 93–103 (2000) 6. Duggan, D.J., Bittner, M.L., Chen, Y., Meltzer, P., Trent, J.M.: Expression profiling using cDNA microarrays. Nature Genetics 21, 10–14 (1999) 7. Munagala, K., Tibshirani, R., Brown, P.: Cancer characterization and feature set extraction by discriminative margin clustering. BMC Bioinformatics 5, 21 (2004) 8. Ramaswamy, S., Tamayo, P., Rifkin, R., Mukherjee, S., Yeang, C.H., Angelo, M., Ladd, C., Reich, M., Latulippe, E., Mesirov, J.P., et al.: Multi-class cancer diagnosis using tumor gene expression signatures. Proc. Natl. Acad. Sci. USA 98, 15149–15154 (2001) 9. Ross, D.T., Scherf, U., Eisen, M.B., Perou, C.M., Rees, C., Spellman, P., Iyer, V., Jeffrey, S.S., Van de Rijn, M., Waltham, M., et al.: Systematic variation in gene expression patterns in human cance cell lines. Nat. Genet. 24, 227–235 (2000) 10. Yeoh, E.-J., Ross, M.E., Shurtleff, S.A., Williams, W.K., Patel, D., Mahfouz, R., Behm, F.G., Raimondi, S.C., Relling, M.V., Patel, A., et al.: Classification, subtype discovery, and prediction of outcome in pediatric lymphoblastic leukemia by gene expression profiling. Cancer Cell 1(2), 133–143 (2002) 11. Khan, J., Wei, J.S., Ringner, M., Saal, L.H., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Antonescu, C.R., Peterson, C., et al.: Classification and diagnostic prediction of cancers using expression profiling and artificial neural networks. Nat. Med. 7, 673–679 (2001) 12. Bhattacharjee, A., Richards, W.G., Staunton, J.E., Li, C., Monti, S., Vasa, P., Ladd, C., Beheshti, J., Bueno, R., Gillette, M., et al.: Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc. Natl. Acad. Sci. USA 98, 13790–13795 (2001)
Multiclass Microarray Gene Expression Analysis Based on Mutual Dependency Models
55
13. Armstrong, S.A., Staunton, J.E., Silverman, L.B., Pieters, R., den Boer, M.L., Minden, M.D., Sallan, S.E., Lander, E.S., Golub, T.R., Korsmeyer, S.J.: MLL translocations specify adistinct gene expression profile that distinguishes a unique leukemia. Nat. Genet. 30, 41– 47 (2002) 14. Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., et al.: Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286, 531–537 (1999) 15. Borga, M.: Canonical correlation a tutorial (1999), http://www.imt.liu.se/mi/Publications/magnus.html 16. Hsu, C.-W., Lin, C.-J.: A comparison of methods for multi-class support vector machines. IEEE Transactions on Neural Networks 13, 415–425 (2002)
An Efficient Convex Nonnegative Network Component Analysis for Gene Regulatory Network Reconstruction Jisheng Dai1,2 , Chunqi Chang1 , Zhongfu Ye2 , and Yeung Sam Hung1 1 Department of Electrical and Electronic Engineering The University of Hong Kong, Pokfulam Road, Hong Kong 2 Department of Electronic Engineering and Information Science University of Science and Technology of China, Hefei, Anhui, P.R. China
[email protected],
[email protected],
[email protected],
[email protected]
Abstract. A systems biology problem of reconstructing gene regulatory network from time-course gene expression microarray data via network component analysis (NCA) is investigated in this paper. Inspired by the idea that each column of the connectivity matrix can be estimated independently, we try to propose a fast and stable convex approach for nonnegative NCA (nnNCA). Compared with the existing method, our new method reduces the computational cost substantially, whereas maintains a reasonable accuracy. Both the simulation results and experimental results demonstrate the effectiveness of our method. Keywords: Gene regulatory network, microarray, network component analysis, convex programming, positivity constraints.
1
Introduction
Data from high-throughput DNA microarrays and ChIP-chip binding assays have become the basis of transcriptional regulatory analysis in the post-genome period. The study of gene regulatory network is important to understand the information embedded in these data. To reveal the underlying inter-dependency and cause-and effect relationship between various cellular functions, much effort has been devoted to the research of gene regulatory network reconstruction from microarray data. Developing tractable system identification techniques capable of reconstructing gen regulatory networks of a large size from small microarray datasets is a key challenge for gene regulatory network reconstruction. Several different approaches have been proposed to modeling gene regulatory network in early studies, including dynamic Bayesian networks (DBN) [1], probabilistic Boolean network (PBN) [2], and differential/difference equation [3]. More recently, with the assumption of linear instantaneous signal models, statistical techniques such as the principal component analysis (PCA) [4] and independent component analysis (ICA) [5] have been applied successfully to deduce biologically significant information from V. Kadirkamanathan et al. (Eds.): PRIB 2009, LNBI 5780, pp. 56–66, 2009. c Springer-Verlag Berlin Heidelberg 2009
Convex Nonnegative NCA for Gene Regulatory Network Reconstruction
57
microarray datasets. However, neither the reconstructed networks nor the reconstructed regulatory signals is consistent with real biological systems in general, because in these approaches the assumption of mutually orthogonality or statistical independence for the regulatory signals are not usually valid. The network component analysis (NCA) has been shown to be a very effective approach because it incorporates helpful and biologically sound assumptions [6, 7, 8]. It is well known that a real gene regulatory network always has a sparse structure, and sometimes the non-zero entries of its (sparse) connectivity matrix may be obtained from ChIP-chip experiments [9]. Even when experimental data on the structure of the network are unavailable, the structural information sometimes may be extracted partially from the literature or predicted by bioinformatics methods [10]. The NCA approach can make use of this (sparse) structure information to fully reconstruct the network (including the connectivity matrix and the regulatory signals) in the case of noise-free, if some mild conditions, called NCA criteria, are satisfied. Most recently, the method was further improved in [11], where a closed-form solution to the NCA problem is obtained through fitting the model by a series of subspace projections. So far, NCA is one of the most effective approaches to gene regulatory network reconstruction, existing algorithms, however, lack accuracy and consistency. To improve the performance of NCA and develop a more robust network reconstruction method, convex programming methods incorporating the nonnegativity constraints on the connectivity matrix are proposed in [12, 13]. Although this convex optimization problem can be solved by an inner-point algorithm, the required complexity is still very high, if the number of the entries in the connectivity matrix is large which is often happened in biological environments. This motivates us to find an efficient nonnegative network component analysis (nnNCA) approach for gene regulatory network reconstruction subject to the nonnegativity constraints. Inspired by the idea that each column of the connectivity matrix may be solved independently [11], we successfully derive a very fast and stable approach to nnNCA in this paper. Compared with the existing method, the computational cost is reduced substantially. Both the simulation results and experimental results demonstrate the effectiveness of our method.
2
Network Component Analysis
Figure 1 illustrates the instantaneous linear regulation model we are addressing, which is a bipartite network. Gene expressions are regulated by transcription factors. The upper layer in the figure represents the expression level of activated transcription factors (TF) or transcription factor activities (TFA), and the lower layer represents the microarray gene expression data. In general, gene regulation processes are dynamic and nonlinear, but are approximately log-linear [14]. Assuming that in the network we have N genes and M transcription factors, and the length of time series is K. Then if the gene expression are represented as log-ratios, the network can be modeled as X = AS + Γ
(1)
58
J. Dai et al.
TF1: s1
x1
x2
TF2:s2
x3
x4
TF3: s3
x5
x6
x7
Fig. 1. Transcriptional regulatory model. The expression of genes (lower layer) are regulated by transcription factors (upper layer) through a bipartite network.
where X ∈ RN ×K , A ∈ RN ×M , S ∈ RM×K and Γ ∈ RN ×K are the gene expression levels which can be measured by microarray data, the connectivity matrix of the network, the unknown TFAs, and the measurement noise, respectively. It should be noted that the connectivity matrix A are assumed sparse, which agrees well with observations in biological environments. Our goal is to recover the connectivity matrix A and the TFAs S from the time-course microarray data X. It is proven in [6] that A and S can be estimated up to possible scaling factors, which is the inherent indeterminacy for any blind source separation problem, if the following NCA criteria are satisfied: 1. the connectivity matrix A must be full-column rank; 2. when an element in the regulatory domain is removed together with the rows corresponding to the nonzero entries of this column, the connectivity matrix of the resulting network is still of full-column rank; 3. the TFAs S must have full-row rank. Under the constraint that the non-zero pattern of A has the structure corresponding to the known a priori network topology, the A and S can be uniquely determined by solving the optimization problem as follows. min X − ASF A,S
(2)
subject to : A(I) = 0 where · F is the Frobenius norm of a matrix, and I contains the indices where the entries of A are zeros, which is induced by the network structure. Since the actual estimation of A and S is performed by a two-step alternating leastsquares algorithm, very high computational complexity is required. To reduce the computational cost, [11] makes a effort to develop a fast, stable and globally convergent algorithm for reconstructing the gene-regulatory networks, referred as FastNCA algorithm, where a closed-form solution to the NCA problem is given. Interestingly, it was shown that the connectivity matrix A can be estimated
Convex Nonnegative NCA for Gene Regulatory Network Reconstruction
59
first, especially each column of A can be obtained independently, then the TFA can be obtained simply by S = A† X (3) where A† is the pseudo inverse of A. However, all the NCA approaches only use the structure information about the connectivity matrix. In attempt to improve the performance of NCA method, as well as develop a more accurate and robust method for the network reconstruction, a convex algorithm incorporating another new prior information (the entries of connectivity matrix A are all nonnegative, also referred as nnNCA) is proposed in [12]. It is reasonable to put the nonnegative constraint on the entries of the connectivity matrix A, as long as any specific transcription factor has the same effect to all genes. The sound biological support for this assumption can be found in [15]. With this nonnegative constraint, in [12], it is shown that the connectivity matrix A can be estimated by solving the following optimization problem min CT AF (4) A,S
subject to : A(I) = 0 A(J) ≥ c where C is the null space of matrix X, J contains the indices where the entries of A are nonzero (positive), and c is a constant small positive value. Although the above convex optimization problem can be solved by inner-point algorithm [16], the required computational complexity is still very high, if the number of the entries in matrix A is large which is often happened in real biological circumstance. Inspired by the idea that each column of the connectivity matrix A may be solved independently [11], we try to propose a very efficient approach to nnNCA based on individual estimation of column of A in the next section.
3
Efficient Approach to nnNCA
It is well known that if there is noise in the model Eq. (1), the effect from noise can be coped with by applying the singular value decomposition (SVD) [17] to matrix X. For simplicity of analysis, we first write X (assuming no noise) in the standard SVD form as follows. Σ0 Σ0 T X=U V = Us Un VT = Us Σ 0 VT (5) 0 0 0 0 where U ∈ RN ×N and V ∈ RK×K are the singular-vector matrices, Us ∈ RN ×M and Un ∈ RN ×(N −M) are the singular-vector matrices corresponding to M nonzero singular-values and N −M zero singular-values, respectively, and Σ ∈ RM×M is the singular-value matrix containing nonzero singular-values arranged
60
J. Dai et al.
in a decreasing order from top-left to bottom-right. From the subspace principle [18], we know that there always exists a matrix T ∈ RM×M such that Us = AT
(6)
Notice that the above equation only holds if there is no noise in Eq. (1). Nevertheless, AT can provide a robust estimate of X in the existence of noise. Therefore, we still assume that Eq. 6 holds in noise case. Since we can always re-order the rows of X and A, we assume without loss of generality that the first column of A can be partitioned as ˜ a a1 = (7) 0L×1 ˜ ∈ R(N −L)×1 . where L is number of zero entries in the vector a1 , and a To show that each column of the connectivity matrix A can be solved independently for nnNCA, we need the following lemma, which forms the basis of our method. Lemma 1. If there is a vector t ∈ RM×M satisfying b Us t = 0L×1
(8)
˜ up to a scaling ambiguity where b ∈ R(N −L)×1 , then b must be an estimate of a η, i.e., b = η˜ a (9) Proof: Using Eq. (6) and Eq. (8), we have
b ˆ Us t = ATt = At = 0
where ˆt Tt. We partition the above equation accordingly as ˜ A12 a η b ˆ At = = 0 A22 t2 0
(10)
(11)
or, equivalently, b = η˜ a + A12 t2
(12)
0 = A22 t2
(13)
Since A22 is of full-column rank determined by NCA criterion (2), according to Eq. (13) we are able to obtain t2 = 0 (14) Then substituting the above equation into Eq. (12), we achieve the desired result b = η˜ a
(15) End of proof
Convex Nonnegative NCA for Gene Regulatory Network Reconstruction
61
Relying on Lemma 1, we are able to derive a very simply approach to estimate a ˜1 . Since we can always re-order the columns of A and the rows of T accordingly while keeping the equality in Eq. (6), the other columns of A can be estimated in the same way. Keeping these results in mind, we now proceed to solve the NCA problem with positivity constraints on the nonzero entries of the connectivity matrix. Unlike the strategy employed in [12], we follow a different approach where each column of the connectivity matrix A is estimated independently, so that the computational cost will be reduced substantially. The optimization problem for estimating the i-th column (denoted as ai ) with positivity constraints can be formulated as min Us (Ii , :) · ti F ti
(16)
subject to : Us (Ji , :) · ti ≥ c where Ii contains the indices where the entries of ai are zeros, Ji contains the indices where the entries of ai are nonzero, and Us (Ii , :) denotes the rows of Us corresponding to indices Ii . Eq. (16) is convex and can be solved by effective algorithms [16]. Recall Lemma 1, it is not difficult to see that the estimate of ai can be given by ai = Us ti (17) From the above discussions, now we summarize the proposed algorithm as follows. Efficient Algorithm for nnNCA: 1. Perform a standard SVD on X and obtain the Us as in Eq. (5); 2. Estimation of A For i = 1, 2, · · · , M a) solve the optimization problem (16) and get the optimal ti ; b) let ai = Us ti . End 3. Estimate the TFA matrix by S = A† X.
4 4.1
Results Simulation Results
In order to test the proposed method, we use the simulation data described in [6, 12]. This is a hemoglobin spectroscopy data set with a 7 × 3 connectivity matrix and a 7 × 321 measurement data. If positivity constraints on the nonzero entries of the connectivity matrix and the noise are not considered, the same performances are achieved by the FastNCA, the method in [12], and our proposed
62
J. Dai et al.
Table 1. True connectivity matrix and its estimation by FastNCA, the method in [12], and our proposed method without positivity constraints and noise True 0.417 2.083 0 0.417 1.25 0.833 0
FastNCA
1.184 0 1.184 1.053 0.921 0 0.658
0 1.25 0.25 0.25 0 2.00 1.25
0.435 2.12 0 0.429 1.176 0.841 0
1.196 0 1.164 1.056 0.904 0 0.681
Original
Method in [12]
0 1.267 0.202 0.236 0 2.015 1.281
0.435 2.12 0 0.429 1.176 0.841 0
FastNCA
0.4
0 1.267 0.202 0.236 0 2.015 1.281
0.435 2.12 0 0.429 1.176 0.841 0
Method in [12]
0.2
0.3
1.196 0 1.164 1.056 0.904 0 0.681
Proposed
0.1
0.2
1.196 0 1.164 1.056 0.904 0 0.681
0 1.267 0.202 0.236 0 2.015 1.281
Proposed
0.6
0.6
0.4
0.4
0.2
0.2
0 0.1 −0.1
0 −0.1
0
100
200
300
0.6
−0.2
0
100
200
300
0.8 0.6
0.4
0
0
−0.2
−0.2
−0.4
0
100
200
300
−0.4
0.6
0.6
0.4
0.4
0.2
0.2
0
100
200
300
0
100
200
300
0
100
200
300
0.4 0.2 0.2 0 −0.2
0
0 0
100
200
300
0.6 0.4
−0.2
0
100
200
300
−0.2
0
0
100
200
300
−0.2
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0
0
0.2 0 −0.2
0
100
200
300
−0.2
0
100
200
300
−0.2
0 0
100
200
300
−0.2
Fig. 2. True sources and their corresponding estimates by FastNCA, the method in [12], and our proposed method subject to positivity constraints under noisy case
method. The normalized true (original) and estimated connectivity matrices are shown in Table 1. In the next simulation, we study the performance of the proposed algorithm subject to positivity constraints in the noisy case, where the signal to noise ratio (SNR) is set to 9 dB. The true and the estimated source spectra signals are shown in Figure 2, and the normalized true and estimated connectivity matrices are shown in Table 3. The simulation results in the table shows that the connectivity matrix estimated by FastNCA contains both positive and negative entries. Since the true entries connectivity matrix are either nonnegative or nonpositive, such an estimate is unfavorable. The simulation results in the figure shows that the estimated connectivity matrix achieved by the method in [12] and our method are almost the same. This well demonstrates the optimality of our method.
Convex Nonnegative NCA for Gene Regulatory Network Reconstruction
63
Table 2. True connectivity matrix and its estimation by FastNCA, the method in [12], and our proposed method subject to positivity constraints under noisy case True 0.417 2.083 0 0.417 1.25 0.833 0
FastNCA
1.184 0 1.184 1.053 0.921 0 0.658
0 1.25 0.25 0.25 0 2.00 1.25
-1.785 -0.202 0 -0.947 -0.279 1.788 0
Method in [12]
Proposed
1.028 0 0.985 1.029 0 1.12 1.028 0 0 2.065 1.505 0 1.236 1.485 0 1.358 1.244 -0.468 0 1.245 0.67 0 1.244 0.604 0.966 0.096 1.027 0.965 0.67 0.917 0.966 0.604 1.087 0 0.741 1.087 0 0.74 1.087 0 0 1.782 0.741 0 1.754 0.74 0 1.537 0.675 0.59 0 0.674 0.67 0 0.675 0.898
Table 3. The running time comparison between our method and the method in [12] Proposed
Method in [12]
Running Time 6.71 (seconds) 1144.62 (seconds)
4.2
Experimental Results
To test the proposed method in the real biological networks, we apply all the methods to analyze the time-course Yeast cell cycle microarray data used in [19]. Note that the network topology data required by the algorithm are from [9]. The microarray data contains 441 genes and 56 time points. In this analysis, we are interested in the 11 transcription factors that are known to regulate the expression of genes that are involved in the cell cycle process. Three experiments with different synchronization methods represented by alpha, cdc15, and elutriation Method in [12] [7]
FastNCA 0.1 0 −0.1 0 0.05 0 −0.05 0 0.1 0 −0.1 0 0.1 0 −0.1 0 0.05 0 −0.05 0 0.1 0 −0.1 0 0.05 0 −0.05 0 0.2 0 −0.2 0 0.2 0 −0.2 0 0.2 0 −0.2 0 0.1 0 −0.1 0
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
0.1 0 −0.1 0 0.05 0 −0.05 0 0.1 0 −0.1 0 0.1 0 −0.1 0 0.05 0 −0.05 0 0.1 0 −0.1 0 0.05 0 −0.05 0 0.2 0 −0.2 0 0.2 0 −0.2 0 0.2 0 −0.2 0 0.1 0 −0.1 0
Proposed
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
0.1 0 −0.1 0 0.05 0 −0.05 0 0.1 0 −0.1 0 0.1 0 −0.1 0 0.05 0 −0.05 0 0.1 0 −0.1 0 0.05 0 −0.05 0 0.2 0 −0.2 0 0.2 0 −0.2 0 0.2 0 −0.2 0 0.1 0 −0.1 0
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
Fig. 3. Estimated yeast cell cycle related TFAs for alpha synchronization by FastNCA, the method in [12], and our proposed method
64
J. Dai et al.
are studied. The estimated TFAs for these 11 cell cycle related transcription factors with different synchronization methods are shown in Figure 3, Figure 4 and Figure 5 for FastNCA, the method in [12], and our proposed method, respectively. It can be seen that the method in [12] and our proposed method produce almost the same results, and it is obvious that the estimated TFAs subject to positivity constraints show more cyclic behavior than that achieved by FastNCA. Cyclic behavior is expected in this case since all these transcription factors are cell cycle related. The results demonstrate the superior robustness of the nnNCA methods and the equivalence between our method and the existing method. FastNCA 0.1 0 −0.1 0 0.1 0 −0.1 0 0.1 0 −0.1 0 0.2 0 −0.2 0 0.2 0 −0.2 0 0.2 0 −0.2 0 0.2 0 −0.2 0 0.2 0 −0.2 0 0.1 0 −0.1 0 0.2 0 −0.2 0 0.1 0 −0.1 0
Method in [12]
0.5
1
1.5
2
0.5
1
1.5
2
0.5
1
1.5
2
0.5
1
1.5
2
0.5
1
1.5
2
0.5
1
1.5
2
0.5
1
1.5
2
0.5
1
1.5
2
0.5
1
1.5
2
0.5
1
1.5
2
0.5
1
1.5
2
0.1 0 −0.1 0 0.1 0 −0.1 0 0.1 0 −0.1 0 0.2 0 −0.2 0 0.2 0 −0.2 0 0.2 0 −0.2 0 0.2 0 −0.2 0 0.2 0 −0.2 0 0.1 0 −0.1 0 0.2 0 −0.2 0 0.1 0 −0.1 0
Proposed
0.5
1
1.5
2
0.5
1
1.5
2
0.5
1
1.5
2
0.5
1
1.5
2
0.5
1
1.5
2
0.5
1
1.5
2
0.5
1
1.5
2
0.5
1
1.5
2
0.5
1
1.5
2
0.5
1
1.5
2
0.5
1
1.5
2
0.1 0 −0.1 0 0.1 0 −0.1 0 0.1 0 −0.1 0 0.2 0 −0.2 0 0.2 0 −0.2 0 0.2 0 −0.2 0 0.2 0 −0.2 0 0.2 0 −0.2 0 0.1 0 −0.1 0 0.2 0 −0.2 0 0.1 0 −0.1 0
0.5
1
1.5
2
0.5
1
1.5
2
0.5
1
1.5
2
0.5
1
1.5
2
0.5
1
1.5
2
0.5
1
1.5
2
0.5
1
1.5
2
0.5
1
1.5
2
0.5
1
1.5
2
0.5
1
1.5
2
0.5
1
1.5
2
Fig. 4. Estimated yeast cell cycle related TFAs for cdc15 synchronization by FastNCA, the method in [12], and our proposed method Method in [12] [7]
FastNCA 0.1 0 −0.1 0 0.1 0 −0.1 0 0.1 0 −0.1 0 0.1 0 −0.1 0 0.1 0 −0.1 0 0.2 0 −0.2 0 0.1 0 −0.1 0 0.1 0 −0.1 0 0.1 0 −0.1 0 0.2 0 −0.2 0 0.1 0 −0.1 0
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
0.1 0 −0.1 0 0.1 0 −0.1 0 0.1 0 −0.1 0 0.1 0 −0.1 0 0.1 0 −0.1 0 0.2 0 −0.2 0 0.1 0 −0.1 0 0.1 0 −0.1 0 0.1 0 −0.1 0 0.2 0 −0.2 0 0.1 0 −0.1 0
Proposed
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
0.1 0 −0.1 0 0.1 0 −0.1 0 0.1 0 −0.1 0 0.1 0 −0.1 0 0.1 0 −0.1 0 0.2 0 −0.2 0 0.1 0 −0.1 0 0.1 0 −0.1 0 0.1 0 −0.1 0 0.2 0 −0.2 0 0.1 0 −0.1 0
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
Fig. 5. Estimated yeast cell cycle related TFAs for elutriation synchronization by FastNCA, the method in [12], and our proposed method
Convex Nonnegative NCA for Gene Regulatory Network Reconstruction
65
Finally, to testify the improvement in computational cost, Table 3 gives the running time for our method and the method in [12]. All the computer codes are developed in MATLAB 7.7.0 on a Intel Core2 2.66GHz with 2GB of RAM. The result in the table shows that our method is much faster than the method in [12]. Both the simulation results and experimental results demonstrate the effectiveness of our proposed method. Compared with the existing method, our new method reduces the computational cost substantially, whereas maintains a reasonable accuracy.
5
Conclusion
Both the simulation results and experimental results shows the superior robustness of the nnNCA methods, and demonstrate that the performances of our method and the method in [12] are very similar. The contribution of this paper is to derive a powerful convex approach for nnNCA, which reduces the computational cost substantially, whereas maintains a reasonable accuracy.
Acknowledgements This work is supported by a Seed Funding for Basic Research of the University of Hong Kong.
References 1. Friedman, N., Linial, M., Nachman, I., Peer, D.: Using bayesian networks to analyze expression data. Journal of Computational Biology 7(3), 601–620 (2000) 2. Shmulevich, I., Dougherty, E.R., Kim, S., Zhang, W.: Probabilistic boolean networks: a rule-based uncertainty model for gene regulatory networks. Bioinformatics 18(2), 261–274 (2002) 3. Dougherty, E.R., Shmulevich, I., Chen, J., Wang, Z.J.: Genomic Signal Processing and Statistics. Hindawi Publishing Corporation (2005) 4. Alter, O., Brown, P.O.: Singular value decomposition for genome-wide expression data processing and modeling. Proceedings of the National Academy of Sciences of the United States of America 97(18), 10101–10106 (2000) 5. Lee, S.I., Batzoglou, S.: Application of independent component analysis to microarrays. Genome Biology 4(11), R76 (2003) 6. Liao, J.C., Boscolo, R., Yang, Y.L., Tran, L.M., Sabatti, C., Roychowdhury, V.P.: Network component analysis: Reconstruction of regulatory signals in biological systems. Proceedings of the National Academy of Sciences of the United States of America 100(26), 15522–15527 (2003) 7. Chang, C.Q., Hung, Y.S., Fung, P.C.W., Ding, Z.: Network component analysis for blind source separation. In: Proc. 2006 International Conference on Communications, Circuits and Systems, Guilin, China, pp. 323–326 (2006) 8. Chang, C.Q., Ding, Z., Hung, Y.S., Fung, P.C.W.: Fast network component analysis for gene regulation networks. In: Proc. 2007 IEEE International Workshop on Machine Learning for Signal Processing, Thesaloniki, Greece, pp. 21–26 (2007)
66
J. Dai et al.
9. Lee, T.I., Rinaldi, N.J., Robert, F., Odom, D.T., Bar-Joseph, Z., Gerber, G.K., Hannett, N.M., Harbison, C.T., Thompson, C.M., Simon, I., Zeitlinger, J., Jennings, E.G., Murray, H.L., Gordon, D.B., Ren, B., Wyrick, J.J., Tagne, J.B., Volkert, T.L., Fraenkel, E., Gifford, D.K., Young, R.A.: Transcriptional regulatory networks in saccharomyces cerevisiae. Science 298(5594), 799–804 (2002) 10. Wang, C., Xuan, J., Chen, L., Zhao, P., Wang, Y., Clarke, R., Hoffman, E.: Motifdirected network component analysis for regulatory network inference. BMC Bioinformatics 9(suppl. 1), S21 (2008) 11. Chang, C.Q., Ding, Z., Hung, Y.S., Fung, P.C.W.: Fast network component analysis (FastNCA) for gene regulatory network reconstruction from microarray data. Bioinformatics 24(11), 1349–1358 (2008) 12. Chang, C.Q., Ding, Z., Hung, Y.S.: A new optimization algorithm for network component analysis based on convex programming. In: Proc. 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, Taipei, Taiwan (2009) 13. Chang, C.Q., Ding, Z., Hung, Y.S.: Nonnegative network component analysis by linear programming for gene regulatory network reconstruction. In: Adali, T., et al. (eds.) ICA 2009. LNCS, vol. 5441, pp. 395–402. Springer, Heidelberg (2009) 14. Savageau, M.A.: Biochemical Systems Analysis: A Study of Function and Design in Molecular Biology. Addison-Wesley, Reading (1976) 15. Alon, U.: An Introduction to Systems Biology: Design Principles of Biological Circuits. Chapman & Hall/CRC, Boca Raton (2007) 16. Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004) 17. Golub, G.H., van Loan, C.F.: Matrix Computation, 3rd edn. The Johns Hopkins University Press (1996) 18. Lay, D.C.: Linear Algebra and Its Applications, 2nd edn. Addison-Wesley, New York (2000) 19. Spellman, P.T., Sherlock, G., Zhang, M.Q., Iyer, V.R., Anders, K., Eisen, M.B., Brown, P.O., Botsein, D., Futcher, B.: Comprehensive identification of cell cycleregulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell 9(12), 3273–3297 (1998)
Using Higher-Order Dynamic Bayesian Networks to Model Periodic Data from the Circadian Clock of Arabidopsis Thaliana Rónán Daly1 , Kieron D. Edwards2 , John S. O’Neill3 , Stuart Aitken4 , Andrew J. Millar4 , and Mark Girolami1 1
Inference Group, Department of Computing Science, University of Glasgow {rdaly,girolami}@dcs.gla.ac.uk 2 Advanced Technologies (Cambridge) Limited
[email protected] 3 Institute of Metabolic Science, Metabolic Research Laboratories, University of Cambridge
[email protected] 4 Centre for Systems Biology at Edinburgh, The University of Edinburgh {s.aitken,andrew.millar}@ed.ac.uk
Abstract. Modelling gene regulatory networks in organisms is an important task that has recently become possible due to large scale assays using technologies such as microarrays. In this paper, the circadian clock of Arabidopsis thaliana is modelled by fitting dynamic Bayesian networks to luminescence data gathered from experiments. This work differs from previous modelling attempts by using higher-order dynamic Bayesian networks to explicitly model the time lag between the various genes being expressed. In order to achieve this goal, new techniques in preprocessing the data and in evaluating a learned model are proposed. It is shown that it is possible, to some extent, to model these time delays using a higher-order dynamic Bayesian network. Keywords: Dynamic Bayesian Network, Gene Regulatory Network, Gene Expression, Arabidopsis Thaliana.
1
Introduction
The analysis of data obtained from experiments looking at the expression of genes has become an important topic in the realm of bioinformatics. Whilst other bioinformatics tasks such as sequence alignment, gene finding, genome assembly and genome annotation have proven very amenable to computational analysis, there are tasks which are intrinsically difficult at the current level of knowledge. These include protein structure prediction and modelling gene regulatory networks. The latter will be examined in this paper. It is well known that genes code for proteins and that certain proteins can affect how certain genes are expressed. Gene regulatory networks encode these dependencies between genes in an organism, abstracting away the influence of V. Kadirkamanathan et al. (Eds.): PRIB 2009, LNBI 5780, pp. 67–78, 2009. c Springer-Verlag Berlin Heidelberg 2009
68
R. Daly et al.
proteins so that nodes correspond to genes and directed arcs correspond to causal gene-expression relations. In this paper, an attempt will be made to model these gene regulatory networks and in particular, the time lag between the expression of a gene and its effect on the expression of another gene. This modelling will proceed by using an algorithm based on ant colony optimisation known as ACO-E [1] to build dynamic Bayesian networks (DBNs). DBNs are an extension of Bayesian networks, graphical models that can be useful in modelling probabilistic relations between variables [2]. They consist of a prior network, used to model initial conditions, and a transition network, used to model the effect of interactions over time. Whilst DBNs have been used to model gene regulatory networks [3,4,5,6,7], the method presented here differs in how the data are treated and in how it can infer interactions at multiple time periods. To test the methods proposed, a set of gene expression data obtained from experiments observing particular genes of the plant Arabidopsis thaliana will be used. These genes are known to behave in a clock-like fashion that regulates the function of the plant [8,9]. The data will be preprocessed and experiments will be conducted to learn the structure of DBNs using these preprocessed data. Using a new technique, these networks will be compared to a standard network developed by an expert in the field being analysed. Using these results, the ability of DBNs to reconstruct gene-regulatory networks will be investigated.
2
Background
The analysis of gene regulatory networks is a hard task. In order to see how various genes interact, experiments must be performed which measure the expression levels of these genes. Using this data, it is possible to develop models of genetic systems behaviour. One of the earliest and still commonly used techniques is to use differential equations that show how expression levels vary with respect to other expression levels. However, recent advances in measurement of gene expression levels, using techniques such as DNA microarrays, have allowed the measurements of tens of thousands of genes to be performed simultaneously. With this amount of genes in an organism, all possible interactions between all the genes cannot be looked at. Therefore, methods involving graphical representations have become more widely used. These include Boolean networks, networks similar to neural networks, stochastic process calculi and Bayesian networks [10]. 2.1
Data from Arabidopsis Thaliana Experiments
The experimental study in this chapter will involve luminescence data obtained from experiments on Arabidopsis thaliana. The luminescence is an indirect measure of the synthesis of new RNA transcripts from each gene, originating from a copy of the cognate promoter and 5 untranslated region that are fused to the firefly luciferase reporter gene and integrated into the plant genome at a random location. The particular experiments in question were designed to study
Using Higher-Order Dynamic Bayesian Networks to Model Periodic Data
69
Table 1. Ratio of light to darkness for each experimental condition Condition AT0029 AT0030 AT0031b AT0032 AT0033 AT0047 Ratio 6:18 9:15 12:12 15:9 18:6 3:21
the ‘plant circadian rhythms’ of this organism, i.e. the oscillating behaviour of plants as they respond to the change in sunlight [11]. Briefly, the expression levels of the genes in plants tend to synchronise with the rising and setting of the sun, oscillating between high and low levels. However, when the light source is removed, the expression levels continue to oscillate with the same frequency, with the behaviour decaying over time. In the experiments, there were in general two phases. The first phase, called the entrainment, switched the light on and off at regular intervals and lasted three days. The second phase had constant light and also lasted three days. At intervals of 1.5 hours, readings were taken of the expression levels of the genes being investigated. The experiments involved ten different genes: CAB, GI, CCA1, CCR2, LHY, CAT3, ELF3, PRR9, TOC1 and ELF4. Further information about these genes can be found elsewhere [11]. There were also six different conditions under which the experiments were conducted; each condition had a different light period for the entrainment phase of the experiment. Table 1 shows, for each experimental condition, the ratio of light to dark for that condition. E.g. for condition AT0029, there are 6 hours of light followed by 18 hours of darkness in the entrainment phase. Because the experiments lasted six days, and a sample was taken every 1.5 hours, there were a total of 96 samples taken for each condition. The first of these was taken as a reference datum and subtracted from each of the following samples, leading to 95 samples with which to work.
3
Modelling Gene Expression Data Using DBNs
It is the stochastic nature of the gene expression process that makes probabilistic models an appropriate choice in modelling them. One interesting approach is in using Bayesian networks as a probabilistic model. This enables large scale, non-linear interactions between many genes to be represented and simulated. E.g., recent studies have looked at learning Bayesian networks with thousands of genes [12]. The large amounts of data present in microarray studies can be used to learn both the structure and parameters of the network – missing values and noise can also be taken care of. Also, Bayesian networks can be given a causal interpretation. A problem arises however in that loops are not permitted in Bayesian networks. Therefore, feedback processes cannot be represented. A solution to this problem is to use dynamic Bayesian networks (DBNs). With DBNs, it is possible to ‘unroll’ what would be a loop in a Bayesian network across
70
R. Daly et al.
Fig. 1. A 4-layer dynamic Bayesian network
time. In doing so, a temporal order is brought in amongst the various variables. In this respect, DBNs are a natural way to model temporal data. 3.1
Learning a Higher-Order Dynamic Bayesian Network
Most Bayesian network structure learning algorithms are designed to learn a static structure. Whilst dynamic Bayesian networks are very similar to static networks, there are slight differences that mean extra care has to be taken when trying to learn them. Bearing in mind that a DBN has two parts, the prior and transition network, this section will focus on learning the transition network; the prior network is exactly the same as a normal Bayesian network. In a transition network, the nodes are grouped into multiple layers that designate different timepoints. Arcs can never go back in time and the head of any arc can only be in a single layer, normally that layer at time t. It is very normal to have two layers, such that they correspond to variables at time t and t − 1. In fact, in all the general literature on Bayesian networks surveyed, learning DBNs amounted to learning DBNs with two layers. However, in general, n + 1 layers can be specified, from t − n to t. An example of a 4-layer DBN is shown in Fig. 1. When learning such a DBN, it is sufficient to add an extra constraint to the learning algorithm, i.e. the head of an arc must always be in the layer at time t. By following this, any Bayesian network structure learning algorithm that searches through the space of DAGs can be used and will produce a valid network.
4
Experimental Methodology
In order to test the ability of multi-layered DBNs to model a genetic regulatory network, experiments were conducted using the Arabidopsis thaliana data discussed in Sect. 2.1. This section will describe the steps taken to achieve this. This includes preprocessing the data, formulating prior knowledge, designing a methodology to run the algorithm and testing the results obtained against expert knowledge.
Using Higher-Order Dynamic Bayesian Networks to Model Periodic Data
4.1
71
Preprocessing the Data
The data obtained from the experiments described in Sect. 2.1 are continuous in nature, i.e. they show the level of gene expression as measured. However, most scoring functions for use with structure learning are based on nominal data, i.e. data that is of a discrete nature. For the purposes of these experiments, the BDeu scoring function was used [13]. For this to be utilised, the data has to be discretised. Whilst binary discretisation is an easy option, there is a problem with this as shown by Fig. 2a. It can be seen that as the expression level decays after entrainment, the discretised expression level flatlines, even though there is still oscillatory action occurring. One way to counter this is to discretise the data into multiple levels. However, in doing so, the amount of data at each level is reduced. This can lead to less support for dependencies when a model is fitted to the data, something that should be avoided when there is not much data to begin with. In order to avoid this problem the first derivative of the data was taken and the data discretised as to whether this derivative was greater than or less than zero. The results of this procedure are shown in Fig. 2b. It can be seen that this discretisation captures the oscillations present in the original data. In a sense, instead of looking for correlations in the mRNA level, we are looking for correlations in the rising and falling of the mRNA level. To the authors’ knowledge, this is the first time this has been done when learning Bayesian networks that model gene regulatory networks. Finally due to the expert being able to provide knowledge on the behaviour of five of the genes and the light source, these were selected as the variables that would be used in the construction of the DBN. The genes in question were GI, CCA1, LHY, PRR9 and TOC1. Whilst the other genes could have been included in the analysis, any connections among them would be unverifiable by expert knowledge in this study. 4
6
4
x 10
x 10 Expression Level Discretised Level 1
Expression Level Difference Discretised Level 1
Expression Level Difference
Expression Level
5
4
3
0
2 0 1
0
10
20
30
40
50 60 Time Steps
70
(a) Expression level
80
90
100
0 0
10
20
30
40
50 60 Time Steps
70
80
90
(b) Expression level derivative
Fig. 2. Expression levels for PRR9 in AT0029 condition
100
72
4.2
R. Daly et al.
Prior Knowledge
In order to provide more meaningful results and to compensate for the absence of much data, prior knowledge was incorporated into the learning algorithm in the form of constraints as to the allowable edges. These constraints are summarised below: – Connections from a variable to the same variable at a later time point were disallowed. This type of information is trivial and does not add anything interesting to the model. – No connection was allowed to the light variable, as this was causally independent of the gene expression values. – Since interactions at this genetic level took time in the region of at least one time step, no connections were allowed among the variables at time t. – A dynamic constraint was introduced that specified that if a connection was made from a variable X at time t − i to a variable Y at time t, then no other connection could be made from X to Y . This is equivalent to saying that variables only affect each other temporally in a single manner, i.e. from X to Y only ever takes a set amount of time. Without a constraint such as this, in the absence of much data, multiple interactions between the same genes at different time points could easily be inferred. – The number of slices of a possible dynamic Bayesian network was set to 9, i.e. the time slice at t and 8 slices back. Other authors show using DBNs with multiple layers to model gene regulatory networks with multiple time lags [14]. However, their methods consider arcs between arbitrary layers that are not at time t. This method is incorrect, as it means that arcs could be added between nodes without taking into account the other parents these nodes have. E.g. a local search could add an arc between Xt−2 and Xt−1 and between Yt−1 and Xt . This implies that Xt has parents Xt−1 and Yt−1 , but both arcs were added without considering the other. 4.3
Testing Methodology
In order to test the ability of DBNs to recreate the expert’s knowledge, a series of experiments were conducted. Each experiment used the data described in the preprocessing section, with the prior knowledge as described above. Also, the BDeu scoring criterion was used with an uniform structure prior. The parameters of the structure learning algorithm (ACO-E) were set as ρ = 0.3, q0 = 0.8 and β = 1. These were selected as reasonable values that should perform well in most cases [1]. It has been known for some time that the value of the equivalent sample size parameter N for the BDeu scoring function has a large effect on learning structure [15,16,17]. In a sense, the value of N can be seen as a regularisation parameter - the larger the value of N , the more edges are supported in the learned graph [18]. This can be contrasted to scores like the BIC and AIC, where regularisation is implicit in the function and cannot be adjusted. Therefore, in
Using Higher-Order Dynamic Bayesian Networks to Model Periodic Data
73
learning a Bayesian network structure from data, it is important to see how the N parameter will affect the learned graph. Because of this, for each condition described in Table 1, experiments were performed that varied the equivalent sample size N over 15 values ranging from 0.0001 to 500. These values were 0.0001, 0.001, 0.01, 0.1, 1, 3, 7, 10, 20, 30, 50, 80, 100, 300, 500. At each level of N , 10 experiments were performed. Also, the data from all the conditions was concatenated and experiments performed, varying N over 20 values ranging from 0.01 to 8000. These values were 0.01, 0.1, 1, 3, 7, 10, 20, 30, 50, 80, 100, 300, 500, 750, 1000, 2000, 4000, 6000, 7500, 8000. Again, at each level of N , 10 experiments were performed. For each experiment, the resulting DBN was saved for later comparison against the expert supplied knowledge. 4.4
Evaluation Criteria
Evaluating the learned DBNs was achieved by comparing them against the expert supplied knowledge. For this task, the knowledge was transformed into a DBN representing the state of the expert’s knowledge on the domain. The DBN was validated by the expert and is shown in Fig. 3. Note that although the DBN looks different to published networks in the literature, it captures the same semantics. Please also note that nodes are not shown if they have no connection to any other node. Also, although the arcs in the network are definite, the knowledge of the domain expert was not as definite. Often the time points supplied were of the order of six hours long, i.e. instead of a definite time lag being given, the lag was bounded by two times roughly six hours apart. The effect of this would be to reduce the measured accuracy of any learned network. To compare a learned network against the standard network, two measures were used: the true positive rate (TPR) and the false positive rate (FPR). These are defined as TP TPR = TP + FN and FP FPR = FP + TN where TP is the number of true positives, i.e. the number of arcs computed present, that are present. For the evaluation of the networks, this was calculated in two different ways: Time-Independent True Positive. If there was an arc from gene X to gene Y in the standard network and also an arc from gene X to gene Y in the testing network, irrespective of the time slice that X is in, then this counts as a true positive. Time-Dependent True Positive. If there was an arc from gene X to gene Y in the standard network and also an arc from gene X to gene Y in the testing network, then this counts as a time-dependent true positive. The
74
R. Daly et al.
Fig. 3. Expert developed DBN for Arabidopsis thaliana circadian clock
degree to which it is a true positive depends on the difference of time slices that the tail of each arc comes from. A difference of 0 gives a true positive of 1 and a difference of 8 gives a true positive of 0. Other values are calculated linearly between these two values. E.g. with an arc in the standard network having the tail in slice 4 and an arc in the testing network having a tail in slice 2, this leads to a difference of 2, which gives a time dependent true positive of 1 − 2/8 = 6/8. FP is the number of false positives, i.e. the number of arcs computed present that are absent. If there exists an arc from gene X to gene Y in the testing network and there does not exist an arc in the standard network, then this counts as a false positive. TN is the number of true negatives, i.e. the number of arcs computed absent that are absent. If there does not exist an arc from gene X to gene Y in the testing network and there does not exist an arc in the standard network, then this counts as a true negative. FN is the number of false negatives, i.e. the number of arcs computed absent that are present. If there does not exist an arc from gene X to gene Y in the testing network and there does exist an arc in the standard network, then this counts as a false negative. In previous work of a similar nature, evaluating the learned network proceeds in a different manner than above [14]. Some problems in their methods include: – Firstly, as mentioned in Sect. 4.2, the method they use to add arcs can lead to incorrect conclusions about the structure of the dynamic network;
Using Higher-Order Dynamic Bayesian Networks to Model Periodic Data
75
– Secondly, they don’t take into account the time delay when evaluating the learned network; and – Thirdly, they don’t take any false positive rate into account.
5
Results and Discussion
For each condition and each level of N , the mean of the TPR and that of the FPR were taken over the 10 experiments. This was done for the TimeIndependent TP (TITP) and the Time-Dependent TP (TDTP). With these results, two different types of graphs were plotted. The first was the receiver operating characteristic (ROC) curve, which plots the FPR against the TPR. The second plots the FPR and TPR as a function of N . The results of these plots for the concatenated data are shown in Fig. 4. Due to reasons of space it was not possible to show the graphs for the individual conditions. To compensate, the area under the curve (AUC) statistic is given in Table 2. 5.1
Discussion of Experimental Results
The value of the equivalent sample size N has a large effect on learning structure [15,16,17]. Therefore, in learning a Bayesian network structure from data, it is important to see how this parameter will affect the learned graph. This is the reason why the experiments described above were conducted over various values of N . Looking first at Figs. 4a and 4b, it can be seen that the learned DBN structures do a good job of identifying all of the connections between the genes as supplied by the domain expert. In this case, the best value of N is at 1, where the true positive rate is 1 and the false positive rate at just over 0.3. With the time dependent TPRs as seen in Figs. 4c and 4d, again the best value of N is at 1, with a time dependent TPR of around 0.8 and a false positive rate of 0.3. In this case, ‘best’ is being taken as the absolute difference between the TPR and FPR. These results show that the learned models are good at showing all the connections between the genes, but not as proficient at keeping out bad connections. Table 2. Area under the ROC curve for both the time-independent and time-dependent true positive cases Condition All AT0029 AT0030 AT0031b AT0032 AT0033 AT0047
Time-Independent AUC Time-Dependent AUC Light Ratio 0.7913 0.6943 0.6597 0.8234 0.7135 0.8156 0.5200
0.6749 0.5749 0.5942 0.7046 0.5696 0.7155 0.4326
— 6:18 9:15 12:12 15:9 18:6 3:21
R. Daly et al.
1
1
0.9
0.9
0.8
0.8
0.7
0.7
True/False Positive Rate
True Positive Rate
76
0.6 0.5 0.4 0.3 0.2
0.5 0.4 0.3
0.1
0
0.1
0.2
0.3
0.4 0.5 0.6 False Positive Rate
0.7
0.8
0.9
0 −2 10
1
(a) ROC for the time-independent TPR and FPR 1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6 0.5 0.4 0.3 0.2
0
10
1
2
10 10 Equivalent Sample Size
3
10
4
10
True Positive Rate False Positive Rate
0.6 0.5 0.4 0.3 0.2
0.1 0
−1
10
(b) The time-independent TPR and FPR as a function of N
True/False Positive Rate
True Positive Rate
0.6
0.2
0.1 0
True Positive Rate False Positive Rate
0.1
0
0.1
0.2
0.3
0.4 0.5 0.6 False Positive Rate
0.7
0.8
(c) ROC for the time dependent TPR and FPR
0.9
1
0 −2 10
−1
10
0
10
1
2
10 10 Equivalent Sample Size
3
10
4
10
(d) The time dependent TPR and FPR as a function of N
Fig. 4. Results for concatenated data
Indeed, when examining a trace of the algorithm it can be seen that spurious connections between highly synchronised genes are often inserted. E.g. if X causes Y and X causes Z, then a connection between Y and Z can easily appear as well as the connections from X to Y and Z. Problems such as these often appear with a small amount of data and prior knowledge becomes increasingly important in these situations. With the results of the individual conditions, different behaviours can be observed. The value of N for which best results are obtained differs widely depending on the condition used, ranging from 1 to 100. The different conditions also differ in how good the results are. E.g. the best performance came from the results with the larger light to dark ratio, i.e. AT0031b, AT0032 and AT0033. The worst performance came from the other conditions, i.e. AT0029, AT0030 and AT0047. Results such as these are plausible, as having more light in the entrainment phase equates to higher expression levels, which are less likely to be affected by noise. It should be noted that not all the expert supplied knowledge is as accurate as may seem from the standard network. Whilst the time in that network was
Using Higher-Order Dynamic Bayesian Networks to Model Periodic Data
77
given in time steps of 1.5 hours, the knowledge of the expert was often given in terms of much larger grained steps (e.g. ‘the morning’), as opposed to a definite time period. This is an artifact of the domain in question, as it is not completely understood at the present time. In turn, it has an effect on the time dependent true positive values obtained from comparing two networks; with better prior knowledge a more accurate comparison could be used that bounds what the correct time lags might be for an arc. It should also be noted that the layers in the network impose a discretisation in time that might not be accurately reflected in the domain.
6
Conclusions
This paper asked the question on whether it is possible to use multi-layered dynamic Bayesian networks to model gene regulatory networks using gene expression data. To help answer this question new techniques were devised. Firstly, the gene expression data was preprocessed by a novel method in relation to learning gene regulatory networks; the first difference was obtained and this difference discretised into two bins – rising or falling. Secondly, because of the lack of much previous work on learning higher-order DBNs, new evaluation criteria were defined to judge how well the learned network reconstructed the expert supplied network. From the results it can be seen that it is possible, to some extent, to model both the connections between various genes and the time lags associated with these connections. Acknowledgements. We thank Adrian Thompson for his contribution to the experimental work. Kieron Edwards and Adrian Thompson performed the experimental work. John O’Neill helped develop the reference model of the circadian clock. The experimental work was funded by BBSRC awards 88/G19886 and BB/E015263/1 and by the EU FP6 award EUClock to AJM. The Centre for Systems Biology at Edinburgh (CSBE) is a Centre for Integrative Systems Biology (CISB) funded by BBSRC and EPSRC, reference BB/D019621/1. JSO, SA and AJM are funded by CSBE. SA is also partially funded by BBSRC grant BB/F015976/1.
References 1. Daly, R., Shen, Q.: Learning Bayesian network equivalence classes with ant colony optimization. Journal of Artificial Intelligence Research 35, 391–447 (2009) 2. Daly, R., Shen, Q., Aitken, S.: Learning Bayesian networks: Approaches and issues. The Knowledge Engineering Review (in press, 2009) 3. Husmeier, D.: Sensitivity and specificity of inferring genetic regulatory interactions from microarray experiments with dynamic Bayesian networks. Bioinformatics 19(17), 2271–2282 (2003) 4. Kim, S.Y., Imoto, S., Miyano, S.: Inferring gene networks from time series microarray data using dynamic Bayesian networks. Briefings in Bioinformatics 4(3), 228–235 (2003)
78
R. Daly et al.
5. Perrin, B.E., Ralaivola, L., Mazurie, A., Bottani, S., Mallet, J., d’Alché Buc, F.: Gene networks inference using dynamic Bayesian networks. Bioinformatics 19(suppl. 2), ii138–ii148 (2003) 6. Zou, M., Conzen, S.D.: A new dynamic Bayesian network (DBN) approach for identifying gene regulatory networks from time course microarray data. Bioinformatics 21(1), 71–79 (2005) 7. Geier, F., Timmer, J., Fleck, C.: Reconstructing gene-regulatory networks from time series, knock-out data, and prior knowledge. BMC Systems Biology 1, 11 (2007) 8. Edwards, K.D., Anderson, P.E., Hall, A., Salathia, N.S., Locke, J.C., Lynn, J.R., Straume, M., Smith, J.Q., Millar, A.J.: FLOWERING LOCUS C mediates natural variation in the high-temperature response of the Arabidopsis circadian clock. The Plant Cell 18, 639–650 (2006) 9. Locke, J.C.W., Kozma-Bognár, L., Gould, P.D., Fehér, B., Kevei, É., Nagy, F., Turner, M.S., Hall, A., Millar, A.J.: Experimental validation of a predicted feedback loop in the multi-oscillator clock of Arabidopsis thaliana. Molecular Systems Biology 2(59) (2006) 10. Schlitt, T., Brazma, A.: Current approaches to gene regulatory network modelling. BMC Bioinformatics 8(suppl. 6), S9 (2007) 11. McClung, C.R.: Plant circadian rhythms. The Plant Cell 18, 792–803 (2006) 12. Huang, Z., Lib, J., Su, H., Watts, G.S., Chen, H.: Large-scale regulatory network analysis from microarray data: modified Bayesian network learning and association rule mining. Decision Support Systems 43(4), 1207–1225 (2007) 13. Heckerman, D., Geiger, D., Chickering, D.M.: Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning 20(3), 197–243 (1995) 14. Xing, Z., Wu, D.: Modeling multiple time units delayed gene regulatory network using dynamic Bayesian network. In: Proceedings of the Sixth IEEE International Conference on Data Mining - Workshops (ICDMW 2006), pp. 190–195 (2006) 15. Steck, H., Jaakkola, T.S.: On the Dirichlet prior and Bayesian regularization. In: Becker, S., Thrun, S., Obermayer, K. (eds.) Advances in Neural Information Processing Systems 15 (NIPS 2002), pp. 697–704. MIT Press, Cambridge (2003) 16. Silander, T., Kontkanen, P., Myllymaki, P.: On sensitivity of the MAP Bayesian network structure to the equivalent sample size parameter. In: Proceedings of the Twenty-Third Conference on Uncertainty in Artificial Intelligence, UAI 2007 (2007) 17. Kayaalp, M., Cooper, G.F.: A Bayesian network scoring metric that is based on globally uniform parameter priors. In: Darwiche, A., Friedman, N. (eds.) Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence (UAI 2002), pp. 251–258. Morgan Kaufmann, San Francisco (2002) 18. Steck, H.: Learning the Bayesian network structure: Dirichlet prior vs data. In: McAllester, D.A., Myllymäki, P. (eds.) Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence (UAI 2008), pp. 511–518. AUAI Press (2008)
Sequential Hierarchical Pattern Clustering Bassam Farran, Amirthalingam Ramanan, and Mahesan Niranjan School of Electronics and Computer Science, University of Southampton Southampton, SO17 1BJ, United Kingdom {bf06r,ar07r,mn}@ecs.soton.ac.uk
Abstract. Clustering is a widely used unsupervised data analysis technique in machine learning. However, a common requirement amongst many existing clustering methods is that all pairwise distances between patterns must be computed in advance. This makes it computationally expensive and difficult to cope with large scale data used in several applications, such as in bioinformatics. In this paper we propose a novel sequential hierarchical clustering technique that initially builds a hierarchical tree from a small fraction of the entire data, while the remaining data is processed sequentially and the tree adapted constructively. Preliminary results using this approach show that the quality of the clusters obtained does not degrade while reducing the computational needs. Keywords: On-line clustering, Hierarchical clustering, Large scale data, Gene expression.
1
Introduction
Clustering as a tool in pattern analysis has a wide spectrum of applications: mining in large data warehouse environments [1], dynamic routing in optical networks [6], text classification [19] and codebook construction for bag-of-keypoint visual scene analysis problems [20] are examples of this. For pattern recognition in bioinformatics, the popular use of clustering is in gene expression analysis [4], where expression profiles of genes measured across different biological conditions are clustered. Genes that fall into the same cluster may be assumed to have common functional properties, such as acting under control of the same regulatory mechanism, or acting in tandem along a signalling pathway. Another example of clustering in bioinformatics is the analysis of protein sequences to assign putative function [5]. Repositories of protein sequences have seen massive growth in recent years [18]. As the number of sequenced proteins grows at a much faster rate than those whose structure is determined, or function characterised, automatically predicting function by clustering the sequence space is an active topic of interest. While clustering algorithms and their performance characteristics have been studied extensively over recent years, a particular property of several of the new problems, including the bioinformatics problems, is their massive scale. In other areas, too, data mining examples with a million or more data points are V. Kadirkamanathan et al. (Eds.): PRIB 2009, LNBI 5780, pp. 79–88, 2009. c Springer-Verlag Berlin Heidelberg 2009
80
B. Farran, A. Ramanan, and M. Niranjan
becoming available (e.g. KDD Cup1 ). The UniProt database of proteins [18] now consists of over six million sequences. A matrix of pairwise similarity scores of all these proteins has a file size of 2.6GB. At these scales, classical clustering algorithms such as K-means or hierarchical clustering aren’t straightforward to apply and a need for novel approaches arises. Our approach to such very large scale clustering algorithms is to study sequential and constructive algorithms, much in the spirit of the resource allocating network by Platt [14] and its variants by Kadirkamanathan et al. [7], and Molina et al. [11]. Another online approach for large scale learning (> 4M datapoints) was proposed by Farran et al. [2]. In recent work Ramanan et al. [15] developed a codebook design strategy for visual object categorization which uses a resource-allocating clustering approach. In this paper we investigate whether a formulation for hierarchical clustering by sequentially processing the data in a one-pass setting can be designed. Computational saving in such an approach will come from not having to evaluate all pairwise similarities between data items. This clearly is not possible for all the data as a measure of the scale of the distribution is required. Thus, the approach we take involves the construction of an initial tree of hierarchical clustering by processing a random subset of the data in batch mode. To a cluster structure formulated in this way, any further data may be sequentially included, adaptively changing the structure of the cluster tree, at a heavily reduced cost of similarity computations.
2
Previous Work
Clustering has been widely applied in bioinformatics in the past years. Hierarchical clustering techniques are useful for representing protein sequence family relationships [8]. Eisen et al. [4] applied a variant of the hierarchical average-linkage clustering algorithm to identify groups of co-regulated genome-wide expression patterns. Loewenstein et al. [10] applied agglomerative clustering to protein sequences that automatically builds a comprehensive evolutionary driven hierarchy of proteins from sequence alone. Frey et al. [5] applied an affinity propagation clustering technique to detect putative exons comprising genes from mouse chromosomes. While they claim lower computational cost in comparison to other algorithms, they do not include the cost of pairwise similarity computations. Since this is the most expensive stage in large scale problems, the claimed advantage is exaggerated. Our focus is on reducing the computational complexity arising from this cost. El-Sonbaty et al. [17] proposed an on-line hierarchical clustering algorithm based on the single-linkage method that finds at each step the nearest k patterns with the arrival of a new pattern and updates the nearest seen k patterns so far. Finally the patterns and the nearest k patterns are sorted to construct the hierarchical dendrogram. While they claim their method is sequential, at the arrival of each data item they compute similarity to all the data seen previously. Thus there is little computational saving in their method, and it is equivalent to 1
http://www.sigkdd.org/kddcup/index.php
Sequential Hierarchical Pattern Clustering
81
re-training a new model at the arrival of new data. Contrast that with a truly sequential algorithm, where it is the model that is adapted, similar in fashion to the Kalman filter.
3
Methodology
Our algorithm for sequentially updating a hierarchical tree is shown as pseudo code in Algorithm 1. We construct an initial hierarchical tree by computing all pairwise similarities between a small subset of the data, and then passing these to the Single-Round-MC-UPGMA algorithm [10]. Following the construction of the initial tree using Single-Round-MC-UPGMA, the remaining data is sequentially processed using Algorithm 1. Whenever a new pattern (xi ) arrives for clustering, its similarity distance d to the root of the current hierarchical tree is computed. If d is greater than a predefined threshold (θ), a new root is created having the current pattern (xi ) and the previous root as its children, and as a consequence the depth of the tree increases by one. The value of the new root is assigned with the arithmetic mean of all the leaf nodes. However, if d is less than θ, the nearest child of the current node is retrieved. If the distance of xi to this child node is also smaller than θ then we continue to repeat finding the closest child until either the distance to the current node is greater than θ or we reach a leaf node. In either of the two cases, xi is created as a sibling to the node under consideration, and xi ’s value is propagated up the tree to update its ancestors. Algorithm 1. Update the initial hierarchical tree in an online fashion Input: Root of the initial tree (CurNode), the new pattern (NewNode), and the novelty threshold (θ) Output: Updated hierarchical tree simdist CN ← similarity distance(CurNode, NewNode) if (simdist CN ≤ θ) then () Children ← getChildrenOf(CurNode) if (Children == NULL) then Make NewNode as a sibling of CurNode and update ancestors else {CurNode has children} nearestNode ← min (similarity distance(Children, NewNode)) if (nearestNode ≤ thresh) then CurNode ← nearestNode Go to () else Make NewNode as sibling of CurNode and update ancestors end if end if else Make NewNode as sibling of CurNode by creating a new root end if
82
B. Farran, A. Ramanan, and M. Niranjan
Fig. 1. Hierarchical tree constructed using Single-Round-MC-UPGMA on the entire capitals dataset
Fig. 2. Hierarchical tree constructed by our approach in an online fashion with the aid of an initial tree constructed by Single-Round-MC-UPGMA. The initial tree was constructed with the first 15 capitals in the dataset. Capitals Stockholm and Washington (depicted in dotted hexagons) were sequentially inserted using Algorithm 1.
Fig. 3. Hierarchical tree constructed by the approach proposed in [17] on the entire capitals dataset
By adjusting θ, one can obtain different numbers of clusters at different levels of granularity. This threshold is seen as a hyperparameter in the algorithm and can be tuned in different ways. The specific way in which we set this is described in section 4. The leaf nodes of the hierarchical tree are the input patterns, and
Sequential Hierarchical Pattern Clustering
83
Fig. 4. Hierarchical tree constructed by our approach with the aid of an initial tree constructed using the method in [17] on the capitals dataset
each intermediate node (up to and including the root) contains the arithmetic mean of every leaf node it represents. Because of this, we need not traverse the entire tree during the update process. We use the capitals dataset2 for illustration and evaluation purposes of the proposed algorithm, as the structure of the clusters is known. We numbered the capital cities in the following way: Tallinn (1), Beijing (2), Berlin (3), Buenos Aires (4), Cairo (5), Canberra (6), Cape Town (7), Helsinki (8), London (9), Moscow (10), Ottawa (11), Paris (12), Riga (13), Rome (14), Singapore (15), Stockholm (16), Washington (17). In Fig. 1, we show the tree constructed using Single-Round-MC-UPGMA using the enire capitals data set, which is identical to the tree depicted on the data’s website. Fig. 2 shows how our approach inserts the last two points into the tree, given that the first 15 points were used for the construction of the initial tree. Cutting at the second level gives us the exact same 4 clusters obtained by cutting the tree in Fig. 1 at the same level. To illustrate the importance of having a good initial tree, we use El-Sonbaty et al. [17]’s method on the entire capitals data set (shown in Fig. 3). Fig. 4 shows how Algorithm 1 inserts the last 2 points, given that the first 15 points were used for the initial tree using El-Sonbaty et al.’s method. Even though Algorithm 1 inserted the last 2 points correctly with respect to Fig. 3, the same 4 clusters obtained by the trees in Fig.1 and Fig.2 are not attainable due to the incorrect initial tree.
4
Experiments and Results
Evaluation of clustering algorithms is not straightforward. Unlike in supervised learning problems, there is no ground truth information available to verify whether the clusters obtained are correct or not. To evaluate our algorithm we used two bioinformatic datasets. The first is Eisen et al. [4]’s gene expression clusters consisting of ten clusters formed by their clustering algorithm. We make 2
http://www.quretec.com/HappieClust/
84
B. Farran, A. Ramanan, and M. Niranjan
Fig. 5. Hierarchical tree constructed by the Single-Round-MC-UPGMA scheme on Eisen’s data clusters labelled as B and D [4]. The tree was constructed by using the whole data of the selected two clusters. The dendrogram was cut at the root node (shown in dotted lines) to obtain two clusters.
Fig. 6. Tree constructed by the proposed approach with the aid of an initial tree constructed by Single-Round-MC-UPGMA on Eisen’s clusters labelled as B and D [4]. The initial tree was constructed with 20% of the data and the rest was clustered using Algorithm 1. The dendrogram was cut at the root node (shown in dotted lines) to obtain two clusters.
the weak assumption that the clusters published by these authors are perfect associations and quantify how close our approach gets to this solution. The second dataset is from a protein fold classification problem, constructed by Ding & Dubchak [3] on a subset of the Structural Classification of Proteins (SCOP) [12] database. This is essentially a classification problem, but we apply clustering to it (without using the class labels) and evaluate how well the resulting cluster membership matches the clusters returned by Single-Round-MC-UPGMA applied on the entire subset. We combined both training and testing sets provided in [3]. To quantify the quality of clusters, we use the F1 measure, widely used in information retrieval literature, and the results given in Table 1 are the average of the 10 runs where we randomised the initial subset and the order of presentation of the remaining data. precision =
TP TP + FP
F1 =
recall =
2 × precision × recall precision + recall
TP TP + FN
Sequential Hierarchical Pattern Clustering
85
Fig. 7. Tree constructed by Single-Round-MC-UPGMA on the two selected folds Alpha: four-helical up-and-down bundle (depicted as filled circles) and Beta: ConA-like lectins (depicted as diamonds) of the SCOP data subset [3]. We used the whole 28 data points for the construction of this tree.
Fig. 8. Tree constructed by the proposed approach with the aid of an initial tree constructed by Single-Round-MC-UPGMA on the two selected folds Alpha: four-helical up-and-down bundle (depicted as filled circles) and Beta: ConA-like lectins (depicted as diamonds) of the SCOP data subset [3]. We used 20 out of 28 data for the construction of initial tree and used the rest in an online manner. Table 1. Preliminary results of the hierarchical clustering performed on a subset of SCOP and Eisen’s data. (1) Beta: ConA-like lectins, (2) Alpha: Four-helical up-anddown bundle, (3) Beta: Immunoglobulin-like beta-sandwich, and (4) A/B: beta/alpha (TIM)-barrel. Dataset SCOP (1) & (2) SCOP (3) & (4) Eisen (B & D) Eisen (C & I) Eisen (C, B & I) Eisen (C, B, D & I)
No. of data 28 151 20 104 113 124
No. of data used F1-measure or the initial tree 20 1.0 100 0.9921 ± 0.0086 4 1.0 26 1.0 25 1.0 30 0.9247 ± 0.0652
The threshold θ used in this paper was determined from the data sample used to construct the initial tree. θ was set as the sum of the pairwise Euclidean distances between the patterns in this data sample.
86
B. Farran, A. Ramanan, and M. Niranjan
(a)
(b)
Fig. 9. Tree constructed by our approach on the selected clusters of Eisen’s data (a) Clusters B, C, D and I; 30 out of 124 data were used for the construction of the initial tree. The tree was cut at the level indicated by the dotted line to yield four perfect clusters. (b) Clusters B, C, and I; 25 out of 113 data were used for the construction of the intial tree. The tree was cut at the level indicated by dotted line to yield three perfect clusters.
Fig. 5 shows the tree constructed by the Single-Round-MC-UPGMA scheme on the Eisen’s clusters labelled as B and D in [4]. Fig. 6 shows the tree obtained by using 4 points for the initial tree, and the remaining 16 points inserted using our approach. Cutting at depth 1 to obtain 2 clusters shows that we get the exact same 2 clusters as in Fig. 5. However when we used our approach on the protein fold data of Ding & Dubchak [3] (see Fig. 8), we obtained a better separation than when the entire data was used with Single-Round-MC-UPGMA (see Fig. 7). Finally, Fig. 9 shows the trees obtained by using Eisen’s data clusters (C,B,I and C,B,D, & I), and the cuts that return the desired clusters.
5
Dealing with Categorical or Symbolic Patterns
In the previous section, the illustration of the proposed algorithm was restricted to the Euclidean space. Here, we discuss the capability of our algorithm to handle categorical or symbolic patterns. Instead of updating the parent nodes using the arithmetic means of their leaf nodes, the most informative child node can be selected to act as a parent. However, for simplicity in this paper, we choose a random child to act as the parent node in Algorithm 1. Also, the measure of similarity used depends on the application of interest, and is not limited to numerical data. For example, if we are clustering protein sequences, then Smith-Waterman local alignment [16] or Needleman-Wunsch global alignment [13] measures can be used instead of Euclidean distances. For immediate comparison purposes with the previously shown cluster trees in section 3 (Capitals data and the selected subsets of Eisen’s data), we used the Euclidean distances with the selective node approach. Our preliminary results gave us exactly the same clusters as in the Euclidean case, and hence the same F1-measure.
Sequential Hierarchical Pattern Clustering
6
87
Conclusion and Future Work
In this paper we present an algorithm for on-line hierarchical clustering. The approach depends on the construction of an initial clustering stage using a random subset of the data. This establishes a scale structure of the input space. Subsequent data can be processed sequentially and the tree adapted constructively. We have shown that on small bioinformatics problems such an approach does not degrade the quality of the clusters obtained while saving computational cost. The proposed technique could be significantly improved with an appropriate choice of the novelty threshold (θ). θ can be better estimated by taking into account the inter-cluster and/or intra-cluster information of the initial tree. This can be subsequently updated after the insertion of a newly arrived pattern. Another way of better estimating θ might be to use local thresholds associated with each parent or level of the tree, instead of a global threshold. The greatest benefit of the proposed technique lies in its application on very-large datasets such as UniRef90 proteins [18]. Acknowledgments. We thank the anonymous reviewers for their useful comments on this paper. AR is partially supported by a grant from the School of Electronics and Computer Science, University of Southampton, United Kingdom, and University of Jaffna, Sri Lanka, under the IRQUE Project funded by the World Bank.
References 1. Achtert, E., Bohm, C., Kriegel, H.-P., Kr¨ oger, P.: Online Hierarchical Clustering in a Data Warehouse Environment Data Mining. In: Proceedings of the Fifth IEEE International Conference on Data Mining, pp. 10–17 (2005) 2. Farran, B., Saunders, C.: Voted Spheres: An online Fast Approach to Large Scale Learning. In: IEEE International Symposium on Mining and Web (2009) 3. Ding, C.H., Dubchak, I.: Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics 17(4), 349–358 (2001) 4. Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D.: Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences USA 95(25), 14863–14868 (1998) 5. Frey, B.J., Dueck, D.: Clustering by Passing Messages between Data Points. Science AAAS 315, 972–976 (2007) 6. Hasan, M., Jue, J.: Online Clustering for Hierarchical WDM Networks. In: IEEE/OSA Conference on Optical Fiber Communication, San Diego, CA, pp. 1–3 (2008) 7. Kadirkamanathan, V., Niranjan, M.: A Function Estimation Approach to Sequential Learning with Neural Networks. Neural Computation 5, 954–975 (1993) 8. Kaplan, N., Friedlich, M., Fromer, M., Linial, M.: A functional hierarchical organization of the protein sequence space. BMC Bioinformatics 5, 196 (2004) 9. Kull, M., Vilo, J.: Fast approximate hierarchical clustering using similarity heuristics. BioData Mining 1, 9 (2008)
88
B. Farran, A. Ramanan, and M. Niranjan
10. Loewenstein, Y., Portugaly, E., Fromer, M., Linial, M.: Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space. Bioinformatics 24(13), 41–49 (2008) 11. Molina, C., Niranjan, M.: Pruning with replacement on limited resource allocating networks by f-projections. Neural Computation 8(4), 855–868 (1996) 12. Lo Conte, L., Ailey, B., Hubbard, T.J., Brenner, S.E., Murzin, A.G., Chothia, C.: SCOP: a Structural Classification Of Proteins database. Nucleic Acids Research 28(1), 257–259 (2000) 13. Needleman, S.B., Wunsch, C.D.: A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of two Proteins. Journal of Molecular Biology 48(3), 443–453 (1970) 14. Platt, J.C.: A Resource-Allocating Network for Function Interpolation. Neural Computation 3, 213–225 (1991) 15. Ramanan, A., Niranjan, M.: Designing a Resource-Allocating Discriminant Codebook for Visual Object Recognition. Neural Computation (2009) (under review) 16. Smith, T.F., Waterman, M.S.: Identification of Common Molecular Subsequences. Journal of Molecular Biology 147, 195–197 (1981) 17. El-Sonbaty, Y., Ismail, M.A.: On-line hierarchical clustering. Pattern Recognition Letters 19, 1285–1291 (1998) 18. Wu, C.H., Apweiler, R., Bairoch, A., Natale, D.A., Barker, W.C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., et al.: The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Research 34, D187–D191 (2006) 19. Zhao, Y., Karypis, G., Fayyad, U.: Hierarchical Clustering Algorithms for Document Datasets. Data Mining and Knowledge Discovery 10(2), 141–168 (2005) 20. Zhang, J., Marszalek, M., Lazebnik, S., Schmid, C.: Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study. International Journal of Computer Vision 73, 213–238 (2007)
Syntactic Pattern Recognition Using Finite Inductive Strings Paul Fisher1, Howard Fisher2, Jinsuk Baek1, and Cleopas Angaye3 1
Department of Computer Science, Winston-Salem State University, Winston-Salem, North Carolina, United States 2 Fisher Company, Salt Lake City, Utah, United States 3 National Information Technology Development Agency, Nigeria, Africa
Abstract. A syntactic pattern recognition technique is described based upon a mathematical principle associated with finite sequences of symbols. The technique allows for fast recognition of patterns within strings, including the ability to recognize expected symbols that are close to the desired symbols, and mutations as well as both local and global substring matching. This allowance of deviation permits sequences to be subject to error and still be recognized. Some examples are provided illustrating the technique. Keywords: Pattern recognition, finite inductive sequences, syntactic pattern recognition, genome recognition.
1 Introduction We provide a description including the properties of finite inductive sequences, and two programs which utilize this formulation to process strings of unknown patterns for recognition of subsequences of symbols that occur within other sequences. There have been and continue to be syntax pattern recognition systems for application within the biomedical field, as well as utilization in many other fields. The basis for this work began many years ago with the consideration of sequences of control structures for autonomous robots. Needing a concise and simple way of representing learned activities, and using that learned experience to drive the next steps, we proposed the concept of Finite Inductive Sequences. When experienced as next activities, the results would provide some measure of new data that would be incorporated into the ‘learned’ model. So in either stationary models where the rules do not evolve, thus less interesting, or non-stationary models where the rules would be altered to adapt to the changing conditions, provided the impetus to find a solution [3]. The need for more efficient pattern detection algorithms is especially urgent in bioinformatics where genome-wide association studies (GWASs) are becoming increasingly popular in determining the relationship between genetic variations such as single nucleotide polymorphisms (SNPs) and copy-number variations (CVNs) and disease. Current software tools used to analyze these large datasets are computationally intensive. The strength of the finite inductive method is in its ability to determine small variations over a large data set. We believe that this algorithm is ideally suited to improve the speed of GWASs. We note that many tools exist for processing the V. Kadirkamanathan et al. (Eds.): PRIB 2009, LNBI 5780, pp. 89–101, 2009. © Springer-Verlag Berlin Heidelberg 2009
90
P. Fisher et al.
genome sequences, and new applications using graphical display, color, 3D and other extensions have and are being developed [2]. At issue is both speed and accuracy with sufficient flexibility to deal with insertions and deletions in the sequence comparison processes. This has given birth to extensions of the dynamic programming approach first developed for the longest common subsequence (LCS) problem using modifications to the Smith-Waterman algorithm using both hardware and software systems [1], [5], [6], [7], [8], [10]. These systems utilize algorithms ranging over neural nets, Bayesian statistics, tree structures and many other approaches incorporating video algorithms, visualization, data mining, etc., and each has been used in the analysis and recognition process for both protein and amino acid sequences. One only need look at any conference or journal to find articles detailing such research and development activities. All implemented systems must incorporate work from many aspects of science to accomplish the continued analysis of the data which arrives at a rate estimated to be two-million DNA bases each day [8]. We take some space to describe this algorithm which we propose for the purpose of subsequence processing, and indicate some extensions this approach provides. We also provide a measure of the storage and processing complexity of this approach, and lastly, we provide some small examples illustrating the application of FI to the genomic problems.
2 Finite Inductive Sequences In the next sections we describe first a sequence of symbols (‘string’) which is finitely inductive. Following, that we provide a description of the algorithm for building a set of rules from exemplar sequences together with their associated storage structure called a ruling. Lastly, we show how these rules can be used to recognize patterns in similar types of sequences. Here similar implies that all sequences are formed from identical alphabets. 2.1 Definition of Finite Inductive Sequences In this section, we will describe a particular kind of sequence composed of symbols coming from some alphabet. Consider a sequence of symbols made up from some alphabet such as the representation of the genome with alphabet A, C, G and T. The technique which we shall describe resembles neural net capabilities [9] and emulates Markov models, but with some important differences to both approaches. Consider one or more sequences of symbols, and we can be process them as a single entity or as multiple strings representing data from several sources to be processed in parallel. These sequences we have termed Finitely Inductive Sequences (FI), and such sequences form a countable collection of sequences of finite length. (We will not address infinite sequences, although there is a class of such infinite sequences that are amenable to this approach.) FI has a formal background in mathematics and lies between a mathematical characterization, on the one hand, and a statistical characterization, on the other. The idea is generally to deal with sequences or subsequences observed previously—such as might have been seen previously or learned through exemplar sequences—then such
Syntactic Pattern Recognition Using Finite Inductive Strings
91
sequences can be used to recognize new, similar subsequences from other strings that one desires to process. Unlike neural nets, it is additionally possible to dynamically modify the data structure formed from exemplar sequences so that new experience can be acquired in near real time from new data sequences. The primary data structure of FI, called a ruling, is a finite state machine that can use a short driving sequence to generate another sequence that may be much longer, or it can be used in reverse to produce a residual short sequence from a presented long sequence. The residual is the ‘left over’ symbols from the string that are not part of the experience base encapsulated within the ruling. In a sense, the size of the residual is a measure of the common experience between an unknown event sequence and the string(s) placed in the ruling. The ruling embodies patterns of interest, and acts as a filter through which unknown strings can be filtered looking for identical or similar subsequences found within the ruling and the unknown string. Finite sequences of symbols are said to be finitely inductive (FI) over a finite alphabet if the choice of any symbol at any particular position in a sequence depends immediately upon only the choices of symbols at the previous n points (the similarity to Markov models except this is generally deterministic and not probablistic.) Given an FI sequence (deterministic), an implicant is a pair (w, p) consisting of a word w over the alphabet and a symbol p of the alphabet such that w occurs at least once as a substring of the sequence; and whenever w occurs as a sub-string of the sequence there is a succeeding entry and it is p. For non-deterministic implicants, there might be two or more values for p with a probability associated with each such value. The antecedent is w while p is called the consequent. A reduced form implicant has no proper segment of its antecedent in another implicant. We note the following: a. b. c.
Every finite sequence is finitely inductive. For any finitely inductive sequence, the inductive base is the maximal length of the antecedents when considering all of its reduced form implicants. If an FI sequence has inductive base n and an alphabet of k symbols, then kn is an upper bound for the number of its reduced form implicants.
2.2 Description of the Storage Structure for Finite Inductive Sequences Figure 1 is an example of function table (called a Ruling) resulting from an FI sequence(s) meeting the requirement for an optimal storage system for streams of symbols. The system consists of a structure with k levels. The rules are defined from the exemplar patterns according to a push up formulation of the strings. The inductive base in Figure 1 is as defined above and is equivalent to the longest antecedent in reduced form. The rules are formulated much like those characterizing a Markov process. (We point out another difference between Markov Models and FI sequences is that within the theory of FI sequences; the order of the sequence, i.e., its inductive base, can be altered to any a priori value desired, unlike the order of the Markov Model.) Given some number n, the inductive base, and a sequence of symbols, then we can say that the previous n characters uniquely determine the next character in the string. As the inductive base is reduced, the number of levels will increase. The number of rules remains nearly invariant as the number of levels change. The last three elements in Figure 1, the rules, starting sequence, and the shift register will be explained by an example.
92
P. Fisher et al. Level k Structure Rules Starting Sequence Shift Register Appeal
1
2
Inductive Base
Output
Level 1 Structure Appeal
Rules Starting Sequence
Output
Shift Register 1
2
Inductive Base Output
Fig. 1. The FI Ruling Structure for Representing a Sequence(s) in k-Levels
Consider the following string to build the structure of Figure 1. We note that the starting symbol ‘S’ is not part of the sequence, but it allows a simpler representation. S2446664882443
(1)
This sequence comes from an alphabet of eight symbols: 1, 2, 3, 4, 5, 6, 7, 8. It is finite in length, and therefore is finitely inductive. We can form rules for this sequence as a function of the inductive base. However, since the inductive base must be large enough to uniquely guarantee the next symbol, an inductive base of two or less would be unsuitable since the occurrence of ‘66’ does not uniquely specify the next symbol (‘66’ precedes both a ‘6’ and an ‘4’). We can easily extend this representation to the non-deterministic case and to a probabilistic case by specifying that the sequence 66 specifies both a ‘6’ and a ‘4’, and if we associate a probability with the two alternatives, then we have the probabilistic representation. We illustrate the table that is generated for the reduced form rules. The first rule in (2) states that each time we see the antecedent ‘S’ then a ‘2’ follows, without ambiguity, as the next symbol in the string. S Æ 2, 466 Æ 6,
S2 Æ 4, 24 Æ 4, S244 Æ 6, 8244 Æ 3 666 Æ 4, 64 Æ 8,
48 Æ 8
46 Æ 6,
88 Æ 2, 82 Æ 4
(2)
From (2) the inductive base is 4 (the largest of the antecedent values). Some of the rules are inductive base 1 and 2; when this variability occurs, we can reduce the inductive base (the 3) to any value less than the 3. Suppose we wish to use an inductive
Syntactic Pattern Recognition Using Finite Inductive Strings
93
base of 2, we would get the FI structure of Figure 1, and the process (called Factoring) for deriving this new FI ruling structure is shown in Example 1. Example 1. Simple String Factoring and Regeneration For the string: S2446664882443 using the reduced rules (2), define the ruling as follows: Step 1: Keep all rules shown in (2) with inductive base 2 or less. For everything else push the consequent symbol up out of the string to the next level, which we designate Level 1. This results in the following pattern: Level 1 Level 0
S 6 64 3 S2446664882443
The Rules for Level 0 are those representing all symbols not pushed up: SÆ2
S2Æ 4
24 Æ 4
46 Æ 6 64 Æ 8
48 Æ 8
88 Æ 2
82 Æ 4
Step 2: For the string of Level 1, we find the reduced rules. These are: SÆ6
S6 Æ 6
66 Æ 4
4Æ3
This process has produced rules of inductive base less than or equal to 2, in a ruling of two levels, and since there are no symbols that need to be processed further, we stop the process. We also point out the symbols in Level 1 are called the residual from Level 0. As noted, there is no residual from Level 1. We also point out that every FI string of finite length produces a ruling where the last level has no residual, when that string is processed independently of any other string. The inductive base for the sequence is 2. The starting sequences are loaded in the shift registers, and the system is ready to regenerate the original sequence. This is done in the following steps: Step 1: The S in the shift register of Level 0 is matched against the antecedents of the rules in that level. In this case the antecedent of the string S Æ 2 matches the character in the Shift Register. Step 2: Push the consequent ‘2’ into the right side of the shift register. Step 3: Since the shift register now contains ‘S2’, see whether this string or the symbol ‘2’ (top of the shift register) matches any antecedent in this level. There is a match, so the consequent ‘4’ is placed in the top of the Shift Register, and the ‘S’ is output. Step 4: Again the Shift Register content of ‘24’ is matched to the antecedents of Level 0. A match is found, and the consequent ‘4’ is pushed into the shift register while the ‘2’ is pushed out to the output. Step 5: The Level 0 Shift Register now contains ‘44’, and this does not match any of the antecedents in Level 0, so an appeal is made to Level 1. The current status is now: Level 1: shift register: S Level 0: shift register: 44 and the output S2 has been generated from level 1.
94
P. Fisher et al.
Step 6: In Level 1, the contents of the Shift Register are matched, and the consequent ‘6’ is placed in the shift register, and the ‘6’ is pushed down to Level 0 and placed in the top of the Shift Register for that level. Now we have the following configuration: Level 1: Shift Register: ‘6’ Level 0: Shift Register: ‘46’ Output: ‘S24’ Step 7: Continue with processing in Level 0 until no processing can continue, and then output the Shift Register of Level 0 to complete the original sequence regeneration. From this simple example, we see the FI ruling structure: how the rules are fashioned and processed using Figure 1, and how the string is regenerated. We have seen that the inductive base was reduced to a smaller value for the example. Non FI strings are those that are pseudo random, and in this case the antecedents of all rules will have equal length, or they grow at the same rate as the string being processsed. Examples are strings that have no pattern, and these include such things as the expansion of PI or other non-repeating, non terminating sequences (transcendental numbers). We have looked [3] at the occurrence of non-FI strings in the case where binary symbols are used as the alphabet, and their occurrence is rare. We have also shown [3] when the alphabet is richer than binary, then the occurrence of non-FI sequences is even less frequent. 2.3 Application of a Ruling in Recognizing an Unknown Pattern The next example uses the Ruling shown in Table 1 resulting from the simultaneous factoring of the four strings. Each string uses the same alphabet, which is not a hard requirement. The next example uses the Ruling shown in Table 1 for the factoring of the four event strings. We also state the obvious: the more data one has about a pattern, the more accurately one can identify that pattern in unknown sequences. In order to show how rulings apply in terms of recognition of subsequences, we now provide some patterns. In this case, we are no longer concerned about regenerating the original sequences from the rulings. Example 2. Ruling Built from Four Sequences (Simultaneous Factoring) Suppose there are four strings designated a, b, c and d. We have added the start symbol ‘S’ to each string for convenience. Consider the following four strings: a : S2446664882443
b : S66666254322187 d : S7765442218
c : S666662225442217
Since we have started each string with the symbol S, this will shorten the residuals from each string. As can be seen from the four strings the starting implicants are: S Æ 2; S Æ 6; S Æ 7; however, since this is simultaneous factoring, and these implicants are contradictory, they are not acceptable in the deterministic case. We require in simultaneous factoring that the implicants be consistent intra- as well as inter-string. The implicants S2 Æ 4; S6 Æ 6; S7 Æ 7 are consistent for all four
Syntactic Pattern Recognition Using Finite Inductive Strings
95
sequences. The final residuals shown in (3) for each of the strings are those symbols that remain due to intra- or inter-string conflicts. The resulting table of implicants augmented with their source string, for each of the levels in the FI ruling table is given in Table 1 with the inductive base held to 2 a: s Æ 2
b: s 6 Æ 5
c: s 6 Æ 2
d: s Æ 7
(3)
Table 1. FI Ruling Table for the Strings a, b, c and d with an Inductive Base of 2
Level 1
Level 2
Level 3
Level 4
Level 5
S2 Æ 4 (a)
S2 Æ 6 (a)
S2 Æ 4 (a)
S6 Æ 6 (b, c)
62 Æ 2 (c)
S6 Æ 6 (b, c)
S6 Æ 6 (b, c)
S6 Æ 2 (b, c)
22 Æ 5 (c)
S7 Æ 7 (d)
S6 Æ 6 (b, c) S7 Æ 4 (d)
S7 Æ 8 (d)
65 Æ 3 (b)
82 Æ 4 (a)
32 Æ 1 (b)
3 Æ 8 (b)
25 Æ 4 (c)
32 Æ 2 (b)
42 Æ 1 (d, c)
54 Æ 7 (c)
42 Æ 2 (d, c)
53 Æ 2 (b)
38 Æ 7 (b)
24 Æ 4 (a)
64 Æ 3 (a)
64 Æ 8 (a)
54 Æ 2 (c)
5 Æ 4 (b, c, d)
74 Æ 2 (d)
46 Æ 6 (a)
26 Æ 6 (a)
76 Æ 5 (d) 77 Æ 6 (d) 48 Æ 8 (a) 88 Æ 2 (a)
The processing that takes place in this example shows how the incoming string is matched with the known data as represented by the implicants contained within the ruling. If the antecedents match, then we can delete the consequent symbol at the end of the process for this level from the unknown string. We can match the position of this deleted symbol for later identification. The residual is the string remaining after all levels have removed the consequents matching the antecedents for the appropriate implicants in each level. The algorithm used in this matching process is called Following. As was indicated above, the residual is important as its length is a percentage of the original string suggesting an overall measure of similarity between the ruling and the string. Residuals also represent a measure of similarity based upon differences between multiple strings with the ruling acting as the adjudicator. These residuals in a sense represent the entropy of the string when the universe of experience or knowledge is contained within the ruling.
96
P. Fisher et al.
Example 3. Recognition of Patterns in a New Sequence Consider the following two strings formed from the new sampled events. We will designate them by e and f. The strings are: e: s666662278122345
f: s666662543443244666488244187
(4)
The Following results are shown in Table 2, when the Ruling of Table 1 is applied to the two strings. In addition, the implicants used from Table 1 have their originating string annotation, and this annotation is copied under the bottom row of Table 2 in the line called ‘Source’. In order to represent implicants with multiple sources, we encode the sources as: (a) = 1, (b) = 2, (c) = 3, (d) = 4, (b, c) = 5, (d, c) = 6, (b, c, d) = 7. Table 2. Following Applied to the Two Patterns e and f Respectively Using Table 1
Following Level For String e 5: s 6 2278122345 4: s 6 62278122345 3: s 6 662278122345 2: s 6 6 6 6 2 2 7 8 1 2 2 3 4 5 1: s 6 6 6 6 6 2 2 7 8 1 2 2 3 4 5 Source 5555 3 Level For String f 5: s 6 25 344324 6664 221 4: s 6 625 344324 6664 221 3: s 6 6625 344324 6664 221 2: s 6 6 6 6 2 5 3 4 4 3 2 4 6 6 6 4 221 1: s 6 6 6 6 6 2 5 4 3 4 4 3 2 4 4 6 6 6 4 8 8 2 2 2 1 Source 5 5 5 5 7 1 111 From the results shown in Table 2 of the Following over e and f we conclude: • • • • • •
Neither string matches very well the data contained in the ruling. Best matches come from a combination of b and c for string e. The best match for string f comes initially from b or c, and then evolves into the string of a. This would indicate in both sequences e and f, that the data in the ruling is not sufficiently rich to represent the sequence, so this data should be added to the ruling. Modification of the ruling on the fly can be accommodated, and depending upon storage space, one can add strings at will. In order to control growth, the implicants can be eliminated based upon a usage criteria. A counter can be added to each rule, and when the counter drops beneath some usage limit, the rule can be deleted, or the sequence containing the rule can be deleted.
Syntactic Pattern Recognition Using Finite Inductive Strings
•
97
When a symbol has not been recognized, but the antecedent was matched, then that symbol represents a possible single symbol mutation. In this case, the system will substitute the consequent in the string, and mark the substitution, and continue recognition with the altered symbol. This is called Replacement Following. For sequence substitutions, the sequence will be processed with the ruling, and the substituted sequence will not be recognized, and except for accidental similarities, replacement will not solve the similarity problem. Thus there will be a block of symbols that will remain in the residual. These symbols can be used to form a new ruling, linked to the previous ruling, so that over time, an assemblage of rulings will be formed that represent the evolution of the original string.
•
3 Finite Inductive Sequences and Biometric Pattern Examples There are several applications that are of interest using the model that we have proposed. The first just deals with recognition of a gene from one source to another. If we extend the recognition technique described in Section 2.3, we can deal with the mutation issues within a gene. Besides the Exact Following of Section 2.3, there are several more extensions. The first allows recognition of genes that have changed from the exemplars stored in the Ruling by mutations of deletion, duplication, inversion, insertion or translocation. Consider the issue of insertion and the Skip Following process. In this process the user defines the size of a skip region, and in the gene, this involves a region that potentially has been inserted and is not in the exemplar set of the ruling. We now provide some examples to show how FI can be used in the genomic pattern matching world when various kinds of mutations are allowed. Example 4. Mutation by insertion. Consider sequence (5) as an exemplar sequence for the ruling content. GTGAGTGGTCTTAGGTGAGTCAGTGCAG
(5)
The ruling is given in Table 3 where the levels are divided by the horizontal lines, and the Order column is the order in which the implicants were formed in processing (5). Table 3. Ruling for Sequence (5)
Antecedent
S G
G T G G G
S G T
S G T G A
A G G T C T
G G T C T T
Consequent Order G 1 T 2 G 3 A 4 G 5, 14, 19, 23, 28 T 6, 20 T 9, 16 C 10 T 11 T 12 A 13
T A G G G A G C C A
A G T T A G G G S G
G T G C G T C S C G G
G G A A T G A G C G C
15 17 18 22 24 25 27 7 26 8 21
98
P. Fisher et al.
Now consider the new sequence and their associated positions: G T G A G T G G T C c g t a t T T A G G T G A G (6) Loc: 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 20 1 2 3 4 We have defined a variable called Skip in the system to permit ignoring some number of symbols in the recognition (following) process. Processing the string of (6) with the ruling of Table 3, we get the results shown in (7) when we process left to right: Residual: g t a t T T Lev 1: S G G g t a t T T Lev 0: S G T G A G T G G T C c g t a t T T A G... (7) Rule: 1 2 3 4 5 6 7 8 9 10 21 13 14... We now know that there is a difference between the ruling and the unknown sequence because of the residual. If we set the skip value to 6 for example, and starting to match from the first symbol that occurs in the residual (being the g) plus the skip value gives the next starting symbol A in position 18. Looking for a rule in Table 3 after 10, maintaining rule order (another parameter setting in the system), we find for the A, rules 4, 13, 18, and 27. All rules are unacceptable except rule 13 due to the antecedent TT and its relative order. Since the skip value is a guess only, we need to see if backing up prior to location 18 can trim some off of the mutation region. We also note that rule 21 is not in the correct place, so removing 21, and backing up from location 18 we get the final insertion mutation candidate as: c g t a t The skip value being arbitrary for this example can be set by the user to an expected value, or just set to some maximal value. The only difference in reality is the length of time it requires to search backwards from the skipped location. If rule order is main-tained, then the processing will be much faster, since all previous rules used up to the first location where a nucleotide appears in the residual will not need to be considered. Example 5. SNP Processing Suppose we have an exemplar sequence around a SNP loci where an allele exists as shown in Table 4. The problem is to determine the frequency of SNPs over a set of loci to allow likelihood patterns for determining both missing nucleotides as well as determining the predominate nucleotide in a chromosome. Table 4. SNP Patterns from Individuals (Rows) and from a Chromosome (Column)
85 91 93 94 83
M CC CC CG CG CC
22 29 34 48 55
M CG CC CG CC CG
45 41 57 61 27
M CG CC CC CG CC
Syntactic Pattern Recognition Using Finite Inductive Strings
99
The surrounding symbols to these alleles are identical in the chromosome and suppose that we define the precursor symbols to be CACC with the two possible sequences for M to be CACCC or CACCG with a frequency of eight and seven. If one doesn’t know where the SNP happens to reside, then one can follow one gene in one chromosome from the ruling of a gene in the exemplar chromosome with the SKIP value set to one, or one can use Replacement Following and keep track of the loci and type of replacements done. Consider the rules derived for the string CACCC and CACCG. The rules, permitting non-determinism are shown in (8). The current, internal form of the Ruling is a Finite State Machine (FSM), and to add non-determinism, we simply add another state for the additional consequent, and a counter to record the number of times each rule was used. For the last two rules of (8), a single rule of the form CC Æ C(8)/G(7) would suffice in the FSM where the 8 and 7 are the frequency counts for occurrences. S Æ C, SC Æ A, A Æ C, AC Æ C, CC Æ C(8), CC Æ G(7)
(8)
For Replacement following, suppose the exemplar sequence was the first five rules implying that the symbol pair CC always produced a C. With replacement suppose we found the string CACCG, the result would be to delete the G and replace it with the C. If counting is turned on, then the number for each of the replaced symbols would be counted. Additionally a position marker can be output to identify to the location of each replacement. Example 6. Trace for Following Suppose we have the ruling of Table 3, and the unknown sequence shown in (9). Here the sequence is only similar to that used as the exemplar in Table 3. One of the things which the system will do is provide a trace of the implicant usage. SGTTGGCCCGTGGTTTTA G G T G AATGGG TGCA G 12
9
S TGGC CCGTGG
13 14 15 16 17 18 TTT
16 ATGGG
21 26
(9)
27 28 (9a) GC
(9b) 21 (9c)
(9a) represents Level 0 implicant application to the sequence of (9), and (9b) represents the residual from Level 0. (9b) and (9c) represent the processing from Level 1. The final residual is given in (10). S TGG
CGTGG
TTT
ATGGG
G
(10)
The residual contains seventeen symbols implying that fifteen symbols were deleted. The more interesting observation is the subsequence of symbols beginning with rule number 13 and running consecutively through rule number 18 in a cluster with no skipped nucleotides indicating the presence of a subsequence of potential interest.
100
P. Fisher et al.
4 Conclusion The need for exact matching in a fast algorithm is important for continued research and development for genomic understanding and disease characterization. We have implemented the algorithms for both the Factoring and Following, and they operate in their basic mode in near O(n) [4] time where n is the length of the sequence to follow. With the newest version of the algorithm based upon a direct addressing scheme for the antecedents in an implicant, we have yet to actually obtain real performance data. It should be clear based upon how the ruling is structured, that one can achieve faster processing by placing levels of a ruling on several independent processors and connect the processors in a pipeline configuration. This will provide a simple, pipelined, parallel algorithm with speed of O(n/p) where p is the number of processors. Alternatively, one can place the entire ruling on each of several processors in a cluster or network, and then supplying overlapping (by the inductive base) subsets of the unknown string to each processor. Again the process complexity becomes O(n/p). In order to achieve these speeds, we first used a finite state machine model indicated earlier as the ruling structure. Storage space has an upper limit of the inductive base*n again where n is the length of the string to be factored. With the new algorithm we are using a large amount of internal memory to hold the ruling, and the following process is simply a direct access into that storage structure. This has limited the inductive base to approximately three codons, but access for any nucleotide would be then, if found ruling-levels/2, and if not found then simply the number of levels in the ruling. Since this number is small, the complexity for following would be at most O(n*number-of-levels-in-Ruling), which is O(n*k) or just O(n). This is the more important time constraint, due to the fact that a ruling is created once and used many times in identifying patterns.
References 1. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. Journal of Molecular Biology 215(3), 403–410 (1990) 2. Buckingham, S.D.: Scientific software: seeing the SNP’s between us. Nature Methods 5, 903–908 (2008) 3. Case, J., Fisher, P.S.: Long Term Memory Modules. Bulletin of Mathematical Biology 46(2) (1984) 4. Das, S., Fisher, P.S., Zhang, H.: Efficient Parallel Algorithms for Pattern Recognition. In: Proceedings of Twenty-sixth Hawaii International Conference in Systems Sciences (January 1993) 5. Pearson, W.: Flexible sequence similarity searching with FASTA3 program package. In: Misener, S., Krawety, S.A. (eds.) Bioinformatics Methods and Protocols, pp. 185–219. Humana Press, Inc., Totowa (2000) 6. Pearson, W.R., Lipman, D.J.: Improved tools for biological sequence comparison. Proc. National Academy of Science 85, 2444–2448 (1988) 7. Smith, T.F., Waterman, M.S.: Identification of Common Molecular Subsequences. Journal of Molecular Biology 147, 195–197 (1981)
Syntactic Pattern Recognition Using Finite Inductive Strings
101
8. Uberbacher, E.: Computing the Genome, http://www.ornl.gov/info/ornlreview/v30n3-4/genome.htm 9. Wang, G.Y., Fisher, P.: Knowledge Acquisition: Neural Network Learning. In: Data Mining and Knowledge Discovery: Theory, Tools, and Technology. II SPIE, vol. 4057, pp. 117–128 (2000) 10. Xu, Y., Mural, R.J., Einstein, J.R., Shah, M., Uberbacher, E.C.: GRAIL: A Multi-Agent Neural Network System for Gene Identification. Proceedings of The IEEE 84(10), 1544– 1552 (1996)
Evidence-Based Clustering of Reads and Taxonomic Analysis of Metagenomic Data Gianluigi Folino1 , Fabio Gori2 , Mike S.M. Jetten2 , and Elena Marchiori2, 1
2
ICAR-CNR, Rende, Italy Radboud University, Nijmegen, The Netherlands
[email protected]
Abstract. The rapidly emerging field of metagenomics seeks to examine the genomic content of communities of organisms to understand their roles and interactions in an ecosystem. In this paper we focus on clustering methods and their application to taxonomic analysis of metagenomic data. Clustering analysis for metagenomics amounts to group similar partial sequences, such as raw sequence reads, into clusters in order to discover information about the internal structure of the considered dataset, or the relative abundance of protein families. Different methods for clustering analysis of metagenomic datasets have been proposed. Here we focus on evidence-based methods for clustering that employ knowledge extracted from proteins identified by a BLASTx search (proxygenes). We consider two clustering algorithms introduced in previous works and a new one. We discuss advantages and drawbacks of the algorithms, and use them to perform taxonomic analysis of metagenomic data. To this aim, three real-life benchmark datasets used in previous work on metagenomic data analysis are used. Comparison of the results indicates satisfactory coherence of the taxonomies output by the three algorithms, with respect to phylogenetic content at the class level and taxonomic distribution at phylum level. In general, the experimental comparative analysis substantiates the effectiveness of evidence-based clustering methods for taxonomic analysis of metagenomic data.
1
Introduction
The rapidly emerging field of metagenomics seeks to examine the genomic content of communities of organisms to understand their roles and interactions in an ecosystem. Given the wide-ranging roles microbes play in many ecosystems, metagenomics studies of microbial communities will reveal insights into protein families and their evolution. Because most microbes cannot grow in the laboratory using current cultivation techniques, scientists have turned to cultivationindependent techniques to study microbial diversity. At first shotgun Sanger sequencing was used to survey the metagenomic content, but nowadays massive parallel sequencing technology like 454 or Illumina, allow random sampling of DNA sequences to examine the genomic material present in a microbial community [1].
Corresponding author.
V. Kadirkamanathan et al. (Eds.): PRIB 2009, LNBI 5780, pp. 102–112, 2009. c Springer-Verlag Berlin Heidelberg 2009
Evidence-Based Clustering of Reads and Taxonomic Analysis
103
For a given sample, one would like to determine the phylogenetic provenance of the obtained fragments, the relative abundance of its different members, their metabolic capabilities, and the functional properties of the community as a whole. To this end, computational analysis is becoming increasingly indispensable [2,3]. In particular, clustering methods are used for rapid analysis of sequence diversity and internal structure of the sample [4], for discovering protein families present in the sample [5], and as a pre-processing set for performing comparative genome assembly [6], where a reference closely related organism is employed to guide the assembly process. In this paper we focus on clustering methods and their application to taxonomic analysis of metagenomic data. Clustering analysis for metagenomics amounts to group similar partial sequences, such as raw sequence reads, or candidate ORF (Open Reading Frame) sequences generated by an assembly program into clusters in order to discover information about the internal structure of the considered dataset, or the relative abundance of protein families. Different methods for clustering analysis of metagenomic datasets have been proposed, which can be divided into two main approaches. Sequence- and evidence-based methods. Sequence-based methods compare directly sequences using a similarity measure either based on sequence overlapping [4] or on extracted features such as oligonucleotide frequency [7]. Evidence-based methods employ knowledge extracted from external sources in the clustering process, like proteins identified by a BLASTx search (proxygenes) [5]. In this paper we focus on the latter approach for clustering short reads. We consider two clustering algorithms introduced in previous works [5,8] and a refinement of the latter one based on ensemble techniques. These algorithms cluster reads using weighted proteins as evidence. Such proteins are obtained by a specialized version of BLAST (Basic Local Alignment Search Tool), called BLASTx, which associates a list of hits to one read. Each hit consists of one protein, two score values, called bit and identities, which measure the quality of the read-protein matching, and one confidence value, called E-value, which amounts to a confidence measure of the matching between the read and the protein. Specifically, in [5] an algorithm, here called LWproxy (Local Weight proxy), is introduced, that clusters reads and those proteins in their sets of hits simultaneously, in such a way that one cluster of proteins is associated to one cluster of reads. Then it assigns one local weight to each protein of a cluster, using the cumulative BLASTx bit score of those reads in corresponding cluster having that protein as one of their hits. The protein with best weight (highest cumulative bit score) is selected as proxygene of the cluster of reads. In [8], an alternative method for clustering metagenome short reads based on weighted proteins is proposed, here called GWproxy (Global Weight proxy). The method first assigns global weights to each protein using the BLASTx identity and bit score of those reads having that protein as one of their hits. Next, the method groups reads into clusters using an instance of the weighted set covering
104
G. Folino et al.
problem, with reads as rows and proteins as columns. It seeks the smallest set of columns covering all rows and having best total weight. A solution corresponds to a clustering of reads and one protein (proxygene) associated to each cluster. While in [5] the proxygene of a cluster is selected within a set of proteins associated to that cluster, in GWproxy clustering and proxygene selection are performed at the same time. In this paper we introduce a refinement of GWproxy based on the following ensemble technique, called EGWproxy (Ensemble Global Weight proxy). The algorithm associates a list of proteins to each cluster resulting from application of GWproxy, such that each protein occurs as hit of each of the reads of that cluster. Such a list is used for refining the biological analysis of the cluster, for instance by assigning a taxonomic identifier (taxID) by means of weighted majority vote among the taxID’s of the proteins in the associated list. We discuss advantages and drawbacks of the above clustering algorithms, and use them to perform taxonomic analysis of metagenomic data. To this aim, three real-life benchmark datasets used in previous work on metagenomic data analysis are used. These datasets were introduced in [5] and used to perform a thorough analysis of evidence-based direct and indirect (that is, using proxygenes) annotation methods for short metagenomic reads. The results of such analysis substantiated advantages and effectiveness of indirect methods over direct ones. The results of the three considered evidence-based clustering algorithms indicate satisfactory coherence of the taxonomies output by the algorithms. In general, the experimental comparative analysis substantiates the effectiveness of evidence-based methods for taxonomic analysis of metagenomic data.
2
Clustering Metagenome Short Reads Using Proxygenes
Different methods for clustering analysis of metagenomic datasets have been proposed, which can be divided into two main approaches. Sequence- and evidencebased methods. Sequence-based methods compare directly sequences using a similarity measure either based on sequence overlapping [4] or on extracted features such as oligonucleotide frequency [7]. Evidence-based methods employ knowledge extracted from external sources in the clustering process, like proteins identified by a BLASTx search (proxygenes) [5]. Here we consider the latter approach for clustering short reads. The knowledge used by the clustering algorithms here considered is extracted by a reference proteome database by matching reads to that database by means of BLASTx, a powerful search program. BLASTx belongs to the BLAST (Basic Local Alignment Search Tool) family, a set of similarity search programs designed to explore all of the available sequence databases regardless of whether the query is protein or DNA [9,10]. BLASTx is the BLAST program designed to evaluate the similarities between DNA sequences and proteins; it compares nucleotide sequence queries dynamically translated in all six reading frames to peptide sequence databases. The scores assigned in a BLAST search have a statistical interpretation, making real matches easier to distinguish from random background hits. In the following we summarize the main features of BLAST.
Evidence-Based Clustering of Reads and Taxonomic Analysis
2.1
105
The BLAST Alignment Method
BLAST uses a heuristic algorithm that seeks local as opposed to global alignments and is therefore able to detect relationships among sequences that share only isolated regions of similarity [11]. When a query is submitted, BLAST works by first making a look-up table of all the words (short subsequences, three letters in our case) and neighboring words, i.e., similar words in the query sequence. The sequence database is then scanned for these strings; the locations in the databases of all these words are called word hits. Only those regions with word hits will be used as alignment seeds. When one of these matches is identified, it is used to initiate gap-free and gapped extensions of the word. After the algorithm has looked up all possible words from the query sequence and extended them maximally, it assembles the statistically significant alignment for each query-sequence pair, called High-scoring Segment Pair (HSP). The matching reliability of read r and protein p is evaluated trough Bit Score, denoted by SB (r, p), and E-value, denoted by E(r, p). The bit score of one HSP is computed as the sum of the scoring matrix values for that segment pair. The E-value is the number of times one might expect to see such a query-sequence match (or a better one) merely by chance. Another important BLASTx score of matching between r and p is Identities score, denoted by Id (r, p), defined as the proportion of the amino-acids in the database sequence p that are matched by the amino-acids translation of the current query frame r. We refer to [9] for a formal description of these measures. We turn now to describe the three methods here used for taxonomic analysis of metagenomic data. Here and in the sequel we assume the BLASTx has been applied to a metagenomic data set with a given Evalue cutoff value. We denote by R = {r1 , . . . , rm } the resulting set of reads having at least one BLASTx hit for the given cutoff, and by P = {p1 , . . . , pn } the set of proteins occurring in the hit of at least one read of R. 2.2
LWproxy
LWproxy generates a collection of pairs (Ci , Pi ), where Ci is a set of reads and Pi a set of proteins. The algorithm can be summarized as follows. Set i = 0. Set X = R. If X is empty then terminate, otherwise set i = i + 1. Select randomly1 one read ri from X as seed of cluster Ci = {ri }. Set Pi to the set of hits of ri . Remove ri from X. Add to Ci all the reads having one element of Pi as a best hit, and remove them from X. 8. Add to Pi all hits of those reads added to Ci in the previous step. 9. If no reads are added then go to step 3, otherwise go to step 7.
1. 2. 3. 4. 5. 6. 7.
1
We consider here random seed selection. However, in [5] the criterion for selecting a seed is not specified.
106
G. Folino et al.
When the clustering process is terminated, the method assigns one proxygene to each Ci by selecting from Pi the protein having highest cumulative bit-score. Example 1. Suppose given a set of five reads {r1 , . . . , r5 } and suppose that the proteins occurring in their hits: – – – – –
{p1 , p3 , p5 } for read r1 , with best hit p3 . {p1 , p3 , p5 } for read r2 , with best hit p3 . {p2 , p4 } for read r3 , with best hit p2 . {p2 } for read r4 , with best hit p2 . {p2 , p3 , p5 } for read r5 , with best hit p2 .
If LWproxy starts from r1 as seed for C1 then only r2 is added to C1 , since p2 (best hit of each of the other reads) does not occur in the list of protein associated to C1 . Then construction of a second cluster, say C2 begins. C2 is filled with the rest of the reads {r3 , r4 , r5 }. 2.3
GWproxy
While LWproxy constructs clusters incrementally, GWproxy searches for clusters in a given search space, consisting of clusters characterized by the proteins as follows. We say that a protein covers a read if the protein occurs as one of the hits of that read. Then each protein characterizes one cluster, consisting of the reads it covers. Moreover, we can assign to each protein a global weight, representing the cost of selecting that protein as cluster representative. The weight of protein p is defined as ⎡ ⎤ max 1 S − S (r, p) B B w(p) = 1 + ⎢ (100 + 100 − Id (r, p))⎥ ⎢ Np ⎥, max min S − S B B ⎢ ⎥ r|p hit of r where v denotes the smallest integer bigger or equal than v, and Np is the number of reads having p as one of their hits. The maximum and minimum value of SB over the considered pairs of reads and proteins, SB max and SB min , respectively, are used to scale SB (r, p). The weight w(p) is in inverse proportion to the average quality of the matchings between p and the reads having p as a hit, so that better proteins have smaller w(p) value (smaller cost). Clustering then amount at finding a minimum set of proteins in R that, together, cover all the reads in R and have minimum total cost. Formally, consider the vector of protein weights w ∈ Nn and the matrix A ∈ {0, 1}m×n whose elements aij are such that 1, if pj covers ri , aij = 0, otherwise. We want to solve the following constrained optimization problem (weighted set covering problem (WSC, in short)). min
x∈{0,1}n
n j=1
xj wj ,
such that
n j=1
aij xj ≥ 1,
for i = 1, . . . , m .
(WSC)
Evidence-Based Clustering of Reads and Taxonomic Analysis
107
Table 1. Left: input covering matrix; position (i, j) contains a 1 if protein pj occurs in the set of selected hits of read ri , otherwise it contains a 0. Right: proxygenes selected by the GWproxy are indicated with bigger fonts. p1 1 1 A= 0 0 0 w = 15
p2 0 0 1 1 1 10
p3 1 1 0 0 1 10
p4 1 0 1 0 0 20
p5 0 1 0 0 1 5
r1 r2 r3 r4 r5
p1 1 1 A= 0 0 0 w = 15 x= 0
p2 p3 p4 0 1 1 0 1 0 1 0 1 1 0 0 1 1 0 10 10 20 1 1 0
p5 0 1 0 0 1 5 0
r1 r2 r3 r4 r5
The variable xj indicates whether pj belongs to the solution (xj = 1) or not (xj = 0). The m constraint inequalities are used to express the requirement that each read ri be covered by at least one protein. The weight wj specifies the cost of protein pj . Here a fast heuristic algorithm2 for WSC [12] is applied to find a solution in an acceptable time. A solution corresponds to a subset of P consisting of those proteins pj such that xj = 1. Each of the selected proteins is a proxygene. It represents the cluster consisting of those reads covered by that protein. Example 2. We illustrate the application of GWproxy on the toy problem of Example 1. Let w = (15 10 10 20 5) be the vector of protein weights. Then Table 1 (left part) shows the corresponding matrix A ∈ {0, 1}5×5. Application of GWproxy outputs proteins p2 , p3 (see Table 1 right part). The selected proteins correspond to the two clusters of reads {r3 , r4 , r5 } and {r1 , r2 , r5 }, with p2 and p3 as associated proxygenes, respectively. 2.4
EGWproxy
The aim of the EGWproxy algorithm is to refine the clustering produced by GWproxy as follows. As we described before, each cluster that GWproxy outputs is represented by one protein. However, because of the short length of reads, and because in general the size of clusters is not very big (see analysis in [8]), it may well happen that more than one protein cover all the reads of a cluster. Despite GWproxy selects only one of such proteins, among those having best score, each of the proteins covering a cluster can be considered as an equivalent representative of that cluster. Taking into account this fact, we can increase the robustness of GWproxy proxygenes choice method. For instance, for performing taxonomic analysis of cluster C, the following criterion can be used in order to decide which taxID to associate to C. The set T of taxID’s of the list of proteins that EGWproxy associates to C is considered. Then the final taxID tf in of C is computed as Np tf in = arg min , t∈T wp p with taxID equal to t 2
Publicly available at http://www.cs.ru.nl/~ elenam
108
G. Folino et al.
where the operator arg min gives the element at which the objective function takes its minimum. 2.5
Comparison of Algorithms
GWproxy and LWproxy use different clustering heuristics: the first algorithm searches a clustering in a fixed searching space characterized by the sets of reads covered by each protein in R, while LWproxy constructs incrementally clusters of reads and of proteins. Furthermore, GWproxy scores proteins using bit and identities score, while LWproxy uses only bit score. Finally, GWproxy scores each protein globally, that is, using all the reads it covers, while LWproxy scores only the proteins of a cluster, where each protein is scored locally using the reads it covers that belong to that cluster. Both EGWproxy and LWproxy associates to each cluster of reads one set of proteins. However, while LWproxy selects one protein as final representative of a cluster, EGWproxy employs an ensemble technique in order to exploit information of all the proteins of that set. A drawback of LWproxy is that results may be affected by the choice of the read used in the first step of the algorithm, as illustrated by the following example. Example 3. Consider the toy problem in Example 1. Suppose LWproxy selects r5 as seed of the first cluster C1 . Then it adds all the other reads to C1 , since their best hit is in the list of hits of r5 . P1 becomes equal to the entire set of proteins. Suppose for simplicity that all proteins have equal bit-score. Then LWproxy selects p2 as proxygene, since it has highest cumulative bit-score. Hence, this clustering While in the experiments here conducted this drawback does not seem to affect the results, it remains to be investigated whether this drawback does not affect results in general. A drawback of GWproxy is that it outputs only one solution, while in general there may be more ”optimal” clustering of reads. This is because the weighted set covering problem seeks one optimal solution, not the set of all optimal solutions. EGWproxy tries to overcome this drawback by using a post-processing step, followed by the application of an ensemble technique for merging multiple solutions. However, the post-processing step acts only on the set of proteins, while the clusters of reads remain those produced by GWproxy. It remains to be investigated whether application of ensemble techniques also at the level of clusters of reads can improve the performance of the method.
3
Taxonomic Analysis of Metagenome Data
We consider three complex metagenome datasets introduced in [5], called in the following M1, M2 and M3. These datasets were generated, respectively, from 9, 5 and 8 genome projects, sequenced at the Joint Genome Institute (JGI) using the 454 GS20 pyrosequencing platform that produces ∼ 100 bp reads. From
Evidence-Based Clustering of Reads and Taxonomic Analysis
109
Table 2. Characteristics of the organisms used in the experiments: the identifier and name of the organism, the size of its genome and the total number of reads sampled (M1 dataset) Id. a b c d e f g h i
Organism genome size (bp) reads sampled Clostridium phytofermentans ISDg 4 533 512 4638 Prochlorococcus marinus NATL2A 1 842 899 1866 Lactobacillus reuteri 100-23 2 174 299 2371 Caldicellulosiruptor saccharolyticus DSM 8903 2 970 275 2950 Clostridium sp. OhILAs 2 997 608 2934 Herpetosiphon aurantiacus ATCC 23779 6 605 151 6937 Bacillus weihenstephanensis KBAB4 5 602 503 4158 Halothermothrix orenii H 168 2 578 146 2698 Clostridium cellulolyticum H10 3 958 683 3978
each genome project, reads were sampled randomly at coverage level 0.1X. The coverage is defined as the average number of times a nucleotide is sampled. This resulted in a total of 35230, 28870 and 35861 reads, respectively. Table 3 shows the names of the organisms and the number of reads generated for the M1 dataset. The reader is referred to [5] for a detailed description of all the datasets. In our experiments we use the NR3 (non-redundant) protein sequence database as reference database for BLASTx. The parameters of the external software we used are set as follows. For BLASTx the default parameters were used. In all experiments we used Evalue cutoff E = 10−6 . Moreover, WSCP was run with pre-processing (−p), number of iterations equal to 1000 (−x1000), one tenth of the best actual cover used as starting partial solution (−a0.1), and 150 columns to be selected for building the initial partial cover at the first iteration (−b150). For lack of space, we refer to [12] for a detailed description of the WSCP program. 3.1
Results
We extract taxonomic information from each metagenome dataset as follows. For LWproxy and GWproxy each cluster of reads is represented by one protein. The taxID of such protein is used as taxonomic information of that cluster. For EGWproxy the list of proteins associated to each cluster is transformed into one taxID as described in Section 2.4. In this way, the metagenomic data is transformed into a set of taxID’s of proteins, one for each cluster of reads. Taxonomic information is then retrieved from the NCBI taxonomy (see http://www.ncbi.nlm.nih.gov/Taxonomy/). The NCBI Taxonomy database is a curated set of taxonomic classifications for all the organisms that are represented in GenBank. Each taxon in the database is associated with a numerical unique identifier called taxID. In the present analysis, the taxonomic information of these known proxygenes is used to determine the taxonomic content of the metagenomic data. 3
Publicly available at ftp://ftp.ncbi.nlm.nih.gov/blast/db
110
G. Folino et al.
We visualize the resulting taxonomic information in two ways. – Graph representation of taxonomic distribution of reads, as done e.g. in [14]. Here analysis at the taxonomic level of phylum and class is performed, where resulting taxa containing less than 100 reads are discarded. – Histogram of phylogenetic identities, as done e.g. in [13]. Shown are the percentages of the total of identifiable hits assigned to the phylogenetic groups obtained by means of the taxID of the proxygenes. Here analysis at the class taxonomic level is performed. We apply the above techniques to the proxygenes and taxid obtained from the considered algorithms, as well as to the known taxid’s of the original metagenome data sets, provided by the producers of the benchmark data [5]. We use these latter results as ”golden truth” (GT in short) to evaluate the methods. For lack of space, we show graphs of the taxonomic distribution at phylum and class level only for M1 in Figure 1. Results indicate satisfactory consensus
root
Cyanobacteria
Bacilli
(815)
(3907)
Firmicutes
Clostridia
(12919)
(9011)
Chloroflexi
(4178)
Chloroflexi (class)
(4178)
root
Cyanobacteria
Bacilli
(832)
(3827)
Firmicutes
Clostridia
(12863)
(9036)
Chloroflexi
(4206)
Chloroflexi (class)
(4206)
root
Cyanobacteria
Bacilli
(833)
(3852)
Firmicutes
Clostridia
(12842)
(8989)
Chloroflexi
(4188)
Chloroflexi (class)
(4187)
root
Cyanobacteria
Bacilli
(1868)
(6529)
Firmicutes
Clostridia
(21031)
(14502)
Chloroflexi
(9635)
Chloroflexi (class)
(9635)
Fig. 1. Phylogenetic graph for M1. From top to bottom: LWproxy, GWproxy, EGWproxy and ”Golden Truth”.
Evidence-Based Clustering of Reads and Taxonomic Analysis
111
Fig. 2. Taxonomic distribution at taxonomic class level of the three datasets. From left to right: M1, M2 and M3. From top to bottom: ”golden truth”, GWproxy, EGWproxy and LWproxy.
among the three methods, yielding similar type of graphs. The methods give graphs with the same structure of the ”golden truth” for M1 and M3 datasets. For M2, instead, GWproxy and EGWproxy give graphs with a phylum subtree (Streptophyta) that does not appear in the ”golden truth”; anyway they are made by just 1% and 2% of the covered reads, respectively. Histograms of phylogenetic identities at the class level for the three data sets are shown in Figure 2. The results achieved by the three methods are almost identical.
4
Conclusion and Future Work
In this paper we compared three methods for clustering reads and their application to taxonomic analysis of metagenome data. We discuss advantages and drawbacks of the methods and applied them to perform taxonomic analysis of three real-life metagenome datasets with known taxonomic content. Results of such analysis indicate satisfactory consensus of all the three methods, and very good performance with respect to taxonomic distribution and phylogenetic content. Unfortunately, a drawback of LWproxy is that results can be affected by the choice of the read used in the first step of the algorithm, as shown in Example 3. While in the experiments here conducted this drawback
112
G. Folino et al.
does not seem to affect the results, it remains to be investigated whether this drawback does not affect results in general. The algorithms EGWproxy can potentially refine the proxygenes selection made by GWproxy despite we found no evidence of this fact in our analysis yet. In future we intend to introduce a statistical test for measuring significance of taxonomic assignment, in order to discard assignments possibly due to the composition of the reference proteome database used when applying BLASTx. Such a test will consider not only the number of reads assigned to a taxa, but also the divergence of their proxygenes as well as the nucleotide composition of the reads. Acknowledgements. We would like to thank Mavrommatis Konstantinos for providing the datasets in [5] as well as useful information about such data.
References 1. Yooseph, S., et al.: The Sorcerer II global ocean sampling expedition: Expanding the universe of protein families. PLoS Biol. 5(3), e16 (2007) 2. McHardy, A., Rigoutsos, I.: What’s in the mix: phylogenetic classification of metagenome sequence samples. Current Opinion in Microbiology 10, 499–503 (2007) 3. Raes, J., Foerstner, K., Bork, P.: Get the most out of your metagenome: computational analysis of environmental sequence data. Current Opinion in Microbiology 10, 490–498 (2007) 4. Li, W., Wooley, J., Godzik, A.: Probing metagenomics by rapid cluster analysis of very large datasets. PLoS One 3(10) (2008) 5. Dalevi, D., Ivanova, N., Mavromatis, K., Hooper, S., Szeto, E., Hugenholtz, P., Kyrpides, N., Markowitz, V.: Annotation of metagenome short reads using proxygenes. Bioinformatics 24(16) (2008) 6. Pop, M., Phillippy, A., Delcher, A., Salzberg, S.: Comparative genome assembly. Briefings in Bioinformatics 5(3), 237–248 (2004) 7. Chan, C., Hsu, A., Tang, S., Halgamuge, S.: Using growing self-organising maps to improve the binning process in environmental whole-genome shotgun sequencing. Journal of Biomedicine and Biotechnology (2008) 8. Folino, G., Gori, F., Jetten, M.S.M., Marchiori, E.: Clustering metagenome short reads using weighted proteins. In: EvoBIO 2009. LNCS, vol. 5483, pp. 152–163. Springer, Heidelberg (2009) 9. Korf, I., Yandell, M., Bedell, J.: BLAST. O’Reilly & Associates, Inc., Sebastopol (2003) 10. Madden, T.: 16. In: The BLAST Sequence Analysis Tool, Bethesda, MD (2002) 11. Altschul, S., Gish, W., Miller, W., Myers, E., Lipman, D.: Basic local alignment search tool. J. Molecular Biology 215(3), 403–410 (1990) 12. Marchiori, E., Steenbeek, A.: An evolutionary algorithm for large scale set covering problems with application to airline crew scheduling. In: Oates, M.J., et al. (eds.) EvoIASP 2000, EvoWorkshops 2000, EvoFlight 2000, EvoSCONDI 2000, EvoSTIM 2000, EvoTEL 2000, and EvoROB/EvoRobot 2000. LNCS, vol. 1803, pp. 367–381. Springer, Heidelberg (2000) 13. Biddle, J.F., et al.: Metagenomic signatures of the Peru margin subseafloor biosphere show a genetically distinct environment. PNAS (105), 10583–10588 (2008) 14. Venter, J., et al.: Environmental genome shotgun sequencing of the sargasso sea. Science (304), 66–74 (2004)
Avoiding Spurious Feedback Loops in the Reconstruction of Gene Regulatory Networks with Dynamic Bayesian Networks Marco Grzegorczyk1 and Dirk Husmeier2 1 2
Department of Statistics, TU Dortmund University, 44221 Dortmund, Germany
[email protected] Biomathematics and Statistics Scotland, JCMB, KB, Edinburgh EH9 3JZ, UK
[email protected]
Abstract. Feedback loops and recurrent structures are essential to the regulation and stable control of complex biological systems. The application of dynamic as opposed to static Bayesian networks is promising in that, in principle, these feedback loops can be learned. However, we show that the widely applied BGe score is susceptible to learning spurious feedback loops, which are a consequence of non-linear regulation and autocorrelation in the data. We propose a non-linear generalisation of the BGe model, based on a mixture model, and demonstrate that this approach successfully represses spurious feedback loops.
1
Introduction
In systems biology, there has been increased interest in learning regulatory networks and signalling pathways from postgenomic data. Following up on the seminal paper by Friedman et al. [1], Bayesian networks have been widely applied to this end. Their popularity partially stems from the tractability of the marginal likelihood of the network structure, which is the consistent scoring scheme for model selection in the Bayesian context. The practical computation requires the integration of the likelihood over the entire parameter space, though. To obtain a closed-form expression, two probabilistic models with their respective conjugate prior distributions have been employed in the past: the multinomial distribution with the Dirichlet prior, leading to the so-called BDe score [2], and the linear Gaussian distribution with the normal-Wishart prior, leading to the BGe score [3]. These approaches are restricted in that they either require the data to be discretised (BDe) or can only capture linear regulatory relationships (BGe). A non-linear non-discretised model based on heteroscedastic regression has been proposed by Imoto et al. [4]. However, this approach no longer allows the marginal likelihood to be obtained in closed-form and requires a restrictive approximation (the Laplace approximation) to be adopted. Another non-linear model based on node-specific Gaussian mixture models has been proposed [5]. Here, Ko et al. resort to the Bayesian information criterion BIC of Schwarz [6] for model selection, which is only a good approximation to the marginal likelihood in the limit of very large data sets. Recently, we proposed a generalisation V. Kadirkamanathan et al. (Eds.): PRIB 2009, LNBI 5780, pp. 113–124, 2009. c Springer-Verlag Berlin Heidelberg 2009
114
M. Grzegorczyk and D. Husmeier
of the BGe score [8] based on a combination of a mixture model with the allocation sampler proposed in [7]. In this approach the latent variable allocation is common to the whole network, though, which results in a heterogeneous linear rather than a proper non-linear model. Our work aims to generalise our earlier work [8] by the introduction of node-specific change-points. This model is similar to the model by Ko et al. [5], with the difference that the marginal likelihood is computed properly, rather than approximated by BIC.
2
Problems of the BGe Score
When the objective is to infer regulatory networks from time series, as is typically the case in systems biology, the restriction of the model to linear processes can result in the prediction of spurious feedback loops. Consider the simple example shown in Figure 1. The graph shows two interacting nodes. Node X is a regulator of node Y , and it also has a regulatory feedback loop acting back on itself. Node Y is regulated by node X, but does not contain a feedback loop. The figure shows both the state space representation, i.e. the recurrent graph, and the corresponding dynamic Bayesian network. Note that the latter is a valid DAG obtained by the standard procedure of unfolding the state space graph in time. First assume that the data generation processes are consistent with the BGe model assumption, e.g.: X(t + 1) = X(t) + c + σx · φX (t) and Y (t + 1) = w · X(t) + m + σy · φY (t) where w, m, c, σx , σy are constants, and φ. (.) are iid normally distributed random variables. Under fairly general regularity conditions, the marginal likelihood and, hence, the BGe score is a consistent estimator. This implies that the correct model structure will be learned as m → ∞, where m is the data set size. Next, consider the scenario of a non-linear regulatory influence that X exerts on Y : X(t + 1) = X(t) + c + σx · φX (t),
Y (t + 1) = f (X(t)) + σy · φY (t)
(1)
for some non-linear function f (.). This non-linear function cannot be modelled with a linear Bayesian network based on the BGe model. Consequently, the prediction of Y (t + 1) from X(t) will tend to be poor. Note that for sufficiently small noise levels, the Y (t)’s will exhibit a strong autocorrelation, by virtue of the autocorrelation of the X(t)’s, and the regulatory influence of X(t) on Y (t + 1). If
Fig. 1. State space graph and corresponding dynamic Bayesian network.Left: Recurrent state space graph containing two nodes. Node X has a recurrent feedback loop and acts as a regulator of node Y . Right: The same graph unfolded in time.
Avoiding Spurious Feedback Loops
115
the latter regulatory influence cannot be learned owing to the linear restriction of our model, the next best explanation is a direct modelling of the autocorrelation between the Y (t)’s themselves. This autocorrelation corresponds to a feedback loop of Y acting back on itself in the state-space graph, or, equivalently, an edge from Y (t) to Y (t + 1) in the dynamic Bayesian network. We would therefore conjecture that the linear restriction of the Bayesian network model may result in the prediction of spurious feedback loops and, hence, to the reconstruction of wrong network structures. Ruling out feedback loops altogether, as we did in [8], will not provide a sufficient remedy for this problem, as some nodes – X in the example above – will exhibit regulatory feedback loops (e.g. in molecular biology: transcription factors regulating their own transcription), and it is generally not known in advance where these nodes are.
3
Methodology
3.1
The Dynamical BGe Network
Dynamical Bayesian networks (DBNs) are flexible models for representing probabilistic relationships between interacting variables X1 , . . . , XN . The graph G of a DBN describes the relationships between the variables, which have been measured at equidistant time points t = 1, . . . , m, in the form of conditional probability distributions. An edge pointing from Xi to Xj means that the realisation of Xj at time point t, symbolically: Xj (t), is influenced by the realisation of Xi at time point t − 1, symbolically: Xi (t − 1). πn = πn (G) denotes the parent node set of node Xn in G, i.e. the set of all nodes from which an edge points to node Xn in G. Given a data set D, where Dn,t and D(πn ,t) are the tth realisations Xn (t) and πn (t) of Xn and πn , respectively, DBNs are based on the following homogeneous Markov chain expansion: P (D|G, θ) =
N m
P (Xn (t) = Dn,t |πn (t − 1) = D(πn ,t−1) , θn )
(2)
n=1 t=2
where θ is the total parameter vector, composed of subvectors θn , which specify the local conditional distributions in the factorisation. The BGe model [3] specifies the distributional form P (D|G, θ = {μ, Σ}) as multivariate Gaussian distribution with expectation vector μ and covariance matrix Σ, and assumes a normal-Wishart distribution as prior distribution P ({μ, Σ−1 }). The local probability distributions P (Xn |πn , θn ) are then given by conditional Gaussian distributions so that only linear relationships between Xn and its parent nodes πn can be modelled. Under fairly weak conditions imposed on the parameters θ (prior independence and modularity) and prior distribution P (θ) (conjugacy), the parameters can be integrated out analytically, as shown by Geiger and Heckerman [3], and the marginal likelihood satisfies the same expansion rule as the Bayesian network with fixed parameters: P (D|G) =
N m n=1 t=2
P (Xn (t) = Dn,t |πn (t − 1) = D(πn ,t−1) ) =
N n=1
Ψ (Dnπn )
(3)
116
M. Grzegorczyk and D. Husmeier
where Dnπn := {(Dn,t , Dπn ,t−1 ) : 2 ≤ t ≤ m} is the subset of the data pertaining to node Xn and parent set πn . For the BGe model the factors Ψ (Dnπn ) can be computed according to Eqn. (15) and (24) in [3]. 3.2
The New Mixture/Change-Point BGe Model
To obtain a more flexible model for DBNs we generalise Eq. (2) with a nodespecific mixture model: P (D|G, V, K, θ) =
N m Kn
P (Xn (t) = Dn,t |πn (t − 1) = D(πn ,t−1) , θkn )δVn (t),k
n=1 t=2 k=1
(4) where δVn (t),k is the Kronecker delta, V is a matrix of latent variables Vn (t), Vn (t) = k indicates that the tth realisation of node Xn , symbolically Xn (t), has been generated by the kth component of a mixture with Kn components, and K = (K1 , . . . , Kn ). Note that the matrix V divides the data into several disjoined subsets, each of which can be regarded as pertaining to a separate BGe model with parameters θkn . The probability model defined in Eq.(4) is effectively a mixture model with local probability distributions P (Xn |πn , θn ) and it can hence, in principle, approximate any probability distribution arbitrarily closely. The vectors Vn are node-specific, i.e. different nodes can have different break-points, so that the proposed model has a higher flexibility in modelling non-linear relationships than the BGM model proposed in [8]. Different from the free allocation of latent variables in [8], we change the assignment of data points to mixture components via a change-point process. This effectively reduces the complexity of the latent variable space and incorporates our prior belief that, in a time series, adjacent time points are likely to be assigned to the same component. Conditional on the latent variables V and under the assumption that the regularity conditions defined in Geiger and Heckerman [3] are satisfied, the marginal likelihood can be computed in closed form: P (D|G, V, K) =
P (D|G, V, K, θ)P (θ)dθ =
Kn N
Ψ (Dnπn [k, Vn ])
(5)
n=1 k=1
where Dnπn [k, Vn ] := {(Dn,t , Dπn ,t−1 ) : Vn (t) = k, 2 ≤ t ≤ m} is the subset of the data pertaining to node Xn and its parents πn that has been assigned to the kth mixture component, symbolically: Vn (t) = k. In the absence of genuine prior knowledge about the regulatory network structure, we assume for P (G) a uniform distribution on graphs, subject to a fan-in restriction of |πn | ≤ 3. As prior probability distribution on the number of mixture components Kn , P (Kn ), we take an iid truncated Poisson distribution with shape parameter λ = 1 restricted to 1 ≤ Kn ≤ 10. We assume that the prior distributions P (Vn |Kn ) of the latent variable vectors Vn conditional on Kn are iid, and we identify Kn with Kn − 1 break-points: bn = {bn,1 , . . . , bn,Kn−1 } on the continuous interval [2, m]. For notational convenience we introduce the pseudo-break-points bn,0 = 2
Avoiding Spurious Feedback Loops
117
and bn,Kn = m. For node Xn the observation at time point t is assigned to the kth component, symbolically Vn (t) = k, if bn,k−1 ≤ t < bn,k . Following [10] we assume that the break-points are distributed as the even-numbered order statistics of L := 2(Kn − 1) + 1 points u1 , . . . , uL uniformly and independently distributed on the interval [2, m]. The motivation for this prior, instead of taking Kn uniformly distributed points, is to encourage a priori an equal spacing between the break-points, i.e. to discourage mixture components that contain only a few observations. The joint probability distribution of the proposed mixture BGe model is given by: P (G, V, K, D) = P (D|G, V, K) · P (G) · P (V|K) · P (K) Kn N πn = P (G) · {P (Vn |Kn ) · P (Kn ) · Ψ (Dn [k, Vn ]) (6) n=1
k=1
Here P (G) is the graph prior, and P (Kn ) the Poisson prior on the number of mixture components for the nth node. The local marginal likelihood terms Ψ (Dnπn [k, Vn ]), which result from Eq.(5), can be computed independently for each k using Eqn. (15) and (24) in [3]. Note that each vector Vn acts as a filter which divides the data of Xn into Kn different compartments, for which separate independent BGe scores can be computed in closed-form. When a mixture component is empty, then Ψ (Dnπn [k, Vn ]) = 1. The term P (V|K) = N n=1 {P (Vn |Kn ) is the prior distribution on the node-specific allocation vectors Vn , which is induced by the even-numbered order statistics prior on the breakpoint locations bn . Deriving a closed-form expression is involved. However, the MCMC scheme we discuss in the next section does not sample Vn directly, but is based on local modifications of Vn based on birth, death and reallocation moves. All that is required for the acceptance probabilities of these moves are P (Vn |Kn ) ratios, which are straightforward to compute. 3.3
MCMC Inference
We now describe an MCMC algorithm to obtain a sample {G i , Vi , Ki }i=1,...,I from the posterior distribution P (G, V, K|D) ∝ P (G, V, K, D) of Eq. (6). We combine the structure MCMC algorithm for Bayesian networks of Madigan and York [9] with the change-point model of Green [10], and draw on the fact that conditional on the allocation vectors V, the model parameters can be integrated out to obtain the marginal likelihood terms Ψ (Dnπn [k, Vn ]). Note that this approach is equivalent to the idea underlying the allocation sampler proposed in [7]. The resulting algorithm is effectively an RJMCMC scheme [10] in the discrete space of network structures and latent allocation vectors, where the Jacobian in the acceptance criterion is always 1 and can be omitted. With probability pG = 0.5 we perform a structure MCMC move on the current graph G i and leave the latent variable matrix and the numbers of mixture components unchanged, symbolically: Vi+1 = Vi and Ki+1 = Ki . A new candidate graph G i+1 is randomly drawn out of the set of graphs N (G i ) that can be reached from the
118
M. Grzegorczyk and D. Husmeier
current graph G i by deletion or addition of a single edge. The proposed graph G i+1 is accepted with probability: P (D|G i+1 , Vi , Ki ) P (G i+1 ) |N (G i )| A(G i+1 |G i ) = min 1, · · (7) P (D|G i , Vi , Ki ) P (G i ) |N (G i+1 )| where |.| is the cardinality and the marginal likelihood terms have been specified in Eq. (5). The graph is left unchanged, symbolically G i+1 := G i , if the move is not accepted. We note that the subsequent analysis will be based on the marginal posterior probabilities of individual edges, which can be estimated for each edge from the MCMC sample G 1 , ..., G I by the fraction of graphs in the sample that contain the edge of interest. With the complementary probability 1−pG we leave the graph G i unchanged, and perform a move on (Vi , Ki ), where Vni is the latent i variables vector of Xn in Vi , and Ki = (K1i , . . . , KN ). We randomly select a node Xn and change its current number of components Kni via a break-point birth or death move, or its latent variable vector Vni by a break-point re-allocation move. The break-point birth (death) move increases (decreases) Kni by 1 and may also have an effect on Vni . The break-point reallocation move leaves Kni unchanged and may have an effect on Vni . If with probability (1−pG )/N a breakpoint move on (Kni , Vni ) is performed, we randomly draw the move type. Under fairly mild regularity conditions (ergodicity, detailed balance), discussed in [10], the MCMC sampling scheme converges to the desired posterior distribution. To ensure detailed balance, the acceptance probabilities for the three break-point moves (Kni , Vni ) → (Kni+1 , Vni+1 ) are chosen of the form min(1, R), see [10], with Ki+1 n
k=1 R= i Kn
Ψ (Dnπn [k, Vni+1 ])
k=1
Ψ (Dnπn [k, Vni ])
×A×B
(8)
where A = P (Vni+1 |Kni+1 )P (Kni+1 )/P (Vni |Kni )P (Kni ) is the prior probability ratio, and B is the inverse proposal probability ratio. The exact form of these factors depends on the move type. (i) For a break-point reallocation (r) we randomly select one of the existing break-points bn,j ∈ {bn,1 , . . . , bn,Kn −1 }, and the replacement value b†n,j is drawn from a uniform distribution on [bn,j−1 , bn,j+1 ] where bn,0 = 2 and bn,Kn = m. Hence, the proposal probability ratio is one, the prior probabilities P (Kni+1 ) = P (Kni ) cancel out, and the remaining prior probability ratio P (Vni+1 |Kni+1 )/P (Vni |Kni ) can be obtained from p.720 in [10]: (bn,j+1 − b†n,j )(b†n,j − bn,j−1 ) Ar = , (bn,j+1 − bn,j )(bn,j − bn,j−1 )
Br = 1
(9)
If there is no break-point (Kni = 1) the move is rejected and the Markov chain is left unchanged. (ii) If a break-point birth move (b) on Kni is proposed, the location of the new break-point b† is randomly drawn from a uniform distribution on the interval [2, m]; the proposal probability for this move is bKin /(m − 2), where bKin is the (Kni -dependent) probability of selecting a birth move. The reverse death move, which is selected with probability d(Kin +1) , consists in discarding
Avoiding Spurious Feedback Loops
119
randomly one of the Kni − 1 + 1 = Kni change-points. The inverse proposal probability ratio is thus given by B = d(Kin +1) (m − 2)/bKin Kni . The prior probability ratio is given by the expression at the bottom of p.720 in [10] (slightly modified to allow for the fact that Kn components correspond to Kn − 1 break-points), and we get: Ab =
d(Kin +1) (m − 2) P (Kni + 1) 2Kni (2Kni + 1) (bn,j+1 − b† )(b† − bn,j ) , Bb = (10) i 2 P (Kn ) (m − 2) (bn,j+1 − bn,j ) bKin Kni
For Kni = KMAX the birth of a new break-point is invalid and the Markov chain is left unchanged. Note that the ratio of the proposal probabilities for birth versus death moves d(Kin +1) /bKin can be chosen such that it cancels out against the prior ratio P (Kni + 1)/P (Kni ), and the expression simplifies: Ab Bb =
2(2Kni + 1) (bn,j+1 − b† )(b† − bn,j ) (m − 2) (bn,j+1 − bn,j )
(11)
(iii) A break-point death move (d) is the reverse of the birth move, and we get: Ad Bd =
4
(m − 2) (bn,j+1 − bn,j ) 2(2Kni − 1) (bn,j+1 − b† )(b† − bn,j )
(12)
Data
We have evaluated our method, which we henceforth refer to as the Mix-BGe model, on various synthetic data sets. For illustration purposes we present the results obtained for two studies with small networks. The first network consists of two domain nodes X and Y . The true network structure is shown in Figure 1. The dynamics of the system are given by the non-linear state-space equations (see Eq. (1)), with the non-linear function f (.) = sin(.). We generated 40 observations by applying Eq. (1), setting the drift term c = 2π/41 to ensure that the complete period [0, 2π] of the sinusoid is involved. The second network is a generalisation of this two node domain where three nodes Y1 , Y2 , and Y3 are regulated by X. The regulatory relationships are again realised by sinusoids, whereby we shift the periods. More precisely, we set: Yi (t+ 1) = sin(X(t)+ τi ·π)+ σy ·φy,i (t) where τ1 = 0, τ2 = 2/3, and τ3 = 4/3. Again we set c = 2π/41, initialised the variables at t = 1 randomly, and generated 40 further observations. We have also applied our method to three gene expression time series of the Interferon regulatory factors Irf1, Irf2 and Irf3 from bone marrow derived macrophages, which we analysed in our earlier work [8]. Data of Irf1, Irf2 and Irf3 were collected at 25 × 30 minute time intervals under three external conditions: (1) infection with Cytomegalovirus (CMV), (2) treatment with Interferon Gamma (IFNγ), and (3) infection with Cytomegalovirus after pre-treatment with IFNγ (CMV+IFNγ). Finally we have applied our method to two gene expression time series from Arabidopsis thaliana cells, which were sampled at 13 × 2 hour time intervals. As in [8] we focus on 9 genes: LHY, CCA1, TOC1, ELF4, ELF3, GI, PRR9, PRR5, and
120
M. Grzegorczyk and D. Husmeier
PRR3, which are known to be involved in circadian regulation [8]. The expressions were measured independently in 2 plants under experimentally generated constant light condition but with different pre-histories. In the experiment T20 (T28 ), the plant was entrained in a 10h:10h (14h:14h) light/dark-cycle.
5
Simulations
In all our simulations, data were standardised to zero mean and marginal variance of 1 for all dimensions. The hyperparameters of the normal-Wishart prior were chosen as uninformative as possible subject to certain regulatory conditions discussed in [3]. For the synthetic (real) data we set both the burn-in and the sampling-phase lengths of our MCMC simulations to 50,000 (500,000) each and sampled every 1,000 iterations during the sampling-phase1 . For the real data we started 5 independent MCMC simulations from different initialisations on each data set, and we computed the potential scale reduction factor (PSRF) based on the marginal edge posterior probabilities to monitor convergence. As we observed a sufficient degree of convergence for all these data sets (P SRF < 1.2), we report only the results of the empty-seeded MCMC runs. For the evaluation of the results, we proceeded in different ways. For the synthetic study based on the two-node network of Figure 1, we computed the marginal posterior probabilities for the individual edges. Our main interest was to test our conjecture from Sect. 2 that the BGe score is susceptible to inferring a spurious self loop for node Y . We wanted to test whether this susceptibility could be reduced by the proposed Mix-BGe model. For the second synthetic simulation study, we assessed the network reconstruction accuracy via the area under the ROC (receiver operator characteristic) curve: AUC; this is a standard criterion that has been applied in numerous related articles. In this study we also compare the Mix-BGe model with our BGM model [8], whereby we exchanged the random allocation of the latent variables of our original BGM model by a change-point process. This slight modification ensures a fair comparison, and actually improved the performance of the BGM model on the synthetic data sets. Finally, for the real data, we focused on the self-loops again. Since we do not know the true network for the Arabidopsis thaliana data, computing AUC scores is impossible. However, as discussed in Sect. 2, we conjecture that in the presence of temporal autocorrelation in the signals of the (unknown) regulators, many downstream nodes will show spurious self-loops when the network is reconstructed with the BGe model, whereas this susceptibility should be reduced with the proposed Mix-BGe model. We therefore take as an alternative figure of merit the average marginal posterior probability of a self-loop compared to the average marginal posterior probability of a non-self-loop: ξ =
Nsl Nnl 1 1 P (esl |D) − P (enl |D) Nsl Nnl sl=1
1
(13)
nl=1
We note that even for the largest network with N = 9 nodes each single simulation was c code on a SunFire accomplished within a few hours using non-optimised Matlab X4100M2 machine with AMD Opteron 2224 SE dual-core processor.
Avoiding Spurious Feedback Loops
121
where esl is an edge corresponding to a self-loop, enl is an edge corresponding to a non-self-loop, Nsl is the total number of self loops, and Nnl is the total number of non-self-loops. Lower values of ξ are taken as an indication of a better performance.
6
Results
Panels (a) and (b) of Figure 2 show the AUC scores for the synthetic data. Both figures are laid out as matrices, in which the rows and columns correspond to different standard deviations of the noise in X (rows) and in Y (columns). Note that an increase of the noise in X reduces the autocorrelation of X, while increasing the noise in Y blurs the functional dependence of Y (t + 1) on X(t). The autocorrelation of Y is jointly influenced by both noise levels. The proposed Mix-BGe model consistently outperforms both the linear BGe model, as well as the common change-point BGM model of Grzegorczyk et al. [8]. Only when the noise in the signal of the hub node X is large (right-most columns) does the proposed Mix-BGe model fail to achieve an improvement. As discussed above, this is a consequence of an increased mis-classification of latent variables, owing to the nature of the change-point process. This could in principle be addressed by combining the node-specific change-points of the present paper with the allocation sampler used in [8] - albeit at the cost of a greatly inflated configuration space in latent space. Panels (c) and (d) in Figure 2 show the marginal posterior probabilities of the four possible edges in the two-node network of Figure 1 and the non-linear state space process of Eq. (1). The results in panel (c) of Figure 2 were obtained with the linear BGe model and show a clear propensity for inferring the spurious self-loop Y → Y , in confirmation of our earlier conjecture (see Sect. 2). Compare this with the results for the proposed Mix-BGe model, shown in panel (d) of Figure 2. Here, the spurious self-loop Y → Y is suppressed in favour of the correct edge X → Y . There are two noise regimes in which the spurious self-loop Y → Y has a marginal posterior probability that is higher than or equal to that of the correct edge X → Y . One noise regime is where both noise levels in X and in Y are low (top left corner in panels (c) and (d) of Figure 2). Here, the autocorrelation of Y is so high that the spurious selfloop Y → Y is still favoured over the true edge X → Y ; this is a consequence of the fact that the functional dependence of Y (t + 1) on X(t) is only learned approximately (namely approximated by a mixture model). The second regime is where both noise levels are high (top right corners in the panels (c) and (d) of Figure 2). High noise in Y blurs the functional dependence of Y (t + 1) on X(t), while high noise in X leads to a high mis-classification of latent variables and, consequently, a deterioration of the model accuracy; this is a consequence of the fact that latent variables are not allocated individually, as in [8], but according to a change-point process. However, in the majority of noise scenarios, the marginal posterior probability of the correct edge X → Y is significantly higher than that of the self-loop Y → Y . This suggests that the proposed MixBGe model is successful at suppressing spurious feedback loops. Finally, Figure 3 shows the results for the real data. It is seen that for four out of five data sets,
122
M. Grzegorczyk and D. Husmeier σY=0.25
σY=0.5
σy=1
1
1
0
0
0
0
0
1
1
1
1
1
1
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
1
1
0
σX=0.25
0
σX=0.5
0
σx=0.1
1
1
0
(a) AUC for N=2 σY=0.5
σY=1
1
1
1
1
0
0
0
0
1
1
1
1
0
0
0
1
1
1
0
0
0
1
1
1
0
0
0
0
σX=0.1
σX=0.25
σX=0.5
σX=1
1
1
1
0
0
0
0
1
1
1
1
0
0
0
0
0
1
1
1
1
1
0
0
0
0
0
1
1
1
1
1
0
0
0
0
σY=0.5 σY=1
σX=1
σX=0.5
1
σY=0.1
σY=0.25
(b) AUC for N=4
σY=0.25
σY=0.1 σX=0.1
σy=0.5
1
1
0
σX=0.25
σy=0.25
σy=0.1
σY=1
1
σX=1
σX=1
σX=0.5
σX=0.25
σX=0.1
σY=0.1
1
(c) BGe edge posteriors (N=2)
(d) MIX-BGe edge posteriors (N=2)
Fig. 2. AUC scores and marginal edge posterior probabilities for the synthetic data. All four panels are laid out as matrices, whose cells correspond to standard deviations σX and σY of the noise in X and Y (or Yi ). All histograms show averages (means/std. deviations) from 20 independent data instantiations. Panels (a) and (b) were obtained for data generated from the 4-node-network described in Section 4 and show average AUC score histograms for BGe (left grey bar) and MIX-BGe (right white bar). The centre black bar shows the AUC score for a mixture model in which each change-point applies to all the variables, as proposed in [8]. In (c) and (d) the histograms show the posterior probabilities of the edges in the 2-node network, as obtained with BGe (c) and the MIX-BGe model (d). Each histogram contains 4 bars, which represent the average posterior probabilities of the 4 possible edges: Left: self-loop X → X (true); centre left: X → Y (true); centre right: self-loop Y → Y (false); right: Y → X (false). It is seen that BGe has a high propensity for learning the spurious feedback loop Y → Y , while MIX-BGe tends to learn an increased probability of the correct edge X → Y (centre left bars).
Avoiding Spurious Feedback Loops
123
0.15 0.15
0.5
0.5
0.5 0.1
0.25
0
0.25
BGE
MIX
0
(a) CMV
0.25
BGE
MIX
(b) IFNγ
0
0.1 0.05
0.05
BGE
MIX
(c) CMV+IFNγ
0
BGE
MIX
(d) T20
0
BGE
MIX
(e) T28
Fig. 3. Results on the Interferon and Arabidopsis thaliana gene expression time series. The histograms shows the self-loop score ξ of Eq.(13) for the BGe model (dark bar) and the proposed MIX-BGe model (light bar). Lower values are taken as an indication of a better performance. The histograms in (a)-(c) were obtained from the Interferon regulatory factor gene expression time series: (a) infection with CMV, (b) pre-treatment with IFNγ, and (c) infection and pre-treatment. The histograms in (d)-(e) were obtained from the Arabidopsis thaliana gene expression time series: (d) 10h:10h light/dark entrainment, (e) 14h:14h light/dark entrainment.
employing the proposed Mix-BGe model leads to a significant suppression of the marginal posterior probabilities of potentially spurious self loops. We note that the chosen criterion is not based on a proper gold-standard, as we do not know the true number of genuine feedback loops. The difference between the results shown in panels (d) and (e) of Figure 3 may appear as surprising. It cannot be ruled out, though, that the entrainment with different light-dark cycles may indeed lead to the activation of different recurrent pathways, especially given that the interactions between day and evening genes in Arabidopsis thaliana is intrinsically of a recurrent nature [12]. Overall, our findings summarised in Figure 3 point to the general reduction of the posterior probability of feedback loops inferred with Mix-BGe as compared with BGe. Given that BGe is intrinsically susceptible to inferring spurious feedback loops, as discussed in Section 2, this points to a potential improvement in the network reconstruction accuracy.
7
Discussion
We have demonstrated that when learning dynamic Bayesian networks from time series data, the presence of temporal autocorrelations in the signals of the regulating nodes renders an approach based on the linear BGe score susceptible to spurious feedback loops. We have proposed a non-linear generalisation of the BGe score based on a mixture model and node-specific change-point processes; this is also a generalisation of the BGM model, where the allocation of data points to mixture components was not node-specific, but affected all nodes simultaneously [8]. Our simulations have shown that the network reconstruction accuracy is improved, and that spurious feedback loops are avoided. We note that there is a close similarity between our model and the one proposed in [13]. The essential difference is that the model in [13] learns separate network structures for different time series segments. This assumption is reasonable for some scenarios, like morphogenesis. However, for most cellular processes on a shorter time scale, it is questionable whether it is the structure rather than
124
M. Grzegorczyk and D. Husmeier
just the strength of the regulatory interactions that changes with time. The practical problem is potential model over-flexibility. Owing to the high costs of postgenomic high-throughput experiments, time series in systems biology are typically rather short. Modelling short time series segments with separate network structures will almost inevitably lead to inflated inference uncertainty. For that reason we have constrained the network structure to remain invariant and only allow the interaction parameters to change. As a direction for future research one might consider the implementation of a hybrid scheme with a soft rather than hard constraint on the network structures, based on the hierarchical Bayesian model proposed in [14].
References 1. Friedman, N., Linial, M., Nachman, I., Pe’er, D.: Using Bayesian networks to analyze expression data. Journal of Computational Biology 7, 601–620 (2000) 2. Cooper, G.F., Herskovits, E.: A Bayesian method for the induction of probabilistic networks from data. Machine Learning 9, 309–347 (1992) 3. Geiger, D., Heckerman, D.: Learning Gaussian networks. In: Proceedings of the Tenth Conference on Uncertainty in Artificial Intelligence, pp. 235–243 (1995) 4. Imoto, S., Kim, S., Goto, T., Aburatani, S., Tashiro, K., Kuhara, S., Miyano, S.: Bayesian network and nonparametric heteroscedastic regression for nonlinear modeling of genetic network. Journal of Bioinformatics and Computational Biology 1(2), 231–251 (2003) 5. Ko, Y., Zhai, C., Rodriguez-Zas, S.L.: Inference of gene pathways using Gaussian mixture models. In: BIBM International Conference on Bioinformatics and Biomedicine, pp. 362–367. Fremont, CA (2007) 6. Schwarz, G.: Estimating the dimension of a model. Annals of Statistics 6(2), 461– 464 (1978) 7. Nobile, A., Fearnside, A.T.: Bayesian finite mixtures with an unknown number of components: The allocation sampler. Statistics and Computing 17(2), 147–162 (2007) 8. Grzegorczyk, M., Husmeier, D., Edwards, K.D., Ghazal, P., Millar, A.J.: Modelling non-stationary gene regulatory processes with a non-homogeneous Bayesian network and the allocation sampler. Bioinformatics 24, 2071–2078 (2008) 9. Madigan, D., York, J.: Bayesian graphical models for discrete data. International Statistical Review 63, 215–232 (1995) 10. Green, P.J.: Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82(4), 711–732 (1995) 11. Gelman, A., Rubin, D.B.: Inference from iterative simulation using multiple sequences. Statistical Science 7(4), 457–472 (1992) 12. Salome, P., McClung, C.: The Arabidopsis thaliana clock. Journal of Biological Rhythms 19, 425–435 (2004) 13. L`ebre, S.: Analyse de processus stochastiques pour la g´enomique: ´etude du mod`ele ´ MTD et inf´erence de r´eseaux bay´esiens dynamiques. PhD thesis. Evry-Val-d’Essone (2008) 14. Werhli, A.V., Husmeier, D.: Gene regulatory network reconstruction by Bayesian integration of prior knowledge and/or different experimental conditions. Journal of Bioinformatics and Computational Biology 6(3), 543–572 (2008)
Ligand Electron Density Shape Recognition Using 3D Zernike Descriptors Prasad Gunasekaran1, , Scott Grandison1, , Kevin Cowtan2 , Lora Mak3 , David M. Lawson4, and Richard J. Morris1 1
3 4
Department of Computational & Systems Biology, John Innes Centre, Norwich Research Park, Colney Lane, NR4 7UH Norwich, UK 2 Structural Biology Laboratory, Department of Chemistry, University of York, Heslington, York, UK YO10 5DD Theoretical Systems Biology, Institute of Food Research, Norwich Research Park, Colney Lane, NR4 7UA Norwich, UK Department of Biological Chemistry, John Innes Centre, Norwich Research Park, Colney Lane, NR4 7UH Norwich, UK
[email protected],
[email protected],
[email protected],
[email protected],
[email protected],
[email protected]
Abstract. We present a novel approach to crystallographic ligand density interpretation based on Zernike shape descriptors. Electron density for a bound ligand is expanded in an orthogonal polynomial series (3D Zernike polynomials) and the coefficients from this expansion are employed to construct rotation-invariant descriptors. These descriptors can be compared highly efficiently against large databases of descriptors computed from other molecules. In this manuscript we describe this process and show initial results from an electron density interpretation study on a dataset containing over a hundred OMIT maps. We could identify the correct ligand as the first hit in about 30 % of the cases, within the top five in a further 30 % of the cases, and giving rise to an 80 % probability of getting the correct ligand within the top ten matches. In all but a few examples, the top hit was highly similar to the correct ligand in both shape and chemistry. Further extensions and intrinsic limitations of the method are discussed. Keywords: pattern recognition, structural bioinformatics, electron density, protein crystallography, 3D Zernike moments.
1
Introduction
With the success of structural genomics worldwide and the recognised importance of 3D information in unravelling protein function and enzymatic mechanism, computational methods are increasingly serving a vital role in structural biology. From target selection (using sequence analysis, phlyogeny, fold recognition and homology modeling), registration, laboratory information management
Equal contributing first author.
V. Kadirkamanathan et al. (Eds.): PRIB 2009, LNBI 5780, pp. 125–136, 2009. c Springer-Verlag Berlin Heidelberg 2009
126
P. Gunasekaran et al.
systems, crystal optimisation, data collection and analysis, the solution of the phase problem (often using molecular replacement techniques with databases of fragments or homology models), to complete and validated models and the prediction of protein function, computational biology has a major role to play. Automated model building into crystallographic electron density has come a long way in the last decade. Although many highly sophisticated algorithms and graphical approaches existed to aid the model building process, prior to the release of packages such as ARP/wARP [1], RESOLVE [2], PHENIX [3], and BUCCANEER [4], the construction of a macromolecular model into electron density often required many months of expert crystallographer time. Electron density interpretation has always been driven by pattern recognition [5]. Initially this process was carried out by the human brain based on structural insights and experience, but alternative approaches based on algorithmic pattern recognition developments such as skeletonizaton [6] quickly reduced the demand on human effort. Automated methods [7,4,8,9,10,11,12,13,14] contributed greatly to and benefitted hugely from structural genomics initiatives and recent developments [15,16,17,18,4] have reached an impressive level of sophistication, providing an intelligent expert system approach to model building for protein structures. Methods for determining structures from small molecule crystallographic data are completely automated and very well advanced. The success of these techniques relies strongly on direct methods, Patterson approaches and sufficient data to warrant atomicity [19,20], thus allowing identified atoms to be connected and the molecule to be built in a rather straightforward manner. These methods break down with decreasing resolution. Whereas proteins have regular repeating elements that can be distinguished already at lower resolution, ligands exhibit less pronounced and regular patterns, making their interpretation problematic. Thus, automated building of protein ligands into macromolecular structures is by comparison less well developed than for proteins, and DNA/RNA molecules seem to have either been ignored or successfully resisted automation. However, a number of original, diverse, and interesting approaches have been developed for the identification and modeling of bound ligands. Zwart et al. [21] describe methods for automated identification and building of protein-bound ligands in electron-density maps. Their approach relies on a graph-based search for geometrical features which match stereochemical expectations and an atomic labeling algorithm which explores the combinatorics of the problem. This approach has recently been assessed [22] and has shown to be most successful for high resolution data. Based on his powerful methods for density modification using structural motifs [23], Terwilliger successfully enhanced and extended the approach to ligand fitting [24]. This technique relies on the placement of core fragments into electron density and then uses a novel procedure to extend and build the remainder of the ligand. The authors performed extensive testing over a large dataset and found their methods functioned well, placing the ligand within 2 ˚ A r.m.s.d of the coordinates in the original structure. Recently, a new development [25] has been described which allows for the identification of ligands bound in crystal structures of
Ligand Electron Density Shape Recognition
127
macromolecules. This method uses density and fingerprint correlations. The density correlation is computed from the density after optimization of each entry of a test set of ligands and the ligand density. The fingerprint correlations are lists of computed model density features between the ligands. An interesting development geared towards structure-based drug design is presented in [26]. Their approach aims to automate protein-ligand crystallography for drug design. Ligands are positioned by docking directly into electron density, whilst taking care of protein-ligand interactions. In addition to the identification of full ligands, the authors show the potential for analysing fragment-library screening experiments. A force-field approach can be included to provide a better fit whilst maintaining good geometry [27]. An alternative approach captures the central axis of the electron density isosurface with a graph. This graph is matched against a graph of the molecular model [28]. This approach was reported to work as well as density peak picking methods and showed promise for the extension to lower resolution [28]. In a novel development using rotation-invariant features derived from moments of various orders, Lamzin and co-workers [29] developed a powerful method for accurately locating planar objects in electron density. This approach was tested on protein and DNA/RNA crystal structures and was able to place the plane centres within 0.5 ˚ A of the correct position. Although not specifically designed for ligand density interpretation, this approach seems a promising strategy and has much in common with our methodology. In this manuscript we present a method for ligand identification based on global pattern recognition using 3D Zernike descriptors. Zernike polynomial techniques have recently been applied with success in the biomolecular sciences [30,31,32]. The advantages of this approach include accurate feature representation and speed of comparison, meaning that large libraries can be scanned efficiently. Although for the current application the top hits already identified the correct ligand in many cases, a more promising set up might be to combine this technique with other methods such as those mentioned above to build a hierarchical search system which filters quickly through large databases and progressively invests more time only in relatively few promising candidates.
2
Methodology
An overview of the ligand density identification method described in this manuscript is shown in Figure 1. The individual steps are explained in detail below. To evaluate the success of this approach, we chose datasets of experimental crystallographic density which we attempted to interpret. The success was judged by how many times we predicted the correct ligand and, in cases where we got the wrong ligand, how distant our prediction was from the correct solution. 2.1
Density Extraction
Clipper library routines [33] were used to 1) compute crystallographic OMIT maps (OMIT maps show the difference between the density computed from the
128
P. Gunasekaran et al.
Fig. 1. Flowchart of the 3D Zernike descriptor method for ligand density interpretation. Crystallographic difference density was computed using Clipper routines and passed through our region growing algorithm. The resulting density is projected into Zernike space (the individual colours in the multi-colour sphere representation correspond to different Zernike polynomials). The Zernike descriptors are scanned against a database of pre-computed descriptors from existing ligands and the best hit is returned as the most probable ligand.
observed crystallographic structure factors and the theoretical density from a given arrangement of atoms and can be used to spot those parts of the density that have not been modeled or contain errors) by leaving out all the ligands from the model; 2) compute a crystallographic R-factor to check the density was reasonable; 3) to extract an orthogonal grid of difference density around each ligand. Such electron density maps are shown in the left column in Figures 2 and 3. Using OMIT maps based on models that have ’seen’ the ligand during
Ligand Electron Density Shape Recognition
129
Fig. 2. Density, segmentation and ligands. A selection of density maps for which the 3D Zernike descriptor ligand identification gave the correct answer.
refinement introduces a bias towards improved ligand density and yet these maps are by nature often noisy and contain errors from missing data and/or poorly modelled parts of the structure. To address this, we implemented a regiongrowing algorithm following the approach described in [34]. This method allows for the automated segmentation of 3D images. The algorithm starts from a seed point and progresses to visit neighbouring pixels, marks them, and either adds them to the current region list, or not, depending on the defined growth criteria. As the seed point we chose the highest point in the OMIT map and as growth criteria we defined that at least two neighbouring points to be above a given threshold. The choice of threshold was map dependent and our selected values ranged from 1.2 to 2.0 σ (standard deviation of the electron density values). The threshold was varied to produce the most ligand-like shape as judged by the human eye. The average threshold value was about 1.4 σ, so this value was employed for automated density extraction. Images of automatically extracted difference density are shown in the second column of Figure 2 and 3. 2.2
Zernike Moments
The success of 3D Zernike moments for object recognition in computer science has been documented in the number of publications, outperforming many other well-established techniques in the area. Following developments by [35] and [36] for 3D shape retrieval, we have employed 3D Zernike polynomials as a basis set for molecular shape comparison [30]. This approach was initially used to describe and compare binary objects, but the underlying mathematics are sufficiently general to allow any function in 3D to be described within the same framework.
130
P. Gunasekaran et al.
Fig. 3. Crystallographic difference densities, extracted ligand densities and the correct ligands. A selection of density maps for which the correct answer was not obtained. For 1JKX another conformation was ranked top, for 1RY8 the density segmentation picked up only half the ligand density because of a density break, and for 1E2K the region growing algorithm missed the actual ligand density.
As demonstrated in [31] and [37], the Zernike descriptor approach is well-suited for molecular comparisons and can be adapted to a range of physicochemical properties. We recently demonstrated the application of 3D Zernike moments to capture model uncertainty and protein flexibility [32]. A detailed description of the method for matching shapes using Zernike moments has been presented in [35,36,30]. We summarise some of the main points below. Any square-integrable function on the unit ball, f , can be represented as f (r) =
∞ n l
cnlm Znlm (r),
(1)
n=0 l=0 m=−l
in which Znlm (r) are the 3D Zernike polynomials given here as functions of the spherical coordinates, r = (r, θ, φ). The 3D Zernike polynomials are basis functions on the unit ball consisting of a radial term, Rnl (r), and an angular term, Ylm (θ, φ). The angular functions, Ylm (r), are known as spherical harmonics and can themselves be used for molecular shape recognition purposes [38,39]. The function indices are denoted by nlm, where n ranges from 0 to the maximum expansion order, l ≤ n and −l ≤ m ≤ l. The radial part of the Zernike polynomials is given by (n−l)/2 Rnl (r) = Nnlk rn−2k (2) k=0
Ligand Electron Density Shape Recognition
131
for n − l even, otherwise zero. Nnlk is a normalisation constant. See [40,41] for further details and cartesian transformations. The expansion coefficients, cnlm , are called 3D Zernike moments. The determination of the 3D Zernike moments requires the integration of the function of interest, f (r), multiplied by ∗ the complex conjugate Zernike polynomials, Znlm (r), over the unit ball, 1 2π cnlm = 0
0
0
π
∗ Znlm (r)f (r)r2 sin θ dθ dφ dr.
(3)
The symmetry properties of the spherical harmonics induce a symmetry in the coefficients between positive and negative m indices, cnl,−m = (−1)m c∗nlm . The symbol ∗ denotes the complex conjugate. Rotationally invariant descriptors may be constructed from the moments by defining (2l + 1)-dimensional vectors for each index l with all the m indexed coefficients, cnlm , making up each subspace, cnl,−l cnl,−l+1 Fnl = (4) . .. . cnl,l These rotationally invariant coefficients, Fnl , provide an efficient way of comparing shapes, a and b, using a standard L2 metric, nmax n a − F b )2 , d= (Fnl (5) nl n=0 l=0
for a maximum expansion order equal to nmax . Thus, in terms of ligand identification the Zernike descriptors can be employed to find promising molecules of similar shape to the crystallographic density without the need for translational or rotational searches and optimisation. Once potential candidates have been identified a more sensitive method could be used, such as [8], and coupled with refinement.
3 3.1
Datasets Electron Density Maps
We downloaded a selection of PDB structures with their crystallographic structure factors. In total we selected 110 structures in a resolution range of 1.5 ˚ A to 2.8 ˚ A, which gave rise to 586 ligand density maps. Only maps corresponding to ligands with more than 10 non-hydrogen atoms were selected, leaving us with 164 maps in total. The size of these ligands varies from 10 to 111 atoms, with the average being 22.7 non-hydrogen atoms.
132
3.2
P. Gunasekaran et al.
Ligand Database
We downloaded the latest version, release 12.1, of the HIC-Up database [42]. This version contains 7870 small molecules extracted from the PDB. The 3D Zernike moments were computed following the procedure outlined in Grandison et al. [32]. In short, each molecule was projected onto the unit ball such that the largest distance from the centre of geometry was scaled to 0.6. The atoms were projected onto a 643 grid using a Gaussian function with a variance defined by the van der Waals radius. For each voxel the geometric moments were computed from which the Zernike moments were determined. The Zernike moments were used to compute rotationally-invariant descriptors, following Equation 4.
4
Results
For each density map we applied the noise reduction algorithm described in the methods section to choose a continuous high density (> 1.5σ) region. This density was expanded in 3D Zernike polynomials to an order or 30 and the rotationally invariant shape descriptors were computed. These shape descriptors were scanned against a database consisting of 7870 pre-computed small molecule shape descriptors and the 164 ligand shape descriptors from the PDB files used for the density maps. Surprisingly, the density shape was not able to recognise its own ligand in a large number of cases. There are a number of reasons for this. First, although the region-growing algorithm performed well overall, it failed to extract the correct region in almost 30 % of the cases, see Figure 3, due to poor density. There are numerous ways to address this problem but for the
Fig. 4. This plot shows the frequency of correctly identified ligands (y-axis) as a function of the number of selected molecules (x-axis) over the entire data set. A perfect method would give 1.0 already for the number of molecules equal to one (this is the case for our approach using computed ligand densities from the ligand model). Poor density and insufficiently powerful segmentation and image enhancement algorithms limit the performance in our test cases. For real applications the conformational flexibility of ligands will reduce the performance further.
Ligand Electron Density Shape Recognition
133
present analysis we simply excluded those cases in order to evaluate the shape matching procedure rather than the region growing algorithm. Second, in cases where the region-growing procedure performed well, the density was still poor in places and resulted in parts being missed (and other ligands matching better) or moved (resulting in correct ligands in different conformations being the best hit). Another surprising finding is that a higher expansion order often decreases the performance. In [38] we found that for shape matching purposes using spherical harmonics a maximum expansion order of about 6 was sufficient and the performance increase for higher order was negligible. Similarly for Zernike descriptors [30], we found an order of about 10 to be sufficient for molecule classification and that the performance increased only marginally overall with higher orders, dropping slightly in only a few cases for protein comparisons. Here, the higher orders are capturing more noise in the density, whereas the lower orders are making good approximations to the overall shape and thereby performing noise reduction. We found the ligand interpretation performance peaked at a maximum expansion order of 8.
5
Discussion
We have described a novel method for the interpretation of electron density which can be used for the identification of ligands in crystallographic electronic difference density. We have performed an initial analysis on a realistic test dataset which covers a range of ligands from experimental crystallographic data of varying resolution and map quality. We have developed methods to automatically extract the difference density maps, place them on an orthogonal grid, expand them in 3D Zernike polynomials, and to efficiently compare their rotationally invariant moment against other shape databases. Overall, the performance was fair but not outstanding. We could identify the correct ligand in 30 % cases as the top hit and in a further 30 % cases within the top five matches. Despite encouraging results, several points became apparent during this analysis that require further investigation before the method can be properly automated. The first point is the density extraction itself. Advanced image processing tools, more specifically better techniques for image segmentation such a sophisticated watershed transform, may provide improve this step. Another option would be to allow the user to manually select and modify the density (via the placement of atoms) in existing model building graphics software. However, even with the perfect ligand density this method has limitations in that it performs static global shape matching, although multi-conformer searching presents a tractable solution given the speed of comparison. With the current test set, the conformational diversity was limited and the full effect of this limitation is therefore reduced. This conformational sampling issue could represent a problem, although initial analyses suggest that the number of bound conformations of flexible ligands is rather small [43]. We stress the point that the use of OMIT maps based on models that have ’seen’ the ligand during refinement implies that the current results will be better than in a more realistic scenario with poorer ligand density.
134
P. Gunasekaran et al.
Overall, the performance in selecting the correct ligand as the top hit did not meet our expectations. For this dataset the correct match was usually within the top twenty and with 80 % frequency within the top ten hits. In 30 % of the cases, we could identity the correct ligand with the first hit. However, this approach does act as an impressive filter and delivers a small set of potential candidates. More elaborate methods could then be employed to select the most likely ligand from this set. We therefore suggest that this approach would be best-suited as an efficient pre-screening method, prior to launching more computationally costly techniques.
Acknowledgments The idea to this piece of work is based on discussions with Dr Victor S Lamzin (EMBL Hamburg) about using spherical harmonics to interpret ligand density shapes. We thank Abdullah Kahraman for discussion. This project was funded in part by a BBSRC Tools and Resources grant - grant number CA340H10B.
References 1. Perrakis, A., Morris, R., Lamzin, V.S.: Automated protein model building combined with iterative structure refinement. Nat. Struct. Biol. 6(5), 458–463 (1999) 2. Terwilliger, T.C.: Solve and resolve: automated structure solution and density modification. Methods Enzymol. 374, 22–37 (2003) 3. Adams, P.D., Grosse-Kunstleve, R.W., Hung, L.W., Ioerger, T.R., McCoy, A.J., Moriarty, N.W., Read, R.J., Sacchettini, J.C., Sauter, N.K., Terwilliger, T.C.: Phenix: building new software for automated crystallographic structure determination. Acta Crystallogr. D Biol. Crystallogr. 58(Pt 11), 1948–1954 (2002) 4. Cowtan, K.: The buccaneer software for automated model building. 1. tracing protein chains. Acta Crystallogr. D Biol. Crystallogr. 62(Pt 9), 1002–1011 (2006) 5. Morris, R.J.: Statistical pattern recognition for macromolecular crystallographers. Acta Crystallogr. D Biol. Crystallogr. 60(Pt 12 Pt 1), 2133–2143 (2004) 6. Greer, J.: Computer skeletonization and automatic electron density map analysis. Methods Enzymol. 115, 206–224 (1985) 7. Cowtan, K.: Fast fourier feature recognition. Acta Crystallogr. D Biol. Crystallogr. 57(Pt 10), 1435–1444 (2001) 8. Cowtan, K.: Fitting molecular fragments into electron density. Acta Crystallogr. D Biol. Crystallogr. 64(Pt 1), 83–89 (2008) 9. Morris, R.J., Perrakis, A., Lamzin, V.S.: ARP/wARP model-building algorithms. i. the main chain. Acta Crystallogr. D Biol. Crystallogr. 58(Pt 6 Pt 2), 968–975 (2002) 10. Morris, R.J., Perrakis, A., Lamzin, V.S.: ARP/wARP and automatic interpretation of protein electron density maps. Methods Enzymol. 374, 229–244 (2003) 11. Morris, R.J., Zwart, P.H., Cohen, S., Fernandez, F.J., Kakaris, M., Kirillova, O., Vonrhein, C., Perrakis, A., Lamzin, V.S.: Breaking good resolutions with ARP/wARP. J. Synchrotron. Radiat. 11(Pt 1), 56–59 (2004)
Ligand Electron Density Shape Recognition
135
12. Adams, P.D., Gopal, K., Grosse-Kunstleve, R.W., Hung, L.W., Ioerger, T.R., McCoy, A.J., Moriarty, N.W., Pai, R.K., Read, R.J., Romo, T.D., Sacchettini, J.C., Sauter, N.K., Storoni, L.C., Terwilliger, T.C.: Recent developments in the phenix software for automated crystallographic structure determination. J. Synchrotron. Radiat. 11(Pt 1), 53–55 (2004) 13. Terwilliger, T.C.: Automated structure solution, density modification and model building. Acta Crystallogr. D Biol. Crystallogr. 58(Pt 11), 1937–1940 (2002) 14. Terwilliger, T.: Solve and resolve: automated structure solution, density modification and model building. J. Synchrotron. Radiat. 11(Pt 1), 49–52 (2004) 15. Cohen, S.X., Morris, R.J., Fernandez, F.J., Ben Jelloul, M., Kakaris, M., Parthasarathy, V., Lamzin, V.S., Kleywegt, G.J., Perrakis, A.: Towards complete validated models in the next generation of ARP/wARP. Acta Crystallogr. D Biol. Crystallogr. 60(Pt 12 Pt 1), 2222–2229 (2004) 16. Terwilliger, T.C., Grosse-Kunstleve, R.W., Afonine, P.V., Moriarty, N.W., Zwart, P.H., Hung, L.W., Read, R.J., Adams, P.D.: Iterative model building, structure refinement and density modification with the phenix autobuild wizard. Acta Crystallogr. D Biol. Crystallogr. 64(Pt 1), 61–69 (2008) 17. Joosten, K., Cohen, S.X., Emsley, P., Mooij, W., Lamzin, V.S., Perrakis, A.: A knowledge-driven approach for crystallographic protein model completion. Acta Crystallogr. D Biol. Crystallogr. 64(Pt 4), 416–424 (2008) 18. Terwilliger, T.C., Grosse-Kunstleve, R.W., Afonine, P.V., Moriarty, N.W., Adams, P.D., Read, R.J., Zwart, P.H., Hung, L.W.: Iterative-build omit maps: map improvement by iterative model building and refinement without model bias. Acta Crystallogr. D Biol. Crystallogr. 64(Pt 5), 515–524 (2008) 19. Morris, R.J., Bricogne, G.: Sheldrick’s 1.2 ˚ A rule and beyond. Acta Crystallogr. D Biol. Crystallogr. 59(Pt 3), 615–617 (2003) 20. Morris, R.J., Blanc, E., Bricogne, G.: On the interpretation and use of < |E|2 > (d∗) profiles. Acta Crystallogr. D Biol. Crystallogr. 60(Pt 2), 227–240 (2004) 21. Zwart, P.H., Langer, G.G., Lamzin, V.S.: Modelling bound ligands in protein crystal structures. Acta Crystallogr. D Biol. Crystallogr. 60(Pt 12 Pt 1), 2230–2239 (2004) 22. Evrard, G.X., Langer, G.G., Perrakis, A., Lamzin, V.S.: Assessment of automatic ligand building in arp/warp. Acta Crystallogr. D Biol. Crystallogr. 63(Pt 1), 108– 117 (2007) 23. Terwilliger, T.C.: Maximum-likelihood density modification using pattern recognition of structural motifs. Acta Crystallogr. D Biol. Crystallogr. 57(Pt 12), 1755– 1762 (2001) 24. Terwilliger, T.C., Klei, H., Adams, P.D., Moriarty, N.W., Cohn, J.D.: Automated ligand fitting by core-fragment fitting and extension into density. Acta Crystallogr. D Biol. Crystallogr. 62(Pt 8), 915–922 (2006) 25. Terwilliger, T.C., Adams, P.D., Moriarty, N.W., Cohn, J.D.: Ligand identification using electron-density map correlations. Acta Crystallogr. D Biol. Crystallogr. 63(Pt 1), 101–107 (2007) 26. Mooij, W.T.M., Hartshorn, M.J., Tickle, I.J., Sharff, A.J., Verdonk, M.L., Jhoti, H.: Automated protein-ligand crystallography for structure-based drug design. Chem. Med. Chem. 1(8), 827–838 (2006) 27. Wlodek, S., Skillman, A.G., Nicholls, A.: Automated ligand placement and refinement with a combined force field and shape potential. Acta Crystallogr. D Biol. Crystallogr. 62(Pt 7), 741–749 (2006)
136
P. Gunasekaran et al.
28. Aishima, J., Russel, D.S., Guibas, L.J., Adams, P.D., Brunger, A.T.: Automated crystallographic ligand building using the medial axis transform of an electrondensity isosurface. Acta Crystallogr. D Biol. Crystallogr. 61(Pt 10), 1354–1363 (2005) 29. Hattne, J., Lamzin, V.S.: Pattern-recognition-based detection of planar objects in three-dimensional electron-density maps. Acta Crystallogr. D Biol. Crystallogr. D64(Pt 8), 834–842 (2008) 30. Mak, L., Grandison, S., Morris, R.J.: An extension of spherical harmonics to regionbased rotationally invariant descriptors for molecular shape description and comparison. J. Mol. Graph. Model 26(7), 1035–1045 (2008) 31. Sael, L., Li, B., La, D., Fang, Y., Ramani, K., Rustamov, R., Kihara, D.: Fast protein tertiary structure retrieval based on global surface shape similarity. Proteins (2008) 32. Grandison, S., Roberts, C., Morris, R.J.: The application of 3d zernike moments for the description of “model-free” molecular structure, functional motion, and structural reliability. J. Comput. Biol. 16(3), 487–500 (2009) 33. Cowtan, K.: The clipper C++ libraries for x-ray crystallography. IUCr Computing Commission Newsletter 2, 4–9 (2003) 34. Revol-Muller, C., Peyrin, F., Carrillon, Y., Odet, C.: Automated 3d region growing algorithm based on an assessment function. Pattern Recogn. Lett. 23(1-3), 137–150 (2002) 35. Canterakis, N.: 3D zernike moments and zernike affine invariants for 3D image analysis and recognition. In: Scandinavian Conference on Image Analysis (1999) 36. Novotni, M., Klein, R.: Shape retrieval using 3d zernike descriptors. ComputerAided Design 36(11), 1047–1062 (2004) 37. Sael, L., La, D., Li, B., Rustamov, R., Kihara, D.: Rapid comparison of properties on protein surface. Proteins 73(1), 1–10 (2008) 38. Morris, R.J., Najmanovich, R.J., Kahraman, A., Thornton, J.M.: Real spherical harmonic expansion coefficients as 3d shape descriptors for protein binding pocket and ligand comparisons. Bioinformatics 21(10), 2347–2355 (2005) 39. Morris, R.J.: An evaluation of spherical designs for molecular-like surfaces. J. Mol. Graph. Model 24(5), 356–361 (2006) 40. Mathar, R.J.: Third order newton’s method for zernike polynomial zeros, arXiv: math.NA/0705.1329 (2007) 41. Mathar, R.J.: Zernike basis to cartesian transformations, arXiv: 0809.2368 math-ph (2008) 42. Kleywegt, G.J.: Crystallographic refinement of ligand complexes. Acta Crystallogr. D Biol. Crystallogr. 63(Pt 1), 94–100 (2007) 43. Funkhouser, T., Glaser, F., Laskowski, R.A., Morris, R.J., Najmanovich, R., Stockwell, G., Thornton, J.M.: Shape-based classification of bound ligands. In: Barber, S., Baxter, P.D., Mardia, K.V., Wells, R.E. (eds.) Quantitative Biology Shape Analysis and Wavelets, pp. 39–42 (2005)
Definition of Valid Proteomic Biomarkers: A Bayesian Solution Keith Harris1 , Mark Girolami1 , and Harald Mischak2 1
Inference Group, Department of Computing Science, University of Glasgow, UK {keithh,girolami}@dcs.gla.ac.uk http://www.dcs.gla.ac.uk/inference 2 Mosaiques Diagnostics and Therapeutics AG, Hannover, Germany
Abstract. Clinical proteomics is suffering from high hopes generated by reports on apparent biomarkers, most of which could not be later substantiated via validation. This has brought into focus the need for improved methods of finding a panel of clearly defined biomarkers. To examine this problem, urinary proteome data was collected from healthy adult males and females, and analysed to find biomarkers that differentiated between genders. We believe that models that incorporate sparsity in terms of variables are desirable for biomarker selection, as proteomics data typically contains a huge number of variables (peptides) and few samples making the selection process potentially unstable. This suggests the application of a two-level hierarchical Bayesian probit regression model for variable selection which assumes a prior that favours sparseness. The classification performance of this method is shown to improve that of the Probabilistic K-Nearest Neighbour model. Keywords: Proteomic biomarkers, classification, sparsity, feature selection, Bayesian inference.
1
Introduction
Proteins and peptides in body fluids hold considerable information on the physiology of an organism and thus can serve as biomarkers for disease. However, the fields of biomarker discovery and clinical proteomics are suffering from high hopes generated by reports on potential biomarkers, most of which subsequently could not be substantiated via validation [1]. This development has resulted in much scepticism from both clinicians and regulatory agencies, which will make the application of valid biomarkers even more of a challenge. This vicious circle has to be broken by pinpointing the major errors made in earlier research and highlighting good practice that will enable the definition of valid biomarkers with a much higher probability than currently observed. While some of the initial issues have already been dealt with satisfactorily, others are still unresolved. For example, it is now generally accepted that single biomarkers should not be applied, as the complexity of a disease is unlikely to be thoroughly displayed by just one marker, and that a panel of such biomarkers should be employed V. Kadirkamanathan et al. (Eds.): PRIB 2009, LNBI 5780, pp. 137–149, 2009. c Springer-Verlag Berlin Heidelberg 2009
138
K. Harris, M. Girolami, and H. Mischak
instead [1,2]. However, it is equally evident that such a panel must consist of clearly defined biomarkers, and not of an ill-defined signature, as reported in several of the original manuscripts, almost exclusively based on the SurfaceEnhanced Laser Desorption/Ionization (SELDI) technology, that subsequently could not be validated [3,4]. This brings the issue of definition of a valid biomarker into focus. The fundamental question that should be asked is whether the change observed in the disease (frequency or abundance) of a certain molecule, based on data from a proteomics study, is in fact a result of the disease, or does it merely reflect an artefact due to technical variability in the pre-analytical steps, or in the analysis. Other likely suspects for suggesting an apparent but erroneous association with disease are biological variability or bias introduced in the study (for example, due to lifestyle, age and gender). In fact, these two problems are likely responsible for the majority of erroneous biomarkers. The most appropriate answer to this challenge appears to be the application of stringent statistical analysis. Not only does good statistical practice need to be highlighted, but also more sophisticated multivariate selection methods need to be developed, so that valid biomarkers will be defined with a much higher probability than currently observed. To this end, we adopt a Bayesian approach to classification and feature selection, as this approach offers formal and well-calibrated probabilities for class prediction which is useful for medical decision making. We compare the Probabilistic K-Nearest Neighbour and hierarchic linear probit regression classifiers. Feature selection was incorporated in the latter method by assuming priors that favoured sparse solutions. It should be noted that other classification methods like support vector machines with recursive feature elimination, adaptive boosting and random forests could have been used, but that in this paper we decided to focus solely on highlighting possible Bayesian approaches for proteomic biomarker selection. The rest of this paper is organised as follows: in Sect. 2 we discuss the illustrative experiment to find biomarkers that differentiate between males and females. Section 3 describes the classification and feature selection methods used in this paper in more detail. Section 4 presents the results of our experiments comparing the classification performance of the two methods and the feature selection performance of the three priors used to induce sparsity in the probit regression model. Finally, Sect. 5 discusses the conclusions that can be drawn from our experimental results.
2
Application
To avoid any uncertainty in the assignment of a physiological condition, we choose as an illustrative example defining proteomic differences between apparently healthy adult males and females. While clinical diagnosis or pathophysiological conditions are in general associated with a certain degree of uncertainty, gender can be assessed with almost 100% confidence. Furthermore, the differences between male and female, while quite obvious at first sight, are likely to be rather subtle at the proteomic level.
Definition of Valid Proteomic Biomarkers: A Bayesian Solution
139
We chose urine to be the body fluid of interest, since urine has been found to be of much higher stability than blood-derived samples (serum or plasma), hence reducing pre-analytical variability [2,5]. Capillary electrophoresis-mass spectrometry (CE-MS) was used to analyse the urine samples, as this technology allows the routine analysis of a large number of samples and has been thoroughly validated as a platform technology [6,7]. The second urine of the morning was collected from a group of apparently healthy male and female volunteers (aged 21-40) during a routine medical checkup before recruitment at the Hannover Medical School. All samples were prepared and analysed using CE-MS as described in [6,7]. The goal of the analysis was to define biomarkers that would enable differentiation between male and female samples (based on the hypothesis that such biomarkers must exist).
3 3.1
Methods Probabilistic K-Nearest Neighbour Classification
The Probabilistic K-Nearest Neighbour (PKNN) classification method (see [8] for an empirical analysis) adopts a fully Bayesian approach to obtaining posterior probabilities over the scaling parameter and the number of nearest neighbours to be employed. Markov chain Monte Carlo using the Metropolis-Hastings algorithm is employed to perform posterior sampling and Monte Carlo averaging provide the predicitive probabilities of class labels. Some more detail is provided below. Consider a finite data sample {(t1 , x1 ), · · · , (tN , xN )} where each tn ∈ {1, · · · , C} denotes the class label associated with the D-dimensional feature vector xn ∈ IRD and the feature space IRD has an associated metric with parameters θ denoted as Mθ . To define a probabilistic representation of the KNN method an approximate conditional joint likelihood is defined in [9] such that Mθ β exp k δtn tj N j∼n|k p(t|X, β, k, θ, M) ≈ (1) Mθ C n=1 β exp k δctn c=1
j∼n|k
where we define the N × 1-dimensional vector t as [t1 , · · · tN ]T and the N × Ddimensional matrix X = [x1 , · · · , xN ]T , M denotes the metric employed in the feature space and θ are the associated parameters. The number of nearest neighbours is k and β defines a scaling variable. The expression Mθ
δtn tj
(2)
j∼n|k
denotes the number of the nearest k neighbours of xn , as measured under the metric Mθ within N − 1 samples from X remaining when xn is removed which
140
K. Harris, M. Girolami, and H. Mischak
we denote as X −n , and have the class label value of tn , whilst each of the terms in the summation of the denominator provides a count of the number of the k neighbours of xn which have class label equaling c. Full posterior inference will follow by obtaining the parameter posterior distribution p(β, k, θ|t, X, M) and subsequent predictions of the target class label t∗ of a new datum x∗ are made by posterior averaging such that p(t∗ |x∗ , t, X, M) = p(t∗ |x∗ , t, X, β, k, θ, M)p(β, k, θ|t, X, M)dβdθ. (3) k
As the required posterior takes an intractable form an MCMC procedure is proposed in [9] and extended in [10] to enable metric inference so that the following Monte-Carlo estimate is employed Ns 1 pˆ(t∗ |x∗ , t, X, M) = p(t∗ |x∗ , t, X, β (s) , k (s) , θ(s) , M) Ns s=1
(4)
where each β (s) , k (s) , θ(s) are samples obtained from the full parameter posterior p(β, k, θ|t, X, M) using a Metropolis style sampler. As the standard KNN method has no straightforward way to learn the metric we restrict this study to posterior inference over k and β and fix the metric to the standard Euclidean metric. We therefore adopt the Metropolis scheme detailed in [9] and obtain samples from the posterior p(β, k, |t, X, M) and employ MonteN s Carlo estimates pˆ(t∗ |x∗ , t, X, M) = N1s s=1 p(t∗ |x∗ , t, X, β (s) , k (s) , M) in the following experimental section. 3.2
Hierarchic Linear Probit Regression Models
The fundamental problem of biomarker selection via CE-MS data is to identify which peptides best discriminate between different types of protein samples, in this case between male and female samples. CE-MS data contains a large number of variables (peptides) and the sample size tends to be relatively small so the selection process can be unstable. Hence, models which incorporate sparsity in terms of variables are desirable for this kind of problem. Bae and Mallick [11] proposed a two-level hierarchical Bayesian probit regression model for variable selection which used three different priors to incorporate different levels of sparsity in the model. Details of this model, the sparsity inducing priors and the Gibbs sampler used to perform posterior sampling are given below. This method is preferable to using support vector machines for performing variable selection as we can obtain predictive probabilities of the class labels for new observations by Monte Carlo averaging, similar to the Probabilistic K-Nearest Neighbour method mentioned earlier. Model. Consider a finite data sample {(t1 , x1 ), · · · , (tN , xN )} where each tn ∈ {1, 2} denotes the class label associated with the D-dimensional feature vector xn ∈ IRD . Define the binary regression model as pi = P (ti = 2) = Φ(xTi β),
Definition of Valid Proteomic Biomarkers: A Bayesian Solution
141
i = 1, . . . n, where β is the D × 1-dimensional vector of unknown regression parameters and Φ is the standard normal cumulative density function linking the probability pi with the linear structure xTi β. Albert and Chib [12] introduced n independent latent variables z = [z1 , . . . , zn ]T into the problem, where zi ∼ N (xTi β, 1) and define ti = 2 if zi > 0 and ti = 1 if zi ≤ 0. This approach connects the probit binary regression model for ti to a normal linear regression model for the latent variable zi . Bae and Mallick [11] considered different sparsity inducing priors for β in a two-level hierarchical Bayesian model. They placed a zero-mean Gaussian prior on β with unknown variances and assigned three different priors for the variances under the assumption that they were independent, i.e., β|Λ ∼ N (0, Λ), where 0 = [0, . . . , 0]T , Λ = diag(λ1 , . . . , λD ) and λi is the variance of βi . Prior distributions for Λ. Model I - conjugate Inverse Gamma priors for each λi , i.e., (a/2)+1 D D a 2 1 b Λ∼ IG , ∝ exp − . (5) 2 b λi 2λi i=1 i=1 Model II - exponential priors for each λi , i.e., γλi Λ∼ Exponential(γ) ∝ exp − . 2 i=1 i=1 D
D
(6)
Model III - non-informative Jeffreys priors for each λi , i.e., Λ∼
D 1 . λ i=1 i
(7)
Note that Model III is the special case of Model I in which the hyperparameters a and b are both set to 0. Gibbs sampler. 1. Sample zi , for i = 1, . . . , n, from its full conditional distribution
N (xTi β, 1) truncated at the left by 0 if ti = 2, zi |β, ti ∝ (8) N (xTi β, 1) truncated at the right by 0 if ti = 1. 2. Sample β from its full conditional distribution p(β|z, t, Λ) ∝ N (ΣX T z, Σ), where Σ = (X T X + Λ−1 )−1 . 3. Sample Λ from its full conditional distribution. The full conditional distributions for Models I, II and III, respectively, are: p(Λ
−1
|z, t, β) ∝
D i=1
p(Λ−1 |z, t, β) ∝
D i=1
Gamma
a+1 2 , 2 b + βi2 √
InverseGaussian
γ ,γ |βi |
,
(9)
(10)
142
K. Harris, M. Girolami, and H. Mischak
and p(Λ
−1
|z, t, β) ∝
D i=1
Gamma
1 2 , 2 βi2
.
(11)
Predictive classification. The predictive classification of the target class label t∗ of a new datum x∗ is given by the following Monte-Carlo estimate: Ns Ns 1 1 Pˆ (t∗ = 2|x∗ ) = p(t∗ = 2|x∗ , β (s) , z (s) , Λ(s) ) = Φ(xT∗ β (s) ), (12) Ns s=1 Ns s=1
where β (s) , z (s) and Λ(s) are the MCMC samples from the posterior distribution.
4 4.1
Experimental Results PKNN Classifiers
To maintain consistency of data representation with other studies on this data the same arbitrary threshold for data sparsity (80%) was employed to reduce the number of covariates to 1524. However, instead of normalising samples with their sum of intensity values, we normalised features with their sum of intensity values, so as to not distort the original feature space before applying our models. A Wilcoxon Rank-Sum non-parametric test was used to provide a ranking of individual covariates based on p-value, and from this it is clear that a very small percentage of the 1524 peptides have any statistical evidence supporting their discrimination ability. Setting a p-value threshold of 2% the number of peptides was reduced further to 229 and these were used in devising a series of PKNN classifiers. Starting with the full set of 229 peptides a PKNN classifier was devised by using Metropolis sampling with a burn-in of 5000 samples and a further 45000 post-burn-in samples retained for Monte-Carlo averaging. The proposal distribution was tuned to achieve acceptance rates between 35% to 50%. A randomised ten-fold cross validation (10-CV) was used to obtain estimates of predictive performance and only 0-1 error loss is reported here, however predictive probabilities are obtained from PKNN. These probabilities are used to make decisions based on the cost and threshold selected which in this case as the classes are balanced was set at 0.5. The 10-CV score and associated standard error is reported when 229 peptides are used, then 228 are used where the peptide with the highest p-value is removed. This is done for fourteen peptides after which groups of 10 peptides were removed each time and the 10-CV score was measured. These results are shown in Fig. 1. A minimum mean 10-CV error of 8.68% is achieved, however this is rather meaningless taken on its own without considering the standard error, which would increase if multiple randomisations were employed. At around 220 to 210 peptides the range of error is minimal and this increases as the number of low p-value peptides are removed. It is conjectured that due to the relatively
Definition of Valid Proteomic Biomarkers: A Bayesian Solution
143
22
Percent Error (10−CV)
20 18 16 14 12 10 8 6
220
200
180
160
140
120
100
80
60
40
Ranked Peptides by p−value
Fig. 1. This graph shows the estimated 10-CV prediction error (mean ± one standard error)
high levels of sparsity in the data which remains a large number of peptides are required to make reasonable predictions, this will be discussed further on in the paper. 4.2
Hierarchic Linear Probit Regression Models
As in Sect. 4.1, we remove all peptides that have more than 80% zero intensity values and normalise each feature with their sum of intensity values. We again use the ranking of covariates provided by the p-values of the Wilcoxon test to further reduce the number of peptides, but this time choose a p-value threshold of roughly 5% to reduce the number of peptides to 350. These peptides were then used to build the three classifiers discussed in Sect. 3.2. Like Bae and Mallick [11], we fixed the hyperparameters for Models I and II so that E(λi ) = 10 and Var(λi ) = 100. We ran the Gibbs sampler of Sect. 3.2 for 50,000 iterations and discarded the first 20,000 iterations as burn-in. As in Sect. 4.1, a randomised 10-CV was used to assess the predictive performance of the three models. Both Models I and II gave an average test error of 8.2%±2.1%, while Model III gave an average test error of 11.2%±2.0%. It should be noted that tuning the hyperparameters of Models I and II could potentially lead to improved performance. It is not surprising that Models I and II performed similarly, as although the form of the prior distribution for the variance of the regression coefficients was different, the mean and variance was set to be the same, and this result is consistent with the findings of Bae and Mallick [11]. We believe that the poorer performance of Model III is due to the Jeffreys prior inducing too much sparsity in the model, similar to the worsening performance of the PKNN classifier seen in Fig. 1 after the number of peptides in the model is reduced below 210. We select potential biomarkers using the posterior variance of β with the idea being that the peptides with larger variance are more important in discriminating between the different types of protein samples than those with smaller variance. Figures 2, 3 and 4 show the variance of βi for Models I, II and III, respectively.
144
K. Harris, M. Girolami, and H. Mischak
17 16 mean(lambda)
15 14 13 12 11 10 9 0
0.5
1 Peptide ID
1.5
2 5
x 10
Fig. 2. Plot of the variance of βi versus the peptide ID (Model I)
16
mean(lambda)
15 14 13 12 11 10 9 0
0.5
1 Peptide ID
1.5
2 5
x 10
Fig. 3. Plot of the variance of βi versus the peptide ID (Model II) 7
7
x 10
median(lambda)
6 5 4 3 2 1 0 0
0.5
1 Peptide ID
1.5
2 5
x 10
Fig. 4. Plot of the variance of βi versus the peptide ID (Model III)
We see that Models I and II give very similar results and there are roughly 20 peptides that have significantly larger variance than the others. Comparing the top 20 peptides for Models I and II we see that Models I and II give very consistent selections, as 18 peptides are in both top 20s. In particular, both models rank peptides 186673 and 187114 as the two most important peptides by a comfortable margin. Of these 18 peptides, all but three of them had pvalues less than 0.005 in the original Wilcoxon tests and of the top two peptides both had p-values less than 3 × 10−6 . We also checked the sensitivity of the peptide rankings with the 10-CV mentioned earlier and found that the rankings
Definition of Valid Proteomic Biomarkers: A Bayesian Solution
145
of Models I and II were broadly consistent between folds with the same peptides being ranked highly in the majority of folds. We see from Fig. 4 that Model III induces sparseness much more strongly than Models I and II, as there are only 8 peptides that have significantly larger variance than the others. Only two of these peptides were selected by Models I and II, but both models ranked them outside the top 10. Unlike Models I and II, when we checked the sensitivity of the peptide rankings with the results for the 10-CV we found that the selected peptides were rarely consistent between folds. This suggests that the Jeffreys prior over-prunes the model and puts very little weight on many peptides useful for classifying the binary response, leading to its worse performance in terms of the average test error found earlier. We thus conclude that the peptides suggested by Models I and II are more likely to make a good set of biomarkers for this problem than those suggested by Model III. 4.3
Classification for the Blinded Test Data
Figures 5, 6, 7 and 8 show the posterior predictive probabilities obtained for each test sample from the PKNN classifier of Sect. 4.1 and Bae and Mallick’s sparse probit regression Models I, II and III of Sect. 4.2, respectively. We see that the PKNN classifier and Bae and Mallick’s Model III tend to give the most confident predictions, while the posterior predictive probabilities are very similar for Bae and Mallick’s Models I and II and tend to give the least confident predictions. Note that it is easy to compare the predictive performance of two competing classifiers graphically by plotting the predictive probabilities of one method against the other. We also see that the four classifiers tend to allocate the test samples to the same class. In fact, all four classifiers are in agreement for 71 of the 92 test samples. The test samples where there is disagreement in the predictions of the four classifiers tend to happen when at least one classifier gives an unconfident prediction, that is, a posterior predictive probability close to 0.5. Even with the PKNN classifier there are four samples that have a posterior predictive probability of between 0.4 and 0.6. In such cases we would advocate
Predictive Class Probability
1 0.8 0.6
Decision Threshold P > 0.5
0.4 0.2 0
10
20
30
40 50 60 Blinded Test Sample
70
80
90
Fig. 5. Plot of the posterior predictive probabilities from the PKNN classifier
146
K. Harris, M. Girolami, and H. Mischak
Predictive Class Probability
1 0.8 0.6
Decision Threshold P > 0.5
0.4 0.2 0
10
20
30 40 50 60 Blinded Test Sample
70
80
90
Fig. 6. Plot of the posterior predictive probabilities from Bae and Mallick Model I
Predictive Class Probability
1 0.8 0.6
Decision Threshold P > 0.5
0.4 0.2 0
10
20
30 40 50 60 Blinded Test Sample
70
80
90
Fig. 7. Plot of the posterior predictive probabilities from Bae and Mallick Model II
not allocating the test sample to either class, as there is great uncertainty over the true class of the test sample. This transparency in the confidence of our class predictions is a huge advantage of the Bayesian approach over the more commonly used SVM techniques, which cannot provide such a formal and wellcalibrated measure of the confidence of a class prediction. The performance of the four classifiers on the blinded test set turned out to be very similar, as Bae and Mallick’s Model II misclassified 14 out of the 92 samples, while both their Models I and III made 15 misclassifications, and the PKNN classifier performing slightly worse with 17 misclassifications. As we would expect, the test samples that were misclassified tended to have posterior predictive probabilities between 0.3 and 0.7, and thus had class predictions that were not very confident.
Definition of Valid Proteomic Biomarkers: A Bayesian Solution
147
Predictive Class Probability
1 0.8 0.6
Decision Threshold P > 0.5
0.4 0.2 0
10
20
30 40 50 60 Blinded Test Sample
70
80
90
Fig. 8. Plot of the posterior predictive probabilities from Bae and Mallick Model III
We then trained Bae and Mallick’s three models on three smaller data sets with 33, 20 and 7 cases and controls, in order to assess how model performance was affected by smaller training set sizes. We discovered that the confidence in our predictions declines significantly as the number of training samples decreases. Indeed, when the number of training samples is only 14, almost all the predictive probabilities are between 0.3 and 0.7, which suggests that our predictive performance may be little better than guessing and that the biomarkers suggested by such a small data set would not be substantiated in practice. This suggestion of deteriorating predictive performance as the number of training samples is reduced was confirmed when we unblinded the test samples (see Table 1). Table 1. Test error for different training set sizes Training set size Model I 14 28.3% 40 27.2% 66 21.7% 134 16.3%
5
Model II Model III 27.2% 25% 27.2% 23.9% 21.7% 25% 15.2% 16.3%
Conclusions
Sparse models enable us to identify a small number of peptides having the greatest discriminating power, thereby allowing researchers to quickly focus on the most promising candidates for diagnostics and prognostics. The Bayesian approach yields a coherent way to assign new samples to particular classes. Rather than hard rules of assignment, we can evaluate the probability that the new sample will be of a certain type which is more helpful for medical decision making.
148
K. Harris, M. Girolami, and H. Mischak
Meaningful results will only be obtained if the number of training samples collected is sufficient to allow the definition of statistically valid biomarkers. Acknowledgements. K. Harris & M. Girolami are supported by an Engineering and Physical Sciences Research Council (EPSRC) grant EP/F009429/ 1 - Advancing Machine Learning Methodology for New Classes of Prediction Problems. M. Girolami is funded by an EPSRC Advanced Research Fellowship EP/E052029/1. H. Mischak is supported by EU funding through the InGenious HyperCare consortium, grant LSHM-CT-2006-037093.
References 1. Mischak, H., Apweiler, R., Banks, R.E., Conaway, M., Coon, J., Dominiczak, A., Ehrich, J.H.H., Fliser, D., Girolami, M., Hermjakob, H., Hochstrasser, D., Jankowski, J., Julian, B.A., Kolch, W., Massy, Z.A., Neusuess, C., Novak, J., Peter, K., Rossing, K., Schanstra, J., Semmes, O.J., Theodorescu, D., Thongboonkerd, V., Weissinger, E.M., Van Eyk, J.E., Yamamoto, T.: Clinical proteomics: A need to define the field and to begin to set adequate standards. Proteomics - Clinical Applications 1(2), 148–156 (2007) 2. Decramer, S., de Peredo, A.G., Breuil, B., Mischak, H., Monsarrat, B., Bascands, J.L., Schanstra, J.P.: Urine in clinical proteomics. Molecular and Cellular Proteomics 7(10), 1850–1862 (2008) 3. Petricoin, E.F., Ardekani, A.M., Hitt, B.A., Levine, P.J., Fusaro, V.A., Steinberg, S.M., Mills, G.B., Simone, C., Fishman, D.A., Kohn, E.C., Liotta, L.A.: Use of proteomic patterns in serum to identify ovarian cancer. The Lancet 359(9306), 572–577 (2002) 4. Check, E.: Proteomics and cancer - running before we can walk? Nature 429(6991), 496–497 (2004) 5. Mischak, H., Coon, J.J., Novak, J., Weissinger, E.M., Schanstra, J.P., Dominiczak, A.F.: Capillary electrophoresis-mass spectrometry as a powerful tool in biomarker discovery and clinical diagnosis: An update of recent developments. Mass Spectrometry Reviews (October 2008) (in press) 6. Coon, J.J., Z¨ urbig, P., Dakna, M., Dominiczak, A.F., Decramer, S., Fliser, D., Frommberger, M., Golovko, I., Good, D.M., Herget-Rosenthal, S., Jankowski, J., Julian, B.A., Kellmann, M., Kolch, W., Massy, Z., Novak, J., Rossing, K., Schanstra, J.P., Schiffer, E., Theodorescu, D., Vanholder, R., Weissinger, E.M., Mischak, H., Schmitt-Kopplin, P.: CE-MS analysis of the human urinary proteome for biomarker discovery and disease diagnostics. Proteomics - Clinical Applications 2(7-8), 964–973 (2008) 7. Jantos-Siwy, J., Schiffer, E., Brand, K., Schumann, G., Rossing, K., Delles, C., Mischak, H., Metzger, J.: Quantitative urinary proteome analysis for biomarker evaluation in chronic kidney disease. Journal of Proteome Research 8(1), 268–281 (2009) 8. Manocha, S., Girolami, M.: An empirical analysis of the probabilistic k-nearest neighbour classifier. Pattern Recognition Letters 28(13), 1818–1824 (2007) 9. Holmes, C.C., Adams, N.M.: A probabilistic nearest neighbour method for statistical pattern recognition. J. R. Statist. Soc. B 64(2), 295–306 (2002)
Definition of Valid Proteomic Biomarkers: A Bayesian Solution
149
10. Everson, R.M., Fieldsend, J.E.: A variable metric probabilistic k-nearestneighbours classifier. In: Yang, Z.R., Yin, H., Everson, R.M. (eds.) IDEAL 2004. LNCS, vol. 3177, pp. 654–659. Springer, Heidelberg (2004) 11. Bae, K., Mallick, B.K.: Gene selection using a two-level hierarchical Bayesian model. Bioinformatics 20(18), 3423–3430 (2004) 12. Albert, J., Chib, S.: Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association 88, 669–679 (1993)
Inferring Meta-covariates in Classification Keith Harris, Lisa McMillan, and Mark Girolami Inference Group, Department of Computing Science, University of Glasgow, UK {keithh,lisa,girolami}@dcs.gla.ac.uk http://www.dcs.gla.ac.uk/inference
Abstract. This paper develops an alternative method for gene selection that combines model based clustering and binary classification. By averaging the covariates within the clusters obtained from model based clustering, we define “meta-covariates” and use them to build a probit regression model, thereby selecting clusters of similarly behaving genes, aiding interpretation. This simultaneous learning task is accomplished by an EM algorithm that optimises a single likelihood function which rewards good performance at both classification and clustering. We explore the performance of our methodology on a well known leukaemia dataset and use the Gene Ontology to interpret our results. Keywords: Gene selection, clustering, classification, EM algorithm, Gene Ontology.
1
Introduction
In this paper, we develop a procedure for potentially improving the classification of gene expression profiles through coupling with the method of model based clustering. Such DNA microarray data typically consists of several thousands of genes (covariates) and a much smaller number of samples. Analysing this data is statistically challenging, as the covariates are highly correlated, which results in unstable parameter estimates and inaccurate prediction. To alleviate this problem, we use the averages of covariate clusters, rather than all the original covariates, to classify DNA samples. The advantage of this approach over using a sparse classification model [1,2] is that we can extract a much larger subset of genes with essential predictive power and partition this subset into groups, within which the genes are similar. An overview of our procedure that combines model based clustering and binary classification is as follows. By averaging the features within the clusters obtained from a Gaussian mixture model [3,4], we define “superfeatures” or “meta-covariates” and use them in a probit regression model, thereby attaining concise interpretation and accuracy. Similar ideas, from a non-Bayesian two-step perspective, have been looked at by Hanczar et al. [5] and Park et al. [6]. With our simultaneous procedure, the clusters are formed considering the correlation of the predictors with the response in addition to the correlations among the predictors. The proposed methodology should have wide applicability in areas such as gene selection and proteomic biomarker selection. V. Kadirkamanathan et al. (Eds.): PRIB 2009, LNBI 5780, pp. 150–161, 2009. c Springer-Verlag Berlin Heidelberg 2009
Inferring Meta-covariates in Classification
151
The rest of this paper is organized as follows: in Sect. 2 we introduce our meta-covariate classication model and provide an EM algorithm for learning the parameters of our model from data. In Sect. 3 we illustrate our method with a DNA microarray data example and use the Gene Ontology (GO) to interpret our results. Section 4 discusses the conclusions we draw from our experimental results. Finally, Appendix A gives the full details of our model and shows the derivation of our EM algorithm.
2 2.1
Methodology Model
In the following discussion, we will denote the N × D design matrix as X = [x1 , . . . , xN ]T and the N × 1 vector of associated response values as t where each element tn ∈ {−1, 1}. The K × N matrix of clustering mean parameters θkn is denoted by θ. We represent the K × 1-dimensional columns of θ by θ n and the corresponding N × 1-dimensional rows of θ by θk . The D × K matrix of clustering latent variables zdk is represented as Z. The K × 1 vector of regression coefficients is denoted by w. Finally, we denote the N × 1 vector of classification auxiliary variables by y. The graphical representation of the conditional dependency structure in the meta-covariate classification model is shown in Fig. 1. From Fig. 1 we see that the joint distribution of the meta-covariate classification model is given by p(t, y, X, θ, w) = p(t, y|θ, w)p(X|θ)p(θ)p(w).
(1)
The distribution p(X|θ) is the likelihood contribution from our clustering model, which we chose to be a normal mixture model with equal weights and identity covariance matrices. Similarly, p(t, y|θ, w) is the likelihood contribution from our classification model, which we chose to be a probit regression model whose covariates are the means of each cluster, that is, θk , k = 1, . . . , K. Finally, the model was completed by specifying vague normal priors for θ and w. Full details of our model along with the derivation of the following EM algorithm that we used for inference is given in Appendix A. 2.2
Summary of the EM Algorithm
Given the number of clusters K, the goal is to maximise the joint distribution with respect to the parameters (comprising the means of the clusters and the regression coefficients). 1. Initialise θ, w, the responsibilities γ(zdk ) and E(y), and evaluate the initial value of the log likelihood. 2. E-step. Evaluate: exp − 12 xd − θk 2 γ(zdk ) = K (2) 1 2 j=1 exp − 2 xd − θ j
152
K. Harris, L. McMillan, and M. Girolami
X
θ
y
t
w
Fig. 1. Graphical representation of the conditional dependencies within the metacovariate classification model
and E(yn ) =
w T θn + w θn − T
φ(−wT θ n ) 1−Φ(−wT θ n ) φ(−w T θn ) Φ(−wT θ n )
if tn = 1 otherwise.
(3)
3. M-step. Evaluate: E(y) − θT w−k wk + Xγ k + h1 θ0 θk = D wk2 + d=1 γ(zdk ) + h1 and w=
−1 1 θθT + I θE(y). l
(4)
(5)
After updating w in this manner, set the first component of the vector to 1, so that the model is identifiable. 4. Evaluate the log likelihood and check for convergence. If the convergence criterion is not satisfied return to step 2.
3
Experimental Results - Acute Leukemia Data
3.1
Data Description
A typical application where clustering and classification have become common tasks is the analysis of DNA microarray data, where thousands of gene expression levels are monitored on a few samples of different types. We thus decided to illustrate our proposed methodology for inferring meta-covariates in classification with the widely analysed leukaemia microarray dataset of Golub et al. [7], which was downloaded from the Broad Institute Website1 . Bone marrow or peripheral blood samples were taken from 72 patients with either acute myeloid leukaemia (AML) or acute lymphoblastic leukaemia (ALL). Gene expression levels were measured using Affymetrix high-density oligonucleotide arrays containing 7129 probes for 6817 human genes. Following the experimental setup of the original 1
http://www.broad.mit.edu/cgi-bin/cancer/publications/ pub_paper.cgi?mode=view&paper_id=43
Inferring Meta-covariates in Classification
153
paper, the dataset was split into a training set of 38 samples of which 27 are ALL and 11 are AML, and a test set of 34 samples, 20 ALL and 14 AML. The data was preprocessed as recommended in [8]: (a) thresholding, floor of 100 and ceiling of 16000; (b) filtering, exclusion of probes with max/min ≤ 5 and (max − min) ≤ 500; (c) base 10 logarithmic transformation; and (d) standardising, so that each sample has mean 0 and variance 1. This left us with 3571 probes for analysis. Finally, GO annotations for the appropriate gene chip (Hu6800) were obtained via the Affymetrix NetAffx analysis centre2 . 3.2
Results and Discussion
EM algorithm results. Figure 2 shows the minimum and mean test error from 200 runs of our EM algorithm for different values of the number of clusters K. It should be noted that we used the K-means clustering algorithm to initialise the matrix of clustering mean parameters θ, while the other parameters were initialised randomly. We see from Fig. 2 that on average the algorithm performs best for around 15 to 25 clusters, with the best case yielding an average test error rate of 9.93% for K = 21 clusters. We also see that for K = 21 clusters, the run that achieved the highest likelihood value also achieved the minimum test error of 2.94%, that is, just one misclassification in the test set. The predictions from the highest likelihood model with K = 21 clusters thus appear to improve predictions made by Golub et al. [7], who made five misclassifications on the test set, and is competitive with the methods of Lee et al. [1] and Bae and Mallick [2], who misclassified one and two test samples, respectively. We will now use the Gene Ontology to interpret the results from this model.
0.5
Test error
0.4 0.3 0.2 0.1 0 0
Run with max likelihood value.
50
100
150
200
250
K Fig. 2. Minimum and mean test error after 200 runs of the EM algorithm 2
http://www.affymetrix.com/analysis/index.affx
154
K. Harris, L. McMillan, and M. Girolami Table 1. The best clusters (K = 21) Cluster Probes Controls w rank(abs(w)) Genes 1 20 0 1.00 10 16 2 486 4 0.22 19 412 3 20 0 -1.22 8 20 4 253 0 -1.88 7 230 5 182 0 0.55 15 173 6 240 1 -3.08 4 199 7 110 2 -0.37 18 99 8 60 4 0.16 20 50 9 4 4 -0.15 21 1 10 230 0 -2.66 5 214 11 189 1 -1.10 9 166 12 210 1 0.88 12 183 13 228 0 0.79 13 200 14 230 0 0.55 16 187 15 61 0 3.87 1 56 16 240 0 3.21 3 204 17 213 0 -0.50 17 205 18 17 0 -0.95 11 16 19 267 1 0.75 14 235 20 101 1 -3.79 2 85 21 210 3 2.46 6 175
GO analysis. Table 1 describes each of the 21 probe clusters, with respect to the number of probes allocated to the cluster; the number of control probes allocated to the cluster; its regression coefficient (w); its rank by descending absolute regression coefficient; and the number of genes represented by the probe set. The number of unique Entrez Gene IDs (as obtained from NetAffx) was used to count the number of unique genes. 22 of the 59 controls on the microarray survive the initial filtering process (all 22 of these are endogenous controls). Control probes, by design, should not be functionally significant. It is therefore encouraging that most (63.64%) of the control probes belong to the four least influential clusters (with respect to abs(w)): clusters 9 (w = −0.15, ranked 21st), 8 (w = 0.16, ranked 20th), 2 (w = 0.22, ranked 19th) and 7 (w = −0.37, ranked 18th). Furthermore, cluster 8 – the cluster with the lowest absolute regression coefficient – contains only four probes, all of which are control probes. It should be noted that six control probes do occur in the ten ‘significant’ clusters; the extent to which these probes are appropriate controls should be investigated further. The clusters are reasonably well balanced, with most clusters containing approximately 200 genes. The largest and smallest clusters (numbers 2 and 9 respectively) have small regression coefficients, indicating that they have limited influence on the classifier. Using w = 1 as a baseline, ten clusters (numbers 15, 20, 16, 6, 10, 21, 4, 3, 11, 1) are sufficiently weighted to be of interest (these ten clusters will be
Inferring Meta-covariates in Classification
155
described as the ‘significant’ clusters). The aim of this work is to assess whether there is any biological significance in the clustering of the probes (or genes): the expectation is that genes clustered together will be carrying out a similar function or functions. As such, GO annotations from the molecular function aspect of the GO were used. The total number of occurrences for each GO term across all genes in a cluster was calculated. By comparing this to the occurrences for each GO term across the entire chip and using the hypergeometric distribution, we can calculate the probability that the terms were encountered by chance. By comparing the occurrence of the GO term in the cluster and the entire chip, we can describe it as over- or under-represented in the gene cluster. Cluster 15, w = 3.87. Most noticeably, metal ion (and specifically zinc ion) annotations are under-represented in this gene cluster. Further, nucleotide and nucleic acid binding are seen less often than would be expected. Several very specific terms are found enriched in this gene cluster; of particular interest is a cluster of three enzyme inhibitor activity subterms. Cluster 20, w = −3.79. There is a concentration of very specific transmembrane transporter activities and oxidoreductase terms. Unlike the previous cluster, protein kinase activity is under-represented; nucleic acid binding is over-represented and receptor activity is under-represented in this cluster. Cluster 16, w = 3.21. In this cluster, zinc ion binding is over-represented, unlike in clusters 15 and 20 (where the term was under-represented and not significant respectively). Also interesting is the overrepresentation of the “damaged DNA binding” term - particularly relevant in the context of cancer. Like cluster 15, several general receptor binding terms are over-represented. A small cluster of pyrophosphatase subterms are also over-represented. Cluster 6, w = −3.08. Several metal ion binding terms are over-represented here, including calcium and zinc, and most interestingly – particularly in the context of leukaemia, cancer of the blood – heme binding. Again, several receptor binding and activity terms are over-represented. Cluster 10, w = −2.66. Most noticeably, a small cluster of under-represented terms describe signal transducer activity and several kinds of receptor activities. This is an area of the Gene Ontology that was enriched in clusters 15, 16 and 6 and under-represented in cluster 20. There is significant enrichment of DNA binding terms (specifically DNA topoisomerases). Cluster 21, w = 2.46. Cluster 21 has the most extensive coverage and deepest annotation of the ten significant clusters, despite being of comparable size to many others (e.g., 16, 6, 10, 4 and 11). In addition, none of the significant annotations are seen less than would be expected: they are all enriched in this cluster. Multiple metal ion binding terms are enriched here as are DNA binding, receptor activity and kinase activity.
156
K. Harris, L. McMillan, and M. Girolami
Cluster 4, w = −1.88. Cluster 4 is enriched for several transcription regulation terms, kinase activities, and DNA and nucleotide binding. Here, enzyme regulator activities are under-represented. Cluster 3, w = −1.22. The genes in cluster 3 are enriched for receptor activity and a specific receptor activity: fibroblast growth factor receptor activity. Again, receptor binding and activity terms are over-represented and metal ion terms are under-represented. There is enrichment of a specific enzyme activator activity, apoptotic protease activator activity, of particular interest in the context of cancer. Cluster 11, w = −1.10. A cluster of signal transducer activity/receptor activities are under-represented here; similar to patterns observed in clusters 20, 4 and 10. There are fewer metal (iron, calcium and zinc) ion binding terms and protein kinase annotations than would be expected by chance. Cluster 1, w = 1.00. Cluster 1 defines the ‘baseline’ for regression model coefficients. This cluster is enriched for ion binding (including iron, ferrous, haem and haemoglobin), ferrochelatase and oxygen transporter activity, significant in the context of leukaemia. Table 2 describes each of the ten significant clusters with respect to an annotation profile, which considers over-representation and under-representation of metal ion binding terms; DNA or RNA binding terms; receptor activity terms; enzyme regulation terms; receptor binding terms; kinase activity terms; transmembrane transport terms and transcriptional regulation terms. It is clear that none of the clusters are identical with respect to this profile. Receptor activity terms and metal ion binding terms are more often overrepresentated in the gene clusters with positive regression coefficients, and more often under-represented in the gene clusters with negative regression coefficients. Comparison to other methods. In their original paper, Golub et al. [7] identified 50 genes that were highly correlated with the AML/ALL class distinction. 68% of these genes are assigned to a cluster with an absolute regression coefficient of ≥ 1. Cluster 15, the top ranking cluster with respect to absolute regression coefficient, contains six of these genes and cluster 20, the next most influential cluster, contains four of these genes. Surprisingly, eight genes are found in cluster 5, which has a low regression coefficient (w = 0.55). More recently, Lee et al. [1] identified 27 genes as informative, using a Bayesian method for variable selection. In this more refined set, eight (29.63%) of the genes belong to the most influential cluster (15). In a follow up study where sparsity was imposed on the priors, Bae and Mallick [2] identified 10 genes using various models. Here, three genes are found in cluster 15 and two genes are found in cluster 20, and only two genes are mapped to clusters with an absolute regression coefficient < 1. Three genes are identified by all three methods [1,2,7]: Cystatin C, Zyxin and CF3 (transcription factor 3). CF3 is assigned to cluster 5, a comparatively weakly informative cluster; however, both Zyxin and Cystatin C are assigned to cluster 15, the most influential cluster in the regression model.
Inferring Meta-covariates in Classification
157
Table 2. Summary of cluster annotations Cluster 15 16 21 1 11 3 4 10 6 20
w MIB D/RB RA 3.87 n n y 3.21 y y y 2.46 y y y 1.00 y y -1.10 n n -1.22 n y -1.88 y n -2.66 n y n -3.08 y y -3.79 y n
ER RB KA TMT TRR y y y y y y y ∼ y y y y
y y
y
y y n
y y y y y
y
n
y
MIB = metal ion binding; D/RB = DNA or RNA binding; RA = receptor activity; ER = enzyme regulation; RB = receptor binding; KA = kinase activity; TMT= transmembrane transport; TRR = transcription regulation. y indicates over-representation; n indicates under-representation; ∼ indicates conflicting results.
4
Conclusions
The method is successful in assigning limited influence to control probes. The clustering of probes reflects functional differences between the genes that they represent. Furthermore, enrichment of metal ion binding and receptor activity annotations appear to correspond with the sign of the regression coefficients; that is, clusters with positive regression coefficients are more often enriched for such annotations, while clusters with negative regression coefficients are often under-represented by such annotations. In a comparison with methods of variable selection in the same dataset, genes important in the discrimination between AML and ALL tend to belong to clusters with high absolute regression coefficients in the model; this is particularly true as the variable selection methods become more sophisticated and fewer genes are found to be significant. Of the three genes that are common in three different analyses of these data, two (Zyxin and Cystatin C) are assigned to the most influential cluster in our model. Our experimental results thus indicate that our EM algorithm approach of inferring meta-covariates in classification is a promising new methodology with wide applicability. Moreover, the approach can be naturally extended to multiclass classification and to incorporate sparsity by employing an Inverse Gamma prior on the variance of the regression coefficients. Future research will focus on developing a Bayesian sampler for the “meta-covariate” classification model, possibly using reversible jump Markov chain Monte Carlo or an infinite mixture model to infer directly from the data the optimal number of clusters. Acknowledgements. K. Harris & M. Girolami are supported by the Engineering and Physical Sciences Research Council (EPSRC) grant EP/F009429/
158
K. Harris, L. McMillan, and M. Girolami
1 - Advancing Machine Learning Methodology for New Classes of Prediction Problems. M. Girolami is funded by an EPSRC Advanced Research Fellowship EP/E052029/1. L. McMillan is funded by a grant from SHEFC SRDG.
References 1. Lee, K.E., Sha, N., Dougherty, E.R., Vannucci, M., Mallick, B.K.: Gene selection: a Bayesian variable selection approach. Bioinformatics 19(1), 90–97 (2003) 2. Bae, K., Mallick, B.K.: Gene selection using a two-level hierarchical Bayesian model. Bioinformatics 20(18), 3423–3430 (2004) 3. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Heidelberg (2006) 4. Fraley, C., Raftery, A.E.: Bayesian regularization for normal mixture estimation and model-based clustering. Journal of Classification 24(2), 155–181 (2007) 5. Hanczar, B., Courtine, M., Benis, A., Henegar, C., Cl´ement, K., Zucker, J.D.: Improving classification of microarray data using prototype-based feature selection. SIGKDD Explorations 5(2), 23–30 (2003) 6. Park, M.Y., Hastie, T., Tibshirani, R.: Averaged gene expressions for regression. Biostatistics 8(2), 212–227 (2007) 7. Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., Lander, E.S.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439), 531–537 (1999) 8. Dudoit, S., Fridlyand, J., Speed, T.P.: Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association 97(457), 77–87 (2002)
A A.1
Derivation of the EM Algorithm Regression
Modelling. In the following subsection, y denotes an N ×1 continuous response vector. Joint distribution. p(y, X, θ, w) = p(y|θ, w)p(X|θ)p(θ)p(w).
(6)
yn = wT θ n + n where n ∼ N (0, 1).
(7)
Regression model.
⇒ p(y|θ, w) =
N
p(yn |θn , w) =
n=1
⇒ log p(y|θ, w) = −
N
T 2 1 1 √ e− 2 (yn −w θn ) . 2π n=1
N 1 N (yn − wT θn )2 − log(2π). 2 n=1 2
(8)
(9)
Inferring Meta-covariates in Classification
159
Clustering model. Normal mixture model with equal weights and identity covariance matrices. K 1 ⇒ p(x) = N (x|θk , I). (10) K k=1
From [3] we have that:
log p(X|θ) ≥ L(q, θ) =
q(Z) log
Z
p(X, Z|θ) q(Z)
,
(11)
where Z is a D × K matrix of latent variables with rows z Td such that z d is a K-dimensional binary random variable having a 1 of K representation in which a particular element zk is equal to 1 and all other elements are equal to 0, and q(Z) is a distribution defined over the latent variables. ⇒ log p(X|θ) ≥
p(Z|X, θold ) log p(X, Z|θ) −
Z
p(Z|X, θold ) log p(Z|X, θold )
Z
(12) = Q(θ, θold ) + const.
p(X, Z|θ) =
(13)
z D
K
1 dk N (xd |θ k , I)zdk . K
(14)
d=1 k=1
⇒ EZ [log p(X, Z|θ)] ≥ −
D K N 1 E(zdk ) (xnd − θkn )2 + const. 2 n=1
(15)
d=1 k=1
Prior distributions. p(θ) =
K
N (θk |θ 0 , hI),
(16)
k=1
where each element of θ0 is set to the corresponding covariate interval midpoint and h is chosen arbitrarily large in order to prevent the specification of priors that don’t overlap with the likelihood and allow for mixtures with widely different component means. p(w) = N (w|0, lI). (17) E-step. 1 z zdk K N (xd |θk , I) dk zdj E(zdk ) = γ(zdk ) = 1 zdj K N (xd |θ j , I) exp − 21 xd − θk 2 = K 1 . 2 j=1 exp − 2 xd − θ j
zdk
(18) (19)
160
K. Harris, L. McMillan, and M. Girolami
M-step 2 N K D K N 1 1 log p(y, X, θ, w) ≥ − yn − wk θk n − γ(zdk ) (xnd − θkn )2 2 n=1 2 d=1 k=1 n=1 k =1
−
∂ log p(y, X, θ, w) = ∂θkn
yn −
K
K N K 1 1 2 (θkn − θ0n )2 − wk + const. 2h k=1 n=1 2l k=1
wk θ k n
k =1
wk +
D d=1
(20)
1 γ(zdk )(xnd −θkn )− (θkn −θ0n ) = 0. h (21)
y − θT w−k wk + Xγ k + h1 θ 0 ⇒ θk = , 1 wk2 + D d=1 γ(zdk ) + h
(22)
where w −k is w with the k th element set to 0 and γ k is the D × 1-dimensional column of the D × K matrix of responsibilities [γ(zdk )]. N K ∂ log p(y, X, θ, w) 1 = yn − wk θk n θkn − wk = 0. (23) ∂wk l n=1 k =1 −1 1 ⇒ w = θθT + I θy. (24) l A.2
Extension to Binary Classification
Modelling Joint distribution. The joint distribution now becomes p(t, y, X, θ, w) = p(t, y|θ, w)p(X|θ)p(θ)p(w). Classification model.
tn =
1 if yn > 0 −1 otherwise.
yn = wT θ n + n where n ∼ N (0, 1). ⇒ p(t, y|θ, w) =
N
(25)
(26) (27)
p(tn , yn |θ n , w)
(28)
p(tn |yn )p(yn |θn , w)
(29)
p(tn |yn )N yn |wT θn , 1 ,
(30)
n=1
=
N
n=1
=
N
n=1
Inferring Meta-covariates in Classification
where
δ(yn > 0) if tn = 1 p(tn |yn ) = δ(yn ≤ 0) otherwise.
161
(31)
E-step. Then, by taking logarithms and applying Jensen’s inequality, we obtain the following result: Ey [log p(t, y|θ, w)] ≥
N
log p(tn |E(yn ))N E(yn )|w T θn , 1 .
(32)
n=1
δ(yn > 0)N yn |wT θ n , 1 if tn = 1 yn |tn , θ, w ∝ δ(yn ≤ 0)N yn |wT θ n , 1 otherwise. φ(−wT θ n ) wT θn + 1−Φ(−w if tn = 1 Tθ ) n ⇒ E(yn ) = T φ(−w θ ) wT θn − Φ(−wT θnn ) otherwise.
(33)
(34)
We now see that p(tn |E(yn )) = 1 and equation (32) simplifies to Ey [log p(t, y|θ, w)] ≥
N
log N E(yn )|w T θn , 1
(35)
n=1
=−
N 2 N 1 E(yn ) − w T θn − log(2π). 2 n=1 2
(36)
We thus see that the only difference between equations (9) and (36) is that yn is replaced by E(yn ). Hence, the E-step now involves evaluating E(yn ) using equation (34), in addition to evaluating the responsibilities γ(zdk ) using equation (19). M-step. As the clustering model and the prior distributions are left unchanged, the M-step also remains unchanged except for y being replaced by E(y) in equations (22) and (24).
A Multiobjective Evolutionary Algorithm for Numerical Parameter Space Characterization of Reaction Diffusion Systems Tim Hohm and Eckart Zitzler Computer Engineering and Networks Laboratory ETH Zurich 8092 Zurich, Switzerland {tim.hohm,eckart.zitzler}@tik.ee.ethz.ch http://www.tik.ee.ethz.ch/sop/
Abstract. Mathematical modeling is used to assist in studying complex biological systems. Still, setting up and characterizing models pose challenges of its own: identifying suitable model parameters, even when highresolution time course data concerning the system behavior is available, is a difficult task. This task is further complicated when this high-resolution data remains unavailable like for the tissue level systems considered in developmental biology—a type of systems we focus on in the present study. In addition, costly simulations for tissue level systems prohibit excessive simulations during the parameter estimation phase. Here, we propose an approach that is dedicated to assist in the task of parameter space characterization for reaction diffusion models—a common type of models in developmental biology. We investigate a method to numerically identify boundaries that partition the parameter space of a given model in regions that result in qualitatively different system behavior. Using an Evolutionary Algorithm (EA) combined with an Artificial Neural Network (ANN), we try to identify a representative set of parameter settings minimizing the distance to such boundaries. In detail we train the ANN on numerical data annotated using analytical results to learn the mapping between parameter space and distances to boundaries, thereby guiding the optimization process of the EA to identify such a set of parameter settings. The approach is tested with respect to its boundary identification and generalization capabilities on three different reaction diffusion systems—for all three we are capable of reliably identifying boundaries using the proposed approach.
1
Introduction
Mathematical modeling is a powerful tool to help understanding processes in complex biological systems [14,20,22]. Especially in the field of developmental biology a certain type of models, so called reaction diffusion systems, are among the most cited approaches [20]. Dating back to Turing [21] different reaction diffusion systems are used to explain a range of pattern formation in different V. Kadirkamanathan et al. (Eds.): PRIB 2009, LNBI 5780, pp. 162–174, 2009. c Springer-Verlag Berlin Heidelberg 2009
A Multiobjective Evolutionary Algorithm
163
biological systems [5,11,12,13,16,23]. Although mathematical modeling thereby has proven its value in studying biological systems, the task of setting up suitable models poses some challenges of its own: after translating a set of initial hypothesis in a model structure, the usually parameterized models need to be tuned, i.e., the model parameters need to be optimized in order to minimize the deviation between known experimental data and simulation output. This task is complicated especially for tissue level systems considered in developmental biology for which in many cases only scarcely high-resolution quantitative timecourse data is available and due to the fact that the interactions between model components tend to be non-linear [8,17]. In addition, simulations for tissue level simulations are computationally expensive which prohibits excessive simulations during the parameter estimation phase. In the literature, there are mainly three different approaches to tackle the afore mentioned task of parameter estimation for multi cell systems where mostly qualitative data is used: (i) tuning model parameters by hand [11,23], (ii) exploiting analytical results concerning the system to guide parameter choices [12], (iii) employing optimization techniques to minimize deviations between experimental data and simulation output [8,17]. All three techniques suffer from some limitations: tuning by hand and analytical characterizations of target systems are feasible only for small systems due to the increasingly cumbersome process of the analytical characterization for growing numbers of involved independent parameters and due to the fact that the size of respective parameter spaces grows exponentially with the system size and single simulations are computationally costly. This prohibits the necessary number of tuning steps by hand. Optimization techniques in turn are capable of handling up to mid-sized models but with further growing systems as well suffer from the exponential explosion of the parameter space, complexly structured parameter spaces due to non-linear dependencies between model components, and costly simulations. To address the main problem of exponential explosion of these complexly structured parameter spaces we propose a method where we couple an analytical approach with an optimization method. By exploiting analytical results to guide the optimization process, our approach is supposed to numerically reveal the structure of the parameter space comparable to what an analytical characterization would yield. Thereby, we could learn about for instance qualitatively different system behaviors a considered model is capable of showing and since the target behavior is described in rather qualitative terms, a parameter setting contained in a matching region of the parameter space should show good agreement with a sought target behavior. In addition, further fine tuning should be possible either by hand or by using one of the afore mentioned optimization techniques where the searched parameter space is restricted to parameter space partitions identified by our approach. Focusing on reaction diffusion systems, in detail we use analytical results gained for a simple system (a simplified variant of the activator inhibitor system [12]) and machine learning techniques (ANNs [3]) in order to train a predictor to estimate the distance of a given parameter setting from a boundary in parameter
164
T. Hohm and E. Zitzler
space that discriminates between qualitatively different system behaviors. This predictor ANN is then used in a stochastic optimization technique (EAs [1,4]) to guide the search process to identify a well distributed set of parameter settings constituting boundaries in parameter space between qualitatively different behaving regions of the parameter space. Due to the fact that the general principles inducing different system behaviors are similar for all reaction diffusion systems, namely the concept of local self-activation and long range inhibition [5,13], the predictor ANN trained on data for the simple reaction diffusion system is supposed to generalize well for other systems. After testing the ANN and EA on the activator inhibitor system used for training the ANN, we provide a proof of principles concerning the generalization capabilities of our approach by applying it to two further reaction diffusions systems: an activator substrate system [13] and the Brusselator system [16]. In the following, we will first introduce the concept underlying our approach and give a detailed description of the approach itself (Sec. 2). We then show results gained for the three considered test systems, activator inhibitor system, activator substrate system, and Brusselator (Sec. 3), and conclude the paper with some final remarks (Sec. 4).
2
Approach
Briefly summarizing the concept underlying our approach, we propose to use an EA in order to identify parameter settings for reaction diffusion systems that delimit regions in parameter space resulting in qualitatively different system behavior. To guide the EA we employ an ANN that was trained using analytical characterization data for a simple reaction diffusion system in order to predict for a given parameter setting its distance to such a boundary. In detail we use analytical information concerning a simplified version of the activator inhibitor system as found in the appendix of [12]. Numerically simulating a grid of parameter settings covering the interesting part of the parameter space, we generate time course data that shows the typical behaviors this system is capable of generating (cf. Fig. 1). Since due to peculiarities of the used integrators the empirical boundaries a slightly shifted with respect to the analytically determined boundaries, we manually adjust the theoretical boundaries to the numerical data. We then compute the shortest distance of each parameter set to the empirical boundaries. In addition we process the numerical data in order to reduce it to some meta characteristics that capture important features to determine to which qualitative region a given parameter setting belongs—a necessary step to allow that the considered characteristics become invariant to the exact specification of the considered reaction diffusion system and therefore allow for generalization. Using these meta characteristics as inputs and the calculated distances as outputs, an ANN is employed to learn the mapping from input characteristics to distance from parameter setting to a boundary in parameter space. This predictor is then used to guide an EA in order to identify boundaries delimiting regions in parameter space resulting in qualitatively different behavior of a given system. In this
A Multiobjective Evolutionary Algorithm
165
Fig. 1. Time courses representing the three qualitatively different behaviors of the one dimensional activator inhibitor system. The y-axis of each plot shows the reached concentrations while the x-axis represents time and each of the 100 curves per plot represents the time behavior of the activator of a single cell of the system: (a) a typical oscillating system, (b) a spatially heterogeneous pattern, and (c) a spatially homogeneous behavior. Phase 1: Learning
Phase 2: Parameter Space Exploration Optimization Loop
Training Process
ANN
Numerical Data
Coverage
ANN
EA
Objective Value
Training Data
Objective 2
Simulation
Parameter Setting
Expert Processing
Objective 1
Objective Value
Analytical Data
Putative Boundary Points
Fig. 2. Sketch of the training phase of the ANN as a predictor for the distance between a parameter setting and the closest boundary in parameter space delimiting regions resulting in qualitatively different behavior (left) and the EA that identifies points on such a boundary for an unknown reaction diffusion system, building on the ANN (right)
context, the EA generates a parameter setting which is then simulated. Using the simulated time course, the inputs for the ANN are determined and building on the prediction of the ANN with respect to the distance to a boundary, the EA than refines the proposed parameter setting in order to better match a supposed boundary. Both, the training process of the ANN and the EA, are sketched in Fig. 2 and further details are given in the following. 2.1
Training Data Generation
To generate the training data for the ANN we use an already analytical characterized simplified variant of the activator inhibitor system [12] given by the following equations: ∂a a2 = DΔa + ξ − a + σ ∂t h ∂h = Δh + ξμ(a2 − h) ∂t
(1) (2)
166
T. Hohm and E. Zitzler
This system consists of two interlinked species, an activator a and an inhibitor h. Their respective time behavior is described by partial differential equations that in addition to time depend on spatial information: a diffusion term represented by the Laplace operator Δ and a diffusion constant D. Both species encompass a reaction term, perturbed by a uniformly random value ξ ∈ [0.95, 1.05], and a decay term. The reaction- and decay term of the inhibitor are quantified with a constant μ. In addition, the activator contains the term σ that represents basal expression. This system depends on three constants: D, σ, and μ. To generate numerical data for this system, we consider an implementation of this system in a one-dimensional spatial domain consisting of 100 cells with periodic boundary conditions. We sample the parameter space using an equidistant grid of 5000 parameter settings on the parameter sub-space spanned by (D, μ) and fix σ = 0.001 as well as the initial conditions (ai , hi ) = (0.01, 0.01) of all cells i for both species a and h. The grid spans (D, μ) = [0.006, 0.3] × [0.04, 4] with respective steps of (0.006, 0.04). For numerical integration we consider the interval [0, 1000] of dimensionless time and use an implicit explicit scheme consisting of a modified Crank-Nicolson integrator and an Adams-Bashford integrator [18]. For time discretization we use a time step of δt = 0.125 and for space discretization we apply a spatial grid in cellular resolution. After identifying the analytically determined boundaries in (D, μ) parameter space in the numerical data, we compute the shortest Euclidean distance for each simulated parameter setting to these boundaries after normalizing the (D, μ) = [0, 0.3] × [0, 4] parameter space to [0, 1] × [0, 1]. The resulting distances are shown in Fig. 3. After thereby generating the outputs used for training the ANN, in a last step we need to reduce the integration data (per parameter setting an n × m matrix X with n being the number of cells and m being the number of considered time points) to a set of meta characteristics that capture system invariant features that allow the ANN to learn the mapping between parameter setting (represented by the features) and the shortest distance to a boundary in parameter space. Analyzing the available time course data we found out that in principle two characteristics should be sufficient to characterize the different parameter settings: (i) the spatial difference occurring between all cells during a late
(b) 0.4
0.3 0.2
0.35 0.3 0.25 0.2 0.15 0.1 D 0.05
0.1 0 -0.1 4
3.5
3
2.5
m
2
1.5
1
0.5
0
0
Distance
Distance
(a)
0.3
0.35 0.3 0.25 0.2 0.15 0.1 D 0.05
0.2 0.1 0 4
3.5
3
2.5
m
2
1.5
1
0.5
0
0
Fig. 3. Normalized distances of each parameter setting in the (D, μ) = [0, 0.3] × [0, 4] parameter space as determined from the simulation data (a) and predicted by the ANN (b)
A Multiobjective Evolutionary Algorithm (a)
Concentration
10
10
10
10
10
(b)
6
6 10
4
4 10
Concentration
10
167
2
0
-2
2 10 0 10 -2 10
-4 0
200
400
600
800
-4 10
1000
0
200
(c)
400
600
800
1000
Simulated Time
Simulated Time
(d)
Period 3
1
Discretization Level
2.5
Occurrences
Window 2 1.5 1 0.5 0 0 0
200
400
600
Simulated Time
800
1000
0
5
10
15
20
25
30
35
40
45
Periods
Fig. 4. A sketch of the process used to estimate the dominant oscillation period in time course data: (a) time course data of the activator for a 100 cell activator inhibitor system, (b) time courses are reduced to a single time course that represents the maximum for each time point over the 100 cells (solid curve) and the threshold δthresh used for discretization (dashed curve), (c) discretized time course data for which the periods between ‘1’ peaks are computed, and (d) histogram of the calculated periods and the sliding window used to determine the dominant period in terms of occurrences
integration step and (ii) the dominating oscillation period estimated from the data (for non-oscillating time courses it can be set to a very small positive value, here 10e−14 ). These two characteristics have the advantage that they are invariant with respect two variations of the simulated domain, both in numbers of cells and changes of dimensionality. Still, these two characteristics allow to capture features discriminating between oscillatory and stable system behavior and spatially homogeneous or heterogeneous states. When in addition considering these two characteristics only for the activator, we gain further invariance with respect to possible other realizations of an inhibition, e.g., instead of a direct inhibition by an inhibitor, inhibition could be realized by depleting a substrate. While the computation of the spatial difference is a straight forward procedure, we briefly explain how we estimate the dominant oscillation period. In a first step, we reduce the existing time course data X to a consensus time course Xmax by taking the maximum over all cells for each time point. This has the advantage of generating a more regular signal since due to the stochastic ξ terms the considered time course might show some irregularities in single cells. In a second step we discretize the data using a threshold δthresh = 12 mean (Xmax ). For all time points where Xmax ≥ δthresh , the discretized time course data Xdisc is set to 1 and 0 otherwise. Then, the periods between ‘1’ peaks are determined and gathered in a histogram with buckets encompassing 5 time steps. Using a sliding window covering 5 consecutive buckets, the period with the most occurrences is
168
T. Hohm and E. Zitzler
determined where in case of equal occurrences we have a preference for shorter periods. The process of determining the dominant period is sketched in Fig. 4. 2.2
Artificial Neural Networks
To learn the mapping between inputs extracted from the time course data of the numerical integration to the distance of the respective parameter setting to a boundary in parameter space delimiting partitions of qualitatively different system behavior, we chose to use ANNs [3]. Since learning the described mapping is a regression problem, we decided to choose feed forward multi-layer perceptrons with two layers of adaptive weights and in total three layers: an input layer with two neurons, a hidden layer of 50 neurons with hyperbolic tangent functions as activation functions, and an output layer with a single neuron and a linear activation function. To further enhance the predictive capabilities of the ANN, instead of a single ANN we decided to train an ensemble of ANNs [19]. In detail we use the W-SECA procedure proposed by Grannito et al. [6] to construct the ANN ensemble where the ensemble prediction is the weighted mean of all ANNs in the ensemble, using the following weighting function wi for each ensemble member i [6]: e−2 wi = i −2 (3) j ej Here, e is the prediction error of an ensemble member with respect to the data set containing all 5000 data points and j iterates over all ensemble members. Since the single input values can become rather large, to facilitate training we transform the inputs by taking their logarithm. Each ensemble member is trained using the scaled conjugate gradients algorithm [3]. For training, the available data is divided in a training set and a validation set using bootstrapping: the training set consists of 5000 bootstrap samples while the not-sampled points are used as validation set. The ANNs are trained using the training set until the prediction error for the validation set in successive training epochs gets worse. Ensemble predictions of the ensemble used in the following for all parameter settings used during training are shown in Fig. 3. 2.3
Evolutionary Algorithm
Aiming at the identification of a representative set of parameter settings delimiting regions in parameter space resulting in qualitatively different system behavior, we complement the optimization criterion of minimizing the distance to a boundary by a second objective: the coverage of the parameter space by identified parameter settings. In order to identify trade-offs between these two objectives we apply the Multiobjective Covariance Matrix Adaption Evolution Strategy1 (MO-CMA-ES) [10], belonging to a class of EAs designed to identify 1
We slightly vary the original MO-CMA-ES: instead of the exact hypervolume we use a Monte Carlo sampling method for hypervolume estimation [2] during environmental selection [9].
A Multiobjective Evolutionary Algorithm
169
compromises between conflicting objectives like distance to a boundary and coverage of parameter space; the MO-CMA-ES already showed good results in a comparable situation where on top of the core optimization criterion the coverage of the parameter space had to be considered [9]. The coverage of the parameter space is assessed using a criterion proposed in [9]: the parameter settings xi in a population G of an EA are ranked using their distance to uniformly random parameter settings xj ∈ S (see [9] for a detailed description). In total we draw |S| = 29958 random parameter settings, following Hoeffding’s inequality [7] resulting in a probability of 0.95 of resulting in an error in coverage computation err ≤ 0.01 for the considered two dimensional normed parameter space [0, 1]2 .
3
Simulations and Results
In the following we present results of our approach on three test systems: as a proof of principle we use the activator inhibitor system that was used for training data generation; to test the generalizability we use two further, conceptually different systems. Before we present the obtained results, we briefly describe the used experimental setup as well as the means of validation for the found settings. 3.1
Methodology
For the optimization process, on each system we used the same ANN ensemble and ran the EA 10 times. For each EA run we allowed 2500 function evaluations using a population size of 50. For the HypE function [2] employed during environmental selection we used 10000 samples and the reference point (1, 100) for the two objectives (i) distance to boundary and (ii) coverage of the parameter space. Each of the EA runs took approximately 2 days on a two chip dual core AMD R Opteron 2.6GHz 64-bit machine with 8GB RAM using MATLAB 7.6 (R2008a) and the NETLAB [15] implementation for ANNs and related algorithms. For the evaluation of the EA runs we considered two different factors: (i) the reproducibility of the identified sets of parameter settings over all EA runs for each system, and (ii) the goodness of the identified boundaries. Although the reproducibility of the found sets of parameter settings is difficult to asses quantitatively, nevertheless, visual inspection of the sets clearly showed that certain sub spaces contained no identified parameter settings while others were well-populated for all runs, we deem this visual inspection sufficient to document the reproducibility. In order to validate the identified boundaries we used two different approaches: since for the activator inhibitor system the boundaries are known, we visually compared the identified parameter settings to the known boundaries. For the remaining two systems we validated the putative boundaries inferred from the identified parameter settings by probing the behavior around the putative boundaries: we simulate parameter settings residing on vectors orthogonal to the assumed boundaries in order to test if a qualitative change of system behavior occurs in the vicinity of the putative boundaries. In addition, using the same probing technique we test if parameter settings located in regions for which no boundary is detected exhibit qualitatively similar behavior.
170
3.2
T. Hohm and E. Zitzler
Proof of Principle
We used the proposed method to identify boundaries partitioning the parameter space in regions resulting in qualitative different system behavior for the activator inhibitor system (Eqs. 1–2) that was used for training the ANN ensemble. In a first go we observed that the coverage of the parameter space became worse during the optimizations process corresponding with a reduction in number of distinct parameter settings constituting the estimated boundaries. Eventually, the algorithm converged ending up with only one or two parameter settings. When analyzing the landscape of distances predicted by the ANN ensemble, we found out that although the general distance landscape is in good agreement with the calculated distances (see Fig 3) not all boundary constituting parameter settings are mapped to the same globally optimal value: for example in the the region with small D-values and large μ-values the predicted distances become negative and in terms of minimization better than those for other boundary points. Thereby, our approach traded off coverage for concentrating on the regions containing negative values. In order to prevent these false global optima from dominating the optimization process we decided to cut-off the predicted distance values on the level of 0. Thereby we achieve that most boundary constituting parameter settings are mapped to the globally optimal value of 0 but at the same time introduce some false positive boundary points, e.g., again in the region with small D-values and large μ-values. Using this modification the boundary determined by our approach is in good agreement with the known boundary (see Fig. 5). Still, the (D, μ) ∈ [0, 0.3] × [0, 1.56] regime corresponding to oscillating system behavior contains a considerable number of false positive settings. When again checking the predicted distances (see Fig. 3a) it can be seen that these false positive settings correspond to narrow spikes in the predicted landscape—a fact that could be addressed either by considering the robustness of the predicted distance to a boundary with respect to some sort of neighborhood around the considered parameter setting or by further refining the training process of the ANNs, e.g., by including regularization terms to smoothen ANN outputs by preventing possible over fitting. Nevertheless, although a number of parameter settings corresponds to false positive boundary points, the approach in its current form already clearly shows that large parts of the parameter space belong to qualitatively similar regions and therefore can be neglected. 3.3
Test of Generalizability
After this proof of principle, we tested the generalization capabilities of our approach by running it on the remaining two test systems. When checking the data for the activator substrate system (Eqs. 4–5), the identified parameter settings clearly outline a boundary from small D-values and large μ-values towards large D-values and small μ-values. To validate if these settings constitute a true boundary between qualitatively differently behaving parameter space regions we probed the behavior in a neighborhood around the putative boundary using the vectors shown in Fig. 5b. Evaluating the corresponding simulations we could
A Multiobjective Evolutionary Algorithm (a)
(b)
4
4
3.5
3.5
3
3
2.5
2.5
m
m
2 1.5
3
2
b2
1.5
1
1 0.5
0
(c)
4
0.5 0
0.05
0.1
0.15
0.2
0.25
D
0.3
0
171
1 0 0
1
2
a 0
0.05
0.1
0.15
0.2
0.25
3
0 0.1 0.2 0.3 0.4 D 0.5 4
0.3
D
Fig. 5. Plots showing the probable boundary delimiting parameter settings identified for the three test systems: (a) identified parameter settings (circles) and analytically determined boundary points (squares) for the activator inhibitor system, (b) identified parameter settings for the activator substrate system and the used probing vectors, and (c) identified parameter settings for the Brusselator as well as the used probing vectors and the hyperplane outlining an assumed boundary
confirm that on the lower boarder of the identified boundary the system shows a change in behavior from a spatially heterogeneous pattern (lower region in Fig. 5b) to a spatial homogeneous pattern (upper region in Fig. 5b). In addition, along the probing vectors located in regions for which no boundary was predicted, indeed no qualitative change in system behavior could be observed. When looking at the putative boundary constituting parameter settings identified for the Brusselator (see Fig. 5c), one recognizes that identifying boundaries becomes increasingly more difficult when dealing with higher dimensional search spaces especially when the boundaries stem from non-linear relations between parameters. Still, we have been able to identify a hyperplane outlined by found parameter settings. Using the same probing approach (see Fig. 5c for exact location of hyper plane and probing vector) to validate this putative boundary, we observed a change from spatially homogeneous timely stable solutions to timely oscillations when following the probing vector in direction of increasing b. Again, probing regions that according to our approach were not supposed to contain boundaries showed no qualitative change in system behavior.
4
Conclusions
In this study we investigated the proposed approach to exploit analytical information in order to numerically characterize reaction diffusion systems. Using an EA, we tried to identify parameter settings that constitute boundaries that partition the parameter space in regions showing qualitatively different system behavior. To guide the search process of the EA we employed an ANN ensemble which was trained using numerical data generated for a simple reaction diffusion system and annotated with analytical results. We tested our approach on three different reaction diffusion systems, the activator inhibitor system that was used
172
T. Hohm and E. Zitzler
for training data generation, and two conceptually different reaction diffusion systems: an activator substrate system and the Brusselator. With the presented results we documented the reliable identification of parameter settings residing on boundaries in parameter space as well as the generalizability of our approach for different reaction diffusion systems. In order to further test out approach we plan to apply it to new and larger systems—although the results obtained for the Brusselator indicate that it might be necessary to generate exponentially growing numbers of parameter settings to reliably outline boundaries in high-dimensional parameter spaces as well as it could become difficult to infer the putative boundaries outlined by the identified parameter setting with growing dimensionality. Addressing these concerns it could be interesting to slightly alter the scope of our approach: although knowing the complete structure of the parameter space provides valuable information concerning the characterization of a system, in many situations it is sufficient to identify a region in parameter space showing a certain qualitatively behavior. Therefore it should be possible to train an ANN ensemble, instead for boundary identification, for identification of a region in parameter space showing the target behavior. In turn, a small number of parameter settings is sufficient to, e.g., indicate the centroid of such a region, as well as it solves the problem of having to derive the exact location of a putative boundary from a set of parameter settings.
References 1. B¨ ack, T., Fogel, D.B., Michalewicz, Z. (eds.): Handbook of Evolutionary Computation. IOP Publishing and Oxford University Press (1997) 2. Bader, J., Zitzler, E.: HypE: An Algorithm for Fast Hypervolume-Based ManyObjective Optimization. TIK Report 286, Computer Engineering and Networks Laboratory (TIK), ETH Zurich (November 2008) 3. Bishop, C.M.: Neural Networks for Pattern Recognition. Oxford University Press, Oxford (1995) 4. Foster, J.A.: Evolutionary Computation. Nat. Rev. Genet. 2(6), 428–436 (2001) 5. Gierer, A., Meinhardt, H.: A theory of biological pattern formation. Kybernetik 12, 30–39 (1972) 6. Granitto, P.M., Verdes, P.F., Cecatto, H.A.: Neural network ensembles: evaluation of aggregation algorithms. Artif. Intell. 163, 139–162 (2005) 7. Hoeffding, W.: Probability Inequalities for Sums of Bounded Random Variables. Journal of the American Statistical Association 58(301), 13–30 (1963) 8. Hohm, T., Zitzler, E.: Modeling the Shoot Apical Meristem in A. thaliana: Parameter Estimation for Spatial Pattern Formation. In: Marchiori, E., Moore, J.H., Rajapakse, J.C. (eds.) EvoBIO 2007. LNCS, vol. 4447, pp. 102–113. Springer, Heidelberg (2007) 9. Hohm, T., Zitzler, E.: Multiobjectivization for Parameter Estimation: a Case-Study on the Segment Polarity Network of Drosophila. In: Raidl, G., et al. (eds.) Genetic and Evolutionary Computation Conference (GECCO 2009). ACM, New York (to appear, 2009)
A Multiobjective Evolutionary Algorithm
173
10. Igel, C., Hansen, N., Roth, S.: The Multi-objective Variable Metric Evolution Strategy, Part I. Technical Report IRINI 2005-04, Institut f¨ ur Neuroinformatik, RuhrUniversit¨ at Bochum, 44780 Bochum (2005) 11. J¨ onsson, H., Heisler, M., Reddy, G.V., Agrawal, V., Gor, V., Shapiro, B.E., Mjolsness, E., Meyerowitz, E.M.: Modeling the organization of the WUSCHEL expression domain in the shoot apical meristem. Bioinformatics 21, i232–i240 (2005) 12. Koch, A.J., Meinhardt, H.: Biological pattern formation: from basic mechanisms to complex structures. Rev. Mod. Phys. 66(4), 1481–1510 (1994) 13. Meinhardt, H.: Models of Biological Pattern Formation. Academic Press, London (1982) 14. Murray, J.D.: Mathematical Biology. Springer, New York (2003) 15. Nabney, I.T.: NETLAB: Algorithms for Pattern Recognition. In: Advances in Pattern Recognition, 2nd edn. Springer, Oxford (2003) 16. Prigogine, I., Lefever, R.: Symmetry Breaking Instabilities in Dissipative Systems. J. Chem. Phys. 48, 1695–1700 (1968) 17. Raffard, R., Amonlirdviman, K., Axelrod, J.D., Tomlin, C.J.: Automatic parameter identification via the adjoint method, with application to understanding planar cell polarity. In: IEEE Conference on Decision and Control, Piscataway, NJ, USA, pp. 13–18. IEEE Press, Los Alamitos (2006) 18. Ruuth, S.J.: Implicit-explicit methods for reaction-diffusion problems in pattern formation. J. Math. Biol. 34(2), 148–176 (1995) 19. Sharkey, A.J.C. (ed.): Combining Artificial Neural Nets: Ensemble and Modular Multi-Net Systems. Springer, London (1999) 20. Tomlin, C.J., Axelrod, D.: Biology by numbers: mathematical modelling in developmental biology. Nat. Rev. Genet. 8, 331–340 (2007) 21. Turing, A.: The chemical basis for morphogenesis. Philos. Trans. R Soc. Lond, B 237, 37–72 (1952) 22. Voit, E.O.: Computational Analysis of Biochemical Systems. Cambridge University Press, Cambridge (2000) 23. Yamaguchi, M., Yoshimoto, E., Kondo, S.: Pattern regulation in the stripe of zebrafish suggests an underlying dynamic and autonomous mechanism. Proc. Natl. Acad. Sci. USA 104(12), 4790–4793 (2007)
A
Further Test Systems
In the following we would like to briefly introduce the remaining two test systems that were used in this study. To be able to test the generalization capabilities of the proposed approach we chose to use conceptually different reaction diffusion systems compared to the activator inhibitor system used for training purposes; both realize the long-range inhibition by some sort of depleting substrate. Equations 4–5 thereby constitute the activator substrate system [5,12,13]. Alike to the activator inhibitor system used for training, we fix σ = 0.001 and explore the thereby reduced (D, μ) parameter space. The remaining two Eqs. 6–7 form the Brusselator [16] for which we consider the three-dimensional (D, a, b) parameter space. In all four equations ξ represents a random perturbation uniformly drawn from the interval [0.95, 1.05].
174
T. Hohm and E. Zitzler
∂a ∂t ∂s ∂t ∂x ∂t ∂y ∂t
= DΔa + ξa2 s − a + σ
(4)
= Δs + ξμ(1 − sa2 )
(5)
= DΔx + a − (b + 1)x + ξx2 y + σ
(6)
= Δy + bx − ξx2 y
(7)
Knowledge-Guided Docking of WW Domain Proteins and Flexible Ligands Haiyun Lu1 , Hao Li1 , Shamima Banu Bte Sm Rashid1 , Wee Kheng Leow1 , and Yih-Cherng Liou2 1
2
Dept. of Computer Science, School of Computing, National University of Singapore, Singapore 117590 {luhaiyun,lihao,shamima,leowwk}@comp.nus.edu.sg Dept. of Biological Sciences, Faculty of Science, National University of Singapore, Singapore 117543
[email protected]
Abstract. Studies of interactions between protein domains and ligands are important in many aspects such as cellular signaling. We present a knowledge-guided approach for docking protein domains and flexible ligands. The approach is applied to the WW domain, a small protein module mediating signaling complexes which have been implicated in diseases such as muscular dystrophy and Liddle’s syndrome. The first stage of the approach employs a substring search for two binding grooves of WW domains and possible binding motifs of peptide ligands based on known features. The second stage aligns the ligand’s peptide backbone to the two binding grooves using a quasi-Newton constrained optimization algorithm. The backbone-aligned ligands produced serve as good starting points to the third stage which uses any flexible docking algorithm to perform the docking. The experimental results demonstrate that the backbone alignment method in the second stage performs better than conventional rigid superposition given two binding constraints. It is also shown that using the backbone-aligned ligands as initial configurations improves the flexible docking in the third stage. The presented approach can also be applied to other protein domains that involve binding of flexible ligand to two or more binding sites.
1
Introduction
Protein domains are the fundamental units of tertiary structure of many proteins. One of the most important functions of protein domains is to bind specific ligands to assemble intracellular signaling networks to perform distinct biological functions. The number of defined protein domains has expanded considerably in recent years. Studies of interactions between protein domains and their ligands are crucial for deeper insight of the binding affinities involved. With this vital understanding target prediction of novel domain-binding ligands would be possible, allowing
This research is supported by NUS R-252-000-293-112.
V. Kadirkamanathan et al. (Eds.): PRIB 2009, LNBI 5780, pp. 175–186, 2009. c Springer-Verlag Berlin Heidelberg 2009
176
H. Lu et al.
for subsequent cloning and expression. Determination of possible target ligands by laboratory experimental techniques alone is a known bottleneck requiring intensive consumption of time and resources. Therefore computational techniques are needed to effectively simulate domain bindings. Many protein docking algorithms have been developed to solve the problem. Two broad classifications [10] of docking algorithms are rigid docking and flexible docking. Rigid docking algorithms solve a simpler version of the protein docking problem termed bound docking by reconstruction of a protein complex from the bound structures of the two proteins that constitute the complex [5,8,14,19,23,27,28]. Docking is framed as a rigid alignment problem of two rigid objects with complementary shapes. Flexible docking algorithms solve the general protein docking problem termed unbound or predictive docking by prediction of binding of two proteins in their free or unbound states [7,9,12,16,18,20,22,26]. This problem regards one or both proteins as flexible objects to account for significant conformational shape changes which occur during protein interactions. A flexible molecule often presents a very large number of degrees of freedom posing great difficulty for the flexible docking problem. Flexible docking should be used to predict possible binding and potential novel targets for protein domains as the correct conformations of novel targets are usually unknown. Generally, this remains a very difficult and challenging task. Fortunately, known binding site characteristics of protein domains can be used to help solve the difficult docking problem. The knowledge of binding sites of protein domains is very useful for predicting possible ligand bindings. General flexible docking algorithms often make use of the binding site information. However, what information to use and how to use it for effective and accurate flexible docking is a challenge. For example, AutoDock [20] requires the user to specify a bounding box around the protein binding site in which an optimal ligand conformation is searched for. The amount of binding site information used in AutoDock is minimal and thus not very effectively used. This paper presents an approach for docking protein domains and flexible ligands using known binding site information as the constraints. Our approach uses known binding site knowledge to first search for the protein domains and the ligand residues recognized by the domains. Then the ligand’s peptide backbone is aligned to the domains based on the binding constraints. Finally, existing algorithms can be used to perform flexible docking, with the backbone-aligned ligands as the initial configuration. To be specific, we apply the approach to WW domains as an application example in this paper.
2
WW Domains
WW domains are present in signaling proteins found in all living things. They have been implicated in signal mediation of human diseases such as muscular dystrophy, Alzheimer’s disease, Huntington’s disease, hypertension (Liddle’s syndrome) and cancer [3,11,24,25]. WW domains are distinguished by the characteristic presence of two signature tryptophan residues (W) that are spaced 20– 22 amino acids apart (Table 1). They are known to recognize proline-containing
Knowledge-Guided Docking of WW Domain Proteins and Flexible Ligands
177
Table 1. Residue sequences of sample WW domains and ligands Group I II/III IV
PDB WW domain sequence Ligand sequence 1EG4 HFLSTSVQGPWERAISPNKVPYYINHETQTTCWDHPKMTELYQ KNMTPYRSPPPYVPP 2DYF GSWTEHKSPDGRTYYYNTETKQSTWEKPDD GSTAPPLPR 1PIN KLPPGWEKRMSRSSGRVYYFNHITNASQWERPSGNSSSG
peptide ligands and they share similarities with other proline recognition domains such as SH3 and EVH1 domains [17,29]. WW domains are classified into four groups [11] based on ligand specificity. Group I binds to ligands containing Proline-Proline-‘Any amino acid’-Tyrosine (PPxY) motif (Table 1). Group II binds to ligands containing Proline-ProlineLeucine-Proline (PPLP) motif (Table 1). Group III recognizes Proline-rich segments interspersed with Arginine (R) residues. Group IV binds to short amino acid sequences containing phosphorylated Serine or Threonine followed by proline. Recent studies show that Group II and III WW domains have very similar or almost indistinguishable ligand preferences, suggesting that they should be classified into a single group [15]. Our study focuses on the first three groups of WW domains, as fewer Group IV samples are available in RCSB Protein Data Bank (PDB) [2]. Examples of WW domains and their corresponding ligand amino acid sequences are presented in Table 1. Group I and II/III WW domains have two binding grooves that recognize ligands [24]. Group I WW domains contain the Tyrosine groove, Group II/III WW domains contain the XP2 groove and both groups contain the XP groove. A Tyrosine groove is formed by three residues, Ile/Leu/Val, His, and Lys/Arg/Gln, and it recognizes Tyrosine (Y) residue of the ligand. An XP groove is formed by Tyr/Phe and Trp residues whereas an XP2 groove is formed by Tyr and Tyr/Trp residues. Both recognize Xaa-Pro (P), including Pro-Pro, segments of the ligand. It is to be noted that the grooves are formed by non consecutive residues in the amino acid sequence because the WW domain protein folds in 3-D to give rise to the grooves (Fig. 1).
(a)
(b)
Fig. 1. WW domain proteins and ligands. (a) 1EG4: Group I WW domain complexed with β-dystroglycan peptide. (b) 2DYF: Group II WW domain complexed with a peptide ligand with PPLP motif. Gray: proteins containing WW domains, blue: Tyrosine groove, green: XP/XP2 grooves, red: ligands, yellow: groove-binding residues of ligands.
178
3
H. Lu et al.
Related Work
Flexible docking algorithms can be classified into three categories. Rigid docking with refinement methods perform rigid docking of the proteins followed by refinement of their side chains [7,9,12,16,26]. By applying side chain refinement, side chain flexibility can be accounted for to improve docking results. The method of [9] performs optimization of both backbone displacement and side chain conformations based on simulated annealing Monte Carlo. The methods of [7,26] apply biased probability Monte Carlo minimization of the ligand-interacting side chains while [16] uses energy minimization. The algorithm in [12] uses side chain rotamers and rigid body minimization to relax the interfaces of docking results. These methods handle side chain flexibility but not backbone conformational changes. To handle backbone flexibility, HADDOCK [6] performs rigid-body docking followed by refinement of both the backbone and the side chains using simulated annealing based on molecular dynamics (MD). Biochemical and biophysical interaction data such as chemical shift perturbation data resulting from NMR titration experiments are used so it is not a general docking algorithm. Incremental construction algorithms place ligand fragments one at a time at the binding sites of the binding protein [18,22]. They require the knowledge of binding sites to place ligand fragments at the sites. Their computation speed while satisfactory for smaller ligands remains unsuitable for large ligands. Energy minimization methods apply optimization algorithms to search for the ligand conformation with minimum binding energy [13,20]. The objective is to determine the ligand conformation with minimal binding energy. Various optimization algorithms can be applied such as simulated annealing, Monte Carlo and genetic algorithms. In particular AutoDock [20] uses a hybrid Lamarckian genetic algorithm to optimize an empirical energy function that includes van der Waals potential, hydrogen bonding, Coulombic electrostatics and desolvation. The computational cost of such an energy function is very high. So the number of degrees of freedom is often limited to reduce the search space in practice.
4
Knowledge-Guided Protein Docking
The 3 stages in our approach for docking WW domains and flexible ligands are binding groove and motif search, backbone alignment and flexible docking. 4.1
Binding Groove and Binding Motif Search
Given a WW domain protein with known group classification (Group I or Group II/III). The two types of binding grooves presented in the WW domain are also known. Each binding groove is formed by residues in a special pattern (Table 2). Residues which form the binding grooves are determined by applying a substring search on the WW domain’s amino acid sequence. From the binding grooves, the corresponding ligand motifs (PPxY or PPLP) are also known. Ligand residues forming possible motifs recognized by the binding grooves on WW domains are determined by a substring search applied on the ligand’s amino acid sequence.
Knowledge-Guided Docking of WW Domain Proteins and Flexible Ligands
179
Table 2. Residue patterns of WW domain binding grooves Binding groove Pattern Tyrosine groove I H K L R V Q XP groove Y F XP2 groove Y Y W
4.2
Example 1EG4: ...WERAISPNKVPYYINHETQTTCW...
W 1EG4: ...WERAISPNKVPYYINHETQTTCW... 2DYF: ...WTEHKSPDGRTYYYNTETKQSTW...
Backbone Alignment
Our backbone alignment method performs flexible alignment of a ligand’s backbone to binding sites given the two binding grooves of a WW domain protein and possible binding motifs of the ligand. A ligand’s residue sequence is divided into 3 segments according to the binding motifs. As an example the ligand sequence of 1EG4 complex (KNMTPYRSPPPYVPP) is divided into KNMTPYRS, PPPY and VPP. The middle segment, an instance of the PPxY motif, is flanked by two residues recognized by two binding grooves of the WW domain. The backbone alignment method aligns the backbone of the middle segment to the WW domain such that the flanking residues fit the grooves. The knowledge of relative positions and orientations of the two flanking residues with respect to the (grooves of) WW domain serve as binding constraints. The other two segments and ligand side chain atoms are added after backbone alignment. The bond angle and bond length between two neighboring atoms are assumed to be fixed, but the torsion angle of a rotatable bond can change to give rise to various conformations of a protein molecule (Fig. 2(a)). Similar assumptions are made in existing flexible docking algorithms. Let ai , i ∈ {1, 2, ..., n}, denote the positions of the n atoms in the middle segment of the ligand backbone (Figure 2(b)). The middle segment has n/3 residues because for each residue 3 backbone atoms N, Cα, and C are considered. The
a2
a i −1
a i +1
a n −1
ei bi a1 (a)
a3
ai (b)
li a n −2
an
Fig. 2. Backbone model. (a) The bond angle bi and bond length li between atoms ai and ai+1 are fixed. However torsion of the bond (indicated by arrow) can rotate atom ai+2 to a new position. (b) Model of atoms in the backbone.
180
H. Lu et al.
two binding constraints specify the atom positions a0 , a1 , a2 and an−2 , an−1 , an , which represent the two flanking residues. We denote the target positions of these constrained atoms as {a∗0 , a∗1 , a∗2 } and {a∗n−2 , a∗n−1 , a∗n }. To satisfy the constraint on the first flanking residue, rigid transformation is applied on the backbone to align {a0 , a1 , a2 } to {a∗0 , a∗1 , a∗2 }. To satisfy the constraint on the second flanking residue, we introduce the cost function 1 an−j − a∗n−j 2 2 j=0 2
Cs =
(1)
Minimizing Cs minimizes the distances between the atoms in the last residue and their target positions. Since the backbone can twist but not bend or stretch, constraints on the bond angles and bond lengths should be incorporated in order to correctly deform the backbone to satisfy the constraints on the residues. To represent the stretching and bending constraints we introduce the bond direction ei and bond length li as illustrated in Figure 2(b). Since the first three atoms are fixed by constraints after their rigid transformation we have, for i ≥ 3, ai+1 = ai + li ei
(2)
Thus, given {a0 , a1 , a2 }, li and ei determine the positions of the other atoms. So, Cs can also be expressed in terms of li and ei . Since a bond cannot stretch li is fixed to ensure this condition. Also, ei · ei+1 corresponds to the bond angle bi . So the bending constraint is encoded by the cost function Cb =
n−2 1 ei · ei+1 − e0i · e0i+1 2 2 i=1
(3)
where e0i · e0i+1 corresponds to the initial value of bond angle bi . Minimizing Cb minimizes the change of the bond angles while keeping the bond lengths fixed. The peptide bond nearly always has trans configuration since it is more energetically favorable than cis. The backbone omega torsion angles are limited to values of 180◦ ± 5◦ , except for proline residues. There is no limitation on the omega torsion angle for proline because it can be in either trans or cis configuration. Additionally, WW domains often bind to proline-rich ligands and the average distribution of phi, psi torsion angles for polyproline stretches (4 or more consecutive prolines) is (−75◦ , +145◦) ± 10◦ [1]. Let ti denote the torsion angles formed by atoms ai−1 , ai , ai+1 , ai+2 . The torsional constraint is represented by the cost function Ct = ti − t0i 2 (4) ti is limited
where ti = −atan2(ei ei−1 · (ei × ei+1 ), (ei−1 × ei ) · (ei × ei+1 )) and t0i denote the preferred value of ti . Minimizing Ct minimizes the difference between torsion angles and their preferred values.
Knowledge-Guided Docking of WW Domain Proteins and Flexible Ligands
181
The total cost function for backbone alignment is then C = kb Cb + ks Cs + kt Ct
(5)
where kb , ks and kt are weighting factors. In Eq. 5, the independent variables are the ei ’s. Varying ei changes the torsion angles but not the bond angles because of Cb . Quasi-Newton algorithm [21] is applied to compute the optimal ei that minimizes C to yield the aligned configuration of the ligand’s backbone. 4.3
Flexible Docking
In the second stage knowledge of WW domain binding specificity is effectively used to obtain backbone-aligned ligands. In the third stage, such ligands are used as starting configurations for flexible docking. Any existing flexible docking algorithm can be employed in our approach.
5
Experiments
Known WW domain binding targets are used in experiments to test the performance of our approach. 14 WW domain proteins complexed with prolinecontaining ligands were collected from RCSB Protein Data Bank (PDB) [2]. Complexes 1EG4, 1K9R, 1K5R, 1JMQ, 1I5H, 2JO9, 2DJY form WW domain Group I test cases and 2HO2, 2OEI, 2DYF, 1YWI, 2JUP, 2RLY, 2RM0 form Group II/III test cases. The WW domain proteins were separated from their ligands. Molecular Dynamics (MD) simulations were run using the AMBER program [4] to simulate possible unbound ligand conformations. Firstly, the backbone alignment algorithm was tested against rigid superposition given binding site knowledge. Backbone alignment was performed 30 times for each test run. The results are ranked according to the cost computed by Eq. 5 and only the top ranked alignments are recorded. Rigid superposition was performed for each test case based on known binding placements of the two ligand residues which bind to two binding grooves of WW domain. Least squares fit is used to compute the rigid transformation of ligand to minimize the distance between the two binding residues and their ideal positions in binding sites. Results evaluation was performed by comparing the backbone atoms N, Cα and C between two binding constraints of the ligand with those in the bound structure and computing the root mean square deviation (RMSD)(Table 3). The average RMSD of results produced by backbone alignment method is 0.30˚ A with a standard deviation of 0.21˚ A and the average RMSD results produced by rigid superposition is 1.54˚ A with a standard deviation of 0.72˚ A. It is obvious that our backbone alignment method produced better results than rigid superposition. Figure 3 visualizes the results of backbone alignment compared with rigid superposition for 14 test cases. Ligand backbone conformations between two binding constraints resulting from our method are very close to bound structures with RMSD smaller than 0.5˚ A in all cases except 1JMQ. However the placements
182
H. Lu et al.
Table 3. RMSD(˚ A) of backbone alignment results and rigid superposition results Test Case Backbone Alignment Rigid Superposition 1EG4 0.30 1.38 1K9R 0.13 1.77 1K5R 0.28 1.50 1JMQ 0.94 1.71 1I5H 0.42 1.23 2JO9 0.13 0.59 2DJY 0.30 1.02 2HO2 0.27 1.39 2OEI 0.34 1.13 2DYF 0.48 1.99 1YWI 0.19 3.41 2JUP 0.18 0.54 2RLY 0.17 2.19 2RM0 0.12 1.67
produced by rigid superposition are far from optimal, producing good results for only two test cases 2JO9 and 2JUP where input ligand shapes are similar to the bound structures. When input ligand shapes differ greatly from native complexes such as in test cases 1K9R, 1YWI, 2RLY and 2RM0 rigid superposition fails to create good ligand placements satisfying the two binding constraints. To further test our approach’s performance one of the most widely used flexible docking programs, AutoDock was employed at the third stage of our scheme in three experimental settings. In the first setting top-ranked ligand backbone alignments generated by the second stage were used as initial configurations for AutoDock. In the second setting ligand placements resulting from rigid superposition in the previous experiment were used as initial configurations. In the third setting, AutoDock was run using random initial ligand placements. The configuration files for AutoDock were prepared using AutoDockTools. The grid parameter files specifying the 3D search space were manually defined to surround the two WW domain binding sites. The WW domain protein (receptor) was held rigid and the ligand flexible. As AutoDock limits the number of torsional degrees of freedom to 32 AutoDockTools were used to select up to 32 bonds whose torsional rotations moved the most number of atoms. Several parameters were adjusted for all settings in the same way. For example maximum number of energy evaluations was set to 25,000,000 and maximum number of generations to 27,000. 50 AutoDock runs were performed for each test case in each experimental setting and solutions were ranked according to their scores evaluated by AutoDock. Usually a solution with RMSD smaller than 2˚ A is classified as successful docking and is considered a very good result. A solution with RMSD smaller than 3˚ A is classified as partially successful. The docking is considered a success if the top-scored solution is successful for each test case. We relaxed the criteria in our evaluation, because most of our test cases possess a large number of degrees
Knowledge-Guided Docking of WW Domain Proteins and Flexible Ligands
1EG4
2HO2
1K9R
2OEI
1K5R
2DYF
1JMQ
1YWI
1I5H
2JUP
2JO9
2RLY
2DJY
2RM0 (a)
183
(b)
Fig. 3. The backbone alignment results (green) are closer than rigid superposition results (red) to the bound structure (blue). (a) WW domain Group I test cases. (b) Group II/III test cases.
of freedom which make the docking problem extremely difficult. We considered the docking successful if one of the top 10 score-ranked solutions has RMSD smaller than 3˚ A. Table 4 lists RMSD results for all three experimental settings. Only 3 test cases (1K9R, 2OEI and 1YWI) are successful in the second and third settings. They all have short ligands comprising of at most 9 amino acids and 12 torsional degrees of freedom. The difficulty of docking small flexible ligands is relatively low and AutoDock is able to produce good docking results. However for more difficult test cases AutoDock failed to give successful docking solutions using initial ligand structures produced by rigid superposition or using random initial ligand placement. Unsurprisingly AutoDock results in the second and third settings are fairly close to each other. The reason is that rigid superposition is unable to produce optimal ligand conformations that satisfy the two binding constraints and thus initial ligand structures used by AutoDock in the second
184
H. Lu et al.
Table 4. RMSD analysis of AutoDock results in three experimental settings. The minimum, maximum and average RMSD (˚ A) of the top 10 ranked solutions are listed. Test Torsion Backbone Alignment Rigid Superposition Cases DoF min max avg min max avg 1EG4 41 4.05 8.73 6.79 4.36 11.56 7.22 1K9R 12 2.80 5.07 3.71 2.86 5.93 4.71 1K5R 24 3.33 6.20 4.43 3.73 7.60 5.69 1JMQ 24 3.46 6.08 4.84 3.80 5.71 4.78 1I5H 50 4.68 6.71 5.62 4.76 8.92 7.42 2JO9 28 3.52 6.86 5.08 4.30 7.62 5.92 2DJY 65 4.51 7.87 6.47 5.41 8.24 7.26 2HO2 13 3.73 6.45 5.15 3.03 5.75 4.57 2OEI 12 2.93 5.00 4.16 2.18 4.35 3.27 2DYF 25 2.68 6.72 4.41 3.74 7.85 5.43 1YWI 9 1.90 4.94 3.86 2.48 4.34 3.15 2JUP 16 2.59 5.93 4.39 4.10 6.06 5.36 2RLY 14 2.57 5.86 4.31 3.06 4.13 3.58 2RM0 15 3.43 5.49 4.51 3.19 5.55 4.31
Random Placement min max avg 4.62 9.66 6.52 2.85 6.38 4.65 3.81 7.30 5.29 3.14 8.00 4.91 4.96 10.89 7.73 4.16 10.53 7.06 5.35 9.79 7.20 3.74 7.70 5.38 2.93 5.48 4.16 3.14 5.32 4.04 2.82 4.26 3.38 3.41 6.87 4.88 3.18 5.25 4.15 3.06 5.53 4.09
setting are no better than the random ligand conformations used in the third setting. In the first setting 6 test cases (1K9R, 2OEI, 2DYF, 1YWI, 2JUP and 2RLY) are successful. Besides the three simple test cases AutoDock in our approach succeeded in three more cases with larger numbers of torsional degrees of freedom. In particular 2DYF has 25 torsional degrees of freedom which is difficult for flexible docking. Among the 8 failed test cases the results in the first setting are still better than those for the other two settings. In 5 out of the 8 failed cases the average RMSD of top 10 ranked solutions are better than the second setting and in 6 out of 8 cases better than the third setting. Clearly using our backbone alignment method to create initial ligand structures improves the overall performance of AutoDock.
6
Conclusions
This paper presents a three-stage approach for docking of WW domains and flexible ligands. The first stage searches for possible binding motifs of ligands using a substring search. The second stage aligns the ligand’s peptide backbone to binding grooves in WW domains using a quasi-Newton constrained optimization algorithm. The cost function used in the optimization represents multiple constraints on the alignment including positional constraints of ligand residues at the binding grooves, bond angle constraints of backbone atoms and torsion constraints of selected phi, psi as well as omega torsion angles of the backbone atoms. Knowledge of WW domain binding grooves and ligand residues bound to the grooves is used to set the cost function. As shown from the experimental results, the backbone alignment method in stage two works better than
Knowledge-Guided Docking of WW Domain Proteins and Flexible Ligands
185
conventional rigid superposition. The backbone-aligned ligands produced in this stage serve as good starting structures to the third stage which uses any flexible docking algorithm to perform docking. In the experiments AutoDock in our approach yields better results than using rigid superposition to create initial structures or using random initial ligands. The presented approach can also be applied to other protein domains that involve binding of flexible ligands to two or more binding sites. The optimal placement of ligands near binding sites produced by our backbone alignment stage can be used as good initial structures for subsequent stages.
References 1. Adzhubei, A.A., Sternberg, M.J.E.: Left-handed polyproline II helices commonly occur in globular proteins. Journal of Molecular Biology 229, 472–493 (1993) 2. Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., Bourne, P.E.: The protein data bank. Nucleic Acids Research 28(1), 235–242 (2000) 3. Bork, P., Sudol, M.: The WW domain: a protein module that binds proline-rich or proline-containing ligands (2000) 4. Case, D.A., Cheatham, T.E., Darden, T., Gohlke, H., Luo, R., Merz Jr., K.M., Onufriev, A., Simmerling, C., Wang, B., Woods, R.J.: The amber biomolecular simulation programs. Journal of Computational Chemistry 26, 1668–1688 (2005) 5. Chen, R., Li, L., Weng, Z.: ZDOCK: an initial-stage protein-docking algorithm. Proteins 52, 80–87 (2003) 6. Dominguez, C., Boelens, R., Bonvin, A.M.: HADDOCK: a protein-protein docking approach based on biochemical or biophysical information. Journal of the American Chemical Society 125(7), 1731–1737 (2003) 7. Fern´ andez-Recio, J., Totrov, M., Abagyan, R.: ICM-DISCO docking by global energy optimization with fully flexible side-chains. Proteins 52, 113–117 (2003) 8. Gabb, H.A., Jackson, R.M., Sternberg, M.J.E.: Modelling protein docking using shape complementarity, electrostatics, and biochemical information. Journal of Molecular Biology 272, 106–120 (1997) 9. Gray, J.J., Moughon, S., Wang, C., Schueler-Furman, O., Kuhlman, B., Rohl, C.A., Baker, D.: Protein-protein docking with simultaneous optimization of rigid-body displacement and side-chain conformations. Journal of Molecular Biology 331, 281– 299 (2003) 10. Halperin, I., Ma, B., Wolfson, H., Nussinov, R.: Principles of docking: An overview of search algorithms and a guide to scoring functions. Proteins 47, 409–443 (2002) 11. Ilsleya, J.L., Sudolb, M., Windera, S.J.: The WW domain: Linking cell signalling to the membrane cytoskeleton. Cellular Signalling 14, 183–189 (2002) 12. Jackson, R.M., Gabb, H.A., Sternberg, M.J.: Rapid refinement of protein interfaces incorporating solvation: application to the docking problem. Journal of Molecular Biology 276, 265–285 (1998) 13. Jones, G., Willett, P., Glen, R.C., Leach, A.R., Taylor, R.: Development and validation of a genetic algorithm for flexible docking. Journal of Molecular Biology 267, 727–748 (1997) 14. Katchalski-Katzir, E., Shariv, I., Eisenstein, M., Friesem, A., Aflalo, C., Vakser, I.: Molecular surface recognition: Determination of geometric fit between protein and their ligands by correlation techniques. Proceedings of the National Academy of Sciences of the United States of America 89, 2195–2199 (1992)
186
H. Lu et al.
15. Kato, Y., Nagata, K., Takahashi, M., Lian, L., Herrero, J.J., Sudol, M., Tanokura, M.: Common mechanism of ligand recognition by group II/III WW domains. Journal of Biological Chemistry 279(30), 31833–31841 (2004) 16. Li, L., Chen, R., Weng, Z.: RDOCK: refinement of rigid-body protein docking predictions. Proteins 53, 693–707 (2003) 17. Macias, M.J., Wiesner, S., Sudol, M.: Ww and sh3 domains, two different scaffolds to recognize proline-rich ligands. FEBS Letters 53(1), 30–37 (2002) 18. Makino, S., Kuntz, I.D.: Automated flexible ligand docking method and its application for database search. Journal of Computational Chemistry 18, 1812–1825 (1997) 19. Mandell, J.G., Roberts, V.A., Pique, M.E., Kotlovyi, V., Mitchell, J.C., Nelson, E., Tsigelny, I., Ten Eyck, L.F.: Protein docking using continuum electrostatics and geometric fit. Protein Engineering 14, 105–113 (2001) 20. Morris, G.M., Goodsell, D.S., Halliday, R.S., Huey, R., Hart, W.E., Belew, R.K., Olson, A.J.: Automated docking using a lamarckian genetic algorithm and and empirical binding free energy function. Journal of Computational Chemistry 19, 1639–1662 (1998) 21. Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipes in C++: The Art of Scientific Computing. Cambridge University Press, Cambridge (2002) 22. Rarey, M., Kramer, B., Lengauer, T., Klebe, G.: A fast flexible docking method using an incremental construction algorithm. Journal of Molecular Biology 261, 470–489 (1996) 23. Ritchie, D., Kemp, G.: Protein docking using spherical polar Fourier correlations. Proteins 39(2), 178–194 (2000) 24. Sudol, M.: Structure and function of the WW domain. Progress in Biophysics and Molecular Biology 65(1-2), 113–132 (1996) 25. Sudol, M.: From src homology domains to other signaling modules: proposal of the ‘protein recognition code’. Oncogene 17, 1469–1474 (1998) 26. Totrov, M., Abagyan, R.: Flexible protein-ligand docking by global energy optimization in internal coordinates. Proteins 1, 215–220 (1997) 27. Tovchigrechko, A., Vakser, I.A.: GRAMM-X public web server for protein-protein docking. Nucleic Acids Research 314, W310–W314 (2006) 28. Vakser, I.A.: Protein docking for low-resolution structures. Protein Engineering 8, 371–377 (1995) 29. Zarrinpar, A., Bhattacharyya, R.P., Lim, W.A.: The structure and function of proline recognition domains. Science’s STKE 179, re8 (2003)
Distinguishing Regional from Within-Codon Rate Heterogeneity in DNA Sequence Alignments Alexander V. Mantzaris and Dirk Husmeier Biomathematics and Statistics Scotland, JCMB, KB, Edinburgh EH9 3JZ, UK
[email protected],
[email protected]
Abstract. We present an improved phylogenetic factorial hidden Markov model (FHMM) for detecting two types of mosaic structures in DNA sequence alignments, related to (1) recombination and (2) rate heterogeneity. The focus of the present work is on improving the modelling of the latter aspect. Earlier papers have modelled different degrees of rate heterogeneity with separate hidden states of the FHMM. This approach fails to appreciate the intrinsic difference between two types of rate heterogeneity: long-range regional effects, which are potentially related to differences in the selective pressure, and the short-term periodic patterns within the codons, which merely capture the signature of the genetic code. We propose an improved model that explicitly distinguishes between these two effects, and we assess its performance on a set of simulated DNA sequence alignments.
1
Introduction
DNA sequence alignments are usually not homogeneous. Mosaic structures may result as a consequence of recombination or rate heterogeneity. Interspecific recombination, in which DNA subsequences are transferred between different (typically viral or bacterial) species may result in a change of the topology of the underlying phylogenetic tree. Rate heterogeneity corresponds to a change of the nucleotide substitution rate. Two Bayesian methods for simultaneously detecting recombination and rate heterogeneity in DNA sequence alignments are the dual multiple change-point model (DMCP) of [13], and the phylogenetic factorial hidden Markov model (PFHMM) of [9] and [12]. The idea underlying the DMCP is to segment the DNA sequence alignment by the insertion of change-points, and to infer different phylogenetic trees and nucleotide substitution rates for the separate segments thus obtained. Two separate change-point processes associated with the tree topology and the nucleotide substitution rate are employed. Inference is carried out in a Bayesian way with reversible jump (RJ) Markov chain Monte Carlo (MCMC). Of particular interest are the number and locations of the change-points, which mark putative recombination break-points and regions
This work was funded by RERAD of the Scottish Government.
V. Kadirkamanathan et al. (Eds.): PRIB 2009, LNBI 5780, pp. 187–198, 2009. c Springer-Verlag Berlin Heidelberg 2009
188
A.V. Mantzaris and D. Husmeier
putatively under different selective pressures. A related modelling paradigm is provided by the PFHMM, where two a priori independent hidden Markov chains are introduced, whose states represent the tree topology and nucleotide substitution rate, respectively. While the earlier work of [9] kept the number of hidden states fixed, [12] generalised the inference procedure with RJMCMC and showed that this framework subsumes the DMCP as a special case. This model has recently been extended to larger numbers of species [16]. Common to all these models are two simplifications. First, the no-common mechanism model of [15] is introduced, which assumes separate branch lengths for each site in the DNA sequence alignment. Second, there is no distinction between regional and within-codon rate heterogeneity. Following [14], the first assumption was introduced with the objective to reduce the computational complexity of the inference scheme. The no-common-mechanism model allows the branch lengths to be integrated out analytically. This is convenient, as the marginal likelihood of the tree topology, the nucleotide substitution rate, and further parameters of the nucleotide substitution model (like the transitiontransversion ratio) can be computed in closed from. In this way, the computational complexity of sampling break-points (DMCP) or hidden state sequences (PFHMM) from the posterior distribution with MCMC is substantially reduced. However, in the no-common-mechanism model the branch lengths are incidental rather than structural parameters. As we discussed in [10], this implies that maximum likelihood no longer provides a consistent estimator, and that the method systematically infers the wrong tree topology in the Felsenstein zone defined in [3]. The second simplification does not distinguish between two different types of rate heterogeneity: (1) a regional effect, where larger consecutive segments of the DNA sequence alignment might be differently evolved, e.g. as a consequence of changes of the selective pressure; (2) and a codon effect, where the third codon position shows more variation than the first or the second. Not allowing for this difference and treating both sources of rate heterogeneity on an equal footing implies the risk that subtle regional effects might be obscured by the short-range codon effect, as discussed in [12]. The latter effect is of no biological interest, though, as it only represents the signature of the genetic code. In the present work, we address this issue and develop a model that properly distinguishes between these two effects. Our work is based on the model we introduced in [10]. We modify this approach so as to explicitly take the signature of the genetic code into account. In this way, the within-codon effect of rate heterogeneity is imposed on the model a priori, which makes it easier to learn the biologically more interesting effect of regional rate heterogeneity a posteriori.
2 2.1
Methodology Modelling Recombination and Rate Heterogeneity with a Phylogenetic FHMM
Consider an alignment D of m DNA sequences, N nucleotides long. Let each column in the alignment be represented by yt , where the subscript t represents
Distinguishing Regional from Within-Codon Rate Heterogeneity
189
the site, 1 ≤ t ≤ N . Hence yt is an m-dimensional column vector containing the nucleotides at the tth site of the alignment, and D = (y1 , . . . , yN ). Given a probabilistic model of nucleotide substitutions based on a homogeneous Markov chain with instantaneous rate matrix Q, a phylogenetic tree topology S, and a vector of branch lengths w, the probability of each column yt , P (yt |S, w, θ), can be computed, as e.g. discussed in [4]. Here, θ denotes a (vector) of free nucleotide substitution parameters extracted from Q. For instance, for the HKY85 model of [7], we have π = (πA , πC , πG , πT ), with πi ∈ [0, 1] and i πi = 1, is a vector of nucleotide equilibrium frequencies, and α, β ≥ 0 are separate nucleotide substitution rates for transitions and transversions. For identifiability between w and Q, the constraint i Qii πi = −1 is commonly introduced, which allows the branch lengths to be interpreted as expected numbers of mutations per site (see, e.g., [13]). The normalisation constraint on π further reduces the number of free parameters by one, so that without loss of generality we have θ = (πA , πC , πG , ζ), where ζ = α/(2β) ≥ 0 is the transition-transversion ratio. In what follows, we do not make the dependence on θ explicit in our notation. We simultaneously model recombination and rate heterogeneity with a phylogenetic FHMM, as originally proposed in [9], with the modification discussed in [10]. A hidden variable St ∈ {τ1 , . . . , τK } is introduced, which represents one out of K possible tree topologies τi at site t. To allow for correlations between nearby sites – while keeping the computational complexity limited –N a Markovian dependence structure is introduced: P (S) = P (S1 , . . . , SN ) = t=2 P (St |St−1 )P (S1 ). Following [5], the transition probabilities are defined as [1−δ(St ,St−1 )] 1 − νS δ(S ,S ) P (St |St−1 , νS ) = νS t t−1 (1) K −1 where δ(St , St−1 ) denotes the Kronecker delta symbol, which is 1 when St = St−1 , and 0 otherwise. The parameter νS denotes the probability of not changing the tree topology between adjacent sites. Associated with each tree topology τi is a vector of branch lengths, wτi , which defines the probability of a column of nucleotides, P (yt |St , wSt ). The practical computation follows standard methodology based on the pruning algorithm [4]. For notational convenience we rewrite these emission probabilities as P (yt |St , w), where St ∈ {τ1 , . . . , τk } determines which of the subvectors w = (w1 , . . . , wK ) is selected. To model rate heterogeneity, a second type of hidden states Rt is introduced. Correlations between adjacent sites are modelled again by a Markovian dependence structure: N P (R) = P (R1 , . . . , RN ) = t=2 P (Rt |Rt−1 )P (R1 ). The transition probabilities are defined as in (1): [1−δ(Rt ,Rt−1 )] 1 − νR δ(R ,R ) P (Rt |Rt−1 , νR ) = νR t t−1 (2) ˜ −1 K ˜ is the total number of different rate states. Each rate state is associated where K with a scaling parameter Rt ∈ ρ = {ρ1 , . . . , ρK } by which the branch lengths are rescaled: P (yt |St , w) → P (yt |St , Rt w). To ensure that the model is identifiable, we constrain the L1-norm of the branch length vectors to be equal to one:
190
A.V. Mantzaris and D. Husmeier
||wk ||1 = 1 for k = 1, . . . , K. To complete the specification of the probabilistic model, we introduce prior probabilities on the transition parameters νS and νR , which are given conjugate beta distributions (which subsume the uniform distribution for the uninformative case). The initial state probabilities P (S1 ) and P (R1 ) are set to the uniform distribution, as in [11]. The prediction of recombination break-points and rate heterogeneity is based on the marginal posterior probabilities P (St |D) = ... ... P (S|D) (3) S1
P (Rt |D) =
R1
St−1 St+1
...
Rt−1 Rt+1
SN
...
P (R|D)
(4)
RN
The distributions P (S|D) and P (R|D) are obtained by the marginalisation P (S|D) = P (S, R, νS , νR , w|D)dνS dνR dw (5) R
P (R|D) =
P (R, S, νS , νR , w|D)dνS dνR dw
(6)
S
where P (S, R, νS , νR , w|D) ∝ P (D, S, R, νS , νR , w) = P (S1 )P (R1 )P (νS )P (νR ) N N N t=1 P (yt |St , Rt w) t=2 P (St |St−1 , νS ) t=2 P (Rt |Rt−1 , νR ). The respective integrations and summations are intractable and have to be numerically approximated with Markov chain Monte Carlo (MCMC): we sample from the joint posterior distribution P (S, R, νS , νR , w|D) and then marginalise with respect to the entities of interest. Sampling from the joint posterior distribution follows a Gibbs sampling procedure [2], where each parameter group is iteratively sampled separately conditional on the others. So if the superscript (i) denotes the ith sample of the Markov chain, we obtain the (i + 1)th sample as follows: (i)
(i)
S(i+1) ∼ P (·|R(i) , νS , νR , w(i) , D) R(i+1) ∼ (i+1) νS (i+1) νR (i+1)
w
∼ ∼ ∼
(i) (i) P (·|S(i+1) , νS , νR , w(i) , D) (i) P (·|S(i+1) , R(i+1) , νR , w(i) , D) (i+1) P (·|S(i+1) , R(i+1) , νS , w(i) , D) (i+1) (i+1) P (·|S(i+1) , R(i+1) , νS , νR , D)
(7) (8) (9) (10) (11)
The order of these sampling steps is arbitrary. Note that, in principle, the nucleotide substitution parameters θ should be included in the Gibbs scheme, as described in [11]. In practice, a fixation of θ at a priori estimated values makes little difference to the prediction of P (St |D) and P (Rt |D) and has the advantage of reduced computational costs. Sampling the hidden state sequences S and R in (7) and (8) is effected with the stochastic forward-backward algorithm of [1]. Sampling the transition probabilities νS and νR in (9) and (10) is straightforward due to the conjugacy of the beta distribution. Sampling the branch lengths in
Distinguishing Regional from Within-Codon Rate Heterogeneity
191
(11) cannot be effected from a closed-form distribution, and we have to resort to a Metropolis-Hastings-within-Gibbs scheme. Note that the branch lengths have to satisfy the constraint ||wk ||1 = 1, k = 1, . . . , K, as well as the positivity constraint wki ≥ 0. This is automatically guaranteed when proposing branch new ∗ αwki −1 length vectors wk∗ from a Dirichlet distribution: Q(wk∗ |wk ) ∝ , i [wki ] where α is a tuning parameter that can be adapted during burn-in to improve mixing. The acceptance probability for the proposed branch lengths is then given by the standard Metropolis-Hastings criterion [8]. 2.2
Distinguishing Regional from Within-Codon Rate Heterogeneity
We improve the model described in the previous subsection, which was proposed in [10], in two respects. First, we adapt ρ and sample it along with w from the posterior distribution. To make this explicit in the notation, we slightly change the definition of the rate state as Rt ∈ {1, . . . , K } and rewrite: P (yt |St , Rt w) → P (yt |St , ρRt w). Second, we explicitly model codon-positionspecific rate heterogeneity in a way similar to [5]. To this end, we introduce the indicator variable It ∈ {0, 1, 2, 3}, where It = 0 indicates that the tth position of the alignment does not code for protein, and It = i ∈ {1, 2, 3} indicates that site t is the ith position of a codon. Each of the four categories is associated with a positive factor taken from λ = (λ0 , λ1 , λ2 , λ3 ), by which the branch lengths are modulated. The emission probabilities are thus given by P˜ (yt |St , Rt , It , ρ, λ, w) := P (yt |St , ρRt λIt w), where P (.) was defined below equation (1), and P˜ (.) makes the dependence on ρ and λ explicit. Note that as opposed to [5], we do not keep λ fixed, but sample it from the posterior distribution with MCMC. For identifiability we introduce the same constraint as for the branch lengths: ||λ||1 = 1, which is automatically guaranteed when proposing λ from a Dirichlet distribution. Hence, to sample ρ and λ from the posterior distribution P (S, R, νS , νR , ρ, λ, w|D), we have to add two Metropolis-Hastingswithin-Gibbs steps akin to equation (11) to the Gibbs sampling procedure (7-11): (i+1)
[ρ(i+1) , λ(i+1) ] ∼ P (·|S(i+1) , R(i+1) , νS
(i+1)
, νR
, w(i+1) , D)
(12)
With all other parameters and hidden states fixed, we propose new values for ρ and λ, and accept or reject according to the Metropolis-Hastings criterion. As discussed above, we propose new values for λ from a Dirichlet distribution. New values for ρ are proposed from a uniform distribution (on the log scale), centred on the current values. The dispersal parameters of the proposal distributions can be adjusted during the burn-in phase using standard criteria.
3
Data
To assess the performance of the method, we tested it on synthetic DNA sequence alignments; this has the advantage that we have a known gold-standard. For a realistic simulation, we generated sequence alignments with Seq-Gen, developed
192
A.V. Mantzaris and D. Husmeier
a) b) Fig. 1. Illustration of regional versus within-codon rate heterogeneity. Each circle corresponds to a nucleotide in a DNA sequence, and the circle diameter symbolises the average nucleotide substitution rate at the respective position. The top panel (a) shows a “homogeneous” DNA sequence composed of six codons, where each third position is more diverged as a consequence of the nature of the genetic code. The bottom panel (b) shows a hypothetical DNA sequence subject to regional rate heterogeneity, where the second half on the right of the dashed vertical line constitutes a region that is more evolved. The sequences used in our simulation study were similar, but longer (1.5Kbp).
by Rambaut and Grassly. This software package is widely used for Monte Carlo simulations of molecular sequence evolution along phylogenetic trees; see e.g. http://bioweb2.pasteur.fr/docs/seq-gen/ or http://tree.bio.ed.ac. uk/software/seqgen/ for details. We generated a DNA sequence alignment from a phylogenetic tree of four hypothetical taxa with equal branch lengths, using the HKY model of nucleotide substitution [7] with a uniform nucleotide equilibrium distribution, πA = πC = πG = πT = 0.25, and a transition-transversion ratio of ζ = 2. We generated two types of alignments. In the first alignment, the normalised branch lengths associated with the three codon positions were set to wi = [0.5 − 2c , 0.5 − 2c , 0.5 + c]/1.5, where the codon offset parameter 0 ≤ c ≤ 0.99 was varied in increments of 0.1. All codons had the same structure, as illustrated in Figure 1a. We refer to these sequence alignments as “homogeneous”. The second type of alignment, which we refer to as “heterogeneous” or “subject to regional rate heterogeneity”, is illustrated in Figure 1b. The codons have a similar structure as before. The second half of the alignment is more evolved, though, and the branch lengths are expanded by a factor of ς = 2. In all simulations, the total length of the alignment was 1.5 Kbp.
4
Simulations
Our objective is to sample topology and rate state sequences S, R, their associated transition probabilities νS , νR and rate vectors ρ, the branch lengths w and (for the new model) the within-codon rate vector λ from the posterior distribution P (S, R, νS , νR , ρ, λ, w|D). To this end, we apply the Gibbs sampling scheme of (7–12), which we have described in Sections 2.1 and 2.2. Our current software has not yet been optimised for speed. Hence, to improve the convergence of the Markov chain and to focus on the aspect of interest for the present study (rate heterogeneity), we have set all states in S to the same tree topology without allowing for recombination: νS = 1. We also set K = 2 fixed. The model was initialised with the maximum likelihood tree obtained with DNAML from
Distinguishing Regional from Within-Codon Rate Heterogeneity
193
Felsentein’s PHYLIP package, available from http://evolution.genetics. washington.edu/phylip/. We tested the convergence of the MCMC simulations by computing the potential scale reduction factor of Gelman and Rubin [6] from the within and between trajectory variances of various monitoring quantities (e.g. w, P (Rt |D), etc.), and took a value of 1.2 as an indication of sufficient convergence. The main objective of our study is to evaluate the performance of the proposed model that allows for within-codon rate heterogeneity; we refer to this as the “new” model. We compare its performance with a model that does not include within-codon rate heterogeneity, that is, where λ = 1 is constant. We refer to this as the “old” model. Note that the latter model is equivalent to the one proposed in [10], but with the improvement that ρ is sampled from the posterior distribution, rather than kept fixed. In order to evaluate the performance of the methods, we want to compute the marginal posterior probability of the average effective branch length scaling for the three codon positions. The effective branch lengths are given by w ˜t = ρRt λIt wt , where wt are the normalised branch lengths. The entity of interest is Υt =
||w ˜ t ||1 = ρRt λIt ||wt ||1
(13)
˜ t associated with which is the scaling factor by which the branch length vector w position t deviates from the normalised branch lengths wt . Note that Υt is composed of two terms, associated with a region (ρRt ) and a codon (λIt ) effect. We are interested in the marginal posterior distribution of this factor, P (Υ |D, I = k), for the three codon positions I ∈ {1, 2, 3}. In practice, this distribution is estimated from the MCMC sample by the appropriate marginalisation with respect to all other quantities: M N P (Υ |D, I = k) ≈
i=1
t=1 δIt ,k δ(Υ − N M t=1 δIt ,k
ρiRi λiIt ) t
(14)
where the subscript t refers to positions in the alignment (of total length N ), the superscript i refers to MCMC samples (sample size M ), δ(.) is the delta function, the quantities on the right of its argument, ρiRi , λiIt , are obtained from t the MCMC sample, and δi,k is the Kronecker delta. For the conventional model without explicit codon effect, we set λIt = 1/3 ∀t.
5
Results
Figure 2 shows the posterior distribution of the (complementary) transition probability νR . The two models were applied to the “homogeneous” DNA sequence alignment that corresponds to the top panel in Figure 1. The left panel shows the results obtained with the old model, which does not explicitly include the codon effect. For small values of the offset parameter c, the posterior distribution
194
A.V. Mantzaris and D. Husmeier
10 10
9
9
8
8
7 ν (x10)
6
R
νR (x10)
1 0.9
7
6 0.8 5
0.7
5
0.6
4 4
0.5 3
0.4
3
0.3
2
2
0.2 1
1
a)
0.1 0
1
2
3
4 5 6 Codon offset c (x10)
7
8
9
10
0
b)
1
2
3
4 5 6 Codon offset c (x10)
7
8
9
10
c)
0
Fig. 2. Posterior distribution of νR (vertical axis) for different codon offsets c (horizontal axis), where the offset indicates to what extent the nucleotide substitution rate associated with the third codon position is increased over that of the first two positions. The left panel (a) shows the results obtained with the old model, the centre panel (b) shows the results obtained with the new model. The grey levels represent probabilities, as indicated by the legend in the panel on the right (c). The distributions were obtained from a “homogeneous” DNA sequence alignment, corresponding to Figure 1a.
of νR is concentrated on νR = 1, which corresponds to a homogeneous sequence alignment. As the offset increases, the posterior distribution of νR gets shifted to smaller values, with a mode at νR = 0.5. Note that νR is related to the average l−1 l segment length l via the relation l = (1 − νR ) l lνR = (1 − νR ) dνdR l νR = d 1 1 (1−νR ) dνR 1−νR = 1−νR . For νR = 0.5 we get l = 2. The model has thus learned the within-codon rate heterogeneity intrinsic to the genetic code; compare with Figure 1. The right panel of Figure 2 shows the posterior distribution of νR obtained with the new model. Irrespective of the codon offset c, the distribution is always concentrated on νR = 1. This correctly indicates that there is no regional rate heterogeneity in the DNA sequence alignment. Recall that the within-codon rate heterogeneity has been explicitly incorporated into the new model and, hence, need not be learned separately via νR and transitions between rate states Rt . Figure 3 shows the posterior distribution of the scaling factor Υt , defined in (13), for the “homogeneous” DNA sequence alignment corresponding to Figure 1a. The columns in Figure 3 correspond to the three codon positions. The posterior distribution was obtained from the MCMC samples via (14). For the new model (bottom row of Figure 3), the distributions of Υt are unimodal and sharply peaked. This is consistent with the fact that we have no regional rate heterogeneity, and the shift in the peak locations for the third codon position clearly indicates the within-codon rate heterogeneity. For the old model (top panel of Figure 3), the posterior distribution is always bimodal. This is a consequence of the fact that the within-codon rate heterogeneity has to be learned via the assignment of rate states Rt to the respective codon positions. The bimodality and increased width of the distribution stem from a misassignment of rate states. Note that for an alignment of N = 1500 sites, 500 state transitions have to be learned to model the within-codon rate heterogeneity correctly.
Distinguishing Regional from Within-Codon Rate Heterogeneity
20
20
15
15
10
10
5
5
195
15
10
5
a)
0 0
0.2
0.4
0.6
0.8
b)
1
0 0
0.2
0.4
0.6
0.8
c)
1
0 0
0.2
0.4
0.6
0.8
1
0.2
0.4
0.6
0.8
1
6000 6000 4000
400 300
4000
200 2000
d)
0 0
2000
0.2
0.4
0.6
0.8
1
e)
100
0 0
0.2
0.4
0.6
0.8
f)
1
0 0
Fig. 3. Posterior distribution (vertical axes) of the combined rate Υt (horizontal axes), defined in equation (13), for a “homogeneous” DNA sequence alignment, corresponding to Figure 1a, with codon offset parameter c = 0.8. The three columns correspond to the three codon positions. The top row shows the distribution obtained with the old model. The bottom row shows the distribution obtained with the new model. The distributions were obtained from the MCMC samples with a kernel density estimator, where the delta function in (14) was replaced by a Gaussian (standard deviation: a tenth of the total range).
4
3
5
30
4
25 20
3 2
15 2 10
1
a)
0 0
1
0.5
1
1.5
2
2.5
b)
1
0 0
5 0.5
1
1.5
2
2.5
c)
0 0
0.8
0.8
0.6
0.6
0.4
0.4
0.5
1
1.5
2
2.5
0.5
1
1.5
2
2.5
0.8 0.6 0.4 0.2
0.2
d)
0 0
0.5
1
1.5
2
2.5
e)
0 0
0.2
0.5
1
1.5
2
2.5
f)
0 0
Fig. 4. Posterior distribution (vertical axes) of the rate ρRt (horizontal axes) for a “heterogeneous” DNA sequence alignment, corresponding to Figure 1b, with codon offset parameter c = 0.8 and regional factor ς = 2. The three columns correspond to the three codon positions. The top row shows the distribution obtained with the old model. The bottom row shows the distribution obtained with new model. The distributions were obtained from the MCMC samples with a kernel density estimator, where the delta function in (15) was replaced by a Gaussian (standard deviation: a tenth of the total range).
Figure 4 is similar to Figure 3, but was obtained for the heterogeneous DNA sequence alignment corresponding to Figure 1b. For better clarity we have shown the codon site-specific posterior distributions of the rate ρRt rather than the scale factor Υt , that is, in equation (14) we have ignored the factor λiIt : M N P (ρ|D, I = k) ≈
i=1
t=1 δIt ,k δ(ρ − N M t=1 δIt ,k
ρiRi ) t
(15)
196
A.V. Mantzaris and D. Husmeier
a)
2.5
1
2
0.8
1.5
0.6
1
0.4
0.5
0.2
0 0
0.5
1
1.5
2
2.5
3
b)
2.5
d)
0 0
2
2
1.5
1.5
1
1
0.5
0.5
0 0
0.5
1
1.5
2
2.5
3
0.5
1
1.5
2
2.5
3
2.5
0.5
1
1.5
2
2.5
3
e)
0 0
Fig. 5. Alternative representation of the posterior distribution (vertical axes) of the rate ρRt (horizontal axes) for the “heterogeneous” DNA sequence alignment. The figure corresponds to Figure 4, but shows a separation of the distributions with respect to regions rather than codon positions. The distribution of ρRt is defined in (16). The two columns correspond to the two differently diverged segments in the DNA sequence alignments, with the left column representing the first 750 positions, and the right column representing the last 750 positions; the latter were evolved at double the nucleotide substitution rate. The two rows correspond to the two models. The top row shows the distribution obtained with the old model. The bottom row shows the distribution obtained with new model. The distributions were obtained from the MCMC samples with a kernel density estimator, where the delta function in (16) was replaced by a Gaussian (standard deviation: a tenth of the total range).
The bottom row shows the distributions obtained with the new model. They have a symmetric bimodal form. The bimodality reflects the regional rate heterogeneity. The symmetry reflects the nature of the DNA sequence alignment, which contains two differently diverged regions of equal size (see Figure 1b). The top panel shows the distributions obtained with the old model. The distributions are still bimodal, but the symmetry has been destroyed. This distortion results from the fact that two effects – regional and within-codon rate heterogeneity – are modelled via the same mechanism: the rate states Rt . Consequently, these two forms of rate heterogeneity are not clearly separated. To illustrate this effect from a different perspective, Figure 5 shows the posterior distributions of the rate ρRt not separated according to codon positions, but according to differently diverged regions. That is, from the MCMC sample we compute the following distribution: M N P (ρ|D, t ∈ r) ≈
i=1
I(t ∈ r)δ(ρ − ρiRi ) t N M t=1 I(t ∈ r) t=1
(16)
where r represents the two regions: r = 1 for 1 ≤ t ≤ 750, and r = 2 for 751 ≤ t ≤ 1500, I(t ∈ r) is the indicator function, which is one if the argument is true, and zero otherwise, and the remaining symbols are as defined below equation (14).
Distinguishing Regional from Within-Codon Rate Heterogeneity
197
The bottom panel shows the distributions obtained with the new model, where the two columns represent the two regions. The distributions are unimodal and clearly separated, which indicates that modelling regional rate heterogeneity is properly disentangled from the within-codon rate variation. The top panel shows the distributions obtained with the old model. Here, the distributions are bimodal, which results from a lack of separation between regional and withincodon rate heterogeneity, and a tangling-up of these two effects.
6
Discussion
We have generalised the phylogenetic FHMM of [10] in two respects. First, by sampling the rate vector ρ from the posterior distribution with MCMC (rather than keeping it fixed) we have made the modelling of regional rate heterogeneity more flexible. Second, we explicitly model within-codon rate heterogeneity via a separate rate modification vector λ. In this way, the within-codon effect of rate heterogeneity is imposed on the model a priori, which should facilitate the learning of the biologically more interesting effect of regional rate heterogeneity a posteriori. We have carried out simulations on synthetic DNA sequence alignments, which have borne out our conjecture. The old model, which does not explicitly include the within-codon rate variation, has to model both effects with the same mechanism: the rate states Rt with associated rate factors ρRt . As expected, it was found to fail to disentangle these two effects. On the contrary, the new model was found to clearly separate within-codon from regional rate heterogeneity, resulting in a more accurate prediction. We emphasise that our paper describes work in progress, and we have not yet applied our method to real DNA sequence alignments. This is partly a consequence of the fact that our software has not been optimised for computational efficiency yet, resulting in long MCMC simulation runs. Note that the computational complexity of our algorithm is larger than for the model described in [12]. The latter approach is based on the no-common-mechanism model of [15], which leads to a substantial model simplification, though at the price of potential inconsistency problems (as discussed in [10]). The increased computational complexity of the method proposed in the present article might require the application of more sophisticated MCMC schemes, e.g. population MCMC, which will be the objective of our future work. As a final remark, we note that a conceptually superior approach would be the modelling of substitution processes at the codon rather than nucleotide level. However, the application of this approach to standard Bayesian analysis of single phylogenetic trees has turned out to be computationally exorbitant. A generalisation to phylogenetic FHMMs for modelling DNA mosaic structures, as described in the present article, is unlikely to be computationally feasible in the near future. We therefore believe that the method we have proposed, which is based on individual nucleotide substitution processes while taking the codon structure into account, promises a better compromise between model accuracy and practical viability.
198
A.V. Mantzaris and D. Husmeier
References 1. Boys, R.J., Henderson, D.A., Wilkinson, D.J.: Detecting homogeneous segments in DNA sequences by using hidden Markov models. Applied Statistics 49, 269–285 (2000) 2. Casella, G., George, E.I.: Explaining the Gibbs sampler. The American Statistician 46(3), 167–174 (1992) 3. Felsenstein, J.: Cases in which parsimony or compatibility methods will be positively misleading. Systematic Zoology 27, 401–440 (1978) 4. Felsenstein, J.: Evolution trees from DNA sequences: A maximum likelihood approach. Journal of Molecular Evolution 17, 368–376 (1981) 5. Felsenstein, J., Churchill, G.A.: A hidden Markov model approach to variation among sites in rate of evolution. Molecular Biology and Evolution 13(1), 93–104 (1996) 6. Gelman, A., Rubin, D.B.: Inference from iterative simulation using multiple sequences. Statistical Science 7, 457–472 (1992) 7. Hasegawa, M., Kishino, H., Yano, T.: Dating the human-ape splitting by a molecular clock of mitochondrial DNA. Journal of Molecular Evolution 22, 160–174 (1985) 8. Hastings, W.K.: Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57, 97–109 (1970) 9. Husmeier, D.: Discriminating between rate heterogeneity and interspecific recombination in dna sequence alignments with phylogenetic factorial hidden Markov models. Bioinformatics 172, ii166–ii172 (2005) 10. Husmeier, D., Mantzaris, A.V.: Addressing the shortcomings of three recent Bayesian methods for detecting interspecific recombination in DNA sequence alignments. Statistical Applications in Genetics and Molecular Biology 7(1), Article 34 (2008) 11. Husmeier, D., McGuire, G.: Detecting recombination in 4-taxa DNA sequence alignments with Bayesian hidden Markov models and Markov chain Monte Carlo. Molecular Biology and Evolution 20(3), 315–337 (2003) 12. Lehrach, W.P., Husmeier, D.: Segmenting bacterial and viral DNA sequence alignments with a trans-dimensional phylogenetic factorial hidden Markov model. Applied Statistics 58(3), 307–327 (2009) 13. Minin, V.N., Dorman, K.S., Fang, F., Suchard, M.A.: Dual multiple change-point model leads to more accurate recombination detection. Bioinformatics 21(13), 3034–3042 (2005) 14. Suchard, M.A., Weiss, R.E., Dorman, K.S., Sinsheimer, J.S.: Inferring spatial phylogenetic variation along nucleotide sequences: A multiple changepoint model. Journal of the American Statistical Association 98(462), 427–437 (2003) 15. Tuffley, C., Steel, M.: Links between maximum likelihood and maximum parsimony under a simple model of site substitution. Bulletin of Mathematical Biology 59, 581–607 (1997) 16. Webb, A., Hancock, J., Holmes, C.: Phylogenetic inference under recombination using Bayesian stochastic topology selection. Bioinformatics 25(2), 197–203 (2009)
A Hybrid Metaheuristic for Biclustering Based on Scatter Search and Genetic Algorithms Juan A. Nepomuceno1 , Alicia Troncoso2, and Jesús S. Aguilar–Ruiz2 1
2
Department of Computer Science, University of Sevilla, Spain
[email protected] Area of Computer Science, Pablo de Olavide University of Sevilla, Spain {ali,aguilar}@upo.es
Abstract. In this paper a hybrid metaheuristic for biclustering based on Scatter Search and Genetic Algorithms is presented. A general scheme of Scatter Search has been used to obtain high–quality biclusters, but a way of generating the initial population and a method of combination based on Genetic Algorithms have been chosen. Experimental results from yeast cell cycle and human B-cell lymphoma are reported. Finally, the performance of the proposed hybrid algorithm is compared with a genetic algorithm recently published. Keywords: Biclustering, Gene Expression Data, Scatter Search, Evolutionary Computation.
1
Introduction
Recently, data mining techniques are being applied to microarray data analysis in order to extract useful information [1]. Clustering techniques find groups of genes with similar patterns from a microarray. However, genes are not necessary related to every condition. Thus, the goal of the biclustering is to identify genes with the same behavior only under a specific group of conditions. In the context of microarray analysis, many approaches have been proposed for biclustering [2]. Biclustering techniques have two important aspects: the search algorithm and the measure to evaluate the quality of biclusters. Most of proposed approaches in the literature are focussed on different search methods. Thus, in [3] an iterative hierarchical clustering is applied to each dimension separately and biclusters are built by the combination of the obtained results for each dimension. In [4] an iterative search method which built biclusters adding or removing genes or conditions in order to improve the measure of quality called Mean Squared Residue (MSR) was presented. An exhaustive biclusters enumeration by means a bipartite graph-based model that nodes were added or removed in order to find maximum weight subgraphs was generated in [5]. The FLOC algorithm [6] improved the method presented in [4] obtaining a set of biclusters simultaneously and adding missing values techniques. In [7], a simple linear model for gene expression was used assuming normally distributed expression level for each gene or condition. Also, geometrical characterizations V. Kadirkamanathan et al. (Eds.): PRIB 2009, LNBI 5780, pp. 199–210, 2009. c Springer-Verlag Berlin Heidelberg 2009
200
J.A. Nepomuceno, A. Troncoso, and J.S. Aguilar–Ruiz
such as hyperplanes in a high dimensional data space have been used to find biclusters [8]. In the last few years, global optimization techniques, such as Simulated Annealing [9] or Evolutionary Computation [10,11], have been applied to obtain biclusters due to their good performance in several environments. Recently, several papers were focussed on the measure proposed to evaluate the quality of biclusters. In [12] an analysis of the MSR was made, showing that this measure is good to find biclusters with shifting patterns but not scaling patterns. A new measure based on unconstrained optimization techniques was proposed in [13] as alternative to the MSR in order to find biclusters with certain patterns. In this paper a hybrid metaheuristic for biclustering based on Scatter Search and Genetic Algorithms (SS&GA) is presented. A general scheme of Scatter Search has been used to obtain high–quality biclusters, but a way of generating the initial population and a method of combination based on Genetic Algorithms have been chosen. Finally, the performance of the proposed hybrid algorithm is compared with a genetic algorithm recently published [10]. A Scatter Search has been selected due to the recent success obtained to solve different hard optimization problems and to references about the application of Scatter Search for biclustering have not been found in the literature. This paper is organized as follows. Section 2 presents basic concepts about Scatter Search. The description of the proposed metaheuristic is described in Section 3. Some experimental results from two real datasets and a comparison between the proposed method and a genetic algorithm are reported in Section 4. Finally, Section 5 outlines the main conclusions of the paper and future works.
2
Scatter Search
Scatter Search [14] is an optimization algorithm based on populations introduced in the seventies. Recently, Scatter Search algorithms have been applied to many nonlinear and combinatorial optimization problems providing remarkable outcomes due to its flexibility to adopt different search strategies mainly. Basically, a standard Scatter Search can be summarized by the following steps: 1. Generate an initial population in a deterministic manner to assure the diversity of the population regarding a distance. 2. A set, called set of reference, is built with the best individuals from this population. The best individuals is not limited to a measure of individuals provided by a fitness function but a individual that improves the diversity can be added to the reference set. 3. New individuals are created by the deterministic combination of individuals of the reference set and all individuals of the reference set are selected to be combined. 4. The reference set is updated using the new individuals and the combination is repeated until the reference set does not change. 5. The reference set is rebuilt and if the maximum number of iterations is not reached go to step 3.
A Hybrid Metaheuristic for Biclustering Based on Scatter Search
201
Therefore, the search strategies of a Scatter Search depend on a diversification method to generate the initial population, a method to built the reference set, a method to combine individuals and a method to rebuilt the reference set. The main differences between a Genetic Algorithm and a Scatter Search are the way of generating the initial population, as the initial population is generated randomly and deterministic, respectively; the selection of individual to create offspring, as a probabilistic procedure is applied to select parents in Genetic Algorithms and all individuals of the reference set are combined in Scatter Search; the evolution of the population, based on the survival of the best depending on the fitness function in Genetic Algorithms and the rebuilding method of reference set used in Scatter Search. Finally, the size of the population in Genetic Algorithms is bigger than that of the reference set in Scatter Search. A typical size in Genetic Algorithms is 100 and 10 in Scatter Search, due to that the combination method in Scatter Search takes into account all pairs of individuals to create new individuals. In short, the underlying idea of Scatter Search is to emphasize systematic processes against random procedures to generate populations, to create new individuals and to inject diversity to the population.
3
Description of the Algorithm
In this section the proposed SS&GA algorithm to obtain biclusters is described, detailing the steps aforementioned in the previous section such as combination, generation, updating and rebuilding methods. The pseudocode of the proposed SS&GA algorithm is presented in algorithm 1. 3.1
Biclusters Codification and Generation
Formally, a microarray is a real matrix composed by N genes and M conditions. The element (i, j) of the matrix means the level of expression of gene i under the condition j. A bicluster is a submatrix of the matrix M composed by n ≤ N rows or genes and m ≤ M columns or conditions. Biclusters are encoded by binary strings of length N + M [10]. Each of the first N bits of the binary string is related to the genes and the remaining M bits to the conditions from microarray M . For instance, the bicluster shown in Fig. 1 is encoding by the following string, 0010110000|01100
(1)
Thus, this string codifies the bicluster composed by genes number 3, 5 and 6 and conditions 2 and 3 from a microarray comprising 10 genes and 5 conditions. The initial population of biclusters is strictly randomly generated (typical in Genetic algorithms) without taking into account the diversity (typical in Scatter Search). Random strings composed by 0 and 1 are generated until nB biclusters are built, where nB is the size of the starting population, i.e. the number of biclusters.
202
J.A. Nepomuceno, A. Troncoso, and J.S. Aguilar–Ruiz
Algorithm 1. SS&GA for Biclustering INPUT Microarray M , penalization factors M1 and M2 , size of population nB, size of reference set, S, and maximum number of iterations, M axIter. OUTPUT The reference set, Ref Set begin Initialize P randomly with nB biclusters //Building reference set R1 ← S/2 best biclusters from P (according to their fitness function) R2 ← S/2 most scatter biclusters, regarding R1 , from P R1 (according to a distance). Ref Set = R1 ∪ R2 P = P Ref Set //Initialization stable ← FALSE iter = 0 while (iter ≤ M axIter) do //Updating reference set while (NOT stable) do A ← Ref Set B ← CombinationM ethod(Ref Set) Ref Set ← S best biclusters from Ref Set ∪ B if (A = Ref Set) then stable ← T RU E end if end while //Rebuilding reference set R1 ← S/2 best biclusters from Ref Set (according to their fitness function) R2 ← S/2 most scatter biclusters, regarding R1 , from P R1 . Ref Set = R1 ∪ R2 P ← P Ref Set iter = iter + 1 end while end
3.2
Building Reference Set
The reference set comprises the best S biclusters of the initial population, P , where S is the number of biclusters that belong to this set. The reference set is built taken into account both quality and scattering of biclusters. The quality of biclusters is measured evaluating the fitness function considered in the evolutive process. Thus, a bicluster is better than another if the fitness function value is lower than that of the second one. On the other hand, a distance must be defined in order to show how the scattering is introduced in the search space. In the proposed SS&GA approach the distance used is the Hamming distance. The Hamming distance between two binary strings is defined by the number of positions for which their corresponding 0/1 values are different. For example, the Hamming distance for strings 001001001|001 and 001011001|101 is 2.
A Hybrid Metaheuristic for Biclustering Based on Scatter Search
G1 G2 G3 G4 G5 G6 G7 G8 G9 G10
C1 2.2 1.3 4.7 -3.8 7.5 0.4 3.2 2.5 3.1 0.3
C2 3.6 1.5 1.0 -0.3 1.0 1.0 8.3 3.1 0.4 0.5
C3 5.3 -3.1 1.0 2.2 1.0 1.0 -2.5 4.1 6.9 0.3
C4 -2.6 -2.1 7.9 3.1 2.1 0.4 -2.5 0.3 9.2 0.3
C5 0.3 2-2 0.4 1.4 -2.3 0.3 3.1 0.1 0.2 -0.1
1.0
1.0
1.0
1.0
1.0
1.0
203
bicluster 00101100000|1100 codification
microarray
Fig. 1. Microarray and bicluster along with its codification
Therefore, the reference set is formed by the S/2 best biclusters from P (set R1 ) according to their fitness function and the S/2 biclusters from P R1 (set R2 ) with the highest distances to the set R1 according to the Hamming distance. 3.3
Combination Method and Updating Reference Set
Combination method is the mechanism to create new biclusters in Scatter Search. All pairs of biclusters are combined generating S ∗(S −1)/2 new biclusters. In the SS&GA algorithm the typical uniform crossover operator used in Genetic Algorithms is the proposed combination method. This crossover operator is shown in Fig. 2. A binary mask is randomly generated and a child is composed by values from the first parent when there is a 1 in the mask, and from the second parent when there is a 0. The reference set is updated with the S best biclusters from the reference set and the new biclusters generated by the combination method according to the fitness function. This process is repeated iteratively until the reference set does not change. mask
1 0 0 1 0 0 1
parent 1
1 1 0 1 1 0 1
child
1 0 0 1 1 1 1
parent 2
0 0 0 1 1 1 0
Fig. 2. Uniform crossover operator of Genetic Algorithms
204
J.A. Nepomuceno, A. Troncoso, and J.S. Aguilar–Ruiz
3.4
Rebuilding Reference Set
After getting the stability of reference set in the updating process, this set is rebuilt to introduce diversity in the search process. This task is made by mutation operators in Genetic Algorithms. Thus, the reference set is composed by the S/2 best biclusters from the updated reference set (set R1 ) according to the fitness function and the S/2 most distant from P R1 according to the Hamming distance. 3.5
Biclusters Evaluation
The fitness function is fundamental in order to evaluate the quality of biclusters. Cheng and Church proposed the MSR which measures the correlation of a bicluster. Given a bicluster comprising the subset of genes I and the subset of conditions J, the MSR is defined as follows, 1 M SR(I, J) = R(i, j)2 (2) |I||J| i∈I,j∈J
where R(i, j) = eij − eIj − eiJ + eIJ 1 eIj = eij |I| i∈I 1 eiJ = eij |J|
(3) (4) (5)
j∈J
eIJ =
1 |IJ|
eij
(6)
i∈I,j∈J
In this work, biclusters with low residue and high volume are preferred. Therefore, the fitness function is defined by: 1 1 f (B) = M SR(B) + M1 + M2 (7) G C where M SR(B) is the MSR of the bicluster B, M1 and M2 are penalization factors to control the volume of the bicluster B, and G and C are the number of genes and conditions of the bicluster B, respectively. The use of MSR in the fitness function considered in the proposed SS&GA algorithm allows to establish a comparison with a previous evolutionary-based biclustering method and the Cheng and Church algorithm.
4
Experimental Results
Two well known datasets [4] have been used to shows the performance of the proposed SS&GA algorithm. The first dataset is the yeast Saccharomyces cerevisiae cell cycle expression and the second one is the human B-cells expression
A Hybrid Metaheuristic for Biclustering Based on Scatter Search
205
data originated from [15] and [16], respectively. Original data were preprocessed in [4] replacing missing values with random numbers. The Yeast dataset contains 2884 genes and 17 experimental conditions and the Human dataset consists of 4026 genes and 96 conditions. The main parameters of the proposed SS&GA algorithm are as follows: 200 for the initial population; 10 for the reference set and 20 for the maximum number of iterations. The penalization factor for the number of conditions has been chosen of one order of magnitude larger than the range in which the fitness function varies for both datasets. However, that of the number of genes has been chosen of same order of magnitude to the range of values of the fitness function for both datasets. The main goal of this choice is to test the influence of the penalization factors on the volume of the biclusters. 4.1
Yeast Data Set
Table 1 presents several biclusters obtained by the application of the SS& GA approach from Yeast dataset. For each bicluster is shown an identifier of the bicluster, the value of its MSR, the number of genes and the number of conditions. It can be observed that high–quality biclusters have been obtained as the values of the MSR are lower than 220. Moreover, the volume of the obtained biclusters is satisfactory showing that the SS& GA approach find non–trivial biclusters. Concretely, biclusters and no clusters are obtained since the number of conditions is less than 17 always. Biclusters presented in Table 1 are shown in Fig. 3. Although biclusters are good taking into account the values of their MSR, in this figure their tends cannot be observed easily. This is due to the overlapping among biclusters as the same gene can be found in different biclusters. Fig. 4 shows the evolution of the average MSR, fitness function values and volume of the reference set throughout the evolutionary process for the Yeast dataset. The values of the MSR and the volume are represented in the axis on the left and that of the fitness function in the axis on the right. It can be noticed that the initial reference set improves the average MSR throughout the iterations Table 1. Results obtained by SS&GA algorithm for Yeast dataset Bicluster bi.1 bi.2 bi.3 bi.4 bi.5 bi.6 bi.7 bi.8 bi.9
MSR Genes Conditions 74.72 10 13 106.25 13 13 125.9 22 13 216.16 25 14 97.04 26 11 117.25 14 14 136.67 25 13 159.44 39 13 121.89 26 11
206
J.A. Nepomuceno, A. Troncoso, and J.S. Aguilar–Ruiz
bi1
bi2
300 200 100 0
400
400
Expression Value
Expression Value
Expression Value
400
300 200
2
4
6 8 Conditions
10
12
2
4
6 8 Conditions
bi4
200 100 6
8 10 Conditions bi7
12
6 8 Conditions
100
200 150
4
6 Conditions bi8
8
300 250 200 2
4
6
8 10 Conditions bi
12
14
9
400
400 300 200
12
350
10
100 10
12
150 2
Expression Value
250
10
bi6
200
0
Expression Value
Expression Value
300
6 8 Conditions
4
400
300
14
350
4
200
2
500
2
250
12
Expression Value
300
4
300
bi5 Expression Value
Expression Value
10
400
2
350
150
100
400
0
bi3
2
4
6 8 Conditions
10
12
300 200 100 0
2
4
6 Conditions
8
10
Fig. 3. Biclusters from Yeast dataset
and the SS&GA algorithm converges in 8 iterations approximately. The average volume of the reference set decreases versus the number of iterations due to the non too large penalization factors have been chosen. 4.2
Lymphoma Data Set
Table 2 presents information about several biclusters found by the SS&GA approach for Human dataset. The values of the MSR are considerably low since all are lower than 1100. Thus, it can be stated that obtained biclusters have a remarkable quality. Moreover, in general the obtained biclusters have a large number of genes, specially the bicluster number 1, 2 and 4. These biclusters are also represented in Fig. 5. Figure 6 presents the performance of the proposed algorithm for Human dataset. The evolution of the average MSR, fitness function values and volume
A Hybrid Metaheuristic for Biclustering Based on Scatter Search
350
207
600 Volume
Average residue
300
500
Fitness Function 400
250 300 200 200 150
Average fitness function
MSR
100
100
0 1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20
Number of iterations
Fig. 4. Performance of the proposed SS&GA algorithm for Yeast dataset Table 2. Results obtained by SS&GA for Human dataset Bicluster bi.1 bi.2 bi.3 bi.4 bi.5 bi.6 bi.7 bi.8 bi.9
MSR 855.17 813.70 642.13 815.74 771.69 595.69 1074.10 507.17 794.07
Genes Conditions 109 13 127 12 85 11 122 10 48 12 44 9 56 13 67 8 70 11
Table 3. Comparison of the results obtained by SS&GA, SEBI and CC algorithms Algorithm-Dataset SS&GA–Yeast SS&GA–Human SEBI–Yeast SEBI–Human CC–Yeast CC–Human
Avg. Residue 128.37 (40.71) 763.27 (165.73) 205.18 (4.49) 1028.84 (29.19) 204.29 (42.78) 850.04 (153.91)
Avg. gene num. Avg. cond. num. 22.23 (8.86) 12.78 (1.09) 80.89 (36.61) 11 (1.73) 13.61 (10.38) 15.25 (1.37) 14.07 (5.39) 43.57 (6.20) 166.71 (226.37) 12.09 (4.39) 269.22 (204.71) 24.5 (20.92)
for the reference set is shown. A good performance of the SS&GA technique and a fast convergence can be appreciated. The values of the fitness function decrease quickly and only ten iterations approximately are needed to find high–quality biclusters. In this case, the choice of penalization parameters to keep under control the volume of the biclusters provides a nearly constant volume in the last iterations. Finally, a comparison between the results obtained with the SS&GA algorithm and two representative techniques reported in the literature is provided. Concretely, the SS&GA algorithm is compared to SEBI [10] and Cheng and
208
J.A. Nepomuceno, A. Troncoso, and J.S. Aguilar–Ruiz
bi1
bi2
bi3
50 0 −50
100
Expression Value
100
Expression Value
Expression Value
100 50 0 −50 −100
50 0 −50
−100 2
4
6 8 Conditions
10
12
2
4
bi4
12
50 0 −50
2
4
6 Conditions bi7
100 50 0 −50 −100
8
2
0 −50 −100 6 8 Conditions
10
12
10
0 −50
4
6 8 Conditions bi8
10
12
1
2
3
4
5 6 Conditions bi
7
8
9
9
100 Expression Value
Expression Value
50
8
50
100
4
6 Conditions
−100
10
100
2
4
bi6
−150
−100
2
Expression Value
Expression Value
Expression Value
10
bi5
100
Expression Value
6 8 Conditions
50 0 −50 1
2
3
4 5 Conditions
6
7
50 0 −50 −100
8
2
4
6 Conditions
8
10
Fig. 5. Biclusters from Human dataset
Church (CC) algorithm [4]. The SEBI approach is a genetic algorithm which introduces mechanisms to avoid the overlapping among biclusters. On the other hand, the most of biclusters obtained by the CC algorithm are overlapped. Table 3 presents the average of the MSR and the average of the number of genes and conditions of the biclusters found by the three approaches. Furthermore, the standard deviation is shown in brackets. It can be observed that the proposed algorithm improves all the average MSR in spite of SEBI obtains biclusters smaller than CC and SS&GA methods. The small volume of the biclusters found by SEBI algorithm, due to the control of the overlapping, should provide a lower MSR. As regards the standard deviation, it is the SEBI approach which has a more stable behavior since CC and SS&GA methods have standard deviations larger than the SEBI algorithm. In short, it can be stated that the SS&GA algorithm has a good performance yielding competitive results with respect to that of other techniques.
A Hybrid Metaheuristic for Biclustering Based on Scatter Search
Volume MSR
1300 Average residue
Fitness Function 1100
3000 2500 2000
900 1500 700 1000
500
500
300 100
Average fitness function
1500
209
0 1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20
Number of iterations
Fig. 6. Performance of the proposed SS&GA algorithm for Human dataset
5
Conclusions
A hybrid metaheuristic for biclustering based on Scatter Search and Genetic Algorithms has been presented in this work. A general scheme of Scatter Search has been used to obtain high–quality biclusters, but the starting population has been generated randomly and an uniform crossover operator to create new biclusters has been chosen from Genetic Algorithms. Experimental results from yeast cell cycle and human B-cell lymphoma have been reported and the outcomes of the proposed hybrid algorithm has been compared with that of a genetic algorithm, showing a satisfactory performance taking into account the difficulty of the biclustering problem. Future works will be focussed on the use of deterministic combination methods and diversification methods to generate the initial population. Moreover, other measures based on scaling and shifting patterns to evaluate biclusters will be tested.
Acknowledgments The financial support given by the Spanish Ministry of Science and Technology, project TIN-68084-C02-01 and by the Junta de Andalucía, project P07-TIC02611 is acknowledged.
References 1. Larranaga, P., et al.: Machine learning in bioinformatics. Briefings in Bioinformatics 7(1), 86–112 (2006) 2. Busygin, S., Prokopyev, O., Pardalos, P.M.: Biclustering in data mining. Computers and Operations Research 35(9), 2964–2987 (2008) 3. Levine, E., Getz, G., Domany, E.: Couple two-way clustering analysis of gene microarray data. Proceedings of the National Academy of Sciences (PNAS) of the USA 97(22), 12079–12084 (2000)
210
J.A. Nepomuceno, A. Troncoso, and J.S. Aguilar–Ruiz
4. Cheng, Y., Church, G.M.: Biclustering of Expression Data. In: 8th International Conference on Intelligent Systems for Molecular Biology, pp. 93–103 (2000) 5. Tanay, A., Sharan, R., Shamir, R.: Discovering statistically significant biclusters in gene expression data. Bioinformatics 18(1), 136–144 (2002) 6. Yang, J., Wang, H., Wang, W., Yu, P.: Enhanced biclustering on expression data. In: 3rd IEEE Symposium on Bioinformatics and Bioengineering, pp. 321–327 (2003) 7. Bergmann, S., Ihmels, J., Barkai, N.: Iterative signature algorithm for the analysis of large-scale gene expression data. Physical Review E 67(3), 31902 (2003) 8. Harpaz, R., Haralick, R.: Exploiting the geometry of gene expression patterns for unsupervised learning. In: 18th International Conference on Pattern Recognition (ICPR 2006), pp. 670–674 (2006) 9. Bryan, K., Cunningham, P., Bolshakova, N., Coll, T., Dublin, I.: Biclustering of expression data using simulated annealing. In: 18th IEEE International Symposium on Computer-Based Medical Systems, pp. 383–388 (2005) 10. Divina, F., Aguilar-Ruiz, J.S.: Biclustering of Expression Data with Evolutionary Computation. IEEE Transactions on Knowledge and Data Engineering 18(5), 590– 602 (2006) 11. Mitra, S., Banka, H.: Multi-objective evolutionary biclustering of gene expression data. Pattern Recognition 39(12), 2464–2477 (2006) 12. Aguilar-Ruiz, J.S.: Shifting and scaling patterns from gene expression data. Bioinformatics 21(20), 3840–3845 (2005) 13. Nepomuceno, J.A., Troncoso, A., Aguilar-Ruiz, J.S., Garcıa-Gutierrez, J.: Biclusters Evaluation Based on Shifting and Scaling Patterns. In: Yin, H., Tino, P., Corchado, E., Byrne, W., Yao, X. (eds.) IDEAL 2007. LNCS, vol. 4881, pp. 840–849. Springer, Heidelberg (2007) 14. Marti, R., Laguna, M.: Scatter Search. Methodology and Implementation in C. Kluwer Academic Publishers, Boston (2003) 15. Cho, R.J., et al.: A Genome-Wide Transcriptional Analysis of the Mitotic Cell Cycle. Molecular Cell 2(1), 65–73 (1998) 16. Alizadeh, A.A., et al.: Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403, 503–511 (2000)
Di-codon Usage for Gene Classification Minh N. Nguyen1 , Jianmin Ma1 , Gary B. Fogel2 , and Jagath C. Rajapakse3,4,5 1 BioInfomatics Institute, Singapore Natural Selection Inc. San Diego, USA BioInformatics Research Centre, Nanyang Technological University, Singapore 4 Singapore-MIT Alliance, Singapore 5 Department of Biological Engineering, Massachusettes Institutes of Technology, USA 2
3
Abstract. Classification of genes into biologically related groups facilitates inference of their functions. Codon usage bias has been described previously as a potential feature for gene classification. In this paper, we demonstrate that di-codon usage can further improve classification of genes. By using both codon and di-codon features, we achieve near perfect accuracies for the classification of HLA molecules into major classes and sub-classes. The method is illustrated on 1,841 HLA sequences which are classified into two major classes, HLA-I and HLA-II. Major classes are further classified into sub-groups. A binary SVM using di-codon usage patterns achieved 99.95% accuracy in the classification of HLA genes into major HLA classes; and multi-class SVM achieved accuracy rates of 99.82% and 99.03% for sub-class classification of HLA-I and HLA-II genes, respectively. Furthermore, by combining codon and di-codon usages, the prediction accuracies reached 100%, 99.82%, and 99.84% for HLA major class classification, and for sub-class classification of HLA-I and HLA-II genes, respectively.
1
Introduction
Genetic information encoded in nucleic acids is transferred to proteins via codons. The study of codon usage is important as it is an integral component of translation of nucleic acids to their functional forms or proteins and its relevance to mutation studies. When a synonymous mutation occurs, the codon usage varies, but the resulting protein product remains unchanged. Therefore, codon usage is a good indicator for studies of mutation and molecular evolution. The pattern of codon usage has been found to be highly variable [1] and is implicated in the function of genes in different species. The use of codon usage bias for gene classification was rarely explored in the past except Kanaya et al. [2] who used the species-specific characteristics of codon usage to classify genes from 18 different species, mainly prokaryotes and unicellular eukaryotes. We recently showed that codon usage is a potential feature for gene classification [3]. Furthermore, using human leukocyte antigen (HLA) molecules, classification based on codon usage bias was shown to be inconsistent with molecular structure and biological function of the genes. V. Kadirkamanathan et al. (Eds.): PRIB 2009, LNBI 5780, pp. 211–221, 2009. c Springer-Verlag Berlin Heidelberg 2009
212
M.N. Nguyen et al.
Experimental approaches for gene classification often use microarray data, yet such methods are costly and tedious. Researchers have begun to use computational approaches such as machine learning techniques to extract features and thereby classify gene expressions from microarray experiments to identify genes belonging to biologically meaningful groups [4]. Because of the large dimension and the limited sample sizes, these methods have limited utility on larger datasets. Sequence-based gene classification provides an alternate to expressionbased methods of gene classification. Other sequence-based methods of gene classification includes homology-based approaches through multiple sequence alignment [5]. Because of time and space complexities in multiple sequence alignment, such approaches are relatively difficult to use on a large number of sequences. Moreover, if the lengths or evolutionary distances of sequences differ, correct alignments are difficult to achieve, resulting in lower gene classification accuracy. More importantly, the information from synonymous mutations is often neglected in homology-based approaches despite their importance in evolution. The classification of genes based on structural features also neglects synonymous mutations [6]. In this paper, we demonstrate the use of di-codon usage as a promising feature for gene classification. Di-codon usage patterns contain additional information for gene classification to those given by codon usage as di-codon usage patterns encapsulate more global (di-codon frequency) information of a DNA sequence. Given that ribosomes actually reside over two codon positions when they slide along mRNA, di-codon usage has a biological rationale to translation of genes. Noguchi et al. developed a prokaryotic gene-finding program, MetaGene, which utilizes di-codon frequencies estimated by the GC content of a given sequence with other various measures [7]. By using di-codon frequencies, their method achieved a higher prediction accuracy than by using codon frequencies alone [7]. A hidden Markov model with self-identification learning for finding protein coding regions from un-annotated genome sequences has been studied and shown that the di-codon model outperforms other competitive features such as aminoacid pairs, codon usage, and G+C content in terms of sensitivity as well as specificity [8]. The gene finding program, DicodonUse, is based on frequencies of di-codons and used for identification of open reading frames that have a high probability of being genes [9]. Uno et al. demonstrated that the main reading frame of Chi sequences (5’-GCTGGTGG-3’) increased as a result of the di-codon CTG-GTG increasing under a genomewide pressure for adapting to the codon usage and base composition of the E. coli K-12 strain [10]. In this paper, we use binary and multi-class support vector machines (SVM) for the classification of genes based on codon and di-codon usage features. Their good generalization capabilities in classification [11,12,13] make them ideal for gene classification. We have used SVMs successfully for classifying protein features [14,15,16], gene expressions [17], mass spectra [18], and genes based on codon usage [3]. Others have also demonstrated their use in other bioinformatics problems: Lin et al. [19] to study conserved codon composition of ribosomal protein coding genes in E. coli, M. tuberculosis, and S. cerevisiae; Bhasin and
Di-codon Usage for Gene Classification
213
Raghava [20,21] for the prediction of HLA-DRB1*0401 binding protein and Cytotoxic T lymphocyte (CTL) epitopes; Donnes and Elofsson for the prediction of MHC class I binding peptides [22]; and Zhao et al. for the prediction of T-cell epitopes [23]. By using di-codon usage pattern as input feature for SVM, we demonstrate our method for gene classification on a dataset of 1,841 HLA gene sequences collected from the IMGT/HLA Sequence Database. The proposed approach achieved substantial improvement in classification accuracies of HLA molecules into HLA-I and HLA-II classes, and their subclasses. We compare our results when using codon usage alone as input feature, and with homology-based methods.
2 2.1
Materials and Methods Data
Recently, there has been an increase of the number of nucleic acid and protein sequences in the international immunogenetics databases [24,25,26], which has enabled computational biologists to study human and primate immune systems. In order to demonstrate our method, we use a set of HLA genes, obtained from HLA ImmunoGenetics (IMGT/HLA) database of European Bioinformatics Institute (EBI) (http://www.ebi.ac.uk/). The Major Histocompatibility Complex (MHC) is determined by a suite of genes located on a specific chromosome (e.g., HLA is located on chromosome 6 while mouse MHC is located on chromosome 11) and produces glycoprotein products to initiate the immune response of the body [27]. HLA or human MHC molecules are a vital component of immune response and take part in the selection process of thymus cells, genetic control of immunological reaction, and interactions between immunocytes. The primary function of HLA molecules is to bind and present antigens on cell surfaces for recognition by antigen-specific T-cell receptors (TCR) of lymphocytes. Immune reactions involve interactions between HLA molecules and T lymphocytes [28]; T-cell response has subsequently been restricted not only by the antigen but also by HLA molecule [29]. Furthermore, HLA molecules are involved in the production of antibodies, which process is also HLA restricted by gene products from the class II molecules [30,31]. HLA gene products are involved in the pathogenesis of many diseases including autoimmune disorders. The exact mechanisms behind HLA associated risk of autoimmune diseases remain to be fully understood. We first demonstrate our approach through the classification of HLA genes into major classes HLA-I and HLA-II. The major classes are then divided into sub-classes: HLA-I molecules are classified into HLA-A, HLA-B, HLAC, HLA-E, HLA-F, and HLA-G types, and HLA-II molecules are classified into HLA-DMA, HLA-DMB, HLA-DOA, HLA-DOB, HLA-DPA1, HLA-DPB1, HLA-DQA1, HLA-DQB1, HLA-DRA, HLA-DRB1, HLA-DRB3, HLA-DRB4, and HLA-DRB5. Expression of HLA-I genes is constitutive and ubiquitous in most cell types. This is consistent with the protective function of cytotoxic T lymphocytes (Tc) which continuously survey cell surfaces and destroy cells
214
M.N. Nguyen et al.
harboring metabolically active microorganisms. HLA-II molecules are expressed only within cells that present antigens, such as antigen-presenting macrophages, dendritic cells, and B cells. This is in accordance with the functions of helper T lymphocytes (Th) activated locally wherever they encounter antigen presenting cells that have internalized and processed antigens produced by pathogens. HLA genes were extracted from the IMGT/HLA Sequence Database [24,25,26] of EBI (Release 2.7, 10/08/2004, http://www.ebi.ac.uk/imgt/hla/) which is part of the international ImMunoGeneTics project (IMGT) providing specialist databases of the sequences of HLA molecules, including official sequences for Nomenclature Committee for Factors of HLA System of the World Health Organization. Extracted HLA gene sequences were checked individually for errors such as incorrect assignment of translation initiation sites, inconsistencies with the reference sequences in EMBL or GenBank nucleotide databases, etc. and the errors were then curated manually. Because there are 61 different codons coding for amino acids, in order to have a sufficient sampling of codons for computation, coding sequences of less than 50 amino acids were excluded from this analysis [3], resulting in 1,841 HLA genes. The details of this dataset are available in [3]. Di-codon usage patterns were calculated for each sequence and used as input features for SVM in classifying input HLA sequences into main- and sub-classes. The input to SVM was a 4096dimensional vector derived from di-codon usage values. Binary SVM was adopted for classification of main classes and multi-class SVM was adopted for sub-class identification of HLA-I and HLA-II molecules. 2.2
Di-codon Usage
Let the coding sequence of the gene in terms of codons be denoted by s = (s1 , s2 , . . . , sn ) where si ∈ Ω, n is the length of the sequence in codons and Ω = {c1 , c2 , . . . c64 } is the alphabet of codons. The di-codon usage pattern is given by the fractions of di-codon types within the coding sequence and captures the global information about the gene sequence. The di-codon usage rcj ck is measured by the fraction of di-codons (cj , ck ) ∈ Ω 2 of the sequence s: rcj ck =
n−1 1 δ(si = cj )δ(si+1 = ck ) n − 1 i=1
(1)
where δ(·) = 1 if the argument inside is satisfied, otherwise is 0. Di-codon patterns have a fixed length of 4096 (64×64) irrespective of the length of the sequence. Let r = (r1 , r2 , . . . , rk , . . . , r4096 ), where rk ∈ [0, 1], denote the feature vector consisting of di-codon usages derived from the input sequence s. 2.3
Binary SVM
A binary SVM classifier was adopted to classify HLA gene sequences into two main classes: HLA-I and HLA-II. The problem of classifying HLA sequence, s,
Di-codon Usage for Gene Classification
215
into major classes is seen as to find the optimal mapping from the space of di-codon usage patterns to HLA-I and HLA-II classes, respectively. Let {(rj , qj ) : j = 1, 2, . . . , N } denote the set of all training exemplars where qj denotes the desired classification, HLA-I or HLA-II, for the input di-codon usage pattern, rj , so that the output qj is −1 if the correct class is HLA-I or +1 if the class is HLA-II; N denotes the number of training sequences. SVM implicitly projects the input to a higher dimensional space with a kernel function K and then linearly combines them with a weight vector w to obtain the output. The binary SVM was trained to classify input vectors of di-codon usage patterns to correct major class of HLA by solving the following optimization problem: Minimize N 1 T w w+γ ξj 2 j=1 subject to the constraints: qj (wT φ(rj ) + b) ≥ 1 − ξj and ξj ≥ 0
(2)
where slack variables ξj represent the magnitude of error in the classification, φ represents the mapping function to a higher dimension, b is the bias used to classify samples, and γ(> 0) is the sensitivity parameter which decides the trade-off between the training error and the margin of separation [11,12]. The minimization of the above optimization problem was done by solving a quadratic programming problem. And the class corresponding to the input pattern of dicodon usage values is determined by the resulting discriminant function obtained from the optimization [3]. 2.4
Multi-class SVM
Multi-class SVM was adopted to classify HLA sequences to sub-classes of HLA-I and HLA-II molecules. A scheme proposed by Crammer and Singer [32] for multiclass SVM was used, which has the capacity to solve the optimization problem in one step while minimizing the generalization error in the prediction [16]. For HLA-I classification, SVM was used to construct three discriminant functions all of which are obtained by solving one single optimization problem: Minimize N1 1 (wc )T wc + γ ξj 2 j=1 c∈Ω1
subject to the constraints (wtj )T φ(rj ) − (wc )T φ(rj ) ≥ dcj − ξj
(3)
where tj ∈ Ω1 = {HLA-A, HLA-B, HLA-C} denotes the desired subclass for input rj , N1 denotes the number of training sequences of HLA-I molecules, slack variables ξj represent the magnitude of error in classification, c ∈ Ω1 denotes 0 if t j = c the predicted subclasses of HLA-I sequence, and dcj = . 1 if tj =c
216
M.N. Nguyen et al.
The minimization of the above optimization problem in Eq. (3) was done by solving the quadratic programming problem. Based on the resulting discriminant function, the subclass of HLA-I corresponding to the input pattern of di-codon usage values is determined [3]. For HLA-II, five discriminant functions f c , c ∈ Ω2 , and Ω2 = {HLA-DPB1,HLA-DQA1,HLA-DQB1,HLA-DRB1,HLA-DRB3} are constructed, each obtained by solving one single optimization problem as formulated in Eq. (3). The subclass of HLA-II, corresponding to the input pattern of di-codon usage was determined by the resulting discriminant function obtained from the optimization [3].
3
Results
Binary SVM was implemented using LIBSVM [33] known to have faster convergence properties than other tools available for solving the quadratic programming problem [34]. For sub-class classification of HLA-I and HLA-II molecules, multi-class SVM was implemented using BSVM libraries [34]. Ten-fold crossvalidation was used to evaluate the accuracy in HLA major class classification as well as HLA-I and HLA-II subclass classifications. In order to avoid selection of extremely biased partitions in cross-validation, the dataset was divided randomly into ten balanced partitions of equal size. In addition, we also used specificity and sensitivity to assess the performance of the prediction scheme [3]. 2 For binary and multi-class SVM, the Gaussian kernel K(x, y) = e−σx−y gave superior performance over linear and polynomial kernels for classification of HLA molecules. This was also observed in the case of gene classification using codon bias as features [3]. The sensitivity parameter γ and the Gaussian kernel parameter σ were determined by using the grid-search method [34]. Grid-search provides useful parameter estimates for multi-class SVM in a relatively short time. The classification accuracy of binning 1,841 HLA sequences into either HLA-I or HLA-II classes using binary SVMs was evaluated using ten-fold cross-validation. The optimal estimates of sensitivity parameter γ = 2 and kernel parameter σ = 0.125 of the Gaussian kernel achieved an accuracy of 99.95% for classification of HLA molecules. For HLA-I subclass classification, we first considered the subclasses of HLA-A, HLA-B, and HLA-C as the numbers of sequences in other sub-classes such as HLA-E, HLA-F, and HLA-G were too small (less than 25 sequences) to be included in the analysis, so the total number of sequences for the experiment was 1,124. For a similar reason, we only considered subclasses of HLADPB1, HLA-DQA1, HLA-DQB1, HLA-DRB1, and HLA-DRB3 for HLA-II subclass classification, so the total number of sequences included in the experiment was 617. For HLA-I sub-class classification of the dataset of 1124 sequences, the parameters γ = 1 and σ = 0.25 resulted in the best predictive accuracy of 99.82%, and for HLA-II sub-class classification on the dataset of 617 sequences, the parameters γ = 1 and σ = 0.25 gave an accuracy of 99.03%. The performance of binary SVM for major class classification and multiclass SVM for sub-class classification of HLA-I and HLA-II molecules are presented in Table 1. The standard deviation of cross-validation accuracies of HLA
Di-codon Usage for Gene Classification
217
Table 1. Accuracy (Acc), sensitivity (Sn), and specificity (Sp) of the classification of HLA molecules by using codon and di-codon usage as features for SVM classifier Features:usage HLA Classification
codon Acc
Sn
di-codon Sp
Acc
Sn
codon + di-codon Sp
Acc
Sn
Sp
Major Class
99.30 98.99 99.48 99.95 99.86 100.0 100.0 100.0 100.0
HLA-I Sub-class
99.73 99.47 99.87 99.82 99.75 99.90 99.82 99.75 99.90
HLA-II Sub-class
98.38 93.82 99.59 99.03 96.35 100.0 99.84 99.40 100
Table 2. Comparison of performances of the present approach using codon and dicodon usage on the dataset of 1841 HLA genes
HLA Classification
Features/Method Codon
Major class
HLA-I Sub-class classification HLA-II
Testing
Cross-validation
Accuracy
Accuracy
mean SD mean
SD
98.72 0.01 99.30
0.01
Di-codon
99.13 0.01 99.95
0.01
Codon + Di-codon
99.78 0.01 100
0.00
Homology based method 96.14 0.04 96.65
0.04
Codon
98.60 0.03 99.73
0.03
Di-codon
99.47 0.02 99.82
0.01
Codon + Di-codon
99.64 0.01 99.82
0.01
Homology based method 97.51 0.23 97.83
0.23
Codon
97.67 0.03 98.38
0.02
Di-codon
98.70 0.02 99.03
0.02
Codon + Di-codon
99.35 0.02 99.84
0.01
Homology based method 96.27 0.24 96.74
0.24
major class classification, HLA-I subclass classification, and HLA-II subclass classification were 0.01, 0.01, and 0.02, respectively, indicating a little effect of data partitioning (referred in Table 2). We also investigated the combination of codon and di-codon features for the classification of HLA molecules into major classes and HLA-I/HLA-II molecules into their subclasses. A total of 4155 features including relative synonymous
218
M.N. Nguyen et al.
codon usage of 59 codons [3] and 4096 di-codon usage values were used as input for the classification. Table 1 shows the ten-fold cross-validation accuracies, sensitivities, and specificities of binary SVM for major class classification and multi-class SVM for sub-class classification of HLA-I and HLA-II molecules, achieved through best parameter values. By combining codon and di-codon features for HLA sequence classification, the binary SVM achieved the highest accuracy of 100% with sensitivity parameter γ = 2 and kernel parameter σ = 0.125 of the Gaussian kernel; multi-class SVM achieved the accuracies of 99.82% and 99.84% for HLA-I and HLA-II sub-class classification, respectively, with parameters γ = 1 and σ = 0.25, interestingly, for both classes. In order to evaluate testing accuracies of the present method, the dataset was randomly divided into two balanced halves of major- and sub-classes of HLA sequences. One partition was selected for training and the other was reserved for testing. SVM was trained with the training dataset and the kernels and parameters were selected based on the best accuracies on the training dataset. The test accuracies were calculated on the testing dataset with the parameters obtained during training. This procedure was repeated 25 times and the mean and standard deviation of accuracy were calculated and given in (Table 2). As seen, the testing and cross-validation accuracies are close, indicating good generalization ability of the method. 3.1
Comparison with Homology-Based Methods
In order to compare discriminating power of di-codon usage pattern, homology based distance matrices were used for the classification of HLA sequences, HLA-I sequences, and HLA-II sequences. The multiple sequence alignment on sequences was performed by using ClustalX [35] and the distance matrix was constructed by pairwise similarities of aligned sequences. The distance matrix has been shown previously as an effective feature for clustering or classification of aligned sequences [36]. Using the distance matrix as input features, SVM was used to classify the sequences; and ten-fold cross-validation accuracies are reported in Table 2. These results show that di-codon usage pattern gives improvement in classification accuracy and is an effective feature for classification of HLA genes.
4
Discussion and Conclusion
Codon and di-codon usage are useful features in synonymous mutation studies in molecular evolution because when a synonymous mutation occurs, though the phenotype (the coded protein) does not change, the codon usage pattern as well as features such as the gene expression level are affected. Di-codon usage patterns provide additional information on codon usage as ribosomes actually reside over two codon positions during translation. Therefore, di-codon usage is a good indicator in gene expression and molecular evolution studies and, as seen in our experiments, provides a good feature for gene classification.
Di-codon Usage for Gene Classification
219
The efficacy of our method was demonstrated on a set of HLA genes collected from IMGT/HLA database. Once HLA genes were classified according to major classes, di-codon usage were further explored for more precise classification of the molecules. In major class classification of HLA molecules and subclass classifications of HLA-I and HLA-II molecules, the present approach using di-codon usage patterns achieved better overall accuracies than obtained by the classifiers using codon usage bias. The results in classification of HLA genes, using codon and di-codon usage as features for SVM were near perfect. The method is independent of the lengths of sequences and useful when homology-based methods tend to fail on datasets having genes of varying length. Also, in case of SVM, testing and cross-validation accuracies were close, indicating that the parameter estimation and kernel selection procedures were not sensitive to data. Since the classifications of HLA molecules into their subclasses were accurately achieved with di-codon usage patterns, the functions of HLA molecules should be closely related to di-codon usage. Although our demonstration was limited to HLA molecules, the approach could be generalized and applicable for the classification of other groups of molecules as well. As the method generalized well in the experiments, it could also help in the prediction of the function of novel genes. The authors are unaware of any public datasets for benchmarking gene classification algorithms such as the approach presented here. Di-codon usage is a complicated phenomenon affected by many factors, such as species, gene function, protein structure, gene expression level, tRNA abundance, etc. Building a correlation between di-codon usage patterns and biological phenotypes and finding the relationships and interactions can result in unfolding valuable biological information from nucleic acid sequences. For novel genes, di-codon usage patterns could be used for their classification and helpful in inferring their function. Therefore, analyses of di-codon usage patterns with computational techniques that capture inherent rules of translation could be useful for both basic and applied research in life sciences. Investigating usage patterns of which codons and di-codons most affect the classification of genes is worthy of further exploration. Recently, error-correcting output codes (ECOC) provide a general-purpose method for improving the performance of inductive learning programs on multi-class problems. Therefore, a comparison of the multi-class SVM with ECOC methods for multi-class gene classifications could be helpful and is reserved for future work.
References 1. Sharp, P.M., Cowe, E., Higgins, D.G.: Codon usage patterns in Escherichia coli, Bacillus subtilis, Saccharomyces cerevisiae, Schizosaccharomyces pombe, Drosophila melanogaster and Homo sapiens: a review of the considerable withinspecies diversity. Nucleic Acids Res. 16, 8207–8211 (1988) 2. Kanaya, S., Yamada, Y., Kudo, Y., Ikemura, T.: Studies of codon usage and tRNA genes of 18 unicellular organisms and quantification of Bacillus subtilis tRNAs: gene expression level and species-specific diversity of codon usage based on multivariate analysis. Gene 238, 143–155 (1999)
220
M.N. Nguyen et al.
3. Ma, J.M., Nguyen, M.N., Rajapakse, J.C.: Gene Classification using codon usage and support vector machines. IEEE/ACM Transactions on Computational Biology and Bioinformatics 6(1), 134–143 (2009) 4. Zhang, Y., Rajapakse, J.C. (eds.): Machine Learning in Bioinformatics. John Wiley and Sons Inc., Chichester (2009) 5. Wallace, I.M., Blackshields, G., Higgins, D.G.: Multiple sequence alignments. Curr. Opin. Struct. Biol. 15, 261–266 (2005) 6. Shatsky, M., Nussinov, R., Wolfson, H.J.: Optimization of multiple-sequence alignment based on multiple-structure alignment. Proteins: Structure, Function, and Bioinformatics 62, 209–217 (2006) 7. Noguchi, H., Park, J., Takagi, T.: MetaGene: prokaryotic gene finding from environmental genome shotgun sequences. Nucleic Acids Research 34(19), 5623–5630 (2006) 8. Kim, C., Konagaya, A., Asai, K.: A generic criterion for gene recognitions in genomic sequences. Genome Inform. Ser. Workshop Genome Inform. 10, 13–22 (1999) 9. Paces, J., Paces, V.: DicodonUse: the programme for dicodon bias visualization in prokaryotes. Folia Biol. (Praha) 48(6), 246–249 (2002) 10. Uno, R., Nakayama, Y., Tomita, M.: Over-representation of Chi sequences caused by di-codon increase in Escherichia coli K-12. Gene 380(1), 30–37 (2006) 11. Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995) 12. Vapnik, V.: Statistical Learning Theory. Wiley and Sons, Inc., New York (1998) 13. Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and other kernel-based learning methods. Cambridge University Press, Cambridge (2000) 14. Nguyen, M.N., Rajapakse, J.C.: Prediction of protein relative solvent accessibility with a two-stage SVM approach. Proteins: Structure, Function, and Bioinformatics 59, 30–37 (2005) 15. Nguyen, M.N., Rajapakse, J.C.: Two-stage support vector regression approach for predicting accessible surface areas of amino acids. Proteins: Structure, Function, and Bioinformatics 63, 542–550 (2006) 16. Nguyen, M.N., Rajapakse, J.C.: Prediction of protein secondary structure with two-stage multi-class SVM approach. International Journal of Data Mining and Bioinformatics 1(3), 248–269 (2007) 17. Duan, K.B., Rajapakse, J.C.: Multiple SVM-RFE for gene selection in cancer classification with expression data. IEEE Trans. Nanobioscience 4(3), 228–234 (2005) 18. Rajapakse, J.C., Duan, K.B., Yeo, W.K.: Proteomic cancer classification with mass spectrometry data. American Journal of Pharmacology 5(5), 281–292 (2005) 19. Lin, K., Kuang, Y., Joseph, J.S., Kolatkar, P.R.: Conserved codon composition of ribosomal protein coding genes in Escherichia coli, Mycobacterium tuberculosis and Saccharomyces cerevisiae: lessons from supervised machine learning in functional genomics. Nucleic Acids Res. 30, 2599–2607 (2002) 20. Bhasin, M., Raghava, G.P.: SVM based method for predicting HLA-DRB1*0401 binding peptides in an antigen sequence. Bioinformatics 20, 421–423 (2004) 21. Bhasin, M., Raghava, G.P.: Prediction of CTL epitopes using QM, SVM and ANN techniques. Vaccine 22, 3195–3204 (2004) 22. Donnes, P., Elofsson, A.: Prediction of MHC class I binding peptides, using SVMHC. BMC Bioinformatics 3(1), 25–32 (2002) 23. Zhao, Y., Pinilla, C., Valmori, D., Martin, R., Simon, R.: Application of support vector machines for T-cell epitopes prediction. Bioinformatics 19, 1978–1984 (2003)
Di-codon Usage for Gene Classification
221
24. Robinson, J., Waller, M.J., Parham, P., Bodmer, J.G., Marsh, S.G.E.: IMGT/HLA Sequence Database - a sequence database for the human major histocompatibility complex. Nucleic Acids Res. 29, 210–213 (2001) 25. Robinson, J., Waller, M.J., Parham, P., de Groot, N., Bontrop, R., Kennedy, L.J., Stoehr, P., Marsh, S.G.E.: IMGT/HLA and IMGT/MHC: sequence databases for the study of the major histocompatibility complex. Nucleic Acids Res. 31, 311–314 (2003) 26. Galperin, M.: The Molecular Biology Database Collection: 2004 update. Nucleic Acids Res. 32, D2–D22 (2004) 27. Bodmer, J.G., Marsh, S.G.E., Albert, E.D., Bodmer, W.F., Bontrop, R.E., Charron, D., Dupont, B., Erlish, H.A., Mach, B., Mayr, W.R., Parham, P., Sasazuki, T., Schreuder, G.M.T., Strom-inger, J.L., Svejgaard, A., Terasaki, P.I.: Nomenclature for factors of the HLA system, 1995. Tissue Antigens 46, 1–18 (1995) 28. Rosenthal, A.S., Shevach, E.: Function of macrophages in antigen recognition by guinea pig T lymphocytes. I. Requirement for histocompatibile macrophages and lymphocytes. J. Exp. Med. 138, 1194–1212 (1973) 29. Zinkernagel, R.M., Doherty, P.C.: Restriction of in vitro T cell-mediated cytotoxicity in lymphocytic choriomeningitis within a syngeneic or semiallogeneic system. Nature 248, 701–702 (1974) 30. Katz, D.H., Hamoaka, T., Benacerraf, B.: Cell interactions between histocompatible T and B lymphocytes. Failure of physiologic cooperation interactions between T and B lymphocytes from allogeneic donor strains in humoral response to haptenprotein conjugates. J. Exp. Med. 137, 1405–1418 (1973) 31. Han, H.X., Kong, F.H., Xi, Y.Z.: Progress of studies on the function of MHC in immuno-recognition. J. Immunol. (Chinese) 16(4), 15–17 (2000) 32. Crammer, K., Singer, Y.: On the Learnability and Design of Output Codes for Multiclass Problems. Machine Learning 47, 201–233 (2002) 33. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines, http://www.csie.ntu.edu.tw/~ cjlin/libsvm 34. Hsu, C.W., Lin, C.J.: A comparison on methods for multi-class support vector machines. IEEE Transactions on Neural Networks 13, 415–425 (2002) 35. Thompson, J.D., Gibson, T.J., Plewniak, F., Jeanmougin, F., Higgins, D.G.: The ClustalX windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res. 24, 4876–4882 (1997) 36. Grishin, V.N., Grishin, N.V.: Euclidian space and grouping of biological objects. Bioinformatics 18, 1523–1534 (2002)
Counting Patterns in Degenerated Sequences Gr´egory Nuel MAP5, CNRS 8145, University Paris Descartes, 45 rue des Saint-P`eres, F-75006 Paris, France
[email protected]
Abstract. Biological sequences like DNA or proteins, are always obtained through a sequencing process which might produce some uncertainty. As a result, such sequences are usually written in a degenerated alphabet where some symbols may correspond to several possible letters (ex: IUPAC DNA alphabet). When counting patterns in such degenerated sequences, the question that naturally arises is: how to deal with degenerated positions ? Since most (usually 99%) of the positions are not degenerated, it is considered harmless to discard the degenerated positions in order to get an observation, but the exact consequences of such a practice are unclear. In this paper, we introduce a rigorous method to take into account the uncertainty of sequencing for biological sequences (DNA, Proteins). We first introduce a Forward-Backward approach to compute the marginal distribution of the constrained sequence and use it both to perform a Expectation-Maximization estimation of parameters, as well as deriving a heterogeneous Markov distribution for the constrained sequence. This distribution is hence used along with known DFA-based pattern approaches to obtain the exact distribution of the pattern count under the constraints. As an illustration, we consider a EST dataset from the EMBL database. Despite the fact that only 1% of the positions in this dataset are degenerated, we show that not taking into account these positions might lead to erroneous observations, further proving the interest of our approach. Keywords: Forward-Backward algorithm, Expectation-Maximization algorithmn, Markov chain embedding, Deterministic Finite state Automaton.
1
Introduction
Biological sequences like DNA or proteins, are always obtained through a sequencing process which might produce some uncertainty. As a result, such sequences are usually written in a degenerated alphabet where some symbols may correspond to several possible letters. For example, the IUPAC [1] protein alphabet includes the following degenerated symbols: X for “any amino-acid”, Z for “glutamic acid or glutamine”, and B for “Aspartic acid or Asparagine”. For DNA sequences, there is even more of such degenerated symbols which exhaustive list and meaning are given in Table 1 along with observed frequencies in several datasets from the EMBL [2] database. V. Kadirkamanathan et al. (Eds.): PRIB 2009, LNBI 5780, pp. 222–232, 2009. c Springer-Verlag Berlin Heidelberg 2009
Counting Patterns in Degenerated Sequences
223
Table 1. Meaning and frequency of the IUPAC [1] DNA symbols in several files of the release 97 of the EMBL nucleotide sequence database [2]. Degenerated symbols (lowest part of the table) contribute to 0.5% to 1% of the data. symbol meaning est pro 01 htg pro 01 htc fun 01 std hum 21 A Adenine 67459 1268408 1347782 1190205 C Cytosine 53294 1706478 1444861 1031369 G Guanine 54194 1719016 1325070 809651 T Thymine 66139 1277939 1334061 1067933 U Uracil 0 0 0 0 R Purine (A or G) 13 0 7 39 Y Pyrimidine (C, T, or U) 6 0 9 37 M C or A 2 0 6 31 K T, U, or G 6 0 5 30 W T, U, or A 6 0 8 26 S C or G 21 0 4 28 B not A 0 0 0 0 D not C 3 0 0 0 H not G 0 0 1 0 V not G, not U 0 0 0 0 N any base 1792 115485 28165 19272
When counting patterns in such degenerated sequences, the question that naturally arise is: how to deal with degenerated positions ? Since most (usually 99%) of the positions are not degenerated, it is usually considered harmless to discard the degenerated positions in order to get an observation. Another option might be to preprocess the dataset by replacing each special letter by the most likely compatible symbol at the position (in reference with some background model). Finally, one might come up with some ad hoc counting rule like: “whenever the pattern might occurs I add one1 to the observed count”. However practical, all these solutions remain quite unsatisfactory from the statistician point of view and their possible consequences (like adding or missing occurrences) remain unclear. In this paper, we want to deal rigorously with the problem of degenerated symbols in sequences by introducing the distribution of sequences under the uncertainty of their sequencing, and then by using this distribution to study the “observed” number of occurrences of a pattern of interest. To do so we place ourself in a Markovian framework by assuming that the sequence X1 = X1 . . . X is a order2 d 1 homogeneous Markov chain over the finite alphabet A. We denote by ν its starting distribution and by π its transition matrix. For all ad1 ∈ A and for all b ∈ A we then have: P X1d = ad1 = ν(ad1 ) = and P(Xi+d = b|Xii+d−1 = ad1 ) = π(ad1 , b) with 1 i − d. 1 2
One might also think to add a fraction of one which correspond to the probability to see the corresponding letter at the degenerated position. For the sake of simplicity, the particular degenerated case where d = 0 is left to the reader.
224
G. Nuel
For all 1 i we denote by Xi ⊂ A the subset of all possible values taken by Xi according to the data. For example, if we consider a IUPAC DNA sequence “ANTWY . . .” we have X1 = {A}, X2 = {A, C, G, T}, X3 = {T}, X4 = {A, T}, X5 = {C, T}, . . . In a first part we establish the distribution of X1 under the constraint that X1 ∈ X1 using an adaptation of the Baum-Welch algorithm [3]. We then demonstrate that the constrained sequence is distributed according to a heterogeneous Markov model which starting distribution and transition function have explicit expressions. This result hence allows to obtain the exact constrained distribution of a pattern by the application of known Markov chain embedding techniques. The interest of the method is finally illustrated with EST data and discussed.
2
Constrained Distribution
In order to compute the constrained probability P(X1d |X1d ∈ X1d ) we follow the sketch of the Baum-Welch algorithm [3] by introducing the Forward and Backward quantities. Proposition 1 (Forward). For all x1 ∈ A and ∀i, 1 i − d we define the def i+d−1 forward quantity Fi xi+d−1 = P Xi = xi+d−1 , X1i+d−1 ∈ X1i+d−1 which i i is computable by recurrence through: i+d−2 Fi xi+d−1 = Fi−1 xi+d−2 π xi−1 , xi+d−1 (1) i i−1 xi−1 ∈Xi−1
for 2 i − d + 1 and with the initialization F1 xd1 = ν xd1 IX1d (xd1 ) where I is the indicatrix function3 . We then obtain that: −1 P(X1 ∈ X1 ) = F−d x−1 (2) −d π x−d , x . x−d ∈X−d
Proof. We prove Equation (1) by simply rewriting Fi (xi+d−1 ) as: i i+d−1 i+d−1 Fi xi = P(Xi−1 = xi+d−1 , X1i+d−1 ∈ X1i+d−1 ) i−1 xi−1 ∈Xi−1
=
i+d−2 P(Xi−1 = xi+d−2 , X1i+d−2 ∈ X1i+d−2 ) i−1 xi−1 ∈Xi−1 Fi−1 (xi+d−2 ) i−1
i+d−2 ×P(Xi+d−1 = xi+d−1 , Xi+d−1 ∈ Xi+d−1 |Xi−1 = xi+d−2 , X1i+d−2 ∈ X1i+d−2 ) . i−1 π (xi+d−2 ,xi+d−1 )IXi+d−1 (xi+d−1 ) i−1
The proof of Equation (2) is established in a similar manner. 3
For any set E, subset A ⊂ E and element a ∈ E, IA (a) = 1 if a ∈ A and IA (a) = 0 otherwise.
Counting Patterns in Degenerated Sequences
225
Proposition 2 (Backward). For all x1 ∈ A and ∀i, 1 i − d we define def the backward quantity Bi xi+d−1 = P Xi ∈ Xi |Xii+d−1 = xi+d−1 which is i i computable by recurrence through: Bi xi+d−1 = π xi+d−1 , xi+d Bi+1 xi+d (3) i i i+1 xi+d ∈Xi+d
−1 for 2 i −d−1 and with the initialization B−d x−1 −d = x ∈X π x−d , x IX −1 (x−1 −d ). We then obtain that: −d
P(X1 ∈ X1 ) =
ν xd1 B1 xd1 .
(4)
d xd 1 ∈X1
Proof. The proof is very similar to the one of Proposition 1 and is hence omitted. Theorem 1 (Marginal distributions). For all x1 ∈ A we have the following results: a) P X1d = xd1 , X1 ∈ X1 = ν xd1 B1 xd1 ; i+d i+d−1 b) P Xi = xi+d , X1 ∈ X1 = Fi xi+d−1 π xi , xi+d Bi+1 xi+d i i i+1 ; −1 c) P X−d = x−d , X1 ∈ X1 = F−d x−1 −d π x −d, x ; i+d−1 i+d−1 i+d−1 d) P Xi = xi , X1 ∈ X1 = Fi xi Bi xi+d−1 . i Proof. a), b), and c) are proved using the same conditioning mechanisms used in the proofs of propositions 1 and 2. One could note that a) is a direct consequence of Equation (4), while c) could be derived from Equation (2). Thanks to Equation (3), it is also clear that b) ⇒ d) which achieves the proof. def From now on we denote by PC (A) = P A|X1 ∈ X1 the probability of an event A under the constraint that X1 ∈ X1 . Theorem 2 (Heterogeneous Markov chain). X1 is a order d heterogeneous Markov chain under PC which starting distribution ν C is given by: def ν C (xd1 ) = PC X1d = xd1 ∝ ν xd1 B1 xd1 (5) C and the transition matrix πi+d (toward position i + d) is given by:
def C πi+d (xd1 ) = PC Xi+d = xi+d |Xii+d−1 = xi+d−1 i ∝ π xi+d−1 , xi+d Bi+1 xi+d i i+1 . (6) Proof. Equation (5) is a direct consequenceof Theorem 1a) and Equation (4). For Equation (6) we start by denoting PC Xi+d = xi+d |Xii+d−1 = xi+d−1 = i P(A|B, C, D) with A = {Xi+d = xi+d }, B = {Xii+d−1 = xi+d−1 }, C = i {X1i ∈ X1i }, and D = {Xi+1 ∈ X1i }. Thanks to Bayes’ formula we get that P(A|B, C, D) ∝ P(D|A, B, C) × P(A|B, Markov property C). We finally use thei+d−1 to get P(D|A, B, C) = Bi+1 xi+d and P(A|B, C) = π x , xi+d which i+1 i achieves the proof.
226
G. Nuel
One should note the reverse sequence X1 = X . . . X1 is also a heterogeneous order d Markov model which parameters can be expressed through the Forward quantities.
3
Estimating the Background Model def
Let us denote by θ = (ν, π) the parameters of our order d Markov model. We denote by Pθ all probability computation performed using the parameter def θ. Since the (log-)likelihood L θ|X1 ∈ X1 = log Pθ X1 ∈ X1 may be derived either from the Forward or Backward quantities, it is possible maximize numerically this likelihood to get the Maximum Likelihood Estimator (MLE) def θ = arg maxθ L θ|X1 ∈ X1 . We suggest here an alternative approach founded on the classical ExpectationMaximization algorithm for maximum likelihood estimation from incomplete data [4]. To do so, we simply consider that X1 ∈ X1 is the observed data, while X1 = x1 is the unobserved data. We then get the following result: def
Proposition 3 (EM algorithm). For any starting parameter θ0 = (ν0 , π0 ), def
we consider the sequence (θj )j0 defined for all j 0 by θj+1 = (νj+1 , πj+1 ) with: d θj d d I{ad1 ∈X1d} νj a1 B1 a1 νj+1 a1 = (7) Pθj X1 ∈ X1 d θj d −d θj d i+d F d i=1 I{ad } i a1 πj a1 , b Bi+1 a2 b 1 b∈Xi πj+1 a1 , b = (8) −d θj d θj d a B a i+d−1 F 1 1 i i i=1 I{ad ∈X } 1 i θ
θ
where the Fi j and Bi j denote respectively the Forward and Backward quantities computed with the current value θj of the parameter, and with the convention θj that B−d+1 ≡ 1. The sequence (θj )j0 converge towards a local maximum of L θ|X1 ∈ X1 . Proof. This comes from a special application of the EM algorithm [4] where the Expectation step (Step E) consists in computing def
Q (θ|θj ) =
Pθi X1 = x1 |X1 ∈ X1 log Pθ X1 = x1
x1 ∈X1
while the Maximization step (Step M) consists in computing θi+1 = arg maxθ Q (θ|θj ). Equations (7) and (8) then simply come from a natural adaptation of the classical MLE of a order d Markov chains using the pseudo counts that come directly from Theorem 1.
Counting Patterns in Degenerated Sequences
4
227
Counting Patterns
Let us consider here W a finite set of words over A. We want to count the number N of positions where W occurs in our degenerated sequence. Unfortunately, since the sequence itself is not observed, we study instead the number N of matching positions in the random sequence X1 under PC . Thanks to Theorem 2 we hence need to establish the distribution of N over a heterogeneous order d Markov chain. To do so, we perform an optimal Markov chain embedding of the problem through a Deterministic Finite Automaton (DFA) as it is suggested in [5; 6; 7; 8]. We use here the notations of [8]. Let (A, Q, s, F , δ) be a minimal DFA recognizing the language4 A∗ W of all texts over A ending with an occurrence of W (see Figure 1 for an example of such a minimal DFA). Q is a finite state space, s ∈ Q is the starting state, F ⊂ Q is the subset of final states, and δ : Q×A → Q is the transition function. We recursively extend the definition of δ over Q × A∗ def thanks to the relation δ(p, aw) = δ(δ(p, a), w) for all p ∈ Q, a ∈ A, w ∈ A∗ . We additionally suppose that this automaton is non d-ambiguous5 which means that def for all q ∈ Q, δ −d (p) = ad1 ∈ Ad1 , ∃p ∈ Q, δ p, ad1 = q is either a singleton, or the empty set. A,C,G
A,C,G A,C,G A,C 0
T A,C,G
1
T T
3
G
A,C,G
5 A,C
2
T
T
T G 4
T
G 7
6
T A,C
Fig. 1. Minimal DFA recognizing the language of all DNA sequences ending with an occurrence of the IUPAC pattern TTNGT. This DFA have a total of 8 states, s = 0 being the starting state, and F = {7} being the subset of final states. This DFA is 1-ambiguous since one can reach states 0 or state 3 with more than one letter.
Theorem 3 (Markov chain embedding). We consider the random sequence
0 def
i def
i−1 , Xi ) ∀i, 1 i . Under PC , over Q defined by X = s and X = δ(X 4 5
A∗ denotes the set of all (possibly empty) texts over A. A DFA having this property is also called a d-th order DFA in [7].
228
G. Nuel
i )id is a heterogeneous order 1 Markov chain over Q = δ(s, Ad A∗ ) such as, (X def
d = p for all p, q ∈ Q and 1 i −d the starting distribution μd (p) = PC X def
i+d = q|X
i+d−1 = p are given by: and transition matrix Ti+d (p, q) = PC X def
ν C ad1 if ∃ad1 ∈ Ad , δ s, ad1 = p μd (p) = ; 0 else C −d μi+d δ (p), b if ∃b ∈ A, δ(p, b) = q Ti+d (p, q) = . 0 else Since Qi+d contains all counting transitions, we keep track of the number of occurrences by associating a dummy variable y to these transitions. Then computing the marginal distribution at the end of the sequence would give us access to the moment generating function (mgf) of the random number of occurrences (see [5; 6; 7; 8] for more details): Corollary 1 (Moment generating function). The moment generating function F (y) of the random number N under PC is given by: −d +∞ def C F (y) = P (N = k) y k = μd (Pi+d + yQi+d ) 1 (9) i=1
k=0
where 1 is a column vector of ones and where, for all 1 i − d, Ti+d = def
def
Pi+d + Qi+d with Pi+d (p, q) = Iq∈F / Ti+d (p, q) and Qi+d (p, q) = Iq∈F Ti+d (p, q) for all p, q ∈ Q .
5
Algorithm
The practical implementation of this results requires two main steps: 1) compute Forward and Backward quantities; 2) compute the mgf using Corollary 1. For the first step, the resulting complexity is O() both in space and time. For the second step the space complexity is O(D×|Q |) where D is the difference between the maximum and the minimum degree of F (y), and the time complexity is O( × D × |Q | × |A|) (we take here taking advantage of the sparse structure of Ti+d ). Using this approach on a large dataset (ex: = 5 × 106 or = 3 × 109 ) may then result into high memory requirements and/or long running time. Fortunately, it is possible to reduce dramatically these complexities when considering degenerated sequence where most positions are deterministic like it is the case with biological sequences. def Let us denote by I = {1 i , |Xi | > 1} the set of degenerated positions
j is completely deterin the sequence. It is clear then that the random state X def
ministic for all j ∈ J = {1 i , j > i + d ∀i ∈ I}. The positions j ∈ J thus contribute in a deterministic way to N with a fixed number of occurrence n. It
Counting Patterns in Degenerated Sequences
229
hence remains only to take into account the variable part N − n = N1 + . . . + Nk where the Ni are independent contributions of each of the k segments of J¯ (the complementary of J in {1, . . . , }). If we denote by Fi (y) the mgf of Ni , we get that F (y) = y n ×
k
Fi (y)
i=1
which dramatically reduces the complexity of the problem. Since each Fi (y) may be obtained by a simple application of Corollary 1 on the particular (short) segment of interest, and one only need to compute the Forward-Backward quantities for this particular segment. For example, let us consider that the observed IUPAC sequence is x1 = AAYGCANGBTAGGCTTATCWATGRT and that d = 2. We have I = {3, 7, 9, 20, 24} and J¯ = [3, 5] ∪ [7, 11] ∪ [20, 22] ∪ [24, 25]. In order to compute F1 (y), F2 (y), F3 (y) and F4 (y), one just need to known the order d = 2 past before each of the corresponding segment: AA for the first, CA for the second, TC for the third, and TG for the last one.
6
Discussion
Let us consider the dataset est pro 01 which is described in Table 1. Here is the transition matrix over of a order d = 1 homogeneous Markov model over A = {A, C, G, T} estimated on this dataset using MLE (though the EM algorithm): ⎛ ⎞ 0.3337 0.1706 0.2363 0.2595 ⎜ 0.2636 0.2609 0.1775 0.2980 ⎟ ⎟ π
=⎜ ⎝ 0.2946 0.2218 0.2666 0.2169 ⎠ . 0.2280 0.2413 0.2106 0.3201 Since only 1% of the dataset is degenerated, we observe little difference between this rigorous estimate and one obtained through a rough heuristic (like discarding all degenerated positions in the data). However, this result should not be taken as a rule, especially when considering more degenerated sequences (e. g. with 10% degenerated positions) and/or higher order Markov models (e. g. d = 4). Using this model, it is possible to study the observed distribution of a pattern in the dataset by computing though Corollary 1 the distribution of its random number of occurrence N under the constrained probability PC . Table 2 compares the number of occurrences obtained by discarding all degenerated positions in the data (Count1) to the observed distribution. Despite the fact that only 1% of the data are degenerated, we can see that there is great differences between our naive approach and the real observed distribution. For example, if we consider the simple pattern GCTA we can see that the naive count of 715 occurrences lies well outside the 90% credibility interval [727, 740] and we have similar results for the other considered patterns.
230
G. Nuel
Table 2. Distribution of patterns in the degenerated IUPAC sequences from est pro 01. Count1 is obtained by discarding all degenerated positions in the dataset, and Count2 by replacing each special letter by the most likely compatible symbol. Since the observed distribution is discrete, percentiles and median are rounded to the closest value. pattern Count1 Count2 min 5%-tile median 95%-tile max GCTA 715 732 715 727 733 740 824 TTAGT 197 211 197 201 205 209 253 TTNGT 839 853 853 874 881 889 1005 TRNANNNSTM 472 505 477 488 493 498 535
For more complex patterns like TTNGT the difference between the naive count and the observed distribution is even more dramatic since 839 does not even belong to the support [853, 1005] of the observed distribution. This is due to the fact that the string TTNGT actually occurs 853 − 839 = 14 times in the dataset. Since our naive approach discards all positions in the data where a symbol other than A, C, G or T appears, these 14 occurrences are hence omitted. If we now preprocess the dataset by replacing all degenerated symbols by the most frequent letter in the corresponding subset we get the number of occurrences denoted Count2. If this heuristic seems to give an interesting result for Pattern GCTA (counting close to the median), it is unfortunately not the case for the other ones for which the method results either in under-counting (Pattern TTNGT) or over-counting (patterns TTAGT and TRNANNNSTM). As a general rule, it is usually difficult to predict the bias introduced by a particular heuristic since it can either lead to under- or over-coutings (for example Count1 always result in under-countings) and that this may even depend on the pattern of interest (like with Count2). The rigorous method we have here developed may hence also provide a way to test the statistical properties of a particular heuristic. Finally, let us point out that thanks to the optimal Markov chain embedding provided by the DFA-based approach presented above, we are here able to deal with relatively complex patterns like TRNANNNSTM.
7
Conclusion
In this paper, we provide a rigorous way to deal with the distribution of Markov chains over a finite alphabet A under the constraint that each position Xi of the sequence belongs to restricted subset Xi ⊂ A. We provide a Forward-Backward framework to compute marginal distributions and derive from it a EM estimation procedure. We also prove that the resulting constrained distribution is a heterogeneous Markov chains and provide explicit formulas to recursively compute its transition matrix. Thanks to this result, it is possible to apply known DFA-based methods from pattern theory to study the distribution of a pattern of interest in this constrained sequence, hence providing a trustful observed distribution for
Counting Patterns in Degenerated Sequences
231
the pattern number of occurrences. This information may then be used to derive a p-value p for a pattern by combining pn the p-value of the observation of n occurrences in a unconstrained dataset with the observed distribution through formulas like p = n pn PC (N = n). One should note that the approach we introduce here may have more applications than just counting patterns in IUPAC sequences. For example, one might use a similar approach to take into account the occurrences positions of known patterns of interest thus allowing to derive distribution of patterns conditionally to a possibly complex set of other patterns. One should also point out that the constraint Xi ∈ Xi should easily be complexificated, for example by considering a specific distribution over Xi . For instance, such a distribution may come from the posterior decoding probabilities of a sequencing machine. From the computational point of view, it is essential to understand that the heterogeneous nature of the Markov chain we consider forbid to use classical computational tricks like power computations. The resulting complexity is hence linear with the sequence length rather that logarithmic. However, one should expect a dramatic improvement of the method by restricting the use of heterogeneous Markov models only in the vicinity of degenerated positions like it is suggested in Section 5. With such an approach, one might rely on classical pattern matching for 99% of the data, and the method presented above would be restricted to the study of the 1% remaining data. Using this computational trick, it hence seems possible to rely on the rigorous exact computation introduced here rather than on a bias heuristic. Finally, we have demonstrated with our example that even a small amount of degenerated data may have huge consequences in terms of pattern frequencies, and thus possibly affect every subsequent analysis method involving these frequencies like Markov and hidden Markov model estimations and pattern studies. Considering the possible bias caused by degenerated letters in biological data, and the reasonable complexity of the exact solution we introduce in this paper, our study suggests that the problem of degenerated data in pattern related analysis should no longer be ignored.
References [1] IUPAC: International Union of Pure and Applied Chemistry (2009), http://www.iupac.org [2] EMBL: European Molecular Biology Laboratory Nucleotide Sequence Database (2009), http://www.ebi.ac.uk/embl/ [3] Baum, L.E., Petrie, T., Soules, G., Weiss, N.: A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. Ann. Math. Statist. 41(1), 164–171 (1970) [4] Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Stat. Society. Series B 39(1), 1–38 (1977) [5] Nicod`eme, P., Salvy, B., Flajolet, P.: Motif statistics. Theoretical Com. Sci. 287(2), 593–617 (2002)
232
G. Nuel
[6] Crochemore, M., Stefanov, V.: Waiting time and complexity for matching patterns with automata. Info. Proc. Letters 87(3), 119–125 (2003) [7] Lladser, M.E.: Mininal markov chain embeddings of pattern problems. In: Information Theory and Applications Workshop, pp. 251–255 (2007) [8] Nuel, G.: Pattern markov chains: optimal markov chain embedding through deterministic finite automata. J. of Applied Prob. 45(1), 226–243 (2008)
Modelling Stem Cells Lineages with Markov Trees Victor Olariu, Daniel Coca, Stephen A. Billings, and Visakan Kadirkamanathan Department of Automatic Control and Systems Engineering The University of Sheffield UK
Abstract. A variational Bayesian EM with smoothed probabilities algorithm for hidden Markov trees (HMT) is proposed for incomplete tree structured data. The full posterior of the HMT parameters is determined and the underflow problems associated with previous algorithms are eliminated. Example results for the prediction of the types of cells in real stem cell lineage trees are presented.
1
Introduction
The existence of stem cells in inter convertible sub-states and the kinetics of the cells switching between the sub-states are observed using cell tracking and real time phenotype monitoring techniques. However, the available technologies are limited and the resulting stem cells lineage trees are incomplete. To confront this problem we use probabilistic techniques of analysis for cell lineage trees reconstruction based on observations gathered in real time (cells division rate) and combine this with particular surface antigen expression information gathered at the end of the set of cell divisions being monitored. We take as our starting point the Hidden Markov Models (HMM) which are used in various fields like speech recognition, financial time series prediction [7], natural language processing, ion channel kinetics [17] and general data compression [21]. They have played important roles in the modeling and analysis of biological sequences, in particular DNA,[26], [5] and they have proven to be useful tools for statistical signal and image processing. Baum and colleagues developed the core theory of Hidden Markov Models [3]. In 1972 they proposed the forward-backward algorithm as an iterative technique for the Maximum Likelihood statistical estimation of probabilistic functions of Markov chains. Devijver demonstrated that the computation of joint likelihoods in Baum’s algorithm could be converted to the computation of posterior probabilities [11]. The resulting algorithm was similar to Baum’s except for the presence of a scaling factor suggested by Levinson et al. [22] which was robust to computational underflow. Further developments in HMMs has been done by MacKay [23], Beal and Ghahramani [4], Watanabe et al. [27], Ji et al. [19] in which they apply a variational Bayesian approach to these models. V. Kadirkamanathan et al. (Eds.): PRIB 2009, LNBI 5780, pp. 233–243, 2009. c Springer-Verlag Berlin Heidelberg 2009
234
V. Olariu et al.
Hidden Markov tree (HMT) models have been proposed by Crouse at al. for modelling the statistical dependencies between wavelet coefficients in signal processing [9]. They have been applied successfully to image de-noising and segmentation [8], [24], [6] to signal processing and classification [12], [14] and to tree structured data modelling [15], [5]. The forward-backward algorithm proposed by Baum was transposed to the Hidden Markov Trees context by Crouse et al. [9]. The resulting algorithm has been called the upward-downward algorithm but it suffered computational underflow problems as in Baum’s algorithm. The upward downward recursions have been proposed by Ronen et al. for the Estep in maximum likelihood estimation of dependence tree models with missing observations [25]. The upward-downward algorithm was revisited by Durand et al. [14] who made changes to solve the computational underflow problems by adapting the ideas from Devijver’s changes to the forward-backward algorithm. Romberg et al. proposed a Bayesian HMT model for image processing using wavelets [24] and later, Dasgupta and Carin developed the variational Bayesian hidden Markov tree model based on the model proposed by Crouse et al. with a similar application [10]. In this study we derive the variational Bayesian with smoothed probabilities implementation of Hidden Markov trees. We extend Durand’s HMT framework [14] to variational Bayesian with the critical embodiment of prior probability distributions. Inclusion of prior probability distributions of a class of HMT models such as in the case of cell lineages is essential to avoid ill-posedness of the estimation problem. We demonstrate this through an application to modelling stem cell lineages using real data.
2
Hidden Markov Tree Model
A Hidden Markov Tree (HMT) model is composed of the observed random tree X = x1 , ..., xN and hidden random tree S = s1 , ...., sN which has the same indexing structure as the observed tree. S takes value in a set of k discrete states, which are referred as 1, ..., k. A distribution P () satisfies the HMT property if and only if: P (X, S) = P (s1 ) P (st |sρ(t) ) P (xt |st ) (1) t =1
t
ρ(t) represents the parent of node t, C(t) is the notation for the children of node t, Xt is the subtree rooted in t and X1/t represents the entire tree except for the subtree rooted in t. The parameters describing the HMT model are similar to the HMM model parameters: πj = P (s1 = j) Pij = P (st = j|sρ(t) = i) Cjh = P (xt = h|st = j)
is the initial hidden state prior is the transition probability matrix is the emission probability matrix.
for j = 1...k where k is the number of possible discrete values of states.
Modelling Stem Cells Lineages with Markov Trees
235
x1
s1 xU(t)
xU(t)
sU(t)
xt
st
xt
sU(t)
xt
st
st
xt
st
Fig. 1. Hidden Markov Tree (HMT) representation with observed nodes (x) and hidden nodes (s). The straight arrows show the state transitions, curly arrows show emission probabilities.
In the next sections we will refer to all parameters of the HMT model as θ. θ = [π, vec(P ), vec(C)]T where vec(·) rearranges the matrix into a column vector.
3
Maximum Likelihood Estimation for Hidden Markov Tree Model
Considering that the S states are not observable, a popular approach to determine maximum likelihood estimates is the EM algorithm. For the E-step Crouse et al. realised a direct transposition to the Hidden Markov Tree (HMT) context of the forward-backward algorithm [9]. The result is the upward-downward algorithm which suffers from underflow problems [16]; [22]. In order to overcome this, Durand et al. [14] proposed the upward-downward algorithm for smoothed probabilities. Upward recursion for leaves of the tree: βt (j) =
Cjt P (st = j) Nt
where Nt = P (xt ) =
j
Cjt P (st = j)
(2)
236
V. Olariu et al.
Upward recursion for non-leaves: { v∈C(t) βt,v }Cjt P (st = j) βt (j) = Nt where
Nt =
{
j
(3)
βt,v }Cjt P (st = j)
v∈C(t)
and βρ(t),t (j) =
βt (k)Pjk P (st = k) k
Downward recursion: αt (j) =
Pij βρ(t) (i)αρ(t) (i) 1 P (st = j) i βρ(t),t (i)
(4)
At the M step, the maximisation of expectation of log-likelihood of the complete data ln Q(X|θ) is realised in order to re estimate the model parameters to be used in the next iteration. θτ +1 = arg max ln Qτ (X|θ) θ
(5)
where θτ +1 represents the model parameters at iteration τ + 1. The α and β probabilities determined at the E-step are used to find the expression of ln Q(X|θ) as a function of parameter θ of the hidden Markov tree [25]. ln Q(X|θ) = P (s1 = j|X, θ) ln P (s1 = j) j
k
P (st = j|X, θ) ln P (st = j|sρ(t) = k)
(6)
j
where the angled brackets · denote the expectation of a conditional probability function with respect to the missing components. Taking the derivative of ln Q(X|θ) and equating to zero gives the new parameters as shown: 1 i,j θti,j = ( Ajρ(t),v + Bt,v ) (7) λj,t v=ρ(leaf )
v =ρ(leaf )
where αt (j)βt (j) Ajt,v = j αt (j)βt (j)
i,j Bt,v =
αρ(t) (j)βρ(t),t (j)βt (j)θti,j j αt (j)βt (j)
The upwards-downwards algorithm mentioned above represents the base of the E step of the variational Bayesian EM algorithm with smoothed probabilities developed in the next section.
Modelling Stem Cells Lineages with Markov Trees
4
237
Variational Bayesian EM with Smoothed Probabilities Algorithm (VBEMS)
The ML approach for estimating the hidden Markov tree (HMT) model parameters produces just a single point estimate, at the same time ML tends to overfit the data. The solutions to these problems are given by the variational Bayesian approach proposed by [2]. This framework applied to HMTs is able to estimate approximate posterior distributions over hidden variables and parameters of the model. The computation is closely related to the computation of the EM algorithm which guarantees convergence. The algorithm proposed here for HMT uses the same strategy as the algorithm proposed for HMM by [23] and by [4]. The variational Bayesian framework in this study is adapted to HMT in a similar fashion to [10] but instead of using the simple upward-downward algorithm for the E-step we used the upward-downward method with smoothed probabilities. The aim is to determine the posterior probability distribution of the hidden variables S and parameters θ based on the set of observed variables X. For most model structures the exact deduction of hidden variables and parameters based on observed variables is not possible, therefore a variational distribution q(S, θ) which approximates the true posterior distribution must be obtained [4]. The log-marginal probability of the observed variables X can be decomposed as: lnP (X) = L(q(S, θ)) + KL(q(S, θ)||P (S, θ|X)) where L(q(S, θ)) =
dθ
q(S, θ)ln{
S
KL(q(S, θ)||P (S, θ|X) = −
dθ
P (X, S|θ) } and q(S, θ)
q(S, θ)ln{
S
(8)
P (S, θ|X) } q(S, θ)
We consider q(S, θ) to be member of conjugate-exponential family and we seek the member of this family for which the KL divergence [20] between the variational posterior distribution approximation and the true posterior distribution is minimised. Minimising KL divergence with respect to q(S, θ) is equivalent to maximising the lower bound L(q(S, θ)). For achieving tractability we make the assumption that P (S, θ|X) ≈ q(θ)q(S) [4]. p(θ)P (S, X|θ) L(q(θ), q(S)) = dθ q(θ, S) ln ) q(θ, S) S p(θ) P (S, X|θ) = dθq(θ)[ln + q(S) ln ] (9) q(θ) q(S) S
where p(θ) represents the prior distribution of the parameters and q(S, θ) are the variational posterior distributions. The prior distribution is restricted to the factorisation p(θ) = p(π)p(Pij )p(Cjh ). We chose the parameter priors over π, the rows of Pij and the rows of Cjh to be Dirichlet distributions.
238
V. Olariu et al.
Γ (U ) uj −1 p(π) = πj j Γ (uj ) j
(10)
where U = j uj is the strength of the prior and hyperparameters uj are subject to the constraint uj > 0. The Dirichlet distributions have the advantage that they are conjugate to the complete-data likelihood terms, and they are appropriate for our model parameters which are probabilities, hence restricted to the interval [0, 1]. The variational posterior distributions have the same form as the priors with hyperparameters incremented by statistics of the observations and hidden states. At the E-step, the posterior distribution over the hidden nodes is computed by calculating the solution of δL(q)/δq(S) = 0: ln q(S) = s1 ln πq(π) + +
st−1 ln Pij q(Pij ) st
t
st ln Cjh q(Cjh ) xt − Z
(11)
t
where Z is a normalisation constant. The expression of the updated parameters is: θˆ = (ˆ π , Pˆij , Cˆjh ) = (expln πq(π) , expln Pij q(Pij ) , expln Cjh q(Cjh ) )
(12)
k Based on the result dπDir(π; u) ln πj = ψ(uj ) − ψ( j=1 uj ) where ψ is the digamma function, we calculate the expectation of the logarithm of the parameters under Dirichlet distributions: k π ˆ = exp[ψ(ωjπ ) − ψ( ωjπ )]
(13)
j=1 k P P Pˆij = exp[ψ(ωj ij ) − ψ( ωj ij )]
(14)
j=1 k C C Cˆjh = exp[ψ(ωj jh ) − ψ( ωj jh )]
(15)
j=1
where k represents the number of possible discrete values of states. For the expectation step, we use the upwards-downwards with smoothed probabilities algorithm with the sub-normalised parameters where the normalisation constants change. This way the numerical stability is guaranteed and we are able to determine the β and α probabilities necessary for the maximisation step: βt (j) =
{
v∈C(t)
βt,v }Cˆjt P (St = j) Nu
(16)
Modelling Stem Cells Lineages with Markov Trees
where Nu =
j
{
239
βt,v }Cˆjt P (st = j)
v∈C(t)
and βρ(t),t (i) =
βt (k)Pˆik P (st = k) k
αt (j) =
Pˆij βρ(t) (i)αρ(t) (i) 1 P (st = j) i βρ(t),t (i)
(17)
The M-step involves calculation of the variational posterior distribution of each parameter of the HMT model by solving δL(q)/δq(θ) = 0. They are Dirichlet distributions and they are functions of expected values which can be determined using the upward and downward probabilities from the E-step. The expressions are similar to the ones used in the original variational Bayesian algorithm for hidden Markov models, with the difference being in the expectations which are functions of the smoothed α and β probabilities. The M-step results in: π q(π) = Dir(π1:k ; ω1:k ) ωjπ = uπj + s1 = jq(S)
q(Pij ) =
k
P
ij Dir(Pi,1:k ; ωi,1:k )
(18)
(19)
i=1 P
P
ωijij = ujij + sτ −1 = isτ = j q(S) q(Cjh ) =
k
C
jh Dir(Cj,1:p ; ωj,1:p )
(20)
j=1 C
C
ωjhjh = uh jh + sτ = jxτ = hq(S) The variational posterior distribution has the same functional form of the Dirichlet distribution. The hyperparameters are equal to the sum between the strength of the prior distribution and statistics of the hidden state and observations which are functions of α and β determined at the E step.
5
Experimental Results
At this moment the technology of monitoring the cells’ divisions is not able to determine the types of cells with respect to SSEA3 marker antigen at all levels of division in a stem cell lineage tree. SSEA3 is a cell surface antigen which is rapidly down-regulated as human Embryonic Stem (hES) cells differentiate [13]. The challenge for scientists is to reconstruct the cell lineage trees based on the observations gathered from experimental data. Frumkin et al. reconstructed cell lineage trees based on somatic microsatellites mutation rates [18]. In this study the reconstruction of the lineage trees is realised based on the observations of the SSEA3 expression level of the cells. The lineage trees were
240
V. Olariu et al.
Fig. 2. Experimental stem cell lineage tree where light grey cells are positive definite, black cells are negative definite, the cross shape cells are dead cells
obtained from a purified population of SSEA3N egative NTERA2 stem cells. The pluripotent embryonal carcinoma (EC) cell line NTERA2 represent a human embryonic stem cells (hES) substitute [1]. The EC cells were subjected to time-lapse microscopy for seventy-two hours. After this time, the cells were sacrificed and examined by immunofluorescent labelling for SSEA3 expression. Cell division in relation to time was obtained from the time-lapse images and annotated in the form of lineage trees. The outcome of the time-lapse experiment consists of a data set of 30 stem cell lineage trees in which the cells’ expression of SSEA3 can only be observed at the leaf and the root levels as can be seen in Figure 2.In the experimental lineage trees used here the stem cells can be either SSEA3P ositive , SSEA3N egative or dead. We estimated the hidden Markov tree model parameters using the variationalBayesian approach with smoothed probabilities. The VBS-HMT model is applied to incomplete stem cells lineage tree data. The experimental data used in this study consists of 30 lineage trees, where just the type of cells at the start and at the end of each tree is known. The model developed here confronts the challenge of stem cell lineage tree reconstruction by determining the most likely state tree corresponding to the observed stem cell lineage tree. Using the proposed model we predicted the presence or absence of SSEA3 expression at the unobserved positions within the trees see Figure 3.
Modelling Stem Cells Lineages with Markov Trees
241
Fig. 3. Diagram representing complete stem cell lineage trees predicted by VBS-HMT model. The light grey cells are SSEA3P ositive cells , black cells are SSEA3Negative cells, the cross shape cells are dead cells.
In several lineage trees our model predicted that SSEA3N egative cells have SSEA3P ositive progeny. This suggest that NTERA2 stem cells could regain the SSEA3 expression i.e. the transition from SSEA3N egative to SSEA3P ositive is possible. Our conclusion has been validated by the real stem cell experiment in which a percentage of the root cells which were SSEA3N egative stem cells produced only SSEA3P ositive progeny.
6
Conclusion
In this paper we developed the variational Bayesian expectation maximisation with smoothed probabilities for hidden Markov trees model (VBS-HMT) and applied it to incomplete tree structured data. The model proved to be superior to the Maximum Likelihood approach and to the classical variational Bayesian method when tested on the prediction of the type of cells at each division level within a lineage tree as well as on the estimation of model parameters. We succeeded in confronting the underflow problems by combining the variational Bayesian method with the upwards-downwards algorithm with smoothed probabilities as an expectation step in the EM context. The resulting algorithm was demonstrated to have superior performance over the competing approaches and
242
V. Olariu et al.
was applied to the real stem cells lineage modelling problem. The VBS-HMT model provides the means to objectively predict a cell’s phenotype from knowing the phenotype of the cells at the root and leaf level within the cell lineage tree. It is important to note that the proposed inference algorithm is able to predict novel behaviours based on incomplete data, which are not directly observable. These predictions can subsequently be validated by targeted experiments Acknowledgments. The authors acknowledge that this work was supported by the Engineering and Physical Sciences Research Council (EPSRC).
References 1. Andrews, P.W.: Retinoic acid induces neuronal differentiation of a cloned human embryonal carcinoma cell line in vitro. Dev. Biol. 103, 285–293 (1984) 2. Attias, H.: A variational Bayesian framework for graphical models. In: Advances in Neural Information Processing Systems, vol. 12, pp. 209–215 (2000) 3. Baum, L.E., Petrie, T., Soules, G., Weiss, N.: A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. The Annals of Mathematical Statistics 41(1), 164–171 (1970) 4. Beal, M., Ghahramani, Z.: The variational Bayesian EM algorithm for incomplete data: with application to scoring graphical model structures. Bayesian Statistics 7, 453–464 (2003) 5. Beerenwinkel, N., Drton, M.: A mutagenetic tree hidden Markov model for longitudinal clonal HIV sequence data. Biostat. 8(1), 53–71 (2007) 6. Bharadwaj, P., Carin, L.: Infrared-image classification using hidden Markov trees. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(10), 1394– 1398 (2002) 7. Bulla, J., Bulla, I.: Stylized facts of financial time series and hidden semi-Markov models. Computational Statistics Data Annals 51(4), 2192–2209 (2006) 8. Choi, H., Baraniuk, R.G.: Multiscale image segmentation using wavelet-domain hidden Markov models. IEEE Transactions on Image Processing 10, 1309–1321 (2001) 9. Crouse, M., Nowak, R., Baraniuk, R.: Wavelet-based statistical signal processing using hidden Markov models. IEEE Transactions on Signal Processing (1997) 10. Dasgupta, N., Carin, L.: Texture analysis with variational hidden Markov trees. IEEE Transactions on Signal Processing 54(6), 2353–2356 (2006) 11. Devijver, P.A.: Baum’s forward-backward algorithm revisited. Pattern recognition Letters 3, 369–373 (1985) 12. Diligenti, M., Frasconi, P., Gori, M.: Hidden Markov tree models for document image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(4), 519–523 (2003) 13. Draper, J.S., Pigott, C., Thomson, J.A., Andrews, P.W.: Surface antigens of human embryonic stem cells: changes upon differentiation in culture. Journal of Anatomy 200, 249–258 (2002) 14. Durand, J.-B., Goncalves, P., Guedon, Y.: Computational methods for hidden Markov tree models-an application to wavelet trees. IEEE Transactions on Signal Processing 52(9), 2551–2560 (2004)
Modelling Stem Cells Lineages with Markov Trees
243
15. Durand, J.-B., Gu´edon, Y., Caraglio, Y., Costes, E.: Analysis of the plant architecture via tree-structured statistical models: The hidden Markov tree models. New Phytologist 166, 813–825 (2005) 16. Ephraim, Y., Merhav, N.: Hidden Markov processes. IEEE Transaction on Informormation Theory 48, 1518–1569 (2002) 17. Fredkin, D.R., Rice, J.A.: Fast evaluation of the likelihood of an HMM: Ion channel currents with filtering and colored noise. IEEE Transactions on Signal Processing 49, 625–633 (1997) 18. Frumkin, D., Wasserstrom, A., Kaplan, S., Feige, U., Shapiro, E.: Genomic variability within an organism exposes its cell lineage tree. PLoS Computational Biology 1, 382–394 (2005) 19. Ji, S., Krishnapuram, B., Carin, L.: Variational Bayes for continuous hidden Markov models and its application to active learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(4), 522–532 (2006) 20. Kullback, S., Leibler, R.A.: On information and sufficiency. The Annals of Mathematical Statistics 22(1), 79–86 (1951) 21. Lee, D.-S.: Substitution deciphering based on HMMs with applications to compressed document processing. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(12), 1661–1666 (2002) 22. Levinson, S.E., Rabiner, L.R., Sondhi, M.M.: An introduction to the application of the theory of probabilistic functions of a Markov process in automatic speech recognition. Bell System Technology J. 62, 1035–1074 (1983) 23. Mackay, D.J.C.: Ensemble learning for hidden Markov models (1997) 24. Romberg, J.K., Choi, H., Baraniuk, R.G.: Bayesian tree-structured image modeling using wavelet-domain hidden Markov models. IEEE Transactions on Image Processing 10, 1056–1068 (2001) 25. Ronen, O., Rohlicek, J., Ostendorf, M.: Parameter estimation of dependence tree models using the EM algorithm. IEEE Signal Processing Letters 2(8), 157–159 (1995) 26. Schliep, A., Costa, I.G., Steinhoff, C., Schnhuth, A.: Analyzing gene expression time-courses. IEEE/ACM Transactions on Computational Biology and Bioinformatics 2(3), 179–193 (2005) 27. Watanabe, S., Minami, Y., Nakamura, A., Ueda, N.: Variational Bayesian estimation and clustering for speech recognition. IEEE Transactions on Speech Audio Process 12, 365–381 (2004)
Bi-clustering of Gene Expression Data Using Conditional Entropy Afolabi Olomola1 and Sumeet Dua1,2 1
Data Mining Research Laboratory (DMRL), Department of Computer Science Louisiana Tech University, Ruston, LA, U.S.A. 2 School of Medicine, Louisiana State University Health Sciences, New Orleans, LA, U.S.A. {aol003,sdua}@latech.edu
Abstract. The inherent sparseness of gene expression data and the rare exhibition of similar expression patterns across a wide range of conditions make traditional clustering techniques unsuitable for gene expression analysis. Biclustering methods currently used to identify correlated gene patterns based on a subset of conditions do not effectively mine constant, coherent, or overlapping biclusters, partially because they perform poorly in the presence of noise. In this paper, we present a new methodology (BiEntropy) that combines information entropy and graph theory techniques to identify co-expressed gene patterns that are relevant to a subset of the sample. Our goal is to discover different types of biclusters in the presence of noise and to demonstrate the superiority of our method over existing methods in terms of discovering functionally enriched biclusters. We demonstrate the effectiveness of our method using both synthetic and real data. Keywords: Gene expression, biclustering, conditional entropy.
1 Background A major challenge in the analysis of gene expression datasets is the discovery of local structures composed of gene sets that show coherent expression patterns across subsets of experimental conditions. These patterns may provide clues about the biological processes associated with physiological states. Recently, researchers have focused on using biclustering methods to find local patterns in which genes in a subset might be similar, based only on a condition subset. Hatigan first defined biclustering as a distinct class of clustering algorithms that perform simultaneous row-column clustering [1]. Cheng and Church first applied biclustering to analyze DNA microarray experimental data [2]. They proposed a greedy algorithm to find a given number of δ biclusters, whose mean squared residues are less than given thresholds. Kupiec et al. [3] presented SAMBA, a graph-theory approach combined with a statistical data model. In the SAMBA framework, the expression matrix is modeled as a bipartite graph and a likelihood score is used to assess the significance of observed sub graphs. The Preserving Sub-Matrix Algorithm (OPSM) [4] bicluster is defined as a submatrix that preserves the order of the selected columns for all selected rows. Based on V. Kadirkamanathan et al. (Eds.): PRIB 2009, LNBI 5780, pp. 244–254, 2009. © Springer-Verlag Berlin Heidelberg 2009
Bi-clustering of Gene Expression Data Using Conditional Entropy
245
a stochastic model, the authors developed a deterministic algorithm to find large and statistically significant biclusters. The Iterative Signature Algorithm (ISA) [5] uses gene signatures and condition signatures to find biclusters with both up- and down-regulated expression values. Murali and Kasif (2003)[6] have proposed xMotif, a framework that seeks biclusters which contain genes that are expressed across the selection of samples. The method involves an iterative search method that runs on random seeds to identify the largest valid biclusters. Zimmerman et al. (2006)[7] have proposed BiMax to find constant biclusters by discretizing the input expression matrix into a binary matrix. However, this discretization makes it harder to determine coherent biclusters. These biclustering methods are not capable of handling noise and discovering several types of biclusters in gene expression. In this paper, we present a new biclustering method that combines information entropy (conditional entropy) and graph theory techniques to identify co-expression patterns of genes that might be relevant to a subset of the condition. This method is motivated by the use of conditional entropy to measure interdependence between pairs of entities. The measure of interdependence between pairs of conditions and genes helps to predict the presence of a good cluster. Our method transforms conditional entropy between pairs of conditions into an unweighted graph and reduces the need to iteratively search for groups of related conditions to find maximal cliques. This experiment has two objectives: to show that BiEntropy can find constant, coherent, and overlapped bi-clusters even if the cluster contains noise and to demonstrate the superiority of BiEntropy over existing biclustering methods in terms of identifying meaningful gene groups related to Gocategories. Our bicluster method consists of the following steps: (1) normalization and discretization, (2) generation of conditional matrix, (3) construction of unweighted graph, (4) finding of maximal cliques, and (5) identification of biclusters. The experimental results on both synthetic and real data (Saccharomyces cerevisiae and NCI 60 datasets) demonstrate the effectiveness of BiEntropy in discovering artificially embedded biclusters as well as biologically significant biclusters with high precision.
2 Definitions and Notations In this section, we present a general definition of a bicluster. Let G = {g1,........., g M } be a set of genes (rows), and let C = {c1 ,......., c N } be a set of conditions (columns). The data can be viewed as an M × N expression matrix EM, where rows signify genes and columns signify experimental conditions. EM is a matrix of real numbers, where each entry gij corresponds to the logarithm of the relative abundance of the mRNA of a gene g i under a specific condition c j . A bicluster corresponds to a sub-matrix that exhibits some coherent tendency. Each bicluster can be identified by a unique set of genes and experimental conditions that determine the sub-matrix. Thus, a bicluster is a matrix I × J , denoted as ( I , J ) , where I and J are a set of genes (rows) and conditions (columns), respectively. In this case, I ⊆ M , and J ⊆ N . We define the volume or size of a bicluster ( I , J ) as the num-
246
A. Olomola and S. Dua
ber of elements gij , so that i ∈ I and j ∈ J . A sub-matrix of A ( I , J ) with I ⊆ M and J ⊆N
is a constant bicluster for reference gene
g* i
if for any i ∈ I and
any j ∈ J , gij = g * . A sub-matrix of A ( I , J ) with I ⊆ M and J ⊆ N is an additive for i j
reference gene i* if for any i ∈ I , and j ∈ J , gij − g * = θi , where θ i is a constant for i j
any row i .
Fig. 1. Example of types of biclusters: (a) constant biclusters, (b) coherent (additive), (c) overlapping
3 Materials and Method Before normalizing gene expression data, we temporarily remove data beyond a threshold (three standard deviations) to reduce the effect of outliers. Then, we linearly normalize each condition (column) of data with a mean of 0 and a variance of 1. We repeat the procedure until no outliers remain. Next, we assign the temporarily removed outlier values equal to the corresponding extreme value of the final normalized data (minimum for outliers below the mean, maximum for outliers above the mean). We discretize each gene expression level into intervals, by uniformly dividing the difference between the maximum and minimum values in the normalized data. 3.1 Generation of Conditional Entropy Matrix We generate a symmetric matrix by finding the conditional entropy between all pairs of conditions (columns) in the discretized data. The conditional entropy measures the mutual interaction between pairs of conditions (column) and predicts each pair’s ability to form a good cluster. Higher conditional entropy between a pair of conditions indicates a lower possibility of the entropy forming a significant cluster. Therefore, a low conditional entropy value between two conditions denotes the presence of a clustering relationship between the two conditions. Lemma 1. Conditional Entropy. Let {c11 , c12 , c13 ......c1K } and {c12 , c 22 , c 23 ......c 2K } be set of intervals in condition (column) c1 and c 2 , respectively. We set the conditional
entropy for condition c1 given c 2 as: K
K
k =1
l =1
H (c1 c2 ) = − ∑ P(c1k ) ∑ P(c1l c2k ) log P(c1l c2k ) .
(1)
Bi-clustering of Gene Expression Data Using Conditional Entropy
247
P(c1k ) is the probability of data at interval k of sample c1 , and P (c1l C2k ) is the condi-
tional probability of a data point in the interval l of sample c1 given a data point in interval k of column c2 . Lemma 2. Conditional Entropy Matrix. Since H (c j ci ) ≠ H (ci c j ) , the measure of
dependence between ci and c j is represented in symmetric matrix form ( M c ) as: ⎧⎪H(c c )×H(c j ci ) ∀ci ,cj ∈C if i ≠ j ⎪⎫ Mc(ci ,c j ) = ⎨ i j ⎬. ⎪⎩0 ⎪⎭ else
(2)
3.2 Construction of Unweighted Graph
In order to map the generated symmetric matrix to the unweighted graph, we transformed the matrix by replacing all values greater than the entropy threshold to 0 and those less than the threshold to 1. We consider graph G (V , E ) with node V as a set of conditions and the edge E as described below: ⎧⎪Eij = 1 if M c(ci ,c j ) ≤ entropythreshold(λ)⎫⎪ . E=⎨ ⎬ ⎪⎩Eij = 0 else ⎪⎭
Eij = 1 denotes the cluster relationship between conditions
(3)
ci and c j due to condi-
tional entropy between two conditions lower than threshold ( λ ). Moreover,
E ij = 0
if there is no edge. We estimate the entropy threshold (λ ) as: Entropy th reshold (λ) = M cmin + β(M cmax − M cmin ) ,
(4)
where M cmin and M cmax are minimum and maximum values in the conditional matrix M c , respectively, and β is an entropy ratio that has value ranges from 0.0 to 1.0. 3.3 Finding the Cliques
We enumerate the maximal cliques in the graph to identify groups of experimental conditions where relevant biclusters can be located. We utilize the Bron-Kerbosch maximum clique algorithm described in [13]. The algorithm operates by means of a backtracking tree search. It maintains three disjoint sets of nodes R, P, X . R represents the currently growing clique; P represents the prospective nodes which are connected to all nodes in R . X contains nodes already processed, i.e. nodes which were previously in P . Hence, all maximal cliques have already been reported. All nodes which are connected to every node of R are either in P or X . The purpose of this work is to present a framework into which any effective clique enumeration algorithm can be plugged.
248
A. Olomola and S. Dua Algorithm. Biclustering with Entropy (Bientropy) Input: An M-by-N Descritized gene expression matrix (D ) Entropy ratio ( β ) Interval of Discretization (K ) .
g min // G = { g1,........., g M } denotes set of genes (rows) C = {c1,......., c N } set of
Minimum number of genes
conditions (columns) Output: Biclusters: {(G1C1 ).....(Gm , C m )} o o ///Generate Pair wise conditional entropy matrix ⎧⎪ H (ci c j ) × H ( c j ci ) M c (ci , c j ) ← ⎨ ⎪⎩0
∀ ci , c j ∈ C
if i ≠ j ⎫⎪ ⎬ ⎪⎭
else
// Construction of unweighted Graph G (V , E ) ← {}
V = {c1, c2 ........c N } ⎧⎪ E ij = 1 if M c (c i ,c j ) ≤ entropy threshold( λ) ⎫⎪ E ← ⎨ ⎬ ⎪⎩ E ij = 0 else ⎪⎭
// Finding the cliques CQ ← Bron - Kerbosch Algorithm (θ , G (V , E ), θ ) // Bicluster Identification Bicluster = Cluster Identifica tion (CQ , D , g min )
3.4 Identification of Biclusters
After identifying a group of correlated conditions through clique enumeration, we perform bicluster identification by identifying a group of genes correlated across the condition using conditional entropy, as described below. K
K
k =1
l =1
H ( g1 g2 ) = −∑ P( g1k ) ∑ P( g1l g2k ) log P( g2l g1k )
.
(5)
Where P( g1k ) is the probability of data in discretization level k of gene g1 , and P( g1l g 2k ) is the conditional probability of the data point in the interval l gene g1
given a data point in interval k of gene g 2 . The two genes are said to be strongly dependent if H ( g1 g2 ) = H ( g1 g2 ) = 0 . We identify biclusters for each group of related conditions through the following procedure: 1. Generate a sub-matrix M ′ from existing discretized data, so that it has values of all genes that make up a clique, 2. Iteratively perform the following sub-steps until all the genes have been clustered,
Bi-clustering of Gene Expression Data Using Conditional Entropy
249
a. Randomly select a gene gi from M ′ , and estimate the conditional entropy gi and each of the genes in M ′ , b. Identify genes with the conditional entropy equal to zero with gene gi and assign them to a cluster. 3. Remove identified genes from M ′ .
4 Complexity Analysis In the first stage, normalized expression levels of each gene are discretized into linear intervals M and N , O( MN ) . In the second stage, the conditional entropy matrix, which serves as an input to the clique graph is computed. We estimate conditional entropy among all pairs of conditions (samples) and represent the relationship as a graph. We have potential worst-case complexity of O( N 2 ) . In the third stage, we find the maximal cliques. If the number of cliques is small or the cliques are relatively small and disjointed, this operation is linear in N . In most cases, it will not significantly affect the overall efficiency by exceeding O( N 2 ) . Having found the cliques, stage four, involves the identification of biclusters through linear or log-linear experimental conditions N and intervals K . Overall, assuming the maximal clique enumeration does not hit a ‘hard’ graph, the efficiency of the whole algorithm is O ( N ( M + N + K )) .
5 Implementation We implemented BiEntropy using MATLAB and input the normalized gene expression matrix and two parameters: K and β , where K is the number of discretization intervals and β is an entropy ratio that ranges from 0 to 1. We can choose K = {3,5,7,9} for the discretization level.
6 Experimental Results Our objective is to show that BiEntropy discovers both constant and coherent (additive) biclusters with respect to noise level and overlap among biclusters when compared with other biclustering algorithms. In addition, we intend to demonstrate our algorithm’s ability to find biclusters with biological significance in gene expression data. Other algorithms include CC [2], SAMBA [3], ISA [5], and Bimax [7]. Biclustering analysis tools (BICAT) developed by Prelic et al. [10] were used to implement Bimax, ISA, CC, OPSM [4], and xMotif [6]. EXPANDER software developed by Maron-Katz et al. [11] was also used to implement SAMBA. The parameters of these algorithms were set to the values recommended in the corresponding publications.
250
A. Olomola and S. Dua
6.1 Synthetic Dataset
Our model for the generation of synthetic gene expression data is based on the proposal from Zimmerman et al. [7]. This dataset includes data matrices with three types of artificial implanted modules: constant, coherent (additive), and overlapping. For the constant situation, the matrix with implanted constant bicluster is generated in four steps. (1) Generate a 100*100-background matrix A so that all elements of A are 0. (2) Generate ten biclusters of size10 *10, so that all elements of the biclusters are 1. (3) Add noise to the biclusters by adding random values from uniform distribution (−σ , σ ) . (4) Implant 10-biclusters to A without overlap. We define the noise level from 0.0 to 0.25 for all experimentation. Ten modules (biclusters) are implanted to background matrix A . We use ten degrees of overlap (d = 0,1..........9) , where the size of the background matrix and modules vary from 100 × 100 to 110 × 110 and from 10 ×10 to 20 × 20 , respectively. In coherent (additive) data, the procedure is the same as that of the constant data type, but we let the biclusters have a 0.02 increasing trend on the rows and columns. To validate the accuracy of our algorithm, we apply the gene match score proposed by Prelic et al. [5]. Let M1 and M 2 be two sets of biclusters. The match score of M1 with respect to M 2 is: SG ( M 1 , M 2 ) =
G1 ∩ G2 1 , ∑ max(G2 , C2 ) ∈ M 2 M1 (G ,C )∈M1 G1 ∪ G2 1 1
(5)
where G and C are a set of genes and a set of conditions in a bicluster, respectively. This score measures the degree of similarity between the computed biclusters and the true transcribed modules implanted in the synthetic data. Let M opt be the set of implanted modules, and let M be the set of biclusters obtained by a biclustering algorithm. The average relevance, S ( M , M opt ) , represents the extent to which the generated biclusters match true modules in the gene dimension. In contrast, the average module recovery, given by S ( M opt , M ) quantifies how well each true bicluster is recovered by the biclustering algorithm under consideration. Both scores take the maximum value of 1 if M opt = M . 6.2 Parameter Selection
The two parameters needed to implement our algorithm are discretization interval ( K ) and Entropy ratio ( β ) . Since the entropy threshold depends on the entropy ratio values, we implement our biclustering method using entropy ratio values between 0 and 1. We use a synthetic dataset with 100*100 matrix and implant 10 non-overlapping 10*10 constant biclusters. Figure 2 shows the high performance of the algorithms at entropy ratio value between 7.5 and 9.0 on three discretization intervals. Out of three discretization interval trials, 5-interval discretization gives the best average match score in our implementation.
Bi-clustering of Gene Expression Data Using Conditional Entropy
Enrichment with GO Biological Process
7-interval
α=0.001%
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.95
1
20 0 O PS M
0.1
40
Entropy Ratio
C
0
60
xM ot if
0.2
α=5%
C
0.4
α=1%
80
IS A
0.6
α=0.5%
100
B ie nt ro py
The proportion of bicluster per sigf level(%)
Avg Match Score
1 0.8
α=0.1%
120
Sa m ba
3-interval
1.2
B im ax
5-interval
251
Biclus tering Algorithm
Fig. 2. Effect of Entropy ratio significantly enriched by GO
Fig. 3. The proportion of bicluster
6.3 Effect of Noise
To show the performance of BiEntropy against noise, we summarize the results in both constant and additive data in Figures 4a and 4b. In constant bicluster, BiEntropy, ISA, and Bimax show a high accuracy in the absence of noise, but the relevance and recovery scores obtained by CC and xMotif are statistically low. This phenomenon occurs because the algorithms do not focus on changes in gene expression but consider the similarity of the selection the only clustering criteria. When the noise level is high, ISA and BiEntropy have the best accuracies. The performance of the algorithms in Figure 4b shows that only three algorithms (BiEntropy, ISA, and SAMBA) demonstrate robustness against noise in additive biclusters. The figure also shows that
Effect of Nois e: Relevance of BCs
Effect of Noise: Relevance of BCs 1.2
1.2
1
avg match score
avgr match score
1 0.8 0.6 0.4
0.8 0.6 0.4 0.2
0.2
0
0 0
0.05
0.1
0.15
0.2
0
0.25
0.02
Effect of Noise: Recovery of m odules
0.06
0.08
0.1
Effect of Noise: Recovery of m odules
1.2
1.2
1
1
0.8
avg match score
avgr match score
0.04
noise level
noise level
0.6 0.4 0.2
0.8 0.6 0.4 0.2
0 0
0.05
0.1
nois e level
(a)
0.15
0.2
0.25
0 0
0.02
0.04
0.06
0.08
0.1
noise level
(b)
Fig. 4. Results of experiments on synthetic dataset: non-overlapping module with increasing noise level for (a) constant and (b) additive biclusters
252
A. Olomola and S. Dua
Bimax has a high accuracy with a constant bicluster but performs poorly in an additive data type at a high noise level. 6.4 Effect of Overlapping Biclusters
The goal of this section is to study the behavior of the chosen biclustering methods with respect to increased regulatory complexity. Figures 5a and b elaborate the performance of biclustering methods at different overlap degrees in the absence of noise. Bimax is the only method that fully recovers all hidden modules in the data matrix among the methods. BiEntropy and SAMBA also perform considerably well when compared to the remaining methods. OPSM is not significantly affected by the degree of overlap in additive bicluster, but it cannot handle constant bicluster with identical expression values. ISA appears more sensitive to a high degree of overlap, especially with the additive biclusters. As with CC, the performance increases with larger overlap degrees, but the gene match scores are still lower than those by Bimax, BiEntropy, SAMBA, and ISA, due to the diminishing number of background cells with large overlaps. 6.5 Real Data
We apply our biclustering method on the Saccharomyces cerevisiae dataset. The dataset, which has 2,993 genes and 173 conditions, is provided by Gasch et al. [8],and is available at http://www.tik.ee.ethz.ch/sop/bimax.We follow the approach used by Zimmerman et al. [7] to evaluate the performance of BiEntropy against other biclustering methods in respect to real expression data. A web tool called FuncAssociate [9] was used to evaluate discovered biclusters using Gene Ontology (GO) annotations [12]. Table 1 lists the parameter settings and the number of biclusters identified from each method. We filter out both biclusters with more than 25% overlap with large biclusters and output the rest in order of descending size. The adjusted significance score of each discovered bicluster is computed from FuncAssociate. The histogram in Figure 3 shows the proportion of biclusters from each method that contains one or several overrepresented GO categories for the Saccharomyces cerevisiae dataset. Bientropy and OPSM obtain the best results. We attribute the good performance of BiEntropy to its unbiased discretization scheme, which accurately classifies bicluster types enriched with the GO Biological Process. OPSM performs well because it returns a small number of biclusters. Bimax, ISA, and SAMBA also provide a high portion of functionally enriched biclusters. Bimax and ISA (~90% at a significance level of 5%), have a slight advantage over SAMBA (~80% at a significance level of 5%). In contrast, CC and xMotif perform poorly. The scores for CC are ~30%. Dataset NCi60 represents the gene expression patterns of 9703 genes in 60 human cancer cell lines and is available at http://www.discover.nci.nih.gov/nature2000. The complete data set contains missing values. We first select genes that have, at most, three missing values. There are 8161 such genes. We use the k-nearest neighbors impute function in MATLAB to estimate the missing values. We then calculate the variance for each gene expression profile and filter out the expression profiles with a variance of less than the 25th percentile of the whole expression. The total number of genes left after
Bi-clustering of Gene Expression Data Using Conditional Entropy
253
filtering is 6344. We apply BiEntropy on the NCi60 dataset with parameters K = 5, and β = 0.9 to generate 92 biclusters, 76 of which are selected after filtering out those with more than 25% overlap with larger biclusters. We evaluate the discovered biclusters by calculating the hyper-geometric functional enrichment score using FuncAssociate. Table 2 shows the partial results of the biclusters found by BiEntropy.
Effect of overlapping relevance of Bcs
Effect of Overlapping:Relevance Bcs 1.2
1.2
1
overlapdegree
0.6 0.4
0.8 0.6 0.4
0. 1
0. 09
0. 08
0. 07
0. 06
0. 05
0. 04
0. 03
0 0. 02
0.2
0
0
0.2
0. 01
avgm atchscore
1 0.8
0
overlap degree
overlap degree Effect of Overlapping:Recovery of Bcs
Effect ofOverlapping Recovery of BCs
1.2
1.2
1 avgmatchscore
1
avg match score
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
0.8 0.6 0.4 0.2
0.8 0.6 0.4 0.2
0
0
0
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
overlap degree
0
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 overlap degree
(a)
(b)
Fig. 5. Results of experiments on synthetic dataset: overlapping module with increasing overlap degree and noise for (a) constant and (b) additive biclusters Table 1. Summary of parameters settings and total number of biclusters
Table 2. Partial results of bicluster found in NCI60; size is given by the # of gene
254
A. Olomola and S. Dua
7 Conclusion We have proposed and implemented a novel biclustering method, called BiEntropy, to discover constant, coherent (additive), and overlapped biclusters in the presence of noise. The method combines conditional entropy and graph theoretic techniques to identify a subset of conditions in which biclusters can be located. The experimental results on both synthetic and real data (Saccharomyces cerevisiae and NCI 60) reveal that BiEntropy is robust against noise and overlaps in both constant and additive biclusters. Our evaluation framework also shows a better accuracy than most biclustering methods.
References [1] Hartigan, J.: Direct Clustering of a Data Matrix. J. Am. Statistical Assoc. 67, 123–129 (1972) [2] Cheng, Y., Church, G.M.: Biclustering of Expression Data. In: Proceedings of Intelligent Systems for Molecular Biology (2000) [3] Kupiec, M., Shamir, R., Tanay, A., Sharan, R.: Revealing Modularity and Organization in the Yeast Molecular Network by Integrated Analysis of Highly Heterogeneous GenomeWide Data. PNAS 101, 2981–2986 (2004) [4] Karp, R., Ben-Dor, A., Chor, B., Yakhini, Z.: Discovering Local Structure in Gene Expression Data: The Order-Preserving Sub Matrix Problem. In: Proceedings of the 6th Int. Conf. on Computational Molecular Biology (RECOMB), pp. 49–57 (2002) [5] Bergmann, S., Ihmels, J., Barkai, N.: Defining Transcription Modules Using Large-Scale Gene Expression Data. Bioinformatics 20, 2003–2004 (1993) [6] Murali, T.M., Kasif, S.: Extracting Conserved Gene Expression Motifs from Gene Expression Data. In: Proceedings of the 8th Pacific Symposium on Biocomputing [7] Zimmermann, P., Wille, A., Buhlmann, P., Gruissem, W., Hennig, L., Thiele, L., Zitzler, E., Prelic, A., Bleuler, S.: A Systematic Comparison and Evaluation of Biclustering Methods for Gene Expression Data. Bioinformatics (2006) [8] Gasch, A.P.: Genomic Expression Programs in the Response of Yeast Cells to Environmental Changes. Mol. Biol. Cell 11, 4241–4257 (2000) [9] Berriz, G., Bryant, O., Sander, C., Roth, F.: Charactering Gene Sets with FuncAssociate. Bioinformatics 22, 1282–1283 (2003) [10] Prelic, A., Zimmermann, P., Barkow, S., Bleuler, S., Zitzler, E.: Bicat: A Biclustering Analysis Toolbox. Bioinformatics 22, 1282–1283 (2006) [11] Maron-Katz, A., Sharan, R., Shamir, R.: Click and Expander: A System for Clustering and Visualizing Gene Expression Data. Bioinformatics 19, 1787–1799 (2003) [12] GeneOntology Consortium, http://www.geneontology.org [13] Bron, C., Kerbosch, J.: Algorithm 457: Finding All Cliques of an Undirected Graph. ACM Comm. 16 (1973)
c-GAMMA: Comparative Genome Analysis of Molecular Markers Pierre Peterlongo1, Jacques Nicolas1 , Dominique Lavenier2, Raoul Vorc’h1 , and Jo¨el Querellou3 1
´ Equipe-projet INRIA Symbiose, Campus de Beaulieu, Rennes, France http://www.irisa.fr/symbiose/ 2 ENS Cachan - IRISA, France 3 LM2E UMR6197 Ifremer, Centre de Brest, France
Abstract. Discovery of molecular markers for efficient identification of living organisms remains a challenge of high interest. The diversity of species can now be observed in details with low cost genomic sequences produced by new generation of sequencers. A method, called c-GAMMA, is proposed. It formalizes the design of new markers for such data. It is based on a series of filters on forbidden pairs of words, followed by an optimization step on the discriminative power of candidate markers. First results are presented on a set of microbial genomes. The importance of further developments are stressed to face the huge amounts of data that will soon become available in all kingdoms of life.
1
Introduction
The decade started with the complete sequencing of the Haemophilus influenzae genome in 1995 [1]. This period was characterized by the multiplication of sequencing projects for getting a better comprehensive view of the whole tree of life. During this time, an exponential rate of sequencing projects was observed, with a ×2 increasing rate every 20 months [2]. Comparative analyses of complete genomes from Bacteria, Archaea to Human have a huge impact on all aspects of life sciences and is deeply redesigning the evolution theory in the light of genomics [3]. To better understand the driving forces in speciation, the diversity in virulence of pathogens, the diversity in metabolic pathways in various key species, more complete genomes of closely related strains of the same species (or species of the same genus) are needed. This recently triggered a flood of sequencing projects for novel strains of key pathogens (Campylobacter, Haemophilus, Mycobacterium, Streptococcus, etc.), model species (Bacillus, Escherichia), ecological key players (Prochlorococcus, Synechococcus) and species potentially interesting for biotechnology (Pyrococcus, Thermococcus). It appears that for these species the number of sequencing projects is growing exponentially, and the time has come to address specifically comparative genomics at micro-scale evolution (Table 1). V. Kadirkamanathan et al. (Eds.): PRIB 2009, LNBI 5780, pp. 255–269, 2009. c Springer-Verlag Berlin Heidelberg 2009
256
P. Peterlongo et al.
Table 1. Number of genome projects related to important prokaryotic genera and species (Source: GOLD http://www.genomesonline.org/ and Microbesonline http://www.microbesonline.org/, modified, april 2009) Phylum Genus Arthropoda Drosophila Euryarchaeota Methanococcus Pyrococcus Thermococcus Firmicutes Bacillus Bacillus Bacillus Clostridium Clostridium Lactobacillus Staphylococcus Staphylococcus Streptococcus Streptococcus Spirochaetes Borrelia Borrelia Proteobacteria Burkholderia Campylobacter Campylobacter Escherichia Haemophilus Haemophilus Pseudomonas Rickettsia Salmonella Shewanella Vibrio Vibrio Yersinia Yersinia Actinobacteria Mycobacterium Mycobacterium Tenericutes Mycoplasma Ureaplasma Ureaplasma Cyanobacteria Prochlorococcus Synechococcus
Species, strains Genomes projects Genomes completed 10 10 7 6 4 3 5 3 anthracis 8 3 cereus 14 4 other species 13 8 botulinum 7 4 other species 29 10 12 11 aureus 12 12 other species 4 4 pneumoniae 17 3 other species 30 22 burgdorferi 7 1 other species 4 2 45 13 jejuni 9 4 other species 6 2 coli 31 10 influenzae 13 4 other species 4 2 16 11 13 10 enterica 23 6 10 6 cholerae 6 1 other species 14 6 pestis 13 6 other species 7 3 tuberculosis 5 4 other species 12 12 14 13 urealyticum 11 1 other species 5 0 marinus 12 12 15 10
One of the main needs is the design of molecular markers that can achieve a high level of discrimination between different species or strains. The use of molecular markers has become increasingly popular in many fields: phylogenetic reconstruction in microbiology, quality control in food industry, traceability in epizooty and epidemic diseases, barcoding of life, forensics, etc. Each domain of activity has its favourite marker(s) working optimally for a specific purpose. The increasing number of complete genomes of related species available in databases raises the question of rapid determination of additional molecular markers through comparative genomics. This paper proposes a novel approach to characterize molecular markers within a set of complete genomes of related strains or species targeting PCR (Polymerase Chain Reaction). PCR is one of the most important tools in genetic
c-GAMMA: Comparative Genome Analysis of Molecular Markers
257
engineering for amplifying specific DNA fragments defined by flanking pairs of words on both side. These pairs of words are matched by complementary short synthetic nucleotides, called primers. Potential applications include strain quality control, identification, taxonomy and possibly phylogeny.
2
Identification of Genome Species Markers Using PCR
Let us first explain the way markers are used during PCR. Let s denotes the reverse complement of word s over a four letter alphabet A, T, C, G and v denotes the marker to be selectively amplified. A DNA double helix corresponds to the hybridization of sequence x.u.v.w.y with sequence x.u.v.w.y = y.w.v.u.x. PCR aims at hybridizing of subsequence u.v.w with its complementary strand w.v.u, initiated by two short synthetic nucleotides, the primers, which will match u and w respectively (x,u,v,w and y are words). Thus, the word v corresponding to the marker itself is produced in the context of two fixed words corresponding to primer sequences. Most of the specific sequences that are used as molecular markers come from ubiquitous components of the cell with limited nucleic material such as ribosomes. One of the main resources concerns 16S rRNA and can be found on various dedicated websites including the Ribosomal Database project [4]. The last release (April 3, 2009) reports 836 814 16S rRNA annotated and aligned sequences. They mostly come from uncultured microbes, as a result of the standard investigation of microbial diversity by molecular methods and more recently by metagenomics. The limits of 16s rRNA for species identification have been reached to handle this high number of species. Firstly, there is no linear relationship between 16S rRNA similarity and DNA-DNA hybridization. A consensus was reached, specifying that 16S rRNA similarity levels fewer than 97% between two strains is equivalent to DNA-DNA hybridization level fewer than 70%, and discriminates two different species. However, many different species display 16S rRNA sequences similarity within the range of 98-99%, and in those cases, 16S rRNA cannot be used to establish a strain as a novel species. Other variable molecular markers are frequently used in addition to 16S rRNA like housekeeping genes [5]. The major drawback is that none of these additional markers are universal. Secondly, in phylogeny reconstruction, 16S rRNA cannot solve all the problems and for some taxonomic groups, tree topologies are uncertain. The help of additional sequences is required and the current trend is to use a set of sequences corresponding to the concatenation of ribosomal proteins. Another widely used molecular marker for Eukarya “barcoding” is the 648-bp region of the cytochrome oxidase I (COI). DNA barcoding employs a minimum of 500-bp sequence of COI to help species identification and discovery in large assemblages of life [6]. Although well adapted to species identification, COI sequences cannot be used in phylogeny nor in ecotypes identification. Here again, additional molecular markers need to be found in the set of complete genomes currently available for various tasks ranging from quality control in laboratory
258
P. Peterlongo et al.
collections of pico and micro-eukaryotes to traceability of pathogens, of pests in the environment, etc. Biologists need help both for choosing markers on a less restrictive set of sequences and for choosing the primers that will select these markers. Since most authors consider short sequences, they rely generally on multiple alignments for a subset of species of interest, so that conserved and non conserved regions become directly visible. Targets are then defined as conserved regions delimiting some highly variable regions and potential primers are then checked with the whole database in order to prune solutions matching elsewhere in the sequence. Nice environments have been developed in this context [7]. However, the task of finding suitable markers is not fully automated and does not scale to many species or long sequences due to the multiple alignment step. In [8], A. Pozhitkov and D. Tautz propose a program for simply finding one probe discriminating a given set of sequences from others. Although the algorithm could be widely improved, they use the interesting idea of building a crude index of words of fixed size in order to speed the search for common patterns. Finding the best primer pairs where each primer is a substring of a given sequence is by itself a complex multicriteria task that has been well described in [9]. It extends beyond string matching and the use of edit distances since it involves criteria including the proximity between primer melting temperatures, minimization of hybridization effects between forward and reverse primers, and avoidance of hybridization of primers with themselves. It may be solved efficiently using dynamic programming schemes that use extensions of approximate string matching equations. The large scale design of primers has been tackled in another context: the observation of gene expression for a given organism. The issue is to produce for a subset of genes, or the complete set of genes of some genomes, a set of markers that identify each gene. The technique used in such a case, microarrays, involves an array of spots, each being attached to a primer. The main objective, in this context, is to find a set of primers working at a same temperature (called probes), each one recognizing a unique gene in the given set. Combining suffixtree indexing, dynamic programming and ad’hoc filters, Kaderali and Schliep [10] showed that it is possible to identify organisms but that this technique requires long probes for identifying many species. A recent review of long primers (size greater than 40) identification tools is available in [11]. Producing a microarray with many long probes remains an expensive operation. One of the interests of working on species instead of gene expression is that the primer-gene association does not needs to be bijective. The issue becomes the choice of a minimal set of primers, a problem easily reduced to the minimum set covering problem [12]. The present study shows yet another variation of primer design problem. The idea of working on whole genomes is kept but restricted to PCR as a low cost identification technique. The genome-based marker design problem consists in determining within a set of genomes, (i) primer pairs conserved over these genomes (ii) usable for PCR amplification for genomes differentiation (iii), associated to at least one homologous flanking region (iv) that can be used for
c-GAMMA: Comparative Genome Analysis of Molecular Markers
259
diverse objectives: speciation, strain and species rapid identification, taxonomy, search of variable regions and contextual gene analysis. To the best of our knowledge, this problem has never been stated before. The closest study we are aware of in terms of constraints to be solved, corresponds to a very different application: the study of the variability of individual genomes in terms of deletions or translocations that can occur in mutants and pathogenic states like cancer. A recent clever experimental protocol, PAMP [13], uses multiplex PCR to selectively amplify the variations observed in pathogenic cells. Authors have developed an optimization technique to design primer sets based on simulated annealing and integer programming [14]. This technique can processes sequences up to one Mbp. Although the setting is different, it shares an interesting characteristic with our approach, namely the comparison of ordered pairs of primers.
3
Model
We propose a generic formalization and a model for designing primer pairs for PCR amplification for finding markers able to differentiate genomes. The model relies on four steps (see Figure 1 for an overview): 1. Given a set of sequences and primer parameters, detection of oligonucleotides that could be primers and that theoretically hybridize on each of these sequences on direct or reverse-complementary strand. This detection is based on physico-chemical properties of DNA fragments (Section 3.1); 2. From this set of possible primers, selection of all pairs that respect location properties on each sequence. These properties derive from PCR amplification technical constraints (Section 3.2); 3. Selection of primer pairs that define fragments considered as molecular markers; they permit to differentiate sequences from each other, using a simple length criterion (Section 3.3); 4. Selection of all pairs that further define flanking regions (fragments on the left and right hand sides of primer pairs) sharing homology or being highly variable (Section 3.4).
Fig. 1. Overview of the model. A set S of four sequences guides the design of two primers (red and green rectangles) sets Pd and Pr , whose pairs generate sets Cpcr , Cdif f and Csim .
260
3.1
P. Peterlongo et al.
Primers Characteristics
The primary goal of a primer is to hybridize a complementary strand of a DNA sequence at a well defined position. Optimal primers are produced on the basis of an hybridization model taking into account various criteria: 1. G+C content: the G+C percentage of a primer is framed between a minimum and a maximum threshold value, typically between 40% and 60%. 2. Melting temperature: the melting temperature of a primer must be in bounded interval. The computation is based on the nearest neighbour method [15]. The melting temperature calculation also takes into account the concentration of nucleotides and the concentration of salt. 3. Repeats: primers containing large trains of identical nucleotides or dinucleotides are eliminated. 4. Hairpin loops: primers must not include hairpin loops. The size of the stem or the size of the loop must be lower than a predefined value. 5. Self-complementarity: a primer must not self hybridize during PCR. Thus, primers that form a duplex with their complementary strand are removed. 6. Thermodynamic stability at primer ends: The Gibbs Free Energy (ΔG: in units of kcal/mole) values computed on the 5’ and 3’ ends of the primers are bounded. The ΔG value determines the strength of the hybridization and triggers the decision of considering the position as a potential hybridization site. If a nucleic sequence possesses all these qualities, it can be considered as a successful primer. The next question is: what are the conditions for this primer to hybridize with a DNA sequence? In other words, given this primer and any portion of the genome, can they hybridize together? The answer is brought by the calculation of the thermodynamic stability between the two strands. The nearest neighbour method proposed by SantaLucia [15] is used to compute ΔG along the two oligonucleotide sequences, with special care in the 5’ and 3’ extremities of the primer. 3.2
Primer Pairs for PCR Amplification
Interesting primer pairs are those defining a fragment that may be amplified by PCR. Their hybridization positions (called PCPP, for Primers Couple PCR Positions) must respect some distances characteristics and some repartition conditions over the hybridization locations of each of these primers. Figure 2 gives an example of PCPP. For a non-ambiguous characterization of PCR results and for a given a primer pair, two PCPP could not start or end at the same position. This avoids amplification of alternative sizes of fragment at a same position. In the following, the set Cpcr of primer pairs defining at least one PCPP on each sequence is defined. Given a set of sequences S, we dispose of two primers sets: Pd containing primers that hybridize on the direct strand of each
c-GAMMA: Comparative Genome Analysis of Molecular Markers
261
Fig. 2. A portion of two DNA strands is shown. On each strand, a primer has two hybridization locations at position a and a on the direct strand (red rectangles on the bottom line) and position b and b on the reverse complementary one (green rectangles on the top line). This pair shows hybridization sites (a and b) which respect conditions (1) to (3). The shaded area corresponds to a putative molecular marker. If condition (4) is also respected, (a, b) is a PCPP.
sequence and Pr that contains primers hybridizing the reverse complementary strand of each sequence. pos(s, p) is then defined as the set of positions where the primer p ∈ Pd hybridizes on the sequence s ∈ S. pos(s, p ) is defined as the set of positions where the primer p ∈ Pr hybridizes on the sequence s reversecomplemented. For the sake of clarity, all positions are reported on the direct strand. Cpcr is defined as the set of pairs c = (p, p ) of primers from Pd × Pr such that for each s ∈ S: ∃a ∈ pos(s, p) and ∃b ∈ pos(s, p ) min ≤ b − a ≤ max
(1)
Moreover, the conditions of uniqueness for fragments starting or ending at a given position can be expressed as follows: ∀ a = a ∈ pos(s, p) a < b ⇒ b − a ≥ limit
(2)
∀ b = b ∈ pos(s, p ) a < b ⇒ b − a ≥ limit
(3)
Conditions (1) ensures that the pair of primers defines at least a fragment of length in [min + primers length, max]. Conditions (2) and (3) ensure that the selected pair of primers defines non-ambiguous fragments at given positions. Figure 2 represents hybridization locations respecting conditions from (1) to (3). In order to get rid of amplification of alternative sizes of fragment at a same position, Cpcr does not contain pairs of primers with hybridization sites respecting condition (1) and not respecting condition (2) and (3). Formally, ∀(p, p ) ∈ Cpcr and ∀s ∈ S ∀(a, b) ∈ pos(s, p) × pos(s, p ) min ≤ b − a ≤ max ⇒ (2) ∧ (3) 3.3
(4)
Primer Pairs for Sequence Differentiation
Primer pairs in Cpcr are potential candidates for PCR amplification. Let Cdif f be a subset of Cpcr containing all pairs of primers defining inner fragments whose length enable to differentiate sequences.
262
P. Peterlongo et al.
To do so, lengths(s, c) (s ∈ S and c ∈ Cpcr ) is defined as the set of lengths of inner fragments defined by PCPP of c on the sequence s. Additionally, Cdif f is defined as the subset of pairs c from Cpcr such that ∀s, s ∈ S, s = s , ∃l ∈ lengths(s, c) such that, ∀l ∈ lengths(s , c) l l max , ≥ δ, with δ a fixed parameter. (5) l l Informally, condition (5) ensures that for each pairs of primers c ∈ Cdif f and that for each couple of sequences s, s ∈ S, at least one of the fragments defined by any PCPP of c on s has a length different enough from all fragments defined by PCPP of c on sequence s . This property enables selected fragments to differentiate sequences from each other with a simple length-based test. Moreover, in order to provide readable PCR results by clearly distinguish amplified fragments, an additional parameter max occ. is applied. In Cdif f , couples c whose number of hybridization sites is bigger than max occ. are removed. Formally, ∀c ∈ Cdif f , ∀s ∈ S : |lengths(s, c)| ≤ max occ., with |E| denoting the cardinality of the set E (this notation is used in the rest of the paper). 3.4
Sequence Similarity / Variability
At last, sequence composition of fragments defined by PCPP is taken into consideration. Given a PCPP, the internal region (red area on Figure 3) and the two flanking regions (yellow fragments on Figure 3) are considered. Depending on the application, one may want these areas to be homologous or variable. Bearing in mind that any combination of searched homology is possible, Csim (see Figure 1) is defined as the subset of pairs of primers from Cdif f , such that there exists at least on PCPP for these pairs with variable centre fragment and at least one homologous flanking region. Each fragment is considered both on the direct and on the reversed strand.
Fig. 3. A pair of primers c = (p, p ) ∈ Csim has one PCPP on each of four genomes. For the sake of clarity only one strand is represented. Inner fragments defined by this PCPP (red) present for instance high variability while left or right flanking regions (yellow) present for example high similarity.
c-GAMMA: Comparative Genome Analysis of Molecular Markers
4
263
Methods
This section presents the methods for finding potential primers (Section 4.1). Then, respectively, sections 4.2, 4.3 and 4.4 show how previously defined sets Cpcr , Cdif f and Csim are detected. 4.1
Methods for Primers Detection (Pd and Pr )
Given a set S of n sequences, ideally all potential primers that may hybridize at least once on each sequence should be generated. Such an approach is actually unfeasible by enumerating all the primer configurations. In this case study, considering primers of length 25 would lead to test 425 elements, which is unrealistic. Instead, the following approach is used. To search common primers of length l, all l-mers of each sequence s ∈ S are first considered. In addition, to extend the space search, these l-mers are degenerated in their middle. Practically, 2 nucleotides are modified, leading to generate up to 42 l-mers per position. After this stage, a huge set of l-mers are considered as potential primers. Only those respecting conditions presented in Section 3.1 are selected. More precisely, the selection of primers is achieved through a pipeline of filters. Each stage of the pipeline eliminates the candidates which do not fit specific criteria. For efficiency purpose, the most stringent criteria are first taken into consideration. The implementation is based on a series of functions which does not present algorithmic challenges and are not detailed here. After this process, a new set of l-mers considered as putative primers is available. From this set only those that hybridize on all different sequences are selected. The whole set of primers is thus checked against S. In that way, a list of hybridizing primers is associated to each sequence. The intersection of these lists, results in the set of primers that hybridize at least once on every sequences. To speed-up the hybridization test, the sequences are first indexed with a seed-based index technique. The length of the seeds are set to 6, meaning that a primer hybridization will be reported only if the primer and the genome share at least 6 common nucleotides (or, more exactly, two complementary 6-nt words). In that case, a ΔG value is computed as presented in section 3.1. Depending of the ΔG value, the primer is added or not to the primer list associated with the genome. 4.2
Methods for Detection of Primer Pairs for PCR Amplification (Cpcr )
From this point, two sets of potential primers are available: Pd and Pr that hybridize respectively at least once on each sequence s and once on each reverse complementary sequence s. In order to verify conditions (1) to (4), all possible primer pairs (p, pj ) ∈ Pd × Pr are checked. On each sequence s, the ordered hybridization locations pos(s, p) and pos(s, p ) are available from previous steps. In a few strokes, the algorithm works as follows: positions over pos(s, p) and pos(s, p ) are read conjointly as long as condition (1) is not fulfilled. In case a
264
P. Peterlongo et al.
pair of hybridization positions (a, b) ∈ pos(s, p)×pos(s, p ) respecting these three conditions is found, then previous positions a on pos(s, p) (resp. next position b on pos(s, p )) is checked in order to validate that condition (2) (resp. (3)) is respected. In case of success, the pair (p, pj ) is tagged as a potential pair for PCR, otherwise the pair is rejected (condition (4)) and the reading of its positions is stopped. All pairs of primers respecting conditions (1) to (4) are stored is the set Cpcr . For a pair of primers (p, pj ) ∈ Pd × Pr , this approach reads all positions in pos(s, p) and in pos(s, p ) leading to a complexity in O(|pos(s, p)| + |pos(s, p )|) that is O(N ) with N the total length of the input genomes. As this computation is done for each possible pair of primers, the overall time complexity of this step is in O(|Pd | × |Pr | × N ) that is O(N 3 ). In practice, the time complexity is much lower, as confirmed by experimental tests described in Section 5. 4.3
Methods for Detecting Primer Pairs for Sequence Differentiation (Cdif f )
Finding the subset Cdif f from the set Cpcr is straightforward. For each primer pair c ∈ Cpcr and each sequence s ∈ S, lengths(s, c) is known (see Section 3.3). Trivially, for each primer pair c in Cpcr and for each couple of sequences s, s ∈ S, s = s , c is conserved in Cdif f if there exists l ∈ lengths(s, c) that is different enough from all l ∈ lengths(s , c) so that condition (5) is respected. Simultaneously, it is trivial to conserve in Cdif f only primer pairs for which the number of occurrences of PCPP on each sequence respects the max occ. parameter. This checking is done in O(|lengths(s, c)| × |lengths(s, c)|) for each couple of sequences s, s and each primer pair c ∈ Cpcr . Thus, for each primer pair, this checking is done in O(n2 × |lengths(s, c)| × |lengths(s , c)|) leading to an overall time complexity of O(|Cpcr | × n2 × |lengths(s, c)| × |lengths(s, c)|). Note that in practice n, |lengths(s, c)| and |lengths(s , c)| are negligible with regard to |Cpcr |. 4.4
Methods for Detection of Primer Pairs Taking into Account Sequences Similarity and Variability (Csim )
Pairs of primers from Cdif f that define fragments respecting conditions exposed in Section 3.4 (see also Figure 3) are selected to be included in Csim . Knowing the large amount of work previously done for finding multiple local alignments, we decided to not develop our own algorithm. In this framework, we used MEME [16] which provides an e-value estimation enabling used as a formal criterion for creating set Csim . As stated earlier, this step is highly tunable depending on the biological applications. In this framework, the method is the following: for each primer pair c ∈ Cdif f , MEME is applied on all combination of PCPP of c on the set of genomes. The primer pair c is stored in Csim if one of the MEME results provides both: – an e-value bigger than a fixed threshold for center fragments alignments, – an e-value bellow another threshold for flanking regions alignments.
c-GAMMA: Comparative Genome Analysis of Molecular Markers
5
265
Results
The method has been implemented in the c-GAMMA tool acting as a pipeline of programs. As a preliminary test c-GAMMA was applied on a set S of height Thermococcales genomes (source GOLD database http://www.genomesonline.org/) of total length N ≈ 16 Mb. Thermoccocales were chosen due to their high interest in biotechnology. Species belonging to this family display thermostable hydrolases of interest. It is therefore important to find molecular markers that can help to identify strains within Thermococcus and Pyrococcus species and insure quality control. The goal of our study was to design couples of primers defining molecular markers both identifiable by PCR and by a sequence homology criterion. All experiments were computed using a PC Intel dual core 2.40 GHz running under Linux Fedora with 2 GBytes memory. 5.1
Primer Detection Results (Pd and Pr )
The method exposed in Section 4.1 was applied for generating primers of length 25 that hybridize at least once on each genome (direct and reverse complementary strand) in S. The primers generation was done by testing all 25-mers presents on each genome (direct and reverse complementary strands) and degenerating two central positions on of each of them. Thus ≈ 512 million of 25-mers were tested. Each of these 25-mers was selected for further analysis if classical parameters for PCR amplification were respected. This method generated 2 803 510 primers on the direct strand and 2 796 747 on the reverse complementary strand. Then only primers that hybridized at least once on each sequence (direct and reverse complementary strands) were conserved. This step conserved 62 247 primers on the direct strands (set Pd ) defining 6 309 356 hybridization sites. On the reverse complementary strands, 62 764 primers were conserved (set Pi ) having a total of 6 295 992 hybridization sites. Note that, in average, a primer hybridization site is found each ≈ 2.38 positions on each strand and that each primer has ≈ 100 hybridization sites. This step is the most time consuming. It was performed in less than six hours. 5.2
Primer Pairs for PCR Amplification and Sequence Differentiation Results (Cpcr and Cdif f )
For creating the sets Cpcr and Cdif f from Pd and Pi , methods presented in Section 4.2 were sequentially applied. For defining Cpcr the parameters were the following: min = 200, max = 2000 and limit = 3500. These parameters facilitate standard PCR procedure used by most diagnostic laboratories. First, set Cpcr containing pairs of primers that respect conditions (1) to (4) is selected. This was done on all possible primer pairs in |Pd | × |Pi | (≈ 3.9 billion pairs in this experimentation). This computation took less than four hours and provides 63877 couples.
266
P. Peterlongo et al.
Table 2. Quantitative results while varying parameters for finding Cdif f from Cpcr . Cpcr contained initially 63877 primer pairs. Tests (a) make variation over the maximal number of occurrences (max occ.) of PCPP of each couple on each genome. Tests (b) (resp. (c)) make variation over the parameter δ (see Section 3.3) using at most 2 (resp. 1) occurrences of PCPP of each couple on each genome. (a) δ max occ. 1 10 1 9 1 8 1 7 1 6 1 5 1 4 1 3 1 2 1 1
|Cdif f | 63872 63865 63782 63518 63193 62018 59050 53218 42187 18122
(a)
(b) δ max occ. 1.01 2 1.02 2 1.03 2 1.04 2 1.05 2 1.06 2 1.07 2 1.08 2 1.09 2 1.10 2
(c) |Cdif f | δ max occ. |Cdif f | 1149 1.01 1 137 301 1.02 1 68 180 1.03 1 41 107 1.04 1 36 71 1.05 1 24 56 1.06 1 23 37 1.07 1 17 11 1.08 1 0 11 11
(b)
Fig. 4. (a) A randomly chosen theoretical PCR obtained on the studied set of genomes using a pair of primers respecting conditions min = 200, max = 2000, limit = 3500, δ = 1.10 and max occ. = 2. (b) theoretical PCR obtained on a primer pair respecting conditions min = 200, max = 2000, limit = 3500, δ = 1.05 and max occ. = 1 and defining a variable marker region and a homologous flanking region between set of genomes.
For obtaining Cdif f from Cpcr , a set of tests using several distinct parameters was performed. Each test was computed in less than 30 seconds. Results are shown Table 2.
c-GAMMA: Comparative Genome Analysis of Molecular Markers
267
This experiment shows that the max occ. parameter (table (a)) has a strong influence and that most of the primer pairs define between 1 and 5 PCPP per genome. However, even by constraining exactly one occurrence per genome (last line of (a)), one notices that still 18122 pairs respect the parameters. Moreover, these results show that, fortunately, even while applying very stringent parameters, some primer pairs are found. For instance, while asking for a minimal fragment length difference of δ = 10% and at most 2 fragments occurrences on each genome (last line of (b) of Table 2), 11 primer pairs are found. Figure 4(a) shows the theoretical PCR result that would be obtained on the studied set of genomes thanks to a randomly chosen primer pair respecting such conditions. It is worth mentioning that, as expected, this single PCR result clearly permit to distinguish strains from each other, as for any primer pair respecting the required parameters. 5.3
Detection of Primer Pairs Taking into Account Sequence Similarity and Variability Results (Csim )
The goal here is to show that an approach involving similarity criterion in addition to lengths attributes provides realistic biological results. Thus, we show results of an experimentation run on the set of 24 primer pairs generating one PCPP on each sequences with at least δ = 5% of difference of length between them (bold faced line of (c) of Table 2.). For each of these primer pairs PCPP, MEME was applied both on central fragment and on the two flanking areas (over 1000 bp). We selected primers pairs for which the best alignment had an e-value higher than 1 for the central fragment and lower than 10−1 for any of the flanking regions. Among this 24 primer pairs, one of them gave satisfying results. Indeed, the couple of primers (CGCAGGATTAGCTACAGCCCCACTC, GGCCAATAATACCCAAAGCGGAGGA), having exactly one PCPP on each genome (see Figure 4(b)) define a central fragment highly variable (best local alignment found has an e-value equal to 4.2e+5) and has a left area containing an homologous motif (shown on Figure 5) of length 98 with a e-value of 1.3e−2 .
Fig. 5. Motif found by MEME on left flanking region of couple of primers CGCAGGATTAGCTACAGCCCCACTC and GGCCAATAATACCCAAAGCGGAGGA
268
6
P. Peterlongo et al.
Conclusion
This paper proposes a generic model to efficiently (1) detect primers on a set of genomes, and (2) define suitable molecular markers for genomes differentiation. The differentiation occurs at two levels, a simple length criterion, and a more precise criterion on flanking regions homologies and/or variability. The model is fully implemented within a bioinformatics pipeline called c-GAMMA. Applied on a set of height bacterial genomes (16 Mb), c-GAMMA designed primers for the detection of molecular markers in 12 hours on a standard work station, making possible the genomes differentiation both using length and homology criterion. These encouraging preliminary results open the way to other experimentations on the huge source of data produced by next generation sequencing machines. Moreover, methods proposed in this framework mark a step over molecular markers detection. They are highly suitable for further enhancements such as: – improving the primers generation by producing all oligonucleotides that may hybridize on a genome fragment. Generation is currently achieved through a simple degeneration scheme on the middle part of the fragments. Such an approach will provide more suitable results. However, it will dramatically increase the number of possible primer pairs (|Pd | × |Pr |) and will raise computational issues for finding hybridization sites; – instead of considering only one primer pair on each sequence, the model may be improved by considering simultaneously several primer pairs to perform multiplex PCR in order to efficiently differentiate close species.
References 1. Fleischmann, R., Adams, M., White, O., Clayton, R., Kirkness, E., Kerlavage, A., Bult, C., Tomb, J., Dougherty, B., Merrick, J., et al.: Whole-genome random sequencing and assembly of haemophilus influenzae rd. Science 269(5223), 496–512 (1995) 2. Koonin, E., Wolf, Y.: Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world. Nucl. Acids Res. 36(21), 6688–6719 (2008) 3. Koonin, E.: Darwinian evolution in the light of genomics. Nucl. Acids Res. 37(4), 1011–1034 (2009) 4. Cole, J., Wang, Q., Cardenas, E., Fish, J., Chai, B., Farris, R., Kulam-SyedMohideen, A., McGarrell, D., Marsh, T., Garrity, G., Tiedje, J.: The ribosomal database project: improved alignments and new tools for rrna analysis. Nucl. Acids Res. 37(suppl. 1), D141–D145 (2009) 5. Stackebrandt, E., Frederiksen, W., Garrity, G., Grimont, P., Kampfer, P., Maiden, M., Nesme, X., Rossello-Mora, R., Swings, J., Truper, H., Vauterin, L., Ward, A., Whitman, W.: Report of the ad hoc committee for the re-evaluation of the species definition in bacteriology. Int. J. Syst. Evol. Microbiol. 52(3), 1043–1047 (2002) 6. Ratnasingham, S., Hebert, P.: Bold: the barcode of life data system. Mol. Ecol. Notes (2007)
c-GAMMA: Comparative Genome Analysis of Molecular Markers
269
7. Ludwig, W., Strunk, O., Westram, R., Richter, L., Meier, H., Yadhukumar, Buchner, A., Lai, T., Steppi, S., Jobb, G., Forster, W., Brettske, I., Gerber, S., Ginhart, A.W., Gross, O., Grumann, S., Hermann, S., Jost, R., Konig, A., Liss, T., Lubmann, R., May, M., Nonhoff, B., Reichel, B., Strehlow, R., Stamatakis, A., Stuckmann, N., Vilbig, A., Lenke, M., Ludwig, T., Bode, A., Schleifer, K.H.: Arb: a software environment for sequence data. Nuc. Acids Res. 32(4), 1363–1371 (2004) 8. Pozhitkov, A., Tautz, D.: An algorithm and program for finding sequence specific oligonucleotide probes for species identification. BMC Bioinformatics 3(9) (2002) 9. Kampke, T., Kieninger, M., Mecklenburg, M.: Efficient primer design algorithms. Bioinformatics 17(3), 214–225 (2001) 10. Kaderali, L., Schliep, A.: Selecting signature oligonucleotides to identify organisms using DNA arrays. Bioinformatics 18(10), 1340–1349 (2002) 11. Lemoine, S., Combes, F., Le Crom, S.: An evaluation of custom microarray applications: the oligonucleotide design challenge. Nuc. Acids Res. 37(6), 1726–1739 (2009) 12. Wang, J., Li, K., Sung, W.: G-primer: greedy algorithm for selecting minimal primer set. Bioinformatics 20(15), 2473–2475 (2004) 13. Liu, Y., Carson, D.: A novel approach for determining cancer genomic breakpoints in the presence of normal DNA. PLoS One 2(4) (2007) 14. Bashir, A., Liu, Y.T., Raphael, B.J., Carson, D., Bafna, V.: Optimization of primer design for the detection of variable genomic lesions in cancer. Bioinformatics 23(21), 2807–2815 (2007) 15. SantaLucia, J.J.: A unified view of polymer, dumbbell, and oligonucleotide dna nearest-neighbor thermodynamics. Proc. Natl. Acad. Sci. USA 95(4), 1460–1465 (1998) 16. Bailey, T.L., Elkan, C.: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In: Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, pp. 28–36 (1994)
Semi-supervised Prediction of Protein Interaction Sentences Exploiting Semantically Encoded Metrics Tamara Polajnar and Mark Girolami University of Glasgow, Glasgow, Scotland, G12 8QQ
[email protected] http://www.dcs.gla.ac.uk/inference/
Abstract. Protein-protein interaction (PPI) identification is an integral component of many biomedical research and database curation tools. Automation of this task through classification is one of the key goals of text mining (TM). However, labelled PPI corpora required to train classifiers are generally small. In order to overcome this sparsity in the training data, we propose a novel method of integrating corpora that do not contain relevance judgements. Our approach uses a semantic language model to gather word similarity from a large unlabelled corpus. This additional information is integrated into the sentence classification process using kernel transformations and has a re-weighting effect on the training features that leads to an 8% improvement in F-score over the baseline results. Furthermore, we discover that some words which are generally considered indicative of interactions are actually neutralised by this process.
1
Introduction
Lack of fully annotated training data is one of the major bottlenecks in biomedical text mining. Even for PPI detection, which is one of the most investigated TM problems, there are only a few standard data sets. The usefulness of these data sets is limited by their size and annotation schema [6,3,22]. In this paper we present a new method that integrates unlabelled data in order to improve performance of a classifier trained on a smaller minimally annotated data set. A PPI is a relation between two protein entities linked by an action descriptor, which is usually either a verb or a present (-ing) or past (-ed) participial adjective. Identification of interactions requires significant biological knowledge. In addition, annotation may also require grammatical expertise, depending on whether entities, interaction identifiers, or even sentence parse trees are considered. Therefore, the simplest kind of annotation possible is the one where the segments of texts are simply marked for relevance by the biologists. This type of labelling is useful for training algorithms that detect passages that contain PPIs as a first step in a full interaction extraction pipeline [14]. We use the V. Kadirkamanathan et al. (Eds.): PRIB 2009, LNBI 5780, pp. 270–281, 2009. c Springer-Verlag Berlin Heidelberg 2009
Semi-supervised PPI Sentences Exploiting Semantically Encoded Metrics
271
AImed data set in which the protein entities are annotated and interacting pairs are specified [3]. We use the pairs annotation to judge which sentences contain interactions. The AImed corpus is emerging as standard and is being used in a variety of ways [8,1], yet it only contains less than 2000 sentences. Attempts to overcome this shortage in labelled data usually involve semisupervised learning where samples without class labels are added to the training set [8]. This approach generally leads to greatest improvements in classification performance when there are few labelled sentences and many unlabelled sentences. However, semi-supervised learning is also volatile, and could lead to a significant loss in accuracy [23]. Furthermore, the underlying assumption is that the labelled and unlabelled data come from the same distribution. Unfortunately, this prevents us expanding a fully labelled corpus by combining corpora created by other queries. In order to address these concerns, we present a novel method of integrating unlabelled data into the classification process. We first create a word-word cooccurrence matrix from a large unlabelled corpus through unsupervised means. This corpus has a related topic and contains the words from the training set vocabulary. The matrix is then used to re-weight the words in the sentence documents according to their meaning in the larger corpus, thereby including external information into the training process implicitly. We consider two semantic representations, the Hyperspace Analogue to Language (HAL) [17,5,4] and the Bound Encoding of the Aggregate Language Environment (BEAGLE) [11,12]. Both HAL and BEAGLE model semantic memory using co-occurrence of words within a defined context window. Therefore they are slightly different from Latent Semantic Analysis (LSA) [15] which is based in the word-document space. Statistical word co-occurrence information has been successfully used for synonym identification and word-sense disambiguation [20], as well as query expansion in information retrieval [24,2]. We are not aware of any previous work that uses these semantic models to integrate external knowledge into the classification process. However, the Wikipedia1 corpus has been previously used, with LSA, to improve the semantic linking of words to aid in classification of news texts. The results did not show any improvement over linear classification methods [19]. In this paper, we show, for the first time, that this type of knowledge can help enhance classification in the document-document space used by the kernel classifiers. We gain statistically significant improvements in classification by incorporating the semantic matrices into the kernel space. In addition, we obtain significant insights into the word usage and importance of particular features in classification. These initial experiments show that interesting results can be achieved through exploitation of the complexity of biomedical terms. Semantic models, such as HAL and BEAGLE, can help explore linguistic phenomena like polysemy, that in general make biomedical text mining more difficult than text processing in other domains [14]. 1
http://wikipedia.org
272
2
T. Polajnar and M. Girolami
Semantic Spaces
Semantic spaces were initially introduced as a way of modelling psycholinguistic phenomena such as language acquisition and semantic priming. More recently semantic models have been applied to and tailored for natural language processing tasks, resulting in a proliferation of models [20]. We use the semantic models to improve kernel-based classification techniques. We do this by constructing word similarity matrices based on HAL and BEAGLE and then incorporating them into the kernels as described in Sect. 3.3. Both HAL and BEAGLE calculate the co-occurrence between a target word, t, and the words within a specified context. The context can be defined as a document, a sentence, a window of words, or even a path in a dependency parse tree, anchored at the target [20]. In HAL it is defined as a sliding window where the target is the last word in the window, while in BEAGLE it is the sentence containing the target word. The words within the context are called the basis, b. The set of all target words, T , and the set of all basis words, B, are not necessarily equivalent. In general, the co-occurrence models are created by counting the number of times a basis occurs in the context of a target word. These counts are recorded in a |T | × |B| matrix, where the targets are represented by the row vectors, while the basis correspond to the columns. Semantic models also include a vector space distance metric that is used to calculate the similarity between target row vectors. In classification, the data are encoded as vectors of features, representing points in some multi-dimensional space. The kernel, k(xi , xj ) = φ(xi )T φ(xj ), is a function that takes these data vectors and transforms them into a linear product space which represents the distances between the points. We investigate the use of two kernel functions, commonly employed for text classification, to calculate the distance between the x ·x word vectors. The cosine kernel is defined as kc (xi , xj ) = |xii||xjj | , and the Radial Basis Function (RBF) as kr (xi , xj ) = exp(−θ|xi − xj |2 ). 2.1
Hyperspace Analogue to Language
The HAL matrix, H, is constructed by passing a window of fixed length, L, across the corpus. The last word in the window is considered the target and the preceding words are the basis. Because the window slides across the corpus uniformly, the basis words are previous targets, and therefore T = B. The strength of the co-occurrence between a target and the basis depends on the distance between the two words, l, 1 < l < L, within the window. The co-occurrence scoring formula, L − l + 1, assigns lower significance to words that are further apart. The overall co-occurrence of a target-basis pair is the sum of the scores assigned every time they coincide within the sliding window, across the whole corpus. Even though the matrix is square, it is not symmetric. In fact the transpose of the matrix reflects the co-occurrence scores between the target and the basis that occur within the window of length L after the target. Thus H and HT together
Semi-supervised PPI Sentences Exploiting Semantically Encoded Metrics
273
reflect the full context surrounding a target. There are two ways of combining this information so that it would be considered when the distance between targets is calculated. The first way is to concatenate H and HT to produce a |T | × 2|B| matrix. The second way is to add the two matrices together H+HT . We found that for our kernel combination method that the latter strategy is more effective. This was also the case when HAL was employed for query expansion [24], Therefore, from now on when we refer to H we will assume H = H + HT . 2.2
Bound Encoding of the Aggregate Language Environment
The BEAGLE model [11,12] was proposed as a combined semantic space that incorporates word co-occurrence and word order. For the purpose of comparison with HAL, we only consider the word co-occurrence construction. BEAGLE differs from HAL in that it does not use the raw word counts directly. Instead, it represents each target t with a 1 × D signal vector, e(t), of points drawn from the Gaussian distribution N (0, ( √1D )2 ). The number of dimensions D is chosen manually so that it is large enough to ensure that this vector is unique for each target or basis word, yet small enough to reduce the burden on memory. It is suggested in [11] that multiples of 1024 are an appropriate choice for D, and they use D = 2048 to encode larger corpora. D is generally much smaller than the number of basis words in a large corpus, so this representation also provides a more compact encoding. The context in BEAGLE is made of the basis words that occur in the same sentence as the target word. The target vectors in the BEAGLE co-occurrence matrix, B, are sums of the environmental vectors of the basis words that occur within the context of the target word. The more times that a certain basis is found in the same sentence as the target, the stronger its signal will be within the vector B[t].
3
Methods
We assess the performance of the semantic kernels using the Gaussian process (GP) classifier [9]. We have previously found that GPs outperform the support vector machine [10] on the AImed [3] data set for the task of PPI sentence detection [21]. We formulate the interaction detection problem as a PPI sentence classification task. This allows us to use bag-of-words (BOW) [16] features with which we can examine the information gain from semantic kernels. In addition, the baseline features we employ are easier to extract and require no annotation. We also use protein names as features. While we rely on gold standard annotations, the proteins could be also automatically annotated. 3.1
Corpora
We use the AImed [3] data set for classifier training and testing and the GENIA [13] corpus to construct the semantic models.
274
T. Polajnar and M. Girolami
AImed has been used in multiple studies recently for exact interacting pair extraction [3,8,1]. It is rapidly becoming one of the standard data sets for PPI classification. AImed has nearly 55,000 words and is annotated with PPIs. On the other hand, the larger GENIA corpus has over 432,000 words, which was constructed from MEDLINE queries: human, blood cell, and transcription factor. It is only annotated with named entities including proteins, thus the information in GENIA cannot be directly used for PPI classification. Consequently, any relevant subset of MEDLINE would be equally as useful for this task. The protein names can be found automatically and therefore the annotations in GENIA are not strictly necessary. 3.2
Features
We consider two types of features for this task, short and protein. In short, each feature corresponds to a word. The words are defined as a sequence of letters limited to the length of ten characters as in [7] . We also used full words, including any that contained numbers and letters. Unfortunately, this technique led to lower classification performance, and therefore we do not report detailed results here. For protein features, the basic word extraction technique is the same as for short. However, we substitute the manually annotated protein names in the AImed corpus with place holder strings enumerating each of the proteins in the sentence. Thus, in each sentence the first protein is named ptngne1, the second is ptngne2 and so on. This method effectively anonymises the proteins across the whole corpus, turning the sentences into patterns. 3.3
Kernel Construction
The target words used for the construction of the semantic matrices are the words occurring in the AImed data set. For BEAGLE the basis are all words that occur in the sentences with the target words, while in HAL the basis are the same as the target words. Some features that occur in AImed cannot be found in GENIA. During the construction of the HAL matrix we find some empty rows, which can cause problems during similarity calculations. We add a small scalar value to the entire matrix to avoid this problem. The baseline classification results were obtained with the kc and kr (as defined in Sect. 2) kernels directly on the sentence data from the AImed corpus, X = x1 , . . . , xM . M is the number of sentences in X and N is the number of features, i.e. the length of the vectors x. The N × N HAL and BEAGLE word-similarity matrices were constructed using the semantic co-occurrence matrices generated from the GENIA corpus and transformed by the kernel functions, for example Hc = {kc (hi , hj )}N i,j=1 . The sentence-sentence kernels are then constructed so that they include the word similarity matrix, for example Kij = xi Hc xj is the HAL + cosine kernel for sentence classification.
Semi-supervised PPI Sentences Exploiting Semantically Encoded Metrics
3.4
275
Experiment Description
In order to effectively use HAL and BEAGLE as kernels, we need to determine initial settings for the comparison experiment. We examined the effects of different distance metrics, parameters, and window sizes (L = 1 . . . 30) for HAL for several feature types on the AImed corpus. We investigated the effects that the number of dimensions, D, and the cosine and RBF distance metrics have on BEAGLE. In [11] claim that if it is large enough, i.e. D > 1000, the lists of similar words produced do not change. Nevertheless, similarity values will make a difference in our experiments, so it is a parameter worth considering. We tested for D = {2048, 4096}. In Sect. 4 we report the observations gathered from these intial experiments and then present further experiments using the best results for each of the methods. The initial experiments for HAL encompassed a wide search space and as such were only ten-fold cross-validations. On the other hand, since the search space was much smaller, the final comparison results are an average of ten ten-fold cross-validations. 3.5
Evaluation Measures
Results were evaluated using the error (E), precision (P), recall (R), and F measures, which are defined in terms of true positives (tp), false positives (fp), true f p+f n tp negatives (tn), and false negatives (fn) as follows: E = tp+tn+f p+f n , P = tp+f p , tp 2·P ·R R = tp+f n , F = P +R [25]. The area under the receiver operator characteristic (ROC) curve is also employed as a standard measure. The ROC is a plot of the true positive rate vs. the false positive rate, and the larger the area under the curve (AUC) the better the performance of the classifier. When perfect classifier performance is achieved the AUC is 1. We also provide the average of the predictive likelihood (PL) for each of the cross validation experiments.
4 4.1
Experimental Results Experimental Parameter Selection
We found that for the sentence classification without semantic information, the cosine kernel always gave a higher F-score than the RBF. Therefore, we use the results obtained using the cosine kernel as the baseline for comparison with the semantic kernels. The experiments to find the right parameters for the HAL kernel were conducted in two stages. Firstly, we found that the scalar value that is added to the matrix H, to prevent division by zero while performing similarity transformations, does not have any influence on classification. In addition, testing shows that RBF parameter θ makes little difference when the kernel is applied to the HAL and BEAGLE matrices. Next, we tested which of the similarity measures will give the highest classification results, for each of the window sizes. We found that the contents of the HAL matrix are highly influenced by the choice of window width parameter L.
276
T. Polajnar and M. Girolami
The right choice of L and the similarity metric could give variations of over 5% in the F-score. We chose three sets of parameters for further experimentation: the ones that gave the highest F-score, the highest AUC, and the lowest error. Unlike HAL, the co-occurrence component of BEAGLE has only one parameter D, resulting in a smaller search space. In general, we found that for BEAGLE the length D of the signal vector e(t) has a lesser effect than the choice of similarity metric. 4.2
The Effects of HAL and BEAGLE on Target Words
The word-similarity lists that semantic spaces produce are difficult to evaluate quantitatively. For biomedical texts, there are no large-scale user-driven linguistic study results that could be used to evaluate these types of lists. For example, Table 1 shows lists of the most similar words to TNF from both the HAL and BEAGLE matrices as transformed by the two similarity metrics. It is obvious that there are differences in the lists, however it is difficult to quantify which list is the best. TNF is a cytokine that is involved in several essential cellular processes and consequently it appears to be a key factor in many diseases including cancer. There are many studies that evaluate TNF interactions and their consequences. The different similarity lists appear to reflect some of the types of different articles written. For example, the BEAGLE matrix transformed by cosine, Bc , tends to weight highly the words that have to do with the function of TNF in different organs. This is supported by the fact that, the words liver and kidney appear further down the list, at positions 11 and 18, respectively. The lists produced by the BEAGLE with RBF (Br ) and HAL with cosine (Hc ) similarity matrices reflect more of a biomolecular experimental view, while the list from Hr appears to contain more words that would be found in clinical medical abstracts. Table 1. Examples of the top ranked words similar to TNF (tumor necrosis factor). Definition of TNF from RefSeq: This cytokine is involved in the regulation of a wide spectrum of biological processes including cell proliferation, differentiation, apoptosis, lipid metabolism, and coagulation. This cytokine has been implicated in a variety of diseases, including autoimmune diseases, insulin resistance, and cancer. BEAGLE Cosine BEAGLE RBF tnf tnf capacities treated architectu cip biofunctio angiotensi shptp testament myogenic subjected increases activated inhibitors immunodefi bcl mol immobilize transfecti
HAL Cosine tnf glutamic egg slightly fra bind uninfected vector progressio hearts
HAL RBF tnf slightly fra vector progressio hearts augmented indirectly searched diagnosis
Semi-supervised PPI Sentences Exploiting Semantically Encoded Metrics
4.3
277
The Effects of HAL and BEAGLE on Sentences
When we examine the similarity vectors of individual words within the HAL and BEAGLE spaces we find that some words are highly similar to many other targets while others are only similar to themselves. Due to the way that each of the sentences is multiplied by the similarity vector, the sum of the similarity values for each of the target words becomes the key. For example, if we concentrate on the similarity space created from GENIA, using short features and the RBF similarity metric, we can observe the transformations that happen to a single sentence from the AImed corpus. So, from the sentence: We have identified a new TNF - related ligand , designated human GITR ligand ( hGITRL ) , and its human receptor ( hGITR ) , an ortholog of the recently discovered murine glucocorticoid - induce d TNFR - related ( mGITR ) protein [ 4 ] .
we can extract the following vector x1 represented by non-zero fetures: tnfr:1, tnf:1, discovered:1, designated:1, protein:1, glucocorti:1, otholog:1, induced:1, recently:1, hgitrl:1, identified:1, receptor:1, hgitr:1, gitr:1, murine:1, ligand:2, human:2, related:2. In general, it would be highly correlated with other sentences that contain these same words in high proportions. However, after including the global knowledge encoded in the Br kernel, we found that these values were greatly altered. If the sentence contains features that are related to many others the similarity with itself will be higher, but also these words will be boosted in significance when calculating the inner product with other sentence vectors. So for x1 , after transformation we got x1 Br xT1 = 53.7142. The features in the sentence were weighted as follows: designated:1, receptor:1, hgitrl:1, protein:1, induced:1, gitr:1, ortholog:1, tnfr:1, tnf:1.0055, glucocorti:1.0492, hgitr:1.0533, human:4.0001, related:4.1569, identified:5.3208, murine:5.8166, discovered:5.8180, recently:5.8195, ligand:11.6744.
Word indices: (21) tnfr, (153) tnf, (216) ligand, (667) human, (1274) discovered, (1298) designated, (1430) protein, (1453) glucocorti, (1879) ortholog, (1977) induced, (2199) recently, (2551) hgitrl, (2780) identified, (2785) receptor, (2797) related, (2881) hgitr, (3079) gitr, (3207) murine.
Fig. 1. Re-weighting of words in a sentence by the BEAGLE and HAL kernels. This figure demonstrates the neutralisation of some features while others are given higher importance.
278
T. Polajnar and M. Girolami
We can visualise this transformation in Fig. 1 for both the BEAGLE and HAL kernels. This is an example of an entry on the diagonal of the kernel, but the same calculations were made between any two sentences, e.g. x1 Br xT3 = 23.3594. 4.4
The Effects of BEAGLE and HAL on Classification
Incorporation of semantic information from the HAL and BEAGLE matrices significantly increases the classification performance (Table 2). With the basic short features we find that the BEAGLE matrix with RBF similarity increases the F-score by nearly 8%. When employing protein features we see less of an improvement, though it is still statistically significant. Using HAL with RBF similarity leads to 5% improvement in the F-score. Table 2. Average results over ten ten-fold cross-validation experiments where the best settings for each of the methods were used. Two types of features were examined, plain words concatenated to the maximum of ten letters (short) and the same feature set but with protein names replaced by place holder strings (protein). The † indicates that all F-scores and AUCs are significantly different from all the other results using the same features. Cosine Kernel settings results features: †F = 0.5384 ± 0.0049 short E = 23.1394 ± 0.2890 kernel: P = 0.7186 ± 0.0065 cosine R = 0.4346 ± 0.0060 †AUC = 0.7934 ± 0.0034 PL = 0.0315 ± 0.0036
4.5
settings features: protein kernel: cosine
results †F = 0.6789 ± 0.0043 E = 18.6717 ± 0.2460 P = 0.7258 ± 0.0056 R = 0.6414 ± 0.0057 †AUC = 0.8688 ± 0.0025 PL = 0.1341 ± 0.0038
HAL Kernel settings results features: †F = 0.5750 ± 0.0055 short E = 23.6515 ± 0.2850 L: P = 0.6482 ± 0.0068 8 R = 0.5197 ± 0.0060 kernel: †AUC = 0.7820 ± 0.0034 H + RBF PL = 0.0241 ± 0.0047
settings features: protein L: 1 kernel: H + RBF
results †F = 0.7267 ± 0.0040 E = 16.3737 ± 0.2296 P = 0.7514 ± 0.0055 R = 0.7061 ± 0.0048 †AUC = 0.8953 ± 0.0022 PL = 0.2237 ± 0.0055
BEAGLE settings features: short D: 2048 kernel: B + RBF
settings features: protein D: 4096 kernel: B + cosine
results †F = 0.7103 ± 0.0043 E = 17.3131 ± 0.2535 P = 0.7378 ± 0.0061 R = 0.6880 ± 0.0051 †AUC = 0.8895 ± 0.0022 PL = 0.2110 ± 0.0055
Kernel results †F = 0.6167 ± 0.0052 E = 21.6869 ± 0.2566 P = 0.6801 ± 0.0064 R = 0.5671 ± 0.0059 †AUC = 0.7997 ± 0.0033 PL = 0.0555 ± 0.0049
Feature Re-weighting and Classification Performance
In order to understand the increase in performance we have to examine the effects of the kernels on the features. In general, the RBF kernel produces a sparser kernel with higher contrast, i.e. sharper decline in similarity values. This can also be observed by examining the highest weighted word in the Br matrix, asp and one of the lowest weighted, protein. Their weight vectors are plotted in Fig. 2.
Semi-supervised PPI Sentences Exploiting Semantically Encoded Metrics
279
Fig. 2. Similarity calculations between the chosen words and the rest of the lexicon as calculated by the different kernels. This figure demonstrates the neutralising effect of the BEAGLE kernel on the high-frequency word protein.
Protein is one of the words that is generally considered to be an indicator of interactions For example, [18] use a list of 83 discriminating words to score abstracts according to their presence or absence. Some of the top words they use are: complex, interaction, two-hybrid, interact, proteins, protein, domain, interactions, required, kinase, interacts, complexes, function, essential, binding, component, etc. We find that Br kernel actually reduces the weight for many of these words. For example, complex, interaction, interact, protein, binding, domain, kinase, complexes, and function all get multiplied only by factor of 1. This implies that these words are only similar to themselves. However, other words including hybrid, proteins, required, interacts, essential, and component get multiplied by numbers orders of magnitude larger, for example 800, implying high similarity with many words. This has the effect of drastically reordering the significance of words in a way that cascades into the final sentence-sentence similarity space. When we examine the properties of the AImed corpus we can see the advantages of the Br scaling. The most frequent words in the positive data are: binding, protein, receptor, interactio, il, beta, domain, complex, cells, human, cell, kinase, . . . , while the top negative words are: protein, receptor, cell, binding, cells, human, proteins, il, transcript, interactio, domain, expression,. . . Therefore we can gather that, actually, for this data there is a large intersection of positive and negative high-frequency words, and thus they are not very discriminative. On the other hand, the words that occur more in the positive data than in negative are: interacts, binds, complex, hsp, gp, ccr, cdk, . . . ; so, the higher weights assigned to these words improve classification.
5
Discussion
In this paper, we have presented a new method of integrating unlabelled data, via kernel substitution, in order to improve supervised classification performance. We use the unsupervised semantic models to combine word usage information from a large external corpus into the kernel space. With this method we are able to integrate data that does not necessarily come from the same distribution as the
280
T. Polajnar and M. Girolami
training data, which is a requirement of traditional semi-supervised approaches. Integration of word co-occurrence data in this manner leads to almost an 8% improvement in the F-score on BOW features and a 5% improvement when using protein annotations in the feature set. This is the first time HAL and BEAGLE semantic spaces have been combined within a kernel classifier in this way. These models re-introduce the semantic links that had been originally lost through the choice of BOW features. By re-weighting the words in a sentence, these models emphasise terms that have many synonyms and thus are more interchangeable with terms that occur in other sentences. Therefore by equating semantically synonymous terms we were able to increase classification performance. The same type of improvement was observed when we artificially anonymised the proteins by substituting a placeholder string for a protein name. However, the proposed semantic models are unsupervised and not limited to handling only manually chosen entity types. These initial experiments introduce new avenues of research that can be undertaken to further explore unlabelled data integration through the kernel space.
Acknowledgements TP is funded by a Scottish Enterprise PhD studentship. MG is funded by an EPSRC Advanced Research Fellowship EP/E052029/1 and EPSRC project CLIMB EP/F009429/1.
References 1. Airola, A., Pyysalo, S., Bj¨ orne, J., Pahikkala, T., Ginter, F., Salakoski, T.: Allpaths graph kernel for protein-protein interaction extraction with evaluation of cross-corpus learning. BMC Bioinformatics 9(suppl. 11) (2008) 2. Azzopardi, L., Girolami, M., Crowe, M.: Probabilistic hyperspace analogue to language. In: SIGIR 2005: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 575–576. ACM, New York (2005) 3. Bunescu, R., Ge, R., Kate, R.J., Marcotte, E.M., Mooney, R.J., Ramani, A.K., Wong, Y.W.: Comparative experiments on learning information extractors for proteins and their interactions. Artif. Intell. Med. 33(2), 139–155 (2005) 4. Burgess, C., Livesay, K., Lund, K.: Explorations in context space: Words, sentences, discourse. Discourse Processes 25, 211–257 (1998) 5. Burgess, C., Lund, K.: Modeling parsing constraints with high-dimensional context space. In: Language and Cognitive Processes, vol. 12, pp. 177–210 (1997) 6. Cohen, K.B., Fox, L., Ogren, P.V., Hunter, L.: Corpus design for biomedical natural language processing. In: Proceedings of the ACL-ISMB workshop on linking biological literature, ontologies and databases: mining biological semantics, pp. 38–45 (2005) 7. Donaldson, I., Martin, J., de Bruijn, B., Wolting, C., Lay, V., Tuekam, B., Zhang, S., Baskin, B., Bader, G.D., Michalickova, K., Pawson, T., Hogue, C.W.: PreBIND and Textomy - mining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics 4(11) (2003)
Semi-supervised PPI Sentences Exploiting Semantically Encoded Metrics
281
8. Erkan, G., Ozgur, A., Radev, D.R.: Semi-supervised classification for extracting protein interaction sentences using dependency parsing. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 228–237 (2007) 9. Girolami, M., Rogers, S.: Variational bayesian multinomial probit regression with gaussian process priors. Neural Computation 18(8), 1790–1817 (2006) 10. Joachims, T.: Making large-Scale SVM Learning Practical. In: Advances in Kernel Methods - Support Vector Learning. MIT Press, Cambridge (1999) 11. Jones, M.N., Kintsch, W., Mewhort, D.J.: High-dimensional semantic space accounts of priming. Journal of Memory and Language 55(4), 534–552 (2006) 12. Jones, M.N., Mewhort, D.J.K.: Representing word meaning and order information in a composite holographic lexicon. Psychological Review 114, 1–37 (2007) 13. Kim, J.D., Ohta, T., Tateisi, Y., Tsujii, J.: GENIA corpus–semantically annotated corpus for bio-textmining. Bioinformatics 19(suppl. 1), 180–182 (2003) 14. Krallinger, M., Leitner, F., Rodriguez-Penagos, C., Valencia, A.: Overview of the protein-protein interaction annotation extraction task of biocreative ii. Genome. Biol. 9(suppl. 2) (2008) 15. Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse Processes 25, 259–284 (1998) 16. Lewis, D.D.: Naive (Bayes) at forty: The independence assumption in information retrieval. In: N´edellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 4–15. Springer, Heidelberg (1998) 17. Lund, K., Burgess, C.: Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instrumentation, and Computers 28, 203–208 (1996) 18. Marcotte, E.M., Xenarios, I., Eisenberg, D.: Mining literature for protein-protein interactions. Bioinformatics 17, 359–363 (2001) 19. Minier, Z., Bodo, Z., Csato, L.: Wikipedia-based kernels for text categorization. In: SYNASC 2007: Proceedings of the Ninth International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, Washington, DC, USA, pp. 157–164. IEEE Computer Society, Los Alamitos (2007) 20. Pad´ o, S., Lapata, M.: Dependency-based construction of semantic space models. Comput. Linguist. 33(2), 161–199 (2007) 21. Polajnar, T., Rogers, S., Girolami, M.: An evaluation of gaussian processes for sentence classification and protein interaction detection. Technical report, University of Glasgow, Department of Computing Science (2008) 22. Pyysalo, S., Ginter, F., Heimonen, J., Bj¨ orne, J., Boberg, J., J¨ arvinen, J., Salakoski, T.: Bioinfer: a corpus for information extraction in the biomedical domain. BMC Bioinformatics 8, 50 (2007) 23. Rogers, S., Girolami, M.: Multi-class semi-supervised learning with the - truncated multinomial probit gaussian process. Journal of Machine Learning Research Workshop and Conference Proceedings 1, 17–32 (2007) 24. Song, D., Bruza, P.D.: Discovering information flow using a high dimensional conceptual space. In: Proceedings of ACM SIGIR 2001, pp. 327–333 (2001) 25. Van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Dept. of Computer Science, University of Glasgow (1979)
Classification of Protein Interaction Sentences via Gaussian Processes Tamara Polajnar, Simon Rogers, and Mark Girolami University of Glasgow, Glasgow, Scotland, G12 8QQ
[email protected] http://www.dcs.gla.ac.uk/inference/
Abstract. The increase in the availability of protein interaction studies in textual format coupled with the demand for easier access to the key results has lead to a need for text mining solutions. In the text processing pipeline, classification is a key step for extraction of small sections of relevant text. Consequently, for the task of locating protein-protein interaction sentences, we examine the use of a classifier which has rarely been applied to text, the Gaussian processes (GPs). GPs are a nonparametric probabilistic analogue to the more popular support vector machines (SVMs). We find that GPs outperform the SVM and na¨ıve Bayes classifiers on binary sentence data, whilst showing equivalent performance on abstract and multiclass sentence corpora. In addition, the lack of the margin parameter, which requires costly tuning, along with the principled multiclass extensions enabled by the probabilistic framework make GPs an appealing alternative worth of further adoption.
1
Introduction
Biomedical research information is disseminated through several types of knowledge repositories. The foremost mode of academic communication are peer reviewed journals where results are evaluated and reported in a structure primarily aimed for human consumption. Alternative sources provide this information in a distilled format that is often designed for purposes of increasing the availability of particular types of results. This is typically achieved by accelerating the speed of access, cross-referencing, annotating with extra information, or restructuring the data for easier interpretation by both humans and computer programs. These resources often link the results directly to the citation in MEDLINE1 , a manually-curated publicly-available database of biomedical publication citations. Protein interactions, in particular, are a subject of many studies, the outcomes of which are stored in databases such as HPID2 , MIPS3 , and DIP4 . 1 2 3 4
http://www.nlm.nih.gov/databases/databases medline.html Human Protein Interaction Database ( http://www.hpid.org/) Mammalian Protein-Protein Interaction Database (http://mips.gsf.de/proj/ppi/) Database of Interacting Proteins the Database of Interacting Proteins (http://dip.doe-mbi.ucla.edu/)
V. Kadirkamanathan et al. (Eds.): PRIB 2009, LNBI 5780, pp. 282–292, 2009. c Springer-Verlag Berlin Heidelberg 2009
Classification of Protein Interaction Sentences via Gaussian Processes
283
The electronic availability of these resources has lead to an increased interest in the automation of the process by which the relevant information is extracted from the original articles and entered into the specific knowledge repositories. We examine the task of locating sentences that describe protein-protein interactions (PPIs) using Gaussian processes (GPs) [35], a Bayesian analogue of the frequently applied support vector machine (SVM) [43] kernel-based classifier. PPI detection is one of the key tasks in biomedical TM [13]. Proteins are essential parts of living organisms that, through interactions with cellular components (including other proteins) regulate many functions of the life-cycle. Approaches to PPI detection vary greatly, spanning information retrieval solutions to fully integrated parsing-based systems. For example, Chilibot is a search engine tool for finding PPIs in MEDLINE abstracts. Given a list of potential interactants, Chilibot first constructs a query that specifies combinations of the proteins, and then it processes the results to find interactions that co-occur in a sentence [9]. In a different approach, an automated pattern-based system described in [22] learns patterns from a corpus of example interaction sentences. Yet on a different track, a range of customised Bayesian methods is also available. For example, [33] present an approach that gives the likelihood that a MEDLINE abstract contains an interaction based on a dictionary of 80 discriminative words (e.g. complex, interaction, two-hybrid, protein, domain, etc.). [37] describe a Bayesian net model that is able to discriminate between multiple types of interaction sentences and detect protein entities at the same time. However, a non-probabilistic discriminative method has recently emerged as a highly-effective popular choice for PPI extraction. In the past ten years, SVMs have been frequently used for PPI sentence detection, where they have been proven to be highly effective [41,18]. In particular, the kernel has been used to manipulate the input knowledge. For example, [6], [1], and [18] use structural features derived from dependency parses of the sentences with graph kernels, while [21], for example, uses kernel combinations of context-based features. In a comparative study between several classifiers, including decision trees and na¨ıve Bayes, [23] find that SVMs perform the best on their PPI detection data set. GPs are a Bayesian classification method analogous to the SVM that has rarely been applied to text classification; however, the probabilistic framework within which it is defined allows for elegant extensions that particularly suit TM tasks. For this reason we seek to evaluate GPs and compare them to the more frequently used SVMs and na¨ıve Bayes (NB) [30] classifiers. Both GPs and SVMs are non-parametric, meaning that they scale with the number of training documents, learn effectively from data with a large number of features, and allow for more relevant information to be captured by the data. Likewise the covariance function in the GP classifier corresponds to the kernel in the SVM algorithm, allowing for comparable data input and data transformations. Thus, while GPs have properties similar to SVMs [35, pp. 141–146] they have failed to attract the same kind of attention in the text processing community. They have been applied to a variety of other bioinformatics tasks, such as protein fold prediction [20,27] and biomarker discovery in microarray data [11]. GPs have also
284
T. Polajnar, S. Rogers, and M. Girolami
been applied to text classification in a few instances. Online Gaussian processes [8] and Informative Vector Machines were investigated for multiple classes on the Reuters collection in [40]. In addition, GPs and SVMs were compared for preference learning on the OHSUMED corpus [12] and an extension of GPs for sequential data such as named entities was proposed by [4]. In this article we will investigate the detection of sentences that describe PPIs in biomedical abstracts using GP classification with bag-of-words [30] and protein named entity (NE) features. The advantage of simpler features is that the test data does not have to be parsed or annotated in order for the model to be applied. Likewise, the model is more resilient to annotation errors. For example, in the sentence below, taken from the AImed [6] corpus, the number of interactions was correctly annotated, but the main interacting protein IL-8 was marked in a way that is incorrect and grammatically difficult to process. The effect is that the subject protein of the sentence is no longer interacting with the object proteins. This work shows that single and double Ala substitutions of His18 and Phe21 in < prot> IL - 8 reduced up to 77 - fold the binding affinity to <prot> < p1 pair=1 >
<prot> IL - 8 receptor subtypes A (
<prot> CXCR1 p2> ) and B ( <prot> CXCR2 ) and to the <prot> Duffy antigen .
In addition, we consider only PPI sentence detection and not full PPI extraction. This is a simplified view that yields a higher precision-recall balance than extraction of interacting pairs. It is a method that is not sufficient for automatic database population, but may be preferable for database curation and research purposes. The whole original sentence is returned and thus would allow the direct application of end-user relevance and quality judgments. If these judgments were logged, the system could be retrained for individual users.
2
Background
Input into all three algorithms is a matrix representation of the data. In sentence classification, using a bag-of-words model, each sentence is represented as a row in the data matrix, X. Considering N documents containing M unique features, the ith document corresponds to the vector xi = [xi1 , . . . , xim ] where each xij is a count of how many times word j occurs in the document i. These vectors are then used directly by the NB, while for the GPs and SVMs the kernel trick [2,5] is then used to embed the original feature space into an alternative space where data may be linearly separable. That kernel function transforms the N xM input data to a square N xN matrix, called the kernel, which represents the similarity or distance between the documents. The principal difference between the approaches is in how the kernel is used; while SVMs use geometric means to discriminate between the positive and negative classes, GPs model the posterior probability distribution of each class.
Classification of Protein Interaction Sentences via Gaussian Processes
285
SVMs have benefited from widely available implementations, for example the C implementation SVMlight [24], whose algorithm uses only a subset of the training data. However, informative vector machines (IVMs) [28,19], which are derived from GPs, now offer an analogous probabilistic alternative. A na¨ıve implementation of SVM has a computational complexity O(N 3 ), due to the quadratic programming optimisation. However, with engineering techniques this can be reduced to O(N 2 ), or even more optimally, to O(N D2 ) where D is a much smaller set of carefully chosen training vectors [25]. Likewise, the GP has O(N 3 ) complexity; with techniques such as the IVM this can be reduced to the worst case performance of O(N D2 ). On the datasets presented in this paper the difference for combined training and classification user time for GPs and SVMs was imperceptible. 2.1
Gaussian Process
Since it operates within a probabilistic framework, the GP classifier does not employ a geometric boundary and hence does not require a margin parameter. Instead, we use the GP framework to predict the probability of class membership for a test vector x∗ . This is achieved via a latent function m(x), which is passed through a step-like likelihood function in order to be constrained to the range [0, 1], to represent class membership. The smoothness of m = {mi = m(xi )|xi ∈ X} is regulated by a Gaussian process prior placed over the function and further specified by the mean and covariance functions. In other words, the model is described by the latent function m such that p(m) = N (m|0, C), where C is analogous to the kernel function in the SVMs and would normally require some parametrisation. The function posterior is p(m|X, T) ∝ p(T|m)p(m|X). In GP regression this is trivial as both terms are Gaussian; however, in the classification case the non-conjugacy of the GP prior and the likelihood p(Y|m), which can be for example probit, makes inference non-trivial. In order to make predictions for a new vector x∗ , we need to compute the predictive distribution p(t∗ |x∗ , X, T) = p(t∗ |x∗ , m)p(m|X, T)dm, which is analytically intractable and must be approximated. The strategy chosen to overcome this will depend on the likelihood function chosen (options include the logistic and probit functions). In this work, we follow [19] and use the probit likelihood, mi p(ti = 1|mi ) = Φ(mi ) = −∞ N (z|0, 1)dz, where the auxiliary variable trick [3] enables exact Gibbs sampling or efficient variational approximations. 2.2
Benefits of the Probabilistic Non-parametric Approach
The clear advantages of the probabilistic approach to classification have inspired attempts to develop probabilistic extensions of SVMs. For example, [34] proposed an ad-hoc mapping of SVM output into probabilities; however, this is not a true probabilistic solution as it yields probabilities that tend to be close to 0 or 1 [35, p. 145]. On the other hand, the GP output probabilities give a more accurate depiction of class membership that can be used to choose the optimal precisionrecall trade off for a particular problem or further post-processing for appropriate decision making.
286
T. Polajnar, S. Rogers, and M. Girolami
The Bayesian framework also allows for additional mathematical extensions of the basic algorithm, such as multiple classes [35,19,38], sequential data [4], and ordinal classes [10]. One advantage of the particular Gaussian process classifier used in this paper is its ability to effectively handle unlabelled training data (semi-supervised learning in the multiclass setting [36]). This is especially useful in text classification since there is a wealth of unlabelled documents available, but annotation can be expensive. SVMs can also be used for semi-supervised learning [39]; however difficulties often arise when multiple class data is used. There are theoretical extensions for SVMs but they are not as elegant as in the Bayesian case. For example [29] demonstrate the use of multiclass SVM on cancer microarray data; however, the implementation is O(N 3 K 3 ) [14], where K is the number of classes. Thus most applications of SVM to multiple class problems use combinations of multiple binary classifiers, for example two popular strategies are one vs. all and one vs. one . When using the former strategy one class is considered positive and the rest are negative resulting in K classifiers, while in the latter approach each class is trained against each of the others resulting in K·(K−1) classifiers. For example, [16] use 351 SVM classifiers, per feature space, 2 to predict 27 protein fold classes. For the same problem, [15] demonstrate how a single probabilistic multiclass kernel machine tailored to learn from multiple types of features for protein fold recognition can outperform a multiple classifier SVM solution.
3 3.1
Results Corpora and Experimental Setup
We use three main data sets. AImed is a corpus of abstracts where each individual sentence is annotated for proteins and interactions. We also examine the properties of PreBIND [17], which is only annotated for the presence of interaction within an abstract. We use these two data sets in cross validation experiments to compare the classifiers. In addition we examine if it is possible to train on the minimally annotated PreBIND data set and still classify on the sentence level. Finally, we use the BioText corpus, which is a compilation of fulltext articles, referenced in the HIV Human Protein Interaction Database and separated into several types of interactions, including interacts with, stimulates, inhibits, and binds [37]. This is used to compare the algorithms in the multiclass setting. ·x∗ Kernel Settings. We used the cosine kernel k(xi , x∗ ) = |xxii||x in all of the ∗| experiments. We also considered the Gaussian kernel, but found it did not increase the area under the ROC curve for either of the data sets (which was 0.83 for the SVM with both kernels, 0.67 for the GP with the Gaussian and 0.80 with the cosine kernel).
Evaluation Measures. Results were evaluated using the precision, recall, and F measures, which are defined in terms of true positives (tp), false positives (fp),
Classification of Protein Interaction Sentences via Gaussian Processes
true negatives (tn), and false negatives (fn): precision =
tp tp+f p ,
recall =
287 tp tp+f n ,
2·precision·recall precision+recall
F = [42]. The area under the receiver operator characteristic (ROC) curve is also employed as a standard measure. The ROC is a plot of the true positive rate vs. the false positive rate, and the larger the area under the curve (AUC) the better the performance of the classifier. We also use the information retrieval standard mean average precision (MAP) [31] measure to assess the quality of the top ranked results from each of the classifiers. Features. Plain features were sequences of letters truncated at maximum length of 10 with stop words removed. We considered stemming and term frequency inverse document frequency (tf-idf) [32, pp. 541–544] word weighting were examined as alternative representations, but both lead to a decrease in performance. We examined the effect of individual proteins on classification and found that anonymisation of protein names increased performance on sentence data but decreased it for the PreBIND corpus. The features were constructed so that protein names were replaced by a placeholder string ptngne concatenated with the sequential number of the protein in the sentence. For example in the following sentence: We have identified a new TNF - related ligand , designated human <prot> GITR ligand ( <prot> hGITRL ) , and its human receptor ( <prot> hGITR ) , an ortholog of the recently discovered murine <prot> glucocorticoid - induced TNFR - relate d ( <prot> mGITR ) protein [ 4 ] .
the extracted features are: identified ptngne1 designated ptngne2 ptngne2 human receptor ortholog recently discovered murine glucocorti induced tnfr related mgitr protein
3.2
Binary Results
The results in Table 1 show that in general the Bayesian methods are performing better on this task than the SVMs. NB has a consistently high F-score, mainly due to perfect recall. However, the precision is quite low, in turn influencing the accuracy and the AUC, both of which are significantly worse than GP and SVM across all of the cross-validation experiments. GP has a significantly higher AUC on plain features with the sentence data; however, on abstract data the difference between GPs and SVMs is not statistically significant. For AImed we found that using protein features increased the performance greatly regardless of whether they are gold standard annotations and automatically annotated NEs. The automatic annotation was done using the Lingpipe5 HMM NE tagger trained on the GENIA [26] corpus. We found that considering protein molecule (pm) features gave the highest quality of partial alignment between the annotations, which was still relatively low (P=0.8359, R=0.5937, and F=0.6943). However, in cross validation, for the PreBIND data set considering only pm features reduced performance, while also using protein family or group (pfg) had less of a detrimental effect. 5
http://alias-i.com/lingpipe/
288
T. Polajnar, S. Rogers, and M. Girolami
Table 1. Results for NB, GPs, and SVMs ten-fold cross-validation experiment, repeated ten times. These are presented as F-score (F), accuracy (A), precision (P), recall (R), and area under the ROC (AUC), and include the standard error. The † symbol indicates that the paired t-test significance analysis shows that the difference between the indicated value and the corresponding values from the other two algorithms is significant (P-value < 0.05). In the feature column, NER pm indicates that we used entities labelled protein molecule as features, while pm+pfg indicates we also used entities labelled with protein family or group. Data Features AIM Plain
AIM annotated
AIM NER pm
AIM NER pm+pfg
PB
Plain
PB
NER pm
PB
NER pm+pfg
NB †F=0.6785 ± 0.0080 †A=51.4009 ± 0.9111 †P=0.5140 ± 0.0091 †R=1.0000 ± 0.0000 †AUC=0.2894 ± 0.0076 F=0.6915 ± 0.0108 †A=52.9561 ± 1.2742 †P=0.5296 ± 0.0127 †R=1.0000 ± 0.0000 †AUC=0.2617 ± 0.0158 †F=0.7243 ± 0.0141 †A=56.9674 ± 1.7439 †P=0.5697 ± 0.0174 †R=1.0000 ± 0.0000 †AUC=0.2399 ± 0.0057 †F=0.6455 ± 0.0153 †A=47.8439 ± 1.6409 †P=0.4784 ± 0.0164 †R=1.0000 ± 0.0000 †AUC=0.3092 ± 0.0082 †F=0.8350 ± 0.0095 †A=71.7861 ± 1.4432 †P=0.7179 ± 0.0144 †R=1.0000 ± 0.0000 †AUC=0.3590 ± 0.0140 †F=0.8141 ± 0.0074 †A=68.7152 ± 1.0689 †P=0.6872 ± 0.0107 †R=1.0000 ± 0.0000 †AUC=0.4131 ± 0.0170 F=0.8461 ± 0.0073 †A=73.3874 ± 1.0987 †P=0.7339 ± 0.0110 †R=1.0000 ± 0.0000 †AUC=0.3390 ± 0.0161
GP †F=0.6441 ± 0.0105 †A=77.1309 ± 0.7102 †P=0.6236 ± 0.0096 †R=0.6679 ± 0.0160 †AUC=0.7365 ± 0.0126 †F=0.7099 ± 0.0154 †A=81.0926 ± 0.8885 †P=0.6757 ± 0.0175 R=0.7518 ± 0.0210 †AUC=0.7898 ± 0.0102 †F=0.7117 ± 0.0087 †A=81.4798 ± 0.3983 †P=0.6878 ± 0.0133 †R=0.7413 ± 0.0159 †AUC=0.7886 ± 0.0075 †F=0.5925 ± 0.0180 †A=74.2450 ± 1.1850 †P=0.5876 ± 0.0259 R=0.6074 ± 0.0232 †AUC=0.6942 ± 0.0173 F=0.8621 ± 0.0114 A=82.6097 ± 1.2976 P=0.8600 ± 0.0142 †R=0.8651 ± 0.0121 AUC=0.8069 ± 0.0157 F=0.7187 ± 0.0148 A=64.2192 ± 1.6666 P=0.7166 ± 0.0197 R=0.7251 ± 0.0188 AUC=0.6128 ± 0.0213 F=0.8535 ± 0.0099 A=81.4715 ± 1.1134 P=0.8530 ± 0.0131 R=0.8553 ± 0.0120 AUC=0.8009 ± 0.0196
SVM †F=0.6014 ± 0.0130 †A=74.0353 ± 0.7717 †P=0.5744 ± 0.0118 †R=0.6336 ± 0.0194 †AUC=0.7030 ± 0.0139 F=0.6872 ± 0.0178 †A=78.7958 ± 1.2361 †P=0.6350 ± 0.0184 R=0.7532 ± 0.0237 †AUC=0.7738± 0.0118 †F=0.6611 ± 0.0141 †A=78.1370 ± 0.7351 †P=0.6345 ± 0.0129 †R=0.6926 ± 0.0205 †AUC=0.7500 ± 0.0097 †F=0.5556 ± 0.0075 †A=70.1948 ± 0.6240 †P=0.5196 ± 0.0133 R=0.6052 ± 0.0198 †AUC=0.6655 ± 0.0123 F=0.8547 ± 0.0091 A=81.7756 ± 1.1916 P=0.8656 ± 0.0165 †R=0.8453 ± 0.0041 AUC=0.8033 ± 0.0158 F=0.7264 ± 0.0115 A=65.1232 ± 1.0334 P=0.7205 ± 0.0119 R=0.7358 ± 0.0187 AUC=0.6239 ± 0.0124 F=0.8575 ± 0.0130 A=82.0506 ± 1.5046 P=0.8585 ± 0.0125 R=0.8578 ± 0.0169 AUC=0.8163 ± 0.0217
Table 2. Mean average precision for top results of the cross-validation experiments with protein features. The † symbol indicates that the paired t-test significance analysis shows that the difference between the indicated value and the corresponding values from the other two algorithms is significant (P-value < 0.05). No. of results 5 10 30 100
NB †0.1790 ± 0.0185 0.1870 ± 0.0147 0.1648 ± 0.0069 0.1367 ± 0.0027
0.3063 0.2470 0.1910 0.1467
GP ± 0.0273 ± 0.0202 ± 0.0177 ± 0.0099
SVM 0.2567 ± 0.0236 0.2267 ± 0.0193 0.1726 ± 0.0134 0.1399 ± 0.0085
When we examined the rankings of the documents in the sentence data set with pm features, we found that the top results returned by the GP are significantly better than those returned by NB, as evaluated by MAP (Sect. 3.1). The variance of the MAP measure is large, so that, even though the numbers appear vastly different they are not statistically significant, except where indicated (Table 2). The quality converges as we consider more documents.
Classification of Protein Interaction Sentences via Gaussian Processes
289
Table 3. Cross-corpora experiment results for GPs and SVMs. Each row shows whether the classifiers were trained or tested on the PreBIND (PB) or the AImed (AIM) corpus and what features were used (plain bag-of-words, or HMM NER tagged). The results are presented as F-score (F), accuracy (A), precision (P), and recall (R). Corpus Features GP Train Test F A P PB AIM Plain 0.5425 50.7092 0.3814 AIM PB Plain 0.2157 44.0476 0.9767 PB AIM NER 0.7031 51.5981 0.5565 AIM PB NER 0.1491 41.4835 0.9655
3.3
R 0.9397 0.1212 0.9544 0.0808
F 0.5674 0.5697 0.6949 0.6222
SVM A P 59.4949 0.4242 60.7143 0.9342 75.8147 0.5737 63.1868 0.8922
R 0.8567 0.4098 0.8811 0.4776
Cross-corpus Evaluation
In this initial study we can observe that GPs learn from the abstract data better than from the sentence data, while for the SVMs it makes very little difference. While using PreBIND for training and AImed for testing we find that GPs have very high recall but low precision, leading to a low F-score. The area under the ROC curve (AUC), however, is the same between the two algorithms, 0.72. Using NER features increases the AUC to 0.79 for the GP and 0.82 for the SVM, a result that is also observable in the F-scores and accuracies. On the other hand, if we reverse the training and testing corpora, the precisionrecall relationship is also inverted. This results in the AUC for both of the classifiers decreasing (from 0.75 to 0.70 for the GP and from 0.80 to 0.77 for the SVM), even though pm NER features still increase the SVM F-score. Considering the pm+pfg entities as proteins the PreBIND results in more effective training (as shown in Table 1), but in a smaller AUC increase (GP: 0.78, SVM: 0.79), and higher F-scores (F=0.4472, A=54.0241, P=0.9437, R=0.2930 for the GP and F=0.7420, A=29.6703, P=0.8277, R=0.6724 for the SVM). Thus, the choice of NER features that is more effective in cross validation for the training data leads to a stronger classification model, even when it is applied to data for which different settings are more applicable. This result is close to the AIM cross-validation results, which means that it is possible to annotate only abstracts, but still retrieve sentences with high accuracy. In summary, the abstract data is more conducive to training and the NER features have a positive effect given the correct choice of entities. 3.4
Multiclass Results
Multi-class and semi-supervised extensions of results indicate that GPs are particularly well suited for biomedical text classification. In the 10 fold crossvalidation experiment, repeated ten times, on multiclass data NB was significantly worse than GP and SVM, while there was no difference between GPs and SVMs. The F-score for NB is 0.7169 ± 0.0023, for GPs it is 0.7649 ± 0.021 and 0.7655 ± 0.0016 for SVM. However, the GP algorithm required one single classifier for all 25 classes [19], while the one vs. one SVM multiclass application [7] required K·(K−1) . For the case of K = 25 classes, it required 300 classifiers. 2 Moreover, the simple bag-of-words model without named entity tagging applied
290
T. Polajnar, S. Rogers, and M. Girolami
here outperformed the model originally reported in [37]. Their graphical model only achieved 60% accuracy in classifying this data, although it also performed named entity recognition at the same time.
4
Conclusion
In this paper we have presented an extensive evaluation of the GP classifier for protein interaction detection in biomedical texts. Across the different experiments we can see that GPs either score higher than the SVMs, or that there is no significant difference between them. In the binary cross-validation experiments the NB has a high F-score, but a significantly lower AUC than either GPs or SVMs in all experiments. Likewise, in the binary experiments we demonstrated that using protein features increases classification performance regardless of whether proteins are identified manually or through automatic means. We have shown that the optimal choice of NE features can also improve cross-corpus classification even when applying a model to data with a greatly different distribution of positive to negative examples. In the multiclass setting we find the na¨ıve Bayes classifier accuracy is much lower than that of the GPs and SVMs, whose accuracies are not significantly different. In our evaluation, one multiclass GP is equivalent to a combination of 300 binary SVM classifiers. We believe that the flexibility of the probabilistic framework, the lack of a margin parameter, and the availability of the optimised IVM algorithm are factors that make GP methods an attractive and efficient alternative to SVMs.
Acknowledgements TP was funded by a Scottish Enterprise PhD studentship. SR and MG were funded by the EPSRC grant EP/E052029/1.
References 1. Airola, A., Pyysalo, S., Bj¨ orne, J., Pahikkala, T., Ginter, F., Salakoski, T.: Allpaths graph kernel for protein-protein interaction extraction with evaluation of cross-corpus learning. BMC bioinformatics 9(suppl. 11) (2008) 2. Aizerman, A., Braverman, E.M., Rozoner, L.I.: Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control 25, 821–837 (1964) 3. Albert, J.H., Chib, S.: Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association 88(422), 669 (1993) 4. Altun, Y., Hofmann, T., Smola, A.J.: Gaussian process classification for segmenting and annotating sequences. In: ICML (2004) 5. Boser, B.E., Guyon, I., Vapnik, V.: A training algorithm for optimal margin classifiers. In: Computational Learning Theory, pp. 144–152 (1992) 6. Bunescu, R., Ge, R., Kate, R.J., Marcotte, E.M., Mooney, R.J., Ramani, A.K., Wong, Y.W.: Comparative experiments on learning information extractors for proteins and their interactions. Artif. Intell. Med. 33(2), 139–155 (2005)
Classification of Protein Interaction Sentences via Gaussian Processes
291
7. Cawley, G.C.: MATLAB support vector machine toolbox (v0.55β). University of East Anglia, School of Information Systems, Norwich, Norfolk, U.K. NR4 7TJ (2000) 8. Chai, K.M.A., Chieu, H.L., Ng, H.T.: Bayesian online classifiers for text classification and filtering. In: SIGIR 2002: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 97–104. ACM Press, New York (2002) 9. Chen, H., Sharp, B.M.: Content-rich biological network constructed by mining pubmed abstracts. BMC Bioinformatics 5, 147 (2004) 10. Chu, W., Ghahramani, Z.: Gaussian processes for ordinal regression. Journal of Machine Learning Research 6, 1019–1041 (2005) 11. Chu, W., Ghahramani, Z., Falciani, F., Wild, D.L.: Biomarker discovery in microarray gene expression data with gaussian processes. Bioinformatics 21(16), 3385–3393 (2005) 12. Chu, W., Ghahramani, Z.: Preference learning with gaussian processes. In: Twentysecond International Conference on Machine Learning, ICML 2005 (2005) 13. Cohen, A.M., Hersh, W.R.: A survey of current work in biomedical text mining. Briefings in Bioinformatics 6(1), 51–71 (2005) 14. Crammer, K., Singer, Y.: On the algorithmic implementation of multiclass kernelbased vector machines. Journal of Machine Learning Research 2, 265–292 (2001) 15. Damoulas, T., Girolami, M.A.: Probabilistic multi-class multi-kernel learning: On protein fold recognition and remote homology detection. Bioinformatics (March 2008) 16. Ding, C.H., Dubchak, I.: Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics 17(4), 349–358 (2001) 17. Donaldson, I., Martin, J., de Bruijn, B., Wolting, C., Lay, V., Tuekam, B., Zhang, S., Baskin, B., Bader, G.D., Michalickova, K., Pawson, T., Hogue, C.W.: PreBIND and Textomy - mining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics 4(11) (2003) 18. Erkan, G., Ozgur, A., Radev, D.R.: Semi-supervised classification for extracting protein interaction sentences using dependency parsing. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 228–237 (2007) 19. Girolami, M., Rogers, S.: Variational bayesian multinomial probit regression with gaussian process priors. Neural Computation 18(8), 1790–1817 (2006) 20. Girolami, M., Zhong, M.: Data integration for classification problems employing gaussian process priors. In: Sch¨ olkopf, B., Platt, J., Hoffman, T. (eds.) Advances in Neural Information Processing Systems, vol. 19, pp. 465–472. MIT Press, Cambridge (2007) 21. Giuliano, C., Lavelli, A., Romano, L.: Exploiting shallow linguistic information for relation extraction from biomedical literature. In: Proc. EACL 2006 (2006) 22. Hao, Y., Zhu, X., Huang, M., Li, M.: Discovering patterns to extract proteinprotein interactions from the literature: Part II. Bioinformatics 21(15), 3294–3300 (2005) 23. Huang, J., Lu, J., Ling, C.X.: Comparing naive bayes, decision trees, and svm with auc and accuracy. In: ICDM 2003: Proceedings of the Third IEEE International Conference on Data Mining, Washington, DC, USA, p. 553. IEEE Computer Society, Los Alamitos (2003) 24. Joachims, T.: Making large-Scale SVM Learning Practical. In: Advances in Kernel Methods - Support Vector Learning. MIT-Press, Cambridge (1999)
292
T. Polajnar, S. Rogers, and M. Girolami
25. Keerthi, S.S., Chapelle, O., DeCoste, D.: Building support vector machines with reduced classifier complexity. Journal of Machine Learning Research 7, 1493–1515 (2006) 26. Kim, J.D., Ohta, T., Tateisi, Y., Tsujii, J.: GENIA corpus–semantically annotated corpus for bio-textmining. Bioinformatics 19(suppl. 1), 180–182 (2003) 27. Lama, N., Girolami, M.: Vbmp: variational Bayesian Multinomial Probit Regression for multi-class classification in R. Bioinformatics 24(1), 135–136 (2008) 28. Lawrence, N., Platt, J.C., Jordan, M.I.: Extensions of the informative vector machine. In: Winkler, J., Lawrence, N.D., Niranjan, M. (eds.) Proceedings of the Sheffield Machine Learning Workshop, Berlin. Springer, Heidelberg (2005) 29. Lee, Y., Lin, Y., Wahba, G.: Multicategory support vector machines: Theory and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association 99, 67–81 (2004) 30. Lewis, D.D.: Naive (Bayes) at forty: The independence assumption in information retrieval. In: N´edellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 4–15. Springer, Heidelberg (1998) 31. Manning, C.D., Raghavan, P., Sch¨ utze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008) 32. Manning, C.D., Sch¨ utze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999) 33. Marcotte, E.M., Xenarios, I., Eisenberg, D.: Mining literature for protein-protein interactions. Bioinformatics 17, 359–363 (2001) 34. Platt, J.C.: Probabilities for SV Machines. In: Advances in Large Margin Classifiers, pp. 61–74. MIT Press, Cambridge (1999) 35. Rasmussen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning. MIT Press, Cambridge (2006) 36. Rogers, S., Girolami, M.: Multi-class semi-supervised learning with the - truncated multinomial probit gaussian process. Journal of Machine Learning Research Workshop and Conference Proceedings 1, 17–32 (2007) 37. Rosario, B., Hearst, M.: Multi-way relation classification: Application to proteinprotein interaction. In: Proceedings of HLT-NAACL 2005 (2005) 38. Seeger, M., Jordan, M.I.: Sparse gaussian process classification with multiple classes. Technical Report TR 661, Department of Statistics, University of California at Berkeley (2004) 39. Silva, Catarina, Ribeiro, Bernardete: On text-based mining with active learning and background knowledge using svm. Soft Computing 11(6), 519–530 (2007) 40. Stankovic, M., Moustakis, V., Stankovic, S.: Text categorization using informative vector machine. In: The International Conference on Computer as a Tool, EUROCON 2005, pp. 209–212 (2005) 41. Sugiyama, K., Hatano, K., Yoshikawa, S.U.M.: Extracting information on proteinprotein interactions from biological literature based on machine learning approaches. In: Gribskov, M., Kanehis, M., Miyano, S., Takagi, T. (eds.) Genome Informatics 2003, pp. 701–702. Universal Academy Press, Tokyo (2003) 42. Van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Dept. of Computer Science, University of Glasgow (1979) 43. Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995)
MCMC Based Bayesian Inference for Modeling Gene Networks Ramesh Ram and Madhu Chetty Gippsland School of IT, Monash University, Churchill, Victoria 3842, Australia {Ramesh.ram,Madhu.chetty}@infotech.monash.edu.au
Abstract. In this paper, we apply Bayesian networks (BN) to infer gene regulatory network (GRN) model from gene expression data. This inference process, consisting of structure search and conditional probability estimation, is challenging due to the size and quality of the data that is currently available. Our previous studies for GRN reconstruction involving evolutionary search algorithm obtained a most plausible graph structure referred as Independence-map (or simply I-map). However, the limitations of the data (large number of genes and less samples) can result in many plausible structures that equally satisfy the data set. In the present study, given the network structures, we estimate the conditional probability distribution of each variable (gene) from the data set to deduce a unique minimal I-map. This is achieved by using Markov Chain Monte Carlo (MCMC) method whereby the search space is iteratively reduced resulting in the required convergence within a reasonable computation time. We present empirical results on both, the synthetic and real-life data sets and also compare our approach with the plain MCMC sampling approach. The inferred minimal I-map on the real-life yeast data set is also presented. Keywords: Bayesian network, gene expression, MCMC, parameter estimation.
1 Introduction Cellular processes are controlled by gene-regulatory networks (GRN). The invention of DNA microarrays, which measure the abundance of thousands of mRNA targets simultaneously, has made way for several computational methods that are currently used to learn the structure of gene-regulatory networks. Some of the computational methods such as Boolean, multiple regression and Bayesian methods have been extensively explored and reviewed [1]. Bayesian networks [2-4] were first applied to the problem of inferring genetic networks from microarray expression data [5-7]. Bayesian networks are interpretable and flexible models for representing probabilistic relationships between multiple interacting genes. At a qualitative level, the structure of a Bayesian network describes the relationships between these genes in the form of conditional independence relations. At a quantitative level, relationships between the interacting genes are described by conditional probability distributions. The probabilistic nature of this approach is capable of handling both biological and technical noise V. Kadirkamanathan et al. (Eds.): PRIB 2009, LNBI 5780, pp. 293–306, 2009. © Springer-Verlag Berlin Heidelberg 2009
294
R. Ram and M. Chetty
and makes the inference scheme robust and allows the confidence in the inferred network structures to be estimated objectively. However, the application of BN learning to gene expression data in understanding the mechanism of GRN is particularly hard because the data sets are very sparse, typically containing only a few dozen samples but thousands of genes. In this case, our goal is to devise computational methods that consistently identify causal and dependence relationships between expressions of different genes. The most common approach to learning of GRN based on BN consists of two separate problems: structure learning and parameter estimation. The structure of the network (gene-gene interaction) is unknown and the gene expression data is also incomplete. In such a case, one has to resort to structure search and the approximation of parameters. As the number of BN structures is astronomically large and the problem is NP-hard [8], these strategies for structure search have to be advanced and sophisticated. Further, since the sample size is small compared to the number of genes, there are many sub optimal models that can fit the data equally well [9]. The essential edges which are present in majority of these optimal structures are of significance and great importance from the point of view of GRN modeling. Having detected such edges (interactions) between pairs of genes from the inferred structures, important information from biological literature is used to substantiate the findings. After the inferred structure is validated for biological plausibility, estimation of parameters (i.e. conditional probability distributions (CPD)) of the given GRN is carried out. Estimating CPDs involve specifying P(X | pa(X)) for each of the gene (variable) X where the term pa(X) refers to parents of variable X in the given structure. Assuming that the inferred structure G, is an Independence-map (I-map) of a probability distribution P, we note that I(G) ⊆ I(P) where I(G) represents independence assertions in graph G and I(P) is the independence assertions in the probability distribution P. Since G is an I-map of P, P factorizes according to joint probability distribution (JPD) given by equation (1). P(X1,….Xn) = ∏P(Xi|pa(Xi))
(1)
The network is a pair (G, P) where G is specified in edges and P is specified in CPDs. With several optimal graphs G equally representing the distribution P, I(G) becomes a subset of I(P) as shown in the Fig below implying that we can obtain P(X1….Xn) from G. Once we obtain P, it is possible to deduce a unique minimal I-map G. Removing any edge from the minimal G then induces conditional independencies that do not hold in P. Unlike other methods [10, 11], we are modelling the gene expression to be continuous rather than discrete. Further, due to high dimensional data, exact computation of the CPDs is infeasible and computationally expensive. Hence, the joint distribution can
I(G)
I(P)
Fig. 1. Independence Assertions
MCMC Based Bayesian Inference for Modeling Gene Networks
295
be approximated by stochastic simulation commonly referred as `sampling. Using Monte Carlo algorithms based on random sampling, we can fit a distribution to the data and retain the samples. However, random sampling from the GRN may not be the best strategy since the state space is enormous with large number of samples needed to approximate the probabilities reasonably well. One way for picking the most representative samples and increase efficiency is to create a 'Markov Chain' in which each sample is selected using the previous sample resulting in the well known Monte Carlo Markov Chain (MCMC) methods and its variants [12-15]. In this paper, we propose a new approach to approximate the conditional probability distributions of complex GRN models with the use of a MCMC method. The proposed approach is essentially based on two novel ideas. The first is an efficient computation of CPDs based on the ordered ranking of Markov Blankets (MB). We choose MB for ranking, because our earlier work using MB scoring metric to search for a structure produced promising results. The genes with high scoring MBs tend to be more accurate allowing much faster convergence compared with a stationary distribution of the Markov chain. The second novelty is progressively reducing the space by clamping those variables whose samples have converged to a fixed distribution thereby allowing convergence over a narrower region. Empirical results are presented to illustrate the superiority of the approach over direct MCMC and random sampling. Studies are performed using not only the synthetic data sets (which allow variation of parameters) but also the real life Saccharomyces cerevisiae (yeast) [16] microarray dataset. The rest of the paper is structured as follows. In Section 2, a brief overview of Bayesian learning and MCMC sampling is given. Section 3 elaborates on the system and methods of the proposed method. Section 4 provides experiments and results. Finally, section 5 has concluding remarks on the paper and some future work.
2 Background In this section, we briefly elaborate on the probability distribution and sampling for GRN with a focus on Gibbs sampling which is a type of Markov Chain Monte Carlo (MCMC) sampling. 2.1 Probability Distribution and Sampling a) Probability distribution: A GRN based on Bayesian network specifies a probability distribution through a directed acyclic graph (structure) and a collection of conditional probability distribution (parameters) for each gene Xi in the graph G. The graph G captures conditional independence relationships in its edges. A gene (node) is conditionally independent of all other genes (nodes) in network given its Markov Blanket (parents, children, and children’s parents). The probabilities summarize a potentially infinite set of circumstances that are not explicit in the model but rather appear implicitly in the probability. If each gene (variable) is influenced by at most k others and we have n random genes (variables), then we only need to specify n*2k probabilities instead of 2n. Succinctly, conditional probability distribution shows the probability distributions over all values of gene X given the values of its parent genes. Conditional probability distribution of X=x given Y=y is given by equation 2.
296
R. Ram and M. Chetty
P( x | y ) =
p ( x, y ) p ( y | x ) p ( x ) = p( y ) p( y )
(2)
If genes x and y are independent, then P(x | y) = p(x) since p(x, y) = p(x) p(y). The equation 2 above is repeated to condition X on all parent genes of X. The parents of gene Xi are all those genes that directly influence gene Xi from the set of genes X1, …,Xi-1. Since large GRN models will have more parameters, the exact computation is, therefore intractable and in such cases simulation (sampling) technique becomes suitable for approximating conditional distribution. The structure G, necessary for sampling, is obtained by applying a structure search over the entire space of all possible structures. Hence, given structure G with genes X={X1, X2....Xn}, we can draw a sample from the joint probability distribution as follows: (i) Instantiate randomly all except one of the genes, Xi (ii) Compute the probability distribution over the states of Xi, i.e. P(Xi|X1…Xi1,Xi+1,…Xn) (iii) From the probability distribution, randomly select a state of Xi If all genes in the network except the gene Xi are instantiated, then due to the factorization of the joint probability distribution, the full conditional for a given gene in the DAG involves only a subset of genes participating in its Markov blanket (i.e. the set of parents, children and other parents of the children for the gene). P(Xi|X1…Xi-1,Xi+1,…XN) = P(Xi|MB(Xi))
(3)
Here, MB(Xi) is the Markov Blanket of gene Xi. Since gene Xi is independent of rest of the genes in the network (except its Markov blanket), it is necessary to consider only the partial conditional conditioning on the Markov blanket. Furthermore, P(Xi|MB(Xi)) = P(Xi|Pa(Xi)) ∏P(Yi|Pa(Yi))
(4)
Here, Yi, i = 1, ….k are the children of gene Xi. b) Sampling: The sampling using Monte Carlo methods involves drawing n samples from the GRN with the instantiated genes fixed at their values as explained above. From these samples, the probability distributions are estimated based on frequency of occurrence of genes. Since our model involves continuous expression values, we plot these samples as a histogram and then smooth the histogram to give the probability density function of the genes. The instantiation of the genes is done using the distribution available from the data set. However, due to typically large number of genes in a GRN, random sampling methods are not suitable because they can be slow and the posterior distribution estimated cannot be reliable. Markov Chain Monte Carlo (MCMC) approach is suitable in such cases for approximating the difficult high dimensional distributions. From amongst the many MCMC methods available, we chose Gibbs sampler which results in obtaining samples asymptotically from the posterior distribution and can provide convergence in reasonable computation time. The Gibbs sampler is discussed in detail in the next section.
MCMC Based Bayesian Inference for Modeling Gene Networks
297
2.2 Markov Chains and Gibbs Sampling A MCMC method such as Gibbs sampler which is applied for sampling probability distributions is based on constructing a Markov chain. Next, we briefly present the concept of Markov chain followed by the Gibbs sampling technique. a) Markov Chain: The Markov chain includes the probability of transitioning the variables from their current state s to the next state s’ based on the transition probability q(s → s'). If the state distribution πt(s) describes the probability of genes being in state s at the t-th step of the Markov chain, then the stationary (equilibrium, invariant) distribution π*(s) will occur when πt=πt+1, i.e.
∀s '
π ( s ') = ∑ π ( s)q( s → s ')
(5)
s
We note that the stationary distribution also satisfies the detailed balance equation (6) given below.
∀s, s '
π ( s )q( s → s ') = π ( s ')q( s ' → s )
(6)
No matter what the initial state distribution is, a Markov chain converges to π*(s) if it fulfils the following conditions: unique, aperiodicity and irreducibility. The condition of aperiodicity ensures that the chain can not get trapped in cycles or the state transition graph is connected. The irreducibility condition ensures that for any state, there is positive probability to visit all other states. An aperiodic and irreducible Markov chain is called ergodic [17] and ensures that every state much be reachable from every other and there can be no strictly periodic cycles. Using Gibbs sampling, we aim to design a Markov chain whose stationary distribution is the target (desired) distribution such that gene Xi quickly converges to the stationary distribution irrespective of the initial distribution. From this, we then run the chain to produce a sample; throwing away the initial (burn-in) samples as these is likely to be influenced by the initial distribution. The sampling method for the target distribution π* on χv constructs a Markov chain S0, S1, . . . , Sk, . . .with π*(s) as equilibrium distribution. Since the distribution π*(s) is a unique equilibrium, and the Markov chain is ergodic, we have
∀s
1 m+n χv (S i ) ∑ n →∞ n i = m +1
π * ( s) = lim π n* ( s) = lim n →∞
(7)
Where n is the number of iterations. The state of the chain obtained after a large number of steps is then used as a sample and its quality improves with the increase in the number of iterations. When a dynamic equilibrium is reached, the long-term fraction of time spent in each state is exactly its posterior probability for the given conditions. As number of iteration tends towards infinity, all statistically important regions of state space will be visited.
298
R. Ram and M. Chetty
b) Gibbs sampling: To perform a MCMC simulation on GRN where the target distribution is the joint probability distribution, we design a Markov chain where each state is a full joint instantiation of the distribution (i.e. values are assigned to all variables). Hence, a transition is a move from one joint instantiation to another. The target sampling distribution π*(x) of the GRN is the posterior joint distribution P(x | e) where x is the set of unknown variables and e is the set of evidence variables. It is typically the unknown we want to evaluate. Although sampling methods such as logic sampling [18], rejection sampling [19] and importance sampling [20] are available to sample P(x|e), in the absence of evidence e or with the probability of evidence being small (i.e. if P(e) ~=0), these algorithms result in many wasted samples. The Gibbs sampling overcomes these limitations as it specifically uses conditional distribution P(s’ | s) to define state transition rules. In Fig 2, an example of Markov Chain for a 4 gene GRN is shown. We have specifically fixed the Gene B and D values and Gene A and C are varied to produce 4 states.
Gene A
Gene B
Gene A
Gene C
Gene B
Gene C
Gene D
Gene D
Gene A
Gene A
Gene B
Gene C
Gene D
Gene B
Gene C
Gene D
Fig. 2. Example Markov Chain for a toy 4-gene network. The genes B and D are instantiated as true while the genes A and C are false.
The working of the Gibbs sampling algorithm is shown by the flow chart in Fig. 3. Consider a GRN with n unknown variables X1, X2,….Xn which appears as input to the algorithm. We now recall that a gene Xi is independent of rest of the network given the variables in the Markov blanket (MB) of Xi, i.e. P(Xi | X2,…Xn) = P(Xi | MB (Xi))
(8)
The Markov condition that a variable is independent of all other variables (except its neighbors) reduces significant computational overhead especially for large scale problems. Calculating P (Xi | MB (Xi)) can be done using equation (4) and equation (5). The initial states of all the variables can be chosen randomly or these can be chosen from the original small sample dataset. If the current state is X1 = x1, X2 = x2, . . . , Xn = xn, then we can sample a new value x’1 for X1 from P(X1|X2 = x2, . . . , Xn = xn). In similar manner, we can sample the remaining new values for X2, X3 …Xn until we have a new state X1 = x’1, X2 = x’2, . . . ,Xn = x’n. The initial samples are influenced by the
MCMC Based Bayesian Inference for Modeling Gene Networks
299
Choose a starting state for each variable at random Replace the value of Xi with the selected state Current sample
Select a state Xi from this distribution
Compute the posterior distribution over the states of Xi
Add values to the samples
Select one variable at random Stop after many cycles
Fig. 3. Gibbs Sampling
initial distribution. At every step, we weigh our selection towards the most probable sample using the transition probability so that the samples follow the most common states accurately. Moreover, as the process is ergodic (i.e. it is possible to reach every state), it will ensure convergence to the correct distribution if sufficient number of iterations are carried out. However, the application of Gibbs sampling for GRN estimation is somewhat limited due to the high dimensional data where number of genes is significantly higher than the samples. This means that the variance in the values taken by the variable is high, and can increase dramatically for thousands of genes and may prohibit producing independent uniform samples during sampling. The proposed new methodology for based on novel Gibbs sampling for the GRN estimation problem can overcome this limitation.
3 Methodology The proposed MCMC sampling scheme is shown in the Fig. 4 below. In our earlier work, we employed a guided GA [21] search strategy where we had obtained a set of 10 dissimilar high scoring network structures closely representing the probability distribution using the gene expression data. With the aid of proposed methodology, we will now calculate the Bayesian posterior probability distribution of all the variables (genes) of the ten gene network structures. From the samples drawn from a network structures, we can obtain the posteriors after convergence, and then determine the state sequence and probability estimates of the model in a straight forward manner. Although the inferred high scoring network structures are disjoint (i.e. cannot be combined into one network structure), they can all be combined independently to the underlying probability distribution. Hence, all these network structures are sampled to estimate the probability distribution accurately. The important feature of our approach is the use of high scoring initial networks and a rank ordering on the network genes using Markov blankets. The convergence is obtained by running several Markov chains in parallel. Let us briefly discuss the major ‘components’ of the proposed method as they occur in Fig. 4.
300
R. Ram and M. Chetty Original Data
Histogram
Collect samples Fix values to some variables
Rank MB
Sample variables with high ranking MB
Convergence
P(X|Pa(X))
Burn-in 1000 samples
Gibbs Sampler
Smoothing
Fig. 4. Proposed methodology
3.1 Rank Ordering of the Variables As explained before, an ordinary Gibbs sampler (MCMC) chooses genes at random and then samples a new value from the estimated posterior of the neighbouring variables (i.e Markov Blanket variables). Friedman and Koller [9] argued that sampling from the space of (total) orders on variables rather than directly sampling DAGs was more efficient than application of ordinary MCMC directly in random manner. In our previous work [22], evaluating a network structure was based on the summing of scores of the individuals genes in Markov Blankets. Since the Gibbs sampler also samples the new value of a gene based on the MB variables, we will order the rank of the Markov Blankets based on their scores. 3.2 Gibbs Sampler Before we proceed with the Gibbs sampling scheme, we need to specify a uniform prior distributions for all the genes in the domain. Rather than a random initial state of the network, we apply a standard prior which is a multivariate Dirichlet distribution [9]. This distribution is assigned to initial state distribution and also to the state transition distribution of the Markov chain. The initial distribution of the variables in the network (from which the initial state is sampled) is assigned using the density function estimated after smoothening of the histogram of normalized gene expression data. Sampling is straightforward as there is no evidence in the network and is done by sampling each variable in the specified rank order. For nodes without parents, sampling is done from their initial distributions while for nodes with parents, we sample from the conditional distribution of their MBs. Similarly, n independent and identically distributed samples are drawn from the target distribution P(x). Since the samples drawn are continuous (through the normal range of -3 and +3) rather than discrete, the sampling precision is restricted to two decimal places to reduce the space of complexity. The samples collected are plotted using a histogram with n bins as shown in Fig. 4 above. The probability density function P(x) of a continuous variable (gene expression) x is
MCMC Based Bayesian Inference for Modeling Gene Networks
301
approximated by smoothening of the histogram of samples as shown in the Fig. Similarly, the conditional probability distribution of all variables is estimated. 3.3 Burn-In and Convergence The process of achieving stationary probability distribution is called as convergence while the initial phase of convergence is called the ‘burn-in’ phase. In the proposed method, the convergence is improved by running several parallel Markov chains each using a different network structure representing the probability distribution as the starting point. The idea of running multiple chain using different Bayesian network structures is mainly to obtain samples from the entire sample space of the probability distribution underlying all the structures. The chains are merged together at a certain stage of the iterations and made into a single chain. During the process of multiple chain runs, samples are exchanged between the chains and the overall samples of a number of variables in the top of the specified order are monitored for autocorrelation and stationary distribution. A sample variation factor is introduced to determine the fraction of samples that go out of range. When the sample values do not go above a variation factor after significant number of iterations, we assume the samples have converged. From there onwards, the variable is clamped to the stationary value. This allows the sampling to be carried out on the variables that are in the lower in the rank order of the variables. In our experiments we find that the rank ordering of variables, multiple Markov chain runs and clamping also improves the mixing of samples for the unknown variables improves the mixing of samples more efficiently than an ordinary MCMC approach.
4 Experiments and Results The validation of new techniques by comparing with other GRN reconstruction methods becomes difficult and painstaking due to non-availability of suitable benchmark dataset. Furthermore, most methods work with discrete data or perform experiments on small toy networks which also makes comparisons difficult. For this reason, in this section we validate the methods performance by investigations of synthetic datasets. Incorporating realistic relationships in synthetic data is now well established and widely used for GRN models [23]. The general idea is to obtain a synthetic network from which synthetic time course data is generated. In order to conduct tests using synthetic data set [23], several datasets are created by varying network generator parameters. To the responses of artificial GRN, generated according to topological and kinetic properties of real GRN, biological noise and technical noise is added before calculating final artificial expression ratios. The presented method is compared with a plain MCMC method which does not incorporate the improvements. (i) Experiment 1 For the work reported in this paper, three 40 gene synthetic networks were arbitrarily generated with sample size of 100. From the set of reconstructed networks using the guided GA [21, 24] approach, we choose the first 10 high scoring networks. The probability distribution is then estimated using the proposed MCMC method. For each of the 10 structures of the networks, samples from the probability distribution
302
R. Ram and M. Chetty
were obtained with MCMC, after discarding those from the burn-in phase. All simulations were repeated three times for different training data generated from synthetic networks. The results of experiments are summarized in Fig. 5. First we carry out single MCMC simulation runs instead of proposed multiple parallel MCMC runs. From the estimated probabilities, a set of all edges whose posterior probability exceeds a given threshold θ Є [0, 1] is taken for comparison with actual network. For a given threshold θ, we count the number of true positive (TP), false positive (FP), true negative (TN), and false negative (FN) edges. We then compute the sensitivity = TP/ (TP + FN), the specificity = TN/(TN + FP), and the complementary specificity = (1 − specificity) = FP/(TN + FP). Rather than selecting an arbitrary value for the threshold θ, we repeat this scoring procedure for several different values of θ Є [0, 1] and plot the ensuing sensitivity scores against the corresponding complementary specificity scores. This gives the receiver operator characteristics (ROC) curves of Fig. 5(a). The diagonal dashed line indicates the expected ROC curve for a random predictor. The ROC curve of Fig 5(a), top left, shows that we can recover more than 80% of the true edges at approximately zero FP rate. We note that the ROC curve corresponds to the network structure obtained based on the estimated probability distribution and not based on the network reconstructed by our earlier GGA causal modeling method. The MCMC trace plot between the objective function verses cycle number for 1000 cycles for the synthetic network of 40 genes is shown in Fig 5(b). For this plot, the joint probability distribution is considered as the evaluation criterion after every run. The plot shows good mixing with a very low burn-in period. The same synthetic dataset is repeated on a plain MCMC simulation which does not incorporate the presented improvements. The trace plot of the plain MCMC simulation is shown in Fig 5(c) for 1000 cycles. It is clearly evident that mixing is poor and has a longer burn-in period. Also the simulation is oscillating around sub-optimal values of the objective function while the proposed method quickly reaches the higher values of the objective function confirming that proposed method is better than the simple MCMC. The proposed method is repeated for 500 gene synthetic network dataset and its trace plot is shown in Fig 5(d). This shows the method is easily scalable for thousands of gene as is the case of gene expression data at a comparatively feasible computational time. With sufficient improvements identified using single MCMC runs of the presented method over the plain MCMC method, we proceed to parallel MCMC runs as presented in section 3.3. We obtained 3 different network structures for the same synthetic dataset using the GGA [21] search method and applied the network structures in a parallel MCMC runs with exchange of samples. Fig 5(e) shows the trace plot of 3 parallel MCMC runs where each chain corresponds to an individual network. From the results, it was found that the auto-correlation between the samples produced from the 3 chains was far apart during the initial 1000 samples after which the correlation increased at 2000 cycles which is an indication of convergence. The parallel runs uncovered the entire probability distribution. Although the experiments on synthetic data are successful, the time series of 100 gene expression measurements is significantly larger than what is usually available from real world wet lab experiments; hence we also test the approach using real yeast dataset [16].
MCMC Based Bayesian Inference for Modeling Gene Networks
(a)
303
(b) ROC curve
0.23 23
1 0.9
0.21 21
0.7
0.19 19
0.6 0.5 0.4 0.3
0.17 17
True positive rate (sensibility)
0.8
0.2 0.1
0 0.8
(d)
400 600 Cycle number
800
1000
60
60
70
80
Objective Function
90
100
(c) Objective Function
200
1
100 110
0.4 0.6 False positive rate (1-specificity)
90
0.2
80
0
70
0
0
200
400
600
Cycle number
800
1000
0
2000
4000
6000
8000
10000
Cycle number
(e)
Fig. 5. Simulation results: (a) ROC plot of sensitivity versus (1-specificity) for a synthetic dataset MCMC simulation. (b) Trace plot of proposed MCMC method on synthetic network of 40 genes (c) Trace plot of plain MCMC method on synthetic network of 40 genes (d) Trace plot on synthetic network of 500 genes (e) Trace Plot of Parallel MCMC runs on 3 different network structures.
(ii) Experiment 2 The real dataset is much more complex than the synthetic data. To demonstrate the performance of the MCMC approach and also as a practical application of the method on a real biological dataset, we consider the yeast dataset [16] containing 800 genes and 77 samples comprising of a comprehensive catalogue of cell cycle-regulated genes in the yeast Saccharomyces cerevisiae. The dataset includes three long time course expression values representing three different ways of synchronizing the normal cell cycle, and five shorter time courses representing the altered environmental conditions. These results were combined with those by Cho et al.,[25] to produce a
304
R. Ram and M. Chetty
more comprehensive collection of data. The test samples were synchronized so that all the cells would be at the same stage in their cell cycle. Using this data set, gene networks have already been reconstructed (Friedman et al., 2001). We also note that the Spellman dataset has classified the 800 genes in different phases of cell cycle such as G1, G2, S and M/G1 and G2/M. Using the MCMC based probability inference; the minimal I-map of the inferred yeast network is obtained. It is shown in Fig. 6 below.
CLN3
BCK2
STE2 SNF2
FAR1
STE6
FUS1
SWI4
SWI6
MBP1
CTS1
STE1 2
BAR1
CLN2
CIK1
CLB2
CDH1
SRL2
KAr3 FKH1 CDC1 6
TEM1
NDD1
MCM 1
CDC1 4
ACE2 SWI5 CLB5 SIC1
Fig. 6. Yeast Gene Regulatory network Minimal I-map
From the reconstructed network shown in Fig. 6, we note the following interactions taking place which is in confirmation with the available literature [26]. 1. 2. 3. 4.
MBF (a complex of MBP1 and SWI6) and SBF (a complex of SWI4 and SWI6) controls the late G1 genes (e.g. CLN2 and NDD1). MCM1, together with FKH1 or FKH2, recruits the NDD1 protein in late G2 and controls the transcription of G2/M genes. SWI5 and ACE2 regulate genes at M/G1. MCM1 regulates SIC1 and ACE2.
5 Conclusion A new Markov chain Monte Carlo approach using Gibbs sampling is presented for estimating the conditional probability distribution underlying gene regulatory network structures. The approach is novel as it performs the rank ordering of genes based on the Markov Blanket scoring metric, applies parallel Markov chains using different high scoring starting network structures and clamps genes which are higher in the order for faster and efficient convergence. Rather than initializing the Markov chains with randomly chosen networks, our previously reported guided GA is used to generate the high scoring initial networks and the probability distribution. Both synthetic and real world yeast cell data sets have been applied in the investigations. The experiment on synthetic data set that the proposed technique performs significant better than the standard MCMC algorithm for estimating probability distributions of the genes in the network. From the yeast cell cycle data experiment, we observe that the
MCMC Based Bayesian Inference for Modeling Gene Networks
305
minimal network derived using real life yeast dataset has more accurate reconstruction of regulatory interactions. However, due to the nature of the microarray data set, the resulting minimal GRN is not unique. With integration of other related data such as sequence analysis in the form of prior probability, it would be worth investigating to recover unique minimal network which represents the underlying structure of gene expression. This is currently under investigations.
References 1. D’haseleer, P., Liang, S., Somogoyi, R.: Genetic network inference: from co-expression clustering to reverse engineering. Bioinformatics 16, 707–726 (2000) 2. Heckerman, D.: A Tutorial on Learning with Bayesian Networks. In: Jordan, M. (ed.) Learning in Graphical Models. MIT Press, Cambridge (1999) 3. Pearl, J.: Causality: Models, Reasoning and Inference. Cambridge University Press, Cambridge (2000) 4. Sprites, P., Glymour, C., Scheines, R.: Causation, Prediction, and Search: Adaptive Computation and Machine Learning, 2nd edn. MIT Press, Cambridge (2001) 5. Friedman, N., Linial, M., Nachman, I., Pe’er, D.: Using Bayesian networks to analyze expression data. J. Comp. Biol. 7, 601–620 (2000) 6. Friedman, N.: Inferring cellular network using probabilistic graphical models. Science 33, 799–805 (2004) 7. Nachman, I., Friedman, N.: Inferring quantitative models of regulatory networks from expression data. Bioinformatics 20, I248-I256 (2004) 8. Chickering, D.M.: Learning Equivalence Classes of Bayesian-Network Structures. Journal of Machine Learning Research 2, 445–498 (2002) 9. Friedman, N., Koller, D.: Being Bayesian about network structure: A Bayesian approach to structure discovery in Bayesian networks. Machine Learning 50, 95–126 (2003) 10. de Hoon, S.I., Kobayashi, K., Ogasawara, N., Miyano, S.: Inferring Gene Regulatory Networks From Time-Ordered Gene Expression Data Of Bacillus Subtilis Using Differential Equations. In: Pacific symposium on computation biology, vol. 8, pp. 17–28 (2003) 11. Murphy, K., Mian, S.: Modelling gene expression data using dynamic Bayesian networks, in Technical Report. University of California, Berkeley (1999) 12. Madigan, D., Andersson, S., Perlman, M., Volinsky, C.: Bayesian model averaging and model selection for Markov equivalence classes of acyclic graphs. Communications in Statistics: Theory and Methods 25, 2493–2519 (1996) 13. Madigan, D.a.J.Y.: Bayesian graphical models for discrete data. International statistical Review 63, 215–232 (1995) 14. Giudici, P.a.P.G.: Decomposable graphical Gaussian model determination. Biometrika 86(4), 785–801 (1999) 15. Giudici, P., Green, P., Tarantola, C.: Efficient model determination for discrete graphical models, in Technical Report. Univ. Pavia, Italy (2000) 16. Spellman, P.T., Sherlock, G., Zhang, M.Q., Iyer, V.R., Anders, K., Eisen, M.B., Brown, P.O., Botstein, D., Futcher, B.: Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell. 9, 3273–3297 (1998) 17. Liu, J.S.: Monte Carlo Strategies in Scientific Computing. Springer, Heidelberg (2001) 18. Henrion, M.: Practical issues in constructing a Bayes belief network. Int. J. Approx. Reasoning 2(3), 337 (1988)
306
R. Ram and M. Chetty
19. Gilks, W.R., Wild, P.: Adaptive Rejective Sampling for Gibbs Sampling. Applied Statistics 41, 337–348 (1992) 20. Shachter, R.D., Peot, M.A.: Simulation approaches to general probabilistic inference on belief networks. In: Uncertainty in Artificial Intelligence, vol. 5, pp. 221–231 (1989) 21. Ram, R., Chetty, M.: A Guided genetic algorithm for Gene Regulatory Network. In: IEEE Congress on Evolutionary Computation, Singapore (2007) 22. Ram, R., Chetty, M.: Constraint Minimization for Efficient Modeling of Gene Regulatory Network. In: Chetty, M., Ngom, A., Ahmad, S. (eds.) PRIB 2008. LNCS (LNBI), vol. 5265, pp. 201–213. Springer, Heidelberg (2008) 23. Ram, R., Chetty, M.: Generating Synthetic Gene Regulatory Networks. In: Chetty, M., Ngom, A., Ahmad, S. (eds.) PRIB 2008. LNCS (LNBI), vol. 5265, pp. 237–249. Springer, Heidelberg (2008) 24. Ram, R., Chetty, M.: A Markov blanket based Probabilistic Genetic Algorithm for Causal Reconstruction of Gene Regulatory Networks. BioSystems Special Issue on Evolving Gene Regulatory Networks (submitted, 2009) 25. Cho, R.J., et al.: A genome-wide transcriptional analysis of the mitotic cell cycle. Mol. Cell 2, 65–73 (1998) 26. Chen, K.C., Calzone, L., Csikasz-Nagy, A., Cross, F.R., Novak, B., Tyson, J.J.: Integrative analysis of cell cycle control in budding yeast. Mol. Biol. Cell 15, 3841–3862 (2004)
Efficient Optimal Multi-level Thresholding for Biofilm Image Segmentation Dar´ıo Rojas1 , Luis Rueda2 , Homero Urrutia3 , and Alioune Ngom2 1
3
Department of Computer Science, University of Atacama, 485 Copayapu Ave., Copiap´ o, Chile [email protected] 2 School of Computer Science, University of Windsor, 401 Sunset Ave., Windsor, ON, N9B 3P4, Canada {lrueda,angom}@cs.uwindsor.ca Biotechnology Center and Faculty of Biological Sciences, University of Concepci´ on, Barrio Universitario, Concepci´ on, Chile [email protected]
Abstract. A microbial biofilm is structured mainly by a protective sticky matrix of extracellular polymeric substances. The appreciation of such structures is useful for the microbiologist and can be subjective to the observer. Thus, quantifying the underlying images in useful parameters by means of an objective image segmentation process helps substantially to reduce errors in quantification. This paper proposes an approach to segmentation of biofilm images using optimal multilevel thresholding and indices of clustering validity. A comparison of automatically segmented images with manual segmentation is done through different thresholding criteria, and clustering validity indices are used to find the correct number of thresholds, obtaining results similar to the segmentation done by an expert.
1
Introduction
A biofilm is a complex aggregation of microorganisms that live upon surfaces, structured mainly by the secretion of a protective sticky matrix of extracellular polymeric substances. In order to understand biofilm structures, scanning electronic microscopy (SEM), confocal laser scanning microscopy (CLSM), and optical microscopy (OM) are currently used [1]. The appreciation of such structures in digital images can be subjective to the observer [2], and hence it is necessary to quantify the underlying images in useful parameters for the microbiologist. Automatic segmentation is crucial in this regard, which, if done in correct form, it does not propagate errors of appreciation in image quantification. However, the evaluation of automatic segmentation algorithms is subjective, leaving to the designer the responsibility to judge the effectiveness of the technique based only on their intuition and results of some examples of image segmentation. To solve this problem, the work presented in [3] demonstrated the effectiveness of the normalized probabilistic Rand’s Index (RI), which can be used to make a V. Kadirkamanathan et al. (Eds.): PRIB 2009, LNBI 5780, pp. 307–318, 2009. c Springer-Verlag Berlin Heidelberg 2009
308
D. Rojas et al.
quantitative comparison between different algorithms for segmentation by means of a set of manually segmented images. The RI index was first presented in [4], and is based on the original work of [5]. On the other hand, in [6,7], two novel methods called COMSAT and PHLIP were proposed, which are able to quantify the characteristics of biofilms obtained through CLSM image stacks. They both use Otsu’s thresholding criterion [8] for image segmentation, but no further studies have been performed regarding the efficiency of the segmentation. Another work related to quantifying the parameters of biofilm structures was presented in [2], and a detailed explanation of the quantification methods can be found in [9]. The algorithms for segmentation used by these approaches are also based on Otsu’s criterion and an iterative method for finding thresholds that was proposed in [2]. In the same context, in [10], a review of several automatic thresholding algorithms for biofilm segmentation was presented, including local entropy, joint entropy, relative entropy, Renyi’s entropy and iterative selection, but none of these methods were used to evaluate criteria for multi-level thresholding. Furthermore, the method for evaluating the differences between segmented and original images is the sum of squares of relative residuals (MSSRR). This method takes threshold values into account but not the differences between images. In addition, the approach of [10] is not capable of comparing the results of images segmented with multi-level thresholding algorithms or images segmented with different numbers of thresholds. All approaches for automatic image segmentation based on thresholding proposed so far do not allow to segment different kinds of biofilm images optimally and without the intervention of an expert. In this paper, an approach to segmentation of biofilm images is proposed, based on optimal multi-level thresholding, which is carried out in polynomial time. Also, clustering validity indices are used for finding the best number of thresholds automatically.
2
The Proposed Method
A method for segmentation of biofilm images was implemented through an efficient optimal multi-level thresholding algorithm. Different thresholding criteria and clustering validity indices were implemented for measuring the performance of segmentation methods and the determination of the best number of thresholds, respectively. 2.1
Polynomial-Time Optimal Multi-level Thresholding
In [11], a polynomial-time algorithm for multi-level thresholding was proposed. This algorithm is polynomial not just on the number of bins of the histogram, but also on the number of thresholds. Moreover, it runs in polynomial time independently of the thresholding criterion. In [11], we defined the optimal solution by searching an optimal set of thresholds, T = {t0 , . . . , tk }, that maximizes a function Ψ as follows: k+1 Ψ (T ) = ψtj−1 +1,tj (1) j=1
Efficient Optimal Multi-level Thresholding for Biofilm Image Segmentation
309
where Ψ : P k × [0, 1]n → R+ , k is the number of thresholds, P = {p1 , . . . , pn } are the probabilities of the bins in the histogram, n is the number of bins in the histogram, and function ψtj−1 +1,tj : P 2 × [0, 1]tj −tj−1 +2 → R+ ∪ {0} (where tj is the j th threshold of T ) must satisfy the following conditions: 1. For any histogram P and any threshold set T , Ψ > 0 and ψ ≥ 0. 2. For any m, 1 ≤ m ≤ k + 1, Ψ ({t0 , ..., tm }) can be expressed as Ψ ({t0 , ..., tm−1 }) + ψtm−1 +1,tm . 3. If ψtj−1 +1,tj is known, then ψtj−1 +2,tj can be computed in O(1) time. The three thresholding criteria are defined as follows: Otsu’s (OTSU) : ψtj−1 +1,tj = ωj μ2j ,
(2)
Minimum Error (MINERROR) : ψtj−1 +1,tj = 2ωj {log σj + logωj } ,
(3)
Entropy-based (ENTROPY) : ψtj−1 +1,tj = −
tj i=tj−1
p(i) p(i) log , ω ωj j +1
(4)
tj tj where tj is the j th threshold of T , ωj = i=t p(i), μj = ω1j t=t ip(i) j−1 +1 j−1 +1 t j and σj = ω1j i=tj−1 +1 hp (i)(i − μj )2 . It is important to note that biofilm images lead to sparse histograms (many bins have zero probabilities), and so, for the sake of efficiency the algorithm for irregularly sampled histograms as presented in [11] is implemented in our work. 2.2
Optimal Number of Thresholds
The optimal thresholding algorithm discussed above is not able to determine the number of thresholds, k, in which the image can be segmented correctly. However, k has a direct relationship with the number of classes, k + 1, in which a histogram is partitioned. Viewing thresholding as a problem of clustering pixel intensities, clustering validity indices can be used to obtain the best number of classes, k + 1, and hence the number of thresholds. In this work, the DaviesBouldin Index (DB) is used, which is defined as the ratio between the withincluster scatter versus the between-cluster scatter. The goal is to minimize the value of the DB function, defined as [12]: k+1
DB =
1 max1≤j≤k+1,j=i k + 1 i=1
Si + S j , dij
(5)
tj where k + 1 is the number of clusters, Sj = |ζ1j | i=t p(i)||i − μj || is the j−1 +1 within-cluster scatter of cluster ζj , dij = ||μi − μj || is the distance between clusters ζi and ζj . Other validity indices [12] such as Dunn’s index (DN), CalinskiHarabasz’s index (CH) and index I (IndexI) were also evaluated to compare the results.
310
2.3
D. Rojas et al.
Manual Segmentation vs Automatic Segmentation
As discussed previously, biofilm structures contain information about phenotypic effects of bacteria. Many studies [13,14,15,16] indicate that the structural heterogeneity of a biofilm can affect its dynamic activity and functional properties. In order to obtain this structural information, the segmentation of images in an unsupervised and optimal form (for an objective comparison and reproducibility of experiments) is an important component in any image analysis model for biofilm quantification. In this context, some clustering algorithms used for segmentation do not guarantee an optimal solution, because they require the specification of good initial cluster centers for correct convergence and the solution can be a local optimum [17]. Segmentation approaches such as region growing methods are more tolerant to noise, however, they require the specification of an appropriate initial set of seeds to produce accurate results [18]. Segmentation methods such as image filtering may miss fine but important details of confocal images and hence, are only recommended for image visualization, not for image quantification [2]. To determine the best thresholding criterion, a similarity index of partitions is used, namely the Probabilistic Rand Index (RI), which is the percentage of pairs for which there is an agreement. Let L = {l1 , ..., lN } and L = {l1 , ..., lN } be the ordered sets of labels li and li respectively, for each pixel 1 ≤ i ≤ N of two segmented images to be compared, the RI index is defined as follows: N 1 RI(L, L ) = N [I(li = lj ) ∧ I(li = lj ) + I(li = lj ) ∧ I(li = lj )] 2
(6)
i,j i =j
where I is the identity function and N2 is the total number of pairs among the N pixels. This index takes a value one when L and L are equal, and zero if they do not agree on anything at all. The best technique for automatic segmentation of biofilm images was found experimentally, using a method that combines automatic multi-level thresholding and clustering validity index is proposed. Biofilm images were acquired from optical and confocal microscopy, and the histogram for each image was computed. Manual multi-level segmentation was performed by an expert, by means of a trial and error process, in order to determine the best number of thresholds, k, and set of thresholds, T . Furthermore, each original image was automatically segmented, by means of the optimal multi-level thresholding algorithm proposed in [11], using the three thresholding criteria mentioned above, and for several values of k. Finally, clustering validity indices were computed for each image that was segmented automatically and the RI index was calculated for each segmented image, by means of manual and automatic thresholding, in order to determine the best thresholding criteria.
3
Experimental Results
A dataset of 649 images was used to perform the evaluation of the proposed biofilm segmentation approach. These images were obtained as follows. Mature
Efficient Optimal Multi-level Thresholding for Biofilm Image Segmentation
311
Table 1. Dataset for determining the best combination of techniques k No. of images Microscopy Resolution 1 2 3 4 5 6
616 10 10 6 6 1
Confocal Optical Optical Optical Optical Optical
512 × 512 1040 × 1392 1040 × 1392 1040 × 1392 1040 × 1392 1040 × 1392
biofilms of Pseudomonas syringae strains were developed within the Biofilm and Environmental Microbiology Laboratory1. These biofilms were then scanned using CLSM and OM, generating stacks of images that represent the threedimensional structure of the biofilms. The images were segmented individually, manually and automatically. Table 1 shows the features of each image (images are in 12-bit grayscale) with its number of thresholds, k, found manually by an expert. In order to avoid any bias introduced by the differences in the nature of the images, the dataset was divided into two subsets of images. The set of images which are best segmented with one threshold (obtained from CLSM), and the set of images which are best segmented with more than one threshold (obtained from OM). 3.1
Performance of Thresholding Criteria
The best thresholding criterion was found by using the RI index comparing manual vs. automated segmentation, and by using the following notation: RIall is the RI index for all image datasets, RIclsm is the RI index for images with one threshold found manually, and RIop is the RI index for images with more than one threshold found manually. In Table 2, the resulting values for the RI index are depicted for all image subsets. It is clear that ENTROPY is the best criterion for segmenting images with one threshold. On the other hand, OTSU is the best criterion for segmentation of images with more than one threshold. Overall, the ENTROPY criterion achieved the best performance for all image datasets. Table 2. RI index for different image subsets (number of thresholds found manually) Index and Dataset OTSU ENTROPY MINERROR RIop RIclsm RIall 1
http://www.udec.cl/˜bem-lab/
0.7897 0.7283 0.6184
0.7300 0.7767 0.7566
0.7713 0.6086 0.5846
312
D. Rojas et al. Table 3. MSSRR and R2 for different thresholding criteria (CLSM images) OTSU ENTROPY MINERROR MSSRR 3.851 R2
0.71
0.827
10.25
0.76
0.62
Additionally, for images obtained from CLSM, the mean of the sum of squared relative residuals (MSSRR) and correlation R2 were calculated in order to evaluate the differences between the threshold levels selected by automatic thresholding and the manual method. The MSSRR is defined as follows [10]: MSSRR =
2 M 1 ti − ti M i=1 ti
(7)
where ti and ti are the ith thresholds for images found manually and automatically, respectively, and M = 616 is the total number of images from CLSM. Figs. 1(a)(b)(c) show correlation plots between manually-obtained sets of thresholds and thresholds obtained from ENTROPY, MINERROR and OTSU, respectively. For an agreement between automatic and manual thresholding we expect that data points follow the diagonal line (y = x). In Table 3, the resulting values for MSSRR and R2 for each thresholding criterion are shown. 3.2
Determining the Best Number of Thresholds
The number of thresholds has a direct relation with the number of classes in which an image can be segmented. Therefore, a separate measurement on the error was used to estimate the number of thresholds for the complete dataset of images. Table 4 shows the Mean Squared Error (MSE) for each combination of thresholding criteria and clustering validity indices. As can be observed, the DB index achieves the best performance before applying the ENTROPY thresholding criterion, which reaffirms that the combination ENTROPY+DB attains the best performance in most of the cases for different datasets of images. The clustering validity indices have a direct relationship among them in their formulation [12]; however, each index has a different behavior depending on the number of thresholds selected. The behavior of each validity index can be seen in Table 4. Mean Squared Error (MSE) for the estimation of the best number of thresholds in all datasets IndexI CH DB DN OTSU 7.44 221.97 96.63 212.76 ENTROPY 2.33 212.2 1.18 186.94 MINERROR 2.80 188.31 179.39 220.22
Efficient Optimal Multi-level Thresholding for Biofilm Image Segmentation (b)
2000
2000
1800
1800 Tresholds from MINERROR criterion
Thresholds from ENTROPY criterion
(a)
1600 y=x
1400 1200 1000 800 600 400 200 0
313
1600 y=x
1400 1200 1000 800 600 400 200
0
500
1000 Manually set thresholds
1500
2000
0
0
500
1000 Manually set thresholds
1500
2000
(c) 2000 1800
Thresholds from OTSU criterion
1600 y=x
1400 1200 1000 800 600 400 200 0
0
500
1000 Manually set thresholds
1500
2000
Fig. 1. Correlation plots between thresholds obtained manually and automatically using different criteria: (a) ENTROPY, (b) MINERROR, and (c) OTSU
Fig. 2. Although the plots are for one of the images in the dataset, they represent the general behavior of the clustering validity indices for an entire dataset. In Fig. 2, we notice that the indices IndexI, CH and DN are (for most of the values of k) monotonically increasing functions of k (Figs. 2 (a) (b) (d)), obtaining the best performance when the function achieves the maximum, i.e. when k = 61, k = 64 and k = 64 respectively. This behavior, unfortunately, does not provide a clear direction on how to determine what is the optimal number of clusters with which an image should be segmented. It also illustrates the high MSE values obtained by these indices to estimate the best number of thresholds. On the other hand, the DB index is the only index that shows a high independence in terms of the number of clusters. This index reaches its optimal performance when k = 8, which is a much more meaningful value than those obtained by the others. Moreover, as k grows, the DB index tends to give an almost constant rate, which reflects the fact that increasing the number of clusters, the quality of the clustering does not improve from a certain point (Fig. 2 (c)).
314
D. Rojas et al. (a)
(b)
6
4.5
7
x 10
5
x 10
4.5
4
4
3.5
3.5
3
CH
Index I
3 2.5 2.5
2 2 1.5
1.5
1
1
0.5 0
0.5 0
10
20
30
40
50
60
0
70
0
10
20
30
k
40
50
60
70
40
50
60
70
k
(c)
(d) −3
0.6
4
0.58
3.5 3
Dunn’s Index
0.56
DB
0.54
0.52
0.5
2.5 2 1.5 1
0.48
0.46
x 10
0.5
0
10
20
30
40
50
60
70
0
0
10
20
k
30 k
Fig. 2. General behavior of clustering validity indices: (a) IndexI, (b) CH, (c) DB, (d) DN
3.3
Performance for Image Segmentation Techniques Combined with Cluster Validty Indices
Table 5 shows the performances of the RI index for all biofilm images. The table shows that the best combination is ENTROPY + DB for RI all index. Also, it is clear that the thresholding criteria with the best performance for this dataset is based on the ENTROPY criterion. This result was predictable, because most of the images have one threshold and the best method of segmentation for one threshold is the ENTROPY criterion. Table 5. The RIall index for all automatically segmented biofilm images IndexI CH DB OTSU 0.2163 0.2151 0.2969 ENTROPY 0.2506 0.2351 0.7884 MINERROR 0.2206 0.2332 0.2613
DN 0.2187 0.2385 0.2085
Efficient Optimal Multi-level Thresholding for Biofilm Image Segmentation
315
The behaviors for different combinations of techniques for two separate cases, one threshold and more than one threshold, are discussed next. 3.4
One Threshold
All biofilm images obtained by the confocal microscopy have a single optimal threshold (manually found by the expert). Table 6 shows the performance of thresholding criteria and clustering validity indices for the segmentation of biofilm images with one threshold determined automatically. Table 6. The RIclsm index for automatically segmented images of biofilms with one threshold determined automatically IndexI OTSU 0.6176 ENTROPY 0.7573 MINERROR 0.5844
CH DB DN 0.3901 0.5297 0.4002 0.4907 0.7634 0.5029 0.3279 0.328 0.3075
The best performance is reached by the combinations of methods ENTROPY and DB, corroborating the overall results. In this case, the analysis shows the same pattern as that of the overall performance, because the ENTROPY criterion is the best criterion for segmenting images with one threshold and the DB index is the best clustering validity index for an estimated value of k. 3.5
More Than One Threshold
Table 7 shows the performance of thresholding methods and cluster validity indices for segmentation of biofilms with more than one threshold. As can be seen, all methods achieve a very good performance. The OTSU criterion combined with IndexI attains the best value of the RIop index, but for this set of images, the performance with respect to the combination ENTROPY+DB do not differ significantly. However, it is clear that the performances of thresholding criteria are significantly influenced by the number of classes estimated by the clustering validity indices. Table 7. The RIop index for automatically segmented biofilm images with more than one threshold determined automatically IndexI OTSU 0.7739 ENTROPY 0.6889 MINERROR 0.7594
CH DB DN 0.6548 0.7070 0.6564 0.7046 0.7634 0.7077 0.6657 0.7222 0.6302
316
D. Rojas et al. (a)
(b)
10000
p
h (i)
8000
6000
4000
2000
0
0
500
1000
1500
(c)
2000 i
2500
3000
3500
4000
2500
3000
3500
4000
(d)
10000
p
h (i)
8000
6000
4000
2000
0
0
500
1000
1500
2000 i
Fig. 3. Multi-level Thresholding Segmentation: (a) Optical image segmented manually. (b) Histogram of (a). (c) Optical image segmented automatically. (d) Histogram of (c).
3.6
Visual Validation
Figs. 3(a) and (c) show the manual segmentation of a biofilm image with more than one threshold compared to the automatic segmentation that combines OTSU + IndexI. As can be seen, in Fig. 3(c), the result of automatic segmentation is close to that of the manual segmentation (Fig. 3(a)), setting the thresholds to almost the same values when the segmentation is done by an expert (Figs. 3(b) and (d)). On the other hand, rebuilding the structure of a biofilm from images of CLSM, offers a powerful visualization tool. Figs. 4(a) and (b) show the 3D reconstruction of a biofilm through images segmented manually, and automatic biofilm reconstruction by means of images segmented automatically through combinations of techniques ENTROPY+DB detecting optimal thresholds. As can be seen, the image reconstructed automatically, is quite similar to a manual reconstruction done by an expert.
Efficient Optimal Multi-level Thresholding for Biofilm Image Segmentation (a)
317
(b)
Fig. 4. Biofilm reconstruction: (a) Manual reconstruction. (b) Automatic reconstruction.
4
Conclusions
A method for automatic segmentation of biofilm images has been proposed. This method is based on the entropy-based criterion for multi-level thresholding, estimating the number of thresholds through the well-known Davies-Bouldin clustering validity index. This index is able to find the best number of thresholds close to the criteria established by an expert. This was assessed by using objective measures, the Probabilistic Rand Index, which compares results between segmentations done by the proposed method and the segmentation done by an expert. Automatic segmentation of biofilm images leads to a much better quantification process and also to better understanding of bacterial biofilms. Since the multi-level thresholding used is always optimal, it is possible to make a segmentation process free of subjectivity. Although the three main thresholding criteria have been implemented and tested in this work, other criteria can also be used for the segmentation of biofilm images, whenever they satisfy the conditions stated in [11], and for which optimal thresholding can be achieved in polynomial time. Acknowledgments. This work has been partially supported by NSERC, the Natural Sciences and Engineering Research Council of Canada (Grants No. RGPIN261360 and RGPIN228117), the Canadian Foundation for Innovation (Grant No. 9263), the Ontario Innovation Trust, and the University of Atacama (University Grant for Research and Artistic Creativity, Grant No. 221172).
References 1. Claxton, N.S., Fellers, T.J., Davidson, M.W.: Laser scanning confocal microscopy. Technical report, Department of Optical Microscopy and Digital Imaging, The Florida State University (2006) 2. Beyenal, H., Donovan, C., Lewandowski, Z., Harkin, G.: Three-dimensional biofilm structure quantification. Journal of Microbiological Methods 59, 395–413 (2004)
318
D. Rojas et al.
3. Unnikrishnan, R., Pantofaru, C., Hebert, M.: Toward objective evaluation of image segmentation algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence 29, 929–944 (2007) 4. Unnikrishnan, R., Hebert, M.: Measures of similarity. In: Proceedings of Seventh IEEE Workshop on Applications of Computer Vision, Application of Computer Vision, vol. 1, pp. 394–401 (2005) 5. Rand, W.M.: Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66, 846–850 (1971) 6. Heydorn, A., Nielsen, A.T., Hentzer, M., Sternberg, C., Givskov, M., Ersboll, B.K., Molin, S.: Quantification of biofilm structures by the novel computer program comstat. Microbiology 146, 2395–2407 (2000) 7. Mueller, L.N., de Brouwer, J.F., Stal, J.S.A., Xavier, L.J., Xavier, J.B.: Analysis of a marine phototrophic biofilm by confocal laser scanning microscopy using the new image quantification software phlip. BMC Ecology 6, 1–15 (2006) 8. Otsu, N.: A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man and Cybernetics 9, 62–66 (1979) 9. Beyenal, H., Lewandowski, Z., Harkin, G.: Quantifying biofilm structure: Facts and fiction. Biofouling 20, 1–23 (2004) 10. Yang, X., Beyenal, H., Harkin, G., Lewandowsi, Z.: Evaluation of biofilm image thresholding methods. Water Sci. Technology 35, 1149–1158 (2001) 11. Rueda, L.: An efficient algorithm for optimal multilevel thresholding of irregularly sampled histograms. In: da Vitoria Lobo, N., Kasparis, T., Roli, F., Kwok, J.T., Georgiopoulos, M., Anagnostopoulos, G.C., Loog, M. (eds.) S+SSPR 2008. LNCS, vol. 5342, pp. 602–611. Springer, Heidelberg (2008) 12. Maulik, U., Bandyopadhyay, S.: Performance evaluation of some clustering algorithms and validity indices. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 1650–1655 (2002) 13. Costerton, J.W., Douglas, Z.L., Caldwell, E., Lappin-Scott, H.M., Korber, D.R.: Microbial biofilms. Annual Review of Microbiology 49, 711–745 (1995) 14. Johnson, L.R.: Microcolony and biofilm formation as a survival strategy for bacteria. Journal of Theoretical Biology 251, 24–34 (2008) 15. Jorgensen, T.M., Haagensen, J., Sternberg, C., Molin, S.: Quantification of biofilm structure from confocal imaging. Technical report, Optics and Fluids Department, Riso National Laboratory (2003) 16. Klapper, I.: Effect of heterogeneous structure in mechanically unstressed biofilms on overall growth. Bulletin of Mathematical Biology, 809–824 (2006) 17. Theodoridis, S., Koutroumbas, K.: Pattern Recognition. Academic Press, London (2003) 18. Adams, R., Bischof, L.: Seeded region growing. IEEE Transactions on Pattern Analysis and Machine Intelligence 16, 641–647 (1994)
A Pattern Classification Approach to DNA Microarray Image Segmentation Luis Rueda1 and Juan Carlos Rojas2 1
2
School of Computer Science, University of Windsor, 401 Sunset Avenue, Windsor, Ontario, N9B 3P4, Canada [email protected] Department of Computer Science, University of Concepci´ on, Edmundo Larenas 215, Concepci´ on, Chile [email protected]
Abstract. A new method for DNA microarray image segmentation based on pattern recognition techniques is introduced. The method performs an unsupervised classification of pixels using a clustering algorithm, and a subsequent supervised classification of the resulting regions. Additional fine tuning includes detecting region edges and merging, and morphological operators to eliminate noise from the spots. The results obtained on various microarray images show that the proposed technique is quite promising for segmentation of DNA microarray images, obtaining a very high accuracy on background and noise separation. Keywords: DNA microarray images, segmentation, clustering, classification.
1
Introduction
DNA microarrays are techniques used to evaluate the expression of thousands of genes simultaneously. This paper focuses on DNA microarrays, in which the spots are layered in sub-grids. Segmentation is one of the most important steps in microarray image processing, and consists of identifying the pixels that belong to the spot from the pixels of the background and noise. Various microarray image segmentation approaches have been proposed, which assume a particular shape for the spots, while some of them have more freedom in this regard. Fixed circle is a method that assumes a circular shape with the same diameter for all spots [1][2]. Adaptive circle is a method that allows to adjust the radius of the circle for each spot [1]. While this method solves the problem of the radius of the circle, it fails to find the proper shape when the spots have irregular shapes. Elliptic methods assume an elliptical shape for the spots, and can adapt to a more general shape than the adaptive circle method, but cannot recognize irregularly shaped spots [3]. Seeded region growing is a method that groups pixels in regions based on a certain criterion of similarity, starting from initial points, the seeds [4][5]. Histogram-based methods and mathematical morphology have also been applied to microarray image segmentation [6][7]. The application of clustering to DNA V. Kadirkamanathan et al. (Eds.): PRIB 2009, LNBI 5780, pp. 319–330, 2009. c Springer-Verlag Berlin Heidelberg 2009
320
L. Rueda and J.C. Rojas
microarray image segmentation is based mainly on two algorithms: k-means and expectation maximization [8][9][10]. The advantage of clustering with respect to other techniques is that it is not restricted to any predetermined shape for the spots. However, the power of clustering and in general, pattern recognition techniques has not been exploited in a comprehensive way as we do here. This paper introduces the use of pattern recognition techniques to devise a method for DNA microarray image segmentation. These techniques combined propose a general structure, which is composed of many steps, and the main steps are implemented with classifiers, while the others are implemented with algorithms developed for fine tuning.
2
The Proposed Method
The proposed approach is divided in various steps, starting with a method that discards images that do not have spots, followed by a series of region detectors and classification, ending with the use of morphological operators to eliminate noise. Fig. 1 illustrates this structure, in which the boxes represent the steps, while the arrows the output of each block. A brief description of these steps follows.
Fig. 1. General scheme of the proposed microarray segmentation technique
A Pattern Classification Approach to DNA Microarray Image Segmentation
Image w/spot
Correlation
Without spot
321
Correlation
Fig. 2. Correlation plot for two images (with presence or absence of a spot). The intensities were increased to improve visualization.
Correlation analysis discards regions that do not have spots by analyzing Pearson’s index between the intensities of the pixels and the average intensity of its neighbors [21]. Fig. 2 shows an image with a spot and its correlation plot, where the x-axis represents the average neighbor intensity and the y-axis the pixel intensities, which tends to follow the x = y line, reflecting the high correlation between the features. Fig. 2 also shows an image without a spot and its corresponding correlation plot, which has the shape of a cloud, reflecting the low correlation between the features for this case. Thus, the correlation index is a very good measure for the presence or absence of a spot. Region detection detects the initial regions of an image using k-means and different initial configurations for the number of clusters and centroids, generating a set of 303 different clusterings – we select the best clustering using the I-index [11]. Background-non background classification classifies the initials regions as regions that belong (or do not belong) to the background using a supervised classifier. Border absorption takes the regions that were classified as non background, and determines which are the main regions and which are the borders, and proceeds to merge the main regions with their borders. Noisenon noise classification classifies the initials regions as non noise regions using a supervised classifier. Morphology is finally used to eliminate noise that was not detected in the previous steps. 2.1
Unsupervised Classification
For the unsupervised classification of the pixels into different regions of the image, k-means was used and combined with the I-index to evaluate the quality of the clustering generated [11]. We used the Euclidean distance and the following features to represent each object (pixel) to be classified: pixel intensity, average of neighbor intensities (using an 8-vicinity), distance from the pixel to the center of the region, variance of the neighbor intensities (considering an 8-vicinity), and gradient (a vector that indicates the direction of maximum increment of intensities). The centroids of the clusters were initialized using the percentiles of the distribution of the feature values. We also used a random initialization of the centroids.
322
L. Rueda and J.C. Rojas
When clustering data, it is crucial to know the correct number of clusters. Since this is usually unknown, a difficult task is to find it automatically, i.e. without human intervention. The I-index is a coefficient used to measure the quality of the clustering and hence, it helpsfind the best number of clusters. The I-index aims p k |D | E1 to maximize I(k) = kE D , where Ek = i=1 j=1i uij xj − µi and Dk = k k maxki,j=1 µi − µj , with uij being the membership of xj to cluster Di and µi = 1 xj ∈Di xj . To avoid predominance of some feature over others, normalization n is applied to each feature before using k-means. The general strategy used to determine the final clustering consists of generating a large number of clusterings with different initializations and based on the quality of the clustering, to finally determine which one has the best performance. By following this procedure, the final implementation of the algorithm was configured to recognize between two and four clusters. For each number of clusters, 101 different clusterings were generated, 100 of the initial centroids were chosen at random, and one of them was predetermined, yielding a total number of 303 different clusterings. Each clustering was evaluated with the I-index and the one with the highest index value was selected. 2.2
Supervised Spot Classification
Once the regions have been identified, the next step is to identify those regions that belong to the spot and those that belong to the background. The first stage of this process consists of classifying a region as background or non background. The features for the background classification were, mainly, based on the average intensities and the spatial characteristics of the background, such as its distribution in the image. The features to represent an object (region) used for this classification are the following: average of intensities, percentage of the region perimeter that represents the border of the image, and the largest distance of a pixel that belongs to the region from the geometric center of the image. The second stage consists of separating noise from spots. This procedure takes the regions that were classified as non background and merged with its borders, and classifies them as noise or non noise regions. The features used in this step are the following, which are shown grouped in four categories: Statistics of Intensities – average of intensities 2 1 n – variance of intensities (calculated as n−1 i=1 Ii − I , where n is the total number of pixels that belong to the region, Ii is the intensity of pixel i, and I is the average of intensities of the region) 2 n 1 – standard deviation of intensities (calculated as n−1 i=1 Ii − I ) – average of neighbor regions that are non background
A Pattern Classification Approach to DNA Microarray Image Segmentation
323
Geometric Features – total area of a region – ratio between the total area of a region and its perimeter (excluding its holes) – distance between the geometric center of the region and the geometric center of the image – percentage of the border of the region that limits with the background – percentage of the region perimeter that represents the border of the image – length of the skeleton of the region (defined as the medial axis of the region, where a point belongs to the medial axis, if it has more than one closest neighbor in the border of the region [20]) Comparison with the Edges of Neighboring Regions – ratio between the border of the region and the border of its neighbors (average of pixel intensities) – difference between the border of the region and the border of its neighbors (average of pixel intensities) – ratio between the border of the region and the border of its neighbors that are non background (average of pixel intensities) Comparison with the Average Intensities of Other Regions – – – –
ratio ratio ratio ratio
between between between between
the the the the
region and region and region and region and
the the the the
neighbor regions that are non background largest neighbor that is non background largest region that is non background background
For each of the supervised classifications, various classifiers were implemented, tested and compared: the logistic linear classifier (LG) [13], Fisher’s lineal classifier (FISH) [12], the nearest mean classifier (NM) that uses the Euclidean distance [14], the k-nearest neighbor classifier (k-NN) using the Euclidean distance [15], a support vector machine (SVM) with a linear kernel [16], the naive Bayes classifier (NV) [17], a linear classifier using principal component analysis (PCA) [18], the quadratic classifier (QUAD) that assumes a normal distribution for each class [12], and the binary decision tree classifier (BTREE) [13][19]. 2.3
Post-processing
In order to improve the final segmentation two post-processing stages are performed on the resulting regions classified as spots. Border absorption is applied to detect the borders of the regions, and to eliminate false edges. In a nutshell, this stage detects which regions, classified as non background, are borders of other regions, and then merge the main regions with their border regions, generating a new region. Two conditions are demanded for considering a region as a border of another region. Firstly, considering the pixels of both regions that conform the border between them, the average of intensities of the pixels from
324
L. Rueda and J.C. Rojas
Spot 1 Spot 2 Previous Regions Border Absorption Previous Regions Border Absorption
Fig. 3. Border absorption applied on two spots
the region that is a possible border is lower than the average of intensities of the pixels from the possible main region. Secondly, when the morphological operation dilation is applied over the main region [7,20], with an 8-vicinity and ignoring its holes, and this main extended region is overlapped with the possible border region, if it covers at least an 85% of its area. The algorithm also detects successive borders, i.e. if region B is a border of region A, and region C is a border of region B, then, the algorithm merges the tree regions into a single one. Fig. 3 shows two examples of border absorption, showing the simplification of the resulting regions. The other stage involves applying mathematical morphology. The basic operations used in this work are erosion, which produces thinning of the objects, and dilation, which produces thickening of the objects. Both operations are controlled by a shape called structured element, which consists of a matrix with zeros and ones that is translated through the domain of the image. The combination of these two operations, a dilation and soon an erosion, are used as the last filter of noise on the regions that have been classified as non noise.
3
Experimental Results
The DNA microarrays images used in the experiments were obtained from the Stanford Microarray Database (SMD), publicly available at smd.stanford.edu. The images used here are mainly from experiments with Arabidopsis Thaliana and Austrofundulus Limnaeus. The images extracted from the database represent individual spots, based on its correlation value and a classification index that represents whether or not the image contains a spot. The regions were generated using the unsupervised classification, which were then classified as background or non background. The regions used for the second supervised classifier, which correspond to regions that were classified as nonbackground in the previous supervised step, were merged with their borders and classified as noise or non noise. Finally, morphological operators were applied to the regions classified as non noise, and the resulting regions were considered as the spots detected by the algorithm.
A Pattern Classification Approach to DNA Microarray Image Segmentation
3.1
325
Correlation
The experiments performed with the correlation coefficient between the pixel intensities and the average intensities of the neighbor pixels were done using Pearson’s coefficient, and a threshold 0.7384, which was found experimentally, was used to eliminate images that do not have spots with a very high accuracy. 3.2
Unsupervised Classification
A series of experiments were then performed with the aim of determining the set of features that gives better results, and the range for the numbers of clusters using the I-index as a measure of the quality of the segmentation produced. To find the feature space that gives the best initial image segmentation with the k-means algorithm, a series of experiments with k-means were performed for testing the different configurations. This consists of comparing visually the segmentation generated with different feature spaces for k-means. The experiments performed showed that the feature space involving the intensities of the pixels and the average of intensities of the neighbors generates the best results among all possible feature spaces tested. To determine an appropriate range for numbers of clusters to be used in the k-means algorithm, a test with two, three, four, five and six clusters was conducted, and the results were compared afterwards. These experiments allowed to conclude that the range for the numbers of clusters that gives the best results is between two and four, because a larger number of clusters generates an excessively large number of regions that are difficult to classify. Fig. 4 shows this scenario. The best number of clusters was found using the I-index, searching over 101 different clusterings for each number of clusters in the range (two, three and four), with a total of 303 clusterings. For each number of clusters, we generated 101 different clusterings: one of them has the initial configuration of the centroids pre-determined based on percentiles of the values obtained for each feature, and the remaining 100 initial configurations of centroids were selected at random in the range of values registered for each feature. Then, the algorithm obtains the I-index value for each clustering produced, and selects the one that delivers the largest value. This procedure is applied to each group of clusters, yielding three different clusterings with two, three, and four clusters respectively, and their corresponding image segmentations and I-index values. These experiments demonstrate the validity of the I index as an evaluator of the quality of the segmentation produced. 3.3
Background Classification
In this step, a set of supervised classifiers were tested in the classification of the regions into background and non background, and following a ten-fold crossvalidation setup, obtaining the average error rate over the ten folds. Table 1 shows the results obtained in these experiments, where the error rate for each classifier is listed. These results indicate that the lowest error rates were obtained
326
L. Rueda and J.C. Rojas
Spot
2 Clusters
3 Clusters
4 Clusters
5 Clusters
6 Clusters
Fig. 4. Comparison of the segmentation generated with k-means using different numbers of clusters and the set of features pixel intensities with average of intensities of the neighbors Table 1. Error rates for background vs non background classification Classifier Error Rate LOG 3.47 PCA 3.72 FISH 3.47 NM 40.89 QUAD 8.43 KNN 13.51 BTREE 9.54 NV 4.58 SVM 4.37
with the logistic linear classifier and the minimum least square linear classifier, with an error rate of 3.47% for both cases. Both classifiers show a very low error rate, indicating that the schemes recognize background versus non background quite accurately. Border absorption was then applied to the regions that were classified as non background. The experiments show that the algorithm detects quite accurately when a region is a border of another region. This resulted in images simpler and easier to process for the next level using supervised classifiers. 3.4
Noise Classification
In this step, a set of supervised classifiers were tested in the classification of the regions, which were classified in the previous step as non background regions
A Pattern Classification Approach to DNA Microarray Image Segmentation
327
Table 2. Error rates for noise vs non noise classification Classifier Error Rate LOG 18.88 PCA 21.53 FISH 19.76 NM 33.92 QUAD 23.64 KNN 26.84 BTREE 20.35 NV 24.19 SVM 23.60
and then merged with their borders, into two classes, noise or non noise. The classifiers were tested using a ten-fold cross-validation procedure. Table 2 shows the results obtained in these experiments, where the error rate of each classifier is listed. These results indicate that the lowest error rate was obtained for the logistic linear classifier with an error rate of 18.88%. These results imply that the hardest task of the proposed approach is to recognize the signals including noise, and thus it justifies the use of a specific level of supervised classification for detecting them. It also suggests the need of using additional filters to detect and remove noise, which are implemented in the next step by using morphological operators. 3.5
The Complete Segmentation
We tested the complete segmentation method over a set of images of spots selected and classified based on general features. The aim is to compare visually the original image with the segmentation that the algorithm outputs. The supervised classifiers used in the implementation of the algorithm were selected based on their performance in the tests. In this series of experiments, the same classifier was used to implement the background and noise classification steps, the logistic linear classifier, which was shown to be the most accurate among all classifiers tested. The images included in the experiments are of various characteristics: Regular Spots: This set groups spots that show a circular-like shape, and the image does not present signals of noise. The variations between the images are given by the size, intensity and location of the spot in the image. Irregular Spots: This set groups spots that do not have a circular-like shape, and the image does not present noise signals. Some of the shapes considered in this set are elliptic and half-a-moon. The variations between the images are given by the size, intensity and location of the spot in the image. Noisy Spots: This set groups images of spots that present different levels of noise. In addition to the level of noise, other variations between the images are given by the shape, size, intensity and location of the spot in the image.
328
L. Rueda and J.C. Rojas
Spot 1
Spot 2
Spot 3
Spot 4
Spot 5
Image
Segmentation
Fig. 5. Experiments performed with the complete algorithm
Fig. 5 shows the results of the experiments over a set of spot images using the complete algorithm. The set of spot images considers different configurations, from nearly perfect spots to quite irregular spots and noisy images. The results show the power of the algorithm to produce accurate segmentation and to easily adapt to all these different configurations. When dealing with regular spots, the algorithm segments accurately the images with these characteristics, independently of the size, location or brightness of the spots. The results of the experiments with irregular spots show that the algorithm detects the main features of the spots accurately, and the quality of the segmentation depends on the smoothness of the spot borders – smoother spot borders produce a better segmentation. The results of the experiments with noisy spots show that the performance of the algorithm depends on the level of noise present in the image. If the magnitude of noise compared to the spot is low, the algorithm generates a segmentation that is arbitrarily close to the real spot, while if the magnitude of noise is high, the segmentation will differ substantially from the real spot. In conclusion, the main factors that affect the quality of the segmentation are the level of noise and the smoothness of the borders of the spots. In general, the proposed approach is able to deal with different types of images, capturing different shapes and eliminating noise accordingly.
4
Conclusions
A combination of techniques from the field of pattern recognition is shown to be a very powerful scheme for segmentation of DNA microarray images. Supervised and unsupervised classification techniques have been shown to be effective in the segmentation of real-life images, when performed in sequence and complemented with fine tuning that includes border absorption and morphology. Experiments have been performed in real-life images from the Stanford microarray database, which show that the system is highly accurate in identifying the pixels belonging to the spots, and separating them from background and noise. The proposed approach is a framework for the development of such a system that encourages future research on variations of the different parameters of the system, including the number and selection of the features, the unsupervised and supervised classification schemes, the evaluation of the quality of the clusters, among others.
A Pattern Classification Approach to DNA Microarray Image Segmentation
329
Acknowledgements. This research work was partially supported by NSERC, the Natural Sciences and Research Council of Canada, grant No. RGPIN 261360, and the Chilean National Council for Technological and Scientific Research, FONDECYT grant No. 1060904.
References 1. Yang, Y., Buckley, M., Speed, T.: Analysis of cDNA Microarray Images. Briefings in Bioinformatics 2(4), 341–349 (2001) 2. Eisen, M.: ScanAlyze User Manual. Stanford University (1999) 3. Rueda, L., Qin, L.: A New Method for DNA Microarray Image Segmentation. In: Kamel, M.S., Campilho, A.C. (eds.) ICIAR 2005. LNCS, vol. 3656, pp. 886–893. Springer, Heidelberg (2005) 4. Adams, R., Bishof, L.: Seeded Region Growing. IEEE Trans. on Pattern Analysis and Machine Intelligence 16(6), 641–647 (1994) 5. Talbot, B.: Regularized Seeded Region Growing. In: Proc. of the 6th International Symposium ISMM 2002, pp. 91–99 (2002) 6. Ahmed, A., Vias, M., Iyer, N., Caldas, C., Brenton, J.: Microarray Segmentation Methods Significantly Influence Data Precision. Nucleic Acids Research 32(5), e50 (2004) 7. Angulo, J., Serra, J.: Automatic Analysis of DNA Microarray Images using Mathematical Morphology. Bioinformatics 19(5), 553–562 (2003) 8. Rueda, L., Qin, L.: An Improved Clustering-Based Approach for DNA Microarray Image Segmentation. In: Campilho, A.C., Kamel, M.S. (eds.) ICIAR 2004. LNCS, vol. 3212, pp. 17–24. Springer, Heidelberg (2004) 9. Wu, S., Yan, H.: Microarray Image Processing Based on Clustering and Morphological Analysis. In: Proc. of the First Asia-Pacific Bioinformatics Conference on Bioinformatics, pp. 111–118 (2003) 10. Li, Q., Fraley, C., Bumgarner, R., Yeung, K., Raftery, A.: Donuts, Scratches and Blanks: Robust Model-Based Segmentation of Microarray Images. Technical Report No. 473, Department of Statistics, University of Washington (2005) 11. Maulik, U., Bandyopadhyay, S.: Performance Evaluation of Some Clustering Algorithms and Validity Indices. IEEE Trans. on Pattern Anal. Mach. Intell. 24(12), 1650–1654 (2002) 12. Theodoridis, S., Koutroumbas, K.: Pattern Recognition, 3rd edn. Academic Press, London (2006) 13. Mitchell, T.: Machine Learning. McGraw-Hill, New York (1997) 14. Veenman, C., Tax, D.: A Weighted Nearest Mean Classifier for Sparce Subspaces. Computer Vision and Pattern Recognition 2, 1171–1176 (2005) 15. Song, Y., Huang, J., Zhou, D., Zha, H., Lee, C.: IKNN: Informative K-Nearest Neighbor Pattern Classification. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladeniˇc, D., Skowron, A. (eds.) PKDD 2007. LNCS (LNAI), vol. 4702, pp. 248–264. Springer, Heidelberg (2007) 16. Abe, S.: Support Vector Machines for Pattern Classification. Springer, Heidelberg (2005) 17. Dash, D., Cooper, G.: Exact Model Averaging with Naive Bayesian Classifiers. In: Proc. of the 19th International Conference on Machine Learning, pp. 91–98 (2002) 18. Baek, K., Draper, B., Beveridge, J., She, K.: PCA vs. ICA: A Comparison on the FERET Data Set. In: Proc. of the 4th Int. Conference on Computer Vision, Pattern Recognition and Image Processing, Durham, NC, pp. 824–827 (2002)
330
L. Rueda and J.C. Rojas
19. Safavian, S., Landgrebe, D.: A Survey of Decision Tree Classifier Methodology. IEEE Trans. on Systems, Man, Cybernetics, 660–674 (1991) 20. Gonzalez, R., Woods, R., Eddins, S.: Digital Image Processing Using Matlab. Prentice-Hall, Englewood Cliffs (2003) 21. Draghici, S.: Data Analysis Tools for DNA Microarrays. Chapman and Hall/CRC, Boca Raton (2003)
Drugs and Drug-Like Compounds: Discriminating Approved Pharmaceuticals from Screening-Library Compounds Amanda C. Schierz1 and Ross D. King2 1
Software Systems Research Group, Bournemouth University, Poole House, Talbot Campus, Poole, BH12 5BB 2 Computational Biology Research Group, Aberystwyth University, Penglais Campus, Aberystwyth, SY23 3DB [email protected], [email protected]
Abstract. Compounds in drug screening-libraries should resemble pharmaceuticals. To operationally test this, we analysed the compounds in terms of known drug-like filters and developed a novel machine learning method to discriminate approved pharmaceuticals from “drug-like” compounds. This method uses both structural features and molecular properties for discrimination. The method has an estimated accuracy of 91% in discriminating between the Maybridge HitFinder library and approved pharmaceuticals, and 99% between the NATDiverse collection (from Analyticon Discovery) and approved pharmaceuticals. These results show that Lipinski’s Rule of 5 for oral absorption is not sufficient to describe “drug-likeness” and be the main basis of screening-library design. Keywords: Inductive Logic Programming, drug-likeness, machine learning, Rule of 5, compound screening library.
1 Introduction The successful development and application of Virtual Screening methods for the drug-discovery process has provided a new area of interest to the computer community. With High-Throughput Screening (HTS) technology becoming more accessible together with several commercially-available compound screening-libraries, computer scientists have been given an opportunity to confirm their theoretical observations in wet laboratory experiments. The selection of the most appropriate compound screening-library to purchase for these experiments is a difficult task: there are several ready-built libraries that are commercially-available, libraries may be diversity-based or target-based and the storage and purchase of the libraries is costly. This paper reports an analysis of two commercially-available screening libraries and details an Inductive Logic Programming (ILP) discriminant analysis approach to library design: Which library most closely resembles approved pharmaceuticals? The two main criteria for selecting compounds for screening libraries are: they are similar to existing pharmaceutically-active compounds, and they are structurally diverse. Both criteria can be interpreted as maximising the a priori probability that a V. Kadirkamanathan et al. (Eds.): PRIB 2009, LNBI 5780, pp. 331–343, 2009. © Springer-Verlag Berlin Heidelberg 2009
332
A.C. Schierz and R.D. King
compound will be found in the screening-library that is both drug-like and non-toxic. The requirement for diversity is usually explained by the fact that structurally similar compounds tend to exhibit similar activity. The goal is to find compounds that have a similar activity but have a dissimilar structure. In this way, a structurally diverse set of compounds covers the activity search space but with fewer redundant compounds [1]. Ideally, screening-library compounds should have a low molecular weight and be of low complexity in order to maximise the chance of binding to a target. These compounds should also be amenable to medicinal chemistry optimisation to increase the chance of the primary-screening hit being developed further and becoming a lead for a specific target. As several hit compounds may never be suitable as a lead compound for a target, some researchers such as Hann et al [2] claim that virtual screening methods should focus on lead-likeness and not drug-likeness. As our interest is on the primary-screening process, the focus here is on the drug-likeness (hit-likeness) of the compounds in the screening-libraries. Drug-like properties are usually defined in terms of ADME - Absorption, Distribution, Metabolism, and Excretion - and describe the action of the drug within an organism, such as intestinal absorption or blood-brain-barrier penetration. One of the first methods, and still the most popular, to model the absorption property was the “Rule of 5” developed by Lipinski et al [3] which identifies the compounds where the probability of useful oral activity is low. The "rule of 5" states that poor absorption or permeation of a compound is more likely when: 1. 2. 3. 4.
There are more than 5 Hydrogen-bond donors The Molecular Weight is over 500. The LogP (partition coefficient) is over 5 (or MLogP is over 4.15). There are more than 10 Hydrogen-bond acceptors
Though these rules were never meant to describe the drug-likeness of compounds, their negation is usually used as the main selection filter for the compounds to include in a screening-library. For example, chemical companies such as Maybridge, Chembridge, Analyticon, TimTec, amongst others, all describe their screening-libraries in terms of the number of Lipinski rules covered by the compounds. Though these rules are not definitive, the properties are simple to calculate and not only provide a guideline for good oral absorption of the compound but also for general drug-likeness of that compound. To assess how well the compounds in the screening-libraries resemble existing pharmaceutically-active compounds, two types of analysis was carried out: •
•
The comparison of the compounds in the screening-libraries and the set of approved pharmaceuticals in terms of the number of Lipinksi rules covered by the compounds (Hydrogen bond donors and acceptors, molecular weight and LogP). Machine learning techniques have been used to discriminate between each screening-library and the set of approved pharmaceuticals. 3 decision trees per screening-library have been learnt based on a differing molecular representation – substructures only, quantitative properties only, and both substructures and quantitative properties.
Drugs and Drug-Like Compounds: Discriminating Approved Pharmaceuticals
333
This discriminatory approach is not novel and similar work has been carried out using neural networks [4], [5], [6] and decision trees [7] with relatively good prediction success for drug-likeness. In related work, the success of the Lipinski rules has encouraged research on refining and improving them. For example, Oprea [8], [9] has shown that the “Rule of 5” alone is not sufficient to distinguish between drugs and non-drugs, and proposes other quantitative filters such as rotatable bonds, rigid bonds and ring counts; Veber et al [10] claim that molecular Polar Surface Area is also important when describing drug-likeness and Baurin et al [11] include filters such as tractability and aqueous solubility, amongst others. One important way in which our approach differs from this previous work is that these methods all used the Available Chemicals Directory (ACD) as the dataset of non-drugs, and either the World Drug Index (WDI), MDL Drug Data Report (MDDR) or Medicinal Chemistry database (CMC) as the dataset for drugs (and drugs in development). In our approach, we use approved pharmaceuticals as the drug dataset and commercially-available compound screening-libraries as the non-drug dataset. This adds difficulty to the discrimination task as all the compounds in the screening-libraries are already identified as having drug-like properties. The second significant way that our approach differs is in the representation of molecules. Almost all chemoinformatics is based around using tuples of attributes to describe molecules. An attribute is a proposition which is true or false about a molecule, for example having a Log P of 0.947, the existence of a benzene ring, etc. This representational approach typically results in a matrix where the examples are rows and the columns are attributes. This attribute-based form of data is assumed by standard statistical and machine learning analysis methods. This way of representing molecules has a number of important disadvantages. Perhaps the most important of these is that it is chemically unnatural. Chemists think of molecules as structured objects (atom/bond structures, connected molecular groups, 3D structures, etc.). Such structured objects cannot easily be represented using attributes, and therefore their use forces chemists to use a language that cannot express their most basic concepts. Another important disadvantage of the attribute-based approach is that it is computationally inefficient in terms of space, i.e. to fully capture molecular structure requires an exponential number of attributes to be created. This is the fundamental reason that it is not unusual in chemoinformatic applications to see molecules described using hundreds if not thousands of attributes. A more natural and spatially efficient way to represent molecular structure is to use relations: atom1 bonded to atom2; a benzene ring connected to an amide group, etc. The main disadvantage of using such relational representations is that it requires more complex machine learning methods which are often slower than attribute-based approaches. One machine learning method that can use relational data is Inductive Logic Programming (ILP). The first representation was based on atoms, bonds and some quantitative attributes [12] and a more recent representation has added attributes derived from Richard Bader's Atom in Molecules (AIM) quantum topology theory [13], [14]. ILP enables the usage of background knowledge by defining high-level concepts, e.g. functional groups, aromatic rings, etc and the output of an ILP method is rich, relational rules such as “A compound is active if it has an aliphatic carbon atom attached by a single bond to a nitrogen atom which is in a six-membered aromatic monocycle”.
334
A.C. Schierz and R.D. King
2 Materials and Methods 2.1 Data Sets Two compound-screening libraries were chosen for the research – the target-based NatDiverse collection from Analyticon Discovery (Version 070914) and the diversitybased HitFinder (Version 5) collection from Maybridge. The libraries from these companies are publicly available and therefore computational analysis could be carried out: this was the sole reason for their inclusion in this research. We would like to thank Analyticon Discovery and Maybridge for their data. The HitFinder collection includes 14,400 compounds representing the drug-like diversity of the Maybridge Screening Collection (approximately 60,000 compounds). Compounds have generally been selected for inclusion in the library if they are known to be non-reactive and meeting 2 or more of Lipinski’s Rule of 5 (www.maybridge.com). AnalytiCon Discovery (www.ac-discovery.com) currently offers 13 NatDiverse libraries which are tailor-made synthetic nitrogen-containing compounds. The libraries are template / target-based and include collections containing quinic acid and shikimic acid, hydroxyproline, santonine, dianhydro-Dglucitol, hydroxypipecolinic acid, andrographolide, piperazine-2-carboxylic acid, cytosine, quinidine, quinine, indoloquinolizidine, cyclopentene and ribose. The total number of compounds is 17,402. The approved pharmaceuticals dataset was obtained from the KEGG Drug database and contains 5,294 approved drugs from the United States and Japan. The compounds were not filtered to remove reactive functionalities [8] or any other undesirable properties. The datasets were randomly split into a training and validation dataset and an independent test set. 20% of the compound libraries and 8% of approved pharmaceuticals were used for the independent testing. 2.2 Molecular Descriptors The software PowerMV [15] was used to generate the molecular properties for the compounds. The four properties associated with Lipinski’s Rule of 5 - Molecular weight, LogP, hydrogen bond acceptors, and hydrogen bond donors were calculated, together with the number of rotatable bonds, polar surface area, blood-brain indicator (if the compound penetrates the brain or not) and the number of chemically reactive or toxic groups in the compound. 2.3 Data Preprocessing The OpenBabel suite [16] was used to convert the SDF datasets to the MOL2 chemical format so that the aromatic bonds could be identified and hydrogens added. A text-processing script parsed the MOL2 file into a Prolog-readable format containing data on atoms, bonds and aromaticity. The data is fully normalised according to relational database design standards [17] so each compound and atom are assigned a unique identifier. For example, atom(2,4,c) means that atom number 4 in compound number 2 is a carbon; bond(2,4,5,2) means that in compound number 2, atoms 4 and 5 are bonded by a double bond (the final digit 2).
Drugs and Drug-Like Compounds: Discriminating Approved Pharmaceuticals
335
2.4 Molecular Structure Generator A bespoke Molecular Structure Generator (MSG) program, written in Prolog, uses this atom and bond information to generate descriptions of substructures by referring to a pre-coded library of over 200 chemical rings, functional groups, isomers and analogues. Figure 1 shows a fragment of the normalised relational data representation generated for the illustrated compound. The numbers represent the unique identifiers, for example, ring(compound_id, structure_id, ring_name), ring_length(compound_id,structure_id,ring_length).
ring_length(1,1,6). aromatic_ring(1,1). carbon_ring(1,1). ring(1,1,benzene).
fused_pair_name(1,4,naphthalene). carbon_fused_pair(1,4). polycycle(1,6,phenanthrene) carbon_poly(1,6).
poly_no_rings(1,6,3). group(1,7,nitro). group(1,8,aryl_nitro). parent(1,8,nitro).
nextto(1,1,2,fused). nextto(1,6,7,bonded). count_ring(1,benzene,3).
Fig. 1. A fragment of the background knowledge generated for 2-nitrophenanthrene using the Prolog Molecular Structure Generator. Image from Pubchem.
The relational facts can be read as, for example, • • • •
For compound number 1, the first substructure identified is a benzene ring of length 6. It is a carbon ring and it is aromatic. For compound number 1, the fourth substructure identified is naphthalene which is a fused pair of rings and is only carbon. For compound number 1, the eighth substructure identified is an aryl-nitro which is a type of (has parent) nitro For compound number 1, the sixth substructure (phenanthrene) is bonded to the seventh substructure (nitro)
2.5 Decision Trees The data mining software Tilde [18] is available as part of the ACE data mining system (http://www.cs.kuleuven.ac.be/~dtai/ACE/) which provides a common interface to several relational data mining algorithms. Tilde is an upgrade of the popular C4.5 decision tree learner [19] and can be used for relational data mining: facts represented in Prolog can be both the input and the output of Tilde. For all experiments, the minimal cases allowed for a tree node was set to 5, the search heuristic employed was gain, and the Tilde mode was set to classify. All other options were kept as the default values. The complete datasets were split into a training and validation set and an independent test
336
A.C. Schierz and R.D. King
set. A ten-fold cross-validation was used for Tilde to learn the decision trees. A crossvalidation is a standard statistical technique where the training and validation data set is split into several parts of equal size, in this case 10% of the compounds. For each run of Tilde, 10% of the data is excluded from the training set and put in a corresponding validation set. Each training set is used to construct a classification tree which is then used to make predictions for the corresponding validation set. For each of the three scenarios (structural information only, quantitative information only, and both structural and quantitative information), the ten-fold cross-validation was carried out with identical training and validation sets. The classification tree that performed the best in the training and validation stage was then applied to the independent test set.
3 Results 3.1 Lipinski Attribute Analysis The datasets were first analysed according to the Lipinski Rule of 5. This analysis was carried out to see how well the two commercially-available screening-libraries matched the set of approved pharmaceuticals in terms of the Lipinksi rule properties (Hydrogen bond donors and acceptors, molecular weight and LogP). Each combination of the rules has been allocated an identifier tag as in Table 1. For example, Lip4 is compounds that have all the Lipinski drug-like properties; Lip2b is compounds that have 2 Lipinski drug-like properties (Less than or equal to 5 hydrogen bond donors and a molecular weight less than or equal to 500). Each compound in the two screening-libraries and the set of approved pharmaceuticals was allocated a tag according to the Lipinski Rule combinations shown in Table 1. Table 2 shows the percentages of compounds from each dataset for each identifier tag. Table 1. Identifier tags for the combination of Lipinski Rules Lipinski Rule ID
Lip4 Lip3a Lip3b Lip3c Lip3d Lip2a Lip2b Lip2c Lip2d Lip2e Lip2f Lip1a Lip1b Lip1c Lip1d Lip0
H-bond donors ≤5
Mol. Weight ≤500
LogP ≤5
H-bond acceptors ≤ 5
Drugs and Drug-Like Compounds: Discriminating Approved Pharmaceuticals
337
Table 2. Percentage of compounds with each combination of Lipinski Rules in the compound screening-libraries and approved pharmaceuticals (App) Lipinski Rule ID Lip4 Lip3a Lip3b Lip3c Lip3d Lip2a Lip2b Lip2c Lip2d Lip2e Lip2f Lip1a Lip1b Lip1c Lip1d Lip0
NATDiverse 82.3% 2.67% 1.08% 1.26% 9.85% 0.51% 0 1.47% 0 0.12% 0.36% 0.03% 0 0 0.33% 0.02%
HitFinder 88.9% 10.85% 0.02% 0.02% 0.01% 0.17% 0.02% 0 0 0 0 0 0 0 0 0
App 74.29% 8.05% 1.19% 2.26% 3.53% 1.19% 0.02% 3.53% 0 0.02% 1.10% 0.36% 0.09% 0 3.82% 0.15%
The majority of the compounds in all datasets meet at least 3 of Lipinksi’s 4 druglike properties. The most diverse combinations are in the set of approved pharmaceuticals with just over 10% of compounds meeting 2 or less of the Rule of 5 properties. Interestingly, nearly 4% of approved pharmaceuticals only meet the LogP filter. The HitFinder diversity-library has the least diverse coverage with 0.19% of compounds having 2 or less combinations. According to this attribute-based analysis, the NATDiverse targeted-library is more closely matched to the set of approved pharmaceuticals dataset than the HitFinder library in terms of the Lipinski drug-like properties. Interestingly, no dataset has a compound that just satisfies the molecular weight and hydrogen bond acceptor criteria (Lip2d) or just the molecular weight criteria (Lip1c). Essentially this tells us that if the compound violates rules on LogP and hydrogen bonding it doesn't matter what the molecular weight is, the compound is not likely to be a potential drug. 3.2 Discrimination Analysis Three tests were carried out per dataset pairing (screening-library : approved pharmaceuticals) – one based on structural information only using the relations generated by the MSG Prolog program, another on quantitative attributes only (Molecular weight, LogP, hydrogen bond acceptors, hydrogen bond donors, the number of rotatable bonds, polar surface area, blood-brain indicator and the number of chemically reactive or toxic groups in the compound), and the third based on both structural information and the quantitative attributes. Please note that as the datasets are of uneven size (approximately 3:1, screening-library: approved pharmaceuticals), we have shown the results in terms of True Positives (approved pharmaceuticals correctly classified as such) and False Positives (screening-library compounds that have been incorrectly classified as approved pharmaceuticals). Table 3 shows the average accuracy of the 10 classification models when applied to the validation set together with the size of the most accurate decision tree produced.
338
A.C. Schierz and R.D. King
Table 3. Average accuracy of the classification trees when applied to the validation set. For each screening-library, the results of the 3 data representation results are shown. Validation Dataset HitFinder/App structures only NATDiverse/App structures only HitFinder/App properties only NATDiverse/App properties only HitFinder/App structures & properties NATDiverse/App structures & properties
Accuracy 87.68% 98.62% 83.53% 90.31% 88.29% 97.75%
False Positives 7% 1% 8% 5% 7% 1%
True Positives 75% 96% 64% 76% 78% 95%
Tree size 367 119 423 348 389 138
The results of the cross-validation are promising with high accuracy figures. The classification system has had more difficulty discriminating the approved pharmaceuticals from the HitFinder library than the NATDiverse library – this has resulted in larger decision trees with lower accuracy rates for the HitFinder library. The best result for the HitFinder / Approved Pharmacaeuticals data has been achieved when the data is represented by both structures and quantitative properties; the least accurate is when the data is represented by quantitative properties only. For the NATDiverse / Approved Pharmaceuticals data the best result is achieved by representing the data by structural information only and the least accurate result is when the data is represented by quantitative properties only. As the datasets are of uneven distribution, the ROC (Receiver Operating Characteristics) points which illustrate the trade-off between the hit-rate and false-alarm rate have been shown in Figure 2.
Fig. 2. The ROC points of the classifiers when applied to the validation data. The numbers are the data representation: 1 is structural and quantitative, 2 is structural only and 3 is quantitative only.
For each scenario, the classification tree that provided the lowest True Positive : False Positive ratio was applied to the independent test set, see Table 4.
Drugs and Drug-Like Compounds: Discriminating Approved Pharmaceuticals
339
Table 4. Accuracy of the best classification tree when applied to the independent test set. For each screening-library, the results of the 3 data representation results are shown. Testing Dataset HitFinder / App structures only NATDiverse / App structures only HitFinder / App properties only NATDiverse / App properties only HitFinder / App structures & properties NATDiverse / App structures & properties
Accuracy 89.53% 99.00% 83.43% 89.29% 90.75% 98.98%
False Positives 8% 1% 10% 8% 7% 1%
True Positives 74% 96% 62% 74% 75% 97%
The independent test results are very good and even show a slight improvement over the validation results in some scenarios. This shows us that our model has not been over-fitted to the training data. The results also show that the inclusion of quantitative attributes resulted in a slight increase in the classification accuracy for the HitFinder / Approved Pharmaceuticals data but actually decreased the overall accuracy for the NatDiverse / Approved Pharmaceutical data (though there is an increase in the True Positive rate). Figure 3 shows the ROC points of the classifier.
Fig. 3. The ROC points of the classifier when applied to the test data. The numbers are the data representation: 1 is structural and quantitative, 2 is structural only and 3 is quantitative only.
For both screening-libraries, there has been a decrease (5 to 10%) in performance when using physicochemical quantitative properties only. Interestingly, this may mean that even though the screening-library compounds are similar to approved pharmaceuticals in terms of certain drug-likeness filters, they are dissimilar in terms of certain substructures. These results are converse to the attribute-based Lipinski rules analysis carried out previously. According to Lipinski’s criteria, the target-based NATDiverse library more closely resembles approved pharmaceuticals than the diversity-based HitFinder library. Here the opposite is true – it has been harder to discriminate between the HitFinder compounds and approved pharmaceuticals. This means that the compounds in the HitFinder library resemble approved pharmaceuticals closer than the NATDiverse compounds when more molecular background knowledge is added.
340
A.C. Schierz and R.D. King
3.3 Pruning the Trees One of the advantages of using Tilde is that the decision trees may be represented as a set of Prolog rules, each of which represents a decision tree node. The most accurate rules, i.e. those with the maximum positive coverage and minimal negative coverage, were extracted to build a probabilistic decision list. The aim was to find a decision list that had a minimum overall accuracy of 85% and had less than 10 rules. For the HitFinder / Approved Pharmaceuticals datasets, a pruned decision list of 10 rules was found that had an overall accuracy of 85% and can correctly classify 63% of approved pharmaceuticals with only 7% false positives. Table 5 shows the resulting decision list rules together with their confidence probabilities. The rules may be read as If the compound has a molecular weight greater than 500.502 then there is a 99.9% probability the compound is an approved pharmaceutical, else if the compound has a molecular weight smaller than 150.133 then there is a 99.6% probability the compound is an approved pharmaceutical, and so on. Table 5. The ten best rules for discriminating between the HitFinder library and the set of approved pharmaceuticals. These rules can successfully classify 63% of the approved pharmaceuticals and 93% of the HitFinder compounds.
1. 2. 3. 4. 5. 6. 7.
If molecular weight > 500.502 then approved pharmaceutical (99.9%) else if molecular weight < 150.133 then approved pharmaceutical (99.6%) else if there’s more than 1Hydroxyl then approved pharmaceutical (93%) else if there’s a Sulphur-containing Aromatic Monocycle then HitFinder (91%) else if there’s a Thiophene then Hitfinder (89%) else if there’s more than 2 Methylenes then approved pharmaceutical (75%) else if there’s a Cyclohexane next to a cyclopentane and there’s a Methyl then approved pharmaceutical (95%) 8. else if there’s an Aromatic ring and an Azetine next to an Amide then approved pharmaceutical (97%) 9. else if there’s a Cyclohexane next to a Methyl and molecular weight > 269.388 then approved pharmaceutical (86%) 10. else the compound is from the HitFinder library (67%)
The rules generated are simple to understand and provide insight into the structural differences between the HitFinder library and approved pharmaceuticals. Apart from molecular weight, no other physicochemical property has been employed as a discriminatory feature; this is probably due to the library being designed using these types of filters. For the NatDiverse (NAT) / Approved Pharmaceuticals (App) datasets, a pruned decision list with just 8 rules can classify the compounds with 90% accuracy, with 84% of approved pharmaceuticals classified correctly with 8% False Positves. The rules here are longer and include more structural relations than those for the HitFinder library, see Table 6.
Drugs and Drug-Like Compounds: Discriminating Approved Pharmaceuticals
341
Table 6. The eight best rules for discriminating between the NATDiverse library and the set of approved pharmaceuticals. These rules can successfully classify 84% of the approved pharmaceuticals and 92% of the NATDiverse compounds.
1.
If there’s a non-aromatic ring and less than 6 Amides and a Hetero ring with length < 5 then approved pharmaceutical (100%)
2.
else if there’s a non-aromatic ring and less than 6 Amides and a fused-pair of Hetero rings then NATDiverse (94%)
3.
else if there’s a non-aromatic ring and less than 6 Amides, a Piperidine bonded to an Amide and Hydrogen Bond Donors = 1 or 2 then NATDiverse (91%)
4.
else if there’s a non-aromatic ring and an aromatic monocycle and a Nitrogen-containing ring and an Oxygen-containing ring and any ring with length of 5 then NATDiverse (79%)
5.
else if there’s a non-aromatic ring and less than 6 Amides and more than one 1H-Quiolizine then NATDiverse (100%)
6.
else if there’s a non-aromatic ring and less than 6 Amides and a Cyclohexane bonded to an Alcohol then NATDiverse (94%)
7.
else if there’s a non-aromatic ring and less than 6 Amides and Hydrogen Bond Donors > 1 then NATDiverse (62%)
8.
else the compound is an approved pharmaceutical (91%)
Where as the rules for the HitFinder collection were a mixture of classifying compounds from both the library and the set of approved pharmaceuticals, here the rules seem to be focused on the library compounds – 91% of the compounds left after applying these rules will probably be approved pharmaceuticals. This is probably due to the nature of target-based screening-libraries; they are normally designed around specific molecular structures. Once again, because of the screening-library compounds being close to approved pharmaceuticals in terms of Lipinski rule filters, the rules are mainly based around differing substructures. This time it is only Hydrogen bond donors that have been found in the discriminating rules. Employing an ILP approach to this discrimination task has produced a rich, relational and small set of rules that provide insightful information about the differences between the compounds in the screening-libraries and approved pharmaceuticals.
4 Discussion and Conclusion This research exercise has been interesting to us for several reasons. From a technical viewpoint, the Prolog Molecular Structure Generator provided descriptive molecular background knowledge and this has resulted in some clear, easy to understand relational rules. From a screening-library compound perspective, we were surprised that the classifiers provided some very accurate results. It was expected that the HitFinder library would be harder to discriminate than the NATDiverse collection as it is diversity–based rather than target-based. However, neither task was too challenging and this leads back to the concept of lead-likeness and the argument that virtual screening methods should focus on lead-likeness and not drug-likeness [2].
342
A.C. Schierz and R.D. King
The final interesting perspective is that of screening-library design. The properties associated with the Rule of 5 and others such as Polar Surface Area are predominantly used for the design of screening-libraries. These properties are treated as filters and do not consider a lot of compounds that are filtered out and classed as being non-druglike. This research has shown that even though the compounds in the screeninglibraries resemble approved pharmaceuticals with regard to these filters, there are a lot more factors that need to be considered. The filter approach is almost certainly nonoptimal because such filters are “soft”, i.e. they are only probabilistic and can be contravened under some circumstances. We have taken a discrimination-based approach to the problem of selecting and designing compound libraries for drug screening. We have demonstrated that by using our ILP machine learning method we can accurately discriminate between approved pharmaceuticals and compounds in state-of-the-art screening-libraries with high accuracy. These discrimination functions are expressed in easy to understand rules, are relational in nature and provide useful insights into the design of a successful compound screening-library.
References 1. Leach, A.R., Gillet, V.J.: An Introduction to Chemoinformatics. Kluwer Academic Publishers, Dordrecht (2003) 2. Hann, M.M., Leach, A.R., Harper, G.: Molecular Complexity and Its Impact on the Probability of Finding Leads for Drug Discovery. Journal of Chemical Information and Computer Sciences 41(3), 856–864 (2001) 3. Lipinski, C.A., Lombardo, F., Dominy, B.W., Feeney, P.J.: Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv. Drug Delivery Rev. 23(1-3), 3–25 (1997) 4. Ajay, W., Walters, W.P., Murcko, M.A.: Can We Learn To Distinguish between "Drug-like" and "Nondrug-like" Molecules? J. Med. Chem. 41(18), 3314–3324 (1998) 5. Sadowski, J., Kubinyi, H.: A scoring scheme for discriminating between drugs and nondrugs. J. Med. Chem. 41, 3325–3329 (1998) 6. Murcia-Soler, M., Pérez-Giménez, F., García-March, F.J., Salabert-Salvador, M.T., Díaz-Villanueva, W., Castro-Bleda, M.J.: Drugs and nondrugs: an effective discrimination with topological methods and artificial neural networks. J. Chem. Inf. Comput. Sci. 43(5), 1688–1702 (2003) 7. Wagener, M., van Geerestein, V.J.: Potential drugs and nondrugs: prediction and identification of important structural features. J. Chem. Inf. Comput. Sci. 40 (2000) 8. Oprea, T.I., Davis, A.M., Teague, S.J., Leeson, P.D.: Is there a difference between leads and drugs? A historical perspective. J. Chem. Inf. Comput. Sci. 41, 1308–1315 (2001) 9. Oprea, T.I.: Lead structure searching: Are we looking at the appropriate property? J. Comput.-Aided Mol. Design 16, 325–334 (2002) 10. Veber, D.F., Johnson, S.R., Cheng, H.-Y., Smith, B.R., Ward, K.W., Kopple, K.D.: Molecular properties that influence the oral bioavailability of drug candidates. J. Med. Chem. 45, 2615–2623 (2002) 11. Baurin, N., Baker, R., Richardson, C.M., Chen, I.-J., Foloppe, N., Potter, A., Jordan, A., Roughley, S., Parratt, M.J., Greaney, P., Morley, D., Hubbard, R.E.: Drug-like Annotation and Duplicate Analysis of a 23-Supplier Chemical Database Totalling 2.7 Million Compounds. Journal of Chemical Information and Modeling 44(2), 643–651 (2004)
Drugs and Drug-Like Compounds: Discriminating Approved Pharmaceuticals
343
12. King, R.D., Muggleton, S.H., Srinivasan, A., Sternberg, M.J.E.: Structure activity relationships derived by machine learning: The use of atoms and their bond connectivities to predict mutagenicity using inductive logic programming. Proceedings of the National Academy of Sciences, USA 93, 438–442 (1996) 13. Buttingsrud, B., Ryeng, E., King, R.D., Alsberg, B.K.: Representation of molecular structure using quantum topology with inductive logic programming in structure-activity relationships. Journal of Computer-Aided Molecular Design 20(6), 361–373 (2006) 14. Bader, R.F.W.: Atoms in Molecules - A Quantum Theory. Oxford University Press, Oxford (1990) 15. Liu, K., Feng, J., Young, S.S.: PowerMV: A Software Environment for Molecular Viewing, Descriptor Generation, Data Analysis and Hit Evaluation. J. Chem. Inf. Model. 45, 515–522 (2005) 16. Guha, R., Howard, M.T., Hutchison, G.R., Murray-Rust, P., Rzepa, H., Steinbeck, C., Wegner, J.K., Willighagen, E.: The Blue Obelisk – Interoperability in Chemical Informatics. J. Chem. Inf. Model. 46(3), 991–998 (2006) 17. Codd, E.F.: Recent Investigations into Relational Data Base Systems. IBM Research Report RJ1385 (April 23, 1974); republished in Proc. 1974 Congress, Stockholm, Sweden. North-Holland, New York (1974) 18. Blockeel, H., De Raedt, L.: Top-down induction of first order logical decision trees. Artificial Intelligence 101(1-2), 285–297 (1998) 19. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann series in Machine Learning. Morgan Kaufmann, San Francisco (1993)
Fast SCOP Classification of Structural Class and Fold Using Secondary Structure Mining in Distance Matrix Jian-Yu Shi1,2 and Yan-Ning Zhang2 2
1 Faculty of Life Sciences, Northwestern Polytechnical University College of Computer Science, Northwestern Polytechnical University 710072 Xi'An, China {JianyuShi,YnZhang}@nwpu.edu.cn
Abstract. It is an urgent need to understand the structure-function relationship in proteomic era. One of the important techniques to meet this demand is to analyze and represent the spatial structure of domain which is the functional unit of the whole protein, and perform fast domain classification. In this paper, we introduce a novel method of rapid domain classification. Instead of analyzing directly protein sequence or 3-D tertiary structure, the presented method maps firstly tertiary structure of protein domain into 2-D Cα-Cα distance matrix. Then, two distance functions for alpha helix and beta strand are modeled by considering their geometrical properties respectively. After that, the distance functions are further applied to mine secondary structure elements in such distance matrix with the way similar to image processing. Furthermore, composition feature and arrangement feature of secondary structure elements are presented to characterize domain structure for classification of structural class and fold in Structural Classification of Proteins (SCOP) database. Finally, the results compared with other methods show that the presented method can perform effectively and efficiently automatic classification of domain with the benefit of low dimension and meaningful features, but also no need of complicated classifier system. Keywords: SCOP classification, protein structure, distance matrix, secondary structure mining, image processing, support vector machines.
1 Introduction The function of protein is strongly related to its spatial structure[1]. In order to understand the relationship of structure-function and discover the evolutional explanation of conserved structure, biologists and researchers have the need to retrieve similar tertiary structures for protein structure database, and further categorize them into different classes in terms of their secondary structure, topology and evolution information. Now, the representation, classification and retrieving of protein spatial structure has become the popular area in computational biology and structural bioinformatics. However, the number of proteins with determined spatial structures but unknown types and unclear functions is still large and increasing continuously. Besides, current V. Kadirkamanathan et al. (Eds.): PRIB 2009, LNBI 5780, pp. 344–353, 2009. © Springer-Verlag Berlin Heidelberg 2009
Fast SCOP Classification of Structural Class and Fold
345
structural classification databases are constructed manually by numerous biologists[2], or implemented automatically by exhausted and inaccurate computation[3]. Along with the determination of more and more protein spatial structures, biologists raise the requirement of the classification of protein structure with not only the automatic way but also more accurate and lower computation cost. Consequently, how to represent the spatial structure of protein and perform fast structural classification has become the urgent need. Based on the widely assumption that structural features are closely related to sequence composition[1, 4], one popular approach is called indirect representation of protein spatial structure which extracts feature of sequence to perform classification. The indirect representation can be organized into two types: based on statistical analysis of amino acid residues[5-8], and based on amino acid indices[9,10]. Another approach executes directly analysis on protein spatial structure to obtain representation and extract feature of structure, and can be grouped into three types: based on spatial atom distribution[11,12], topological structure[13,14], and geometrical shape[15-17]. The indirect representation can be always obtained with lower computation cost and higher dimensional feature, but in contrast the direct representation can be acquired with higher computation cost and lower dimensional feature. Besides, in order to achieve better classification with the indirect representation of protein structure, there is always a need to exploit advance techniques of pattern recognition, for example, feature combination[5], fusion[18,19], selection[7] and hierarchical classifier architecture[20,21]. In this paper, we present a feature extraction of protein spatial structure to achieve fast SCOP classification at structural class and fold levels. This paper is constructed as follows: Section 2 describes the benchmark dataset of protein structure used in this paper. Section 3 depicts the core of our idea about how to characterize protein structure in 2D matrix and represent it with a compact feature vector. Section 4 presents experimental results and investigates the effectiveness of our proposed algorithm. Finally, we draw the conclusions in section 5.
2 Database Structural domains of protein often form functional units of which each forms a compact three-dimensional structure and often can be independently stable and folded. Protein always consists of one or more structural domains. On the other hand, one domain may appear in a variety of evolutionarily related proteins. Therefore, one of the important techniques to understand the structure-function relationship is the analysis of the spatial structure of protein domain. Structural Classification of Proteins (SCOP) database is of a manual classification of protein structural domains based on similarities of their amino acid sequences and three-dimensional structures[2]. Because SCOP classification based on the human expertise is more significantly than semi-automatic CATH[3], it is usually accepted that SCOP provides a better justified classification. SCOP utilizes four levels of hierarchical classification: structural class, fold, superfamily and family. The classifications of
346
J.-Y. Shi and Y.-N. Zhang
structural class and fold are concerned about structural similarity, while the classifications of superfamily and family are devoted to sequence similarity. The SCOP domain dataset used here is derived from the work[5] with high cited number, and can be downloaded from http://ranger.uta.edu/~chqding/protein/ or SCOP database[22]. It contains a training set and a testing set of which the sample counts are 313 and 385, the sequence similarities are less 35% and 40% respectively. According to SCOP classification, the whole dataset consists of 4 kinds of structural classes which can be further categorized into 27 types of folds. Only the first two levels of SCOP classification are considered for the analysis of domain structure in this paper.
3 Method 3.1 Distance Matrix Various physical and chemical properties, different counts and sequenced order of amino acids are the elements to decide and product the diversity of protein structures. As a result, it is complicated to describe directly protein structure by its all atoms, and even more difficult to analyze and characterize its structure for further understanding structure-function relationship. Instead of considering all atoms, many computational biology literatures therefore use Cα atoms of protein to characterize the whole protein structure in order to obtain lower complexity of computation. It is also known as protein backbone[23]. As a kind of protein structure representations based on backbone, distance matrix (DM) contains sufficient information to reconstruct the original 3D backbone structure by using the distance geometry methods[24]. Suppose protein Pi is composed of N amino acid residues and its backbone can be defined as Bi = {Coor1i , Coor2i ,L , Coorni ,L , CoorNi } where Coorni is the coordinates vector of the nth Cα atom. Then, Bi can product distance matrix
{
}
DM = dmi ( p, q ) = dist ( Coorpi , Coorqi )
(1)
where dist ( ⋅ ) is just the Euclidean distance between the pth and the qth Cα atoms, and 1 ≤ p, q ≤ N .
Since DM maintains sufficient 3-D structural information, similar protein backbones are expected to have such distance matrices with similar properties. Fig. 1 gives the structure snapshots and the images of DMs of four proteins which fall into all α, all β, α/β and α+β structural class and four kinds of folds respectively. As shown in Fig. 1, different kinds of protein structures have distinct DMs, that is to say, the difference of structures can be exhibited by their DMs. More importantly, secondary structure elements (SSEs) show regular patterns in DM.
Fast SCOP Classification of Structural Class and Fold
347
(a) d1hbga_, all α, Globin-like, chain length 147
(b) d1neua_, all β, Immunoglobulin-like, chain length 115
(c) d1ghra_, α/β, TIM-barrel, chain length 306
(d) d1npka_, α+β, Ferredozin-like, chain length 150
Fig. 1. Structure snapshots and distance matrix images of four proteins
3.2 Secondary Structure Mining The definitions of structural class and fold inspired us that the analysis of secondary structures composition and arrangement based on DM can hopefully represent protein structure with compact and effective way. However, SCOP domain file provides no information of secondary structure, there is a principal need to mine secondary structure in order to extract structural feature of protein domain. There are two of the most popular SSEs: alpha-helix and beta-strand. Geometrically, the backbone of alpha-helix is a kind of spring, while the backbone of betastrand is a stretch of periodic pleating and two strands are connected laterally by three or more hydrogen bonds to forming pleated beta sheet. In detail, the Cα atoms in an alpha-helix are arranged in a right-handed helical structure, 5.4 Å wide, and each Cα atom corresponds to a 100° turn in the helix, radius of 2.3 Å and a translation of 1.5 Å along the helical axis. Therefore, the distance between Cα(i) and Cα(i+t) atoms in alpha-helix can be determined by liα→ i + t = r 2 (1 − 2 cos ( t ⋅ θ ) ) + ( t ⋅ d ) , t = 1,..., 2
(2)
where r=2.3 Å, d=1.5 Å and θ=100°. Due to tetrahedral chemical bonding at the Cα atom, the pleated appearance of beta-strand causes the distance between Cα(i) and Cα(i+2) to be approximately 6 Å, rather than the sum(2 × 3.8 Å) of the distances of adjacent Cα atom pairs. Therefore, the distance between Cα(i) and Cα(i+t) atoms in beta-strand can be determined by
β
li →i + t
d2 × t ⎧ ⎪ 2 ⎪ =⎨ 2 2 ⎪ ⎛ d2 × t ⎞ + d 2 − ⎛ d2 ⎞ 1 ⎟ ⎜ 2 ⎟ ⎪ ⎜⎝ 2 ⎠ ⎝ ⎠ ⎩
where d1=3.8 Å, and d2=6 Å.
t is even , t ≥ 1, t is odd
(3)
348
J.-Y. Shi and Y.-N. Zhang
Because of symmetry of distance matrix, its upper triangular part is enough to mine SSEs by formula (2) or (3). However, there is the divergence of the distance of adjacent Cα-Cα atoms in different protein derived from different experiments. As a result, we cannot apply these two formulas directly to find which residue is a participant of alpha helix or beta strand. Here, inspired by the techniques of image processing, we determinate whether a residue belongs to a SSE or not by whether the following formula holds or not Si =
m
∑ (R k =1
i k
− Rmi ) 2 − τ ≤ 0 ,
(4)
where τ is a threshold which controls the fitting error between distance function and elements of DM, m ⎧ ΔD (i, k ) ⎫ Rmi = ∑ Rki m , Rki = ⎨ ⎬ , i = 1,.. k = 1,... , k =1 ⎩ ΔL ( i , k ) ⎭
(5)
ΔD (i, k ) = dm(i, i + k ) − dm(i, i + k − 1) and ΔL(i, k ) = ( li →i + k − li →i + k −1 ) . By considering the fact that the N-H group of an amino acid forms a hydrogen bond with the C=O group of the amino acid four residues earlier, we assign k with the value greater than 4 in order to discover the alpha helix containing one cycle at least. According to the fact that beta strand always connects another one laterally by three or more hydrogen bonds to forming pleated beta sheet, we assign k with the value greater than 3 for beta-strand finding. Although the positions of all strands in primary structure (sequence) can be determined by formula(4), the count, position and orientation of beta sheets are still not clear. By considering the "sideways" distance between adjacent Cα atoms in hydrogen-bonded β strands is roughly 5 Å, we build a band-pass filter with the range of [5δ1, 5+δ2] and perform it on distance matrix. As a result, whether dm(i, i + k ) is the indication of beta sheet or not can be determined by whether the following formula can be established or not −δ1 ≤ dm(i, i + k ) − 5 ≤ δ 2 .
(6)
3.3 Feature Extraction of Protein Structure According to SCOP description, an all-α protein is a class of structural domains in which the secondary structure is composed entirely of α-helices, with the possible exception of a few isolated β-sheets on the periphery. An all-β protein is a class of structural domains in which the secondary structure is composed entirely of β-sheets, with the possible exception of a few isolated α-helices on the periphery. An α/β protein is a class of structural domains which is composed of alternating α-helices and mostly parallel β-strands along the backbone. An α+β proteins is a class of structural domains which is composed of α-helices and mostly anti-parallel β-strands that occur
Fast SCOP Classification of Structural Class and Fold
349
separately along the backbone. If proteins belong to the same structural class, and have same major secondary structures in same arrangement and with the same topological connections, then they are grouped into a common fold. Inspired by above facts, we regard composition of SSE as key feature in classification of structural class, and take arrangement of SSE, especially beta sheet, as important roles in classification of fold. First of all, we define the regions of interest (ROI) of distance matrix to extract feature for structural class and fold respectively. The first ROI looks like 5-element wide beam in distance matrix, while the second ROI is the little smaller triangle than the upper triangle part of distance matrix. Both of them are shown in Fig. 2 with different gray patterns.
Fig. 2. Regions of Interest
The first feature of protein is called the composition of SSE and defined as
F1 = # { Si } N ,
(7)
where # {Si } is the count of Si which satisfies formula (4). In order to characterize the count, position and orientation of beta sheets, we split the second ROI into several smaller ones with multi-level decomposition and show them in Fig. 3.
Level I
Fig. 3. Multi-level decomposed ROI with one or several sub-regions
As a result, the arrangement feature of protein is defined as F2 = # {dm(i, j )} Cm, n ,
350
J.-Y. Shi and Y.-N. Zhang
where # {dm(i, j )} is the count of dm(i, j ) which satisfies formula (5), and Cm, n is the element count of the nth sub-region in level m. Totally, F1 is of two dimensions corresponding to alpha helix and beta strand, while the dimension of F2 depends on the level of decomposition and equals 2m-1 where m is the value of level. 3.4 Classification
Once the representation of protein structure is set, next step is to choose a classifier to perform classification. Support vector machines (SVM) has been used here due to its good performance of classification. SVM was originally designed for binary classification[25] while protein domain classification is of M-class classification. There are mainly two kinds of approaches for multi-class SVM[26]. The extensive experiments[10, 26] have shown that ‘‘One-Versus-Rest”(OVR), ‘‘One-Versus-One’’ (OVO) and ‘‘Directed Acyclic Graph’’ (DAG) are more practical. Because of its convenient usage, OVO is used in this paper. Practically, the SVM software, LibSVM, is used, and can be freely downloaded from http://www.csie.ntu.edu.tw/~ cjlin/libsvm/ for academic research[26]. In addition, the training is done with only the RBF kernel in all experiments.
4 Experiment 4.1 Classification Result of Structural Class and Fold
Firstly, we use composition and arrangement features to perform classification of structural class and fold, respectively. In order to keep the dimension of feature vector as low as possible, only the decompositions of 3 levels are run when calculating arrangement feature. The results are listed in Table 1. Table 1. The accuracy of classification with single group of features Feature F1 F2 Level 1 F2 Level 2 F2 Level 3
Dimension 2 1 3 7
Structural class(%) 86.23 74.55 74.03 81.04
Fold(%) 44.42 28.57 55.32 68.05
Obviously, composition feature(F1) is more effective to the classification of structural class, while arrangement feature(F2) is more capable to characterize the divergence of different folds. In order to achieve a better classification, we combine composition feature spontaneously with arrangement feature and show the result in Table 2. Table 2 shows that the combination of composition feature and arrangement feature of level 1 is enough to gain good result for structural class. The combination of composition feature and arrangement feature of level 3 wins the best result of fold classification. Besides, the greater the level of decomposition of the second ROI is, the higher the accuracy of fold classification is, but the higher the dimension of feature vector is.
Fast SCOP Classification of Structural Class and Fold
351
Table 2. The accuracy of classification with combined features Feature F1+F2 Level 1 F1+F2 Level 2 F1+F2 Level 3
Dimension 3 5 9
Structural class (%) 90.65 91.17 90.65
Fold(%) 50.91 64.42 74.55
Consequently, the decomposition level is determined by the requirement of running environment when building real application. 4.2 Comparison with Former Methods
In order to validate the effectiveness of the presented method, we compare it with several methods in different literatures of which all used the same benchmark dataset of original paper[5]. These methods can be categorized into two groups in the case of assessment approach. The first group [5, 7, 19, 20, 27]applies the training set to build classifier model and use independently the testing set to evaluate its performance. The second group[16,21] combines these two sets and assesses the performance of classification by 10-cross validation. The comparison with the first group and the second group are listed in Table 3 and Table 4 respectively. Table 3. The comparison in the independent test Method Ref. [5] Ref. [5] Ref. [27] Ref. [19] Ref. [20] Ref. [7] Our method
Dimension of feature 20 125 125 125 1007 1007 3/9
Structural class (%) N/A N/A 80.52 N/A 83.6 87.0 90.65
Fold(%) 49.4 56.5 58.18 61.04 65.5 69.6 74.55
Table 4. The comparison in 10-CV test Method Ref. [21] Ref. [16] Our method
Dimension of feature 125 183 3/9
Structural class (%) 84 N/A 93.70
Fold(%) 74 78 78.65
The result demonstrates that our method outperforms extraordinarily other methods with both the highest accuracy of classification and the lowest dimension of feature vector. Besides, most of these methods exploited intricate techniques of pattern recognition, for example, feature fusion[19, 27], feature selection[7] and hierarchical classifier architecture[16, 20, 21]. The techniques always increase the cost of building application. In contrast, the presented method is an agile solution of protein structure classification of SCOP.
352
J.-Y. Shi and Y.-N. Zhang
5 Conclusion In this paper, we have developed a novel method of rapid domain classification. Instead of analyzing directly protein sequence or 3-D tertiary structure, the presented method maps firstly tertiary structure of protein domain into 2-D distance matrix. Then, two Cα-Cα distance functions for alpha helix and beta strand are modeled by considering their geometrical properties respectively. After that, the distance functions are further applied to mine secondary structure elements in distance matrix with the way similar to image filtering. Furthermore, composition feature and arrangement feature of SSEs are presented to characterize domain structure for SCOP classification of structural class and fold. Finally, the results compared with other methods show that the presented method can perform effectively and efficiently automatic classification of domain with the benefit of low dimension and meaningful features, but also no need of complicated classifier system. Acknowledgments. This work was supported by a grant from National Natural Science Foundation of China (60872145) and China Postdoctoral Science Foundation (20070421130).
References 1. Krissinel, E.: On the Relationship between Sequence and Structure Similarities in Proteomics. Bioinformatics 23, 717–723 (2007) 2. Andreeva, A., Howorth, D., Chandonia, J.-M., Brenner, S.E., Hubbard, T.J.P., Chothia, C., Murzin, A.G.: Data Growth and Its Impact on the SCOP Database: New Developments. Nucleic Acids Research 36, D419–D425 (2008) 3. Alison, L.C., Ian, S., Tony, L., Oliver, C.R., Richard, G., Janet, T., Christine, A.: The CATH Classification Revisited–Architectures Reviewed and New Ways to Characterize Structural Divergence in Superfamilies. Nucleic Acids Research 37, D310–D314 (2008) 4. Bastolla, U., Ortíz, A.R., Porto, M., Teichert, F.: Effective Connectivity Profile: A Structural Representation That Evidences the Relationship between Protein Structures and Sequences. Proteins: Structure, Function, and Bioinformatics 73, 872–888 (2008) 5. Ding, C.H.Q., Dubchak, I.: Multi-Class Protein Fold Recognition Using Support Vector Machines and Neural Networks. Bioinformatics 17, 349–358 (2001) 6. Shi, J.-Y., Zhang, S.-W., Liang, Y., Pan, Q.: Prediction of Protein Subcellular Localizations Using Moment Descriptors and Support Vector Machine. In: Rajapakse, J.C., Wong, L., Acharya, R. (eds.) PRIB 2006. LNCS (LNBI), vol. 4146, pp. 105–114. Springer, Heidelberg (2006) 7. Lin, K.L., Lin, C.-Y., Huang, C.-D., Chang, H.-M., Yang, C.-Y., Lin, C.-T., Tang, C.Y., Hsu, D.F.: Feature Selection and Combination Criteria for Improving Accuracy in Protein Structure Prediction. IEEE Transactions on NanoBioscience 6, 186–196 (2007) 8. Shi, J.-Y., Zhang, S.-W., Pan, Q., Zhou, G.-P.: Using Pseudo Amino Acid Composition to Predict Protein Subcellular Location: Approached with Amino Acid Composition Distribution. Amino Acids 35, 321–327 (2008) 9. Cai, Y.D., Liu, X.J., Xu, X.B., Chou, K.C.: Support Vector Machines for Prediction of Protein Subcellular Location by Incorporating Quasi-Sequence-Order Effect. Journal of Cellular Biochemistry 84, 343–348 (2002)
Fast SCOP Classification of Structural Class and Fold
353
10. Shi, J.-Y., Zhang, S.-W., Pan, Q., Cheng, Y.-M., Xie, J.: Prediction of Protein Subcellular Localization by Support Vector Machines Using Multi-Scale Energy and Pseudo Amino Acid Composition. Amino Acids 33, 69–74 (2007) 11. Ankerst, M., Kastenmüller, G., Kriegel, H.-P., Seidl, T.: 3D shape histograms for similarity search and classification in spatial databases. In: Güting, R.H., Papadias, D., Lochovsky, F.H. (eds.) SSD 1999. LNCS, vol. 1651, pp. 207–228. Springer, Heidelberg (1999) 12. Daras, P., Zarpalas, D., Axenopoulos, A., Tzovaras, D., Strintzis, M.G.: ThreeDimensional Shape-Structure Comparison Method for Protein Classification. IEEE Trans. Comput. Biol. Bioinformatics 3, 193–207 (2006) 13. Gilbert, D., Westhead, D., Viksna, J., Thornton, J.: A Computer System to Perform Structure Comparison Using Tops Representations of Protein Structure. Comput. Chem. 26, 23– 30 (2001) 14. Anne, P.: Voronoi and Voronoi-Related Tessellations in Studies of Protein Structure and Interaction. Current Opinion in Structural Biology 14, 233–241 (2004) 15. Choi, I.-G., Kwon, J., Kim, S.-H.: Local Feature Frequency Profile: A Method to Measure Structural Similarity in Proteins. Proceedings of the National Academy of Sciences of the United States of America 101, 3797–3802 (2004) 16. Marsolo, K., Parthasarathy, S.: Alternate Representation of Distance Matrices for Characterization of Protein Structure. In: Proceedings of the Fifth IEEE International Conference on Data Mining, pp. 298–305. IEEE Computer Society, Los Alamitos (2005) 17. Sayre, T., Singh, R.: Protein Structure Comparison and Alignment Using Residue Contexts. In: Proceedings of the 22nd International Conference on Advanced Information Networking and Applications – Workshops, pp. 796–801. IEEE Computer Society, Los Alamitos (2008) 18. Shi, J.-Y., Zhang, S.-W., Pan, Q., Zhang, Y.-N.: Using Decision Templates to Predict Subcellular Localization of Protein. In: Rajapakse, J.C., Schmidt, B., Volkert, L.G. (eds.) PRIB 2007. LNCS (LNBI), vol. 4774, pp. 71–83. Springer, Heidelberg (2007) 19. Shi, J.-Y., Zhang, S.-W., Pan, Q., Liang, Y.: Protein Fold Recognition with Support Vector Machines Fusion Network. Progress in Biochemistry and Biophysics 33, 155–162 (2006) 20. Huang, C.-D., Lin, C.-T., Pal, N.R.: Hierarchical Learning Architecture with Automatic Feature Selection for Multiclass Protein Fold Classification. IEEE Transactions on NanoBioscience 2, 221–232 (2003) 21. Marsolo, K., Parthasarathy, S., Ding, C.: A Multi-Level Approach to SCOP Fold Recognition. In: Proceedings of the Fifth IEEE Symposium on Bioinformatics and Bioengineering, pp. 57–64. IEEE Computer Society, Los Alamitos (2005) 22. Chandonia, J., Hon, G., Walker, N., Lo Conte, L., Koehl, P., Levitt, M., Brenner, S.: The Astral Compendium in 2004. Nucleic Acids Research 32, D189–D192 (2004) 23. Taylor, W.R., Orengo, C.A.: Protein Structure Alignment. J. Mol. Biol. 208, 1–22 (1989) 24. Timothy, H., Irwin, K., Gordon, C.: The Theory and Practice of Distance Geometry. Bulletin of Mathematical Biology 45, 665–720 (1983) 25. Vapnik, V.N.: An Overview of Statistical Learning Theory. IEEE Transactions on Neural Networks 10, 988–999 (1999) 26. Hsu, C., Lin, C.J.: A Comparison of Methods for Multi-Class Support Vector Machines. IEEE Transactions on Neural Networks 13, 415–425 (2002) 27. Chinnasamy, A., Sung, W.K., Mittal, A.: Protein Structure and Fold Prediction Using Tree-Augmented Naive Bayesian Classifier. Journal of Bioinformatics and Computational Biology 3, 803–820 (2005)
Short Segment Frequency Equalization: A Simple and Effective Alternative Treatment of Background Models in Motif Discovery Kazuhito Shida Institute for Material Research, 2-1-1 Katahira, Aoba-ku, 980-8577 Sendai, Japan [email protected]
Abstract. One of the most important pattern recognition problems in bioinformatics is the de novo motif discovery. In particular, there is a large room of improvement in motif discovery from eukaryotic genome, where the sequences have complicated background noise. The short segment frequency equalization (SSFE) is a novel treatment method to incorporate Markov background models into de novo motif discovery algorithms, namely Gibbs sampling. Despite its apparent simplicity, SSFE shows a large performance improvement over the current method (Q/P scheme) when tested on artificial DNA datasets with Markov background of human and mouse. Furthermore, SSFE shows a better performance than other methods including much more complicated and sophisticated method, Weeder 1.3, when tested with several biological datasets from human promoters. Keywords: Motif discovery, Markov background model, Eukaryotic promoters, Stochastic method, Gibbs sampling.
1 Introduction Reliable de novo motif discovery remains an important problem of pattern recognition that remains unsolved by bioinformatics[1-7], in particular, when subjects are transcription factor binding sites (TFBS) in eukaryotic genomes[8] such as those of fruit fly, mouse, and human: Eukaryotic sequences tend to have a more complicated statistical structure[9] than prokaryotic sequences do. Assuming that the input sequence is a mixture of two sequences generated from two statistical information sources, the Markov background model (noise) and the motif model (signal), many motif discovery algorithms seek a maximally differentiated motif model[3, 10-13] from the given background. It is understandable that the separation of signal and noise is difficult when the noise has complicated statistical structures. However, these two information sources have a difference in their spatial scale, too. In many cases considered, the motif width is greater than that of the order of the Markov background model. Although a weak long-range correlation is reported on genomic sequence, the magnitude of the correlation is a decreasing function of correlation length[14, 15]. Therefore, it is clear that the non-motif information, the “noise” for motif discovery algorithms, is concentrated in the short-range regime of background V. Kadirkamanathan et al. (Eds.): PRIB 2009, LNBI 5780, pp. 354–364, 2009. © Springer-Verlag Berlin Heidelberg 2009
Short Segment Frequency Equalization
355
statistics. It is possible to suppress the noise and enhance the performance of the motif discovery algorithms by selectively reducing the magnitude of the short-range or highfrequency portion of the sequence information. In other words, we need a sort of “lowpass filter” of the sequence information. A “high-pass filter” of sequence information is already realized and used to evaluate the statistical significance of alignments[10, 16]. For example, if the input sequences are cut into numerous non-overlapping pieces of length x and re-organized in a randomly shuffled order, all information contained in the spatial scale of x+1 or longer will be randomized and erased, without imparting a large effect on information found in shorter scales in input sequences. This is exactly why the shuffled sequence is useful as the null hypothesis of sequence alignment and motif discovery. A sequential “low-pass filter” based on the shuffling principle seems to be difficult to realize. This report presents the proposal of adding a very simple modification, a “built-in filter”, to the conventional Gibbs sampling, thereby rendering the resultant sampling behavior as low-pass filtered and noise-tolerant. This filtering method is called short segment frequency equalization (SSFE).
2 Method 2.1 Conventional Method We take Gibbs sampling[11] as our starting point: Gibbs sampling is a type of Markov Chain Monte Carlo (MCMC) method that samples all possible blocks (gapless alignments) with width w in N input sequences with length L, at a probability linear to Q/P, Q=
∏
N y=1
Q(rowy), P=
∏
N y=1
P(rowy),
(1)
where Q(rowy) and P(rowy) respectively signify the likelihood of the y-th row of the block in the current motif model and the given background model. The likelihoods assigned to the entire block are denoted in boldface. Usually the motif model is a position weight matrix (PWM) from which likelihood Q is calculated. The value of a PWM element, qsi, is the number ratio of letter s in the i-th column of the block, calculated with an appropriate pseudocount. The following is the outline of Gibbs sampling based motif discovery with conventional treatment of a Bernoulli background. First, the current PWM, q, is calculated from the current alignment. In the row update (row resampling) phase of conventional Gibbs sampling, the y-th row of the current alignment is updated to be one of all possible length w substrings (segments) in the y-th input sequence sampled with probability Qx / P x ,
(2)
where Qx signifies the likelihood that the x-th substring (comprising the x-th to x+w-1 -th letters of the sequence, s(x) ~ s(x+w-1)) comes from the current model denoted by q; Px is the likelihood that the same substring comes from a Bernoulli background denoted by p, Qx =
∏
qs( x+i) i , Px = ∏
w−1
w−1
i=0
i=0
ps( x+i) .
(3)
356
K. Shida
After the update is done, y is changed to (y+1)mod(N), such that all rows are updated in a cyclic manner. Subsequently, the entire process is repeated starting from the updated alignment. When Markov background models are used, the P part of the transition probability that is used in the original Gibbs sampling is changed to a sequence-dependent one, as
Px = P(s(x + 0)s(x +1)s(x + 2)Ks(x + w −1)).
(4)
The value of Px is given by the following formula when w is greater than the order of Markov background model, m, w−(m+1)
Px = P(s(x + 0)Ks(x + m))
∏ i=1
P(s(x + i)Ks(x + i + m)) ∑ P(s(x + i)Ks(x + i + m −1)t)
(5)
t=G,A,C,T
Because of its popularity[3, 17, 18], this method of incorporating background models will be designated as the “conventional” method or simply the “Q/P-scheme” throughout this report. The “Q/P-scheme” is surely an effective noise reduction scheme because it penalizes frequent m-mers to be sampled as motif. However, it should be noted that there is no mathematical proof on the quantitative correctness of the penalty. 2.2 Proposal of a New Background Treatment Basically, SSFE method differs from the conventional Gibbs sampling scheme only in a very small but crucial point (Fig. 1): a likelihood according to “modified background model”, P’, is used in place of P. The main characteristic of P’ is that the behavior of the “Q/P’-scheme” is almost totally unbiased toward any short segment.
Fig. 1. Schematic explanation of the SSFE scheme (m=1). The size of the letters indicates the frequency of dimers. The equalization stage (left half) iteratively adjusts the background parameters such that no m+1-mer is preferentially sampled in the detection stage (right half). Note that each stage is a simple Gibbs sampling by itself.
Short Segment Frequency Equalization
357
Actually, P’ is obtainable with ease using the following simple iterative process. Identically to P, P’ is based on a Markov model, but with smaller order (therefore, “short segment”). For this report, the order of the model is chosen as m=2 (apparently, this is shorter than most of nucleotide motifs). First, some plausible Markov background is prepared as an initial point to start the equalization stage. A short Gibbs sampling is performed using a conventional scheme using Q/P’ calculated from current background model. After each row updating, the newly selected length w segment is decomposed to w-m short segments (for example, “TATCGT” will be decomposed into TAT, ATC, TCG and CGT) to evaluate the frequency of m+1-mers to be sampled under current P’. If the evaluated sample frequency is biased beyond an appropriate threshold, the background model is adjusted to counterbalance the bias by increasing or decreasing the background parameters by a fixed step (More sophisticated optimization methods, e.g. high-dimensional Newton–Raphson method on some “flatness of sampling” function, can actually be problematic for SSFE because it is difficult to calculate the Jacobian matrix of such goal function). Then the updated background is used to calculate P’ in the next short Gibbs sampling.
Fig. 2. Sample frequencies (Y-axis) of different triplets converge to the near equal values, 1/64, as the iteration number (X-axis) of short Gibbs sampling in the equalization stage increases
358
K. Shida
Typically, after 30–60 short Gibbs samplings, each of which is 2000–4000 steps long, the bias in the frequencies of m+1-mers is reduced within the threshold (see Fig. 2), which means that the background model is converged to the optimal one for balancing m+1-mers. From this point on, we can start the detection stage in which the Gibbs sampler based on the Q/P’-scheme can sample any segment with any length at near-equal frequency, unless the sampling is disturbed by information retained in larger spatial scale in the input sequences. The most likely cause of such disturbance is the over-representation of mutually similar sequences with their length greater than m+1. In other words, it is highly probable that the disturbance is related to the biological motifs. The selection of m=2 has no implication on the legitimacy of a background model with higher order. The main reason to use m=2 segments as the target of equalization is that, although the direct equalization of a longer segment is theoretically possible, it requires many more sampling steps in iterative adjustment (if m=7 is used, at least 65,536 steps are necessary for each iteration) and the sampling error will be much larger.
3 Results An SSFE sampler is implemented in C++ language, as an extension of a previously reported motif discovery tool, GibbsST[19]. With minimum change (the order of Markov background model is changed, and the equalization stage is omitted), this SSFE sampler can precisely simulate a Q/P-sampler with Markov background with any order. The motif score must be chosen carefully: In Gibbs-sampling-based motif discovery, the value of Q/P is frequently regarded as the score and used for selection of the motif candidates. Because this is the first time that SSFE is proposed and tested, it should be tested in conditions closely resembling those of typical usage of the conventional method. Therefore, Q/P’ and Q/P are used, respectively, as the score function in the test for SSFE and the conventional sampler. Both algorithms start the sampling from a number (50) of randomly generated PWM and output the best motif model with largest observed Q/P’ or Q/P. In short, we use a typical likelihood ratio score, but the background model might be adjusted by SSFE. Test datasets are prepared under the following specifications. (1) Artificial motifs implanted in a biological background. Randomly generated (w,d) artificial motif sequences are implanted randomly into artificial sequences with biologically correct statistical features: background generation is performed according to the parameters of an seventh-order Markov background model given as a part of the Weeder 1.3 toolkit for fruit fly, mouse, and human. The motif width, w, is set to 8 (corresponds to the order of background model). The number of mismatches per occurrence, d, is adjusted in conjunction with the number of sequences, N, and the length of an input, L, such that conventional Gibbs sampling shows modest success on the dataset because motifs that are too easy or too difficult to find are unable to prove differences between the two methods. The condition that is finally used is L=600, N=12, d=10/12. Although this condition seems slightly easier than the artificial motifs reported to be at the limit of detection possibility[4], this difference can be explained by the severe disturbance from the eukaryotic background models.
Short Segment Frequency Equalization
359
(2) Biological dataset from eukaryotic genome. Confirmed human TFBS and their flanking promoter sequences were obtained mainly from a curated database of eukaryotic promoters, ABS[20]. The TFBS with too few examples, large gaps, and overly variable structures were omitted by manual inspection to realize a modest level of difficulty. Finally, five human TF (CREB, SRF, TBP, USF, and E2F1) were used to construct our biological datasets. All sequences in the datasets have at least one TFBS (OOPS occurrence model), and the average sequence length was 504.6. In a sense, these data can be regarded as the test of SSFE for smaller N and larger w (up to 10). In addition to SSFE and conventional Gibbs sampling, two most successful motif discovery softwares–– MEME (v4.1)[13] and Weeder (v1.3)[21]––are tested on this dataset. The seventh-order Markov model for human background sequence is used with Weeder, MEME, and conventional Gibbs sampling. The values of w given to these algorithms are the biologically correct ones, with one exception (TBP is processed by Weeder with w=6, because Weeder cannot use odd values of w). The performance is evaluated as the performance coefficient S, which is defined as
si = max(0, min(xi + w, yi + zi ) − max(xi , yi )), N
N
i=1
i=1
S = ∑ si / ∑ w + zi − si
(6)
where xi, yi, and zi respectively signify the resultant motif starting points, correct motif starting points, and correct motif width in the input sequences. This coefficient is basically the fraction of correctly discovered motif sites (1.0 represents the best performance). In both artificial and biological tests, the “correct” length of the motif sequence is given to algorithms. Moreover, the possibilities of motif sequences in reverse strands are excluded and algorithms are not searching in reverse strands. These settings are intended to give the whole test appropriate difficulty, and not to favor SSFE sampling. In Fig. 3, the performance observed for artificial datasets with a biological background model is shown. In the figure, the X-axis corresponds to individual datasets and the Y-axis shows performance coefficients obtained from two background cancellation methods, SSFE and Q/P, shown by upward and downward triangles, respectively. Wherever SSFE outperforms the conventional method, the gap separating two performance coefficients is shaded dark gray. Otherwise, the gap is shaded light gray. The X-axis is sorted to gather light and dark gray regions as much as possible. The average values of performance coefficients over different datasets are also portrayed in the graph. As indicated by the large dark grey areas in Fig. 3, SSFE shows marked performance enhancement over the conventional method for artificial datasets generated from human and mouse background models, although little or no improvement is apparent in the case of the fruit fly. Considering the simplicity of SSFE, this magnitude of improvement (more than two-fold increase for two of the most complicated eukaryotic background models) is surprising.
360
K. Shida
Fig. 3. Performance improvement of the SSFE scheme on artificial datasets. Percentages (Yaxis) of found sites of artificial (w,d) motifs by conventional and SSFE schemes for 300 test datasets (X-axis). The dark gray region represents that SSFE is superior to the conventional scheme; the light gray region is otherwise. Although the datasets are artificial, their background statistics are those of a fruit fly, mouse, and human.
Two features in these data might be useful to elucidate the difference between the conventional method and SSFE. First, SSFE does not increase the performance uniformly; it performs excellently for outnumbering datasets for which the conventional method performs poorly, and vice versa. Second, that the performance of SSFE is merely comparable to the conventional method in the case of fruit fly is unlikely to be a coincidence because the 8-mer distribution of the fruit fly is the least heterogeneous one among the three background models tested. The ratio of largest to smallest 8-mer frequencies is 904.3 for fruit fly, that is a much smaller value compared to human (5879.0) and mouse (12992.5). Probably, the strength of SSFE cannot be exhibited when the background model approximates a Bernoulli model. The limited performance of motif discovery based on conventional Gibbs sampling for the human and mouse background strongly suggests the incapability of conventional method to handle heterogeneous background properly. Although the data are not shown, these general trends are not changed when several other settings of l, d, N, are tested. In Fig. 4, the performance observed for human promoter datasets is shown. In short, the effectiveness of SSFE is not limited to artificial motifs. In all datasets except for E2F1, the solution from SSFE is in better quality than the solution from other methods tested. It is noteworthy that E2F1 dataset also requires the largest number
Short Segment Frequency Equalization
361
Fig. 4. Result of the SSFE scheme on biological data (human TFBS) compared to other algorithms. Successfully identified portions of TFBS by respective methods are marked black.
362
K. Shida
(ca. 100) of equalization steps for SSFE to converge. While MEME shows a very good performance for TBP and E2F1 and a complete failure for other datasets, Weeder shows relatively good performance for CREB, SRF, and E2F1. For TBP and USF, however, Weeder fails to present the correct answer as the most likely answer. Apparently, the performance of SSFE on these dataset is better than those of other methods tested. Considering the simplicity of SSFE, this level of performance enhancement observed for human data is remarkable.
4 Discussion If the conventional scheme cannot handle heterogeneous backgrounds ideally, how did it manage to increase performance[3, 17, 18] in the previous reports? The answer to this question is speculated to be short low-complexity sequences in input. According to the Weeder 1.3 frequency files, the 8-mers with largest P in the human genome are “AAAAAAAA” and “TGTGTGTG”; for mouse, they are followed by “CACACACA” and “AGAGAGAG”. It is often pointed out[22] that these short repeats have very strong disturbance effects over Gibbs sampling. Consequently, it is plausible that the Q/P-scheme was successful at least to alleviate these largest sources of problems (by imposing the maximum penalty on them) and to outperform older Gibbs sampling[11] that assumes only Bernoulli background. The next question is more important but more difficult to answer: how it is possible for something as simple as SSFE to mark such a large increase in performance? To answer this question, more elaborate tests using a wider variety of test data should be conducted. In addition, we must develop at least some theory, not an analogy like the “low-pass filter”, on what the new score P’ actually represents. The author is currently investigating the following hypothesis as a candidate of such a theory. There should be a quantitatively correct system of penalty on score, under which the bias from background model has absolutely no effect on the result of motif discovery algorithms. In SSFE, P’ may work as a crude approximation of such an ideal penalty, because the equalization stage of SSFE is basically adjusting its own sampling behavior as unaffected as possible by the input sequence constituted of large amount of background and only small fraction of (often diverged) motif sequences. If this hypothesis is correct, SSFE tends to wrongfully exclude correct answers when motif sequences have a particularly large presence in the input (that is, when the motif is “easy” to discover): a good explanation for the result of SSFE on the E2F1 dataset. A possible solution for this weakness of SSFE is taking the motif score into account in the equalization stage, such that the answers with statistically meaningful level of scores will not penalized by P’. No matter what is the true strength of SSFE, its success strongly suggests a large gap in our current understanding of pattern discovery under a highly heterogeneous background model. At least, using the Q/P-scheme under a heterogeneous and complicated background model must be seriously re-considered. The basic idea of SSFE can be applied to other problems in bioinformatics. The idea of equalization in terms of sub-pattern sample frequency is applicable to other motif score functions and even to sequence analyses of other types that are strongly affected by the sequence background model. The background sequence statistics is apparently a major source of
Short Segment Frequency Equalization
363
inherent complexity of the biological data. Therefore, improved treatments of background models should take a much higher position in future bioinformatics for a better processing of the biological patterns. It is hoped that SSFE can serve as a good starting point for efforts in this direction. Acknowledgments. The initial stage of this work was supported by “Special Coordination Funds for Promoting Science and Technology” of the Ministry of Education, Culture, Sports, Science and Technology.
References 1. Reddy, T.E., DeLisi, C., Shakhnovich, B.E.: Binding site graphs: A new graph theoretical framework for prediction of transcription factor binding sites. Plos Computational Biology 3, 844–854 (2007) 2. Mahony, S., Hendrix, D., Golden, A., Smith, T.J., Rokhsar, D.S.: Transcription factor binding site identification using the self-organizing map. Bioinformatics 21, 1807–1814 (2005) 3. Liu, X.S., Brutlag, D.L., Liu, J.S.: An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nat. Biotechnol. 20, 835–839 (2002) 4. Pevzner, P.A., Sze, S.H.: Combinatorial approaches to finding subtle signals in DNA sequences. Proc. Int. Conf. Intell. Syst. Mol. Biol. 8, 269–278 (2000) 5. Rigoutsos, I., Floratos, A.: Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm. Bioinformatics 14, 55–67 (1998) 6. Sinha, S., Tompa, M.: YMF: a program for discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Research 31, 3586–3588 (2003) 7. Pavesi, G., Zambelli, F., Pesole, G.: WeederH: an algorithm for finding conserved regulatory motifs and regions in homologous sequences. BMC Bioinformatics 8 (2007) 8. Tompa, M., Li, N., Bailey, T.L., Church, G.M., De Moor, B., Eskin, E., Favorov, A.V., Frith, M.C., Fu, Y., Kent, W.J., Makeev, V.J., Mironov, A.A., Noble, W.S., Pavesi, G., Pesole, G., Regnier, M., Simonis, N., Sinha, S., Thijs, G., van Helden, J., Vandenbogaert, M., Weng, Z., Workman, C., Ye, C., Zhu, Z.: Assessing computational tools for the discovery of transcription factor binding sites. Nat. Biotechnol. 23, 137–144 (2005) 9. Csuros, M., Noe, L., Kucherov, G.: Reconsidering the significance of genomic word frequencies. Trends in Genetics 23, 543–546 (2007) 10. Neuwald, A.F., Liu, J.S., Lawrence, C.E.: Gibbs motif sampling: detection of bacterial outer membrane protein repeats. Protein Sci. 4, 1618–1632 (1995) 11. Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F., Wootton, J.C.: Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262, 208–214 (1993) 12. Frith, M.C., Hansen, U., Spouge, J.L., Weng, Z.: Finding functional sequence elements by multiple local alignment. Nucleic Acids Res. 32, 189–200 (2004) 13. Bailey, T.L., Elkan, C.: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc. Int. Conf. Intell. Syst. Mol. Biol. 2, 28–36 (1994) 14. Messer, P.W., Bundschuh, R., Vingron, M., Arndt, P.F.: Effects of long-range correlations in DNA on sequence alignment score statistics. Journal of Computational Biology 14, 655–668 (2007)
364
K. Shida
15. Herzel, H., Trifonov, E.N., Weiss, O., Grosse, I.: Interpreting correlations in biosequences. Physica A 249, 449–459 (1998) 16. Fitch, W.M.: Random Sequences. Journal of Molecular Biology 163, 171–176 (1983) 17. Thijs, G., Lescot, M., Marchal, K., Rombauts, S., De Moor, B., Rouze, P., Moreau, Y.: A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling. Bioinformatics 17, 1113–1122 (2001) 18. Narasimhan, C., LoCascio, P., Uberbacher, E.: Background rareness-based iterative multiple sequence alignment algorithm for regulatory element detection. Bioinformatics 19, 1952–1963 (2003) 19. Shida, K.: GibbsST: a Gibbs sampling method for motif discovery with enhanced resistance to local optima. BMC Bioinformatics 7 (2006) 20. Blanco, E., Farre, D., Alba, M.M., Messeguer, X., Guigo, R.: ABS: a database of Annotated regulatory Binding Sites from orthologous promoters. Nucleic Acids Res. 34, D63– D67 (2006) 21. Pavesi, G., Mereghetti, P., Mauri, G., Pesole, G.: Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes. Nucleic Acids Research 32, W199–W203 (2004) 22. van Helden, J.: The analysis of regulatory sequences. In: Chatenay, D., Cocco, S., Monasson, R., Thieffry, D., Dailbard, J. (eds.) Multiple aspects of DNA and RNA from biophysics to bioinformatics, pp. 271–304. Elsevier, Amsterdam (2005)
Bayesian Optimization Algorithm for the Non-unique Oligonucleotide Probe Selection Problem Laleh Soltan Ghoraie, Robin Gras, Lili Wang, and Alioune Ngom Bioinformatics and PRML Lab, Department of Computer Science, University of Windsor, 401 Sunset Ave., Windsor, ON, N9B 3P4, Canada {soltanl,rgras,wang111v,angom}@uwindsor.ca
Abstract. DNA microarrays are used in order to recognize the presence or absence of different biological components (targets) in a sample. Therefore, the design of the microarrays which includes selecting short Oligonucleotide sequences (probes) to be affixed on the surface of the microarray becomes a major issue. This paper focuses on the problem of computing the minimal set of probes which is able to identify each target of a sample, referred to as Non-unique Oligonucleotide Probe Selection. We present the application of an Estimation of Distribution Algorithm (EDA) named Bayesian Optimization Algorithm (BOA) to this problem, for the first time. The presented approach considers integration of BOA and state-of-the-art heuristics introduced for the non-unique probe selection problem. This approach provides results that compare favorably with the state-of-the-art methods. It is also able to provide biologists with more information about the dependencies between the probe sequences of each dataset. Keywords: Microarray, Probe Selection, Target, Estimation of Distribution Algorithm, Bayesian Optimization Algorithm, Heuristic.
1
Introduction
Microarrays are the tools typically used for measuring the expression levels of thousands of genes, in parallel. They are specifically applicable in performing many simultaneous gene expression experiments [10]. Gene expression level is measured based on the amount of mRNA sequences bound to their complementary sequences affixed on the surface of the microarray. This binding process is called hybridization. The complementary sequences are called probes which are typically short DNA strands about 8 to 30 bp [13]. Another important application of miocroarrays is the identification of unknown biological components in a sample [4]. Knowing the sequences affixed on the microarray and considering the hybridization pattern of sample, one can infer which target exists in the sample. These applications require finding a good design for microarrays. By microarray design, we mean finding the appropriate set of probes to be affixed on the surface V. Kadirkamanathan et al. (Eds.): PRIB 2009, LNBI 5780, pp. 365–376, 2009. c Springer-Verlag Berlin Heidelberg 2009
366
L.S. Ghoraie et al.
of microarray. The appropriate design should lead to cost-efficient experiments. Therefore, while the quality of the probe set is important, the objective of finding the minimal set of probes also should be considered. Two approaches are considered for the probe selection problem, namely, unique and non-unique probe selection. In the unique probe selection, for each single target there is one unique probe to which it hybridizes. It means that, in specified experimental conditions, the probe should not hybridize to other targets except for its intended target. However, finding unique probes are very difficult, especially for biological samples containing similar genetic sequences [4][5][6][8][10][11][12][13]. In the non-unique probe selection, each probe is considered to hybridize possibly to more than one target. Our focus in this paper is on the non-unique probe selection. We present a method to find the smallest possible set of probes capable of identifying the targets in a sample. It should be noticed that this minimal probe set is chosen regarding a target-probe incidence matrix consisting of candidate probes and the pattern of hybridization of targets to them. Computing the set of candidate probes (incidence matrix) among all the possible non-unique probes is not a trivial task [4]. Many parameters such as secondary structure, salt concentration, GC content, hybridization energy, and hybridization errors such as cross-hybridization, self-hybridization, and non-sensitive hybridization should be taken into account in computing the set of candidate probes for the oligonucleotide probe selection [12]. We assume that the problem of computing the target-probe incidence matrix has been solved, and our focus is minimizing the design given by this matrix. This paper is organized as follows. Section 2 provides a detailed description of the non-unique probe selection problem. The related work is reviewed in section 3. In section 4, we contribute our approach to solve non-unique probe selection problem. A review on the main concepts of Bayesian Optimization Algorithm (BOA) is also presented and its advantages over the Genetic Algorithms (GA) are discussed. Also, the heuristics which we have integrated into the BOA are discussed, and a new heuristic is presented. We discuss the results of our experiments in section 5. Finally, we conclude this research work with discussion of possible future research directions and open problems appears in section 6.
2
Problem Definition
We illustrate the probe selection problem with an example. Assume that we have a target-probe incidence matrix H = (hij ) of a set of three targets (t1 ,...,t3 ) and five probes (p1 ,...,p5 ), where hij = 1, if probe j hybridizes to target i, and 0 otherwise (see Table 1). The problem is to find the minimal set of probes which identifies all targets in the sample. First, we assume that the sample contains single target. Using a probe set of {p1 , p2 }, we can recognize the four different situations of ‘no target present in the sample’, ‘t1 is present’, ‘t2 is present’, and ‘t3 is present’ in the sample. The minimal set of probes in this case is {p1 , p2 } since {p1 } or {p2 } cannot detect these four situations. Consider the case that multiple targets are present in the sample. In this case, the chosen probe set should be able to distinguish between the events in which all subsets (of all
BOA for the Non-unique Oligonucleotide Probe Selection Problem
367
Table 1. Sample Target-probe incidence matrix p1 p2 p3 t1 0 1 1 t2 1 0 0 t3 1 1 0
p4 0 1 0
p5 0 0 1
possible cardinalities) of target set may occur. The probe set {p1 , p2 } is not good enough for this purpose. With this probe set, we cannot recognize between the case of having subset {t1 , t2 } and {t2 , t3 } in the sample. Moreover, the probe set {p3 , p4 , p5 } can distinguish between all events in this case. A more formal definition of the probe selection problem is given below. Given the target-probe incidence matrix H, and parameters smin ∈ N and cmin ∈ N, the goal is to select a minimal probe set such that each target is hybridized by at least cmin probes (minimum coverage constraint), and any two subsets of targets are separated by means of at least smin probes (minimum separation constraint) [5] [4]. A probe separates two subsets of targets if it hybridizes to either one of them. The probe selection is proven to be a NP-hard problem [2], and is considered as a variation of the combinatorial optimization problem minimal set covering problem. The smallest incidence matrix in the literature contains about 256 targets and 2786 probes. The non-unique probe selection problem can be approached as an optimization problem. The objective function to be minimized is the number of probes (variables of the function), and the search space of the problem consists of 2numberof probes possible solutions which makes this problem very difficult to solve, even with powerful computers [8]. In this paper, we solve the single target case, and an EDA (Estimation Distribution Algorithms), named BOA (Bayesian Optimization Algorithm) integrated with some state-of-the-art probe selection heuristics, is used to design an efficient algorithm.
3
Previous Work
Several research works have been conducted in both unique and non-unique probe selection. Rash et al. [9] focused on the assumption of single targets in the sample. Considering the probes as substrings of original strings (genes), they used suffix tree method and Integer Linear Programming. Assuming the presence of multiple targets, Schliep et al. [10] introduced a fast heuristic which guaranteed the separation of up to a randomly chosen number N (e.g. N = 500000) of pairs of targets set. In this work, cross-hybridization and experimental errors were explicitly taken into account for the first time. Klau et al. [5] extended this work, and presented an ILP (Integer Linear Programming) formulation and a branch-and-cut algorithm to reduce the size of the chosen probe set. The ILP formulation extended to a more general version which also includes the group separation [4]. Meneses et al. [6] used a two-phased heuristic to construct a solution and reduce its size for the case of single target. Ragle et al.
368
L.S. Ghoraie et al.
[8] applied a cutting-plane approach with reasonable computation time, and achieved the best results for some of the benchmark datasets in case of single target. It does not use any a priori method to decrease the number of initial probes. Wang et al. [12] focused on the single target problem, and presented deterministic heuristics in order to solve the ILP formulation, and reduce the size of final probe set. They applied a model-based approach for coverage and separation in order to guide the search for the appropriate probe set in case of assuming single target in the sample. Recently, Wang et al. [11] presented a combination of the genetic algorithm and the selection functions used in [12], and obtained the results which are in some cases better than results of [8].
4
BOA and Non-unique Probe Selection
Our approach is based on the Bayesian Optimization Algorithm (BOA) in combination with a heuristic. Two of the heuristics, Dominated Row Covering (DRC) and Dominant Probe Selection (DPS), are the ones introduced in [12] for solving the non-unique probe selection problem. We also modify some of the function definitions of DRC, and introduce a new heuristic in order to capture more information. 4.1
Bayesian Optimization Algorithm
The BOA is an EDA (Estimation of Distribution Algorithm) method, first introduced by Pelikan [7]. EDAs are also called Probabilistic Model-Building Genetic Algorithms (PMBGA) which extend the concept of classical GAs. In the EDA optimization methods, the principle is to generate a sample of search space and use the information extracted from that sample to explore the search space more efficiently. The EDA approach is an iterative one consisting of these steps: (1) Initialization: a set of random solutions is generated (the first sample of search space); (2) Evaluation of the solutions quality; (3) Biased random choice of a subset of solutions such that higher quality solutions have more probability to be chosen; (4) Constructing a probabilistic model of the sample; (5) Use the model to generate a new set of solutions and go back to (2). In BOA, the constructed probabilistic model is a Bayesian Network. Considering a Bayesian Network as a Directed Acyclic Graph, the nodes represent the variables of the problem and the dependencies among the variables are simulated by the directed edges introduced to each node. Constructing a Bayesian Network allows discovering and representing the possible dependencies between the variables of the problem. Some difficult optimization problems contain dependencies. Classical GAs has been shown not to be able to solve these category of problems [3]; But BOA approach has been more successful in solving them. It is interesting to apply BOA approach for the complex problem of non-unique probe selection optimization problem. In this problem each (binary)variable represents presence or absence of a particular probe in the final design matrix. The dependencies among variables represent the fact that choosing a particular probe have a consequence on the
BOA for the Non-unique Oligonucleotide Probe Selection Problem
369
choice of other probes in an optimal solution. Pelikan and Goldberg [7] [1] have proven that when the number of variables and the number of dependencies are n and k, respectively, the size of the sample should be about of O(2k .n1.05 ) to guarantee the convergence. There are several advantages in applying this new approach. First, BOA is known as an efficient way to solve the complex optimization problems. Therefore, it is interesting to compare it with other methods applied to the non-unique probe selection problem. Second, the EDA methods, by working on the samples of the search space and deducing the properties of dependencies among the variables of the problem, are able to reveal new knowledge about the biological mechanism involved (See 5.2). Finally, with the study of the results obtained from experimenting different values of the parameter k, BOA provides the ability to evaluate the level of complexity of the non-unique probe selection in general, and the specific complexity of the classical set of problems applied to evaluate the algorithms used for solving this problem in particular. 4.2
Our Approach
In this section, we explain the details of our approach to solve the non-unique probe selection problem. Wang et al. [12] have introduced two heuristics in order to solve the non-unique probe selection problem. We integrated these heuristics into BOA in order to guarantee the feasibility of obtained solutions. A feasible solution is a solution which satisfies the constraints of coverage and separation of the non-unique probe selection defined in section 2. Since we discuss the case of single target in the sample, the separation constraint is applied on the targetpairs only. This means that we do not focus on the separation of all possible subsets of targets. 4.3
Heuristics
As mentioned above, our algorithm applies three heuristics in combination with the BOA. Two of the heuristics are those proposed by Wang et al. [12], namely, Dominated Row Covering (DRC), and Dominant Probe Selection (DPS). A third heuristic has also been used in our experiments, which we named Sum of Dominated Row Covering(SDRC ). In this heuristic, we modified the definitions of the functions C(pj ) (coverage function), and S(pj ) (separation function) of DRC. C(pj ) = max {cov(pj , ti ) | 1 ≤ j ≤ n} ti ∈Tpj
(1)
where T pj is the set of targets covered by pj . S(pj ) = max2 {sep(pj , tik ) | 1 ≤ j ≤ n} tik ∈Tp
j
where Tp2j is the set of target pairs separated by the probe pj .
(2)
370
L.S. Ghoraie et al.
Before discussing our modifications, we describe the probe selection functions used in DRC (For further information on DPS selection functions, see Wang et al. [12]). Given the target-probe incidence matrix H, probe set P = {p1 ,...,pn }, and the target set T ={t1 ,...,tm }, the function cov and sep have been defined over P × T and P × T 2 , respectively, as following: sep(pj , tik ) = |hij − hkj | × cov(pj , ti ) = hij ×
smin , |Ptik |
cmin , |Pti |
pj ∈ Ptik , pj ∈ Pti ,
tik ∈ T 2 ti ∈ T
(3) (4)
where Pti is the set of probes hybridizing to target ti , and Ptik is the set of probes separating target-pair tik . Function C favors the selection of probes that cmin -cover dominated targets. Target ti dominates target tj , if Ptj ⊆ Pti . Function S favors the selection of the probes that smin -separate dominated target pairs. Target pair tij dominates target pair tkl , if Ptij ⊆ Ptkl . The functions C(pj ) and S(pj ) have been defined as the maximum between the values of the function cov and sep, respectively. The selection function D(pj ) which has been defined as follows will indicate the degree of contribution of pj . D(pj ) = max{C(pj ), S(pj )}
| 1 ≤ j ≤ n}
(5)
The probes of highest value of D(pj ) will be the candidate probes for the solution probe set. Calculation of the coverage and separation functions are given in Tables 2 and 3 based on DRC definitions in rows C and S, respectively [12]. We see, by definition of DRC functions, these four probes have the same score for the coverage of the dominated targets and the same score for the separation of the dominated target pairs, and D(p1 ) = D(p3 ) = D(p4 ) = D(p5 ) = cmin 3 . Although, it can be noticed from 2 and 3 that each of these probes has a distinct covering and separating property. Therefore, these properties are not reflected by the definitions of current DRC functions. In order to capture this information, we modified the two functions of C(pj ) and S(pj ) to C (pj ) and S (pj ), respectively, in the SDRC (see Eq. 6 and 7 below). The values of C (pj ) and S (pj ) have also been calculated and presented in Tables 2 and 3. In the SDRC, the D score is calculated the same as D function in DRC (see Eq. 5). Table 2. Coverage function table: C has been caculated based on the DRC definition, and C based on the SDRC definition t1 t2 t3 t4 C C
p1 cmin 4 cmin 3
p2 cmin 4
p3 0
p4 cmin 4
p5 0 0
p6 cmin 4 cmin 3 cmin 5
cmin 0 3 cmin cmin cmin 0 5 5 5 cmin cmin cmin 0 0 0 3 3 3 cmin cmin cmin cmin cmin cmin 3 4 3 3 3 3 7cmin 9cmin 13cmin 47cmin 8cmin 47cmin 12 20 15 60 15 60
0
cmin 5
BOA for the Non-unique Oligonucleotide Probe Selection Problem
371
Table 3. Separation function table: S has been calculated based on the DRC definition, and S based on the SDRC definition t12 t13 t14 t23 t24 t34 S S
p1 0
p2 smin 3
smin 3 smin 5 smin 4 smin 4
0
smin 5 smin 4
p3 smin 3 smin 3 smin 5 0 0 0
p4 smin 3 0 0
smin 4 smin 4
p5 0
smin 3 smin 5 smin 4 smin 4
p6 0 0
smin 5
0
smin 4 smin smin 0 0 0 2 2 smin smin smin smin smin smin 3 2 3 3 3 2 31smin 77smin 13smin 5smin 31smin 19smin 30 60 15 6 30 20
0
C (pj ) =
cov(pj , ti )
1≤j≤n
(6)
ti ∈Tpj
S (pj ) =
sep(pj , tik )
1≤j≤n
(7)
tik ∈Tp2j
4.4
The Combination of BOA and Heuristics
We have applied the modified version of BOA to the non-unique probe selection problem. The goal is to find the minimum set of probe that satisfies the coverage and separation constraints. In each iterative step of BOA, we generate a population of solutions. Each solution is a representation of a set of probes, and is basically a string of zeros and ones. Each position in the string indicates a probe. The presence or absence of each probe in the solution is noted by 1 and 0, respectively. After generating the population, the feasibility of each solution is guaranteed by computing one of the heuristics described in section 4.3. That is, each solution in the current population is transformed in order to respect the problem constraints. All of the three applied heuristics include a reduction phase. Solutions are shortened in this phase, while maintaining their feasibility. In order to measure the quality of the obtained solutions and distinguish the best and worst solutions in the population, an objective function should be defined. Since the goal is to find the minimal probe set in this problem, we use inverse of the length of a solution as our objective function. The length of a solution corresponds to the cardinality of probe set, and it is given by the number of ones in the solution. The larger the objective function value, the higher the quality of the obtained solutions.
5
Results of Computational Experiments
We combined BOA and with heuristic DRC, DPS, and SDRC for non-unique probe selection problem. We noticed that we are able to improve the results obtained by the best methods in literature. It should be noticed that our approach
372
L.S. Ghoraie et al.
is more time-consuming than other approaches in the literature; But we did not focus on comparing our approach to the latest approaches from the aspect of the execution time, because the design of microarray is not a repetitive task. The main concern in this process is the quality of the design. Our programs were written in C++, and experiments were performed on Sharcnet systems [14]. 5.1
Data Sets
The experiments were performed on ten artificial datasets named a1,..., a5, b1,..., b5, and two real datasets HIV1 and HIV2. These datasets have been used in experiments of all previous works mentioned in the section 3, except for the HIV1, and HIV2 that have not been used in [5][4]. The datasets and the related target-probe incidence matrices were kindly provided to us by Dr. Pardalos and Dr. Ragle [8]. Number of targets and probes of each data set are presented in Table 4. Along with this information, the number of virtual probes required for each dataset to guarantee the feasibility of the original probe set are included. 5.2
Results and Discussions
In all experiments, the parameters cmin and smin were set to ten and five, respectively. Each run of BOA has been executed for 100 iterative steps. The number of probes in each dataset are the number of variables (n) used in the BOA. Based on the convergence condition of BOA, mentioned in the section 4.1, the population size should be of O(2k .n1.05 ). Two different series of experiments are performed, and the results are presented. In each series, we chose the population size for each dataset proportional to the number of the variables, which is sum of the number of real and the number of virtual probes of dataset. The considered level of dependency (k ) among variables is simulated by a parameter named maximum incoming edges in the BOA software. Experiments with the default parameters. First series of experiments have been performed with the default parameters of BOA [15]. For instance, the maximum number of incoming edges to each node was set to two, and the percentage of the offspring and parents in the population was set to 50. The results we obtain by applying this approach are presented in Table 4. The comparison between the results is based on the minimum set of probes obtained from each approach. We have named the combination of BOA and heuristics DRC, DPS, and SDRC respectively BOA+DRC, BOA+DPS, and BOA+SDRC. Three columns have been included related to experiments performed by state-of-the-art approaches Integer Linear Programming (ILP) [5][4], Optimal Cutting Plane Algorithm (OCP) [8], and Genetic Algorithm (DRC-GA) [11]. The last three columns show the improvement of our approach over each of the three latest approaches. The improvement is calculated by Eq. 8. Imp =
BOA+DRC Method Pmin − Pmin × 100 Method Pmin
where Method can be substituted by either ILP, OCP, or DRC-GA.
(8)
BOA for the Non-unique Oligonucleotide Probe Selection Problem
373
Table 4. Comparison of the cardinality of the minimal probe set for different approaches: Performance of various algorithms evaluated using ten datasets with different number of targets (|T |), probes (|P |), and virtual probes (|V |). The last three columns are showing the improvement of BOA+DRC over three methods ILP, OCP, and DRC-GA (see Eq. 8). Set |T | |P | |V | ILP[5][4] OCP[8] DRC[11] BOA BOA BOA ILP OCP DRC -GA +SDRC +DPS +DRC -GA a1 256 2786 6 503 509 502 503 503 502 -0.20 -1.37 0 a2 256 2821 2 519 494 490 492 491 490 -5.59 -0.81 0 a3 256 2871 16 516 543 534 535 533 533 +1.35 -2.02 -0.18 a4 256 2954 2 540 539 537 540 538 537 -0.55 -0.37 0 a5 256 2968 4 504 529 528 530 530 528 +4.76 -0.19 0 b1 400 6292 0 879 830 839 843 837 834 -5.12 +0.50 -0.60 b2 400 6283 1 938 842 852 853 849 846 -9.81 +0.47 -0.70 b3 400 6311 5 891 827 835 839 831 829 -6.96 +0.24 -0.72 b4 400 6223 0 915 873 879 877 877 875 -4.37 +0.23 -0.45 b5 400 6285 3 946 874 890 887 886 879 -7.08 +0.57 -1.23 HIV1 200 4806 20 451 450 452 450 450 -0.22 0 HIV2 200 4686 35 479 476 479 475 474 -1.04 -0.42
The calculated value of Imp is negative(positive) when BOA+DRC returns a Method probe set smaller(larger) than Pmin . Therefore, smaller value of Imp shows more efficiency of the BOA+DRC method. For instance, regarding Table 4 (last three columns), for dataset a3, our approach has obtained 0.18% and 2.02% better results (smaller probe set) than DRC-GA and OCP, respectively, and 1.35% worse result (larger probe set) than ILP. As shown in the Table 4, the best results are obtained with the BOA+DRC, while we expected better results from the BOA+DPS, because the DPS has shown better performance on the non-unique probe selection [12]. The results obtained by the [8] are considered as the best ones in the literature for the non-unique probe selection problem. As shown in the 4, Wang et. al. [11] have recently reported the results (noted as DRC-GA) which are comparable to (and in most cases better than) [8]. Comparing our approach to all the three efficient approaches, we have been able to improve the result of non-unique probe selection for dataset HIV2, and obtain the shortest solution length of 474. The results we obtained for datasets a1, a2, a4, and HIV1 are also equal to the best results calculated for these datasets in the literature. Another comparison based on the number of datasets is presented in Table 5. Another important advantage of our approach over other methods is that BOA can provide biologists with useful information about the dependencies between the probes of the dataset. In each experiment, we have stored the scheme of the relations between variables (probes) which have been found by BOA. As mentioned, by means of this information, we can realize which probes are related to each other. Therefore, we can conclude the targets, that these probes hybridize
374
L.S. Ghoraie et al.
Table 5. Comparison between BOA+DRC and ILP, OCP, and DRC-GA: Number of datasets for which our approach has obtained results better or worse than or equal to methods ILP, OCP, and DRC-GA. In the column average, the average of improvements of our approach (illustrated in last three columns of Table 4) is presented.
ILP OCP GA-DRC
Worse Equal Better Average 2 0 8 -3.36 5 0 7 -0.33 0 5 7 -0.36
Fig. 1. Part of the BOA output for dataset HIV2: the discovered dependencies for probes 30 to 38 by BOA
to, also have correlations with each other. A part of these dependencies obtained for dataset HIV2 is presented in Figure 1. This Figure indicates parts of the output of the BOA software. Probes 30 to 38 and their dependencies to other probes are illustrated. As shown, no dependency has been discovered for probes 30, 31, and 34. Probe 32 has two incoming edges from probes 1720 and 4184. It means that when probes 1720 and 4184 are selected for the final probe set, probe 32 has high probability to also be selected for solving this problem. Experiments for investigation of dependency. We conducted another series of experiments in order to study the effect of increasing the number of dependencies searched by BOA. The parameter maximum incoming edges represents this in BOA. As mentioned before, this parameter was set to two for previous experiments. We decided to increase this number to three and four, and repeat the experiments of BOA+DRC for some of the datasets. The results and the number of iterative steps to converge are shown in Table 6. We did not notice any improvements in results, but comparing cases of k = 2 and k = 3, the number of iterative steps to converge has been reduced. According to the results, it is possible that the obtained results are the global optimal solutions for some of the mentioned datasets. It is also possible that this problem does not contain high order dependencies. Therefore, search for higher order dependencies does not help to solve the problem. These should be further investigated with more experiments.
BOA for the Non-unique Oligonucleotide Probe Selection Problem
375
Table 6. Cardinality of minimal probe set for DRC+BOA: the experiment was repeated in order to investigate the effect of increasing the dependency parameter (k). By gen in the table, we mean the number of iterative steps of BOA to converge. Set k = 2 a1 502 gen:26 a2 490 gen:21 a3 533 gen:24 a4 537 gen:20 a5 528 gen:16
6
k=3 502 gen:17 490 gen:20 533 gen:19 537 gen:17 528 gen:13
k=4 502 gen:19 490 gen:15 533 gen:17 537 gen:22 528 gen:15
Conclusions (and Future Research)
In this paper, we presented a new approach for solving the non-unique probe selection problem. Our approach which is based on one of the EDAs named BOA obtains results that compare favorably with the state-of-the-art. Comparing to all the approaches deployed on the non-unique probe selection, our approach proved its efficiency. It obtained the smallest probe set for most datasets. Besides its high ability for optimization, our approach has another advantage over others which is its ability to indicate dependencies between the variables or probes for each dataset. This information can be of interest for biologists. We also investigated the effect of increasing the dependencies between variables searched by BOA for some of the datasets. According to the presented results, it is possible that the results found for some of these datasets are the global optimal values. This requires more experiments and investigation. The non-unique probe selection has been discussed in this paper according the assumption of existence of single target in the sample. Therefore, one of the future works can be to focus on extending the problem with the assumption of multiple targets in the sample. Also, the discovered dependencies by our approach can be interpreted more precisely by biologists in order to detect more interesting information. As an extension to the presented work, we plan to incorporate several metrics into solution quality measure, and use a multi-objective optimization technique. One of the objectives can be the measure of ability of obtained solutions to recognize all targets present in the sample. This is referred to as decoding ability [10]. Using multi-objective optimization, parallelization techniques in the implementation can also be used in order to improve the running time of experiments considerably.
References 1. Goldberg, D.E.: The Design of Innovation: Lessons from and for Competent Genetic Algorithms. Kluwer Academic Publishers, Dordrecht (2002) 2. Garey, M., Johnson, D.: Computers and Intractability: A guide to the Theory of NP-Completeness. W. Freeman, San Francisco (1979)
376
L.S. Ghoraie et al.
3. Gras, R.: How Efficient Are Genetic Algorithms to Solve High Epistasis Deceptive Problems? In: Proc. 2008 IEEE Congress on Evolutionary Computation, Hong Kong, China, June 1-6, pp. 242–249 (2008) 4. Klau, G.W., Rahmann, S., Schliep, A., Vingron, M., Reinert, K.: Integer linear programming approaches for non-unique probe selection. Discrete Applied Mathematics 155, 840–856 (2007) 5. Klau, G.W., Rahmann, S., Schliep, A., Vingron, M., Reinert, K.: Optimal robust non-unique probe selection using integer linear programming. Bioinformatics 20, i186–i193 (2004) 6. Meneses, C.N., Pardalos, P.M., Ragle, M.A.: A new approach to the non-unique probe selection problem. Annals of Biomedical Engineering 35(4), 651–658 (2007) 7. Pelikan, M.: Bayesian Optimization Algorithm: From Single Level to Hierarchy. University of Illinois. PhD Thesis (2002) 8. Ragle, M.A., Smith, J.C., Pardalos, P.M.: An optimal cutting-plane algorithm for solving the non-unique probe selection problem. Annals of Biomedical Engineering 35(11), 2023–2030 (2007) 9. Rash, S., Gusfield, D.: String barcoding: uncovering optimal virus signatures. In: Annual Conference on Research in Computational Molecular Biology, pp. 254–261 (2002) 10. Schliep, A., Torney, D.C., Rahmann, S.: Group testing with DNA chips: generating designs and decoding experiments. In: Proc. IEEE Computer Society Bioinformatics Conference (CSB 2003), pp. 84–91 (2003) 11. Wang, L., Ngom, A., Gras, R.: Non-Unique Oligonucleotide Microarray Probe Selection Method Based on Genetic Algorithms. In: Proc. 2008 IEEE Congress on Evolutionary Computation, Hong Kong, China, June 1-6, pp. 1004–1010 (2008) 12. Wang, L., Ngom, A.: A model-based approach to the non-unique oligonucleotide probe selection problem. In: Second International Conference on Bio-Inspired Models of Network, Information, and Computing Systems (Bionetics 2007), Budapest, Hungary, December 10-13 (2007) ISBN: 978-963-9799-05-9 13. Wang, L., Ngom, A., Gras, R.: Evolution strategy with greedy probe selection heuristics for the non-unique oligonucleotide probe selection problem. In: Proc. 2008 IEEE Symposiunm on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB 2008), pp. 54–61 (2008) 14. http://www.sharcnet.ca/ 15. http://www.cs.umsl.edu/~ pelikan/software.html
Microarray Time-Series Data Clustering via Multiple Alignment of Gene Expression Profiles Numanul Subhani1 , Alioune Ngom1 , Luis Rueda1 , and Conrad Burden2 1
School of Computer Science, 5115 Lambton Tower, University of Windsor, 401 Sunset Avenue, Windsor, Ontario, N9B 3P4, Canada {hoque4,angom,lrueda}@uwindsor.ca 2 Centre for Bioinformation Science, Mathematical Sciences Institute and John Curtin School of Medical Research, The Australian National University, Canberra, ACT 0200, Australia [email protected]
Abstract. Genes with similar expression profiles are expected to be functionally related or co-regulated. In this direction, clustering microarray time-series data via pairwise alignment of piece-wise linear profiles has been recently introduced. We propose a k-means clustering approach based on a multiple alignment of natural cubic spline representations of gene expression profiles. The multiple alignment is achieved by minimizing the sum of integrated squared errors over a time-interval, defined on a set of profiles. Preliminary experiments on a well-known data set of 221 pre-clustered Saccharomyces cerevisiae gene expression profiles yields excellent results with 79.64% accuracy. Keywords: Microarrays, Time-Series Data, Gene Expression Profiles, Profile Alignment, Cubic Spline, k-Means Clustering.
1
Introduction
Clustering microarray time-series data is an important process in functional genomic studies, where genes with similar expression profiles are expected to be functionally related [1]. Many clustering methods have been developed in recent years [2,3,4,5,6]. A hidden phase model was used for clustering time-series data to define the parameters of a mixture of normal distributions in a Bayesian-like manner that are estimated by using expectation maximization (EM) [3]. A Bayesian approach in [7], partitional clustering based on k-means in [8] and an Euclidean distance approach in [9] have been proposed for clustering time-series gene expression profiles. They have applied self-organizing maps (SOMs) to visualize and to interpret gene temporal expression profile patterns. Also, the methods proposed in [4,10] are based on correlation measures. A method that uses jack-knife correlation with or without using seeded candidate profiles was proposed for clustering time-series microarray data as well [10]. Specifying expression levels for the candidate profiles in advance for these correlation-based procedures requires estimating each candidate profile, which is made using a small sample of arbitrarily V. Kadirkamanathan et al. (Eds.): PRIB 2009, LNBI 5780, pp. 377–390, 2009. c Springer-Verlag Berlin Heidelberg 2009
378
N. Subhani et al.
selected genes. The resulting clusters depend upon the initially chosen template genes, because there is a possibility of missing important genes. A regressionbased method, which is suitable for analyzing single or multiple microarrays was proposed in [6] to address the challenges in clustering short time-series expression datasets. Analyzing gene temporal expression profile datasets that are non-uniformly sampled and can contain missing values has been studied in [2]. Statistical spline estimation was used to represent temporal expression profiles as continuous curves. Clustering temporal gene expression profiles was studied by identifying homogeneous clusters of genes in [5]. The shapes of the curves were considered instead of the absolute expression ratios. Fuzzy clustering of gene temporal profiles, where the similarities between co-expressed genes are computed based on the rate of change of the expression ratios across time, has been studied in [11]. In [12], the idea of order-restricted inference levels across time has been applied to select and cluster genes, where the estimation makes use of known inequalities among parameters. In this approach, two gene’s expression profiles fall into the same cluster, if they show similar profiles in terms of directions of the changes of expression ratios, regardless of how big or small the changes are. In [13], pairs of profiles represented by piece-wise linear functions are aligned in such a way to minimize the integrated squared area between the profiles. An agglomerative method, combined with an area-based distance measure between two aligned profiles, was used to cluster microarray time-series data. We re-formulate the profile alignment problem of [13] in terms of integrals of arbitrary functions, allowing us to generalize from a piecewise linear interpolation to any type of interpolation one believes be more physically realistic. The expression measurements are basically snapshots taken at time-points chosen by the experimental biologist. The cells expressing genes do not know when the biologist is going to choose to measure gene expression, which one would guess is changing continuously and smoothly all the time. Thus, smooth spline curve through the known time-points in the cell’s expression path would be a better guess. We use natural cubic spline interpolation to represent each gene expression profile; also, it gives a handy way to align profiles for which measurements were not taken at the same time-points. We generalize the pairwise expression profile alignment formulae of [13] from the case of piece-wise linear profiles to profiles which are any continuous integrable functions on a finite interval. Next, we extend the concept of pairwise alignment to multiple expression profile alignment, where profiles from a given set are aligned in such a way that the sum of integrated squared errors, over a time-interval, defined on the set is minimized. Finally, we combine k-means clustering with our multiple alignment approach to cluster microarray time-series data.
2
Pairwise Expression Profile Alignment
Clustering time-series expression data with unequal time intervals is a very special problem, as measurements are not necessarily taken at regular time points.
Microarray Time-Series Data Clustering via Multiple Alignment
379
Taking into account the length of the interval is accomplished by means of analyzing the area between two expression profiles, joined by the corresponding measurements at subsequent time points. This is equivalent to considering the sum or average of squared errors between the infinite points in the two lines. This analysis can be easily achieved by computing the underlying integral, which is analytically resolved in advance, subsequently avoiding expensive computations during the clustering process. Given two profiles, x(t) and y(t) (either piece-wise linear or continuously integrable functions), where y(t) is to be aligned to x(t), the basic idea of alignment is to vertically shift y(t) towards x(t) in such a way that the integrated squared errors between the two profiles is minimal. Let yˆ(t) be the result of shifting y(t). Here, the error is defined in terms of the areas between x(t) and yˆ(t) in interval [0, T ]. Functions x(t) and yˆ(t) may cross each other many times, but we want that the sum of all the areas where x(t) is above yˆ(t) minus the sum of those areas where yˆ(t) is above x(t), is minimal (see Fig. 1). Let a denote the amount of vertical shifting of y(t). Then, we want to find the value amin of a that minimizes the integrated squared error between x(t) and yˆ(t). Once we obtain amin , the alignment process consists of performing the shift on y(t) as yˆ(t) = y(t) − amin . The pairwise alignment results of [13] generalize from the case of piece-wise linear profiles to profiles which are any integrable functions on a finite interval. Suppose we have two profiles, x(t) and y(t), defined on the time-interval [0, T ]. The alignment process consists of finding the value a that minimizes T T 2 2 fa (x(t), y(t)) = x(t) − yˆ(t) dt = x(t) − [y(t) − a] dt. (1) 0
0
Differentiating yields d fa (x(t), y(t)) = 2 da Setting
d da fa (x(t), y(t))
x(t) + a − y(t) dt = 2
T
0
T
x(t) − y(t) dt + 2aT. (2)
0
= 0 and solving for a gives 1 T amin = − x(t) − y(t) dt, T 0
(3)
2
d and since da 2 fa (x(t), y(t)) = 2T > 0 then amin is a minimum. The integrated error between x(t) and the shifted yˆ(t) = y(t) − amin is then T T x(t) − yˆ(t) dt = x(t) − y(t) dt + amin T = 0. (4) 0
0
In terms of Fig. 1, this means that the sum of all the areas where x(t) is above y(t) minus the sum of those areas where y(t) is above x(t), is zero. Given an original profile x(t) = [e1 , e2 , . . . , en ] (with n expression values taken at n time-points t1 , t2 , . . . , tn ) we use natural cubic spline interpolation, with n knots, (t1 , e1 ), . . . , (tn , en ), to represent x(t) as a continuously integrable function
380
N. Subhani et al.
5.5
5.5
5 4.5 4
4
3.5
3.5
3 2.5 2 1.5
3 2.5 2 1.5
1
1
0.5
0.5
0
x y
4.5
Expression ratio
Expression ratio
5
x y
0
0.5
1
0
1.5
0
0.5
Time in hrs. (a)
1
1.5
Time in hrs. (b)
Fig. 1. (a) Unaligned profiles x(t) and y(t). (b) Aligned profiles x(t) and y(t), after applying y(t) ← y(t) − amin .
⎧ ⎪ ⎨ x1 (t) x(t) =
⎪ ⎩
xn−1 (t)
if .. .
t1 ≤ t ≤ t2
if
tn−1 ≤ t ≤ tn
(5)
where xj (t) = xj3 (t − tj )3 + xj2 (t − tj )2 + xj1 (t − tj )1 + xj0 (t − tj )0 interpolates x(t) in interval [tj , tj+1 ], with spline coefficients xjk ∈ , for 1 ≤ j ≤ n − 1 and 0 ≤ k ≤ 3. For practical purposes, given the coefficients, xjk ∈ , associated with x(t) = [e1 , e2 , . . . , en ] ∈ n , we need only to transform x(t) into a new space as, x(t) = [x13 , x12 , x11 , x10 , . . . , xj3 , xj2 , xj1 , xj0 , . . . , x(n−1)3 , x(n−1)2 , x(n−1)1 , x(n−1)0 ] ∈ 4(n−1) . We can add or subtract polynomials given their coefficients, and the polynomials are continuously differentiable. This yields an analytical solution for amin in Eq. (3) as amin
n−1 n−1 3 1 tj+1 1 (xjk − yjk ) (tj+1 − tj )k+1 =− xj (t) − yj (t) dt = − . T j=1 tj T j=1 k+1 k=0
(6) Fig. 1(b) shows a pairwise alignment, of the two initial profiles in Fig. 1(a), after applying the vertical shift y(t) ← y(t) − amin. The two aligned profiles cross each other many times, but the integrated error, Eq. (4), is zero. In particular, from Eq. (4), the horizontal t-axis will bisect a profile x(t) into two halves with equal areas, when x(t) is aligned to the t-axis. In the next section, we use this property of Eq. (4) to define the multiple alignment of a set of profiles.
Microarray Time-Series Data Clustering via Multiple Alignment
3
381
Multiple Expression Profile Alignment
Given a set X = {x1 (t), . . . , xs (t)}, we want to align the profiles such that the integrated squared error between any two vertically shifted profiles is minimal. Thus, for any xi (t) and xj (t), we want to find the values of ai and aj that minimize T T 2 2 fai ,aj (xi (t), xj (t)) = x ˆi (t) − x ˆj (t) dt = [xi (t) − ai ] − [xj (t) − aj ] dt, 0
0
(7) where both xi (t) and xj (t) are shifted vertically by an amount ai and aj , respectively, in possibly different directions, whereas in the pairwise alignment of Eq. (1), profile y(t) is shifted towards a fixed profile x(t). The multiple alignment process consists then of finding the values of a1 , . . . , as that minimize Fa1 ,...,as (x1 (t), . . . , xs (t)) = fai ,aj (xi (t), xj (t)) , (8) 1≤i<j≤s
We use Lemma 1 to find the values ai and aj , 1 ≤ i < j ≤ s, that minimize Fa1 ,...,as . Lemma 1. If xi (t) and xj (t) are pairwise aligned each to a fixed profile, z(t),
T then the integrated error 0 [ˆ xi (t) − x ˆj (t)] dt = 0. Proof. If xi (t) and xj (t) are pairwise aligned each to z(t), then from Eq. (3), we
T
T have amini = − T1 0 [z(t) − xi (t)] dt and aminj = − T1 0 [z(t) − xj (t)] dt. Then,
T
T [ˆ xi (t) − x ˆj (t)] dt = 0 [xi (t) − amini ] − [xj (t) − aminj ] dt = 0
T
T
T
T 0 xi (t)dt + 0 [z(t) − xi (t)] dt − 0 xj (t)dt − 0 [z(t) − xj (t)] dt = 0. In other words, xˆj (t) is automatically aligned relative to x ˆi (t), given z(t) is fixed. Corollary 1. If xi (t) and xj (t) are pairwise aligned each to a fixed profile, z(t), then famini ,aminj (xi (t), xj (t)) is minimal. Proof. From Lemma 1,
T
T 2 [ˆ xi (t) − x ˆj (t)] dt = 0 ⇒ 0 [xi (t) − amini ] − [xj (t) − aminj ] dt is minimal. 0 Lemma 2. If profiles x1 (t), . . . , xs (t) are pairwise aligned each to a fixed profile, z(t), then Famin1 ,...,amins (x1 (t), . . . , xs (t)) is minimal. Proof. From Corollary 1, fai ,aj (xi (t), xj (t)) ≥ famini ,aminj (xi (t), xj (t)), with equality holding when ak = amink ; which is attained by aligning each xk (t) independently with z(t), 1 ≤ k ≤ s. From the definition of Eq. (8), it follows that Fa1 ,...,as (x1 (t), . . . , xs (t)) ≥ 1≤i<j≤s famini ,aminj (xi (t), xj (t)) = Famin1 ,...,amins (x1 (t), . . . , xs (t)), with equality holding when ak = amink , 1 ≤ k ≤ s. Thus, given a fixed profile z(t), applying Corollary 1 to all pairs of profiles minimizes Fa1 ,...,as (x1 (t), . . . , xs (t)) in Eq. (8).
382
N. Subhani et al.
Theorem 1. Given a fixed profile, z(t), and a set of profiles, X= {x1 (t), . . . , xs (t)}, ˆ = {ˆ there always exists a multiple alignment, X x1 (t), . . . , xˆs (t)}, such that x ˆi (t) = xi (t) − amini , where, amini = −
1 T
T
z(t) − xi (t) dt,
(9)
0
and, in particular, for profile z(t) = 0, defined by the horizontal t-axis, we have x ˆi (t) = xi (t) − amini , where, amini =
1 T
T
xi (t)dt.
(10)
0
We use the multiple alignment of Eq. (10) in all subsequent discussions. Using spline interpolations, each profile xi (t), 1 ≤ i ≤ s, is a continuous integrable profile ⎧ if t1 ≤ t ≤ t2 ⎪ ⎨ xi,1 (t) .. xi (t) = (11) . ⎪ ⎩ xi,n−1 (t) if tn−1 ≤ t ≤ tn where, xi,j (t) = xij3 (t− tj )3 + xij2 (t− tj )2 + xij1 (t− tj )1 + xij0 (t− tj )0 represents xi (t) in interval [tj , tj+1 ], with spline coefficients xijk for 1 ≤ i ≤ s, 1 ≤ j ≤ n−1 and 0 ≤ k ≤ 3. Thus the analytical solution for amini in Eq. (10) is amini =
n−1 3 k+1 1 xijk (tj+1 − tj ) T j=1 k+1
(12)
k=0
4
Distance Function
The distance between any two piecewise linear profiles was defined as f (amin ) in [13]. For convenience here, we change the definition slightly to 1 1 f (amin ) = T T
d(x, y) =
T
2 x(t) + amin − y(t) dt.
(13)
0
For any function φ(t) defined on [0, T ], we also define
φ
1 T
T
φ(t)dt.
(14)
0
Then, from Eqs. (1) and (3), d(x, y) =
1 T
T
2 [x(t) − y(t)] + 2amin [x(t) − y(t)] + a2min dt
0
2 1 T = x(t) − y(t) dt − 2a2min + a2min T 0 2 = [x(t) − y(t)] − x(t) − y(t)2 .
(15)
Microarray Time-Series Data Clustering via Multiple Alignment
383
Apart from the factor T1 , this is precisely the distance dPA (x, y, t) in [13]. By performing the multiple alignment of Eq. (10) to obtain new profiles x ˆ(t) and yˆ(t), we have: 2 1 T 2 d(x, y) = [ˆ x(t) − yˆ(t)] = x ˆ(t) − yˆ(t) dt. (16) T 0 1
Thus, d(x, y) 2 is the 2-norm, satisfying all the properties we might want for a metric. On the other hand, it is easy to show that d(x, y) in Eq. (16) does not satisfy the triangle inequality, and hence it is not a metric. We, however, use d(x, y) in Eq. (16) as our distance function, since it is algebraically easier to work with 1 than the metric d(x, y) 2 . Eq. (16) is closer to the spirit of regression analysis, and thus, we can dispense with the requirement for the triangle inequality. Also the distance as defined in Eq. (16) is unchanged by an additive shift, and hence, is order-preserving; that is: d(u, v) ≤ d(x, y) if and only if d (ˆ u, vˆ) ≤ d (ˆ x, yˆ). This property has important implications for distance-based clustering methods that rely on pairwise alignments of profiles; as discussed later in the experiment section. With the spline interpolations of Eq. (5), we derived the analytical solution for d(x, y) in Eq. (16), using the symbolic computational package, Maple 1 , as follows: d(x, y) =
P 2 (n7 − m7 ) 7
+
(2P Q − 6P 2 m)(n6 − m6 ) 6
+
(2P R − 10P Qm + Q2 + 15P 2 m2 )(n5 − m5 ) 5
+
(−8P Rm − 4Q2 m + 2P S + 20P Qm2 + 2QR − 20P 2 m3 )(n4 − m4 )
+ 4 3 2 2 2 2 2 4 (−6QRm − 20P m Q + R + 6Q m + 12P m R − 6P mS + 15P m + 2QS)(n3 − m3 )
+ 3 4 2 3 2 2 5 2 2 3 (10P m Q + 6Qm R + 2RS − 8P m R − 2R m − 6P m + 6P m S − 4QmS − 4Q m )(n2 − m2 ) 2 2 2 6 2 4 2 2 −2RmS(n − m) + S (n − m) + P m (n − m) + Q m (n − m) + R m (n − m) − 3 5 3 4 2 2Qm R(n − m) − 2P m Q(n − m) − 2P m S(n − m) + 2P m R(n − m) + 2Qm S(n − m)
(17)
where, P = (xj3 −yj3 ), Q = (xj2 −yj2 ), R = (xj1 −yj1 ), S = (xj0 −yj0 +cy −cx), m = tj and n = tj+1 .
5
Centroid of a Set
Given a set of profiles X = {x1 (t), . . . , xs (t)}, we wish to find a representative centroid profile μ(t), that well represents X. An obvious choice is the function that minimizes s Δ[μ] = d (xi , μ) . (18) i=1
where, Δ plays the role of the within-cluster-scatter defined in [13]. Since d(·, ·) is unchanged by an additive shift x(t) → x(t) − a in either of its arguments, we have 1
All the analytical solutions in this paper were derived by Maple.
384
N. Subhani et al.
Δ[μ] =
s
d (ˆ xi , μ) =
i=1
1 T
0
T
s
2
[ˆ xi (t) − μ(t)] dt,
(19)
i=1
ˆ = {ˆ where, X x1 (t), . . . , x ˆs (t)} is the multiple alignment of Eq. (10). This is a functional of μ; that is, a mapping from the set of real valued functions defined on [0, T ] to the set of real numbers. To minimize with respect to μ we set the functional derivative to zero2 . This functional is of the form F [φ] = L(φ(t))dt, (20) [φ] for some function L, for which the functional derivative is simply δF δφ(t) = In our case, we have s s δΔ[μ] 2 2 =− [ˆ xi (t) − μ(t)] = − xˆi (t) − sμ(t) . δμ(t) T i=1 T i=1
Setting
δΔ[μ] δμ(t)
dL(φ(t)) dφ(t) .
(21)
= 0 gives 1 x ˆi (t). s i=1 s
μ(t) =
(22)
With the spline coefficients, xijk , of each xi (t) interpolated as in Eq. (11), the analytical solution for μ(t) in Eq. (22) is 3 s 1 k μj (t) = xijk (t − tj ) − amini , in each interval [tj , tj+1 ] . (23) s i=1 k=0
Eq. (22) applies to aligned profiles while Eq. (23) can apply to unaligned profiles.
6
k-Means Clustering via Multiple Alignment
Many clustering methods have been developed, and each has its own advantages and disadvantages regarding handling noise in the measurements and the properties of the data set being clustered. None of them is considered the best method. In [13], hierarchical clustering was used and the decision rule was the farthest neighbor distance between two clusters; computed using an equivalent of Eq. (1) for piece-wise linear profiles. Hierarchial clustering is a greedy method that cannot be readily applied on large data sets. Our approach allows us to apply flat clustering such as k-means, which, though not optimal, provides a fast and practical solution to the problem. This also applies to fuzzy k-means or expectation maximization (EM) clustering methods. 2
For a functional F [φ], the functional derivative is defined as lim→0 (F [φ+δt ]−F [φ]) , at t.
δF [φ] δφ(t)
=
where δt (τ ) = δ(τ − t) is the Dirac delta function centered
Microarray Time-Series Data Clustering via Multiple Alignment
385
In k-means [14], we want to partition a set of s profiles, D = {x1 (t), . . . , xs (t)}, into k disjoint clusters C1 , . . . , Ck , 1 ≤ k ≤ s; such that (i) φ, i = 1, . . . , k (ii) k Ui=1 Ci = D (iii) Ci ∩ Cj = φ; i = j; i, j = 1, . . . , k. Also, each profile is assigned to the cluster whose mean is the closest. It is similar to EM for mixtures of Gaussians in the sense that they both attempt to find the centers of natural clusters in the data. It assumes that the object features form a vector space. Let U = {uij } be the membership matrix: 1 if d (xi , μj ) = minl=1,...,k d (xi , μl ) where i = 1, . . . , s uij = (24) 0 otherwise The aim is to minimize the sum of squared distances: J(θ, U ) =
n k
uij d (xi , μj ) .
(25)
i=1 j=1
where θ = μ1 , μ2 , . . . , μn . Algorithm 1. k-MCMA: k-Means Clustering with Multiple Alignment Input: Set of profiles, D = {x1 (t), . . . , xs (t)}, and desired number of clusters, k Output: Clusters Cˆμˆ1 , . . . , Cˆμˆk 1. Apply natural cubic spline interpolation on xi (t) ∈ D, for 1 ≤ i ≤ k (see Section 2) ˆ = {ˆ 2. Multiple-align transformed D to obtain D x1 (t), . . . , x ˆs (t)}, using Eq. (10) 3. Randomly initialize centroid μ ˆi (t), for 1 ≤ i ≤ k repeat 4.a. Assign x ˆj (t) to cluster Cˆμˆi with minimal d (ˆ xj , μ ˆi ), for 1 ≤ j ≤ s and 1 ≤ i ≤ k 4.b. Update μ ˆi (t) of Cˆμˆi , for 1 ≤ i ≤ k until Convergence: that is, no change in μ ˆi (t), for 1 ≤ i ≤ k return Clusters Cˆμˆ1 , . . . , Cˆμˆk
In k-MCMA (see Algorithm. 1), we first multiple-align the set of profiles D, ˆ with k-means. Recall using Eq. (10), and then cluster the multiple aligned D that the process of Eq. (10) is to pairwise align each profile with the t-axis. ˆ The k initial centroids are found by randomly selecting k pairs of profiles in D, and then take the centroid of each pair. In step (4.a), we do not use pairwise alignment to find the centroid μ ˆi (t) closest to a x ˆj (t); since, by Lemma 1, they are automatically aligned relative to each other. When profiles are multiplealigned, any arbitrary distance function other than Eq. (16) can be used in step (4.a), including Euclidean distance. Also, by Theorem 2 below, there is no need to multiple-align Cˆμˆi in step (4.b), to update its centroid μ ˆi (t). Theorem 2. Let μ ¯(t) be the centroid of a cluster of m multiple-aligned profiles. ˆ Then μ ¯(t) = μ ¯(t).
386
N. Subhani et al.
ˆ Proof. We have μ ¯(t) = μ ¯(t) − aminμ¯ . However, aminμ¯ = T1
T m 1 1 ˆi (t) = 0, since each x ˆi (t) is aligned with the t-axis. i=1 x T 0 m
T 0
μ ¯(t)dt =
Thus, Lemma 1 and Theorem 2 make k-MCMA much faster than applying kmeans directly on the non-aligned dataset D, and even more than this when Euclidean distance is used to assign a profile to a cluster. An important implication of Eq. (16) is that applying k-means on the non-aligned dataset D (i.e., clustering on D), without any multiple alignment, is equivalent to k-MCMA ˆ That is, if a profile xi (t) is assigned to a cluster Cμi by (i.e., clustering on D). k-means on D, its shifted profile x ˆi (t) will be assigned to cluster Cˆμˆi by k-MCMA ˆ (k-means on D). This can be easily shown by the fact that multiple alignment is order-preserving, as pointed out in Section 4. In k-means on D, step (4.a) would require O(sk) pairwise alignments to assign s profiles to k clusters; whereas no pairwise alignment is needed in k-MCMA. In other words, we show that we can multiple-align once, and obtain the same k-means clustering results, provided that we initialize the means in the same manner. This also, reinforces a known fact demonstrated in [15]; which is, a dissimilarity function that is not metric can be made metric by using a shift operation (in our case any metric can be used in step (4.a) such as Euclidean distance). In this case the objective function of k-means does not change, and convergence is assured. Thus, this saves a lot of computations and opens the door for applications of multiple alignment methods to many distance-based clustering methods. This is a future research direction that we plan to investigate.
7
Computational Experiments
The performance of the k-MCMA method on a set of pre-clustered budding yeast, Saccharomyces cerevisiae, data set of [1]3 is discussed in this section. The data set contains time-series gene expression profiles of the complete characterization of mRNA transcript levels during the yeast cell cycle. These experiments measured the expression levels of the 6,220 yeast genes during the cell cycle at seventeen different points, from 0 to 160 minutes, at every 10-minute timeinterval. From those gene profiles, 221 profiles were analyzed. We normalized each expression profile as in [1]; that is, we divided each transcript level by the mean value of each profile with respect to each other. The data set contains five known clusters called phases: Early G1 phase (32 genes), Late G1 phase (84 genes), S phase (46 genes), G2 phase (28 genes) and M phase (31 genes); the phases are visualized in Fig. 2(b), and Table 1 shows the complete data set. Setting k = 5, we applied k-MCMA on the data set to see if k-MCMA is able to find these phases as accurately as possible. Once the clusters have been found, to compare the k-MCMA clustering with the preclustered dataset of [1], the next step is to label the clusters, where the labels are the “phases” in the pre-clustered dataset. 3
http://genomics.stanford.edu/yeast_cell_cycle/cellcycle.html
Microarray Time-Series Data Clustering via Multiple Alignment
18
18
16
16 C5
Expression ratio
14 12
Late G1
14
C4 12
10
G2 phase
10 S phase
C3 8
8 C2
6 4
Early G1
6 4
M phase
C1 2
387
0
20
40
60
80 100 Time in min. (a)
120
140
160
2
0
20
40
60
80 100 Time in min. (b)
120
140
160
Fig. 2. (a) k-MCMA clusters and (b) Yeast phases [1], with centroids shown
Although this can be done in many different ways, we adopted the following approach. We assigned each k-MCMA cluster to a yeast phase using the Hungarian algorithm [16]. The Hungarian method is a combinatorial optimization algorithm which solves the assignment problem in polynomial time. Our phase assignment problem is formulated using a complete bipartite graph G = (C, P, E) with k cluster vertices (C) and k phases vertices (P ), and each edge in E has a nonnegative cost c(Cˆμˆi , Pˆνˆj ), Cˆμˆi ∈ C and Pˆνˆj ∈ P . We want to find a perfect matching with minimum cost. The cost of an edge between a cluster vertex Cˆμˆi and a phase vertex Pˆνˆj is the distance between their centroids μ ˆ i , νˆj ; that is c(Cˆμˆi , Pˆνˆj ) = d(Cˆμˆi , Pˆνˆj ), and the distances are computed using Eq. (16). In terms of such bipartite graph, the Hungarian method will select the k perfect matching pairs (Cˆμˆi , Pˆνˆj ) with minimum cost. In Fig. 2, the cluster and the phase of each of the five selected pairs, found by the Hungarian algorithm, are shown at the same level; e.g., cluster C5 of k-MCMA is assigned to the Late G1 phase of [1] by our phase assignment approach, and hence they are at the same level in the figure. The five clusters found by k-MCMA are shown in Fig. 2(a), while the corresponding phases of [1] after the phase assignment are shown in Fig. 2(b). The horizontal axis represents the time-points in minutes and the vertical axis represents the expression values. The dashed black lines are the cluster centroids learned by k-MCMA (Fig. 2(a)) and the known phase centroids of the yeast data (Fig. 2(b)). In the figure, each cluster and phase were multiple-aligned using Eq. (10) to enhance visualization. Fig. 2 clearly shows a high degree of similarity between the k-MCMA clusters and the yeast phases. Visually, each k-MCMA cluster on the left is very similar to exactly one of the yeast phases, which we show at the same level on the right. Visually also, it even “seems” that k-MCMA clusters are more accurate than
388
N. Subhani et al.
Table 1. Pre-clustered yeast genes from Table 1 of [1] with their actual phases and k-MCMA cluster numbers Gene Names
Phases
k-KCMA
YBR202w/CDC47
Early G1
1
YER069w/ARG5,6
G2
3
YMR076C/PDS5
Late G1
5
YPL058C/PDR12
Early G1
1
YJR112W/NNF1
G2
3
YMR078C/CHL12
Late G1
5
YGL255W/ZRT1
G2
1
YLL046c/RNP1
G2
3
YNL225C/CNM1
Late G1
5
YIL106W/MOB1
G2
1
YJL092W/HPR5
G2
3
YPL209C/IPL1
Late G1
5
YBR038w/CHS2
G2
1
YCR084c/TUP1
G2
3
YPL241C/CIN2
Late G1
5
YDL048c/STP4
M
1
YKL032C/IXR1
G2 & M
3
YDR507c/GIN4
Late G1
5
YGR143W/SKN1
M
1
YLL021w/SPA2
Late G1
3
YGL027C/CWH41
Late G1
5
YGL116W/CDC20
M
1
YDR297w/SUR2
Late G1
3
YGR041W/BUD9
Late G1
5
YGR108W/CLB1
M
1
YPL127C/HHO1
Late G1
3
YGR152C/RSR1
Late G1
5
YPR119W/CLB2
M
1
YDR224c/HTB1
Late G1 & G2
3
YIL159W/BNR1
Late G1
5
YBR138c/HDR1
M
1
YDR225w/HTA1
Late G1 & G2
3
YLR286C/CTS1
Late G1
5
YGR092W/DBF2
M
1
YPR167C/MET16
M
3
YLR313C/SPH1
Late G1
5
YHR023w/MYO1
M
1
YER001w/MNN1
S
3
YNL233W/BNI4
Late G1
5
YOL069W/NUF2
M
1
YER003c/PMI40
S
3
YAR007C/RFA1
Late G1
5
YOR058C/ASE1
M
1
YIR017C/MET28
S
3
YBL035c/POL12
Late G1
5
YPL242C/IQG1
M
1
YKR001C/SPO15
S
3
YBR088c/POL30
Late G1
5
YJR092W/BUD4
M
1
YDL155W/CLB3
S
3
YBR252w/DUT1
Late G1
5
YLR353W/BUD8
M
1
YBL063w/KIP1
S
3
YBR278w/DPB3
Late G1
5
YMR001C/CDC5
M
1
YDR113c/PDS1
S
3
YDL164C/CDC9
Late G1
5
YNL053W/MSG5
M
1
YDR356w/NUF1
S
3
YER070w/RNR1
Late G1
5
YGL021W/ALK1
M
1
YEL061c/CIN8
S
3
YJL173C/RFA3
Late G1
5
YDR146c/SWI5
M
1
YGR140W/CBF2
S
3
YKL045W/PRI2
Late G1
5
YLR131c/ACE2
M
1
YHR172W/SPC97
S
3
YLR103c/CDC45
Late G1
5
YOR025W/HST3
M
1
YLR045c/STU2
S
3
YML102W/CAC2
Late G1
5
YOR229W/WTM2
Phases k-KCMA Gene Names
Phases
k-KCMA Gene Names
M
1
YNL126W/SPC98
S
3
YNL102W/CDC17
Late G1
5
YLL040c/VPS13
Early G1
2
YPR141C/KAR3
S
3
YNL262W/POL2
Late G1
5
YDL179w/
Early G1
2
YBL002w/HTB2
S
3
YNL312W/RFA2
Late G1
5
YLR079w/SIC1
Early G1
2
YBL003c/HTA2
S
3
YOR074C/CDC21
Late G1
5
YBR200w/BEM1
Early G1
2
YJR006W/HUS2
S
3
YPR018W/RLF2
Late G1
5
YBL023c/MCM2
Early G1
2
YCR035c/RRP4
S
3
YPR175W/DPB2
Late G1
5
YEL032w/MCM3
Early G1
2
YER016w/BIM1
S
3
YLR382C/NAM2
Late G1
5
YJL194W/CDC6
Early G1
2
YER118c/SSU81
S
3
YDL227C/HO
Late G1
5
YLR274W/CDC46 Early G1
2
YAL001C/TFC3
S
3
5
2
YFR037C/RSC8
S
3
YGL089C/MF(α2 )
Late G1
YPR019W/CDC54 Early G1
YNL173C/MDG1
Late G1
5
YCL040w/GLK1
Early G1
2
YPL016W/SWI1
S
3
YBR070c/SAT2
Late G1
5
YCR005c/CIT2
Early G1
2
YJR159W/SOR1
S & G2
3
YBR073w/RDH54
Late G1
5
YDL181W/INH1
Early G1
2
YDR277c/MTH1
S & M
3
YGL200C/EMP24
Late G1
5
YGR183C/QCR9
Early G1
2
YJL137c/GLG2
G2
4
YHR153c/SPO16
Late G1
5
YLR273C/PIG1
Early G1
2
YIL050W/PCL7
G2
4
YKL101W/HSL1
Late G1
5
YLR395C/COX8
Early G1
2
YBL097w/BRN1
G2
4
YKL165C/MCD4
Late G1
5
YML110C/COQ5
Early G1
2
YCL014w/BUD3
G2
4
YLL002w/KIM2
Late G1
5
YMR256c/COX7
Early G1
2
YJL099W/CHS6
G2
4
YLR233C/EST1
Late G1
5
YHR005c/GPA1
Early G1
2
YCR073c/SSK22
G2
4
YLR457C/NBP1
Late G1
5
YJL157C/FAR1
Early G1
2
YDR389w/SAC7
G2
4
YNL272C/SEC2
Late G1
5
YKL185W/ASH1
Early G1
2
YLR210W/CLB4
S
4
YPL057C/SUR1
Late G1
5
YGR281W/YOR1
Early G1
2
YMR198W/CIK1
S
4
YPL124W/NIP29
Late G1
5
YBR083w/TEC1
Early G1
2
YJR137C/ECM17
S
4
YDL101C/DUN1
Late G1
5
YBR104w/YMC2
G2
2
YMR190C/SGS1
S
4
YDR097C/MSH6
Late G1
5
YLR014c/PPR1
G2
2
YBL052c/SAS3
S
4
YKL113C/RAD27
Late G1
5
YOR274W/MOD5
G2
2
YIL126W/STH1
S
4
YLR032w/RAD5
Late G1
5
YKL068W/NUP100
G2
2
YHR086w/NAM8
S & G2
4
YLR234W/TOP3
Late G1
5
YOR317W/FAA1
Late G1
2
YDR150w/NUM1
S & M
4
YLR383W/RHC18
Late G1
5
YPL187W/MF(α1 ) Late G1
2
YIL009W/FAA3
Early G1
5
YML021C/UNG1
Late G1
5
YOR316C/COT1
Late G1
2
YNR016C/ACC1
Early G1
5
YML060W/OGG1
Late G1
5
YHR038W/KIM4
Late G1
2
YER111c/SWI4
Early G1
5
YML061C/PIF1
Late G1
5
YAL040C/CLN3
M
2
YLR258W/GSY2
Early G1
5
YNL082W/PMS1
Late G1
5
YCL037c/SRO9
M
2
YBR067c/TIP1
Early G1
5
YOL090W/MSH2
Late G1
5
YDL138W/RGT2
M
2
YGL055W/OLE1
Early G1
5
YPL153C/SPK1
Late G1
5
YIL162W/SUC2
M
2
YKL092C/BUD2
Early G1 & G2
5
YML027W/YOX1
Late G1
5
YKL130C/SHE2
M
2
YOR373W/NUD1
Early G1 & S
5
YMR179W/SPT21
Late G1
5
YHR152w/SPO12
M
2
YJL196C/ELO1
Late G1
5
YBR160w/CDC28 Late G1 & M
5
YIL167W/
M
2
YJR148W/TWT2
Late G1
5
YAR008W/SEN34
S
5
YKL129C/MYO3
M
2
YDL127w/PCL2
Late G1
5
YDL093W/PMT5
S
5
YCR042c/TSM1
M
2
YGR109C/CLB6
Late G1
5
YDL095W/PMT1
S
5
YNL073W/MSK1
S
2
YJL187C/SWE1
Late G1
5
YDR488c/PAC11
S
5
YER017c/AFG3
S
2
YMR199W/CLN1
Late G1
5
YOR026W/BUB3
S
5
S & G2
2
YNL289W/PCL1
Late G1
5
YIL140W/SRO4
S
5
YKL049C/CSE4
G2
3
YPL256C/CLN2
Late G1
5
YKL067W/YNK1
S
5
YPR111W/DBF20
G2
3
YPR120C/CLB5
Late G1
5
YKL127W/PGM1
S
5
YJR076C/CDC11
G2
3
YDL003W/RHC21
Late G1
5
YBR275c/RIF1
S
5
YKL048C/ELM1
G2
3
YFL008W/SMC1
Late G1
5
YCR065w/HCM1
S
5
YOR188W/MSB1
G2
3
YJL074C/SMC3
Late G1
5
YDL197C/ASF2
S
5
YDL198C/YHM1
G2
3
YKL042W/SPC42
Late G1
5
YJL115W/ASF1
S
5
YDR464w/SPP41
G2
3
YLR212C/TUB4
Late G1
5
YML091C/RPM2
Microarray Time-Series Data Clustering via Multiple Alignment
389
the yeast phases; which suggest that k-MCMA can also correct manual phase assignment errors, if any. To show the biological significance of the results, the 221 yeast genes are listed in Table 1, where, for each gene, the cluster number that k-MCMA assigns to a gene and the actual yeast phase in [1] of that same gene is shown. An objective measure for comparing k-MCMA clusters with the yeast phases was computed as follows. For each k-MCMA cluster Cˆμˆc (1 ≤ c ≤ k = 5), we find the shortest distance between each profile xi (t), 1 ≤ i ≤ |Cˆμˆc |, and all five phase centroids νj (t), 1 ≤ j ≤ k = 5, using Eq. (16). Profile xi (t) will be assigned ˆ ˆ the correct label (i.e., assigned to phase label of Pνˆj ) whenever xi (t) ∈ Pνˆj and Cˆμˆc , Pˆνˆj ∈ S the set of selected cluster-phase pairs; otherwise, xi (t) will be assigned the incorrect label, if cluster Cˆμˆc was not paired with phase Pˆνˆj by our pair-assignment method. The percentage of correct assignments over the 221 profiles was used as our measure of accuracy, resulting in 79.64%. That is
k |Cˆμˆc | E ( c , arg min1≤j≤k d (xi , νj ) ) Accuracy = c=1 i=1 , (26) 221 where E(a, b) returns 1 when a = b, and zero otherwise. This criterion is reasonable, as k-MCMA is an unsupervised learning approach that does not know the phases beforehand, and hence the aim is to “discover” the phases. In [1], the 5 phases were determined using biological information, including genomic and phenotypic features observed in the yeast cell cycle experiments. This 79.64% accuracy is quite high considering that k-MCMA is an unsupervised learning method.
8
Conclusion
We proposed k-MCMA, a method that combines k-means with multiple alignment of gene expression profiles to cluster microarray time-series data. The profiles are represented as natural cubic splines functions, to compare profiles, where expression measurements were not taken at the same time-intervals. Multiple alignment is based on minimizing the sum of integrated squared errors over a time-interval, defined on a set of profiles. k-MCMA was able to find the 5 yeast cell-cycle phases of [1] with an accuracy of about 80%. k-MCMA can also be used to correct manual phase assignment errors. In the future, we plan to study other distance-based clustering approaches using our multiple alignment method. It will be also interesting to study the effectiveness of any such clustering methods in dose-response microarray data sets. Cluster validity indices based on multiple alignment will also be investigated. We argue that in real applications data can be very noisy, and the use of cubic spline interpolation could lead to some problems. The use of splines has the advantage of being tractable, however, although we also plan to study interpolation methods that incorporate noise. Currently, we are also carrying out experiments with larger data sets. Acknowledgements. This research has been partially funded by Canadian NSERC Grant #RGPIN228117-2006 and CFI grant #9263.
390
N. Subhani et al.
References 1. Cho, R., Campbell, M., Winzeler, E., Steinmetz, L., Conway, A., Wodicka, L., Wolfsberg, T., Gareilian, A., Lockhart, D., Davis, R.: A genome-wide transactional analysis of the mitotic cell cycle. Molecular Cell 2(1), 65–73 (1998) 2. Bar-Joseph, Z., Gerber, G., Jaakkola, T., Gifford, D., Simon, I.: Continuous representations of time series gene expresion data. Journal of Comp. Biology 10(3-4) (2003) 3. Br´eh´elin, L.: Clustering gene expression series with prior knowledge. In: Casadio, R., Myers, G. (eds.) WABI 2005. LNCS (LNBI), vol. 3692, pp. 27–38. Springer, Heidelberg (2005) 4. Chu, S., DeRisi, J., Eisen, M., Mulholland, J., Botstein, D., Brown, P., Herskowitz, I.: The transcriptional program of sporulation in budding yeast. Science 282, 699– 705 (1998) 5. Djean, S., Martin, P., Baccini, A., Besse, P.: Clustering time-series gene expression data using smoothing spline derivatives. EURASIP Journal on Bioinformatics and Systems Biology 70561, 705–761 (2007) 6. Ernst, J., Nau, G., Bar-Joseph, Z.: Clustering short time series gene expression data. Bioinformatics 21(suppl. 1), i159–i168 (2005) 7. Ramoni, M., Sebastiani, P., Kohane, I. (eds.): Cluster analysis of gene expression dynamics. Proc. Natl. Acad. Sci. USA 99 (2002) 8. Tavazoie, S., Hughes, J., Campbell, M., Cho, R., Church, G.: Systematic determination of genetic network architecture. Nature Genetics 22, 281–285 (1999) 9. Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E., Golub, T. (eds.): Interpreting patterns of gene expression with SOMs: Methods and application to hematopoietic differentiation, vol. 96 (1999) 10. Heyer, L., Kruglyak, S., Yooseph, S.: Exploring expression data: identification and analysis of coexpressed genes. Genome Research 9, 1106–1115 (1999) 11. Moller-Levet, C., Klawonn, F., Cho, K., Wolkenhauer, O.: Clustering of unevenly sampled gene expression time-series data. Fuzzy sets and Systems 152(1-16), 49–66 (2005) 12. Peddada, S., Lobenhofer, E., Li, L., Afshari, C., Weinberg, C., Umbach, D.: Gene selection and clustering for time-course and dose-response microarray experiments using order-restricted inference. Bioinformatics 19(7), 834–841 (2003) 13. Rueda, L., Bari, A., Ngom, A.: Clustering time-series gene expression data with unequal time intervals. In: Priami, C., Dressler, F., Akan, O.B., Ngom, A. (eds.) Transactions on Computational Systems Biology X. LNCS (LNBI), vol. 5410, pp. 100–123. Springer, Heidelberg (2008) 14. Xu, R., Wunsch, D.: Clustering. Wiley-IEEE Press, Chichester (2008) 15. Roth, V., Laub, J., Kawanabe, M., Buhmann, J.: Optimal cluster preserving embedding of nonmetric proximity data. IEEE Trans. on Pattern Analysis and Machine Intelligence 25(12), 1540–1551 (2003) 16. Kuhn, H.: The hungarian method for the assignment problem. Naval Research Logistics 52(1), 7–21 (2005)
Recursive Neural Networks for Undirected Graphs for Learning Molecular Endpoints Ian Walsh, Alessandro Vullo, and Gianluca Pollastri School of Computer Science and Informatics and Complex and Adaptive Systems Laboratory University College Dublin, Belfield, Dublin 4, Ireland
Abstract. Accurately predicting the endpoints of chemical compounds is an important step towards drug design and molecular screening in particular. Here we develop a recursive architecture that is capable of mapping Undirected Graphs into individual labels, and apply it to the prediction of a number of different properties of small molecules. The results we obtain are generally state-of-the-art. The final model is completely general and may be applied not only to prediction of molecular properties, but to a vast range of problems in which the input is a graph and the output is either a single property or (with small modifications) a set of properties of the nodes.
1 Introduction Cost-free, time-efficient computational screening of a large set of compounds capable of excluding a sizeable fraction of them before the testing phase would dramatically reduce the cost of drug design and significantly quicken its pace. The Quantitative Structure-Property/Activity Relationship (QSPR/QSAR) approach dates back as far as forty years ago [1], and relies on finding an appropriate function that maps a molecular compound into a property/activity of interest. Machine learning techniques can be used to tackle this problem, but molecules are inherently structured as graphs and there is no single conclusive solution to design statistical methods that deal with graphs. An early solution to this problem has been to “flatten” a molecule into a fixed-size vector of properties, or features, generally hand-crafted, which are then input to a traditional machine learning tool such as a Neural Network (NN) or a Support Vector Machine (SVM). For instance in [2, 3] this approach is followed to predict aqueous solubility by a Multi-Layer Perceptron (MLP), while in [4] features are incrementally selected to be input to an SVM. In [5] a large number of 2D and 3D features, capturing physiochemical and graph properties of molecules, are input to an NN to predict melting point after being compressed by Principal Component Analysis (PCA). In [6] atomic contributions containing correction factors for intramolecular interactions are derived by multivariate regression, yielding accurate predictions of octanol-wated partition coefficients. Although it is clear that this two-stage approach (encode a molecule as features, map the features into the property) may be successful, it has a number of drawbacks: often, one or more experts need to design the features, thus creating a bottleneck; features are often problem-specific; features may not be optimal; however the features are V. Kadirkamanathan et al. (Eds.): PRIB 2009, LNBI 5780, pp. 391–403, 2009. c Springer-Verlag Berlin Heidelberg 2009
392
I. Walsh, A. Vullo, and G. Pollastri
designed, aspects of the structure/connectivity may be lost or it may be decided arbitrarily which ones to represent, thus potentially missing vital information about the mechanisms involved. Structural alert (SA) methods search for patterns within datasets that are indicative of the molecules’ properties. In [7] mutagenicity classification is predicted with good levels of accuracy using a manual derivation of these substructures. An updated version is introduced in [8] to automatically mine the substructures. In [9] a vector of substructure frequencies is input to an SVM, yielding fairly accurate predictions of cancer toxicity, HIV suppression and potential for anthrax binding. Despite the successes of these methods, their generalisation ability is debatable, and their failure to predict carcinogenicity in a sustained way has been attributed to the evolving nature of chemical datasets [10], in which new, unknown substructures keep appearing. Similarly to homology modelling for protein structure prediction, some molecules will need to be predicted “ab initio” as they contain novel active substructures or neighbourhoods thereof. More recently, kernel methods that integrate some form of structural processing into their kernel function have shown state-of-the-art performances at many tasks [11]. Melting point and octanolwater partition coefficient are predicted in [12] by 2D kernels with minmax similarity and 3D histogram kernels. State-of-the-art results are reported for a the classification of cancer and HIV suppression in [13], by 2D and 3D weighted decomposition kernels with the best results reported for a combination of both. In [14] kernels on molecular fingerprint similarity matching in the 2D case and atomic distances in the 3D case are state-of-the-art for mutagenicity and human tumor suppression. In this work we design a novel class of machine learning algorithms for processing structured data. We tackle “ab initio” predictions of a number of properties of small molecules (i.e. we do not mine substructures). The algorithms we describe are based on recursive neural networks and they deal with molecules directly as graphs, in that no features are manually extracted from the structure, and the networks automatically identify regions and substructures of the molecules that are relevant for the property in question. The basic structural processing cell we use is similar to those described in [15, 16, 17, 18], and adopted in essentially the same form in applications including molecule regression/classification [19, 20, 21], image classification [22], natural language processing [23], face recognition [24]. In the case of molecules, there are numerous disadvantages in these earlier models: they can only deal with trees, thus molecules (that are more naturally described as Undirected Graphs (UG)) have to be preprocessed before being input; the preprocessing is generally task-dependent; special nodes (“super-sources”) have to be defined for each molecule; application domains are generally limited, thus the effectiveness of the models is hard to gauge. In this work, although we loosely build on these previous works, we extend them in two crucial directions: our models deal directly with UG; no preprocessing is necessary, and no part of the molecule has to be marked as a “super-source”. We term our model UG-RNN, or Recursive Neural Networks for Undirected Graphs. We apply UG-RNN to the prediction of aqueous solubility, melting point and octanol water partition coefficient (all regression tasks) and to the classification of mutagenicity. Our results are encouraging, outperforming or matching state-of-the-art kernels on the same regression datasets. We alter the models slightly for mutagenicity prediction in
Recursive Neural Networks for Undirected Graphs for Learning Molecular Endpoints
393
order to test whether our approach incorporates useful contextual information In this case we show that UG-RNN outperform a state-of-the-art SA method and only perform less accurately than a method based on SVM’s fed with a task-specific feature which is not available to our model [25]. UG-RNN are open-ended in that they can be used to learn on any molecular dataset, and are portable to other molecular biology problems that require graph processing such as phylogenetic graph/tree analysis, protein classification when the protein is represented as a graph of contacts, etc.
2 Methods A molecule is naturally described as a UG, possibly with cycles, where atoms represent vertices and bonds represent edges. Here we factorise the UG representing a molecule into N Directed Acyclic Graphs, where: N is the total number of atoms/nodes in the molecule; the k th DAG is obtained from the UG describing the molecule by directing all its edges along the shortest path to its k th atom vk . The order of the atoms in the molecule is unimportant, as the result of processing is independent on it. Figure 1 shows how the undirected graph of nitrobenzene can be represented as 9 DAG’s. Let ch1[v,k] , . . . , chn[v,k] be the children of atom/node v in the k th DAG, then we assume that there is a hidden vector Gv,k ∈ Rm describing the contextual information upstream of node v as: Gv,k = M(G) iv , Gch1[v,k] , . . . , Gchn[v,k] (1) where iv ∈ Rl is the label associated with node v (i.e. essentially, all the information to input about the atom v). When a node has no children, or fewer children than the maximum allowed (n) then the empty arguments in M(G) () are set to vectors of zeroes
Fig. 1. The undirected graph of nitrobenzene and the 9 DAG’s derived from the molecule
394
I. Walsh, A. Vullo, and G. Pollastri
(boundary conditions). The maximum number of children n is set to the maximum outdegree of all the vertices in the structures, and is normally n = 4 for molecules. We realise the function M(G) , or state transition function, by a two layered perceptron. In the most basic model we assume stationarity, that is the same function (thus network) is used to process all vertices in all DAG’s. This may be regarded as a form of weight sharing, and helps keep the number of free parameters in the model low. Given there are as many DAG’s as there are nodes in the UG describing the molecule, and given each node vk is a root in a graph, there will be N vectors associated with root nodes: Gvk ,k . Each of these vectors provides a description of the molecule “as seen” from vk , and may be regarded as a way of “localising” the computations. Although it may be possible to analyse individual vectors Gvk ,k to point out which parts of a molecule are relevant to determine a property, we have not focussed our work on this task at present. To map the complex of these vectors into a single property, we first add them up: N Gstructure = Gvk ,k (2) k=1
Notice how: each atom in the molecule is within one transition function (i.e. one TwoLayered Perceptron) from the vector representing the whole molecule, thus minimising the well know vanishing gradient problem, which affects recursive neural networks and deep networks in general [26]; given that all atoms compete to be represented in Gstructure , if this vector is selected with the purpose of predicting a given property, then it effectively represents a task-dependent encoding (compression) of the molecule. We map Gstructure into the property of interest as: o = M(O) (Gstructure )
(3)
We implement M(O) by a Two-Layered Perceptron with a linear output when we predict real-valued properties, and softmax units in case of classification. The error function we minimise is, respectively, a sum of squared differences between target and network output, and the relative entropy between target and output. The whole network (all DAG’s, sum of hidden vectors of root nodes, and output function) is trained by gradient descent. The gradient of the error is computed, exactly, by the backpropagation algorithm, which we can apply here given that the overall network including the molecule has no cycles.
Fig. 2. An example of the model used on acetic acid. All the variables are explained in the text.
Recursive Neural Networks for Undirected Graphs for Learning Molecular Endpoints
395
Figure 2 shows how the 5 DAG’s of acetic acid are composed to build a UG-RNN. Each DAG produces a distinct contextual vector at each root. These vectors are then added and mapped into the final output representing the desired property. 2.1 Relaxing Stationarity We also implement a system in which we relax the stationarity hypothesis. In this case we search for common bonding patterns for an atom (or common neghbourhoods for a node), and we implement dedicated transition functions for the most frequent ones. For example the same transition network/function will represent every carbon atom with a single nitrogen and double bonded oxygen. A neighbourhood token is created by investigating each atom in the training set and storing a the atom symbol and its immediate neighbours. Neighbourhood tokens that contain all carbon symbols are removed because they are considered non-informative. The T most frequent tokens yield special transition functions, while all the other patterns are handled by a general function: u(1) u(n) 1 (1) Gv,k = M iv , Gch1 , . . . , Gchn ... GTv,k Ggeneral v,k
[v,k]
[v,k]
u(1) u(n) (T ) =M iv , Gch1 , . . . , Gchn [v,k] [v,k] u(1) u(n) (general) =M iv , Gch1 , . . . , Gchn [v,k]
[v,k]
(4)
where u(c) ∈ {1, . . . , T, general} is the index of the transition function applied to chc[v,k] based on its identity and neighbours. 2.2 Atomic Label We keep the label iv attached to an atom v as simple as possible in order to make the architecture portable. It should be noted, though, that any feature of an atom or of its neighbourhood can be included into iv . Each atom is labelled as follows: – The element type. – The atom charge. – Using openbabel version 2.1.1 [27] we calculate the Smallest Set of Smallest Rings (SSSR), hybridization and aromaticity of an atom. 2.3 Training Procedure When training for regression all target endpoints are normalised between [0,1] by finding the maximum and minimum target values in the training set of each fold and normalising all targets in the training and testing fold to target−max max−min . We use a sigmoid activation function for the final output neuron and perform gradient descent on a squared error. For classification softmax activation function is used at the output neurons with relative cross entropy as the cost function. All inner neurons have a tanh() activation function irrespective of regression or classification.
396
I. Walsh, A. Vullo, and G. Pollastri
Weights are randomly initialised. We update the weights once per epoch (batch learning). A gradient component dw is applied directly when its absolute value is in the [0.1,1] range, but set to sign(dw) when greater than 1 and to 0.1×sign(dw) when smaller than 0.1. We train 5 distinct models with different random initial weights and number of units and ensemble them to produce the final output. Each model is trained for 2000 epochs. We can learn on a set of 3000-4000 molecules in one day on one core of a modern PC. The final systems can predict millions of molecules per day on a small cluster of machines, making them particularly suitable for high throughput screening. Classification models estimate the probability of the endpoint given the inputs. This is also advantageous for screening since strict criteria can be imposed on the probability in order to increase the confidence of a prediction.
3 Results For all the regression problems we only report the results on the models where stationarity is relaxed. In the mutagencity classification results we show that this type of model significantly outperforms the stationary model without the dedicated processing units. We also test other slight variations to the model architecture when predicting mutagenicity (see below). 3.1 Regression For all the regression tasks described below three measures are reported. The squared correlation coefficient (r2 , or squared Pearson correlation coefficient) where the correlation coefficient is: N ti pi − N tˆpˆ r = i=1 (5) (N − 1)st sp where tˆ and pˆ are the mean of the targets and predicted values respectively, N is the total number of examples and st and sp are the standard deviations of the target and prediction respectively. We also report the root mean squared error (RMSE) and the average N N 1 1 absolute error (AAE) which are N i=1 (ti − pi )(ti − pi ) and | N i=1 (ti − pi )| respectively. Aqueous Solubility. The biological activity of potential drugs is influenced by their aqueous solubility - effectiveness of the drug may depend on this property for delivery. Hence computational methods to evaluate solubility of compounds are important, and many approaches to tackle this task have being described. A review of computational methods for the early phase of drug development can be found in [28]. In [29] the octanol-water partition coefficient (which can be predicted somewhat accurately from the molecular structure, see section 3.1) and 51 2D descriptors are input to a multiple linear regression model, yielding a squared correlation coefficient of 0.74 and an average absolute error of 0.68 on a dataset of 2688 training compounds and 640 test compounds. Delaney [2] uses the water partition coefficient and three other parameters (molecular weight, number of rotatable bonds and the aromatic proportion in aromatic rings) as inputs to a simple linear model. Although the model is simple it outperforms
Recursive Neural Networks for Undirected Graphs for Learning Molecular Endpoints
397
Table 1. Prediction performance for Aqueous Solubility in 10 fold cross validation on the 1144 compounds in the Delaney ”Small” dataset r2
RMSE
AAE
0.914
0.613
0.437
Delaney [2]
-
-
0.75
GSE [30]
-
-
0.47
0.61
0.44
UG-RNN
2D kernel (param d=2) [12] 0.91
the General Solubility Equation [30] which is based on the melting point and octanolwater partition coefficient. In [12] various kernels are designed showing state of the art results on the ”Small” Delaney dataset for the 2D kernel based on path lengths of two. Table 1 shows our results obtained in 10 fold cross validation on the Delaney ”Small” dataset. Comparisons are made with the kernel method in [12], Delaney’s own method [2] and the GSE equation [30]. The dataset contains 1144 compounds of ranging types with solubility values (measured as a log S) ranging from -11.6 to 1.58 Mol/litre. Our results on this dataset are state of the art with a r2 of 0.91, 0.61 RMSE and an AAE of 0.44 which are identical to the best kernel from the work in [12]. The only number available for comparison with Delaney’s work and the GSE equation is an AAE of 0.75 and 0.47 on this ”Small” dataset. Another common dataset is the one in Huuskonen [3], consisting of 1297 compounds of ranging log S values from -11.62 to 1.58 Mol/Litre. Huuskonen’s method relies on Molecular connectivity, shape, and atom-type electrotopological indices, input to a multi layer neural network, yielding a r2 of 0.92 and standard deviation of 0.6. However no cross validation is performed in order to assess the true generalisation ability of the method. In [4] a support vector machine is used to learn from derived descriptors and the final model achieves a r2 value of 0.90 in 8-fold cross validation on the same set. Again the kernel methods of [12] produce state of the art performances. However the best results are now on a different (3D) kernel. Our method remains the same, indicating general applicability. Although the Huuskonen dataset consists of 1297 compounds, Azencott et al. report results on 1026 compounds without mention of redundancy reduction. Table 2 shows a comparison of the methods on the Huuskonen dataset. On this set we achieve a r2 of 0.92, slightly above the squared correlation for the kernel method in Azencott et al. and the method in Frohlich et. al., and a somewhat worse, if still nearly perfect, AAE (0.43 on a range of 12 log Mol/Litre units). In figure 3 we show the correlation graphs for the Delaney dataset. We observe a very similar trend on the Huuskonen set. Table 2. Prediction performance for Aqueous Solubility in 10 fold cross validation on the 1297 compounds in the Huuskonen dataset
UG-RNN
r2
RMSE
AAE
0.92
0.35
0.43
-
-
0.15
0.11
Frohlich [4] 0.90 3D kernel [12] 0.91
398
I. Walsh, A. Vullo, and G. Pollastri
2 (true,predicted) correlation line f(x)=x
Predicted log S values in Mol/Litre
0
-2
-4
-6
-8
-10
-12 -12
-10
-8
-6
-4
-2
0
2
True log S values in Mol/Litre
Fig. 3. Plot of Experimental Log S values against Predicted Log S values for the Delaney Small dataset. S is measured in Mol/Litre.
Melting point. Melting point can be used for the rapid determination of the purity of a substance and it is often a core property in QSAR/QSPR analysis for determining solubility and boiling point [31, 30]. The General solubility equation [30] is log S = 0.5 − log Pow − 0.01(tm − 25) where log Pow is the octanol water partition coefficient and tm is the melting point. log Pow can be predicted with high accuracy (see next section) while the automatic prediction of melting points still remains difficult. The above equation generally works well (RMSE 0.7-0.8 log units) so long as the melting point can be determined accurately. We test our method on a melting point dataset extracted from the literature [5], containing 4173 compounds from the Molecular Diversity Preservation International database (MDPI) [32]. The melting points range between 14◦ C to 392.5◦C. Our results (table 3) on this set compare favourably with two other methods [5, 12], with a correlation coefficient of 0.753 (r2 of 0.57). Karthikeyan [5] uses a large set of 2D and 3D features which are dimensionally reduced using PCA, then input to a feed forward neural network. The model producing the best results in Azencott et al. [12] is again the 2D kernel using a minmax similarity measure but the path length used to determine the similarity is now 10. Table 3. Prediction performance for Melting point in 10 fold cross validation on the 4173 compounds in the Karthikeyan dataset. Other methods based on the same dataset.
UG-RNN Karthikeyan [5]
r2
RMSE
AAE
0.57
42.5◦ C
32.6◦ C
0.42
2D kernel (param d=10) [12] 0.56
◦
52.0 C ◦
42.71 C
41.3◦ C 32.58◦ C
Recursive Neural Networks for Undirected Graphs for Learning Molecular Endpoints
399
Octanol water partition coefficient. Accurate determination of octanol water partition coefficient, i.e. the ratio of the concentrations of a compound in a mixture of water and octanol at equilibrium (normally measured as a logarithm of the ratio, or log Pow ), is central to QSAR/QSPR. The magnitude of log Pow is useful in estimating the distribution of drugs within the body. It is therefore an important factor of a candidate drug. Moreover, log Pow and melting point can be used to accurately determine the solubility of a compound. Table 4 shows the results for the prediction log Pow on the dataset used in [6] and in [12]. The results in [6] are nominally accurate but are measured on the training set, hence meaningless and not reported in the table. The 2D kernel (the best performing is again different, this time with d = 5) is marginally more accurate. Our method is unchanged from the previous tests. Table 4. Prediction performance for octanol/water partition coefficient (as logPow ) by 10 fold cross validation on the dataset in [6, 12]. Other methods based on the same dataset.
This work
r2
RMSE
AAE
0.934
0.279
0.394
0.25
0.38
2D kernel (param d=5) [12] 0.94
3.2 Classification Finally, we apply UG-RNN to the problem of mutagenicity classification. In this case we also gauge whether contextual information, and relaxing stationarity have an effect on the results. We also test two different variants of the output network M(O) () (in Eqn. 2), one in which only the global state of the structure Gstructure is passed as an argument, and one in which the average of the labels iv over the molecule is also input. We term the last two models, respectively, Moore, and Mealy UG-RNN. If T P , F P , T N , and F N are the true positives, false positives, true negatives, and P false negatives respectively, the performance measures we use are precision T PT+F P TP and recall T P +F N for the classes. Matthews Correlation Coefficient (MCC) is also computed for the sake of comparison with other works that report recall and MCC and (T P.T N )−(F N.F P ) leave out precision. MCC is defined as √ . (T P +F N )(T P +F P )(T N +F N )(T N +F P )
Mutagenicity. Mutagenicity is the ability of a compound to cause mutations in DNA with these mutations often causing the onset of cancerous tumors. It has being previously shown that a positive experimental mutagenicity test (known as the Ames test) results in carcinogenicity as high as 77% to 90% in rodents [33]. Screening of drug candidates for mutagenicity is a regulatory requirement for drug approval since mutagenic properties pose risks to humans. At present, a variety of toxicological tests have to be conducted by a drug manufacturer. Although mutagencity testing has a relatively simple experimental procedure, these tests are generally low-throughput and hence cannot be applied to large scale screening. A fast effective method for the prediction of mutagenicity could therefore serve as an initial estimate of carcinogenicity and greatly aid in the manufacture of novel drugs.
400
I. Walsh, A. Vullo, and G. Pollastri
We use a dataset collected from the literature [7]. The set consists of 4337 diverse chemical compounds along with information indicating whether they have mutagenicity in Salmonella Typhimurium strains TA98, TA100, TA1535 and either TA1537 or TA97, which are the standard Ames test strains required for regulatory evaluation of drug approval. In this dataset a compound is considered a mutagen if at least one Ames test result was positive, negative otherwise. This results in 54% of the data to be mutagenic making it a well balanced set. In table 5 we compare UG-RNN with the substructure/structural alert method of [8] and a method based on a novel molecular electrophilicity descriptor input to an SVM [25]. The dataset used in the substructure mining method is a slightly reduced version of the dataset used here since mixtures, counter ions, molecules above 500 moleculular weight and stereoisomers were removed. We have not removed these noisy examples, which is expected to yield slightly worse results. The automatic substructure mining method [8] known as the elaborate chemical representation (ECR) is an extension of a similar manual method in [7], and simply assigns a chemical to a particular class based on substructures previously identified. The method in [25] constructs a novel feature vector based on the atomic electrophilicty which is a highly domain specific vector for mutagenicity prediction. MOLFEA [34] generates molecular fragments by a mining algorithm then uses three types of machine learning systems: decision trees, rule learner and support vector machines. The authors prove that the substructure based approach improves over machine learning with fixed length molecular property vectors, and obtain the highest predictive accuracy by optimised structures and a linear SVM. We test four distinct models: – Multi Layer Perceptron (MLP) The input is defined as: I=
N 1 iv N
(6)
v∈{V }
where N is the number of atoms, iv is the input label at atom v and {V } is the set of all vertices. – Stationary UG-RNN A UG-RNN with only one transition function and therefore only one transition network to encode the molecule. – Moore UG-RNN A UG-RNN with specialised transition functions for the most frequent patterns, and output obtained as o = M(O) (Gstructure ). – Mealy UG-RNN A UG-RNN with specialised transition functions for the most frequent patterns, but output obtained as o = M(O) (I, Gstructure ), where I is the average label over the molecule, as in Eqn.6. Table 5 shows the 10-fold cross validation results of all the models we tested, compared with those of the other methods described above (which are all also tested in 10-fold cross validation, although MOLFEA is tested on a different set). It is clear from the table that the SVM+electrophilicity method [25] performs best, however: the feature vector is highly task-dependent; there is no evidence of test vs. validation separation in the choice of the feature vector; no precision values are reported. Our two best methods (Mealy UG-RNN and Moore UG-RNN) perform better than both
Recursive Neural Networks for Undirected Graphs for Learning Molecular Endpoints
401
Table 5. Performance of the different mutagenicity models in a 10-fold cross validation. (P) is precision and (R) is recall. (*) different dataset of 684 compounds used for training and evaluation. Q all
mutagen
non-
mutagen
non-
(R)
mutagen
(P)
mutagen
(R)
mcc
(P)
MLP
77.8%
79.8%
75.2%
80.0%
75.0%
55.0%
Stationary UG-RNN
78.8%
75.9%
81.1%
76.5%
80.6%
57.1%
Moore UG-RNN
81.7%
84.7%
77.9%
82.5%
80.5%
62.7%
Mealy UG-RNN
82.4%
86.0%
78.0%
82.8%
81.8%
64.3%
SVM+electrophilicity [25] 90.1%
87.7%
92.1%
−
−
80.0%
Substructure mining [8] 80.6%
83.0%
74.5%
80.8%
77.2%
57.4%
77.5%
79.4%
−
−
56.9%
MOLFEA [34]*
78.5%
Substructure Mining [8] and MOLFEA [34]. Moreover, all UG-RNN’s perform better than the static MLP, thus contextual information appears to be incorporated effectively. The non-stationary models perform better than the stationary one, and the Mealy model (in which both contextual information and average input labels are input to the output network) is the best performing of all with an overall accuracy of 82.4% and an MCC of 64.33%. It should also be noted that the average intra-laboratory reproducibility of a series of Ames test data from the National Toxicological Program (NTP) was determined to be 85% [35], hence it is unclear what an accuracy of 90% [25] might mean and our best result is close to the experimental accuracy of the test.
4 Conclusions We have developed a general class of machine learning models (UG-RNN) that map graphs to global labels, and have tested it in four separate problems in QSAR/QSPR. The models discussed in this work have remained essentially the same throughout all of these tasks. In all cases we obtained results close or above the state of the art, which depending on the task is represented by algorithms based on kernels, substructure mining/structural alert and manual or semi-automatic feature extraction, algorithms which are usually domain- and task-specific. The input features we have used are not domainspecific, are very simple and may be expanded, and the method is highly portable to other tasks in QSAR/QSPR, in molecular biology and elsewhere, so long as the input instances are naturally represented as Undirected Graphs. In the future we plan to expand our research in a number of directions, including: testing whether the feature vector automatically generated by UG-RNN (the Gstructure in Eqn.2), which is effectively a task-dependent encoding of the molecule, could be used as input to classifiers other than MLP, for instance SVM; whether UG-RNN can be expanded to include 3D information, alternatively or alongside 2D, for instance by representing the interaction between two atoms closer than a certain threshold as an edge, and introducing the distance as a label on the edges of the model; incorporating
402
I. Walsh, A. Vullo, and G. Pollastri
information about stereoisomers in the model - this is currently overlooked and is likely to have hindered our results in the case of mutagenicity classification, where stereoisomers are present in the set.
Acknowledgments This work is supported by grant RP/2005/219 from the Health Research Board of Ireland.
References 1. Hansch, C., Muir, R.M., Fujita, T., Maloney, P., Geiger, E., Streich, M.: The correlation of biological activity of plant growth regulators and chloromycetin derivatives with hammett constants and partition coefficients. J. Am. Chem. Soc. 85, 2817 (1963) 2. Delaney, J.: Esol: Estimating aqueous solubility directly from molecular structure. J. Chem. Inf. Comput. Sci. 44(3), 1000–1005 (2004) 3. Huuskonen, J.: Estimation of aqueous solubility for a diverse set of organic compounds based on molecular topology. J. Chem. Inf. Comput. Sci. 40(3), 773–777 (2000) 4. Fr¨ohlich, H., Wegner, J., Zell, A.: Towards optimal descriptor subset selection with support vector machines in classification and regression. J. Chem. Inf. Comput. Sci. 45(3), 581–590 (2005) 5. Karthikeyan, M.: General melting point prediction based on a diverse compound data set and artificial neural networks. J. Chem. Inf. Comput. Sci. 45(3), 581–590 (2005) 6. Wang, R., Fu, Y., Lai, L.: Towards optimal descriptor subset selection with support vector machines in classification and regression. J. Chem. Inf. Comput. Sci. 37(3), 615–621 (1997) 7. Kazius, J., McGuire, R., Bursi, R.: Derivation and validation of toxicophores for mutagenicity prediction. J. Med. Chem. 48(1), 312–320 (2005) 8. Kazius, J., Nijssen, S., Kok, J., B¨ack, T., Ijzerman, A.: Substructure mining using elaborate chemical representation. J. Chem. Inf. Model. 46(2), 597–605 (2006) 9. Deshpande, M., Kuramochi, M., Wale, N., Karypis, G.: Frequent substructure-based approaches for classifying chemical compounds. IEEE Transactions on Knowledge and Data Engineering 17(8), 1036–1050 (2005) 10. Benigni, R., Giuliani, A.: Putting the predictive toxicology challenge into perspective: reflections on the results. Bioinformatics 19(10), 1194–1200 (2003) 11. Mah´e, P., Ueda, N., Akutsu, T., Perret, J., Vert, J.: Graph kernels for molecular structureactivity relationship analysis with support vector machines. Journal of Chemical Information and Modeling 45, 939–951 (2005) 12. Azencott, C., Ksikes, A., Swamidass, A., Chen, J., Ralaivola, L., Baldi, P.: One- to fourdimensional kernels for virtual screening and the prediction of physical, chemical, and biological properties. J. Chem. Inf. Comput. Sci. 47(3), 965–974 (2007) 13. Ceroni, A., Costa, F., Frasconi, P.: Classification of small molecules by two- and threedimensional decomposition kernels. Bioinformatics 23(16), 2038–2045 (2007) 14. Swamidass, S., Chen, J., Bruand, J., Phung, P., Ralaivola, L., Baldi, P.: Kernels for small molecules and the prediction of mutagenicity, toxicity and anti-cancer activity. Bioinformatics 21(suppl. 1), 359–368 (2005) 15. Micheli, A., Sperduti, A., Starita, A.: An introduction to recursive neural networks and kernel methods for cheminformatics. Current Pharmaceutical Design 13(14), 1469–1495 (2007)
Recursive Neural Networks for Undirected Graphs for Learning Molecular Endpoints
403
16. Sperduti, A., Starita, A.: Supervised neural networks for the classification of structures. IEEE Transactions on Neural Networks 8(3), 714–735 (1997) 17. Frasconi, P.: An introduction to learning structured information. J. Chem. Inf. Comput. Sci. 1387/1998, 99 (2004) 18. Frasconi, P., Gori, M., Sperduti, A.: A general framework for adaptive processing of data structures. IEEE Transactions on Neural Networks 9(5), 768–786 (1998) 19. Bernazzani, L., Duce, C., Micheli, A., Mollica, V., Sperduti, A., Starita, A., Tin´e, M.: Predicting physical-chemical properties of compounds from molecular structures by recursive neural networks. Applied Intelligence 19(1-2), 9–25 (2003) 20. Micheli, A., Portera, F., Sperduti, A.: QSAR/QSPR studies by kernel machines, recursive neural networks and their integration. In: Apolloni, B., Marinaro, M., Tagliaferri, R. (eds.) WIRN 2003. LNCS, vol. 2859, pp. 308–315. Springer, Heidelberg (2003) 21. Bianucci, A., Micheli, A., Sperduti, A., Starita, A.: Application of cascade correlation networks for structures to chemistry. Applied Intelligence 12(1-2), 117–147 (2000) 22. Siu-Yeung, C., Zheru, C.: Genetic evolution processing of data structures for image classification. IEEE Transactions on Knowledge and Data Engineering 17(2), 216–231 (2005) 23. Costa, F., Frasconi, P., Lombardo, V., Soda, G.: Towards incremental parsing of natural language using recursive neural networks. Applied Intelligence 19(1-2), 9–25 (2003) 24. Bianchini, M., Maggini, M., Sarti, L., Scarselli, F.: Recursive neural networks learn to localize faces. Pattern Recognition Letters 26(12), 1885–1895 (2005) 25. Zheng, M., Liu, Z., Xue, C., Zhu, W., Chen, K., Luo, X., Jiang, H.: Mutagenic probability estimation of chemical compounds by a novel molecular electrophilicity vector and support vector machine. Bioinformatics 22(17), 2099–2106 (2006) 26. Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks 5(2), 157–166 (1994) 27. The open babel package version 2.1.1, http://www.openbabel.org/ 28. Huuskonen, J.: Estimation of aqueous solubility in drug design. Combinatorial Chemistry and High Throughput Screening 4(3), 311–316 (2000) 29. Butina, D., Gola, J.: Modeling aqueous solubility. J. Chem. Inf. Comput. Sci. 43, 837–841 (2003) 30. Jain, N., Yalkowsky, S.: Estimation of the aqueous solubility i: Application to organic nonelectrolytes. Journal of Pharmaceutical Sciences 90(2), 234–252 (2001) 31. Abramowitz, R., Yalkowsky, S.: Melting point, boiling point, and symmetry. Pharmaceutical Research 7(9), 942–947 (1990) 32. Molecular diversity preservation international database, http://www.mdpi.org/ 33. Mortelmans, K., Zeiger, E.: The ames salmonella/microsome mutagenicity assay. Mutat. Res. 455(1-2), 29–60 (2000) 34. Helma, C., Cramer, T., Kramer, S., De Raedt, L.: Data mining and machine learning techniques for the identification of mutagenicity inducing substructures and structure activity relationships of noncongeneric compounds. J. Chem. Inf. Comput. Sci. 44(4), 1402–1411 (2004) 35. Piegorsch, W., Zeiger, E.: Measuring intra-assay agreement for the ames salmonella assay. Statistical Methods in Toxicology. Lect. Notes Med. Informatics 43, 35–41 (1991)
Enhancing the Effectiveness of Fingerprint-Based Virtual Screening: Use of Turbo Similarity Searching and of Fragment Frequencies of Occurrence Shereena M. Arif, Jérôme Hert, John D. Holliday, Nurul Malim, and Peter Willett Krebs Institute for Biomolecular Research and Department of Information Studies, University of Sheffield, 211 Portobello Street, Sheffield S1 4DP, United Kingdom [email protected]
Abstract. Binary fingerprints encoding the presence of 2D fragment substructures in molecules are extensively used for similarity-based virtual screening in the agrochemical and pharmaceutical industries. This paper describes two techniques for enhancing the effectiveness of screening: the use of a second-level search based on the nearest neighbours of the initial reference structure; and the use of weighted fingerprints encoding the frequency of occurrence, rather than just the mere presence, of substructures. Experiments using several databases for which both structural and bioactivity data are available demonstrate the effectiveness of these two approaches. Keywords: Chemoinformatics, Fingerprint, Fragment substructure, Similarity measure, Similarity searching, Turbo similarity searching, Virtual screening, Weighting scheme.
1 Introduction Virtual screening, the ranking of molecules in order of probability of biological activity, plays an increasingly important role in the discovery of novel bioactive molecules in the agrochemical and pharmaceutical industries [1, 2]. There are many ways in which this can be achieved: here, we discuss the use of similarity searching for this purpose [3, 4]. Given a molecule that exhibits some biological activity of interest (the reference structure) and a database of molecules that have not previously been tested for that activity, a similarity search computes a measure of structural similarity between the reference structure and each of the database structures in turn. The database is then ranked in decreasing order of the computed similarities, and the top-ranked, nearest-neighbour molecules passed on for further consideration as having the greatest a priori probabilities of bioactivity. The most common similarity measure involves the use of 2D fingerprints and the Tanimoto coefficient, where a fingerprint is a binary vector encoding the presence or absence in a molecule of small substructural fragments [4]. Fingerprint-based similarity is clearly simple in concept but has proved to be very effective in operation [5-9]. Hert et al. have described an extension of similarity searching, turbo similarity searching (subsequently referred to here as TSS) [10]. The similar property principle V. Kadirkamanathan et al. (Eds.): PRIB 2009, LNBI 5780, pp. 404–414, 2009. © Springer-Verlag Berlin Heidelberg 2009
Enhancing the Effectiveness of Fingerprint-Based Virtual Screening
405
states that molecules that are structurally similar are likely to exhibit similar bioactivities and properties [11, 12]: thus, the nearest neighbours of a bioactive reference structure are also expected to possess that particular bioactivity. Recent studies have demonstrated the increased effectiveness of searching that can be obtained if not one but multiple bioactive reference structures are available, using an approach called group fusion [9, 13, 14]. Here, each reference structure in turn is used for a similarity search, and then the resulting rankings combined to give a single consensus ranking [15]. TSS makes the assumption that the nearest neighbours of a reference structure are not just likely to be active (as suggested by the similar property principle) but actually are active; they can thus be used as the multiple reference structures required for the implementation of group fusion [10]. The user of a TSS system needs to do nothing more than is required for conventional similarity searching, i.e., the input of a bioactive reference structure; however, the final, combined search output is expected to yield a better level of enrichment than a conventional similarity search (hereafter SS) based on just the original reference structure. Hert et al. found that TSS yielded favorable results and they hence suggested that the approach provides a simple way of enhancing the effectiveness of current systems for virtual screening [10]. The original TSS experiments used the MDL Drug Data Report (MDDR) database with the molecules represented by one particular type of fingerprint (specifically the Pipeline Pilot ECFP_4 fingerprints). In the first part of the present paper, we consider the effectiveness of TSS when used with other databases and other types of fingerprint to determine the generality of TSS for virtual screening. Fingerprints for similarity searching are normally binary, with each element of the fingerprint denoting the presence (one) or the absence (zero) of a particular substructural fragment in a molecule. Alternatively, it is possible to assign weights to fragments so that a fragment with a high weight that is common to both a reference structure and a database structure makes a greater contribution to the computed similarity than will a common fragment with a lesser weight. In the absence of the extensive training data needed for machine learning approaches to fragment weighting [16], one source of information that can be used for weighting fragments is the number of times that a fragment occurs in an individual molecule [17]. Several previous studies have suggested that occurrence-based fingerprints (i.e., weighted fingerprints that encode how often a substructure occurs in a molecule) can give better screening than incidencebased fingerprints (i.e., conventional binary fingerprints that encode merely the presence or absence of a substructure). However, the results to date have been far from consistent, with the experiments often involving only small datasets and with no attempt to explain the observed levels of performance: the study reported in the second part of this paper was carried out to address these limitations.
2 Experimental Details 2.1 Databases Several databases have been used in our experiments. The largest number of experiments used the MDDR dataset of 102,514 molecules and eleven bioactivity classes first described by Hert et al. [13]. This file of molecules was also screened for ten
406
S.M. Arif et al.
activity classes (dataset MDDR-HET) that had been chosen to be as structurally diverse as possible, which provides a tougher test of a screening method’s scaffoldhopping abilities [18]. Further experiments used: a dataset of 138,127 molecules and 14 activity classes taken from the World of Molecular Bioactivity database (WOMBAT, available from Sunset Molecular Discovery LLC); and a dataset of 41,192 molecules with 393 confirmed actives from the NCI database, which contains molecules tested in the US government’s anti-AIDS programme. 2.2 Fingerprints There are two main classes of fingerprint. The dictionary-based approach involves a pre-defined list of fragments: a molecule is checked for the presence of each of the fragments in the dictionary, and a bit set (or not set) when a fragment is present (or absent). The molecule-based approach involves hashing algorithms that allocate multiple fragments to each bit-position: a note is made of all fragments of a specific type (e.g., a chain of four connected non-hydrogen atoms) occurring in a molecule, and then each fragment is hashed to set one or more bits in the fingerprint. The TSS experiments used the following fingerprints (which are described in detail by Gardiner et al. [19]): Pipeline Pilot ECFP_4 and FCFP_4 fingerprints (1024 bits, available from Accelrys Software Inc.); Tripos Unity fingerprints (988 bits, available from Tripos Inc.); BCI fingerprints (1052 bits, available from Digital Chemistry Ltd.); Daylight fingerprints (2048 bits, available from Daylight Chemical Information Systems Inc.); and MDL keys (166 bits, available from Symyx Technologies Inc.). Of these, the BCI and MDL fingerprints are dictionary-based, the Daylight and Pipeline Pilot fingerprints are molecule-based (using linear chains and circular substructures, respectively), and the Unity fingerprints are based on both approaches, thus encompassing both the main classes of fingerprint that are currently available. The weighting experiments in the second part of the paper used the following fingerprints: Tripos holograms (which employ hashed fragments analogous to those used in the Tripos Unity fingerprints, with a fingerprint containing 997 elements); Pipeline Pilot ECFC_4 fingerprints (the occurrence version of the ECFP_4 fingerprints, with a fingerprint containing 1024 elements); and Sunset keys (available from Sunset Molecular Discovery LLC), for which the 559-element fingerprints are rather more generic in character than the other two types of descriptor studied here, as they combine chemical substructure recognition with topologically-relevant pharmacophore patterns based on atom-pairs. 2.3 Weighting Schemes In the weighting experiments, each of the molecular representations (holograms, ECFC_4 or Sunset) was considered as a vector, X, where the i-th element, xi, denotes the weight that the i-th fragment has in that molecule. If the i-th fragment occurs fi times in a molecule (fi ≥ 0) then five weighting schemes (W1-W5) were considered. W1 and W2 are the raw incidence and occurrence data, i.e., W1: xi = 1 (for fi > 0); W2: xi = fi . W3 and W4 are two common standardizations in multivariate statistics:
Enhancing the Effectiveness of Fingerprint-Based Virtual Screening
407
W3: xi = ln ( fi ) ; W4: xi = fi . W5 involves a further standardization that has proved helpful in weighting studies in text retrieval [20]: W5: xi = 0.5 + 0.5
fi , max{ fi}
where max{fi} is the frequency of occurrence of the most frequently occurring fragment in a molecule.
3 Turbo Similarity Searching The results of the TSS searches are presented in Table 1. The measure of retrieval effectiveness used here, and also for the weighting experiments in the next section, is the recall, i.e., the fraction of the active molecules retrieved at some cut-off point in the ranking. Our experiments involved a cut-off of 5%, so that, e.g., a recall of 20% of the actives would correspond to a four-fold enrichment of the output as compared with random screening of the database. The recall values in Table 1 are the mean percentage of actives retrieved in the top-5%, averaged over all reference structures for each activity class and then over all activity classes for each database. The searches of the MDDR, MDDR-HET and NCI datasets used each active molecule in turn as the reference structure. The results for these datasets are shown in Tables 1a-1c, where SS denotes a conventional similarity search and where TSS-x denotes a turbo similarity search based on the original reference structure combined with its x nearest neighbours. Results are presented for x= 10, 20, 50 and 100, with the best SS and TSS performance marked as bold-faced and shaded. When the MDDR classes are used (Table 1a) there is often a noticeable increase in the recall of the search as more nearest neighbours are included in a TSS, with the best searches using 50-100 nearest neighbours. However, SS is superior to TSS for the MDL fingerprints, and there is little difference in performance for the Unity fingerprints. The ECFP_4 fingerprint gives the best results, both in the initial SS and in the degree of enhancement when TSS is used: for this fingerprint, the maximum TSS recall corresponds to an increase of ca. 15% of the SS recall, a significant finding since this enhancement is achieved without any additional effort on the part of the user carrying out the similarity search. A very different pattern of behaviour is observed with the MDDR-HET results presented in Table 1b. The degree of enhancement for this more challenging screening task is much less notable, even for the ECFP_4 fingerprint, and for most of the fingerprints there would appear to be little or no advantage in using TSS. Similar comments apply to the searches of the NCI dataset shown in Table 1c. In the WOMBAT experiments, ten molecules were chosen at random from each activity class to be the reference structures for searching. The results of these searches are detailed in Table 1d, from which one can draw similar conclusions as from Table 1a: the initial SS recall is high but there is still a substantial increase in the effectiveness of the TSS searches for the ECFP_4 and (to a lesser extent) the FCFP_4 fingerprints; however, TSS provides only limited benefits with the other types of fingerprint.
408
S.M. Arif et al.
Table 1. Similarity (SS) and turbo similarity (TSS) searches of (a) MDDR, (b) MDDR-HET, (c) NCI and (d) WOMBAT datasets
1a Fingerprint BCI Daylight ECFP_4 FCFP_4 MDL Unity
SS 32.8 31.5 39.2 36.1 30.2 30.2
TSS-10 33.8 32.4 41.9 37.9 27.9 30.8
TSS-20 34.2 32.6 42.9 38.9 28.0 30.9
TSS-50 34.7 33.1 44.5 40.1 28.1 31.0
TSS-100 34.9 32.8 45.1 40.8 28.2 31.1
TSS-20 20.6 17.4 22.5 21.1 19.5 15.2
TSS-50 20.2 16.7 22.5 20.7 18.9 14.1
TSS-100 19.6 16.4 22.0 20.1 18.3 13.8
1b Fingerprint BCI Daylight ECFP_4 FCFP_4 MDL Unity
SS 20.7 18.3 20.9 20.2 20.0 16.6
TSS-10 20.9 18.0 22.3 21.1 20.0 15.8
1c Fingerprint BCI Daylight ECFP_4 FCFP_4 MDL Unity
SS 12.1 10.4 10.5 10.8 11.9 11.5
TSS-10 12.3 10.5 10.3 10.9 11.9 11.5
TSS-20 12.3 10.4 10.3 11.1 12.0 11.6
TSS-50 12.5 10.2 10.4 11.1 12.1 11.8
TSS-100 12.8 10.0 10.7 11.1 12.3 11.7
1d Fingerprint BCI Daylight ECFP_4 FCFP_4 MDL Unity
SS 39.0 35.1 47.2 42.2 36.6 36.8
TSS-10 39.6 35.9 48.6 43.0 37.1 37.3
TSS-20 39.8 36.0 49.5 43.9 37.1 37.8
TSS-50 40.0 35.6 50.6 44.7 37.2 37.5
TSS-100 40.0 36.2 51.9 45.1 36.9 37.4
The results in Table 1 hence suggest that TSS can provide a simple way of improving the effectiveness of similarity searching for at least some types of fingerprints if the active molecules are not too structurally diverse. If they are diverse, as is the case with the MDDR-HET or NCI datasets, then Hert et al. suggest an alternative form of TSS – referred to here as TSS-SSA – in which the nearest neighbours from the basic SS search are processed using a machine-learning technique, rather than group fusion as discussed thus far [18]. Machine learning involves analysing a training set containing known active and inactive molecules and then developing a decision rule to rank the remaining test-set molecules in order of decreasing probability of activity. Hert et al. suggested that
Enhancing the Effectiveness of Fingerprint-Based Virtual Screening
409
Table 2. Similarity (SS) and turbo similarity (TSS-SSA) searches of (a) MDDR-HET and (b) NCI datasets 2a Fingerprint BCI Daylight ECFP_4 FCFP_4 MDL Unity
SS 20.7 18.3 20.9 20.2 20.2 16.6
TSS-10 27.1 25.0 21.5 18.3 26.5 25.1
TSS-20 27.1 23.3 28.5 24.0 25.5 23.4
TSS-50 26.0 21.7 28.8 25.9 24.2 21.1
TSS-100 24.8 21.0 27.9 25.5 23.4 19.8
TSS-20 11.9 9.8 11.8 11.9 10.7 9.9
TSS-50 11.2 9.2 10.4 10.9 10.7 9.8
TSS-100 11.5 9.5 10.4 11.0 11.0 10.1
2b Fingerprint BCI Daylight ECFP_4 FCFP_4 MDL Unity
SS 12.1 10.4 10.5 10.8 11.9 11.5
TSS-10 12.9 10.7 14.5 13.3 10.9 10.4
the nearest neighbours of the known reference structure could form the training-set’s actives with the remainder of the dataset forming the training-set inactives [18]. The decision rule is based on the technique known as substructural analysis (an early form of naïve Bayesian classifier). Substructural analysis (hereafter SSA) computes a weight for each bit in a fingerprint describing the corresponding fragment’s propensity to occur in active or in inactive molecules [21]. The weighting scheme used was the R2 weight, which has the form
⎛ Aj N A ⎞ ⎟. ⎜I N ⎟ ⎝ j I ⎠
R2 = log⎜
Here, Aj and Ij are the numbers of active and inactive training-set molecules with bit j set, and NA and NI are the numbers of active and inactive training-set molecules [22]. A molecule’s overall score is the sum of the R2 weights for its constituent fragments, and the molecules in a dataset are ranked in decreasing order of the sum of scores. The results of using TSS-SSA are shown in Table 2, where it will be seen that the use of SSA, rather than of group fusion, in the second-stage search has brought about substantial increases in screening performance with all fingerprints for MDDRHET; with NCI, substantial performance increases were obtained only with ECFP_4 and FCFP_4 in the TSS-10 searches. Taken together, the results in Tables 1 and 2 suggest that TSS can bring about substantial enhancements in virtual-screening performance in some cases, especially when the highly effective ECFP_4 fingerprint is used.
410
S.M. Arif et al.
4 Use of Fragment Occurrence Data 4.1 Screening Performance Previous studies of weighted similarity searching have considered just the incidence (W1) and occurrence (W2) weighting schemes: here, we considered all similarity measures involving either of these, and those where both the reference structure and the database structures were weighted using W3, W4 or W5. In the following, a similarity measure Mab denotes a measure with weight a and b applied to the fingerprints of the database structures and of the reference structure, respectively. Searches were carried out using each of the 19 resulting similarity measures on both the MDDR and WOMBAT datasets, using ten different reference structures for each of the associated activity classes and using holograms, ECFC_4 and Sunset fingerprints. The recall values for these searches are shown in Table 3: the recall here is the mean number of actives retrieved in the top-5%, averaged over the ten reference structures for each class and then over all classes for each database. The scheme with the best mean recall in each column again has the value bold-faced and shaded. Table 3. Similarity searches of the MDDR and WOMBAT databases using different weighting schemes Weight M11 M12 M13 M14 M15 M21 M22 M23 M24 M25 M31 M32 M33 M41 M42 M44 M51 M52 M55
Holograms 120.8 105.7 145.3 114.6 141.5 65.3 187.2 103.2 132.2 52.4 123.5 103.0 178.8 140.0 146.0 170.7 93.3 103.1 130.7
MDDR ECFC_4 211.9 227.2 95.2 219.4 183.3 126.4 185.8 59.1 142.8 76.2 197.6 171.0 166.7 215.0 213.7 223.5 226.8 222.5 208.3
Sunset 162.0 152.8 143.6 164.7 135.0 16.5 127.0 24.1 32.2 16.6 165.3 87.4 151.8 92.5 95.6 159.1 157.8 130.2 161.8
Holograms 118.9 105.6 143.6 114.7 140.3 65.0 152.5 91.7 120.0 47.3 115.5 100.4 156.1 134.7 137.0 153.5 95.5 101.9 127.1
WOMBAT ECFC_4 188.2 193.4 85.1 191.1 163.7 116.0 165.8 40.7 133.7 66.8 154.3 122.8 158.9 186.7 172.2 192.6 196.0 193.7 188.8
Sunset 157.2 153.3 137.1 165.2 131.7 10.5 139.3 15.5 24.5 9.6 154.9 74.1 159.7 90.4 84.0 162.3 160.4 132.4 157.7
It is possible to assess the consistency of the results using Kendall’s W test of statistical significance, which is used to evaluate the level of agreement between k different sets of ranked judgments of the same set of N different objects [23]. Here, we have considered each of the fingerprint/dataset combinations as a judge ranking the different similarity measures in order of decreasing effectiveness (as measured by the
Enhancing the Effectiveness of Fingerprint-Based Virtual Screening
411
recall values), i.e., k=6 and N=19. Converting the values in Table 3 to ranks, we obtain a value for W of 0.57, which is significant at the 0.01 level of statistical significance using a modified χ² test with N-1 degrees of freedom. Since a significant level of agreement has been achieved, the best overall ranking of the N objects is the objects’ mean ranks when averaged over the k judges [23]. This gives the following ranking of the similarity measures: M44 > M14 > M33=M55 > M11=M12=M51 > M22 > M31 > M42 > M41 > M15 > M52 > M13 > M24 > M32 > M23 > M21 > M25. Thus M44 and M14 (both involving W4, the square root of the raw frequencies of occurrence) are at the top of the rankings; M11, M33, M55, M51 and M22 all do well; and M32, M21, M23, M24 and M25 perform very poorly. The work hence suggests that the inclusion of occurrence information can increase the effectiveness of current similarity searching systems, which predominantly use binary fingerprints. Of the various weighting schemes we have chosen, our results indicate the general effectiveness of the W4 scheme, which seeks to lessen the contribution made by the most frequently occurring fragments within a molecule. Table 4. Mean values of the non-zero elements of each type of weighted fingerprint for the MDDR and WOMBAT fingerprints Mean value W1 W2 W3 W4 W5
Holograms 1.00 2.45 1.04 1.44 0.60
MDDR ECFC_4 1.00 1.70 1.07 1.22 0.61
Sunset 1.00 4.57 1.43 1.86 0.57
Holograms 1.00 2.46 1.04 1.44 0.60
WOMBAT ECFC_4 1.00 1.76 1.08 1.24 0.61
Sunset 1.00 4.46 1.41 1.84 0.57
4.2 Analysis of Similarity Measures We can draw two further conclusions from our results: that symmetric similarity measures (i.e., measures Mab where a=b) tend to do better than asymmetric measures (i.e., where a≠b); and that many of the measures involving W2 perform very badly. These conclusions may be rationalized by considering the interactions that occur when two weighting schemes a and b are combined to form a measure Mab and when the resulting measure is used to compute the Tanimoto similarity coefficient. The basic form of the Tanimoto coefficient for molecules X and Y is
SXY =
∑ xiyi . 2 ∑ xi + ∑ yi − ∑ xiyi 2
where the summations are over the non-zero elements in each fingerprint. If a molecule is matched with itself and if a symmetric measure is used, then xi=yi for all i and the Tanimoto coefficient has the value of unity, which is the upper-bound value for this coefficient. However, the upper-bound may be less than unity if an asymmetric
412
S.M. Arif et al.
measure is used, as we now demonstrate. Assume that all fragments in a molecule occur equifrequently, and are thus assigned the same weight, WNZ, which is the mean value of the non-zero elements in a molecule’s fingerprint when that molecule is weighted using some particular weighting scheme. Then the self-similarity for a molecule X using the measure Mab, with weights WNZ(a) and WNZ(b), is
SXX =
∑ WNZ (a)WNZ (b) . 2 ∑ WNZ (a ) + ∑WNZ (b) − ∑ WNZ (a )WNZ (b) 2
Values for WNZ using each of the schemes W1-W5 for the two datasets are shown in Table 4, and these can be used to compute the similarities SXX. For example, if using the MDDR holograms and the W1 and W2 weights, then the values of WNZ from the table are 1.00 and 2.45, respectively: this gives an upper-bound of 0.54 to the selfsimilarity of a molecule in the W1 representation with itself in the W2 representation (i.e., M12). This value can be compared with the corresponding M12 upper-bounds for MDDR Sunset (0.26) and MDDR ECFCP_4 (0.78), demonstrating the wide range of upper-bound values for the same similarity measure that is obtained using the different fingerprints. Analogous upper-bounds can be computed using the data in Table 4 for all of the other measures Mab: these computations show that combinations of the form M2b have low upper-bounds for all three types of fingerprint. Thus, if there is large discrepancy in the weights computed using the two weighting schemes involved in the chosen similarity measure then there will be a much smaller range of possible similarity values than if the weights are of comparable magnitude. If only a limited range of values is available to the coefficient, then the ranking will be less discriminating resulting in the poor (and in some cases very poor) screening performance that is demonstrated in Table 3 for some combinations of similarity measure and representation, e.g., WOMBAT Sunset M21 and M25. The similarity analysis above is grossly simplified in that it considers selfsimilarities (rather than the similarities between a reference structure and a database structure) and it considers only upper-bound values (which are likely to differ from the largest similarities that are actually obtained during a similarity search). Even so, more detailed examination demonstrates the general correctness of the analysis above, with the similarity behavior observed here mirroring that obtained in searches of entire databases (rather than in self-similarity calculations) using actual (rather than upper-bound) similarities: this more detailed work will be reported shortly. We hence conclude that the upper-bound value for the Tanimoto coefficient depends on the natures of the weighting schemes a and b: if a=b then the upper-bound will be unity; however, if this is not the case and the corresponding weights differ substantially, then the upper-bound can be markedly less than unity. This implies a reduction (and in some cases, a severe reduction) in the discriminatory power of the resulting similarity measure when it is used for virtual screening.
5 Conclusions Similarity-based approaches are widely used for virtual screening. Conventional similarity searching involves using a binary fingerprint describing a bioactive reference
Enhancing the Effectiveness of Fingerprint-Based Virtual Screening
413
structure to rank a chemical database in order of decreasing probability of activity. In this paper, we have described two ways in which the conventional approach can be enhanced: turbo similarity searching based on identifying and then exploiting the reference structure’s nearest neighbours; and taking account of fragments’ frequencies of occurrence in molecules. The search results in Tables 1 and 2 show that turbo similarity searching based on a consensus approach called group fusion can provide substantial enhancements in screening performance if the normal similarity search provides a good starting point, i.e., if the similar property principle holds and if the actives are well clustered using the chosen structure representation and similarity measure. This was particularly the case in the searches based on the ECFP_4 fingerprint; indeed, this would appear to be the representation of choice for similarity-based virtual screening using binary fingerprints. The search results in Table 3 show that fingerprint representations encoding the occurrence-frequencies of fragment substructures can perform much better than conventional binary fingerprints in similarity-based screening, especially using symmetric similarity measures that include the W4 square-root weight; that said, some other combinations of weights can perform very badly. An upper-bound analysis provides a rationalization of the observed variations in performance, this demonstrating the subtle interactions that may occur between the representation and the weighting scheme when a chemical similarity measure is created. Current work on similarity-based virtual screening includes considering alternative consensus rules for the implementation of the group fusion stage of TSS, and the use of different similarity coefficients for weighted fingerprint searching. Acknowledgements. We thank the following: Kristian Birchall for assistance with the WOMBAT data; the Government of Malaysia, and the Novartis Institutes for Biomedical Research for funding; and Accelrys Software Inc., Daylight Chemical Information Systems Inc., Digital Chemistry Limited, the Royal Society, SciTegic Inc., Sunset Molecular Discovery LLC, Symyx Technologies Inc., Tripos Inc. and the Wolfson Foundation for data, software and laboratory support.
References 1. Stahura, F.L., Bajorath, J.: Virtual Screening Methods That Complement High-Throughput Screening. Combin. Chem. High-Through. Screening 7, 259–269 (2004) 2. Alvarez, J., Shoichet, B. (eds.): Virtual Screening in Drug Discovery. CRC Press, Boca Raton (2005) 3. Eckert, H., Bajorath, J.: Molecular Similarity Analysis in Virtual Screening: Foundations, Limitation and Novel Approaches. Drug Discov. Today 12, 225–233 (2007) 4. Willett, P.: Similarity Methods in Chemoinformatics. Ann. Rev. Inform. Sci. Technol. 43, 3–71 (2009) 5. Sheridan, R.P., Kearsley, S.K.: Why Do We Need So Many Chemical Similarity Search Methods? Drug Discov. Today 7, 903–911 (2002) 6. Nikolova, N., Jaworska, J.: Approaches to Measure Chemical Similarity - a Review. QSAR Combin. Sci. 22, 1006–1026 (2003) 7. Maldonado, A.G., Doucet, J.P., Petitjean, M., Fan, B.-T.: Molecular Similarity and Diversity in Chemoinformatics: From Theory to Applications. Mol. Diversity 10, 39–79 (2006)
414
S.M. Arif et al.
8. Glen, R.C., Adams, S.E.: Similarity Metrics and Descriptor Spaces - Which Combinations to Choose? QSAR Combin. Sci. 25, 1133–1142 (2006) 9. Sheridan, R.P.: Chemical Similarity Searches: When Is Complexity Justified? Expert Opin. Drug Discov. 2, 423–430 (2007) 10. Hert, J., Willett, P., Wilton, D.J., Acklin, P., Azzaoui, K., Jacoby, E., Schuffenhauer, A.: Enhancing the Effectiveness of Similarity-Based Virtual Screening Using NearestNeighbour Information. J. Med. Chem. 48, 7049–7054 (2005) 11. Johnson, M.A., Maggiora, G.M. (eds.): Concepts and Applications of Molecular Similarity. John Wiley, New York (1990) 12. Martin, Y.C., Kofron, J.L., Traphagen, L.M.: Do Structurally Similar Molecules Have Similar Biological Activities? J. Med. Chem. 45, 4350–4358 (2002) 13. Hert, J., Willett, P., Wilton, D.J., Acklin, P., Azzaoui, K., Jacoby, E., Schuffenhauer, A.: Comparison of Fingerprint-Based Methods for Virtual Screening Using Multiple Bioactive Reference Structures. J. Chem. Inf. Comput. Sci. 44, 1177–1185 (2004) 14. Whittle, M., Gillet, V.J., Willett, P., Alex, A., Loesel, J.: Enhancing the Effectiveness of Virtual Screening by Fusing Nearest Neighbor Lists: A Comparison of Similarity Coefficients. J. Chem. Inf. Comput. Sci. 44, 1840–1848 (2004) 15. Willett, P.: Data Fusion in Ligand-Based Virtual Screening. QSAR Combin. Sci. 25, 1143–1152 (2006) 16. Goldman, B.B., Walters, W.P.: Machine Learning in Computational Chemistry. Ann. Report. Comput. Chem. 2, 127–140 (2006) 17. Willett, P., Winterman, V.: A Comparison of Some Measures of Inter-Molecular Structural Similarity. Quant. Struct.-Activ. Relat. 5, 18–25 (1986) 18. Hert, J., Willett, P., Wilton, D.J., Acklin, P., Azzaoui, K., Jacoby, E., Schuffenhauer, A.: New Methods for Ligand-Based Virtual Screening: Use of Data-Fusion and MachineLearning Techniques to Enhance the Effectiveness of Similarity Searching. J. Chem. Inf. Model. 46, 462–470 (2006) 19. Gardiner, E.J., Gillet, V.J., Haranczyk, M., Hert, J., Holliday, J.D., Malim, N., Patel, Y., Willett, P.: Turbo Similarity Searching: Effect of Fingerprint and Dataset on VirtualScreening Performance. Stat. Anal. Data Mining (in press, 2009) 20. Salton, G., Buckley, C.: Term-Weighting Approaches in Automatic Text Retrieval. Inf. Proc. Manag. 24, 513–523 (1988) 21. Cramer, R.D., Redl, G., Berkoff, C.E.: Substructural Analysis. A Novel Approach to the Problem of Drug Design. J. Med. Chem. 17, 533–535 (1974) 22. Ormerod, A., Willett, P., Bawden, D.: Comparison of Fragment Weighting Schemes for Substructural Analysis. Quant. Struct.-Activ. Relat. 8, 115–129 (1989) 23. Siegel, S., Castellan, N.J.: Nonparametric Statistics for the Behavioural Sciences. McGraw-Hill, New York (1988)
Patterns, Movement and Clinical Diagnosis of Abdominal Adhesions Benjamin Wright1, , John Fenner1 , Richard Gillott2 , Paul Spencer2 , Patricia Lawford1, and Karna Dev Bardhan2 1
2
University of Sheffield, UK Rotherham General Hospital, UK
Abstract. Patterns in normal abdominal movement captured with medical imaging can be recognised by a trained radiologist but the process is time consuming. Abdominal adhesions present a diagnostic problem in which the radiologist is asked to detect abnormal movement that may be indicative of pathology. This paper postulates that the use of image analysis can augment the diagnostic abilities of the radiologist in respect of adhesions. Proof of concept experiments were conducted insilico to explore the effectiveness of the technique. The results indicate that trained participants are accurate in their assessment of abnormalities when supplied with additional information from image analysis techniques. However without the additional information, participants made incorrect diagnoses on many occasions. ROC methods were used to quantify the outcomes of the in-silico experiment. Keywords: Pattern recognition, Image analysis, Abdominal Adhesions, In-silico modelling.
1
Introduction
Abdominal adhesions are fibrous bands of connecting tissue that can result from injury to the abdominal contents [1]. Often a direct result of mechanical injury during surgery [2], they adhere anatomical components to one-another or to the abdominal wall and as a result can inhibit the normal function of the abdomen [3,4]. Adhesions often lie unnoticed, however, when symptoms present they are frequently diagnosed only through a process of exclusion of other, more common, disorders [5]. They are structurally similar to surrounding tissues and are often volumetrically insubstantial, producing insufficient signal for direct detection using non-invasive medical imaging. However, it is a premise of this work that imaging can be used for diagnosis by identifying patterns of movement that are characteristic of adhesions. This is supported by the work of other groups that promote the use of Magnetic Resonance Imaging (MRI) [6,7] or Ultrasound Scanning (US) [8] as an effective means of non-invasive diagnosis. Both techniques
The authors would like to thank the Bardhan Research and Education Trust of Rotherham (BRET) and the Engineering and Physical Sciences Research Council (EPSRC) for their financial support of this work.
V. Kadirkamanathan et al. (Eds.): PRIB 2009, LNBI 5780, pp. 415–426, 2009. c Springer-Verlag Berlin Heidelberg 2009
416
B. Wright et al.
have advantages but this paper focuses on MRI as the technique of choice and discusses image processing based data reduction techniques to aid the pattern recognition task associated with identifying signatures of disturbed movement caused by adhesions. 1.1
Signatures Observed Using MRI
The most effective protocol for non-invasive diagnosis of adhesions requires that a trained radiologist monitors the movement of the abdominal contents as captured by digital imaging [9]. This is performed using a series of 2-dimensional cine MRI acquisitions. Each cine MRI is a collection of approximately 15 time-sequential 2-dimensional planar images that cover the respiratory cycle of the patient. To examine the whole abdominal cavity, approximately 15 sagittal and transverse planes are acquired. A total of 30 cine MRI scans comprising of more than 450 individual 2-dimensional MRI scans are then presented to the radiologist. The radiologist observes the movement of the abdominal contents throughout the cine sequences in an attempt to detect any abnormal patterns of movement. This is a process that is very time consuming and raises issues about reproducibility. 1.2
Image Analysis
This work postulates that image analysis techniques can be used to augment the aforementioned diagnostic procedures. Movement can be quantified using image registration techniques, so that images in the cine sequence are registered to other temporally consecutive images. A gradient-based registration method matches an image pair and produces a vector map of the transformation required to move from one to the other [10]. The vector map is a continuous field that describes the mapping of structures from one image to equivalent structures in the second image and minimises a cost function appropriate to the medical imaging modality. The vector map can subsequently be analysed and visualised in forms that promote recognition of characteristic signatures, which may be indicative of adhesion induced movement disturbances. Hypothesis. The use of appropriate image analysis methods can augment the diagnostic efficacy of the radiologist through reduction of the pattern recognition task.
2
Method
The hypothesis was examined using in-silico models in which a virtual representation of the diagnostic challenge was trialled with numerous observers. The model, which is based on a 2-dimensional finite element structural system, comprises of 4-node square elements and is complemented with synthesised images that were derived from anatomical features. Each structural element featured the same base stiffness, but was perturbed with additional noise of up to ±30%. The structural nature of the mesh enabled forces to be applied at the nodes to
Patterns, Movement and Clinical Diagnosis of Abdominal Adhesions
(a) Rest
417
(b) Stretched
Fig. 1. In-silico images. (a) The rest image that is added to the finite element mesh. (b) Maximum force has been applied to the mesh, creating the stretched image.
produce distortions. This facilitated crude representation of abdominal movement, subject to the physiological forces experienced from diaphragmatic movement. An image (Fig. 1) was laid over the finite element mesh and image pixel co-ordinates computed relative to each element. When subjected to specified forces, the mesh distortions were computed and interpolation techniques were used to calculate and redraw a consistent distorted image. By changing the strength of the forces applied to the system, multiple distortions were created and the resulting sequence of 20 images collated into a cine sequence. The movement of the image over time was analogous to images generated in the clinical setting. The design of the model permits modification of the stiffness of particular elements and therefore is capable of introducing anomalous movement and disturbances. The task of identifying stiff elements from features in the moving image was intended to reflect the clinical task of diagnosing abdominal movement disturbances caused by the presence of adhesions and was considered by consultant radiologists to be suitably challenging. 2.1
Image Analysis
Image registration methods were used to compare temporally consecutive images from the in-silico cine sequence. Registration involves the mapping of positions in a source image to their relative positions in a target image. This is an iterative process that involves transforming the source image and measuring the difference between the transformed and the target image. The process continues until an acceptable match is found and an appropriate cost function for the registration minimised. The Sheffield Image Registration Toolkit (ShIRT) was used in this work to perform the registration [11]. Once registration is complete, a vector mapping indicating the required transformation can be analysed. Image registration requires that the difference between the images is small enough for correct estimation of the required transformation. Subsequent analysis of the vector mapping is used to present the data. In this work, the magnitude of the
418
B. Wright et al.
vectors and the local gradients of the vector magnitudes were considered to be appropriate indicators of movement. Colour scales were used to present the data. 2.2
Isis
These elements were encapsulated in a software package called Isis (In-Silico Investigation System), created to test the effectiveness of the technique. It was specifically designed to present a participant with a cine sequence that randomly featured the presence/absence of an adhesion and asked them to diagnose the condition as presented. This included recording the certainty of diagnosis and was performed with and without the additional information provided by the image analysis. Before conducting the tests, each participant was required to take part in a short training programme. The software for the training was bundled with Isis and featured a similar user interface. Two modes were available to the participant, the first of which was a demonstration mode. In this mode the participant was presented with two cine data sets; one of these featured an adhesion and the other did not. The participant was told which was which and was able to study the differences in both the cine data and the additional vector information provided by the image analysis. Once the participant was satisfied with their ability to detect the presence of an adhesion, the second training mode could
Fig. 2. Isis graphical user interface. Cine data shown in the left hand pane of the window. Optional information provided by image analysis features on the right hand side. The top right pane shows a contour plot of vector magnitude and the bottom right pane shows a gradient based image.
Patterns, Movement and Clinical Diagnosis of Abdominal Adhesions
419
be used. In this mode the participant was presented with a random cine data set. Their challenge was to be able to identify the presence of an adhesion with adequate certainty and consistency. Successful diagnoses at this stage enabled the participant to progress to the main Isis application for quantification of their diagnostic performance. 2.3
Test Protocol
To evaluate the effectiveness of the technique, Isis was used to educate and examine two medically trained participants with 80 cine data sets in a randomised order. Similar to the ratio experienced by radiologists, half of the data sets featured an adhesion and the other half did not. In order to maintain simplicity in the test, the adhesion was always in the same location if present. Each participant was asked to diagnose the presence/absence of an adhesion in all cases with and without the additional information provided by the image processing. When making a diagnosis the participant was asked to quantify their certainty of diagnosis using a subjective four-level scale featuring the levels: Pure guess, Possibly, Probably and Definitely. Isis recorded the participants response, their time taken to respond and the level of certainty. The information was recorded to an output file and later examined using Receiver Operating Characteristic (ROC) methods [12].
3
Results
Isis scores as recorded for the two participants were analysed as described above. The resulting data points were plotted in an ROC space. Data was also available on the time taken for the user to make a decision and their certainty when making it. 3.1
ROC Data
ROC methods were employed to evaluate the effectiveness of the additional information, provided by the image analysis, available to the participant. Total numbers of true positives (TP), false positives (FP), true negatives (TN) and false negatives (FN) were calculated using the data provided by Isis. The true positive and false positive rates (TPR and FPR respectively) were calculated using the following standard formulae [12]. TPR =
TP TP + FN
(1)
FP (2) FP + TN Figures 3 and 4 display the corresponding ROC curves which plot TPR against FPR for the different decision thresholds recorded in the test. Where appropriate, FPR =
420
B. Wright et al.
1
0.9
0.8
0.7
TPR
0.6
0.5
0.4
0.3
0.2
0.1
0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.8
0.9
1
FPR
(a) Without additional information 1
0.9
0.8
0.7
TPR
0.6
0.5
0.4
0.3
0.2
0.1
0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
FPR
(b) With additional information Fig. 3. ROC Data for Participant 1 gathered by Isis, where TPR = true positive rate and FPR = false positive rate. Dashed line represents the line of no dicrimination. Data is presented for the test without and with additional image processing information, (a) and (b) respectively. Polynomial best fit line shown for Unaided data. Area under the ROC curves are 0.63 and 1.00 respectively.
Patterns, Movement and Clinical Diagnosis of Abdominal Adhesions
421
1
0.9
0.8
0.7
TPR
0.6
0.5
0.4
0.3
0.2
0.1
0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.8
0.9
1
FPR
(a) Without additional information 1
0.9
0.8
0.7
TPR
0.6
0.5
0.4
0.3
0.2
0.1
0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
FPR
(b) With additional information Fig. 4. ROC Data for Participant 2 gathered by Isis, where TPR = true positive rate and FPR = false positive rate. Dashed line represents the line of no dicrimination. Data is presented for the test without and with additional image processing information, (a) and (b) respectively. Polynomial best fit line shown for Unaided data. Area under the ROC curves are 0.58 and 1.00 respectively.
422
B. Wright et al.
a second order polynomial line of best fit has been added. The results show that, rather surprisingly, with the additional information provided by the image analysis both participants were able to obtain perfect scores when diagnosing the presence of an adhesion. Both curves display the line of no discrimination and it is seen that all results lie above this. 3.2
Data Tables
The Isis output file contains information regarding the participants certainty in their choice of diagnosis and the time taken to make a decision. This information is displayed in tables 1 and 2. Table 1. Showing the number of times a diagnosis was made with each level of certainty
Level of certainty Pure guess Possibly Probably Definitely
Participant 1 Image analysis status Unavailable Available 21 0 48 1 11 14 0 65
Participant 2 Image analysis status Unavailable Available 6 0 58 1 16 16 0 63
Table 2. Showing the average time taken to make a diagnosis
Time (sec) Average time taken
4
Participant 1 Image analysis status Unavailable Available 5.6 ± 0.5 1.5 ± 0.5
Participant 2 Image analysis status Unavailable Available 10.8 ± 0.5 2.5 ± 0.5
Discussion
The data confirms the supposition of the hypothesis that the use of appropriate image analysis can augment the diagnostic efficacy of the radiologist through reduction of the pattern recognition task. Participants not supported with image analysis are consistently poor in their diagnosis. In-silico results imply that the technique may have potential to be applied to clinical data where occurrences of adhesions are common after surgery. Radiological methods that allow noninvasive diagnosis of the condition present a difficult and time consuming pattern recognition challenge and this has been replicated in-silico. Furthermore, image analysis has demonstrated the ability to reduce this task through registration of temporally consecutive frames and subsequent presentation of the registration information.
Patterns, Movement and Clinical Diagnosis of Abdominal Adhesions
4.1
423
In-Silico Model Critique
The clinical methodology and its augmentation by image processing has been evaluated using in-silico models designed to capture the essence of free and disturbed abdominal motion during respiration. The motion observed in the model achieves this but its simplicity and structural foundation is associated with numerous limitations. Unlike the abdomen, the model does not currently allow for structures to slide past one another [14]. Instead they are connected by elements that can be made flexible but offer no opportunity for slide. This limits interpretation in the context of clinical images, but nonetheless encapsulates the essence of diagnosing an underlying structural abnormality by observing disturbances to normal movement. The images for this exercise were based on preliminary tests that indicated, in the absence of image processing support, the participant found it increasingly difficult to identify anomalous areas of raised stiffness as the complexity of the image increased. The chosen image was a compromise; being simple enough to offer the participant the best possible chance of detecting a disturbance and yet providing an image with sufficient information content for the image processing algorithm to operate effectively [11]. The structural model can be enhanced beyond its current state. Variables such as the shape of elements and individual stiffness can be altered to make the system bear a closer resemblance to the clinical anatomy. The planar nature of clinical MRI means that sometimes 3-dimensional anatomical structures can move in and out of the imaging plane. The radiologist recognises these occurrences and can cross planes to track anatomical structures. This is something that should be addressed for future in-silico models. Forces applied to the model were simple and linear unlike the forces seen in the abdomen. This produced consistent distortions in the image and made the pattern recognition task easier when considering the image analysis visualisations. 4.2
Experimental Results
Isis was used to evaluate the effectiveness of the proposed diagnostic augmentations. Participants in the test were asked to operate the software which would record their diagnosis of the presence/absence of an adhesion. The test was completed with and without the additional information provided by the image analysis. As seen from the results, the additional information significantly enhanced the diagnostic capability of the participant. The area under the ROC curve is an accepted measurement that quantifies the effectiveness of diagnosis [12]. The area is measured as 1 if the operator is always correct and 0 if they are never correct, an area of 0.5 is equivalent to random guessing and represented by the line of no discrimination. Participant 1 had an area under the curve of 0.63 when not provided with image analysis data. When provided with the additional information, they were able to achieve a perfect score of 1.00 meaning their diagnosis of anomalous stiffness, based on image movement, was always correct. Participant 2 had an area under the curve of 0.58 when not provided with additional information and an area of 1.00 when provided with the image analysis.
424
B. Wright et al.
The perfect scores for both participants when provided with additional information from image analysis might indicate that the diagnostic challenge was too easy, but the score recorded when not provided with this information does not support this. Both participants recorded scores that deviated slightly from the line of no discrimination indicating some success in diagnosis without the additional information. This is supported by the data in table 1 which shows that when provided with additional information from image analysis both participants were much more certain of their diagnosis. Furthermore, when account is taken of the time for a diagnosis to be made, the participants were much quicker when diagnosis was supported with image analysis information. 4.3
Technique
The experimental results confirm the hypothesis in this synthetic evaluation. However, it is clear that the in-silico experiment is far removed from the clinical setting. Nonetheless radiologists were consulted throughout the study and from concept to experimental procedure they confirmed that it resonated with the clinical setting. This included the clinical challenge of analysing a vast data set and concerns about reproducibility when additional information from image analysis is absent. When exposed to the software, the radiologists displayed similar results to the medical trainees who made a diagnosis on a total of 160 animations and who also demonstrated misdiagnosis with a degree of confidence; they convinced themselves of distorting movements when in fact there were none. In a clinical context such misdiagnosis could have serious implications for unnecessary surgery and adhesiolysis. In this simulation, the in-silico adhesion was always in the same place if present. Arguably this is similar to congenitally formed adhesions that attach the liver to the diaphragm for example. Radiologists are often presented with a much more widespread and diffuse diagnostic challenge [9]. This could be replicated in the current in-silico model by randomly placing the stiff element elsewhere within the image. However, the clinical problem also presents adhesions in different forms which will precipitate different movement patterns. Again the current model could be adapted to accommodate for this through modification of the underlying finite element mesh and its flexibility. 4.4
Clinical Application
In essence the image analysis methods explored in this paper are data reduction techniques. Vast amounts of complex visual data in the form of cine sequences have been reduced to simpler 2-dimensional plots. This is a typical strategy of pattern recognition solutions and frequently include additional techniques designed to characterise the image under investigation. The complexity of underlying movement disturbances hidden within MRI radiological image (Fig. 5) sequences makes this a very demanding task for any pattern recognition software; but the technique as presented here separates the challenge by isolating data reduction from explicit feature extraction. It is easy for the complex moving
Patterns, Movement and Clinical Diagnosis of Abdominal Adhesions
425
Fig. 5. Contrast enhanced clinical MRI image. Sagittal slice through volunteer showing abdominal contents.
images in the clinical setting to overwhelm the observer but this is dramatically simplified with the data reduction of the image processing which leaves the feature extraction and final classification to the trained eye of the radiologist. This technique does not attempt to completely automate detection of adhesions, but offers to guide the observer to movement anomalies in support of diagnosis.
5
Conclusion
This paper has described a simulation scenario that encapsulates important elements associated with clinical diagnosis of adhesions. The method captures the challenge of identifying anomalous movement within a background of complex, normal motion. The application of image processing support has demonstrated the effectiveness of such a tool in guiding the observer to a correct diagnosis. The technique offers hope for the diagnosis of clinical adhesions by similar methods. It confirms the hypothesis that the use of appropriate image analysis methods can augment the diagnostic efficacy of the radiologist through simplification of the pattern recognition task. Further work to improve the realism of the models is ongoing.
References 1. Boland, G.M., Weigel, R.J.: Formation and prevention of postoperative abdominal adhesions. J. Surg. Res. 132(1), 3–12 (2006) 2. Vrijland, W.W., Jeekel, J., van Geldorp, H.J., Swank, D.J., Bonjer, H.J.: Abdominal Adhesions: Intestinal obstruction, pain, and infertility. Surg. Endosc. 17(7), 1017–1022 (2003)
426
B. Wright et al.
3. Cheong, Y.C., Laird, S.M., Li, T.C., Shelton, J.B., Ledger, W.L., Cooke, I.D.: Peritoneal healing and adhesion formation/reformation. Hum. Reprod. Update 7(6), 556–566 (2001) 4. Diamond, M.P., Freeman, L.M.: Clinical implications of postsurgical adhesions. Hum. Reprod. Update 7(6), 567–576 (2001) 5. Swank, D.J., Swank-Bordewijk, S.C., Hop, W.C., van Erp, W.F., Janssen, I.M., Bonjer, H.J., Jeekel: Laparoscopic adhesiolysis in patients with chronic abdominal pain: a blinded randomised controlled multi-centre trial. Lancet 361(9365), 1247– 1251 (2003) 6. Katayama, M., Masui, T., Kobayashi, S., Ito, T., Sakahara, H., Nozaki, A., Kabasawa, H.: Evaluation of pelvic adhesions using multiphase and multislice MR imaging with kinematic display. Am. J. Roentgenology 177(1), 107–110 (2001) 7. Lang, R.A., Buhmann, S., Hopman, A., Steitz, H.O., Lienemann, A., Reiser, M.F., Jauch, K.W., Huttl, T.P.: Cine-MRI detection of intraabdominal adhesions: correlation with intraoperative findings in 89 consecutive cases. Surg. Endosc. 22(11), 2455–2461 (2008) 8. Caprini, J.A., Arcelus, J.A., Swanson, J., Coats, R., Hoffman, K., Brosnan, J.J., Blattner, S.: The ultrasonic localization of abdominal wall adhesions. Surg. Endosc. 9(3), 283–285 (1995) 9. Mussack, T., Fischer, T., Ladurner, R., Gangkofer, A., Bensler, S., Hallfeldt, K.K., Reiser, M., Lienemann, A.: Cine magnetic resonance imaging vs high-resolution ultrasonography for detection of adhesions after laparoscopic and open incisional hernia repair: a matched pair pilot analysis. Surg. Endosc. 19(12), 1538–1543 (2005) 10. Crum, W.R., Hartkens, T., Hill, D.L.: Non-rigid image registration: theory and practice. Br. J. Radiol. 77(2), S140–S153 (2004) 11. Barber, D.C., Hose, D.R.: Automatic segmentation of medical images using image registration: diagnostic and simulation applications. J. Med. Eng. Technol. 29(2), 53–63 (2005) 12. Obuchowski, N.A.: Receiver operating characteristic curves and their use in radiology. Radiology 229(1), 3–8 (2003) 13. Ellis, H., Moran, B.J., Thompson, J.N., Parker, M.C., Wilson, M.S., Menzies, D., McGuire, A., Lower, A.M., Hawthorn, R.J., O’Brien, F., Buchan, S., Crowe, A.M.: Adhesion-related hospital readmissions after abdominal and pelvic surgery: a retrospective cohort study. Lancet 353(9163), 1476–1480 (1999) 14. Tan, H.L., Shankar, K.R., Ade-Ajayi, N., Guelfand, M., Kiely, E.M., Drake, D.P., De Bruyn, R., McHugh, K., Smith, A.J., Morris, L., Gent, R.: Reduction in visceral slide is a good sign of underlying postoperative viscero-parietal adhesions in children. J. Pediatr. Surg. 38(5), 714–716 (2003)
Class Prediction from Disparate Biological Data Sources Using an Iterative Multi-Kernel Algorithm Yiming Ying1 , Colin Campbell1 , Theodoros Damoulas2 , and Mark Girolami2 1 Department of Engineering Mathematics, University of Bristol, Bristol BS8 1TR, United Kingdom 2 Department of Computer Science, University of Glasgow, Glasgow, G12 8QQ, United Kingdom [email protected], [email protected], [email protected], [email protected]
Abstract. For many biomedical modelling tasks a number of different types of data may influence predictions made by the model. An established approach to pursuing supervised learning with multiple types of data is to encode these different types of data into separate kernels and use multiple kernel learning. In this paper we propose a simple iterative approach to multiple kernel learning (MKL), focusing on multi-class classification. This approach uses a block L1 -regularization term leading to a jointly convex formulation. It solves a standard multi-class classification problem for a single kernel, and then updates the kernel combinatorial coefficients based on mixed RKHS norms. As opposed to other MKL approaches, our iterative approach delivers a largely ignored message that MKL does not require sophisticated optimization methods while keeping competitive training times and accuracy across a variety of problems. We show that the proposed method outperforms state-of-the-art results on an important protein fold prediction dataset and gives competitive performance on a protein subcellular localization task. Keywords: Multiple kernel learning, multi-class, bioinformatics, protein fold prediction, protein subcellular localization.
1
Introduction
Kernel methods [15,16] have been successfully used for data integration across a number of biological applications. Kernel matrices encode the similarity between data objects within a given space. Data objects can include network graphs and sequence strings in addition to numerical data: all of these types of data can be encoded into kernels. The problem of data integration is therefore transformed into the problem of learning the most appropriate combination of candidate kernel matrices and typically a linear combination is used. This is often termed multi-kernel learning (MKL) in Machine Learning and, due to its practical importance, it has recently received increased attention. Lanckriet et al. [9] proposed a semi-definite programming (SDP) approach to automatically learn a V. Kadirkamanathan et al. (Eds.): PRIB 2009, LNBI 5780, pp. 427–438, 2009. c Springer-Verlag Berlin Heidelberg 2009
428
Y. Ying et al.
linear combination of candidate kernels for SVMs. This approach was improved by Bach et al. [3] who used sequential minimization optimization (SMO) and by Sonnenburg et al. [18] who reformulated it as a semi-infinite linear programming (SILP) task. In [11], the authors studied the kernel learning problem for a convex set of possibly infinite kernels under a general regularization framework. Other approaches include the COSSO estimate for additive models [10], Bayesian probabilistic models [5,8], kernel discriminant analysis [20], hyperkernels [12] and kernel learning for structured outputs [21]. Such MKL formulations have been successfully demonstrated in combining multiple data sources to enhance biological inference [5,9]. Most of the above MKL methods were for binary classification. In Section 2 we build on previous contributions [1,3,10,11,13,21] to propose a simple iterative kernel learning approach focusing on multi-class problems. This formulation employs a mixed RKHS norm over a matrix-valued function which promotes common information across classes. We demonstrate that this problem is jointly convex, laying down the theoretical basis for its solution using an extremely simple iterative method. This approach solves a multi-class classification problem for a single kernel, and then updates the kernel combinatorial coefficients based on the mixed RKHS norms. As opposed to other multi-kernel approaches, our iterative approach delivers an important message that MKL does not require sophisticated optimization methods while keeping competitive training times and accuracy across a wide range of problems. In Section 3 we briefly validate our method on UCI benchmark multi-class datasets before applying it to two multiclass multi-feature bioinformatics problems: protein fold recognition and protein subcellular localization.
2
The Learning Method
Let Nn = {1, 2, . . . , n} for any n ∈ N and input/outputs sample z = {(xi , yi ) : i ∈ Nn } with y = {yi ∈ [1, C] : i ∈ Nn } where C is the number of classes. For input sample xi , there are m different sources of information (feature spaces), i.e. xi = (x1i , x2i , . . . , xm i ) with xi from -th data source for any ∈ Nm . To introduce the learning model, we employ a one-versus-all strategy that encodes the multi-class classification problem as a set of binary ones. To this end, we reconstruct the output vector yi = (yi1 , . . . , yiC ) such that yic = 1 if yi = c and otherwise −1. Hence the outputs are represented by an n×C indicator matrix Y = (yic )i,c whose c-th column vector is denoted by yc . For source and class c, we use a reproducing kernel space H with reproducing kernel K to represent this dataset. In particular, let f = (fc ) be a matrix-valued function1 . For each class c and data source we use a function fc ∈ H to learn the output. Then, we simply use the composite function defined by fc (xi ) = fc (xi ) ∈Nm 1
We denote with bold type a vector or matrix, e.g. fc is a real function while f c denotes a vector of functions and f denotes a matrix of functions.
Class Prediction from Disparate Biological Data Sources
429
to combine m sources. The accuracy of the approximation at sample i can be 2 measured by e.g. yic − fc (xi ) . However, taking the direct minimization of the above empirical error will inevitably lead to overfitting. Hence, we need to enforce some penalty term on f . Since we expect to get good performance after combining multiple sources, the penalty term intuitively should play the role of removing redundant sources (feature spaces) across classes. With this motivation, we introduce a block L1 regularization on the matrix-valued function f = (fc ). This kind of regularization was used in [1] for multi-task linear feature learning and also used in [3,10,11,13] for binary classification kernel learning with block regularization over a vector of functions instead of over a matrix 12 valued function. More specifically, let f (2,1) = fc 2H . We now ∈Nm
c∈NC
propose the following multi-class multiple kernel learning formulation with least square loss. One can easily extend the method and the followed arguments to other loss functions. 2 1 min μ yic − fc (xi ) + f 2(2,1) f 2 (1) i∈Nn c∈NC
∈Nm
s.t. fc ∈ H , ∀ c ∈ NC , ∈ Nm The mixed (2, 1)-norm of f in the regularization term is obtained by first computing the H -norm of the row vector (across all classes) f = (f1 , . . . , fC ) and 1 1 then the 1-norm F (f ) = (( c f1c 2H1 ) 2 , . . . , ( c fmc 2Hm ) 2 ). Consequently, the 1-norm of vector F (f ) (mixed norm term of f ) encourages a sparse representation of the candidate RKHSs {H : ∈ Nm } for the learning task, and thus implies automatically adapting the combination of multiple sources. In order to deal with the non-differential L1 regularizer of equation (1), we turn to an equivalent form. To this end, recall [11], for any w = (w1 , . . . , wm ) ∈ 2 w2 Rm , that = min ∈Nm |w | ∈Nm λ : ∈Nm λ = 1, λ ≥ 0 . Now, we 12 c 2 replace w by and obtain the following equivalent formulation c∈NC f H of equation (1): 2 minf ,λ μ yic − fc (xi ) i∈Nn c∈NC
∈Nm
fc 2H λ
+ 21 (2) ∈Nm c∈NC s.t. ∈Nm λ = 1, λ ≥ 0 and fc ∈ H , ∀ c ∈ NC , ∈ Nm . From the auxiliary regularization term f 2 /λ in equation ∈N n c∈NC 2 c H (2), we note that if λ is close to zero then c∈NC fc H should also be close to zero as we are minimizing the objective function. This intuitively explains the role of the auxiliary variable λ. The following theorem demonstrates the joint convexity of problem (2) which could be shown by adapting the argument in [4]. For completeness, we outline a proof here.
430
Y. Ying et al.
Theorem 1. The objective function in (2) is jointly convex with respect to f and λ. Proof: It suffices to prove the joint convexity of f 2H /λ with respect to f ∈ H and λ ∈ (0, 1), ∀ ∈ Nm . The proof is parallel to that in [2]. For completeness, we briefly prove it again here. We need to show, for any f1 , f2 ∈ H and λ1 , λ2 ∈ (0, 1) and θ ∈ (0, 1), that θf1 + (1 − θ)f2 2H θf1 2H (1 − θ)f2 2H ≤ + θλ1 + (1 − θ)λ2 θλ1 (1 − θ)λ2 1 1 Let a = λ11 θ , b = (1−θ)λ , c = θλ1 +(1−θ)λ and F = θf1 + (1 − θ)f2 , G = θf1 . 2 2 Since f1 , f2 is arbitrary, the above equation is reduced to the following:
cF 2H ≤ aG2H + bF − G2H ,
∀F, G ∈ H .
Equivalently, cF 2H ≤ minG∈H aG2H + bF − G2H 2
2
b a a b 2 = F 2H (a+b) ∀F ∈ H . 2 + F H (a+b)2 ,
which is obviously true by the definition of a, b, c. This completes the proof of the convexity. Let the composite kernel Kλ be defined by Kλ = ∈Nm λ K . Then, the role of λ becomes more intuitive if we use the following dual formulation of (2): 1 2 minλ maxα i,c αic yic − 4μ i,c αic −1 α α K (x , x ) 2 i,j,c ic jc λ i j s.t. ∈Nm λ = 1, λ ≥ 0. which can be directly derived from the dual of kernel ridge regression [16] by first fixing λ. It is worth noting that for the equally weighted kernel combination, i.e. 1 λ= m , equation (2) is reduced to a formulation with a plain L2 -regularization term c fc 2H . We also note that [14] proposed a multi-class kernel learning algorithm based on one-against strategy starting from the dual formulation of SVM. We can formulate (2) as a semi-infinite linear programming (SILP) problem, as in [18,20]. First, however, we propose a conceptually simple implementation based on Theorem 1 which will be referred to as MCKL-EM hereafter. (0) 1 We will initialize λ(0) with λ = m for any ∈ Nm . We then solve (2) for this equally weighted kernel coefficient λ(0) and get f (0) which is a least-square ridge regression problem. Next, for any t ∈ N we update λ(t) for fixed f (t−1) and update f (t) for fixed λ(t) . We repeat the above EM-type iteration until convergence. This can reasonably be monitored by the changes of kernel combinatorial coefficients ∈Nm |λold − λ | or changes of the objective function, since we are mainly interested in obtaining an optimal kernel combination. Global convergence is expected since the overall problem (2) is jointly convex by Theorem 1. The updates at step t ∈ N are listed as follows:
Class Prediction from Disparate Biological Data Sources 1 (t−1) 2 ( c fc H ) 2 1 for any (t−1) 2 H ) 2 ( c fc (t−1) matrix function f (t−1) = (fc )c . (t) (t) (t) given λ(t) , fc (·) = λ i αic K (xi , ·). Here,
(t)
431
1. For fixed f (t−1) , λ =
∈ Nm . Here we denote
the 2. For matrix given by the equation
α(t) = (αic ) is an n × C
where Kλt
α(t) = (Kλ(t) + I/2μ)−1 Y (t) = λ K (xi , xj ) .
(t)
(3)
The second update equation follows from standard kernel ridge regression [16] for fixed λ. The first update for λ follows from the fact that {|w1 |/ ∈Nm |w |, . . . w2 |wm |/ ∈Nm |w |} is the optimizer of the minimization problem min ∈Nm λ : ˆ ∈Nm λ = 1, λ ≥ 0 . Let the convergent solution be f . Given a new sample x∗ , then we assign its class by y ∗ = arg maxc fˆc (x∗ ). Recently, the SILP approach has been applied to kernel learning problems for large scale datasets, see [18,20,21]. Since we later use a SILP approach for comparison (MCKL-SILP) we briefly described this variant here. In a similar fashion to arguments in [18], we can formulate the dual semi-infinite lin problem as an 1 2 ear programming. Specifically, let S0 (α) = c,i αic yic − 4μ c,i αic and, for 1 any ∈ Nm , S (α) = 2 c,i,j αic αjc K (xi , xj ). Then, the SILP formulation of algorithm (2) is stated as maxγ,β γ s.t. ∈N m λ = 1, 0 ≤ λ ≤ 1 γ − ∈Nm λ S (α) ≤ S0 (α), ∀α.
(4)
The SILP can be solved by an iterative algorithm called column generation (or exchange methods) which is guaranteed to converge to a global optimum. The basic idea is to compute the optimum (λ, γ) by linear programming for a restricted subset of constraints, and update the constraint subset based on the obtained suboptimal (λ, γ). Given a set of restricted constraints {αp : p ∈ NP }, first we find the intermediate solution (λ, γ) by the following linear programming optimization with P linear constraints maxγ,λ γ s.t. = 1, 0 ≤ λ ≤ 1 λ γ − λ S (αp ) ≤ S0 (α), ∀p ∈ NP .
(5)
This problem is often called the restricted master problem. Then, we find the next constraint with the maximum violation for the given intermediate solution (λ, γ), i.e. minα ∈Nm λ S (α)+S0 (α). If its optimal α∗ satisfies λ S (α∗ )+ S0 (α∗ ) ≥ γ then current intermediate solution (λ, γ) is optimal for the optimization (4). Otherwise α∗ should be added to the restriction set. We repeat the above
432
Y. Ying et al.
iteration until convergence which is guaranteed to be globally optimal, see e.g. [18]. The convergence criterion for the SILP is usually chosen as
(t−1)
S (α(t) ) + S0 (α(t) )
λ (6)
1 −
≤ .
γ (t−1)
3 3.1
Experiments Validation on UCI Datasets
In this section we briefly validate MCKL-EM on UCI datasets [19], to illustrate its performance, before proceeding to bioinformatics datasets. For fairness of comparison, in all kernel learning algorithms we chose the change of kernel weights, |λold − λ | ≤ ε = 10−4 , as the stopping criterion, and the parameter μ was set at a value of 10. We compared our iterative approach (MCKL-EM) with the SILP approach (MCKL-SILP) and its doubly cross-validated method (LSR-CV) over μ and σ. The results are based on 10 random data splits into 60% training and 40% test. We can see from Table 1 that there is no significant difference between MCKL-EM and MCKL-SILP with respect to both computation time and test set Table 1. Test set accuracy (%) and time complexity (seconds) comparison on UCI datasets denoted wine, waveform3, etc [19]. LSR-CV denotes ridge regression with double cross validation over μ and the Gaussian kernel parameter. wine TSA Time waveform3 TSA Time segment3 TSA Time satimage3 TSA Time segment7 TSA Time satimage6 TSA Time
MCKL-EM 98.19 ± 1.52 1.20 MCKL-EM 85.54 ± 1.78 10.97 MCKL-EM 98.66 ± 0.61 23.24 MCKL-EM 99.58 ± 0.32 7.02 MCKL-EM 93.76 ± 1.14 106.56 MCKL-EM 90.14 ± 1.45 40.06
MCKL-SILP 98.05 ± 1.17 0.9498 MCKL-SILP 85.95 ± 0.79 2.91 MCKL-SILP 98.58 ± 0.65 8.30 MCKL-SILP 99.58 ± 0.34 4.56 MCKL-SILP 94.12 ± 0.73 89.77 MCKL-SILP 90.14 ± 1.48 27.93
LSR-CV 98.75 ± 1.69 LSR-CV 86.75 ± 1.77 LSR-CV 97.16 ± 1.36 LSR-CV 99.66 ± 0.36 LSR-CV 92.71 ± 1.20 LSR-CV 91.14 ± 0.98
Class Prediction from Disparate Biological Data Sources
750
900 850
MCKL−EM MCKL−SILP
Objective value
MCKL−EM MCKL−SILP
700
Objective value
433
800 750
650
700 650
600
600
550
550
5
10
15
20
25
30
35
40
45
50
55
60
500
5
10
15
20
25
Iteration
30
35
40
45
50
55
60
Iteration
92
96 95
91
94
Accuracy
Accuracy
90 MCKL−EM MCKL−SILP
89
MCKL−EM MCKL−SILP
93 92 91
88
90
87 89
86
85 0
88
20
40
60
80
100
120
140
160
180
87 0
200
20
40
60
80
Iteration
120
140
160
180
200
0.6
0.6
120
140
160
180
200
Kernel weights
0.7
Kernel weights
0.7
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0 0
100
Iteration
20
40
60
80
100
Iteration
120
140
160
180
200
0 0
20
40
60
80
100
Iteration
Fig. 1. Evolution of MCKL-EM (blue line) and MCKL-SILP (red line) on the satimage6 (left column) and segment7 (right column) datasets from the UCI Repository [19]. Top: objective function value of MCKL-EM and MCKL-SILP versus iteration; Middle: accuracy of MCKL-EM and MCKL-SILP versus iteration. Bottom: the largest two kernel weights versus iteration, MCKL-EM (blue line) and MCKL-SILP (red line).
accuracy (TSA), despite the fact that MCKL-EM is much simpler to implement. These accuracies are also equal to, or better than, the corresponding doubly cross-validated results. The first column of Figure 1 shows that the objective finction value of MCKL-EM quickly becomes stable while MCKL-SILP oscillates during the first few steps. To validate the global convergence of MCKL-EM, in Figure 1 we also depict evolution of the test set accuracy and the largest two kernel combinatorial weights for MCKL-EM and MCKL-SILP for two example
434
Y. Ying et al.
datasets. For both methods, we can see from the second column of Figure 1 that the test set accuracy quickly becomes stable. 3.2
Protein Fold Prediction
We now evaluated our algorithm on a well-known protein fold prediction dataset [6]. Prediction of protein three-dimensional structure is a very important problem within computational biology. Protein fold prediction is the sub-task in which we predict a particular class of arrangement of secondary structure components such as alpha-helices or beta-strands. The benchmark dataset is taken from [6] which has 27 SCOP fold classes with 313 proteins for training and 385 for testing. There are 12 different data-types, or feature spaces, including Amino Acid Composition (C), Predicted Secondary Structure (S), Hydrophobicity (H), Polarity (P), van der Waals volume (V), Polarizability (Z), PseAA λ = 1 (L1), PseAA λ = 4 (L4), PseAA λ = 14 (L14), PseAA λ = 30 (L30), SW with BLOSUM62 (SW1) and SW with PAM50 (SW2). As in [5], we employed linear kernels (Smith-Waterman scores) for SW1 and SW2 and second order polynomial kernels for the others. In [6] and [17], test set accuracies of 56.5% and 62.1% were reported based on various adaptations of binary SVM and neural network. Recently, test performance was greatly improved by Damoulas and Girolami [5] using a Bayesian multi-class multi-kernel algorithm. They reported a best test accuracy of 70% on a single run. For this problem, we examined the proposed method MCKL-EM, and compared against MCKL-SVM [21] and kernel learning for regularized kernel discriminant analysis, RKDA [20] (MCKL-RKDA)2 . For the first two methods, the parameter μ is tuned by 3-fold cross validation based on a grid search over {10−2 , 10−1 , . . . , 106 }. For RKDA kernel learning [20], we used the SILP approach and the regularization parameter there is also tuned by 3-fold cross validation by a grid search over {10−6 , 10−4 , . . . , 102 }. Table 2. Performance comparison (test set accuracy as %) for the protein fold recognition [6,17] and PSORT protein localization datasets [7,21]. Results for PSORT are cited from [21].
Protein fold (TSA) PSORT+ (Average F1 score) PSORT− (Average F1 score)
MCKL-EM MCKL-SVM MCKL-RKDA 74.15 67.36 68.40 93.34 93.8 93.83 96.61 96.1 96.49
Table 2 illustrates the result for MCKL-EM with μ adjusted by 3-fold cross validation. The method achieves a 74.15% test set accuracy (TSA) which outperforms the previously reported state-of-art result of 70% obtained in [5] using a 2
The MATLAB code is available from http://www.public.asu.edu/jye02/Software/DKL
Class Prediction from Disparate Biological Data Sources
435
0.25
80 75 70 65
0.2
60 55 50
0.15
45 40 35
0.1
30 25 20
0.05
15 10 5 0
C
S
H
P
V
Z
L1
L4
L14
L30
SW1
SW2
0.25
0
C
S
H
P
V
Z
L1
L4
L14
L30
SW1 SW2
74.5
74 0.2 73.5 0.15
73
72.5
0.1
72 0.05 71.5
0 0
200
400
600
800
1000
1200
1400
1600
1800
2000
71
0
200
400
600
800
1000
1200
1400
1600
1800
2000
Fig. 2. Performance of MCKL-EM on the protein fold dataset. First subfigure: performance of each individual kernel; dash-dotted red line is for all kernels equally weighted (i.e. plain 2-norm regularization) and the solid blue line is for MCKL-EM. Second one: kernel combinatorial weights i.e. λ. The last two subfigures: evolution of λ and test set accuracy up to 2000 iterations.
probabilistic Bayesian model, the 68.40% TSA attained by RKDA kernel learning method [20], and the 67.36% TSA by multi-class SVM multi-kernel learning method [21]. The first subfigure of Figure 2 illustrates the performance with each individual feature. The result for MCKL-EM is depicted by a solid line in the first subfigure of Figure 2. The proposed algorithm was also examined with 1 all kernels equally weighted, i.e. λ = m for any ∈ Nm , which as mentioned 2 above is equivalent to a plain L -norm regularization. The performance is 70.49% depicted by the dash-dotted line. The second subfigure of Figure 2 shows the kernel combinatorial weights λ. There, the features Amino Acid Composition (C), van der Waals volume (V), SW with BLOSUM62 (SW1), and SW with PAM50 (SW2) are the most prominent sources. Without using the stopping criterion, MCKL-EM was further examined for up to 2000 iterations after μ was selected by cross-validation. The third subfigure shows the convergence of λ and the fourth subfigure illustrates accuracy versus number of iterations which validates convergence of the iterative algorithm. In Figure 3 the kernel combinatorial weights λ for MCKL-SVM, and MCKLRKDA are plotted. They both indicate that the first, fifth and last feature are important which is consistent with previous observations. However, the kernel combinations are sparse and quite different from that of MCKL-EM as depicted
436
Y. Ying et al.
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2
0.2
0.1
0
0.1
0
C
S
H
P
V
Z
L1
L4
L14
L30
SW1 SW2
C
S
H
P
V
Z
L1
L4
L14
L30
SW1
SW2
Fig. 3. Kernel weights (i.e. λ) of MCKL-SVM (left subfigure) and MCKL-RKDA (right subfigure) on the protein fold recognition dataset
in the second subfigure of Figure 2. The competing methods also result in worse performance (less than 70%) while MCKL-EM achieves 74.15%. This indicates different combinations of kernel weights lead to significantly different predictions by kernel learning algorithms and sparsity in the kernel weights does not necessarily guarantee good generalization performance. We should note here that the parameter μ in all algorithms is chosen by cross-validation using grid search over the same grid. Moreover, the sparsity usually depends on the parameter μ: the smaller the value μ, the greater the sparsity in kernel weights. This may explain why different kernel weights are obtained for different kernel learning algorithms. 3.3
Prediction of Protein Subcellular Localization
The proposed method (MCKL-EM) was further evaluated on two large datasets for bacterial protein localization [7] where 69 kernels are available. The first problem, derived from the PSORT+ dataset, contains four classes and the other, called PSORT−, has five classes. The results will be based on 30 random partitions into 80% training and 20% test data3 . In Table 2, test set accuracies for MCKL-EM, MCKL-SVM, MCKL-RKDA are listed. Zien and Ong [21] provided an average F1 score of 93.8% and 96.1% respectively for the PSORT+ and PSORT− datasets after filtering out 81/541 and 192/1444 ambiguous samples. These outperformed the results 90.0% and 87.5% reported by Gardy et al. [7]. On PSORT+ dataset we got an average F1 score 93.34% for MCKL-EM. For PSORT− dataset, we report an average F1 score 96.61% for MCKL-EM. Hence, our results outperform the results of [7] and are competitive with the methods in [20,21]. As depicted in Figure 4, the kernel weights for MCKL-EM are quite sparse on this dataset which are consistent with those in [21]. 3
http://www.fml.tuebingen.mpg.de/raetsch/suppl/protsubloc
Class Prediction from Disparate Biological Data Sources
0.5
437
0.4
0.45
0.35
0.4 0.3
Average λ
Average λ
0.35 0.3
0.25 0.2
0.25
0.2
0.15
0.15 0.1 0.1 0.05
0.05 0 0
10
20
30
40
50
60
70
0 0
10
20
30
40
50
60
70
Fig. 4. Averaged kernel combinatorial weights (i.e. λ) with error bars of MCKL-EM on PSORT− (left subfigure) and PSORT+ (right subfigure)
4
Conclusion
In this paper we presented MCKL-EM, a simple iterative algorithm for multiple kernel learning based on the convex formulation of block RKHS norms across classes. As opposed to other MKL algorithms, this iterative approach does not need sophisticated optimization methods while retaining comparable training time and accuracy. The proposed approach yielded state-of-the-art performances on two challenging bioinformatics problems: protein fold prediction and subcellular localization. For the latter we report a competitive performance. For the first one we outperform the previous competitive methods and offer a 4.15% improvement over the state-of-art result which is a significant contribution given the large number of protein fold classes. Future work could include possible extensions of the proposed method for tackling multi-task and multi-label problems.
References 1. Argyriou, A., Evgeniou, T., Pontil, M.: Multi-task feature learning. In: NIPS (2006) 2. Argyriou, A., Micchelli, C.A., Pontil, M., Ying, Y.: A spectral regularization framework for multi-task structure learning. In: NIPS (2007) 3. Bach, F., Lanckriet, G.R.G., Jordan, M.I.: Multiple kernel learning, conic duality and the SMO algorithm. In: ICML (2004) 4. Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004) 5. Damoulas, T., Girolami, M.: Probabilistic multi-class multi-kernel learning: On protein fold recognition and remote homology detection. Bioinformatics 24(10), 1264–1270 (2008) 6. Ding, C., Dubchak, I.: Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics 17, 349–358 (2001) 7. Gardy, J.L., et al.: PSORTb v.2.0: expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis. Bioinformatics 21, 617–623 (2004)
438
Y. Ying et al.
8. Girolami, M., Rogers, S.: Hierarchic Bayesian models for kernel learning. In: ICML (2005) 9. Lanckriet, G.R.G., Cristianini, N., Bartlett, P., Ghaoui, L.E., Jordan, M.I.: Learning the kernel matrix with semidefinite programming. J. of Machine Learning Research 5, 27–72 (2004) 10. Lin, Y., Zhang, H.: Component selection and smoothing in multivariate nonparametric regression. Annals of Statistics 34, 2272–2297 (2006) 11. Micchelli, C.A., Pontil, M.: Learning the kernel function via regularization. J. of Machine Learning Research 6, 1099–1125 (2005) 12. Ong, C.S., Smola, A.J., Williamson, R.C.: Learning the kernel with hyperkernels. J. of Machine Learning Research 6, 1043–1071 (2005) 13. Rakotomamonjy, A., Bach, F., Canu, S., Grandvalet, Y.: More efficiency in multiple kernel learning. In: ICML (2007) 14. Rakotomamonjy, A., Bach, F., Canu, S., Grandvalet, Y.: SimpleMKL. J. of Machine Learning Research 9, 2491–2521 (2008) 15. Sch¨ olkopf, B., Smola, A.J.: Learning with Kernels. MIT Press, Cambridge (2002) 16. Shawe-Taylor, J., Cristianini, N.: Kernel methods for pattern analysis. Cambridge University Press, Cambridge (2004) 17. Shen, H.B., Chou, K.C.: Ensemble classifier for protein fold pattern recognition. Bioinformatics 22, 1717–1722 (2006) 18. Sonnenburg, S., R¨ atsch, G., Sch¨ afer, C., Sch¨ olkopf, B.: Large scale multiple kernel learning. J. of Machine Learning Research 7, 1531–1565 (2006) 19. http://archive.ics.uci.edu/ml/ 20. Ye, J., Ji, S., Chen, J.: Multi-class discriminant kernel learning via convex programming. J. of Machine Learning Research 9, 719–758 (2008) 21. Zien, A., Ong, C.: Multi-class multiple kernel learning. In: ICML (2007)
Cross-Platform Analysis with Binarized Gene Expression Data Salih Tuna and Mahesan Niranjan School of Electronics and Computer Science, ISIS Research Group, University of Southampton, UK {st07r,mn}@ecs.soton.ac.uk
Abstract. With widespread use of microarray technology as a potential diagnostics tool, the comparison of results obtained from the use of different platforms is of interest. When inference methods are designed using data collected using a particular platform, they are unlikely to work directly on measurements taken from a different type of array. We report on this cross-platform transfer problem, and show that working with transcriptome representations at binary numerical precision, similar to the gene expression bar code method, helps circumvent the variability across platforms in several cancer classification tasks. We compare our approach with a recent machine learning method specifically designed for shifting distributions, i.e., problems in which the training and testing data are not drawn from identical probability distributions, and show superior performance in three of the four problems in which we could directly compare. Keywords: Cross-platform classification.
1
analysis,
binary
gene
expression,
Introduction
The ability to observe the expression levels, or relative mRNA abundances, of thousands of genes in a given biological sample makes microarray technology a widely used tool in experimental biology. The potential of the technology as a diagnostic tool, producing a high dimensional feature vector upon which statistical pattern classification techniques such as Support Vector Machines (SVM) can be trained and applied, has received significant attention over the last decade [1]. Datasets from complex diseases including different types of cancer and diabetes have been analyzed in this manner, subsets of genes that are useful in discriminating the population with a disease from normal population have been identified for further validation. A particular issue in such studies is variability at the biological and technical levels. Reproducibility of microarray results across different biological samples taken from the same tissue is reported to be very poor [2], while reproducibility
Corresponding author.
V. Kadirkamanathan et al. (Eds.): PRIB 2009, LNBI 5780, pp. 439–449, 2009. c Springer-Verlag Berlin Heidelberg 2009
440
S. Tuna and M. Niranjan
across technical replicates of amplified isolated mRNA is generally good [3]. Reasons for this have to do with the fact that mRNA is taken from a population of cells, each of which carrying a very small number of copies of each species. Except in experimental settings where the cells are artificially synchronized, this observation is largely true, leading to large biological variability. Similarly, variations in results across different laboratories and across platforms have been noted [4,5]. Much research in microarray studies is aimed at developing analytical techniques that are robust to systematic measurement variations. In our past work [6], motivated by the observation that high numerical precision with which gene expression levels are reported in archives is incompatible with large biological variability, we showed that the quality of inference drawn from microarray studies is often not affected by progressive quantization of the expression levels. We established this in a number of different inference problems: classification, cluster analysis, detection of periodically expressed genes and the analysis of developmental time-course data. Building on this, we further showed that with a binary representation of the transcriptome, i.e., retaining only the information whether a gene is expressed or not, one could often achieve superior results by proper choice of distance metrics. Specifically, we used the Tanimoto similarity [7], borrowed from the chemoinformatics literature, and were able to explain some of the improvements obtained by a systematic variation in the probe level uncertainties of Affymetrix gene arrays [8]. We also established that in such reduced numerical precision representations, variability of inference arising from algorithmic choice in the pipeline of various pre-processing stages can be significantly reduced. Binary representation of transcriptome has been shown to be effective in dealing with variation between laboratories by Zilliox and Irizzary [9], in their bar code method. The bar code is simply a binary representation of microarray outputs, but is computed over a very large collection of hybridizations of a particular type of array. In [9], the authors studied Affymetrix HGU133A Human array and using their barcodes and a simple nearest distance to template classifier demonstrated impressive results of tissue specificity of cancer populations. A particular limitation of the approach is distance-to-template classification, because it is known in statistical pattern recognition that such a classifier is optimal only for equal variance isotropic class conditional densities [10]. For gene expression data, this is a poor assumption because genes regulated by common transcription factors and those acting on signal transduction common pathways are often co-expressed. Complex diseases are often realized as disruptions in pathways or regulation, thus correlated expression should be very common in such datasets. While on the data used in [9] good results are obtained, it is not too difficult to find counter examples in which the performance of the bar code method is poor (see section 2.3). Similarly, Shmulevich and Zhang [11] also note the advantage of working with binary transcriptome data. Warnat et al. [12] and Gretton et al. [13] offer novel algorithmic approaches for dealing with cross-platform variations. In their formulation training data for a cancer vs non-cancer SVM classifier is assumed to come from a particular
Cross-Platform Analysis with Binarized Gene Expression Data
441
microarray platform and the unseen test data is assumed to come from a different platform. As one would expect, with no adjustment to the data, test set performance is very poor. In [12], Warnat et al. offer two solutions to improving on this: the use of median rank scores and quantile discretizations. The former approach uses ranks of genes as features in computing similarity metrics while the latter quantizes data into eight bins, the ranges of which are set to equalize bin occupancy. The second method is similar in spirit to the method we advocate in that ours is to quantize down to binary levels. In [13], Gretton et al. develop an approach aimed at the more generic problem of test set distributions being different from training set distributions. A weighting scheme known as kernel mean matching (KMM) is developed and microarray cross-platform inference is used as a test problem to evaluate their algorithm. Binarizing continuous valued data as a means of improving the performance of specific classifiers have been reported in the machine learning literature in the past [14]. Such work, however, is not generic and is merely a statement about accidental improvements over weak baseline classifiers (naive Bayes, decision trees etc.). Our results are specific to transcriptome data and build on observed properties of the measurement environment. Further, our comparisons are against classifier with high performance (i.e., SVM). In this paper we show that a binary representation of the transcriptome, when combined with a suitable similarity metric and cast in a kernel classifier setting, can yield performance that is competitive, and often superior, to methods developed in the literature to address this problem. This, and other examples of high performance from binary representations we have reported previously, arise largely from the fact that often the useful information relating to gene expression is simply if it is transcribed or not, rather than in the actual cellular concentration of the transcripts. Even if the information is in transcript abundances, as noted earlier, heterogeneity within a population of cells makes the measurement unreliable. In this context, quantization of the data has a noise rejection property which our method takes advantage of.
2 2.1
Methods Quantization
Quantization of microarray has been studied in the literature, for example [15,16,17]. Among possible methods, we choose the quantization method of Zhou et al. [15] where mixture of Gaussians are used for the different states of gene expression values. Our justification for choosing [15]’s method is that it is relatively more principled than other approaches for quantization. Arbitrary thresholds set by other researchers are not necessarily transferable across different platforms or experiments due to variabilities induced by image processing and normalization, while the method in [15] depends on the underlying probability density of the expression levels and hence the idea is portable to any situation. We focused on binary representation of these measurements. Gene expression values are quantized by fitting Gaussian mixture model to the expression values:
442
S. Tuna and M. Niranjan
250
200
150
100
50
0 −4
−2
0
2
4
6
Gene expression values Fig. 1. Histogram of expression levels taken from [25] and a two component Gaussian mixture model of the distribution. The quantization threshold is a function of the means and standard deviations of the two mixture components (Eqn. 2).
p (x) =
M
λk N (μk , σk )
(1)
k=1
where p (x) is the probability density of gene expression measurement, M , the number of mixture components, and N(μ, σ) is a Gaussian density of mean μ and standard deviation σ. Fitting such a model is by standard maximum likelihood techniques, and we used the gmm function in NETLAB software (http:// www.ncrg.aston.ac.uk) for this purpose. We used two component mixtures, corresponding to M = 2 in the above equation. Fig. 1 shows an example of gene expression values fitted to two center GMM. After learning parameters of the model, threshold T h is chosen as: Th =
μ1 + σ 1 + μ2 − σ 2 2
(2)
to achieve binary quantization. 2.2
Tanimoto Kernel
Tanimoto coefficient (T ) [7], between two binary vectors of gene expressions, is defined as:
Cross-Platform Analysis with Binarized Gene Expression Data
443
c (3) a+b−c where a is the number of expressed points for the first gene, b is the number of expressed points for the second gene and c is the number of common expressed points in two genes. Tanimoto similarity ranges from 0 (no points in common) to 1 (exact match) and is the rate of the number of common bits on to the total number of bits on two vectors. It focuses on the number of common bits that are on. Following the definition of Tanimoto similarity, Tanimoto kernel is defined as [18,19]: T =
KT an (x, z) =
xT z xT x + zT z − xT z
(4)
where a = xT x, b = zT z and c = xT z. It follows from the work of Swamidass et al. [18] and Trotter [19] that this similarity metric is useful as a valid kernel, i.e., kernel computations in the space of the given binary vectors map onto inner products in a higher dimensional space so that SVM type optimizations for large margin class boundaries is possible. We incorporated this kernel into the MATLAB SVM implementation of Steve Gunn [20] (http://www.isis.ecs.soton.ac. uk/isystems/kernel/). 2.3
Bar Code vs. SVM
Since the bar code method of Zilliox and Irizarry [9] is the closest in literature to our work, we give a quick overview and evaluation of its performance. The binary representation for a class of data (tumor in a particular tissue) is derived for a particular array, Affymetrix HGU133A Human array, by scanning through a large collection of expression levels archived in microarray repositories. Predictions on test data are made by computing nearest Euclidean distance to pre-computed bar codes. As we note in the introduction, we should be skeptical about high performance from a distance-to-template classifier as such an approach is only Bayes’ optimal under isotropic equal variance assumptions. To verify this, we first established that the bar code approach cannot compete with SVM. We used the R code which is made available by the authors’ at their web page: http://rafalab.jhsph.edu/barcode/ and used three datasets downloaded from ArrayExpress (http://www.ebi.ac.uk/arrayexpress/) and Gene Expression Omnibus (GEO) (http://www.ncbi.nlm.nih.gov/geo/). Two of these were used in [9] and the other was not. Prediction accuracies for these three, comparing the bar code method to Tanimoto-SVM, are shown in Table 1. We note that training and testing on the same database, as we have done with Tanimoto-SVM, achieves consistently better prediction accuracies than the bar code method. But in fairness to the bar code method we remark that their intention is to make predictions on a new dataset based on accumulated historic knowledge, rather than repeat the training/testing process all over again. On this point, while there is impressive performance reported on the datasets
444
S. Tuna and M. Niranjan Table 1. Comparison of Tanimoto-SVM with [9]’ s bar code Dataset Data type Method Accuracy E-GEOD-10072 Binary Bar code 0.50 Lung Binary Tanimoto-SVM 0.89 ± 0.03 Lung tumor vs. normal Binary Tanimoto-SVM 0.99 ± 0.03 GSE2665 Binary Bar code 0.95 lymph node/tonsil Binary Tanimoto-SVM 0.99 ± 0.02 lymph node vs. tonsil Binary Tanimoto-SVM 1.0 ± 0.0 GSE2603 Binary Bar code 0.90 Breast Tumor Binary Tanimoto-SVM 0.99 ± 0.01 Breast Tumor vs. normal Binary Tanimoto-SVM 0.99 ± 0.01
Zilliox and Irizarry [9] worked on, the method can fail badly too, as in the case of the lung cancer prediction task E-GEOD-10072 shown in Table 1. On Table 1 ‘Lung’ corresponds to classifying lung vs. breast and lymph node/tonsil which is a similar approach to bar code. ‘lung tumor vs. normal’ corresponds classifying tumor vs. normal in lung only. The same terminology applies to the other two problems as well. Part of the success of the Tanimoto kernel in the microarray setting comes from a systematic variability at the probe level of Affymetrix arrays. We have noted [8] that in a given experiment, the average probe level uncertainty computed amongst expressed genes systematically reduces with the number of expressed genes; i.e., the larger the number of expressed genes lower the uncertainty of measurements. Amongst 50 experiments we looked at there was only one experiment for which this observation did not hold. This variability has a direct bearing when using Tanimoto similarity. For two pairs of expression profiles which differ by the same Hamming (or Euclidean) distance, Tanimoto similarity will be higher for the pair that has a greater number of expressed genes (thereby placing a higher emphasis on experiments with lower probe level uncertainties). Other authors have also exploited probe level uncertainties in principal component analysis [22,23] and cluster analysis [24].
3 3.1
Experiments Datasets
To demonstrate how binary representations help in cross-platform inference, we carried out experiments on breast and prostate cancer datasets. These datasets are the same as those used in [12] and [13] and were given to us by the authors in processed format (i.e., we worked with the expression levels rather than with the raw data at the CEL file or image levels). These data come from spotted cDNA and Affymetrix platforms, and details of the four datasets are summarized in Tables 2 and 3. Warnat et al. [12] preprocessed all the data and found the
Cross-Platform Analysis with Binarized Gene Expression Data
445
Table 2. Details of breast cancer studies Breast cancer Platform No of common genes Samples Target variable West et al. [25] Affymetrix 2166 49 ER-status: 25(+), 24(-) Gruvberger et al. [26] cDNA 2166 58 ER-status: 28(+), 30(-) Study
Table 3. Details of prostate cancer studies Prostate cancer Platform No of common genes Samples Target variable Welsh et al. [27] Affymetrix 4344 33 9 normal, 24 tumor Dhanasekaran et al. [28] cDNA 4344 53 19 normal, 34 tumor Study
subset of common genes by means of the Unigene database (http://www.ncbi. nlm.nih.gov/unigene). 3.2
SVM Classification
In implementing SVM classifiers, we first ensured that our implementation achieves the same results as reported in [12]. Table 6, “cont-not normalized” column confirms that our implementation achieves the same results reported previously. Then, following the suggestion in [13], we normalized each array to have a mean of zero and standard deviation one, and trained and tested our SVM implementations. This normalization has a significant impact on the results (“cont-normalized”, in Table 6). As used in these papers we used linear kernel SVMs with a setting of C = 1000 for the margin parameter, and confirmed previously quoted results are reproducible. We then quantized the data and applied Tanimoto kernel SVM. Note that this kernel has no tuning parameters. We implemented quantization on an array by array basis. In previous work we have experimented with different ways of quantization (array by array, gene by gene and a global method), and noted only small differences between these over a range of quantization thresholds [6]. 3.3
Results
Tables 4 and 5 show the difference in classification between continuous and binary representations on the two cancer classification problems. Accuracies are shown for 25 random partitions of the data into training and testing sets, along with standard deviations quantifying the uncertainty in this process. We see that in three out of the four cases, binarization, and the use of Tanimoto kernel, offers significant improvements, and performs no worse than continuous data in the fourth. In Warnat et al. [12], results are averaged over 10 cross validation runs, but the paper does not report the variation across results. Table 6 presents results of training SVMs with one type of data and testing the performance on data from a different platform. In this cross-platform
446
S. Tuna and M. Niranjan
Table 4. Breast cancer results for cross-platform analysis. Data is randomly partitioned into training and testing for 25 times. Dataset Data type Method Accuracy Gruvberger et al. Cont. Linear-SVM 0.80±0.07 Gruvberger et al. Binary Tanimoto-SVM 0.82±0.08 West et al. Cont. Linear-SVM 0.76±0.15 West et al. Binary Tanimoto-SVM 0.79±0.11 Table 5. Prostate cancer results for cross-platform analysis. Data is randomly partitioned into training and testing for 25 times. Dataset Data type Method Accuracy Dhanasekaran et al. Cont. Linear-SVM 0.89 ± 0.06 Dhanasekaran et al. Binary Tanimoto-SVM 0.89 ± 0.05 Welsh et al. Cont. Linear-SVM 0.92 ± 0.06 Welsh et al. Binary Tanimoto-SVM 0.96 ± 0.06 Table 6. Cross-platform results. Array-by-array quantization. The notation “Gruvberger → West” indicates that we train on Gruvberger’ s data and test on West’ s data. Dataset Gruvberger → West Gruvberger → West Gruvberger → West West → Gruvberger West → Gruvberger West → Gruvberger Dhanasekaran → Welsh Dhanasekaran → Welsh Dhanasekaran → Welsh Welsh → Dhanasekaran Welsh → Dhanasekaran Welsh → Dhanasekaran
Data type Accuracy Cont.(not normalized) 0.49 Cont.(normalized) 0.94 Binary 0.96 Cont.(not normalized) 0.52 Cont.(normalized) 0.93 Binary 0.90 Cont.(not normalized) 0.27 Cont.(normalized) 1 Binary 1 Cont.(not normalized) 0.64 Cont.(normalized) 0.93 Binary 1
comparison, normalization as a first step has a big impact. Further improvement is obtained by our binarized Tanimoto approach. While in one of the four experiments this approach gives poor performance, it proves useful in the other three. In Table 7 we give a comparison with other previously published results on the same datasets, namely the median rank and quantile discretization of [12] and the kernel mean matching approach of [13]. While the number of experiments is small, we note that the binarized Tanimoto method we advance has merit in terms of its performance in a cross-platform setting.
Cross-Platform Analysis with Binarized Gene Expression Data
447
Table 7. Comparison of our approach to the published results in literature. Accuracies obtained by SVM are compared. Study
Train → Test
MRS Gruvberger → West 0.63 Breast cancer West → Gruvberger 0.95 Dhana → Welsh 0.88 Prostate cancer Welsh → Dhana 0.89
Method QD KMM Binary 0.86 0.94 0.96 0.92 0.95 0.90 0.97 0.91 1 0.91 0.83 1
Note KMM is a sample re-weighting process designed to match the test set distribution to the training set (in feature space means) by a quadratic programming formulations. With microarray data, imposing such a shift is an artificial construct, whereas our results show that similar, if not superior, performance is achievable simply by choosing an appropriate data representation.
4
Conclusion
In this paper we show that a binary representation of gene expression profiles, combined with a kernel similarity metric that is appropriate for such data, has the potential to address the important problem in microarray based phenotype classifications of cross-platform inference. While the experimental work is on a very small number of datasets, which were the only ones available to us at this time from previous studies, we believe this advantage comes from using a data representation that respects properties of the measurement environment. This approach is not only limited to cross-platform analysis but can also be successfully applied in Affymetrix vs. Affymetrix (e.g. see results in Table 1) where we show that data from one Affymetrix platform can be robustly transferred to another. Our current work is on extending the study to a larger collection of datasets, the difficulty in doing this being the matching of the gene identities. Acknowledgments. We are grateful to Arthur Gretton and Karsten Borgward for providing the datasets used in this study.
References 1. Brown, M.P.S., Grundy, W.N., Lin, D., Cristianini, N., Sugnet, C.W., Furey, T.S., Ares Jr., M., Haussler, D.: Knowledge-based analysis of microarray gene expression data by using support vector machines. PNAS 97(1), 262–267 (2000) 2. Tomayko, M.M., Anderson, S.M., Brayton, C.E., Sadanand, S., Steinel, N.C., Behrens, T.W., Shlomchik, M.J.: Systematic Comparison of Gene Expression between Murine Memory and Naive B Cells Demonstrates That Memory B Cells Have Unique Signaling Capabilities. J. Immunol. 181(1), 27 (2008) 3. MAQC consortium, The MicroArray Quality Control (MAQC) project shows interand intraplatform reproducibility of gene expression measurements. Nat. Biotechnol. 24, 1151–1161 (2006)
448
S. Tuna and M. Niranjan
4. Draghici, S., Khatri, P., Eklund, A.C., Szallasi, Z.: Reliability and reproducibility issues in DNA microarray measurements. Trends Genet. 22, 101–109 (2006) 5. Kuo, W.P., Jenssen, T.K., Butte, A.J., Ohno-Machado, L., Kohane, I.S.: Analysis of matched mRNA measurements from two different microarray technologies. Bioinformatics 18(3), 405–412 (2002) 6. Tuna, S., Niranjan, M.: Inference from low precision transcriptome data representation. Journal of Signal Processing Systems (April 22, 2009), doi:10.1007/s11265009-0363-2 7. Tanimoto, T.T.: IBM Internal Report, An elementary mathematical theory of classification and prediction (1958) 8. Tuna, S., Niranjan, M.: Classification with binary gene expressions. Journal of Biomedical Sciences and Engineering (in press, 2009) 9. Zilliox, M.J., Irizarry, R.A.: A gene expression bar code for microarray data. Nat. Met. 4(11), 911–913 (2007) 10. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. John Wiley & Sons, USA (2001) 11. Shmulevich, I., Zhang, W.: Binary analysis and optimization-based normalization of gene expression data. Bioinformatics 18(4), 555–565 (2002) 12. Warnat, P., Eils, R., Brors, B.: Cross-platform analysis of cancer microarray data improves gene expression based classification of phenotypes. BMC Bioinformatics 6, 265 (2005) 13. Gretton, A., Smola, A., Huang, J., Schmittfull, M., Borgwardt, K., Scholkopf, B.: Covariate shift by kernel mean matching. In: Quionero-Candela, J., Sugiyama, M., Schwaighofer, A., Lawrence, N.D. (eds.) Dataset shift in machine learning, pp. 131–160. Springer/The MIT Press, London (2009) 14. Dougherty, J., Kohavi, R., Sahami, M.: Supervised and Unsupervised Discretization of Continuous Features. In: International Conference on Machine Learning, pp. 194–202 (1995) 15. Zhou, X., Wang, X., Dougherty, E.R.: Binarization of microarray data on the basis of a mixture model. Mol. Cancer Ther. 2(7), 679–684 (2003) 16. Friedman, N., Linial, M., Nachman, I., Pe’er, D.: Using Bayesian networks to analyze expression data. J. Comput. Biol. 7(3-4), 601–620 (2000) 17. Brazma, A., Jonassen, I., Vilo, J., Ukkonen, E.: Predicting Gene Regulatory Elements in Silico on a Genomic Scale. Genome Res. 8(11), 1202–1215 (1998) 18. Swamidass, S.J., Chen, J., Bruand, J., Phung, P., Ralaivola, L., Baldi, P.: Kernels for small molecules and the prediction of mutagenicity, toxicity and anti-cancer activity. Bioinformatics 21(suppl. 1), i359–i368 (2005) 19. Trotter, M.W.B.: Support vector machines for drug discovery. Ph.D. thesis, University College London, UK (2006) 20. Gunn, S.R.: Support vector machines for classification and regression, Technical Report, University of Southampton (1997), http://www.isis.ecs.soton.ac.uk/ isystems/kernel/ 21. Milo, M., Fazeli, A., Niranjan, M., Lawrence, N.D.: A probabilistic model for the extraction of expression levels from oligonucleotide arrays. Biochem. Soc. Trans. 31(Pt 6), 1510–1512 (2003) 22. Rattray, M., Liu, X., Sanguinetti, G., Milo, M., Lawrence, N.D.: Propagating uncertainty in microarray data analysis. Brief Bioinform. 7(1), 37–47 (2006) 23. Sanguinetti, G., Milo, M., Rattray, M., Lawrence, N.D.: Accounting for probe-level noise in principal component analysis of microarray data. Bioinformatics 21(19), 3748–3754 (2005)
Cross-Platform Analysis with Binarized Gene Expression Data
449
24. Liu, X., Lin, K., Andersen, B., Rattray, M.: Including probe-level uncertainty in model-based gene expression clustering. BMC Bioinformatics 8(1), 98 (2007) 25. West, M., Blanchette, C., Dressman, H., Huang, E., Ishida, S., Spang, R., Zuzan, H., Olson Jr., J.A., Marks, J.R., Nevins, J.R.: Predicting the clinical status of human breast cancer by using gene expression profiles. PNAS 98(20), 11462–11467 (2001) 26. Gruvberger, S., Ringn´er, M., Chen, Y., Panavally, S., Saal, L.H., Borg, A., Ferno, M., Peterson, C., Meltzer, P.S.: Estrogen Receptor Status in Breast Cancer Is Associated with Remarkably Distinct Gene Expression Patterns. Cancer Res. 61(16), 5979–5984 (2001) 27. Welsh, J.B., Sapinoso, L.M., Su, A.I., Kern, S.G., Wang-Rodriguez, J., Moskaluk, C.A., Frierson, H.F., Hampton, G.M.: Analysis of gene expression identifies candidate markers and pharmacological targets in prostate cancer. Cancer Res. 61(16), 5974–5978 (2001) 28. Dhanasekaran, S.M., Barrette, T.R., Ghosh, D., Shah, R., Varambally, S., Kurachi, K., Pienta, K.J., Rubin, M.A., Chinnaiyan, A.M.: Delineation of prognostic biomarkers in prostate cancer. Nature 412(6849), 822–826 (2001)
Author Index
Aguilar–Ruiz, Jes´ us S. 199 Ahmed, Said Hassan 1 Ahmed, Sohail 36 Aitken, Stuart 67 Altarawy, Doaa 13 Angaye, Cleopas 89 Arif, Shereena M. 404 Ayg¨ un, Eser 24 Baek, Jinsuk 89 Banu Bte Sm Rashid, Shamima Bardhan, Karna Dev 415 Billings, Stephen A. 233 Burden, Conrad 377
Gras, Robin 365 Grzegorczyk, Marco 113 Gunasekaran, Prasad 125 Harris, Keith 137, 150 Hert, J´erˆ ome 404 Hohm, Tim 162 Holliday, John D. 404 Hung, Yeung Sam 56 Husmeier, Dirk 113, 187 175
Campbell, Colin 427 Cataltepe, Zehra 24 Chang, Chunqi 56 Cheng, Jierong 36 Chetty, Girija 46 Chetty, Madhu 46, 293 Coca, Daniel 233 Cowtan, Kevin 125 Dai, Jisheng 56 Daly, R´ on´ an 67 Damoulas, Theodoros Dua, Sumeet 244 Edwards, Kieron D.
427
67
Farran, Bassam 79 Fenner, John 415 Fisher, Howard 89 Fisher, Paul 89 Fl˚ a, Tor 1 Fogel, Gary B. 211 Folino, Gianluigi 102 Ghanem, Sahar M. 13 Gillott, Richard 415 Girolami, Mark 67, 137, 150, 270, 282, 427 Gori, Fabio 102 Grandison, Scott 125
Ismail, Mohamed A. Jetten, Mike S.M.
13 102
Kadirkamanathan, Visakan King, Ross D. 331 Koh, Esther G.L. 36
233
Lavenier, Dominique 255 Lawford, Patricia 415 Lawson, David M. 125 Leow, Wee Kheng 175 Li, Hao 175 Liou, Yih-Cherng 175 Lu, Haiyun 175 Ma, Jianmin 211 Mak, Lora 125 Malim, Nurul 404 Mantzaris, Alexander V. Marchiori, Elena 102 McMillan, Lisa 150 Millar, Andrew J. 67 Mischak, Harald 137 Morris, Richard J. 125
187
Nepomuceno, Juan A. 199 Ngom, Alioune 307, 365, 377 Nguyen, Minh N. 211 Nicolas, Jacques 255 Niranjan, Mahesan 79, 439 Nuel, Gr´egory 222 Olariu, Victor 233 Olomola, Afolabi 244
452
Author Index
O’Neill, John S. 67 Oommen, B. John 24
Spencer, Paul 415 Subhani, Numanul 377
Peterlongo, Pierre 255 Polajnar, Tamara 270, 282 Pollastri, Gianluca 391
Troncoso, Alicia 199 Tuna, Salih 439 Urrutia, Homero
Querellou, Jo¨el
307
255
Rajapakse, Jagath C. 36, 211 Ram, Ramesh 293 Ramanan, Amirthalingam 79 Rogers, Simon 282 Rojas, Dar´ıo 307 Rojas, Juan Carlos 319 Rueda, Luis 307, 319, 377 Schierz, Amanda C. 331 Shi, Jian-Yu 344 Shida, Kazuhito 354 Soltan Ghoraie, Laleh 365
Vorc’h, Raoul 255 Vullo, Alessandro 391 Walsh, Ian 391 Wang, Lili 365 Willett, Peter 404 Wright, Benjamin 415 Ye, Zhongfu 56 Ying, Yiming 427 Zhang, Yan-Ning 344 Zitzler, Eckart 162